一個MapReduce 程式示例 細節決定成敗(三) :Combiner

self_control發表於2016-05-27
上一篇中,我們寫了一個統計輸入檔案中 a~z 每個字元個數的mr 程式。通過檢視執行日誌的Counter 計數可以看到map 到 reduce 之間的網路傳輸是多少。
而本篇我們將介紹的Combiner 是一個非常重要的元件,主要可以用來減少網路傳輸。原理是在每個map 任務執行節點上,先把map的輸出進行彙總,然後再傳輸到
reducer任務,也可以稱此為一個map 端的reduce操作。
先上程式碼。因為我們的combiner 邏輯與 reducer邏輯相同,所以我們簡單的使用MyWordCountReducer來設定成任務的combiner 。

點選(此處)摺疊或開啟

  1. job.setCombinerClass(MyWordCountJob.MyWordCountReducer.class);

重點看執行日誌:

點選(此處)摺疊或開啟

  1. File System Counters
  2.                 FILE: Number of bytes read=422
  3.                 FILE: Number of bytes written=338601
  4.                 FILE: Number of read operations=0
  5.                 FILE: Number of large read operations=0
  6.                 FILE: Number of write operations=0
  7.                 HDFS: Number of bytes read=556
  8.                 HDFS: Number of bytes written=103
  9.                 HDFS: Number of read operations=12
  10.                 HDFS: Number of large read operations=0
  11.                 HDFS: Number of write operations=2
  12.         Job Counters
  13.                 Launched map tasks=3
  14.                 Launched reduce tasks=1
  15.                 Data-local map tasks=3
  16.                 Total time spent by all maps in occupied slots (ms)=355336
  17.                 Total time spent by all reduces in occupied slots (ms)=66400
  18.         Map-Reduce Framework
  19.                 Map input records=8
  20.                 Map output records=137
  21.                 Map output bytes=822
  22.                 Map output materialized bytes=434
  23.                 Input split bytes=399
  24.                 Combine input records=137
  25.                 Combine output records=52
  26.                 Reduce input groups=25
  27.                 Reduce shuffle bytes=434
  28.                 Reduce input records=52
  29.                 Reduce output records=25
  30.                 Spilled Records=104
  31.                 Shuffled Maps =3
  32.                 Failed Shuffles=0
  33.                 Merged Map outputs=3
  34.                 GC time elapsed (ms)=274
  35.                 CPU time spent (ms)=3430
  36.                 Physical memory (bytes) snapshot=1078874112
  37.                 Virtual memory (bytes) snapshot=3947868160
  38.                 Total committed heap usage (bytes)=884539392
  39.         Shuffle Errors
  40.                 BAD_ID=0
  41.                 CONNECTION=0
  42.                 IO_ERROR=0
  43.                 WRONG_LENGTH=0
  44.                 WRONG_MAP=0
  45.                 WRONG_REDUCE=0
  46.         File Input Format Counters
  47.                 Bytes Read=157
  48.         File Output Format Counters
  49.                 Bytes Written=103
通過日誌可以看出, map 的輸入輸出記錄數為8和137 這個與沒combiner沒影響。但我們發現reduce的輸入由137變成了52。也就是說通過網路傳輸的資料量減少了。
同時也可以發現這個變化是combine 影響的。 可以看到 map 輸出的137條記錄實際是進行combine中處理了,combine 輸入為map輸入的137條記錄,combine  輸出的52條記錄進行到reduce中繼續彙總。
注意:combine 是在map 端的彙總,沒有網路傳輸,它彙總的也只是combiner所在節點的map的輸入,不是全部節點的資料。
reduce 中處理的才是所有節點的資料。
下一篇,我們介紹另一種類似combiner功能的實現方式。in-map 聚合。

最後還是把全部程式碼貼到後面吧


點選(此處)摺疊或開啟

  1. package wordcount;

  2. import java.io.IOException;

  3. import org.apache.commons.lang.StringUtils;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.conf.Configured;
  6. import org.apache.hadoop.fs.Path;
  7. import org.apache.hadoop.io.IntWritable;
  8. import org.apache.hadoop.io.LongWritable;
  9. import org.apache.hadoop.io.Text;
  10. import org.apache.hadoop.mapreduce.Job;
  11. import org.apache.hadoop.mapreduce.Mapper;
  12. import org.apache.hadoop.mapreduce.Reducer;
  13. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  14. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  15. import org.apache.hadoop.util.Tool;
  16. import org.apache.hadoop.util.ToolRunner;
  17. import org.apache.log4j.Logger;

  18. public class MyWordCountJob extends Configured implements Tool {
  19.         Logger log = Logger.getLogger(MyWordCountJob.class);

  20.         public static class MyWordCountMapper extends
  21.                         Mapper<LongWritable, Text, Text, IntWritable> {
  22.                 Logger log = Logger.getLogger(MyWordCountJob.class);

  23.                 Text mapKey = new Text();
  24.                 IntWritable mapValue = new IntWritable(1);
  25.                 @Override
  26.                 protected void map(LongWritable key, Text value, Context context)
  27.                                 throws IOException, InterruptedException {
  28.                         for(char c :value.toString().toLowerCase().toCharArray()){
  29.                                 if(c>='a' && c <='z'){
  30.                                         mapKey.set(String.valueOf(c));
  31.                                         context.write(mapKey, mapValue);
  32.                                 }
  33.                         }
  34.                 }

  35.         }


  36.         public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  37.                 Text rkey = new Text();
  38.                 IntWritable rvalue = new IntWritable(1);
  39.                 @Override
  40.                 protected void reduce(Text key, Iterable<IntWritable> values,Context context)
  41.                                 throws IOException, InterruptedException {
  42.                         int n=0;
  43.                         for(IntWritable value :values){
  44.                                 n+= value.get();
  45.                         }
  46.                         rvalue.set(n);
  47.                         context.write(key, rvalue);
  48.                 }
  49.         }

  50.         @Override
  51.         public int run(String[] args) throws Exception {
  52.                 //valid the parameters
  53.                 if(args.length !=2){
  54.                         return -1;
  55.                 }

  56.                 Job job = Job.getInstance(getConf(), "MyWordCountJob");
  57.                 job.setJarByClass(MyWordCountJob.class);

  58.                 Path inPath = new Path(args[0]);
  59.                 Path outPath = new Path(args[1]);

  60.                 outPath.getFileSystem(getConf()).delete(outPath,true);
  61.                 TextInputFormat.setInputPaths(job, inPath);
  62.                 TextOutputFormat.setOutputPath(job, outPath);


  63.             job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
  64.                 job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
  65.                 job.setCombinerClass(MyWordCountJob.MyWordCountReducer.class);
  66.             job.setInputFormatClass(TextInputFormat.class);
  67.             job.setOutputFormatClass(TextOutputFormat.class);

  68.                 job.setMapOutputKeyClass(Text.class);
  69.                 job.setMapOutputValueClass(IntWritable.class);
  70.                 job.setOutputKeyClass(Text.class);
  71.                 job.setOutputValueClass(IntWritable.class);

  72.                 return job.waitForCompletion(true)?0:1;
  73.         }
  74.         public static void main(String [] args){
  75.                 int result = 0;
  76.                 try {
  77.                         result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
  78.                 } catch (Exception e) {
  79.                         e.printStackTrace();
  80.                 }
  81.                 System.exit(result);
  82.         }

  83. }
下一篇中,我們使用in-map Aggregation來優化我們的任務。在某種情況下,可以比combiner更有效率。 
一個MapReduce 程式示例 細節決定成敗(四) :In-Map Aggregation 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30066956/viewspace-2107885/,如需轉載,請註明出處,否則將追究法律責任。

相關文章