一個MapReduce 程式示例細節決定成敗(三) ：Combiner

上一篇中，我們寫了一個統計輸入檔案中 a~z 每個字元個數的mr 程式。通過檢視執行日誌的Counter 計數可以看到map 到 reduce 之間的網路傳輸是多少。
而本篇我們將介紹的Combiner 是一個非常重要的元件，主要可以用來減少網路傳輸。原理是在每個map 任務執行節點上，先把map的輸出進行彙總，然後再傳輸到
reducer任務，也可以稱此為一個map 端的reduce操作。
先上程式碼。因為我們的combiner 邏輯與 reducer邏輯相同，所以我們簡單的使用MyWordCountReducer來設定成任務的combiner 。

點選(此處)摺疊或開啟

job.setCombinerClass(MyWordCountJob.MyWordCountReducer.class);

重點看執行日誌：

點選(此處)摺疊或開啟

File System Counters
FILE: Number of bytes read=422
FILE: Number of bytes written=338601
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=556
HDFS: Number of bytes written=103
HDFS: Number of read operations=12
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=355336
Total time spent by all reduces in occupied slots (ms)=66400
Map-Reduce Framework
Map input records=8
Map output records=137
Map output bytes=822
Map output materialized bytes=434
Input split bytes=399
Combine input records=137
Combine output records=52
Reduce input groups=25
Reduce shuffle bytes=434
Reduce input records=52
Reduce output records=25
Spilled Records=104
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=274
CPU time spent (ms)=3430
Physical memory (bytes) snapshot=1078874112
Virtual memory (bytes) snapshot=3947868160
Total committed heap usage (bytes)=884539392
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=157
File Output Format Counters
Bytes Written=103

通過日誌可以看出， map 的輸入輸出記錄數為8和137 這個與沒combiner沒影響。但我們發現reduce的輸入由137變成了52。也就是說通過網路傳輸的資料量減少了。
同時也可以發現這個變化是combine 影響的。可以看到 map 輸出的137條記錄實際是進行combine中處理了，combine 輸入為map輸入的137條記錄，combine 輸出的52條記錄進行到reduce中繼續彙總。
注意：combine 是在map 端的彙總，沒有網路傳輸，它彙總的也只是combiner所在節點的map的輸入，不是全部節點的資料。
reduce 中處理的才是所有節點的資料。
下一篇，我們介紹另一種類似combiner功能的實現方式。in-map 聚合。

最後還是把全部程式碼貼到後面吧

點選(此處)摺疊或開啟

package wordcount;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;
public class MyWordCountJob extends Configured implements Tool {
Logger log = Logger.getLogger(MyWordCountJob.class);
public static class MyWordCountMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
Logger log = Logger.getLogger(MyWordCountJob.class);
Text mapKey = new Text();
IntWritable mapValue = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for(char c :value.toString().toLowerCase().toCharArray()){
if(c>='a' && c <='z'){
mapKey.set(String.valueOf(c));
context.write(mapKey, mapValue);
}
}
}
}
public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
Text rkey = new Text();
IntWritable rvalue = new IntWritable(1);
@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int n=0;
for(IntWritable value :values){
n+= value.get();
}
rvalue.set(n);
context.write(key, rvalue);
}
}
@Override
public int run(String[] args) throws Exception {
//valid the parameters
if(args.length !=2){
return -1;
}
Job job = Job.getInstance(getConf(), "MyWordCountJob");
job.setJarByClass(MyWordCountJob.class);
Path inPath = new Path(args[0]);
Path outPath = new Path(args[1]);
outPath.getFileSystem(getConf()).delete(outPath,true);
TextInputFormat.setInputPaths(job, inPath);
TextOutputFormat.setOutputPath(job, outPath);
job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
job.setCombinerClass(MyWordCountJob.MyWordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true)?0:1;
}
public static void main(String [] args){
int result = 0;
try {
result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
} catch (Exception e) {
e.printStackTrace();
}
System.exit(result);
}
}

下一篇中，我們使用in-map Aggregation來優化我們的任務。在某種情況下，可以比combiner更有效率。
一個MapReduce 程式示例細節決定成敗(四) ：In-Map Aggregation

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/30066956/viewspace-2107885/，如需轉載，請註明出處，否則將追究法律責任。

一個MapReduce 程式示例 細節決定成敗(三) ：Combiner

最後還是把全部程式碼貼到後面吧

相關文章

一個MapReduce 程式示例細節決定成敗(三) ：Combiner