一、環境
1、hadoop 0.20.2
2、作業系統Linux
二、背景
1、最近寫MR的程式碼，總在想統計一些錯誤的資料出現的次數，發現如果都寫在reduce的輸出裡太難看了，所以想找辦法專門輸出一些統計數字。
2、翻看《hadoop權威指南》第8章第1節的時候發現能夠自定義計數器，但都是基於0.19版本寫的，好多函式都不對，改動相對較大。
3、基於上面2個理由，寫個文件，記錄一下。
三、實現
1、前提：寫入一個檔案，規範的是3個欄位，“/t”劃分，有2條異常，一條是2個欄位，一條是4個欄位，內容如下：
jim 1 28
kate 0 26
tom 1
kaka 1 22
lily 0 29 22
2、統計處不規範的資料。我沒有寫reduce，因為不需要輸出，程式碼如下，先看程式碼

[java] view plaincopyprint?
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MyCounter {

public static class MyCounterMap extends Mapper {

public static Counter ct = null;

protected void map(LongWritable key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws java.io.IOException, InterruptedException {
String arr_value[] = value.toString().split("/t");
if (arr_value.length > 3) {
ct = context.getCounter("ErrorCounter", "toolong");
ct.increment(1);
} else if (arr_value.length < 3) {
ct = context.getCounter("ErrorCounter", "tooshort");
ct.increment(1);
}
}
}

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: MyCounter ");
System.exit(2);
}

Job job = new Job(conf, "MyCounter");
job.setJarByClass(MyCounter.class);

job.setMapperClass(MyCounterMap.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3、啟動命令如下：
hadoop jar /jz/jar/Hadoop_Test.jar jz.MyCounter /jz/* /jz/06
對於小於3個欄位的採用tooshort統計，大於3個欄位的採用toolong統計

4、結果如下（紅色部分）：

[java] view plaincopyprint?
10/08/04 17:29:15 INFO mapred.JobClient: Job complete: job_201008032120_0019
10/08/04 17:29:15 INFO mapred.JobClient: Counters: 18
10/08/04 17:29:15 INFO mapred.JobClient: Job Counters
10/08/04 17:29:15 INFO mapred.JobClient: Launched reduce tasks=1
10/08/04 17:29:15 INFO mapred.JobClient: Rack-local map tasks=1
10/08/04 17:29:15 INFO mapred.JobClient: Launched map tasks=6
10/08/04 17:29:15 INFO mapred.JobClient: ErrorCounter
10/08/04 17:29:15 INFO mapred.JobClient: tooshort=1
10/08/04 17:29:15 INFO mapred.JobClient: toolong=1
10/08/04 17:29:15 INFO mapred.JobClient: FileSystemCounters
10/08/04 17:29:15 INFO mapred.JobClient: FILE_BYTES_READ=6
10/08/04 17:29:15 INFO mapred.JobClient: HDFS_BYTES_READ=47
10/08/04 17:29:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=234
10/08/04 17:29:15 INFO mapred.JobClient: Map-Reduce Framework
10/08/04 17:29:15 INFO mapred.JobClient: Reduce input groups=0
10/08/04 17:29:15 INFO mapred.JobClient: Combine output records=0
10/08/04 17:29:15 INFO mapred.JobClient: Map input records=5
10/08/04 17:29:15 INFO mapred.JobClient: Reduce shuffle bytes=36
10/08/04 17:29:15 INFO mapred.JobClient: Reduce output records=0
10/08/04 17:29:15 INFO mapred.JobClient: Spilled Records=0
10/08/04 17:29:15 INFO mapred.JobClient: Map output bytes=0
10/08/04 17:29:15 INFO mapred.JobClient: Combine input records=0
10/08/04 17:29:15 INFO mapred.JobClient: Map output records=0
10/08/04 17:29:15 INFO mapred.JobClient: Reduce input records=0

四、總結
1、其實hadoop權威指南寫的很清楚了，但是由於版本不一樣，所以很多方法也不同，總一下，主要有以下不同：
不再需要列舉的型別、計數器名不在需要寫properties檔案，呼叫的方法在context中都封裝了。
2、hadoop權威指南中寫了統計百分比值，程式碼改改就能實現，就是一個總數除以錯誤數然後百分比的結果。
原文：http://blog.csdn.net/dajuezhao/article/details/5788705

Hadoop中自定義計數器

相關文章