一個MapReduce 程式示例 細節決定成敗(二) :觀察日誌及 Counter

self_control發表於2016-05-27
編寫一個mapreduce 程式:http://blog.itpub.net/30066956/viewspace-2107549/ 

下面是一個計算輸入檔案中a~z每個單字元的數量的一個map reduce 程式。

點選(此處)摺疊或開啟

  1. package wordcount;

  2. import java.io.IOException;

  3. import org.apache.commons.lang.StringUtils;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.conf.Configured;
  6. import org.apache.hadoop.fs.Path;
  7. import org.apache.hadoop.io.IntWritable;
  8. import org.apache.hadoop.io.LongWritable;
  9. import org.apache.hadoop.io.Text;
  10. import org.apache.hadoop.mapreduce.Job;
  11. import org.apache.hadoop.mapreduce.Mapper;
  12. import org.apache.hadoop.mapreduce.Reducer;
  13. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  14. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  15. import org.apache.hadoop.util.Tool;
  16. import org.apache.hadoop.util.ToolRunner;
  17. import org.apache.log4j.Logger;

  18. public class MyWordCountJob extends Configured implements Tool {
  19.         Logger log = Logger.getLogger(MyWordCountJob.class);

  20.         public static class MyWordCountMapper extends
  21.                         Mapper<LongWritable, Text, Text, IntWritable> {
  22.                 Logger log = Logger.getLogger(MyWordCountJob.class);

  23.                 Text mapKey = new Text();
  24.                 IntWritable mapValue = new IntWritable(1);
  25.                 @Override
  26.                 protected void map(LongWritable key, Text value, Context context)
  27.                                 throws IOException, InterruptedException {
  28.                         for(char c :value.toString().toLowerCase().toCharArray()){
  29.                                 if(c>='a' && c <='z'){
  30.                                         mapKey.set(String.valueOf(c));
  31.                                         context.write(mapKey, mapValue);
  32.                                 }
  33.                         }
  34.                 }

  35.         }


  36.         public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  37.                 Text rkey = new Text();
  38.                 IntWritable rvalue = new IntWritable(1);
  39.                 @Override
  40.                 protected void reduce(Text key, Iterable<IntWritable> values,Context context)
  41.                                 throws IOException, InterruptedException {
  42.                         int n=0;
  43.                         for(IntWritable value :values){
  44.                                 n+= value.get();
  45.                         }
  46.                         rvalue.set(n);
  47.                         context.write(key, rvalue);
  48.                 }
  49.         }

  50.         @Override
  51.         public int run(String[] args) throws Exception {
  52.                 //valid the parameters
  53.                 if(args.length !=2){
  54.                         return -1;
  55.                 }

  56.                 Job job = Job.getInstance(getConf(), "MyWordCountJob");
  57.                 job.setJarByClass(MyWordCountJob.class);

  58.                 Path inPath = new Path(args[0]);
  59.                 Path outPath = new Path(args[1]);

  60.                 outPath.getFileSystem(getConf()).delete(outPath,true);
  61.                 TextInputFormat.setInputPaths(job, inPath);
  62.                 TextOutputFormat.setOutputPath(job, outPath);


  63.             job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
  64.                 job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
  65.             job.setInputFormatClass(TextInputFormat.class);
  66.             job.setOutputFormatClass(TextOutputFormat.class);

  67.                 job.setMapOutputKeyClass(Text.class);
  68.                 job.setMapOutputValueClass(IntWritable.class);
  69.                 job.setOutputKeyClass(Text.class);
  70.                 job.setOutputValueClass(IntWritable.class);

  71.                 return job.waitForCompletion(true)?0:1;
  72.         }
  73.         public static void main(String [] args){
  74.                 int result = 0;
  75.                 try {
  76.                         result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
  77.                 } catch (Exception e) {
  78.                         e.printStackTrace();
  79.                 }
  80.                 System.exit(result);
  81.         }

  82. }

輸入檔案: 


點選(此處)摺疊或開啟

  1. [train@sandbox MyWordCount]$ hdfs dfs -ls mrdemo
  2. Found 3 items
  3. -rw-r--r-- 3 train hdfs 34 2016-05-11 01:41 mrdemo/demoinput1.txt
  4. -rw-r--r-- 3 train hdfs 42 2016-05-11 01:41 mrdemo/demoinput2.txt
  5. -rw-r--r-- 3 train hdfs 81 2016-05-11 01:41 mrdemo/demoinput3.txt

點選(此處)摺疊或開啟

  1. [train@sandbox MyWordCount]$ hdfs dfs -cat mrdemo/*input*.txt
  2. hello world
  3. how are you
  4. i am hero
  5. what is your name
  6. where are you come from
  7. abcdefghijklmnopqrsturwxyz
  8. abcdefghijklmnopqrsturwxyz
  9. abcdefghijklmnopqrsturwxyz

執行mr 任務

先看一下結果檔案,可以看到按我們預期計算出了對應字元的個數(ps:這不是重點)


點選(此處)摺疊或開啟

  1. a 8
  2. b 3
  3. c 4
  4. d 4
  5. e 11
  6. f 4
  7. g 3
  8. h 8
  9. i 5
  10. j 3
  11. k 3
  12. l 6
  13. m 7
  14. n 4
  15. o 12
  16. p 3
  17. q 3
  18. r 13
  19. s 4
  20. t 4
  21. u 6
  22. w 7
  23. x 3
  24. y 6
  25. z 3


下面看一個執行日誌關注標紅的部分(這才是重點)

點選(此處)摺疊或開啟

  1. [train@sandbox MyWordCount]$ hadoop jar mywordcount.jar mrdemo/ mrdemo/output
  2. 16/05/11 04:00:45 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
  3. 16/05/11 04:00:46 INFO input.FileInputFormat: Total input paths to process : 3
  4. 16/05/11 04:00:46 INFO mapreduce.JobSubmitter: number of splits:3
  5. 16/05/11 04:00:46 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
  6. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
  7. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
  8. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
  9. 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
  10. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
  11. 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
  12. 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
  13. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
  14. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
  15. 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
  16. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  17. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
  18. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
  19. 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
  20. 16/05/11 04:00:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462517728035_0048
  21. 16/05/11 04:00:47 INFO impl.YarnClientImpl: Submitted application application_1462517728035_0048 to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
  22. 16/05/11 04:00:47 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1462517728035_0048/
  23. 16/05/11 04:00:47 INFO mapreduce.Job: Running job: job_1462517728035_0048
  24. 16/05/11 04:00:55 INFO mapreduce.Job: Job job_1462517728035_0048 running in uber mode : false
  25. 16/05/11 04:00:55 INFO mapreduce.Job: map 0% reduce 0%
  26. 16/05/11 04:01:10 INFO mapreduce.Job: map 33% reduce 0%
  27. 16/05/11 04:01:11 INFO mapreduce.Job: map 100% reduce 0%
  28. 16/05/11 04:01:19 INFO mapreduce.Job: map 100% reduce 100%
  29. 16/05/11 04:01:19 INFO mapreduce.Job: Job job_1462517728035_0048 completed successfully
  30. 16/05/11 04:01:19 INFO mapreduce.Job: Counters: 43
  31.         File System Counters
  32.                 FILE: Number of bytes read=1102
  33.                 FILE: Number of bytes written=339257
  34.                 FILE: Number of read operations=0
  35.                 FILE: Number of large read operations=0
  36.                 FILE: Number of write operations=0
  37.                 HDFS: Number of bytes read=556
  38.                 HDFS: Number of bytes written=103
  39.                 HDFS: Number of read operations=12
  40.                 HDFS: Number of large read operations=0
  41.                 HDFS: Number of write operations=2
  42.         Job Counters
  43.                 Launched map tasks=3
  44.                 Launched reduce tasks=1
  45.                 Data-local map tasks=3
  46.                 Total time spent by all maps in occupied slots (ms)=314904
  47.                 Total time spent by all reduces in occupied slots (ms)=34648
  48.         Map-Reduce Framework
  49.                 Map input records=8
  50.                 Map output records=137
  51.                 Map output bytes=822
  52.                 Map output materialized bytes=1114
  53.                 Input split bytes=399
  54.                 Combine input records=0
  55.                 Combine output records=0
  56.                 Reduce input groups=25
  57.                 Reduce shuffle bytes=1114
  58.                 Reduce input records=137
  59.                 Reduce output records=25
  60.                 Spilled Records=274
  61.                 Shuffled Maps =3
  62.                 Failed Shuffles=0
  63.                 Merged Map outputs=3
  64.                 GC time elapsed (ms)=241
  65.                 CPU time spent (ms)=3340
  66.                 Physical memory (bytes) snapshot=1106452480
  67.                 Virtual memory (bytes) snapshot=3980922880
  68.                 Total committed heap usage (bytes)=884604928
  69.         Shuffle Errors
  70.                 BAD_ID=0
  71.                 CONNECTION=0
  72.                 IO_ERROR=0
  73.                 WRONG_LENGTH=0
  74.                 WRONG_MAP=0
  75.                 WRONG_REDUCE=0
  76.         File Input Format Counters
  77.                 Bytes Read=157
  78.         File Output Format Counters
  79.                 Bytes Written=103
從日誌中我們可以得到:我們這個job的id是job_1462517728035_0048。 有3個split讀檔案,有3個mapper 任務,1個reducer任務。
map的輸出記錄數是137,reduce的輸入記錄數也是137。也就是說這137條記錄是通過網路進行傳輸,送到reducer任務中的。
在下一篇中,我們使用一個combiner,來優化這個mapreduce 任務。


 一個MapReduce 程式示例 細節決定成敗(三) :Combiner

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30066956/viewspace-2107875/,如需轉載,請註明出處,否則將追究法律責任。

相關文章