細節決定成敗 MapReduce任務實戰 Map Join

self_control發表於2016-06-14

一、任務描述

在hdfs 上有兩個檔案

學生資訊檔案:hdfs://***.***.***:8020/user/train/joinjob/student.txt

以逗號分隔,第一列為學號,學號每個學生唯一,第二列為學生姓名。

點選(此處)摺疊或開啟

  1. 2016001,Join
  2. 2016002,Abigail
  3. 2016003,Abby
  4. 2016004,Alexandra
  5. 2016005,Cathy
  6. 2016006,Katherine

學生成績資訊檔案:hdfs://***.***.***:8020/user/train/joinjob/student_score.txt

以逗號分隔,第一列為學號,學號每個學生唯一,第二列為課程程式碼,第三列為分數。

點選(此處)摺疊或開啟

  1. 2016001,YY,60
  2. 2016001,SX,88
  3. 2016001,YW,91
  4. 2016002,SX,77
  5. 2016002,YW,33
  6. .............

期望的結果檔案形式如下

點選(此處)摺疊或開啟

  1. 2016001,Join,YY,60
  2. 2016001,Join,SX,88
  3. 2016001,Join,YW,91
  4. 2016002,Abigail,SX,77
  5. 2016002,Abigail,YW,33
  6. 2016002,Abigail,YY,56
  7. 2016003,Abby,YY,34
  8. 2016003,Abby,SX,84
  9. 2016003,Abby,YW,69
  10. 2016004,Alexandra,YY,89
  11. 2016004,Alexandra,SX,84
  12. .......

二、任務分析

  1. 這是一個兩個資料集關聯的任務。關聯有map端join ,和reduce端join。 通常在有一個資料集比較小,可在全部放到記憶體中時,選擇map端join;當兩個資料集都很大時,使用reduce端join,這時可以使用Bloom過濾器來增加效率。 本次任務我們使用map端join實現。
  2. 此任務關聯到直接輸出即可,不需要reduce過程。
  3. 本任務會使用到的技術有:分散式快取DistributedCache、mapreduce中的本地檔案讀取

三、實現思路

  1. 把一個學生資訊使用分散式快取分發到各個任務節點。
  2. 學生成績資訊當做map的input。
  3. 在mapper 的setup函式中讀取已快取到本的學生資訊,並儲存到一個HashMap中,key為學號,value為姓名。
  4. 在mapper函式中對每一個學生成績資訊,根據學號去上一步的map中去取姓名資訊。
  5. 封裝到Text物件中輸出。

三、實現程式碼

點選(此處)摺疊或開啟

  1. package join;
  2. import java.io.BufferedReader;
  3. import java.io.IOException;
  4. import java.io.InputStreamReader;
  5. import java.net.URI;
  6. import java.util.HashMap;
  7. import java.util.Map;

  8. import org.apache.commons.lang.StringUtils;
  9. import org.apache.hadoop.conf.Configuration;
  10. import org.apache.hadoop.conf.Configured;
  11. import org.apache.hadoop.fs.FileSystem;
  12. import org.apache.hadoop.fs.Path;
  13. import org.apache.hadoop.io.LongWritable;
  14. import org.apache.hadoop.io.NullWritable;
  15. import org.apache.hadoop.io.Text;
  16. import org.apache.hadoop.mapreduce.Job;
  17. import org.apache.hadoop.mapreduce.Mapper;
  18. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  19. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  20. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  21. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  22. import org.apache.hadoop.util.Tool;
  23. import org.apache.hadoop.util.ToolRunner;
  24. import org.apache.log4j.Logger;


  25. public class MapJoinDirectOutPutJob extends Configured implements Tool {
  26.         private static String STUDENTNAME_PARAM ="STUDENTNAME_PARAM";
  27.         private static String INPATH_SCORE = "joinjob/student_score.txt";
  28.         private static String OUTPATH = "joinjob/output";
  29.         public static class MapJoinDirectOutPutMapper extends Mapper<LongWritable,Text,NullWritable,Text>{
  30.                 private BufferedReader br = null;
  31.                 private Map <String,String> map =new HashMap<String,String>();
  32.                 private Text newValue = new Text();
  33.                 @Override
  34.                 protected void map(LongWritable key, Text value, Context context)
  35.                                 throws IOException, InterruptedException {
  36.                         String [] words = StringUtils.split(value.toString(),',');
  37.                         String name = map.get(words[0]);
  38.                         newValue.set(words[0]+","+name +","+words[1]+","+words[2]);
  39.                         context.write(NullWritable.get(), newValue);
  40.                 }

  41.                 @Override
  42.                 protected void setup(Context context) throws IOException,
  43.                                 InterruptedException {
  44.                         Configuration conf = context.getConfiguration();
  45.                         FileSystem fs = FileSystem.getLocal(conf);

  46.                         br = new BufferedReader(new InputStreamReader(fs.open(new Path("studentLink.txt"))));
  47.                         String current;
  48.                         while((current= br.readLine())!=null){
  49.                                 String [] words = current.split(",");
  50.                                 map.put(words[0],words[1]);
  51.                         }
  52.                         br.close();
  53.                 }

  54.         }

  55.         @Override
  56.         public int run(String[] args) throws Exception {
  57.                 Job job = Job.getInstance(getConf(),"MapJoinDirectOutPutJob");
  58.                 job.setJarByClass(getClass());
  59.                 Configuration conf = job.getConfiguration();

  60.                 conf.set(STUDENTNAME_PARAM, args[0]);

  61.                 Path in = new Path(INPATH_SCORE);
  62.                 Path out = new Path(OUTPATH);
  63.                 FileSystem.get(conf).delete(out,true);
  64.                 FileInputFormat.setInputPaths(job, in);
  65.                 FileOutputFormat.setOutputPath(job, out);
  66.                 job.setInputFormatClass(TextInputFormat.class);
  67.                 job.setOutputFormatClass(TextOutputFormat.class);
  68.                 job.setMapperClass(MapJoinDirectOutPutMapper.class);
  69.                 job.setMapOutputKeyClass(NullWritable.class);
  70.                 job.setMapOutputValueClass(Text.class);

  71.                 job.setNumReduceTasks(0);
  72.                 URI uri = new URI("hdfs://***.***.***.***:8020/user/train/joinjob/student.txt#studentLink.txt");
  73.                 job.addCacheFile(uri);

  74.                 return job.waitForCompletion(true) ?0:1;
  75.         }

  76.         public static void main(String [] args){
  77.                 int r = 0;
  78.                 try{
  79.                         r = ToolRunner.run(new Configuration(),new MapJoinDirectOutPutJob(),args);
  80.                 }catch(Exception e){
  81.                         e.printStackTrace();
  82.                 }
  83.                 System.exit(r);
  84.         }

  85. }
本文地址:http://blog.itpub.net/30066956/viewspace-2120063/



來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30066956/viewspace-2120063/,如需轉載,請註明出處,否則將追究法律責任。

相關文章