細節決定成敗 MapReduce任務實戰 倒排索引

self_control發表於2016-06-15
今天在偶然看到一個博文,裡面講述如何使用mapreduce進行倒排索引處理。那就拿這個任務當成本篇實戰任務吧。

一、任務描述


hdfs 上有三個檔案,內容下上面左面框中所示。右框中為處理完成後的結果檔案。
倒排索引(Inverted index),也常被稱為反向索引、置入檔案或反向檔案,是一種索引方法,被用來儲存在全文搜尋下某個單詞在一個文件或者一組文件中的儲存位置的對映。它是文件檢索系統中最常用的資料結構。通過倒排索引,可以根據單詞快速獲取包含這個單詞的文件列表。倒排索引主要由兩個部分組成:“單詞詞典”和“倒排檔案”
這個任務與傳統的倒排索引任務不同的地方是加上了每個檔案中的頻數。

二、實現思路

  • 首先關注結果中有檔名稱,這個我們有兩種方式處理:1、自定義InputFormat,在其中的自定義RecordReader中,直接通過InputSplit得到Path,繼而得到FileName;2、在Mapper中,通過上下文可以取到Split,也可以得到fileName。這個任務中我們使用第二種方式,得到filename.
  • 在mapper中,得到filename 及 word,封裝到一個自定義keu中。value 使用IntWritable。在map  中直接輸出值為1的IntWritable物件。
  • 對進入reduce函式中的key進行分組控制,要求按word相同的進入同一次reduce呼叫。所以需要自定義GroupingComparator。

三、實現程式碼

自定義Key, WordKey程式碼。 注意這裡面有個故意設定的坑。

點選(此處)摺疊或開啟

  1. package indexinverted;

  2. import java.io.DataInput;
  3. import java.io.DataOutput;
  4. import java.io.IOException;

  5. import org.apache.hadoop.io.WritableComparable;


  6. public class WordKey implements WritableComparable <WordKey> {

  7.         private String fileName;
  8.         private String word;

  9.         @Override
  10.         public void write(DataOutput out) throws IOException {
  11.                 out.writeUTF(fileName);
  12.                 out.writeUTF(word);
  13.         }

  14.         @Override
  15.         public void readFields(DataInput in) throws IOException {
  16.                 this.fileName = in.readUTF();
  17.                 this.word = in.readUTF();
  18.         }

  19.         @Override
  20.         public int compareTo(WordKey key) {
  21.                 int r = fileName.compareTo(key.fileName);
  22.                 if(r==0)
  23.                         r = word.compareTo(key.word);
  24.                 return r;
  25.         }

  26.         public String getFileName() {
  27.                 return fileName;
  28.         }

  29.         public void setFileName(String fileName) {
  30.                 this.fileName = fileName;
  31.         }

  32.         public String getWord() {
  33.                 return word;
  34.         }

  35.         public void setWord(String word) {
  36.                 this.word = word;
  37.         }
  38. }
Mapper、 Reducer、 IndexInvertedGroupingComparator (我喜歡把一些小的類當成內部類,放到Job類中,這樣程式碼比較簡單)
Reduce函式中處理輸出結果有點繁瑣,可以不用太關注。

點選(此處)摺疊或開啟

  1. package indexinverted;
  2. import java.io.IOException;
  3. import java.util.HashMap;
  4. import java.util.LinkedHashMap;

  5. import org.apache.hadoop.conf.Configuration;
  6. import org.apache.hadoop.conf.Configured;
  7. import org.apache.hadoop.fs.FileSystem;
  8. import org.apache.hadoop.fs.Path;
  9. import org.apache.hadoop.io.IntWritable;
  10. import org.apache.hadoop.io.LongWritable;
  11. import org.apache.hadoop.io.Text;
  12. import org.apache.hadoop.io.WritableComparable;
  13. import org.apache.hadoop.io.WritableComparator;
  14. import org.apache.hadoop.mapreduce.Job;
  15. import org.apache.hadoop.mapreduce.Mapper;
  16. import org.apache.hadoop.mapreduce.Reducer;
  17. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  18. import org.apache.hadoop.mapreduce.lib.input.FileSplit;
  19. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  20. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  21. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  22. import org.apache.hadoop.util.Tool;
  23. import org.apache.hadoop.util.ToolRunner;
  24. import org.apache.log4j.Logger;


  25. public class MyIndexInvertedJob extends Configured implements Tool{

  26.         public static class IndexInvertedMapper extends Mapper<LongWritable,Text,WordKey,IntWritable>{
  27.                 private WordKey newKey = new WordKey();
  28.                 private IntWritable ONE = new IntWritable(1);
  29.                 private String fileName ;
  30.                
  31.                 @Override
  32.                 protected void map(LongWritable key, Text value, Context context)
  33.                                 throws IOException, InterruptedException {
  34.                         newKey.setFileName(fileName);
  35.                         String words [] = value.toString().split(" ");
  36.                         for(String w:words){
  37.                                 newKey.setWord(w);
  38.                                 context.write(newKey, ONE);
  39.                         }
  40.                 }
  41.                 @Override
  42.                 protected void setup(Context context) throws IOException,
  43.                                 InterruptedException {
  44.                         FileSplit inputSplit = (FileSplit) context.getInputSplit();
  45.                         fileName = inputSplit.getPath().getName();
  46.                 }

  47.         }
  48.         public static class IndexInvertedReducer extends Reducer<WordKey,IntWritable,Text,Text>{
  49.                 private Text outputKey = new Text();

  50.                 @Override
  51.                 protected void reduce(WordKey key, Iterable<IntWritable> values,Context context)
  52.                                 throws IOException, InterruptedException {
  53.                         outputKey.set(key.getWord());
  54.                         LinkedHashMap <String,Integer> map = new LinkedHashMap<String,Integer>();
  55.                         for(IntWritable v :values){
  56.                                 if(map.containsKey(key.getFileName())){
  57.                                         map.put(key.getFileName(), map.get(key.getFileName())+ v.get());
  58.                                 }
  59.                                 else{
  60.                                         map.put(key.getFileName(), v.get());
  61.                                 }
  62.                         }
  63.                         StringBuilder sb = new StringBuilder();
  64.                         sb.append("{");
  65.                         for(String k: map.keySet()){
  66.                                 sb.append("(").append(k).append(",").append(map.get(k)).append(")").append(",");
  67.                         }
  68.                         sb.deleteCharAt(sb.length()-1).append("}");
  69.                         context.write(outputKey, new Text(sb.toString()));
  70.                 }

  71.         }
  72.         public static class IndexInvertedGroupingComparator extends WritableComparator{
  73.                 Logger log = Logger.getLogger(getClass());
  74.                 public IndexInvertedGroupingComparator(){
  75.                         super(WordKey.class,true);
  76.                 }

  77.                 @Override
  78.                 public int compare(WritableComparable a, WritableComparable b) {
  79.                         WordKey key1 = (WordKey) a;
  80.                         WordKey key2 = (WordKey) b;
  81.                         log.info("==============key1.getWord().compareTo(key2.getWord()):"+key1.getWord().compareTo(key2.getWord()));
  82.                         return key1.getWord().compareTo(key2.getWord());
  83.                 }

  84.         }
  85.         @Override
  86.         public int run(String[] args) throws Exception {
  87.                 Job job = Job.getInstance(getConf(), "IndexInvertedJob");
  88.                 job.setJarByClass(getClass());
  89.                 Configuration conf = job.getConfiguration();

  90.                 Path in = new Path("myinvertedindex/");
  91.                 Path out = new Path("myinvertedindex/output");
  92.                 FileSystem.get(conf).delete(out,true);
  93.                 FileInputFormat.setInputPaths(job, in);
  94.                 FileOutputFormat.setOutputPath(job, out);

  95.                 job.setInputFormatClass(TextInputFormat.class);
  96.                 job.setOutputFormatClass(TextOutputFormat.class);

  97.                 job.setMapperClass(IndexInvertedMapper.class);
  98.                 job.setMapOutputKeyClass(WordKey.class);
  99.                 job.setMapOutputValueClass(IntWritable.class);

  100.                 job.setReducerClass(IndexInvertedReducer.class);
  101.                 job.setOutputKeyClass(Text.class);
  102.                 job.setOutputValueClass(Text.class);

  103.                 job.setGroupingComparatorClass(IndexInvertedGroupingComparator.class);

  104.                 return job.waitForCompletion(true)?0:1;
  105.         }

  106.         public static void main(String [] args){
  107.                 int r = 0 ;
  108.                 try{
  109.                         r = ToolRunner.run(new Configuration(), new MyIndexInvertedJob(), args);
  110.                 }catch(Exception e){
  111.                         e.printStackTrace();
  112.                 }
  113.                 System.exit(r);
  114.         }

  115. }

四、檢視結果

hadoop dfs -cat myinvertedindex/output/part-r-00000 

點選(此處)摺疊或開啟

  1. MapReduce {(1.txt,1)}
  2. is {(1.txt,1)}
  3. simple {(1.txt,1)}
  4. MapReduce {(2.txt,1)}
  5. is {(2.txt,2)}
  6. powerful {(2.txt,1)}
  7. simple {(2.txt,1)}
  8. Hello {(3.txt,1)}
  9. MapReduce {(3.txt,2)}
  10. bye {(3.txt,1)}
檢視結果發現問題:單詞並沒有合併到一起,這會是什麼原因?

五、深坑回填

檢視結果發現問題:單詞並沒有合併到一起,這是什麼原因?
GroupingComparator 是在什麼樣的基礎上起作用的?是配置了按word相同輸入到同一次reduce呼叫就一定會相同的word都進入同一個reduce呼叫嗎?
NO! Groupingcomparator中只是比較了相臨的兩個key是否相等。所有要結果正確,就要儲存key的排序與GroupingComparator的排序是相協調。
問題出在了WordKey的排序是先按檔案再按單詞進行排,這樣相臨的key並不是單詞相同的,而是檔案相同的。
所以要改下WordKey的comparTo方法 。修復後程式碼如下:

點選(此處)摺疊或開啟

  1. package indexinverted;

  2. import java.io.DataInput;
  3. import java.io.DataOutput;
  4. import java.io.IOException;

  5. import org.apache.hadoop.io.WritableComparable;


  6. public class WordKey implements WritableComparable <WordKey> {

  7.         private String fileName;
  8.         private String word;

  9.         @Override
  10.         public void write(DataOutput out) throws IOException {
  11.                 out.writeUTF(fileName);
  12.                 out.writeUTF(word);
  13.         }

  14.         @Override
  15.         public void readFields(DataInput in) throws IOException {
  16.                 this.fileName = in.readUTF();
  17.                 this.word = in.readUTF();
  18.         }

  19.         @Override
  20.         public int compareTo(WordKey key) {
  21.                 int r = word.compareTo(key.word);
  22.                 if(r==0)
  23.                         r = fileName.compareTo(key.fileName);
  24.                 return r;
  25.         }

  26.         public String getFileName() {
  27.                 return fileName;
  28.         }

  29.         public void setFileName(String fileName) {
  30.                 this.fileName = fileName;
  31.         }

  32.         public String getWord() {
  33.                 return word;
  34.         }

  35.         public void setWord(String word) {
  36.                 this.word = word;
  37.         }
  38. }


執行結果

點選(此處)摺疊或開啟

  1. Hello {(3.txt,1)}
  2. MapReduce {(1.txt,1),(2.txt,1),(3.txt,2)}
  3. bye {(3.txt,1)}
  4. is {(1.txt,1),(2.txt,2)}
  5. powerful {(2.txt,1)}
  6. simple {(1.txt,1),(2.txt,1)}








來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30066956/viewspace-2120238/,如需轉載,請註明出處,否則將追究法律責任。

相關文章