細節決定成敗 MapReduce任務實戰倒排索引

self_control發表於2016-06-15

今天在偶然看到一個博文，裡面講述如何使用mapreduce進行倒排索引處理。那就拿這個任務當成本篇實戰任務吧。

一、任務描述

hdfs 上有三個檔案，內容下上面左面框中所示。右框中為處理完成後的結果檔案。
倒排索引（Inverted index），也常被稱為反向索引、置入檔案或反向檔案，是一種索引方法，被用來儲存在全文搜尋下某個單詞在一個文件或者一組文件中的儲存位置的對映。它是文件檢索系統中最常用的資料結構。通過倒排索引，可以根據單詞快速獲取包含這個單詞的文件列表。倒排索引主要由兩個部分組成：“單詞詞典”和“倒排檔案”
這個任務與傳統的倒排索引任務不同的地方是加上了每個檔案中的頻數。

二、實現思路

首先關注結果中有檔名稱，這個我們有兩種方式處理：1、自定義InputFormat，在其中的自定義RecordReader中，直接通過InputSplit得到Path，繼而得到FileName;2、在Mapper中，通過上下文可以取到Split，也可以得到fileName。這個任務中我們使用第二種方式，得到filename.
在mapper中，得到filename 及 word，封裝到一個自定義keu中。value 使用IntWritable。在map 中直接輸出值為1的IntWritable物件。
對進入reduce函式中的key進行分組控制，要求按word相同的進入同一次reduce呼叫。所以需要自定義GroupingComparator。

三、實現程式碼

自定義Key， WordKey程式碼。注意這裡面有個故意設定的坑。

點選(此處)摺疊或開啟

package indexinverted;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class WordKey implements WritableComparable <WordKey> {
private String fileName;
private String word;
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(fileName);
out.writeUTF(word);
}
@Override
public void readFields(DataInput in) throws IOException {
this.fileName = in.readUTF();
this.word = in.readUTF();
}
@Override
public int compareTo(WordKey key) {
int r = fileName.compareTo(key.fileName);
if(r==0)
r = word.compareTo(key.word);
return r;
}
public String getFileName() {
return fileName;
}
public void setFileName(String fileName) {
this.fileName = fileName;
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
}

Mapper、 Reducer、 IndexInvertedGroupingComparator （我喜歡把一些小的類當成內部類，放到Job類中，這樣程式碼比較簡單）
Reduce函式中處理輸出結果有點繁瑣，可以不用太關注。

點選(此處)摺疊或開啟

package indexinverted;
import java.io.IOException;
import java.util.HashMap;
import java.util.LinkedHashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;
public class MyIndexInvertedJob extends Configured implements Tool{
public static class IndexInvertedMapper extends Mapper<LongWritable,Text,WordKey,IntWritable>{
private WordKey newKey = new WordKey();
private IntWritable ONE = new IntWritable(1);
private String fileName ;
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
newKey.setFileName(fileName);
String words [] = value.toString().split(" ");
for(String w:words){
newKey.setWord(w);
context.write(newKey, ONE);
}
}
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
FileSplit inputSplit = (FileSplit) context.getInputSplit();
fileName = inputSplit.getPath().getName();
}
}
public static class IndexInvertedReducer extends Reducer<WordKey,IntWritable,Text,Text>{
private Text outputKey = new Text();
@Override
protected void reduce(WordKey key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
outputKey.set(key.getWord());
LinkedHashMap <String,Integer> map = new LinkedHashMap<String,Integer>();
for(IntWritable v :values){
if(map.containsKey(key.getFileName())){
map.put(key.getFileName(), map.get(key.getFileName())+ v.get());
}
else{
map.put(key.getFileName(), v.get());
}
}
StringBuilder sb = new StringBuilder();
sb.append("{");
for(String k: map.keySet()){
sb.append("(").append(k).append(",").append(map.get(k)).append(")").append(",");
}
sb.deleteCharAt(sb.length()-1).append("}");
context.write(outputKey, new Text(sb.toString()));
}
}
public static class IndexInvertedGroupingComparator extends WritableComparator{
Logger log = Logger.getLogger(getClass());
public IndexInvertedGroupingComparator(){
super(WordKey.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
WordKey key1 = (WordKey) a;
WordKey key2 = (WordKey) b;
log.info("==============key1.getWord().compareTo(key2.getWord()):"+key1.getWord().compareTo(key2.getWord()));
return key1.getWord().compareTo(key2.getWord());
}
}
@Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "IndexInvertedJob");
job.setJarByClass(getClass());
Configuration conf = job.getConfiguration();
Path in = new Path("myinvertedindex/");
Path out = new Path("myinvertedindex/output");
FileSystem.get(conf).delete(out,true);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(IndexInvertedMapper.class);
job.setMapOutputKeyClass(WordKey.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(IndexInvertedReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setGroupingComparatorClass(IndexInvertedGroupingComparator.class);
return job.waitForCompletion(true)?0:1;
}
public static void main(String [] args){
int r = 0 ;
try{
r = ToolRunner.run(new Configuration(), new MyIndexInvertedJob(), args);
}catch(Exception e){
e.printStackTrace();
}
System.exit(r);
}
}

四、檢視結果

hadoop dfs -cat myinvertedindex/output/part-r-00000

點選(此處)摺疊或開啟

MapReduce {(1.txt,1)}
is {(1.txt,1)}
simple {(1.txt,1)}
MapReduce {(2.txt,1)}
is {(2.txt,2)}
powerful {(2.txt,1)}
simple {(2.txt,1)}
Hello {(3.txt,1)}
MapReduce {(3.txt,2)}
bye {(3.txt,1)}

檢視結果發現問題：單詞並沒有合併到一起，這會是什麼原因？

五、深坑回填

檢視結果發現問題：單詞並沒有合併到一起，這是什麼原因？
GroupingComparator 是在什麼樣的基礎上起作用的？是配置了按word相同輸入到同一次reduce呼叫就一定會相同的word都進入同一個reduce呼叫嗎？
NO！ Groupingcomparator中只是比較了相臨的兩個key是否相等。所有要結果正確，就要儲存key的排序與GroupingComparator的排序是相協調。
問題出在了WordKey的排序是先按檔案再按單詞進行排，這樣相臨的key並不是單詞相同的，而是檔案相同的。
所以要改下WordKey的comparTo方法。修復後程式碼如下：

點選(此處)摺疊或開啟

package indexinverted;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class WordKey implements WritableComparable <WordKey> {
private String fileName;
private String word;
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(fileName);
out.writeUTF(word);
}
@Override
public void readFields(DataInput in) throws IOException {
this.fileName = in.readUTF();
this.word = in.readUTF();
}
@Override
public int compareTo(WordKey key) {
int r = word.compareTo(key.word);
if(r==0)
r = fileName.compareTo(key.fileName);
return r;
}
public String getFileName() {
return fileName;
}
public void setFileName(String fileName) {
this.fileName = fileName;
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
}

執行結果

點選(此處)摺疊或開啟

Hello {(3.txt,1)}
MapReduce {(1.txt,1),(2.txt,1),(3.txt,2)}
bye {(3.txt,1)}
is {(1.txt,1),(2.txt,2)}
powerful {(2.txt,1)}
simple {(1.txt,1),(2.txt,1)}

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/30066956/viewspace-2120238/，如需轉載，請註明出處，否則將追究法律責任。

細節決定成敗 MapReduce任務實戰 Map Join
2016-06-14
細節決定成敗 MapReduce任務實戰 Reduce Join
2016-06-14
MapReduce實戰：倒排索引
2014-12-24
索引
一個MapReduce 程式示例細節決定成敗（一）
2016-05-26
一個MapReduce 程式示例細節決定成敗(五) ：Partitioner
2016-05-27
搜尋引擎：MapReduce實戰----倒排索引
2015-07-28
索引
一個MapReduce 程式示例細節決定成敗(九)：RawComparator
2016-06-01
一個MapReduce 程式示例細節決定成敗(八)：TotalOrderPartitioner
2016-05-30
mapreduce實現倒排索引
2015-01-27
索引
一個MapReduce 程式示例細節決定成敗(六) ：CombineFileInputFormat
2016-05-30
ORM
一個MapReduce 程式示例細節決定成敗(三) ：Combiner
2016-05-27
一個MapReduce 程式示例細節決定成敗(二) ：觀察日誌及 Counter
2016-05-27
一個MapReduce 程式示例細節決定成敗(七) ：自定義Key 及RecordReader
2016-05-30
MapReduce實現倒排索引（簡單思路）
2016-06-14
索引
面試：黃金法則——細節決定成敗
2009-08-05
面試
細節決定成敗！APP設計不容忽視的20個細節
2015-05-04
APP
Java集合詳解8：Java集合類細節精講，細節決定成敗
2019-11-07
Java
開發者談F2P模式：細節決定成敗
2021-12-06
模式
邦芒簡歷：求職簡歷細節決定成敗
2024-01-23
求職
Python讀書筆記：細節決定成敗(2)
2016-01-30
Python筆記
Python讀書筆記：細節決定成敗（1）
2016-01-14
Python筆記
第五章 Vlookup函式示例-細節決定成敗
2020-03-13
函式
汪峰FIIL Diva智慧耳機究竟如何？細節決定成敗
2016-08-10
如何讓程式設計師幸福工作：細節決定成敗
2014-10-12
程式設計師
細節決定ERP專案啟動會的成敗
2011-05-06
細節決定成敗，不容忽視的10道Node面試題
2019-05-13
面試題
細節決定成敗——無CSS時網頁的可讀性
2011-09-23
CSS網頁
軟體設計是怎樣煉成的（7）——細節決定成敗（詳細設計）
2014-03-04
MapReduce程式設計例項之倒排索引 1
2015-11-24
程式設計索引
【原創】構建高效能ASP.NET站點之三細節決定成敗
2014-03-12
ASP.NET
【2024-03-06】細節成敗
2024-03-11
ElasticSearch 倒排索引（Inverted Index）| 什麼是倒排索引？
2020-04-07
Elasticsearch索引Index
【freertos】006-任務切換實現細節
2022-03-31
專案實戰！接入分散式定時任務框架
2022-03-25
分散式框架
《Elasticsearch技術解析與實戰》Chapter 1.1：Elasticsearch入門和倒排索引
2019-04-12
ElasticsearchAPT索引
【freertos】004-任務建立與刪除及其實現細節
2022-03-29
備份任務實戰
2024-10-04
筆記五：倒排索引
2019-10-14
筆記索引

細節決定成敗 MapReduce任務實戰 倒排索引

一、任務描述

二、實現思路

三、實現程式碼

四、檢視結果

五、深坑回填

相關文章

細節決定成敗 MapReduce任務實戰倒排索引