MapReduce實戰：倒排索引

發表於2014-12-24

索引

1.倒排索引簡介

倒排索引（Inverted index），也常被稱為反向索引、置入檔案或反向檔案，是一種索引方法，被用來儲存在全文搜尋下某個單詞在一個文件或者一組文件中的儲存位置的對映。它是文件檢索系統中最常用的資料結構。

有兩種不同的反向索引形式：

一條記錄的水平反向索引（或者反向檔案索引）包含每個引用單詞的文件的列表。
一個單詞的水平反向索引（或者完全反向索引）又包含每個單詞在一個文件中的位置。

後者的形式提供了更多的相容性（比如短語搜尋），但是需要更多的時間和空間來建立。

舉例：

以英文為例，下面是要被索引的文字：

T₀ = "it is what it is"
T₁ = "what is it"
T₂ = "it is a banana"

我們就能得到下面的反向檔案索引：

 "a":      {2}
 "banana": {2}
 "is":     {0, 1, 2}
 "it":     {0, 1, 2}
 "what":   {0, 1}

"a": {2}

"banana": {2}

"is": {0, 1, 2}

"it": {0, 1, 2}

"what": {0, 1}

檢索的條件"what", "is" 和 "it" 將對應這個集合：{0,1}∩{0,1,2}∩{0,1,2}={0,1}。

對相同的文字，我們得到後面這些完全反向索引，有文件數量和當前查詢的單詞結果組成的的成對資料。同樣，文件數量和當前查詢的單詞結果都從零開始。

所以，"banana": {(2, 3)} 就是說 “banana”在第三個文件裡 (T₂)，而且在第三個文件的位置是第四個單詞(地址為 3)。

"a":      {(2, 2)}
"banana": {(2, 3)}
"is":     {(0, 1), (0, 4), (1, 1), (2, 1)}
"it":     {(0, 0), (0, 3), (1, 2), (2, 0)} 
"what":   {(0, 2), (1, 0)}

"a": {(2, 2)}

"banana": {(2, 3)}

"is": {(0, 1), (0, 4), (1, 1), (2, 1)}

"it": {(0, 0), (0, 3), (1, 2), (2, 0)}

"what": {(0, 2), (1, 0)}

如果我們執行短語搜尋"what is it" 我們得到這個短語的全部單詞各自的結果所在文件為文件0和文件1。但是這個短語檢索的連續的條件僅僅在文件1得到。

2.分析和設計

（1）Map過程

首先使用預設的TextInputFormat類對輸入檔案進行處理，得到文字中每行的偏移量及其內容，Map過程首先必須分析輸入的<key, value>對，得到倒排索引中需要的三個資訊：單詞、文件URI和詞頻，如圖所示：

存在兩個問題，第一：<key, value>對只能有兩個值，在不使用Hadoop自定義資料型別的情況下，需要根據情況將其中的兩個值合併成一個值，作為value或key值；

第二，通過一個Reduce過程無法同時完成詞頻統計和生成文件列表，所以必須增加一個Combine過程完成詞頻統計

public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> {
    private Text keyInfo = new Text();  //儲存單詞和URI的組合
    private Text valueInfo = new Text();//儲存詞頻
    private FileSplit split;            //儲存Split物件
    
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        //獲得<key,value>對所屬的FileSplit物件
        split = (FileSplit)context.getInputSplit();
        StringTokenizer itr = new StringTokenizer(value.toString());
        
        while(itr.hasMoreTokens()) {
            //key值由單詞和URI組成，如"MapReduce:1.txt"
            keyInfo.set(itr.nextToken() + ":" + split.getPath().toString());
            // 詞頻初始為1
            valueInfo.set("1");
            context.write(keyInfo, valueInfo);
        }
    }
}

public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> {

private Text keyInfo = new Text(); //儲存單詞和URI的組合

private Text valueInfo = new Text();//儲存詞頻

private FileSplit split; //儲存Split物件

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

//獲得<key,value>對所屬的FileSplit物件

split = (FileSplit)context.getInputSplit();

StringTokenizer itr = new StringTokenizer(value.toString());

while(itr.hasMoreTokens()) {

//key值由單詞和URI組成，如"MapReduce:1.txt"

keyInfo.set(itr.nextToken() + ":" + split.getPath().toString());

// 詞頻初始為1

valueInfo.set("1");

context.write(keyInfo, valueInfo);

}

（2）Combine過程

將key值相同的value值累加，得到一個單詞在文件中的詞頻，如圖

public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text> {
    private Text info = new Text();
    public void reduce(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {
        //統計詞頻
        int sum = 0;
        for(Text value : values) {
            sum += Integer.parseInt(value.toString());
        }
        int splitIndex= key.toString().indexOf(":");
        
        //重新設定value值由URI和詞頻組成
        info.set(key.toString().substring(splitIndex + 1) + ":" + sum);
        //重新設定key值為單詞
        key.set(key.toString().substring(0, splitIndex));
        context.write(key, info);
    }
}

public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text> {

private Text info = new Text();

public void reduce(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {

//統計詞頻

int sum = 0;

for(Text value : values) {

sum += Integer.parseInt(value.toString());

}

int splitIndex= key.toString().indexOf(":");

//重新設定value值由URI和詞頻組成

info.set(key.toString().substring(splitIndex + 1) + ":" + sum);

//重新設定key值為單詞

key.set(key.toString().substring(0, splitIndex));

context.write(key, info);

}

（3）Reduce過程

講過上述兩個過程後，Reduce過程只需將相同key值的value值組合成倒排索引檔案所需的格式即可，剩下的事情就可以直接交給MapReduce框架進行處理了

public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {
    private Text result = new Text();
    public void reducer(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {
        //生成文件列表
        String fileList = new String();
        for(Text value : values) {
            fileList += value.toString() + ";";
        }
        result.set(fileList);
        context.write(key, result);
    }
}

public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

private Text result = new Text();

public void reducer(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {

//生成文件列表

String fileList = new String();

for(Text value : values) {

fileList += value.toString() + ";";

}

result.set(fileList);

context.write(key, result);

}

完整程式碼如下：

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndex {
    public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> {
        private Text keyInfo = new Text();
        private Text valueInfo = new Text();
        private FileSplit split;
        
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            split = (FileSplit)context.getInputSplit();
            StringTokenizer itr = new StringTokenizer(value.toString());
            
            while(itr.hasMoreTokens()) {
                keyInfo.set(itr.nextToken() + ":" + split.getPath().toString());
                valueInfo.set("1");
                context.write(keyInfo, valueInfo);
            }
        }
        
    }
    public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text> {
        private Text info = new Text();
        public void reduce(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for(Text value : values) {
                sum += Integer.parseInt(value.toString());
            }
            int splitIndex= key.toString().indexOf(":");
            info.set(key.toString().substring(splitIndex + 1) + ":" + sum);
            key.set(key.toString().substring(0, splitIndex));
            context.write(key, info);
        }
    }
    public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {
        private Text result = new Text();
        public void reducer(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {
            String fileList = new String();
            for(Text value : values) {
                fileList += value.toString() + ";";
            }
            result.set(fileList);
            context.write(key, result);
        }
    }
    public static void main(String[] args) throws Exception{
        // TODO Auto-generated method stub
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if(otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "InvertedIndex");
        job.setJarByClass(InvertedIndex.class);
        job.setMapperClass(InvertedIndexMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setCombinerClass(InvertedIndexCombiner.class);
        job.setReducerClass(InvertedIndexReducer.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndex {

public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> {

private Text keyInfo = new Text();

private Text valueInfo = new Text();

private FileSplit split;

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

split = (FileSplit)context.getInputSplit();

StringTokenizer itr = new StringTokenizer(value.toString());

while(itr.hasMoreTokens()) {

keyInfo.set(itr.nextToken() + ":" + split.getPath().toString());

valueInfo.set("1");

context.write(keyInfo, valueInfo);

}

public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text> {

private Text info = new Text();

public void reduce(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {

int sum = 0;

for(Text value : values) {

sum += Integer.parseInt(value.toString());

}

int splitIndex= key.toString().indexOf(":");

info.set(key.toString().substring(splitIndex + 1) + ":" + sum);

key.set(key.toString().substring(0, splitIndex));

context.write(key, info);

}

public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

private Text result = new Text();

public void reducer(Text key, Iterable<Text>values, Context context) throws IOException, InterruptedException {

String fileList = new String();

for(Text value : values) {

fileList += value.toString() + ";";

}

result.set(fileList);

context.write(key, result);

}

public static void main(String[] args) throws Exception{

// TODO Auto-generated method stub

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if(otherArgs.length != 2) {

System.err.println("Usage: wordcount <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "InvertedIndex");

job.setJarByClass(InvertedIndex.class);

job.setMapperClass(InvertedIndexMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(Text.class);

job.setCombinerClass(InvertedIndexCombiner.class);

job.setReducerClass(InvertedIndexReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

參考資料

http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95

《實戰Hadop：開啟通向雲端計算的捷徑.劉鵬》

mapreduce練習11 倒排索引
2020-10-07
索引
ElasticSearch 倒排索引（Inverted Index）| 什麼是倒排索引？
2020-04-07
Elasticsearch索引Index
Elaticsearch倒排索引
2021-09-21
索引
《Elasticsearch技術解析與實戰》Chapter 1.1：Elasticsearch入門和倒排索引
2019-04-12
ElasticsearchAPT索引
筆記五：倒排索引
2019-10-14
筆記索引
ES 筆記五：倒排索引
2019-10-14
筆記索引
什麼是行儲存和列儲存？正排索引和倒排索引？MySQL既不是倒排索引，也
2021-09-09
索引MySql
Elasticsearch 6.x 倒排索引與分詞
2018-08-19
Elasticsearch索引分詞
倒排索引及ES相關概念對比MySQL
2024-10-16
索引MySql
【大資料】MapReduce開發小實戰
2020-09-21
大資料
Elasticsearch 中為什麼選擇倒排索引而不選擇 B 樹索引
2021-10-26
Elasticsearch索引
MySQL——索引優化實戰
2018-08-08
MySql索引優化
後端技術雜談1：搜尋引擎基礎倒排索引
2019-11-21
後端索引
MySQL實戰45講——普通索引和唯一索引
2020-10-23
MySql索引
大資料 - MapReduce：從原理到實戰的全面指南
2023-12-03
大資料
高效能MySQL實戰（二）：索引
2023-09-28
MySql索引
Hadoop大資料實戰系列文章之Mapreduce 計算框架
2020-11-10
Hadoop大資料框架
MySQL實戰 | 為什麼要使用索引？
2018-12-05
MySql索引
ElasticSearch7.3 學習之倒排索引揭祕及初識分詞器(Analyzer)
2022-03-18
Elasticsearch索引分詞
go 自定義二進位制檔案讀寫-儲存倒排索引文件 id
2020-07-14
Go索引
【真·乾貨】MySQL 索引及優化實戰
2018-10-31
MySql索引優化
基於Python實現MapReduce
2024-05-14
Python
小白學習大資料測試之hadoop hdfs和MapReduce小實戰
2018-09-03
大資料Hadoop
搜尋引擎核心技術與演算法 —— 詞項詞典與倒排索引優化
2020-01-09
演算法索引優化
20分鐘資料庫索引設計實戰
2019-02-16
資料庫索引
搜尋引擎核心技術與演算法 —— 詞項詞典與倒排索引最佳化
2020-01-09
演算法索引
MapReduce實戰 - 根據文章記錄獲取時段內發帖頻率
2018-10-19
MapReduce實戰 – 根據文章記錄獲取時段內發帖頻率
2019-03-03
MapReduce原理及簡單實現
2021-02-21
61_索引管理_快速上機動手實戰建立、修改以及刪除索引
2024-10-02
索引
elasticsearch實戰三部曲之一：索引操作
2022-07-29
Elasticsearch索引
談談Hadoop MapReduce和Spark MR實現
2020-07-27
HadoopSpark
MapReduce理解
2024-11-02
使用IDEA+Maven實現MapReduce的WordCount功能
2020-10-21
IdeaMaven
TableStore實戰：GEO索引打造億量級店鋪搜尋系統
2019-01-14
索引
《Elasticsearch技術解析與實戰》Chapter 2.1 Elasticsearch索引增刪改查
2019-04-17
ElasticsearchAPT索引
《MySQL實戰45講》學習筆記4——MySQL中InnoDB的索引
2020-04-17
MySql筆記索引
三高Mysql - Mysql索引和查詢優化（偏實戰部分）
2022-04-13
MySql索引優化
Lab 1: MapReduce
2024-08-25

MapReduce實戰：倒排索引

1.倒排索引簡介

2.分析和設計

參考資料

相關文章