搜尋引擎：MapReduce實戰----倒排索引

Thinkgamer_gyt發表於2015-07-28

索引

1.倒排索引簡介

倒排索引（Inverted index），也常被稱為反向索引、置入檔案或反向檔案，是一種索引方法，被用來儲存在全文搜尋下某個單詞在一個文件或者一組文件中的儲存位置的對映。它是文件檢索系統中最常用的資料結構。

有兩種不同的反向索引形式：

一條記錄的水平反向索引（或者反向檔案索引）包含每個引用單詞的文件的列表。
一個單詞的水平反向索引（或者完全反向索引）又包含每個單詞在一個文件中的位置。

後者的形式提供了更多的相容性（比如短語搜尋），但是需要更多的時間和空間來建立。

舉例：

以英文為例，下面是要被索引的文字：

T₀ = "it is what it is"
T₁ = "what is it"
T₂ = "it is a banana"

我們就能得到下面的反向檔案索引：

"a":      {2}
"banana": {2}
"is":     {0, 1, 2}
"it":     {0, 1, 2}
"what":   {0, 1}

檢索的條件"what", "is" 和 "it" 將對應這個集合：{0,1}∩{0,1,2}∩{0,1,2}={0,1}。

對相同的文字，我們得到後面這些完全反向索引，有文件數量和當前查詢的單詞結果組成的的成對資料。同樣，文件數量和當前查詢的單詞結果都從零開始。

所以，"banana": {(2, 3)} 就是說 “banana”在第三個文件裡 (T₂)，而且在第三個文件的位置是第四個單詞(地址為 3)。

"a":      {(2, 2)}
"banana": {(2, 3)}
"is":     {(0, 1), (0, 4), (1, 1), (2, 1)}
"it":     {(0, 0), (0, 3), (1, 2), (2, 0)}
"what":   {(0, 2), (1, 0)}

如果我們執行短語搜尋"what is it" 我們得到這個短語的全部單詞各自的結果所在文件為文件0和文件1。但是這個短語檢索的連續的條件僅僅在文件1得到。

2.分析和設計

（1）Map過程

首先使用預設的TextInputFormat類對輸入檔案進行處理，得到文字中每行的偏移量及其內容，Map過程首先必須分析輸入的<key, value>對，得到倒排索引中需要的三個資訊：單詞、文件URI和詞頻，如圖所示：

存在兩個問題，第一：<key, value>對只能有兩個值，在不使用Hadoop自定義資料型別的情況下，需要根據情況將其中的兩個值合併成一個值，作為value或key值；

第二，通過一個Reduce過程無法同時完成詞頻統計和生成文件列表，所以必須增加一個Combine過程完成詞頻統計

public static class Map extends Mapper<Object,Text,Text,Text>{

        private Text keyInfo = new Text();

        private Text valueInfo = new Text();

        private FileSplit split;  //儲存所在檔案的路徑

        public void map(Object key,Text value,Context context) throws IOException,

InterruptedException{

            split = (FileSplit)context.getInputSplit();     //獲取當前任務分割的單詞所在的檔案路徑

            StringTokenizer itr = new StringTokenizer(value.toString());

            while(itr.hasMoreTokens()){

                keyInfo.set(itr.nextToken()+"+"+split.getPath().toString());   //keyvalue是由單詞和URI組成的

                valueInfo.set("1");                         

                         //value值設定成1

                context.write(keyInfo,valueInfo);

            }

        }

    }

（2）Combine過程

將key值相同的value值累加，得到一個單詞在文件中的詞頻，如圖

public static class Combiner extends Reducer<Text,Text,Text,Text>{

        private Text info = new Text();

        public void reduce(Text key,Iterable<Text>values,Context context) throws 

IOException, InterruptedException{

            int sum = 0;

            for(Text value:values){

                sum += Integer.parseInt(value.toString());

            }

//            int index = key.toString().indexOf("+");

//            info.set(key.toString().substring(index+1)+":"+sum);    

//            key.set(key.toString().substring(0,index));

            String record = key.toString();

            String[] str = record.split("[+]");

            info.set(str[1]+":"+sum);    

            key.set(str[0]);

            context.write(key,info);

        }

    }

（3）Reduce過程

講過上述兩個過程後，Reduce過程只需將相同key值的value值組合成倒排索引檔案所需的格式即可，剩下的事情就可以直接交給MapReduce框架進行處理了

public static class Reduce extends Reducer<Text,Text,Text,Text>{

        private Text result = new Text();

        public void reduce(Text key,Iterable<Text>values,Context context) throws 

IOException, InterruptedException{

            String value =new String();

            for(Text value1:values){

                value += value1.toString()+" ; ";

            }

            result.set(value);

            context.write(key,result);

        }

    }

完整程式碼如下：

package ReverseIndex;
import java.io.*;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ReverseIndex {

   public static class Map extends Mapper<Object,Text,Text,Text>{
       private Text keyInfo = new Text();
       private Text valueInfo = new Text();
       private FileSplit split; //儲存所在檔案的路徑
       public void map(Object key,Text value,Context context) throws IOException,

InterruptedException{
           split = (FileSplit)context.getInputSplit();     //獲取當前任務分割的單詞所在的檔案路徑
           StringTokenizer itr = new StringTokenizer(value.toString());
           while(itr.hasMoreTokens()){
               keyInfo.set(itr.nextToken()+"+"+split.getPath().toString());   //keyvalue是由單詞和URI組成的
               valueInfo.set("1");

                       //value值設定成1
               context.write(keyInfo,valueInfo);
           }
       }
   }
   public static class Combiner extends Reducer<Text,Text,Text,Text>{
       private Text info = new Text();
       public void reduce(Text key,Iterable<Text>values,Context context) throws

IOException, InterruptedException{
           int sum = 0;
           for(Text value:values){
               sum += Integer.parseInt(value.toString());

}

//下面三行註釋和緊接著四行功能一樣，只不過實現方法不一樣罷了

// int index = key.toString().indexOf("+");
// info.set(key.toString().substring(index+1)+":"+sum);

// key.set(key.toString().substring(0,index));

//對傳進來的key進行拆分，以+為界

           String record = key.toString();
           String[] str = record.split("[+]");
           info.set(str[1]+":"+sum);
           key.set(str[0]);
           context.write(key,info);
       }
   }
   public static class Reduce extends Reducer<Text,Text,Text,Text>{
       private Text result = new Text();
       public void reduce(Text key,Iterable<Text>values,Context context) throws

IOException, InterruptedException{
           String value =new String();
           for(Text value1:values){
               value += value1.toString()+" ; ";
           }
           result.set(value);
           context.write(key,result);
       }
   }
   public static void main(String[] args) throws IOException, ClassNotFoundException,

InterruptedException {
       // TODO Auto-generated method stub
       Job job = new Job();
       job.setJarByClass(ReverseIndex.class);

       job.setNumReduceTasks(1); //設定reduce的任務數量為1，平常的小測試不需要開闢太多的reduce任務程式

       job.setMapperClass(Map.class);
       job.setMapOutputKeyClass(Text.class);
       job.setMapOutputValueClass(Text.class);

       job.setCombinerClass(Combiner.class);

       job.setReducerClass(Reduce.class);

       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(Text.class);

       FileInputFormat.addInputPath(job, new Path("/thinkgamer/input"));
       FileOutputFormat.setOutputPath(job, new Path("/thinkgamer/output"));

       System.exit(job.waitForCompletion(true) ? 0 : 1);
   }

}

後端技術雜談1：搜尋引擎基礎倒排索引
2019-11-21
後端索引
mapreduce練習11 倒排索引
2020-10-07
索引
搜尋引擎核心技術與演算法 —— 詞項詞典與倒排索引優化
2020-01-09
演算法索引優化
搜尋引擎核心技術與演算法 —— 詞項詞典與倒排索引最佳化
2020-01-09
演算法索引
基於 Elasticsearch 的站內搜尋引擎實戰
2019-03-04
Elasticsearch
搜尋引擎-03-搜尋引擎原理
2024-04-04
直播開發app，實時搜尋、搜尋引擎框
2022-03-29
APP
ElasticSearch 倒排索引（Inverted Index）| 什麼是倒排索引？
2020-04-07
Elasticsearch索引Index
海量資料搜尋---搜尋引擎
2018-11-13
48_初識搜尋引擎_快速上機動手實戰Query DSL搜尋語法
2024-10-02
Elaticsearch倒排索引
2021-09-21
索引
Elasticsearch線上搜尋引擎讀寫核心原理深度認知-搜尋系統線上實戰
2019-03-03
Elasticsearch
50_初識搜尋引擎_上機動手實戰常用的各種query搜尋語法
2024-10-02
51_初識搜尋引擎_上機動手實戰多搜尋條件組合查詢
2024-10-02
搜尋引擎es-分詞與搜尋
2024-08-27
分詞
TableStore實戰：GEO索引打造億量級店鋪搜尋系統
2019-01-14
索引
sphinx 全文搜尋引擎
2019-02-16
高效利用搜尋引擎
2018-08-17
ElasticSearch全文搜尋引擎
2019-07-29
Elasticsearch
《Elasticsearch技術解析與實戰》Chapter 1.1：Elasticsearch入門和倒排索引
2019-04-12
ElasticsearchAPT索引
52_初識搜尋引擎_上機動手實戰如何定位不合法的搜尋以及其原因
2024-10-02
筆記五：倒排索引
2019-10-14
筆記索引
搜尋引擎分散式系統思考實踐
2022-11-23
分散式
海量資料搜尋---demo展示百度、谷歌搜尋引擎的實現
2019-09-06
谷歌
高效的使用搜尋引擎
2018-11-07
搜尋引擎與前端SEO
2018-05-24
前端
python 寫的搜尋引擎
2019-08-31
Python
Shodan搜尋引擎介紹
2020-08-19
搜尋引擎優化（SEO）
2020-05-17
優化
BTFILM電影搜尋引擎
2019-05-11
Django整合搜尋引擎Elasticserach
2019-06-04
DjangoAST
搜尋引擎框架介紹
2019-05-13
框架
認識搜尋引擎 Elasticsearch
2021-07-15
Elasticsearch
MapReduce 實現搜尋指數統計和找到人氣王
2018-04-02
57_初識搜尋引擎_分散式搜尋引擎核心解密之query phase
2024-10-02
分散式解密
Nebula 基於 ElasticSearch 的全文搜尋引擎的文字搜尋
2021-06-17
Elasticsearch
Mac上神奇的內建搜尋引擎——Spotlight(聚焦搜尋)
2020-12-14
Mac
MySQL InnoDB搜尋索引的Stopwords
2021-01-15
MySql索引
ES 筆記五：倒排索引
2019-10-14
筆記索引

搜尋引擎：MapReduce實戰----倒排索引

1.倒排索引簡介

2.分析和設計

相關文章