lucene join解決父子關係索引

zhanlijun發表於2015-04-11

http://www.cnblogs.com/LBSer/p/4417074.html

1 背景

以商家（Poi）維度來展示各種服務（比如團購（deal）、直連）正變得越來越流行（圖1a），比如目前美食、酒店等品類在移動端將團購資訊列表改為POI列表頁展示。

圖1 a：商家維度展示資訊； b：join示意

這給篩選帶來了複雜性。之前的篩選是平面的，如篩選poi列表時僅僅利用到poi的屬性（如評價、品類等），篩選deal列表時也僅僅根據deal的屬性（房態、價格等）。而現在的篩選是具有層次關係的，我們需要根據deal的屬性來篩選Poi，舉個例子，我們需要篩選酒店列表，這些酒店必須要有價格在100~200之間的團購。

這種篩選本質是種join操作，其核心是要將poi與deal關聯起來。從資料庫視角上看（圖1 b），我們有poi表以及deal表，deal表儲存了外來鍵（parentid）用於指示該deal所屬的poi，上述篩選分為三步：1）先篩選出價格區間在100~200的deal（得到dealid為2和3的deal）；2）找出deal對應的poi（得到poiid為1和1的poi）；3）去重，因為可能多個deal對應同一個poi，而我們需要返回不重複poi。

目前我們使用lucene來提供篩選服務，那麼lucene如何解決這種帶有join的篩選呢？

2 lucene join解決方案

在我們應用中，一個poi儲存為一個document，一個deal也儲存為一個document，Join的核心在於將poi以及deal的document進行關聯。lucene提供了兩種join的方式，分別是query time join和index time join，下文將分別展開。

2.1. query time join

query time join是通過類似資料庫“外來鍵“方法來建立deal和poi document的關聯關係。

a）索引

分別建立poi的document和deal的document，在建立deal document的時候用一個欄位（parentid）將deal與poi關聯起來，本例中建立了parentid這個field，裡面存的是該deal對應的poiid，可以簡單將其看做外來鍵。

public static Document createPoiDocument(PoiMsg poiMsg) {
   Document document = new Document();
   document.add(new StringField("poiid", String.valueOf(poiMsg.getId()), Field.Store.YES));
   document.add(new StringField("name", poiMsg.getName(), Field.Store.YES));
   return document;
}

public static Document createDealDocument(DealModel dealModel, PoiMsg poiMsg) {
   Document document = new Document();
   document.add(new StringField("did", String.valueOf(dealModel.getDid()), Field.Store.YES));
   document.add(new StringField("name", dealModel.getBrandName(), Field.Store.YES));
   document.add(new DoubleField("price", dealModel.getPrice(), Field.Store.YES));
   document.add(new StringField("parentid", String.valueOf(poiMsg.getId()), Field.Store.YES));
   return document;
}

IndexWriter writer = new IndexWriter(directory, config);
writer.addDocument(createPoiDocument(poiMsg1)); 
writer.addDocument(createPoiDocument(poiMsg2));
writer.addDocument(createDealDocument(dealModel1, poiMsg2));
writer.addDocument(createDealDocument(dealModel2, poiMsg1));
writer.addDocument(createDealDocument(dealModel3, poiMsg1));

b）查詢

需查詢兩次：首先查詢deal document，之後通過deal中的parentId查詢poi document。

1）第一次查詢發生在JoinUtil.createJoinQuery中。首先建立了TermsCollector這個收集器，該收集器將滿足fromQuery的doc的parentid欄位收集起來，之後建立了TermsQuery。

本例執行之後TermsCollector集合裡有兩個terms，分別是”1”和”1”；

2）執行TermsQuery，查詢toField在TermsCollector terms集合中存在的doc，最後找出toField為“1”的doc。

IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        String fromFields = "parentid";
        Query fromQuery = NumericRangeQuery.newIntRange("price", 100, 200, false, false);
        String toFields = "poiid";
        Query toQuery = JoinUtil.createJoinQuery(fromFields, false, toFields, fromQuery, indexSearcher, ScoreMode.Max);
        TopDocs results = indexSearcher.search(toQuery, 10);

JoinUtil.createJoinQuery程式碼
 TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument);
 fromSearcher.search(fromQuery, termsCollector);
 return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms());

c）優缺點

query time join優點是非常直觀且靈活；缺點是不能進行打分排序，此外由於查詢兩遍效能會下降。

2.2. index time join

query time join通過顯式的在deal document上增加一個“外來鍵”來建立關係，找到deal之後需要找出這些deal document的parentid集合，之後再次查詢找出poiId在parentid集合內的poi document。在找到deal之後如果能馬上找到對應的poi document，那將大大提高效率。index time join乾的就是這樣的事情，其通過一種精巧的方法建立了deal document id和poi document id的對映關係。

a）原理

如何通過一個deal document id來找到poi document id？

在lucene中，doc id是自增的，每寫入一個document，doc id加1（簡單起見可以理解）。 index time join要求寫索引的時候要按先後關係寫入，先寫子document，再寫父document。比如我們有poi1和poi2兩個poi，其中poi1下有deal2和deal3，而poi2下只有deal1，這時需要先寫入deal2、deal3，再寫入deal2和deal3對應的poi1 document，依次類推，最後形成如圖2所示的結構。

這樣索引建立之後，我們得到了父document的id集合（3，5）。當我們根據deal的屬性查出deal document id時，比如我們查出滿足條件的deal是deal3，其document id=2，這時候只需要到父document id集合裡去查詢第一個比2大的id，在本例中馬上就找到3。

圖2

lucene自己實現了BitSet來儲存id，lucene內部實現程式碼如圖3所示。

圖3 實現原理

b）索引

從上述原理得知我們需要建立有層次關係的索引。

首先建立document陣列，該陣列有個特點，最後一個必須是poi，之前都是deal。然後呼叫writer.addDocument(documents); 將這個陣列寫入。

public static Document createPoiDocument(PoiMsg poiMsg) {
        Document document = new Document();
        document.add(new StringField("poiid", String.valueOf(poiMsg.getId()), Field.Store.YES));
        document.add(new StringField("name", poiMsg.getName(), Field.Store.YES));
        document.add(new StringField("doctype", "poi", Field.Store.YES));
        return document;
    }

public static Document createDealDocument(DealModel dealModel) {
        Document document = new Document();
        document.add(new StringField("did", String.valueOf(dealModel.getDid()), Field.Store.YES));
        document.add(new StringField("name", dealModel.getBrandName(), Field.Store.YES));
        document.add(new DoubleField("price", dealModel.getPrice(), Field.Store.YES));
        return document;
    }

IndexWriter writer = new IndexWriter(directory, config);
List<Document> documents = new ArrayList<Document>();
documents.add(createDealDocument(dealModel2));
documents.add(createDealDocument(dealModel3));
documents.add(createPoiDocument(poiMsg1));
writer.addDocument(documents);
documents.clear();
documents.add(createDealDocument(dealModel1));
documents.add(createPoiDocument(poiMsg2));
writer.addDocument(documents);

c）查詢

Filter poiFilter = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term(PoiLuceneField.ATTR_DOCTYPE, "poi")))); //篩選出poi
ToParentBlockJoinQuery query = new ToParentBlockJoinQuery(dealQuery, poiFilter, ScoreMode.Max);
ToParentBlockJoinCollector collector = new ToParentBlockJoinCollector(
                    sort, // sort
                    (getOffset() + getLimit()),             // poi分頁numHits
                    true,           // trackScores
                    false           // trackMaxScore
            );
collector = (ToParentBlockJoinCollector) indexSearcher.search(query, collector);
Sort childSort = new Sort(new SortField(DealLuceneField.ATTR_PRICE, SortField.Type.DOUBLE));
TopGroups hits = collector.getTopGroups(
                    query.getToParentBlockJoinQuery(),
                    childSort,
                    query.getOffset(),   // parent doc offset
                    100,  // maxDocsPerGroup
                    0,   // withinGroupOffset
                    true // fillSortFields
            );

3 實踐

官方文件顯示index time join效率更高，比query time join快30%以上。因此我們在專案中使用了index time join方式，目前服務執行良好。

檢索實踐文章系列：

lucene字典實現原理

lucene索引檔案大小優化小結

排序學習實踐

lucene如何通過docId快速查詢field欄位以及最近距離等資訊？

Elasticsearch 父子關係
2020-12-01
Elasticsearch
ElasticSearch系列--父子關係
2020-12-31
Elasticsearch
一張圖搞定七種 JOIN 關係
2019-12-24
【Lucene&&Solr】Lucene索引和搜尋流程
2017-06-04
Solr索引
Lucene建立索引流程
2018-04-04
索引
ES 筆記四十三：文件的父子關係
2020-01-12
筆記
Spring和SpringMVC父子容器關係初窺
2016-09-16
SpringMVC
lucene第一步，lucene基礎，索引建立
2013-08-09
索引
depmod解決模組依賴關係
2007-12-26
RXJS元件間超越父子關係的相互通訊
2018-05-28
JS元件
解決rpm包依賴關係
2014-06-17
vue父子關係元件間的雙向資料繫結
2018-04-20
Vue元件
關係型資料庫之索引
2018-07-07
資料庫索引
二叉樹父子節點下標位置關係證明
2018-11-27
二叉樹
hadoop異構儲存+lucene索引
2019-08-27
Hadoop索引
MYSQL order by排序與索引關係總結
2018-04-13
MySql排序索引
主鍵與主鍵索引的關係
2015-05-19
索引
通過遞迴實現，單表父子關係資料或者上下級關係資料的組合
2020-11-11
遞迴
iframe父子頁面通訊解決方案
2018-08-30
開發：隨筆記錄之新老父子級關係替換
2014-09-01
筆記
關於 JOIN 耐心總結，學不會你打我係列
2020-06-19
Lucene中建立索引的效率和刪除索引的實現
2009-04-28
索引
lucene(二) 索引的建立、增刪改查
2016-05-16
索引
lucene索引檔案大小優化小結
2014-11-02
索引優化
聊聊非關係型資料庫MongoDB索引
2018-10-31
資料庫MongoDB索引
Python父子關係——繼承（反恐精英案例，應用與練習）
2018-12-09
Python繼承
dotnet 設定 X11 建立視窗之間的父子關係
2024-05-17
Lucene索引檔案大小優化方案總結
2014-11-02
索引優化
Lucene 2.0 對 html檔案建立索引的bug
2007-01-21
HTML索引
這一次搞懂Spring Web零xml配置原理以及父子容器關係
2020-06-20
SpringWebXML
Lucene底層原理和最佳化經驗分享(1)-Lucene簡介和索引原理
2021-01-31
索引
譯文：物件/關係阻抗已經被解決了嗎？
2005-11-22
物件
[專案踩坑] MySQL 分割槽:分割槽鍵和唯一索引主鍵的關係，解決報錯 A PRIMARY KEY
2020-05-28
MySql索引
物件導向程式設計程式碼詳解(依賴關係,關聯關係,組合關係)
2020-07-01
物件程式設計
JS 將有父子關係的陣列轉換成樹形結構資料
2018-11-26
JS陣列
js將有父子關係的資料轉換成樹形結構資料
2018-07-06
JS
query rewrite和基於函式的索引有關係？
2008-06-18
函式索引
SQL Server 索引列的順序——真的沒關係嗎
2012-06-20
SQLServer索引