BLOG - 個人博文系統開發總結二：使用Lucene完成博文檢索功能

CAFE_BABE發表於2018-03-20

原文網址 : https://juejin.im/post/5ab0e5d3518825557208371f

自上一篇博文以來，網站在很多地方有了改進，最直觀的是將網站 UI 進行了美化，新增和完善了一些功能。該篇博文將對部落格系統中的博文檢索功能模組進行總結。

GitHub：DuanJiaNing/BlogSystem

簡述

網站通過類別和標籤將一篇博文進行歸類和貼標籤，同時使用了 Lucene 全文檢索框架，博文檢索功能可細分為如下幾類：

通過關鍵字檢索
通過類別檢索
通過標籤檢索
高階檢索（即同時限定類別，標籤以及關鍵字）以上的檢索方式都將按照預設或指定的排序規則來對結果進行排序。

4種檢索方式在博主主頁都有使用。如下示意圖中紅綠藍線框標記處分別為前三種檢索方式的功能入口：

而高階檢索的入口為紫色的圓框處，對應如下的彈出框：

接下來從前端 js 開始，分析一次檢索請求是通過哪些步驟，最終獲取到資料的。在此之前先看一下後端對應博文檢索API：

檢索指定博主的博文列表 API 說明

介面地址：/blog
返回格式：json
請求方式：get
請求引數說明：

名稱	型別	必填	說明	預設
bloggerId	int	是	博主id
cids	string	否	博主的文章類別id，可指定在多個類別，用英文“,”間隔	不限定
lids	string	否	博主的標籤id，可指定多個標籤，用英文“,”間隔	不限定
kword	string	否	關鍵字	不限定
offset	int	否	結果集起始下標	0
rows	int	否	結果集行數	10
sort	string	否	結果集排序依據，說明請參看：博文排序依據	view_count
order	string	否	結果集排序順序，從大到小為“desc”，從小到大為“asc”	desc

請求示例：

降序，博文類別id限定為1或2，博文標籤限定為2，博主id為1：
http://localhost:8080/blog/1/?order=desc&cids=1,2&lids=2

JavaScript ajax 請求

下面與原始碼並不一致，將一些干擾程式碼移除，便於解讀。

/**
 * 重新載入博文列表
 * @param offset 結果集偏移
 * @param rows 行數
 */
function filterBloggerBlog(offset, rows) {
    $.get(
        '/blog',
        {
            bloggerId: pageOwnerBloggerId,
            offset: offset,
            rows: rows,
            cids: filterData.cids,
            lids: filterData.lids,
            kword: filterData.kword,
            sort: filterData.sort,
            order: filterData.order
        },
        function (result) {
            // ...
        }, 'json'
    );
}
複製程式碼

如上程式碼所示，在傳送檢索請求時只要靈活修改 filterData 就能從後臺獲取到想要的資料，通過修改 offset 和 rows 的值則能實現分頁功能。

比如在實現 通過博文類別檢索 功能時，只需修改 filterData 的 cids （目標類別的id，多個類別用英文","分隔），將 filterData 的其他成員置為 null，傳送請求後後端就能根據限定條件返回目標資料。

同理在實現 高階檢索 時只需將使用者給的值（關鍵字，類別 id 集，標籤 id 集以及排序規則）賦予 filterData，發起請求即可。

java 後端實現檢索功能

後端 API 方法如下：

方法中首先需要完成一些校驗工作，賬戶校驗，排序規則校驗，類別和標籤校驗，所有傳入資料合法後就可以呼叫服務進行資料獲取。

檢索服務

檢索服務為 BlogRetrievalService，檢索服務繼承於 BlogFilterAbstract，該抽象類是博文檢索功能 BlogFilter 介面的通用實現，BlogFilter 中定義瞭如下方法：

public interface BlogFilter<T> {

    /**
     * 全限定檢索（包括關鍵字）
     *
     * @param categoryIds 限定在博主的哪些類別之下，不做限定時傳null
     * @param labelIds    限定在博主的哪些標籤之下，不做限定時傳null
     * @param keyWord     關鍵字,不做限定時傳null
     * @param bloggerId   博主id
     * @param offset      結果集起始位置
     * @param rows        行數
     * @param sortRule    排序規則，為null則不做約束
     * @param status      博文狀態
     * @return 查詢結果
     */
    T listFilterAll(int[] categoryIds, int[] labelIds, String keyWord, int bloggerId, int offset, int rows,
                    BlogSortRule sortRule, BlogStatusEnum status);

    /**
     * 標籤&類別檢索（無關鍵字）
     *
     * @param labelIds    限定在博主的哪些標籤之下
     * @param categoryIds 限定在博主的哪些類別之下
     * @param bloggerId   博主id
     * @param offset      結果集起始位置
     * @param rows        行數
     * @param sortRule    排序規則，為null則不做約束
     * @param status      博文狀態
     * @return 查詢結果
     */
    T listFilterByLabelAndCategory(int[] categoryIds, int[] labelIds, int bloggerId, int offset, int rows,
                                   BlogSortRule sortRule, BlogStatusEnum status);

    /**
     * 獲得一次檢索後的結果集總條數
     *
     * @return 數量
     */
    int getFilterCount();

}

複製程式碼

兩個關鍵方法主要的區分依據是檢索條件中是否包含關鍵字，當有關鍵字時需要使用到 Lucene，而沒有關鍵字時檢索流程就會簡單些。

BlogFilter 定義的三個方法在 BlogFilterAbstract 中得到了實現，而 BlogFilterAbstract 中定義了一個抽象方法 constructResult ，用於構造最終結果。


    /**
     * 構造結果集，statistics是經過篩選而且排序了的結果，可藉助 statistics 的順序來得到最終結果
     *
     * @param blogHashMap          博文id為鍵，博文為值的map
     * @param statistics           已排序的博文統計資訊集合
     * @param blogIdMapCategoryIds 博文id為鍵，對應擁有的類別id陣列為值的map
     * @return 最終結果
     */
    protected abstract T constructResult(Map<Integer, Blog> blogHashMap, List<BlogStatistics> statistics, Map<Integer, int[]> blogIdMapCategoryIds);

複製程式碼

該方法用於構造最終的結果，該方法的返回值返回給 Controller ，然後 Controller 藉由 SpringMVC 轉為 json 格式傳遞給前端。

該方法的第一個引數 blogHashMap 為查詢出來的所有符合條件的博文，第二個引數 statistics 為所有符合條件博文的統計資訊集合（博文統計資訊：博文瀏覽次數，評論次數，喜歡次數，收藏次數等），而且已經排好序。

在博文檢索功能模組的資料獲取過程中只有結果集的構建是子類會有不同實現的，而按條件檢索和排序的部分是通用的，這部分通用的功能由 BlogFilterAbstract 進行實現。

檢索流程

帶關鍵字的檢索：

    @Override
    public T listFilterAll(int[] categoryIds, int[] labelIds, String keyWord,
                           int bloggerId, int offset, int rows, BlogSortRule sortRule,
                           BlogStatusEnum status) {

        if (StringUtils.isEmpty(keyWord)) {
            //標籤&類別檢索
            return listFilterByLabelAndCategory(categoryIds, labelIds, bloggerId, offset, rows, sortRule, status);
        } else {
            // 有關鍵字時需要依賴lucene進行檢索
            return filterByLucene(keyWord, categoryIds, labelIds, bloggerId, offset, rows, sortRule, status);
        }

    }
複製程式碼

listFilterAll 方法會對帶關鍵字的檢索請求進行最終確認，進而呼叫 filterByLucene 方法進行檢索。

 /**
     * 關鍵字不為null時需要通過lucene進行全文檢索
     *
     * @param keyWord     關鍵字
     * @param categoryIds 類別id
     * @param labelIds    標籤id
     * @param bloggerId   博主id
     * @param offset      結果偏移量
     * @param rows        結果行數
     * @param sortRule    排序規則
     * @param status      博文狀態
     * @return 經過篩選、排序的結果集
     */
    protected T filterByLucene(String keyWord, int[] categoryIds, int[] labelIds,
                               int bloggerId, int offset, int rows, BlogSortRule sortRule,
                               BlogStatusEnum status) {

        // ------------------------關鍵字篩選
        int[] ids;
        try {
            // 搜尋結果無法使用類似於sql limit的方式分頁，這裡一次性將所有結果查詢出，後續考慮使用快取實現分頁
            ids = luceneIndexManager.search(keyWord, 10000);
        } catch (IOException | ParseException e) {
            e.printStackTrace();
            throw new LuceneException(e);
        }
        //關鍵字為首要條件
        if (CollectionUtils.isEmpty(ids)) return null;

        // 關鍵字檢索得到的博文集合
        List<Integer> filterByLuceneIds = new ArrayList<>();
        // UPDATE 取最前面的rows條結果
        int row = Math.min(rows, ids.length);
        for (int i = 0; i < row; i++) filterByLuceneIds.add(ids[i]);

        // ----------------------類別、標籤篩選
        Map<Integer, int[]> map = getMapFilterByLabelAndCategory(bloggerId, categoryIds, labelIds, status);
        Integer[] mids = map.keySet().toArray(new Integer[map.size()]);
        // 類別、標籤檢索得到的博文集合
        List<Integer> filterByOtherIds = Arrays.asList(mids);

        //求兩者交集得到最終結果集
        List<Integer> resultIds = filterByLuceneIds.stream().filter(filterByOtherIds::contains).collect(Collectors.toList());
        if (CollectionUtils.isEmpty(resultIds)) return null;

        //構造結果,排序並重組
        count.set(resultIds.size());
        List<Blog> blogs = blogDao.listBlogByBlogIds(resultIds, status.getCode(), offset, rows);
        return sortAndConstructResult(blogs, sortRule, map);
    }

複製程式碼

檢索流程：

通過關鍵字從 Lucene 的索引庫中（索引庫在第一次博主建立博文時進行構建，新增、修改或刪除博文時進行更新）找到符合條件的博文的 id 集合，全文檢索時會通過博文標題，內容，摘要和關鍵字進行匹配；
獲取目標行數（這部分程式碼待完善）；
根據博文 id 集合獲取博文，按類別和標籤進行過濾；
查詢博文統計資訊，進行排序；
呼叫 constructResult 構造結果並返回。

對於無關鍵字檢索只需跳過 1~2 步，從第 4 步開始，從指定博主的所有博文中按類別和標籤過濾，排序，構造結果返回即可。

對結果進行排序

    // 對篩選出的博文進行排序並重組結果集
    private T sortAndConstructResult(List<Blog> blogs, BlogSortRule sortRule, Map<Integer, int[]> map) {

        //用於排序
        List<BlogStatistics> temp = new ArrayList<>();

        //方便排序後的重組
        Map<Integer, Blog> blogHashMap = new HashMap<>();
        for (Blog blog : blogs) {
            int blogId = blog.getId();
            BlogStatistics statistics = statisticsDao.getStatistics(blogId);
            temp.add(statistics);
            blogHashMap.put(blogId, blog);
        }

        BlogListItemComparatorFactory factory = new BlogListItemComparatorFactory();
        temp.sort(factory.get(sortRule.getRule(), sortRule.getOrder()));

        return constructResult(blogHashMap, temp, map);

    }
複製程式碼

對查詢結果的排序依賴於博文統計資料，通過 java.util.Comparator 比較器來實現排序。通過比較器將博文統計資訊集排序，重組時只需依據排序過的博文統計資訊集就能得到目標結果。

博文標題
2024-03-17
測試博文
2020-02-15
9.1作業博文
2024-09-01
PTA題目集1-3總結（22207331-張博文）
2024-10-26
個人blog系統
2019-05-11
Go語言開發除錯系列博文3篇
2019-01-31
Go除錯
2分鐘完成論文調研！ByteDance Research推出論文檢索智慧體PaSa，遠超主流檢索工具
2025-01-23
智慧體
1.博文標題
2024-08-15
（二）碩博生常用的外文文獻檢索方式推薦
2020-11-01
博文乾貨｜在 Kotlin 中使用 Apache Pulsar
2022-03-03
KotlinApache
微博爬取長津湖博文及評論
2021-10-08
預設博文當做文件
2020-04-19
自動化：用selenium發一篇博文
2024-06-07
會員權益-新功能釋出：定時釋出博文
2023-10-06
博文推薦｜Apache Pulsar 基於 Log4j2+Kafka+ELK 實現日誌的快速檢索
2022-01-27
ApacheKafka
全文檢索技術lucene的demo
2022-10-27
博文推薦｜使用 Pulsar IO 打造流資料管道
2021-12-06
selenium + xpath爬取csdn關於python的博文博主資訊
2020-12-19
Python
單元測試-系列博文目錄
2019-01-21
我的第一篇博文
2021-05-16
博文推薦｜Apache Pulsar: 統一訊息流平臺
2022-01-26
Apache
部落格園使用VSCode 外掛編輯博文記錄
2024-04-21
VSCode
基於Lucene的全文檢索實踐
2021-11-07
DAPP錢包代幣博餅交易LP系統開發
2023-10-09
APP
給你看看小白博主開發的打賞系統
2020-06-14
博文乾貨｜Apache InLong 使用 Apache Pulsar 建立資料入庫
2022-03-04
Apache
「玩轉Python」打造十萬博文爬蟲篇
2019-07-30
Python爬蟲
第二次blog總結
2024-06-07
CCR合約量化機器人/系統開發/CCR博森AI機器人量化/策略詳情
2023-05-19
機器人AI
IPPSWAP博餅交易所質押挖礦系統開發
2023-10-09
博文推薦｜使用 Apache Pulsar 和 Scala 進行事件流處理
2022-03-30
Apache事件
淄博開票-淄博開票
2020-11-25
cocos2dx 很好的原始碼分析博文
2018-05-07
原始碼
新浪微博動態 RSA 分析圖文+登入
2024-05-20
七篇Meta等大科技公司工程博文
2024-03-13
從SpringBoot構建十萬博文聊聊快取穿透
2019-08-13
Spring Boot快取穿透
為什麼要堅持寫技術博文
2019-02-17
技術博文｜Flink 和 Pulsar 的批流融合
2022-01-27

BLOG - 個人博文系統開發總結 二：使用Lucene完成博文檢索功能

簡述

檢索指定博主的博文列表 API 說明

JavaScript ajax 請求

java 後端實現檢索功能

檢索服務

檢索流程

對結果進行排序

相關文章

BLOG - 個人博文系統開發總結二：使用Lucene完成博文檢索功能