Lucene輕量級搜尋引擎,真的太強了!!!Solr 和 ES 都是基於它

孙半仙人發表於2024-03-11

原文網址 : https://www.cnblogs.com/sun2020/p/18067127

Solr

一、基礎知識

1、Lucene 是什麼

Lucene 是一個本地全文搜尋引擎,Solr 和 ElasticSearch 都是基於 Lucene 的封裝

Lucene 適合那種輕量級的全文搜尋,我就是伺服器資源不夠,如果上 ES 的話會很佔用伺服器資源,所有就選擇了 Lucene 搜尋引擎

2、倒排索引原理

全文搜尋的原理是使用了倒排索引,那麼什麼是倒排索引呢?

先透過中文分詞器,將文件中包含的關鍵字全部提取出來，比如我愛中國，會透過分詞器分成我，愛，中國，然後分別對應‘我愛中國’
然後再將關鍵字與文件的對應關係儲存起來
最後對關鍵字本身做索引排序

3、與傳統資料庫對比

Lucene	DB
資料庫表（table）	索引(index)
行（row）	文件(document）
列（column）	欄位(field）

4、資料型別

常見的欄位型別

StringField：這是一個不可分詞的字串欄位型別，適用於精確匹配和排序。
TextField：這是一個可分詞的字串欄位型別，適用於全文搜尋和模糊匹配。
IntField、LongField、FloatField、DoubleField：這些是數值欄位型別，用於儲存整數和浮點數。
DateField：這是一個日期欄位型別，用於儲存日期和時間。
BinaryField：這是一個二進位制欄位型別，用於儲存二進位制資料，如圖片、檔案等。
StoredField：這是一個儲存欄位型別，用於儲存不需要被索引的原始資料，如文件的內容或其他附加資訊。

Lucene 分詞器是將文字內容分解成單獨的詞彙（term）的工具。Lucene 提供了多種分詞器，其中一些常見的包括

StandardAnalyzer：這是 Lucene 預設的分詞器，它使用 UnicodeText 解析器將文字轉換為小寫字母，並且根據空格、標點符號和其他字元來進行分詞。
CJKAnalyzer：這個分詞器專門為中日韓語言設計，它可以正確地處理中文、日文和韓文的分詞。
KeywordAnalyzer：這是一個不分詞的分詞器，它將輸入的文字作為一個整體來處理，常用於處理精確匹配的情況。
SimpleAnalyzer：這是一個非常簡單的分詞器，它僅僅按照非字母字元將文字分割成小寫詞彙。
WhitespaceAnalyzer：這個分詞器根據空格將文字分割成小寫詞彙，不會進行任何其他的處理。

但是對於中文分詞器,我們一般常用第三方分詞器IKAnalyzer,需要引入它的POM檔案

二、最佳實踐

1、依賴匯入

<lucene.version>8.1.1</lucene.version>
<IKAnalyzer-lucene.version>8.0.0</IKAnalyzer-lucene.version>

<!--============lucene start================-->
<!-- Lucene核心庫 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>${lucene.version}</version>
</dependency>

<!-- Lucene的查詢解析器 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>${lucene.version}</version>
</dependency>

<!-- Lucene的預設分詞器庫 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>${lucene.version}</version>
</dependency>

<!-- Lucene的高亮顯示 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>${lucene.version}</version>
</dependency>

<!-- ik分詞器 -->
<dependency>
    <groupId>com.jianggujin</groupId>
    <artifactId>IKAnalyzer-lucene</artifactId>
    <version>${IKAnalyzer-lucene.version}</version>
</dependency>
<!--============lucene end================-->

2、建立索引

先制定索引的基本資料,包括索引名稱和欄位

/**
 * @author: sunhhw
 * @date: 2023/12/25 17:39
 * @description: 定義文章文件欄位和索引名稱
 */
public interface IArticleIndex {

    /**
     * 索引名稱
     */
    String INDEX_NAME = "article";

    // --------------------- 文件欄位 ---------------------
    String COLUMN_ID = "id";
    String COLUMN_ARTICLE_NAME = "articleName";
    String COLUMN_COVER = "cover";
    String COLUMN_SUMMARY = "summary";
    String COLUMN_CONTENT = "content";
    String COLUMN_CREATE_TIME = "createTime";
}

建立索引並新增文件

/**
 * 建立索引並設定資料
 *
 * @param indexName 索引地址
 */
public void addDocument(String indexName, List<Document> documentList) {
    // 配置索引的位置 例如:indexDir = /app/blog/index/article
    String indexDir = luceneProperties.getIndexDir() + File.separator + indexName;
    try {
        File file = new File(indexDir);
        // 若不存在，則建立目錄
        if (!file.exists()) {
            FileUtils.forceMkdir(file);
        }
        // 讀取索引目錄
        Directory directory = FSDirectory.open(Paths.get(indexDir));
        // 中文分析器
        Analyzer analyzer = new IKAnalyzer();
        // 索引寫出工具的配置物件
        IndexWriterConfig conf = new IndexWriterConfig(analyzer);
        // 建立索引
        IndexWriter indexWriter = new IndexWriter(directory, conf);
        long count = indexWriter.addDocuments(documentList);
        log.info("[批次新增索引庫]總數量:{}", documentList.size());
        // 提交記錄
        indexWriter.commit();
        // 關閉close
        indexWriter.close();
    } catch (Exception e) {
        log.error("[建立索引失敗]indexDir:{}", indexDir, e);
        throw new UtilsException("建立索引失敗", e);
    }
}

注意這裡有個坑,就是這個indexWriter.close();必須要關閉, 不然在執行其他操作的時候會有一個write.lock檔案鎖控制導致操作失敗
indexWriter.addDocuments(documentList)這是批次新增,單個新增可以使用indexWriter.addDocument()

單元測試

@Test
public void create_index_test() {
    ArticlePO articlePO = new ArticlePO();
    articlePO.setArticleName("git的基本使用" + i);
    articlePO.setContent("這裡是git的基本是用的內容" + i);
    articlePO.setSummary("測試摘要" + i);
    articlePO.setId(String.valueOf(i));
    articlePO.setCreateTime(LocalDateTime.now());
    Document document = buildDocument(articlePO);
    LuceneUtils.X.addDocument(IArticleIndex.INDEX_NAME, document);
}

private Document buildDocument(ArticlePO articlePO) {
    Document document = new Document();
    LocalDateTime createTime = articlePO.getCreateTime();
    String format = LocalDateTimeUtil.format(createTime, DateTimeFormatter.ISO_LOCAL_DATE);

    // 因為ID不需要分詞,使用StringField欄位
    document.add(new StringField(IArticleIndex.COLUMN_ID, articlePO.getId() == null ? "" : articlePO.getId(), Field.Store.YES));
    // 文章標題articleName需要搜尋,所以要分詞儲存
    document.add(new TextField(IArticleIndex.COLUMN_ARTICLE_NAME, articlePO.getArticleName() == null ? "" : articlePO.getArticleName(), Field.Store.YES));
    // 文章摘要summary需要搜尋,所以要分詞儲存
    document.add(new TextField(IArticleIndex.COLUMN_SUMMARY, articlePO.getSummary() == null ? "" : articlePO.getSummary(), Field.Store.YES));
     // 文章內容content需要搜尋,所以要分詞儲存
    document.add(new TextField(IArticleIndex.COLUMN_CONTENT, articlePO.getContent() == null ? "" : articlePO.getContent(), Field.Store.YES));
    // 文章封面不需要分詞,但是需要被搜尋出來展示
    document.add(new StoredField(IArticleIndex.COLUMN_COVER, articlePO.getCover() == null ? "" : articlePO.getCover()));
    // 建立時間不需要分詞,僅需要展示
    document.add(new StringField(IArticleIndex.COLUMN_CREATE_TIME, format, Field.Store.YES));
    return document;
}

3、更新文件

更新索引方法

/**
 * 更新文件
 *
 * @param indexName 索引地址
 * @param document  文件
 * @param condition 更新條件
 */
public void updateDocument(String indexName, Document document, Term condition) {
    String indexDir = luceneProperties.getIndexDir() + File.separator + indexName;
    try {
        // 讀取索引目錄
        Directory directory = FSDirectory.open(Paths.get(indexDir));
        // 中文分析器
        Analyzer analyzer = new IKAnalyzer();
        // 索引寫出工具的配置物件
        IndexWriterConfig conf = new IndexWriterConfig(analyzer);
        // 建立索引
        IndexWriter indexWriter = new IndexWriter(directory, conf);
        indexWriter.updateDocument(condition, document);
        indexWriter.commit();
        indexWriter.close();
    } catch (Exception e) {
        log.error("[更新文件失敗]indexDir:{},document:{},condition:{}", indexDir, document, condition, e);
        throw new ServiceException();
    }
}

單元測試

@Test
public void update_document_test() {
    ArticlePO articlePO = new ArticlePO();
    articlePO.setArticleName("git的基本使用=編輯");
    articlePO.setContent("這裡是git的基本是用的內容=編輯");
    articlePO.setSummary("測試摘要=編輯");
    articlePO.setId("2");
    articlePO.setCreateTime(LocalDateTime.now());
    Document document = buildDocument(articlePO);
    LuceneUtils.X.updateDocument(IArticleIndex.INDEX_NAME, document, new Term("id", "2"));
}

更新的時候,如果存在就更新那條記錄,如果不存在就會新增一條記錄
new Term("id", "2")搜尋條件,跟資料庫裡的where id = 2差不多
IArticleIndex.INDEX_NAME = article 索引名稱

4、刪除文件

刪除文件方法

/**
* 刪除文件
*
* @param indexName 索引名稱
* @param condition 更新條件
*/
public void deleteDocument(String indexName, Term condition) {
  String indexDir = luceneProperties.getIndexDir() + File.separator + indexName;
  try {
      // 讀取索引目錄
      Directory directory = FSDirectory.open(Paths.get(indexDir));
      // 索引寫出工具的配置物件
      IndexWriterConfig conf = new IndexWriterConfig();
      // 建立索引
      IndexWriter indexWriter = new IndexWriter(directory, conf);

      indexWriter.deleteDocuments(condition);
      indexWriter.commit();
      indexWriter.close();
  } catch (Exception e) {
      log.error("[刪除文件失敗]indexDir:{},condition:{}", indexDir, condition, e);
      throw new ServiceException();
  }
}

單元測試

@Test
public void delete_document_test() {
    LuceneUtils.X.deleteDocument(IArticleIndex.INDEX_NAME, new Term(IArticleIndex.COLUMN_ID, "1"));
}

刪除文件跟編輯文件類似

5、刪除索引

把改索引下的資料全部清空

/**
* 刪除索引
*
* @param indexName 索引地址
*/
public void deleteIndex(String indexName) {
  String indexDir = luceneProperties.getIndexDir() + File.separator + indexName;
  try {
      // 讀取索引目錄
      Directory directory = FSDirectory.open(Paths.get(indexDir));
      // 索引寫出工具的配置物件
      IndexWriterConfig conf = new IndexWriterConfig();
      // 建立索引
      IndexWriter indexWriter = new IndexWriter(directory, conf);
      indexWriter.deleteAll();
      indexWriter.commit();
      indexWriter.close();
  } catch (Exception e) {
      log.error("[刪除索引失敗]indexDir:{}", indexDir, e);
      throw new ServiceException();
  }
}

6、普通查詢

TermQuery查詢

Term term = new Term("title", "lucene");
Query query = new TermQuery(term);

上述程式碼表示透過精確匹配欄位"title"中包含"lucene"的文件。

PhraseQuery查詢

PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("content", "open"));
builder.add(new Term("content", "source"));
PhraseQuery query = builder.build();

上述程式碼表示在欄位"content"中查詢包含"open source"短語的文件

BooleanQuery查詢

TermQuery query1 = new TermQuery(new Term("title", "lucene"));
TermQuery query2 = new TermQuery(new Term("author", "john"));
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(query1, BooleanClause.Occur.MUST);
builder.add(query2, BooleanClause.Occur.MUST);
BooleanQuery query = builder.build();

上述程式碼表示使用布林查詢同時滿足"title"欄位包含"lucene"和"author"欄位包含"john"的文件。

WildcardQuery查詢

WildcardQuery示例：
java
WildcardQuery query = new WildcardQuery(new Term("title", "lu*n?e"));

上述程式碼表示使用萬用字元查詢匹配"title"欄位中以"lu"開頭，且第三個字元為任意字母，最後一個字元為"e"的詞項

MultiFieldQueryParser查詢

String[] fields = {"title", "content", "author"};
Analyzer analyzer = new StandardAnalyzer();

MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, analyzer);
Query query = parser.parse("lucene search");

a. 在"title", "content", "author"三個欄位中搜尋關鍵字"lucene search"的文字資料 b. MultiFieldQueryParser 預設使用 OR 運算子將多個欄位的查詢結果合併，即只要在任意一個欄位中匹配成功即

可以使用MultiFieldQueryParser查詢來封裝一個簡單的搜尋工具類,這個較為常用

/**
* 關鍵詞搜尋
*
* @param indexName 索引目錄
* @param keyword   查詢關鍵詞
* @param columns   被搜尋的欄位
* @param current   當前頁
* @param size      每頁資料量
* @return
*/
public List<Document> search(String indexName, String keyword, String[] columns, int current, int size) {
  String indexDir = luceneProperties.getIndexDir() + File.separator + indexName;
  try {
      // 開啟索引目錄
      Directory directory = FSDirectory.open(Paths.get(indexDir));
      IndexReader reader = DirectoryReader.open(directory);
      IndexSearcher searcher = new IndexSearcher(reader);
      // 中文分析器
      Analyzer analyzer = new IKAnalyzer();
      // 查詢解析器
      QueryParser parser = new MultiFieldQueryParser(columns, analyzer);
      // 解析查詢關鍵字
      Query query = parser.parse(keyword);
      // 執行搜尋，獲取匹配查詢的前 limit 條結果。
      int limit = current * size;
      // 搜尋前 limit 條結果
      TopDocs topDocs = searcher.search(query, limit); 
      // 匹配的文件陣列
      ScoreDoc[] scoreDocs = topDocs.scoreDocs;
      // 計算分頁的起始 - 結束位置
      int start = (current - 1) * size;
      int end = Math.min(start + size, scoreDocs.length);
      // 返回指定頁碼的文件
      List<Document> documents = new ArrayList<>();
      for (int i = start; i < end; i++) {
          Document doc = searcher.doc(scoreDocs[i].doc);
          documents.add(doc);
      }
      // 釋放資源
      reader.close();
      return documents;
  } catch (Exception e) {
      log.error("查詢 Lucene 錯誤: ", e);
      return null;
  }
}

7、關鍵字高亮

@Test
public void searchArticle() throws InvalidTokenOffsetsException, IOException, ParseException {
    String keyword = "安裝";
    String[] fields = {IArticleIndex.COLUMN_CONTENT, IArticleIndex.COLUMN_ARTICLE_NAME};
    // 先查詢出文件列表
    List<Document> documentList = LuceneUtils.X.search(IArticleIndex.INDEX_NAME, keyword, fields, 1, 100);

    // 中文分詞器
    Analyzer analyzer = new IKAnalyzer();
    // 搜尋條件
    QueryParser queryParser = new MultiFieldQueryParser(fields, analyzer);
    // 搜尋關鍵詞,也就是需要高亮的欄位
    Query query = queryParser.parse(keyword);
    // 高亮html語句
    Formatter formatter = new SimpleHTMLFormatter("<span style=\"color: #f73131\">", "</span>");
    QueryScorer scorer = new QueryScorer(query);
    Highlighter highlighter = new Highlighter(formatter, scorer);
    // 設定片段長度,一共展示的長度
    highlighter.setTextFragmenter(new SimpleFragmenter(50));
    List<SearchArticleVO> list = new ArrayList<>();

    for (Document doc : documentList) {
        SearchArticleVO articleVO = new SearchArticleVO();
        articleVO.setId(doc.get(IArticleIndex.COLUMN_ID));
        articleVO.setCover(doc.get(IArticleIndex.COLUMN_COVER));
        articleVO.setArticleName(doc.get(IArticleIndex.COLUMN_ARTICLE_NAME));
        articleVO.setSummary(doc.get(IArticleIndex.COLUMN_SUMMARY));
        articleVO.setCreateTime(LocalDate.parse(doc.get(IArticleIndex.COLUMN_CREATE_TIME)));
        for (String field : fields) {
            // 為文件生成高亮
            String text = doc.get(field);
            // 使用指定的分析器對文字進行分詞
            TokenStream tokenStream = TokenSources.getTokenStream(field, text, analyzer);
            // 找到其中一個關鍵字就行了
            String bestFragment = highlighter.getBestFragment(tokenStream, text);
            if (StringUtils.isNotBlank(bestFragment)) {
                // 輸出高亮結果,取第一條即可
                if (field.equals(IArticleIndex.COLUMN_ARTICLE_NAME)) {
                    articleVO.setArticleName(bestFragment);
                }
                if (field.equals(IArticleIndex.COLUMN_CONTENT)) {
                    articleVO.setSummary(bestFragment);
                }
            }
        }
        list.add(articleVO);
    }
}

我是一零貳肆，一個關注Java技術和記錄生活的博主。

歡迎掃碼關注“一零貳肆”的公眾號，一起學習，共同進步，多看路，少踩坑。

zinc：替代elasticsearch的輕量級Go語言搜尋引擎
2021-12-04
ElasticsearchGo
開源搜尋技術的核心引擎 —— Lucene
2022-12-05
搜尋引擎es-分詞與搜尋
2024-08-27
分詞
ES 筆記十六：基於詞項和基於全文的搜尋
2019-11-04
筆記
Nebula 基於 ElasticSearch 的全文搜尋引擎的文字搜尋
2021-06-17
Elasticsearch
ES(Elasticsearch)支援PB級全文搜尋引擎入門教程
2019-01-23
Elasticsearch
Redis 也支援全文搜尋？這也太強了
2023-12-13
Redis
自建搜尋引擎-基於美麗雲
2024-07-09
solr搜尋之搜尋精度問題我已經盡力了！！！
2018-07-19
Solr
Tantivy與Quickwit：類似Lucene的Rust全文搜尋引擎庫
2022-03-11
UIRust
網站搜尋功能lucene
2018-03-20
網站
【搜尋引擎】SOLR VS Elasticsearch(2019技術選型參考)
2019-06-23
SolrElasticsearch
Spring Boot整合Postgres實現輕量級全文搜尋
2024-02-19
Spring Boot
【搜尋引擎】Solr全文檢索近實時查詢優化
2019-06-27
Solr優化
基於 Elasticsearch 的站內搜尋引擎實戰
2019-03-04
Elasticsearch
搜尋引擎-03-搜尋引擎原理
2024-04-04
solr搜尋分詞優化
2018-03-10
Solr分詞優化
內部抗議太強烈，Google中國版搜尋黃了
2018-12-18
Go
基於策略搜尋的強化學習方法
2020-10-02
強化學習
使用solr搭建搜尋伺服器
2018-08-29
Solr伺服器
深度解析 Lucene 輕量級全文索引實現原理
2021-07-20
索引
基於 Mysql 實現一個簡易版搜尋引擎
2021-08-29
MySql
輕量級外掛sdstorage用於操作localStorage支援過期、批量搜尋刪除等
2019-04-06
海量資料搜尋---搜尋引擎
2018-11-13
【搜尋引擎】Solr Suggester 實現全文檢索功能-分詞和和自動提示
2019-06-26
Solr分詞
solr搜尋報錯，tomcat maxHttpHeaderSize 設定
2018-03-14
SolrTomcatHTTPHeader
MnasNet：經典輕量級神經網路搜尋方法 | CVPR 2019
2020-07-14
神經網路
基於Drone+Gogs流水線-全面認識輕量級雲原生CI引擎Drone
2022-03-20
Go
強化學習(十八) 基於模擬的搜尋與蒙特卡羅樹搜尋(MCTS)
2019-03-04
強化學習
python 手把手教你基於搜尋引擎實現文章查重
2020-09-13
Python
用elasticsearch和nuxtjs搭建bt搜尋引擎
2018-10-02
ElasticsearchUXJS
文字獲取和搜尋引擎簡介
2018-07-17
百度雲盤搜尋引擎【升級版】
2019-05-11
sphinx 全文搜尋引擎
2019-02-16
高效利用搜尋引擎
2018-08-17
ElasticSearch全文搜尋引擎
2019-07-29
Elasticsearch
Elasticsearch（ES）的高階搜尋（DSL搜尋）（上篇）
2021-09-20
Elasticsearch
Elasticsearch（ES）的高階搜尋（DSL搜尋）（下篇）
2021-09-21
Elasticsearch