在JAVA中將Elasticsearch索引載入到Lucene API

banq發表於2019-01-09

每隔一段時間，Elasticsearch中就會出現意外（或無意）崩潰。對於我的情況，在Elasticsearch的大量IO操作期間是硬體故障（讓我們假設我沒有任何副本或者我設法使所有叢集崩潰）。經過一些研究，我發現它搞砸了許多索引的狀態檔案（已損壞！）。我想，如果Elasticsearch使用Lucene，我肯定可以載入我的資料並使用Lucene API重新索引它。

重要說明：要使本指南處理索引，必須在對映中啟用_source欄位的索引。如果您因任何原因禁用了_source欄位，則此程式碼將不適用於您的索引。Elasticsearch建議始終啟用它。
在我們開始之前，讓我們引用兩個重要的術語。更多細節可以在這裡找到。

索引：用於在Elasticsearch中儲存資料的位置。
Shard分片或Lucene索引例項（！）：每個索引由一個或多個分片組成。它是一個獨立的搜尋引擎，可以索引和處理Elasticsearch叢集中資料子集的查詢。

正如你在這裡看到，我們的資料駐留在索引和這些分佈到多個分片，後者實際上是一個Lucene索引！
讓我們看一下Elasticsearch資料的結構，並嘗試找到這些Lucene索引。您可以配置資料路徑elasticsearch.yml與關鍵path.data。

索引indices目錄下有一個名為類似eCCAJ-x6SOuN6w7vqr4tGQ格式的資料夾。（這些專案的說明是完全獨立的主題。為了簡單起見，只需要知道狀態檔案是由Elasticsearch生成的，它是一個SMILE-encode檔案，其中包含Elasticsearch後設資料，如索引名稱，節點ID等。您可以使用十六進位制編輯器來檢查檔案）

讓我們透過Curl我們的Elasticsearch例項來列出我們的索引。

$ curl elastic-host:9200/_cat/indices
yellow open bad_ip eCCAJ-x6SOuN6w7vqr4tGQ 5 1 4609 0 1.2mb 1.2mb

這是我的索引，其中索引了一些惡意IP資料。我們來看看那個目錄。

$ ls my-elasticsearch/data/nodes/0/indices/eCCAJ-x6SOuN6w7vqr4tGQ
0      1      2      3      4      _state

這是我們索引的分片，或者是我們要載入的Lucene索引例項。

提示：如果此時Elasticsearch例項崩潰，您只需在狀態檔案上執行cat命令即可檢視它們所屬的索引。所有檔案都不可讀，但至少你應該看到你的索引名稱。

重要說明＃2：在繼續執行程式碼之前，您應該確保您的Lucene索引沒有損壞。為了檢查和修復您的Lucene索引（或Elasticsearch分片），我強烈建議您使用CheckIndex工具。

現在，我們找到了Lucene索引，讓我們做一些編碼來拯救我們的資料。我將建立一個示例Maven專案。這是我們需要的依賴項：

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>7.5.0</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.5.0</version>
</dependency>

重要提示＃3：您應該檢視您的Elasticsearch例項的Lucene版本，並在此處使用相同的Lucene API版本。執行該命令curl elastic-host:9200 並查詢lucene_version金鑰。

為了簡單起見，我將在main方法中編寫所有程式碼並丟擲異常。這取決於您如何構建程式碼。

import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class Main {
    public static void main(String[] args) throws IOException {
        String luceneIndexPath = "my-elasticsearch/data/nodes/0/indices/eCCAJ-x6SOuN6w7vqr4tGQ/0/index";
        Directory index = FSDirectory.open(Paths.get(luceneIndexPath));

        IndexReader reader = DirectoryReader.open(index);

        System.out.println(reader.maxDoc());
        
        reader.close();
        index.close();
    }
}

執行它，你會看到你的分片檔案的數量。就我而言，我的第0個碎片中有952個文件。如果列出所有主分片並對其進行總結，則它將等於您的Elasticsearch索引中的總文件數。

Directory類來自Lucene API，但它與Java i / o庫沒有太大區別。它只是為了簡化不同來源的實現，例如儲存在資料庫中的索引。我的索引的實際例項是 MMapDirectory。如果您有不同的配置，可能會有所不同。（同樣，這是另一個話題。）
IndexReader是一個用於訪問索引的視點的抽象類。

我們可以使用我們的reader變數訪問大量資訊，例如段資訊和文件本身。我們想要的是找到這些檔案。讓我們編寫那部分程式碼。

for(int i = 0; i < reader.maxDoc(); i++){
    if(((DirectoryReader) reader).isCurrent()){
      Document document = reader.document(i);

      String source = document.getBinaryValue("_source").utf8ToString();
      System.out.println(source);
    }
}

我們所做的是一個簡單的迴圈並獲得該位置的相應文件。之後，我們只獲取JSON所在的_source鍵。它以二進位制形式儲存，因此，首先我們需要獲取二進位制值，然後進行utf8ToString轉換。

Elasticsearch Lucene 資料寫入原理 | ES 核心篇
2019-08-16
Elasticsearch
Elasticsearch Lucene是怎樣資料寫入的
2020-08-27
Elasticsearch
ElasticSearch—— Java API
2018-11-22
ElasticsearchJavaAPI
Lucene建立索引流程
2018-04-04
索引
ElasticSearch Java API使用
2019-03-02
ElasticsearchJavaAPI
Elasticsearch-02-入門：叢集、節點、分片、索引及常用API
2021-07-14
Elasticsearch索引API
使用Java客戶端將資料載入到Grakn知識圖中
2018-11-26
Java客戶端
讀書筆記：從Lucene到Elasticsearch:全文檢索實戰
2019-01-08
筆記Elasticsearch
Elasticsearch 7.x 之文件、索引和 REST API 【基礎入門篇】
2019-10-16
Elasticsearch索引RESTAPI
將Xml檔案遞迴載入到TreeView中
2018-12-11
XML遞迴View
使用 Java API 操作 elasticsearch
2020-10-20
JavaAPIElasticsearch
Elasticsearch 入門實戰(9)--Java API Client 使用二
2024-07-28
ElasticsearchJavaAPIclient
hadoop異構儲存+lucene索引
2019-08-27
Hadoop索引
【Elasticsearch】Elasticsearch 索引模板
2020-10-02
Elasticsearch索引
lucene入門使用
2018-12-23
教你如何在 elasticsearch 中重建索引
2018-05-15
Elasticsearch索引
ElasticSearch安裝及java Api使用
2019-01-10
ElasticsearchJavaAPI
ElasticSearch 索引 VS MySQL 索引
2020-10-09
Elasticsearch索引MySql
java高階用法之:在JNA中將本地方法對映到JAVA程式碼中
2022-04-13
Java
elasticsearch索引原理
2019-03-07
Elasticsearch索引
基於Lucene查詢原理分析Elasticsearch的效能
2018-10-30
Elasticsearch
Elasticsearch Java High Level REST Client（Exists API）
2019-01-19
ElasticsearchJavaRESTclientAPI
Elasticsearch Java High Level REST Client（Delete API）
2019-01-19
ElasticsearchJavaRESTclientdeleteAPI
filebeat輸出結果到elasticsearch的多個索引
2020-12-10
Elasticsearch索引
Lucene底層原理和最佳化經驗分享(1)-Lucene簡介和索引原理
2021-01-31
索引
擁抱 invokedynamic，在 Java agent 中馴服類載入器
2024-04-22
Java
《Elasticsearch技術解析與實戰》Chapter 1.1：Elasticsearch入門和倒排索引
2019-04-12
ElasticsearchAPT索引
Elasticsearch 入門實戰(8)--REST API 使用二(Search API)
2024-07-21
ElasticsearchRESTAPI
Lucene 中的 VInt
2022-05-29
從根上理解elasticsearch(lucene)查詢原理(2)-lucene常見查詢型別原理分析
2023-12-12
Elasticsearch型別
ElasticSearch 7.8.1 從入門到精通
2020-08-10
Elasticsearch
掌握它才說明你真正懂 Elasticsearch - Lucene （一）
2020-02-09
Elasticsearch
掌握它才說明你真正懂 Elasticsearch - Lucene （二）
2020-02-09
Elasticsearch
elasticsearch配置注入索引
2020-11-20
Elasticsearch索引
Elasticsearch 學習索引
2020-04-30
Elasticsearch索引
將資料庫中資料匯入至solr索引庫
2020-11-11
資料庫Solr索引
高效管理 Elasticsearch 中基於時間的索引
2018-03-02
Elasticsearch索引
Elasticsearch Search API
2018-12-14
ElasticsearchAPI

在JAVA中將Elasticsearch索引載入到Lucene API

相關文章