Elasticsearch 近實時搜尋的底層原理

偶尔发呆發表於2024-06-17

原文網址 : https://www.cnblogs.com/allenwas3/p/18252857

我們都知道 Elasticsearch 的搜尋是近實時的，資料寫入後，立即搜尋（不透過 id）文件是搜不到的。這一切的原因要歸於 lucene 所提供的 API，因為 lucene 的 API 就是非實時的，Elasticsearch 在 lucene 之上蓋房子，透過一些增強，實現了查詢的近實時和 id 查詢的實時性。本文就來看看這個近實時的原理。

對應每個索引分片，ES 會建立一個定時任務，很顯然 AsyncRefreshTask 是這個定時任務

 1 // org.elasticsearch.index.IndexService.AsyncRefreshTask
 2 static final class AsyncRefreshTask extends BaseAsyncTask {
 3 
 4     AsyncRefreshTask(IndexService indexService) {
 5         super(indexService, indexService.getIndexSettings().getRefreshInterval());
 6     }
 7 
 8     @Override
 9     protected void runInternal() {
10         indexService.maybeRefreshEngine(false);
11     }
12 
13     @Override
14     protected String getThreadPool() {
15         return ThreadPool.Names.REFRESH;
16     }
17 
18     @Override
19     public String toString() {
20         return "refresh";
21     }
22 }

那麼這個定時任務的執行間隔是多少呢，是 1 秒鐘

1 // org.elasticsearch.index.IndexSettings
2     public static final TimeValue DEFAULT_REFRESH_INTERVAL = new TimeValue(1, TimeUnit.SECONDS);
3     public static final Setting<TimeValue> INDEX_REFRESH_INTERVAL_SETTING = Setting.timeSetting(
4         "index.refresh_interval",
5         DEFAULT_REFRESH_INTERVAL,
6         new TimeValue(-1, TimeUnit.MILLISECONDS),
7         Property.Dynamic,
8         Property.IndexScope
9     );

定時任務的觸發執行在哪呢，下面程式碼第 31 行，執行緒池放入任務

 1    abstract static class BaseAsyncTask extends AbstractAsyncTask {
 2 
 3         protected final IndexService indexService;
 4 
 5         BaseAsyncTask(final IndexService indexService, final TimeValue interval) {
 6             super(indexService.logger, indexService.threadPool, interval, true);
 7             this.indexService = indexService;
 8             rescheduleIfNecessary();
 9         }
10 
11         @Override
12         protected boolean mustReschedule() {
13             // don't re-schedule if the IndexService instance is closed or if the index is closed
14             return indexService.closed.get() == false
15                 && indexService.indexSettings.getIndexMetadata().getState() == IndexMetadata.State.OPEN;
16         }
17     }
18 
19     // org.elasticsearch.common.util.concurrent.AbstractAsyncTask#rescheduleIfNecessary
20     public synchronized void rescheduleIfNecessary() {
21         if (isClosed()) {
22             return;
23         }
24         if (cancellable != null) {
25             cancellable.cancel();
26         }
27         if (interval.millis() > 0 && mustReschedule()) {
28             if (logger.isTraceEnabled()) {
29                 logger.trace("scheduling {} every {}", toString(), interval);
30             }
31             cancellable = threadPool.schedule(this, interval, getThreadPool());
32             isScheduledOrRunning = true;
33         } else {
34             logger.trace("scheduled {} disabled", toString());
35             cancellable = null;
36             isScheduledOrRunning = false;
37         }
38     }

具體的執行，是否執行 refresh 需要滿足一系列條件，這裡著重看 getEngine().refreshNeeded()

// org.elasticsearch.index.IndexService#maybeRefreshEngine
private void maybeRefreshEngine(boolean force) {
    if (indexSettings.getRefreshInterval().millis() > 0 || force) {
        for (IndexShard shard : this.shards.values()) {
            try {
                shard.scheduledRefresh();
            } catch (IndexShardClosedException | AlreadyClosedException ex) {
                // fine - continue;
            }
        }
    }
}

// org.elasticsearch.index.shard.IndexShard#scheduledRefresh
/**
 * Executes a scheduled refresh if necessary.
 *
 * @return <code>true</code> iff the engine got refreshed otherwise <code>false</code>
 */
public boolean scheduledRefresh() {
    verifyNotClosed();
    boolean listenerNeedsRefresh = refreshListeners.refreshNeeded();
    if (isReadAllowed() && (listenerNeedsRefresh || getEngine().refreshNeeded())) {
        if (listenerNeedsRefresh == false // if we have a listener that is waiting for a refresh we need to force it
            && isSearchIdle()
            && indexSettings.isExplicitRefresh() == false
            && active.get()) { // it must be active otherwise we might not free up segment memory once the shard became inactive
            // lets skip this refresh since we are search idle and
            // don't necessarily need to refresh. the next searcher access will register a refreshListener and that will
            // cause the next schedule to refresh.
            final Engine engine = getEngine();
            engine.maybePruneDeletes(); // try to prune the deletes in the engine if we accumulated some
            setRefreshPending(engine);
            return false;
        } else {
            if (logger.isTraceEnabled()) {
                logger.trace("refresh with source [schedule]");
            }
            return getEngine().maybeRefresh("schedule");
        }
    }
    final Engine engine = getEngine();
    engine.maybePruneDeletes(); // try to prune the deletes in the engine if we accumulated some
    return false;
}

是否需要 refresh，最終呼叫的是 lucene 中 DirectoryReader 的 isCurrent() 方法，透過方法簽名可以看出，當索引發生了新的變化後，該方法返回 true

 1 // org.elasticsearch.index.engine.Engine#refreshNeeded
 2     public boolean refreshNeeded() {
 3         if (store.tryIncRef()) {
 4             /*
 5               we need to inc the store here since we acquire a searcher and that might keep a file open on the
 6               store. this violates the assumption that all files are closed when
 7               the store is closed so we need to make sure we increment it here
 8              */
 9             try {
10                 try (Searcher searcher = acquireSearcher("refresh_needed", SearcherScope.EXTERNAL)) {
11                     return searcher.getDirectoryReader().isCurrent() == false;
12                 }
13             } catch (IOException e) {
14                 logger.error("failed to access searcher manager", e);
15                 failEngine("failed to access searcher manager", e);
16                 throw new EngineException(shardId, "failed to access searcher manager", e);
17             } finally {
18                 store.decRef();
19             }
20         }
21         return false;
22     }
23     
24     
25 
26 // org.apache.lucene.index.DirectoryReader#isCurrent
27 /**
28 **Check whether any new changes have occurred to the index since this reader was opened.
29   If this reader was created by calling open, then this method checks if any further commits (see IndexWriter.commit) have occurred in the directory.
30   If instead this reader is a near real-time reader (ie, obtained by a call to open(IndexWriter), or by calling openIfChanged on a near real-time reader), then this method checks if either a new commit has occurred, or any new uncommitted changes have taken place via the writer. Note that even if the writer has only performed merging, this method will still return false.
31   In any event, if this returns false, you should call openIfChanged to get a new reader that sees the changes.
32 **/
33 public abstract boolean isCurrent() throws IOException;

寫慣了 CRUD 業務程式碼的我，看到 IndexService 想當然以為它管理著所有的索引，仔細閱讀了下原始碼，ES 中一個索引對應一個 IndexService 例項，一個 Engine 例項。好，接下來我們看重新整理操作到底做了什麼，最終呼叫的是 lucene 中 DirectoryReader 的 openIfChanged 方法，呼叫該方法後，返回的新 reader 可以搜尋到新文件。

// org.elasticsearch.index.engine.InternalEngine#maybeRefresh
@Override
public boolean maybeRefresh(String source) throws EngineException {
    return refresh(source, SearcherScope.EXTERNAL, false);
}

// org.elasticsearch.index.engine.InternalEngine#refresh
final boolean refresh(String source, SearcherScope scope, boolean block) throws EngineException {
    // both refresh types will result in an internal refresh but only the external will also
    // pass the new reader reference to the external reader manager.
    final long localCheckpointBeforeRefresh = localCheckpointTracker.getProcessedCheckpoint();
    boolean refreshed;
    try {
        // refresh does not need to hold readLock as ReferenceManager can handle correctly if the engine is closed in mid-way.
        if (store.tryIncRef()) {
            // increment the ref just to ensure nobody closes the store during a refresh
            try {
                // even though we maintain 2 managers we really do the heavy-lifting only once.
                // the second refresh will only do the extra work we have to do for warming caches etc.
                ReferenceManager<ElasticsearchDirectoryReader> referenceManager = getReferenceManager(scope);
                // it is intentional that we never refresh both internal / external together
                if (block) {
                    referenceManager.maybeRefreshBlocking();
                    refreshed = true;
                } else {
                    refreshed = referenceManager.maybeRefresh();
                }
            } finally {
                store.decRef();
            }
            if (refreshed) {
                lastRefreshedCheckpointListener.updateRefreshedCheckpoint(localCheckpointBeforeRefresh);
            }
        } else {
            refreshed = false;
        }
    } catch (AlreadyClosedException e) {
        failOnTragicEvent(e);
        throw e;
    } catch (Exception e) {
        try {
            failEngine("refresh failed source[" + source + "]", e);
        } catch (Exception inner) {
            e.addSuppressed(inner);
        }
        throw new RefreshFailedEngineException(shardId, e);
    }
    assert refreshed == false || lastRefreshedCheckpoint() >= localCheckpointBeforeRefresh
        : "refresh checkpoint was not advanced; "
            + "local_checkpoint="
            + localCheckpointBeforeRefresh
            + " refresh_checkpoint="
            + lastRefreshedCheckpoint();
    // TODO: maybe we should just put a scheduled job in threadPool?
    // We check for pruning in each delete request, but we also prune here e.g. in case a delete burst comes in and then no more deletes
    // for a long time:
    maybePruneDeletes();
    mergeScheduler.refreshConfig();
    return refreshed;
}

//org.elasticsearch.index.engine.ElasticsearchReaderManager#refreshIfNeeded
class ElasticsearchReaderManager extends ReferenceManager<ElasticsearchDirectoryReader> {
    @Override
    protected ElasticsearchDirectoryReader refreshIfNeeded(ElasticsearchDirectoryReader referenceToRefresh) throws IOException {
        return (ElasticsearchDirectoryReader) DirectoryReader.openIfChanged(referenceToRefresh);
    }
}

程式碼很長，結論很簡單，ES 透過定時任務，定期對索引進行 refresh，將非實時的搜尋增強為近實時。

Elasticsearch 實現簡單搜尋
2019-03-07
Elasticsearch
Laravel + Elasticsearch 實現中文搜尋
2020-02-05
LaravelElasticsearch
Elasticsearch搜尋功能的實現（五）-- 實戰
2023-04-18
Elasticsearch
使用Node，Vue和ElasticSearch構建實時搜尋引擎
2019-02-16
VueElasticsearch
ArrayList底層的實現原理
2024-05-15
Elasticsearch線上搜尋引擎讀寫核心原理深度認知-搜尋系統線上實戰
2019-03-03
Elasticsearch
Elasticsearch常用搜尋
2020-08-27
Elasticsearch
Elasticsearch——全文搜尋
2019-02-18
Elasticsearch
elasticsearch搜尋商品
2021-07-15
Elasticsearch
Elasticsearch 向量搜尋
2022-04-16
Elasticsearch
Elasticsearch（ES）的高階搜尋（DSL搜尋）（上篇）
2021-09-20
Elasticsearch
Elasticsearch（ES）的高階搜尋（DSL搜尋）（下篇）
2021-09-21
Elasticsearch
用一個圖書庫例項搞懂二分搜尋樹的底層原理
2020-06-23
elasticsearch實現基於拼音搜尋
2023-01-15
Elasticsearch
NSDictionary底層實現原理
2018-08-01
AutoreleasePool底層實現原理
2018-05-23
HashMap底層實現原理
2020-07-29
HashMap
MySQL Join的底層實現原理
2018-11-13
MySql
iOS底層原理總結 - 探尋Class的本質
2018-04-14
iOS
基於Kafka和Elasticsearch構建實時站內搜尋功能的實踐
2023-03-27
KafkaElasticsearch
Nebula 基於 ElasticSearch 的全文搜尋引擎的文字搜尋
2021-06-17
Elasticsearch
【搜尋引擎】Solr全文檢索近實時查詢優化
2019-06-27
Solr優化
ElasticSearch全文搜尋引擎
2019-07-29
Elasticsearch
elasticsearch之拼音搜尋
2022-01-14
Elasticsearch
Elasticsearch 為了搜尋
2021-03-06
Elasticsearch
基於Elasticsearch實現搜尋建議
2018-07-27
Elasticsearch
使用 Laravel Scout + ElasticSearch 實現全文搜尋
2021-10-15
LaravelElasticsearch
基於 Elasticsearch 的站內搜尋引擎實戰
2019-03-04
Elasticsearch
MySQL索引底層實現原理
2018-11-20
MySql索引
卷級實時備份的底層資料處理原理
2024-03-05
Netty的底層原理
2019-03-12
Netty
Volatile的底層原理
2024-08-22
HashMap的底層原理
2021-05-15
HashMap
iOS底層原理總結 - 探尋KVO本質
2018-04-21
iOS
KVO的使用和底層實現原理
2018-07-31
iOS底層原理總結 – 探尋OC物件的本質
2019-03-03
iOS物件
iOS底層原理總結 - 探尋block的本質（一）
2018-05-20
iOSBloC
iOS底層原理總結 - 探尋block的本質（二）
2018-06-03
iOSBloC

Elasticsearch 近實時搜尋的底層原理

相關文章