Elasticsearch 的事務日誌

偶尔发呆發表於2024-06-19

translog 是 Elasticsearch 保證資料可靠性和災難恢復的重要元件,每個索引分片配備一個 translog,對索引資料的增加、更新操作都會記錄在 translog 中。

translog 本質上是一個可滾動的日誌檔案,相比於 lucene 的寫入,日誌檔案的寫入是一個相對輕量的操作,translog 會定期地 sync 到磁碟中。

translog 的寫入:

 1 // org.elasticsearch.index.translog.Translog#add
 2 /**
 3  * Adds an operation to the transaction log.
 4  *
 5  * @param operation the operation to add
 6  * @return the location of the operation in the translog
 7  * @throws IOException if adding the operation to the translog resulted in an I/O exception
 8  */
 9 public Location add(final Operation operation) throws IOException {
10     final ReleasableBytesStreamOutput out = new ReleasableBytesStreamOutput(bigArrays);
11     try {
12         final long start = out.position();
13         out.skip(Integer.BYTES);
14         writeOperationNoSize(new BufferedChecksumStreamOutput(out), operation);
15         final long end = out.position();
16         final int operationSize = (int) (end - Integer.BYTES - start);
17         out.seek(start);
18         out.writeInt(operationSize);
19         out.seek(end);
20         final BytesReference bytes = out.bytes();
21         try (ReleasableLock ignored = readLock.acquire()) {
22             ensureOpen();
23             if (operation.primaryTerm() > current.getPrimaryTerm()) {
24                 assert false
25                     : "Operation term is newer than the current term; "
26                         + "current term["
27                         + current.getPrimaryTerm()
28                         + "], operation term["
29                         + operation
30                         + "]";
31                 throw new IllegalArgumentException(
32                     "Operation term is newer than the current term; "
33                         + "current term["
34                         + current.getPrimaryTerm()
35                         + "], operation term["
36                         + operation
37                         + "]"
38                 );
39             }
40             return current.add(bytes, operation.seqNo());
41         }
42     } catch (final AlreadyClosedException | IOException ex) {
43         closeOnTragicEvent(ex);
44         throw ex;
45     } catch (final Exception ex) {
46         closeOnTragicEvent(ex);
47         throw new TranslogException(shardId, "Failed to write operation [" + operation + "]", ex);
48     } finally {
49         Releasables.close(out);
50     }
51 }

Operation 是一個介面,Index 和 Delete 是它的實現類,ES 中對索引資料的增刪改都可以用 Index 和 Delete 實現。

public interface Operation 
public static class Index implements Operation
public static class Delete implements Operation 

current 是 TranslogWriter,translog 的寫入是非同步刷盤的,先寫到記憶體的輸出流中,後續由定時任務刷盤

 1 // org.elasticsearch.index.translog.TranslogWriter#add
 2 /**
 3  * Add the given bytes to the translog with the specified sequence number; returns the location the bytes were written to.
 4  *
 5  * @param data  the bytes to write
 6  * @param seqNo the sequence number associated with the operation
 7  * @return the location the bytes were written to
 8  * @throws IOException if writing to the translog resulted in an I/O exception
 9  */
10 public Translog.Location add(final BytesReference data, final long seqNo) throws IOException {
11     long bufferedBytesBeforeAdd = this.bufferedBytes;
12     if (bufferedBytesBeforeAdd >= forceWriteThreshold) {
13         writeBufferedOps(Long.MAX_VALUE, bufferedBytesBeforeAdd >= forceWriteThreshold * 4);
14     }
15 
16     final Translog.Location location;
17     synchronized (this) {
18         ensureOpen();
19         if (buffer == null) {
20             buffer = new ReleasableBytesStreamOutput(bigArrays);
21         }
22         assert bufferedBytes == buffer.size();
23         final long offset = totalOffset;
24         totalOffset += data.length();
25         data.writeTo(buffer);
26 
27         assert minSeqNo != SequenceNumbers.NO_OPS_PERFORMED || operationCounter == 0;
28         assert maxSeqNo != SequenceNumbers.NO_OPS_PERFORMED || operationCounter == 0;
29 
30         minSeqNo = SequenceNumbers.min(minSeqNo, seqNo);
31         maxSeqNo = SequenceNumbers.max(maxSeqNo, seqNo);
32 
33         nonFsyncedSequenceNumbers.add(seqNo);
34 
35         operationCounter++;
36 
37         assert assertNoSeqNumberConflict(seqNo, data);
38 
39         location = new Translog.Location(generation, offset, data.length());
40         bufferedBytes = buffer.size();
41     }
42 
43     return location;
44 }

非同步刷盤的定時任務,這種檔案寫盤的套路都一樣,或非同步刷盤,或針對每個請求進行刷盤(也就是同步寫入),非同步刷盤的預設間隔是 5 s

  1 // org.elasticsearch.index.IndexService.AsyncTranslogFSync
  2 /**
  3  * FSyncs the translog for all shards of this index in a defined interval.
  4  */
  5 static final class AsyncTranslogFSync extends BaseAsyncTask {
  6 
  7     AsyncTranslogFSync(IndexService indexService) {
  8         super(indexService, indexService.getIndexSettings().getTranslogSyncInterval());
  9     }
 10 
 11     @Override
 12     protected String getThreadPool() {
 13         return ThreadPool.Names.FLUSH;
 14     }
 15 
 16     @Override
 17     protected void runInternal() {
 18         indexService.maybeFSyncTranslogs();
 19     }
 20 
 21     void updateIfNeeded() {
 22         final TimeValue newInterval = indexService.getIndexSettings().getTranslogSyncInterval();
 23         if (newInterval.equals(getInterval()) == false) {
 24             setInterval(newInterval);
 25         }
 26     }
 27 
 28     @Override
 29     public String toString() {
 30         return "translog_sync";
 31     }
 32 }
 33 
 34 // org.elasticsearch.index.IndexService#maybeFSyncTranslogs
 35 private void maybeFSyncTranslogs() {
 36     if (indexSettings.getTranslogDurability() == Translog.Durability.ASYNC) {
 37         for (IndexShard shard : this.shards.values()) {
 38             try {
 39                 if (shard.isSyncNeeded()) {
 40                     shard.sync();
 41                 }
 42             } catch (AlreadyClosedException ex) {
 43                 // fine - continue;
 44             } catch (IOException e) {
 45                 logger.warn("failed to sync translog", e);
 46             }
 47         }
 48     }
 49 }
 50 
 51 // org.elasticsearch.index.shard.IndexShard#sync()
 52 public void sync() throws IOException {
 53     verifyNotClosed();
 54     getEngine().syncTranslog();
 55 }
 56 
 57 
 58 // org.elasticsearch.index.engine.InternalEngine#syncTranslog
 59 @Override
 60 public void syncTranslog() throws IOException {
 61     translog.sync();
 62     revisitIndexDeletionPolicyOnTranslogSynced();
 63 }
 64     
 65 // org.elasticsearch.index.translog.TranslogWriter#syncUpTo
 66 /**
 67  * Syncs the translog up to at least the given offset unless already synced
 68  *
 69  * @return <code>true</code> if this call caused an actual sync operation
 70  */
 71 final boolean syncUpTo(long offset) throws IOException {
 72     if (lastSyncedCheckpoint.offset < offset && syncNeeded()) {
 73         synchronized (syncLock) { // only one sync/checkpoint should happen concurrently but we wait
 74             if (lastSyncedCheckpoint.offset < offset && syncNeeded()) {
 75                 // double checked locking - we don't want to fsync unless we have to and now that we have
 76                 // the lock we should check again since if this code is busy we might have fsynced enough already
 77                 final Checkpoint checkpointToSync;
 78                 final List<Long> flushedSequenceNumbers;
 79                 final ReleasableBytesReference toWrite;
 80                 try (ReleasableLock toClose = writeLock.acquire()) {
 81                     synchronized (this) {
 82                         ensureOpen();
 83                         checkpointToSync = getCheckpoint();
 84                         toWrite = pollOpsToWrite();
 85                         if (nonFsyncedSequenceNumbers.isEmpty()) {
 86                             flushedSequenceNumbers = null;
 87                         } else {
 88                             flushedSequenceNumbers = nonFsyncedSequenceNumbers;
 89                             nonFsyncedSequenceNumbers = new ArrayList<>(64);
 90                         }
 91                     }
 92 
 93                     try {
 94                         // Write ops will release operations.
 95                         writeAndReleaseOps(toWrite);
 96                         assert channel.position() == checkpointToSync.offset;
 97                     } catch (final Exception ex) {
 98                         closeWithTragicEvent(ex);
 99                         throw ex;
100                     }
101                 }
102                 // now do the actual fsync outside of the synchronized block such that
103                 // we can continue writing to the buffer etc.
104                 try {
105                     assert lastSyncedCheckpoint.offset != checkpointToSync.offset || toWrite.length() == 0;
106                     if (lastSyncedCheckpoint.offset != checkpointToSync.offset) {
107                         channel.force(false);
108                     }
109                     writeCheckpoint(checkpointChannel, checkpointPath, checkpointToSync);
110                 } catch (final Exception ex) {
111                     closeWithTragicEvent(ex);
112                     throw ex;
113                 }
114                 if (flushedSequenceNumbers != null) {
115                     flushedSequenceNumbers.forEach(persistedSequenceNumberConsumer::accept);
116                 }
117                 assert lastSyncedCheckpoint.offset <= checkpointToSync.offset
118                     : "illegal state: " + lastSyncedCheckpoint.offset + " <= " + checkpointToSync.offset;
119                 lastSyncedCheckpoint = checkpointToSync; // write protected by syncLock
120                 return true;
121             }
122         }
123     }
124     return false;
125 }

接下來探究 translog 的讀取,什麼情況下會讀取 translog,一種是災難恢復時,如果 lucene 資料還沒有持久化而恰好 ES 停機,重啟後會從 translog 中恢復資料(如果 translog 持久化了);還有一種情況是基於 id 的實時查詢,如果我們新寫入一個文件,立即透過其他條件而非 id 查詢,這時是查不到文件的,但如果透過 id 查詢,則可以查詢到,因為此時從 translog 中可以查詢到最新的文件資料。

trasnlog 的讀取:

// org.elasticsearch.index.translog.Translog#readOperation(org.elasticsearch.index.translog.Translog.Location)
/**
 * Reads and returns the operation from the given location if the generation it references is still available. Otherwise
 * this method will return <code>null</code>.
 */
public Operation readOperation(Location location) throws IOException {
    try (ReleasableLock ignored = readLock.acquire()) {
        ensureOpen();
        if (location.generation < getMinFileGeneration()) {
            return null;
        }
        if (current.generation == location.generation) {
            // no need to fsync here the read operation will ensure that buffers are written to disk
            // if they are still in RAM and we are reading onto that position
            return current.read(location);
        } else {
            // read backwards - it's likely we need to read on that is recent
            for (int i = readers.size() - 1; i >= 0; i--) {
                TranslogReader translogReader = readers.get(i);
                if (translogReader.generation == location.generation) {
                    return translogReader.read(location);
                }
            }
        }
    } catch (final Exception ex) {
        closeOnTragicEvent(ex);
        throw ex;
    }
    return null;
}

從上文的 transllog 寫入方法返回的 Translog.Location 物件,此時派上用場,根據 Location 去 translog 的指定位置讀取資料,反查出 Operation。

檢視 ES getById 的方法,在 16 行取出 versionValue,針對新增資料是 IndexVersionValue,在第 45 行指定 Location 讀取 translog

 1 // org.elasticsearch.index.engine.InternalEngine#get
 2 @Override
 3 public GetResult get(
 4     Get get,
 5     MappingLookup mappingLookup,
 6     DocumentParser documentParser,
 7     Function<Engine.Searcher, Engine.Searcher> searcherWrapper
 8 ) {
 9     assert Objects.equals(get.uid().field(), IdFieldMapper.NAME) : get.uid().field();
10     try (ReleasableLock ignored = readLock.acquire()) {
11         ensureOpen();
12         if (get.realtime()) {
13             final VersionValue versionValue;
14             try (Releasable ignore = versionMap.acquireLock(get.uid().bytes())) {
15                 // we need to lock here to access the version map to do this truly in RT
16                 versionValue = getVersionFromMap(get.uid().bytes());
17             }
18             if (versionValue != null) {
19                 if (versionValue.isDelete()) {
20                     return GetResult.NOT_EXISTS;
21                 }
22                 if (get.versionType().isVersionConflictForReads(versionValue.version, get.version())) {
23                     throw new VersionConflictEngineException(
24                         shardId,
25                         "[" + get.id() + "]",
26                         get.versionType().explainConflictForReads(versionValue.version, get.version())
27                     );
28                 }
29                 if (get.getIfSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO
30                     && (get.getIfSeqNo() != versionValue.seqNo || get.getIfPrimaryTerm() != versionValue.term)) {
31                     throw new VersionConflictEngineException(
32                         shardId,
33                         get.id(),
34                         get.getIfSeqNo(),
35                         get.getIfPrimaryTerm(),
36                         versionValue.seqNo,
37                         versionValue.term
38                     );
39                 }
40                 if (get.isReadFromTranslog()) {
41                     // this is only used for updates - API _GET calls will always read form a reader for consistency
42                     // the update call doesn't need the consistency since it's source only + _parent but parent can go away in 7.0
43                     if (versionValue.getLocation() != null) {
44                         try {
45                             final Translog.Operation operation = translog.readOperation(versionValue.getLocation());
46                             if (operation != null) {
47                                 return getFromTranslog(get, (Translog.Index) operation, mappingLookup, documentParser, searcherWrapper);
48                             }
49                         } catch (IOException e) {
50                             maybeFailEngine("realtime_get", e); // lets check if the translog has failed with a tragic event
51                             throw new EngineException(shardId, "failed to read operation from translog", e);
52                         }
53                     } else {
54                         trackTranslogLocation.set(true);
55                     }
56                 }
57                 assert versionValue.seqNo >= 0 : versionValue;
58                 refreshIfNeeded("realtime_get", versionValue.seqNo);
59             }
60             return getFromSearcher(get, acquireSearcher("realtime_get", SearcherScope.INTERNAL, searcherWrapper), false);
61         } else {
62             // we expose what has been externally expose in a point in time snapshot via an explicit refresh
63             return getFromSearcher(get, acquireSearcher("get", SearcherScope.EXTERNAL, searcherWrapper), false);
64         }
65     }
66 }
 1 final class IndexVersionValue extends VersionValue {
 2 
 3     private static final long RAM_BYTES_USED = RamUsageEstimator.shallowSizeOfInstance(IndexVersionValue.class);
 4 
 5     private final Translog.Location translogLocation;
 6 
 7     IndexVersionValue(Translog.Location translogLocation, long version, long seqNo, long term) {
 8         super(version, seqNo, term);
 9         this.translogLocation = translogLocation;
10     }
11 
12     @Override
13     public long ramBytesUsed() {
14         return RAM_BYTES_USED + RamUsageEstimator.shallowSizeOf(translogLocation);
15     }
16 
17     @Override
18     public boolean equals(Object o) {
19         if (this == o) return true;
20         if (o == null || getClass() != o.getClass()) return false;
21         if (super.equals(o) == false) return false;
22         IndexVersionValue that = (IndexVersionValue) o;
23         return Objects.equals(translogLocation, that.translogLocation);
24     }
25 
26     @Override
27     public int hashCode() {
28         return Objects.hash(super.hashCode(), translogLocation);
29     }
30 
31     @Override
32     public String toString() {
33         return "IndexVersionValue{" + "version=" + version + ", seqNo=" + seqNo + ", term=" + term + ", location=" + translogLocation + '}';
34     }
35 
36     @Override
37     public Translog.Location getLocation() {
38         return translogLocation;
39     }
40 }

最後還有一塊 checkpoint 檔案,checkpoint 檔案記錄了 translog 的後設資料資訊,指導 ES 從 translog 的哪個位置開始重放資料

 1 // org.elasticsearch.index.engine.InternalEngine#recoverFromTranslogInternal
 2 private void recoverFromTranslogInternal(TranslogRecoveryRunner translogRecoveryRunner, long recoverUpToSeqNo) throws IOException {
 3     final int opsRecovered;
 4     final long localCheckpoint = getProcessedLocalCheckpoint();
 5     if (localCheckpoint < recoverUpToSeqNo) {
 6         try (Translog.Snapshot snapshot = translog.newSnapshot(localCheckpoint + 1, recoverUpToSeqNo)) {
 7             opsRecovered = translogRecoveryRunner.run(this, snapshot);
 8         } catch (Exception e) {
 9             throw new EngineException(shardId, "failed to recover from translog", e);
10         }
11     } else {
12         opsRecovered = 0;
13     }
14     // flush if we recovered something or if we have references to older translogs
15     // note: if opsRecovered == 0 and we have older translogs it means they are corrupted or 0 length.
16     assert pendingTranslogRecovery.get() : "translogRecovery is not pending but should be";
17     pendingTranslogRecovery.set(false); // we are good - now we can commit
18     logger.trace(
19         () -> format(
20             "flushing post recovery from translog: ops recovered [%s], current translog generation [%s]",
21             opsRecovered,
22             translog.currentFileGeneration()
23         )
24     );
25     flush(false, true);
26     translog.trimUnreferencedReaders();
27 }

相關文章