translog 是 Elasticsearch 保證資料可靠性和災難恢復的重要元件,每個索引分片配備一個 translog,對索引資料的增加、更新操作都會記錄在 translog 中。
translog 本質上是一個可滾動的日誌檔案,相比於 lucene 的寫入,日誌檔案的寫入是一個相對輕量的操作,translog 會定期地 sync 到磁碟中。
translog 的寫入:
1 // org.elasticsearch.index.translog.Translog#add 2 /** 3 * Adds an operation to the transaction log. 4 * 5 * @param operation the operation to add 6 * @return the location of the operation in the translog 7 * @throws IOException if adding the operation to the translog resulted in an I/O exception 8 */ 9 public Location add(final Operation operation) throws IOException { 10 final ReleasableBytesStreamOutput out = new ReleasableBytesStreamOutput(bigArrays); 11 try { 12 final long start = out.position(); 13 out.skip(Integer.BYTES); 14 writeOperationNoSize(new BufferedChecksumStreamOutput(out), operation); 15 final long end = out.position(); 16 final int operationSize = (int) (end - Integer.BYTES - start); 17 out.seek(start); 18 out.writeInt(operationSize); 19 out.seek(end); 20 final BytesReference bytes = out.bytes(); 21 try (ReleasableLock ignored = readLock.acquire()) { 22 ensureOpen(); 23 if (operation.primaryTerm() > current.getPrimaryTerm()) { 24 assert false 25 : "Operation term is newer than the current term; " 26 + "current term[" 27 + current.getPrimaryTerm() 28 + "], operation term[" 29 + operation 30 + "]"; 31 throw new IllegalArgumentException( 32 "Operation term is newer than the current term; " 33 + "current term[" 34 + current.getPrimaryTerm() 35 + "], operation term[" 36 + operation 37 + "]" 38 ); 39 } 40 return current.add(bytes, operation.seqNo()); 41 } 42 } catch (final AlreadyClosedException | IOException ex) { 43 closeOnTragicEvent(ex); 44 throw ex; 45 } catch (final Exception ex) { 46 closeOnTragicEvent(ex); 47 throw new TranslogException(shardId, "Failed to write operation [" + operation + "]", ex); 48 } finally { 49 Releasables.close(out); 50 } 51 }
Operation 是一個介面,Index 和 Delete 是它的實現類,ES 中對索引資料的增刪改都可以用 Index 和 Delete 實現。
public interface Operation public static class Index implements Operation public static class Delete implements Operation
current 是 TranslogWriter,translog 的寫入是非同步刷盤的,先寫到記憶體的輸出流中,後續由定時任務刷盤
1 // org.elasticsearch.index.translog.TranslogWriter#add 2 /** 3 * Add the given bytes to the translog with the specified sequence number; returns the location the bytes were written to. 4 * 5 * @param data the bytes to write 6 * @param seqNo the sequence number associated with the operation 7 * @return the location the bytes were written to 8 * @throws IOException if writing to the translog resulted in an I/O exception 9 */ 10 public Translog.Location add(final BytesReference data, final long seqNo) throws IOException { 11 long bufferedBytesBeforeAdd = this.bufferedBytes; 12 if (bufferedBytesBeforeAdd >= forceWriteThreshold) { 13 writeBufferedOps(Long.MAX_VALUE, bufferedBytesBeforeAdd >= forceWriteThreshold * 4); 14 } 15 16 final Translog.Location location; 17 synchronized (this) { 18 ensureOpen(); 19 if (buffer == null) { 20 buffer = new ReleasableBytesStreamOutput(bigArrays); 21 } 22 assert bufferedBytes == buffer.size(); 23 final long offset = totalOffset; 24 totalOffset += data.length(); 25 data.writeTo(buffer); 26 27 assert minSeqNo != SequenceNumbers.NO_OPS_PERFORMED || operationCounter == 0; 28 assert maxSeqNo != SequenceNumbers.NO_OPS_PERFORMED || operationCounter == 0; 29 30 minSeqNo = SequenceNumbers.min(minSeqNo, seqNo); 31 maxSeqNo = SequenceNumbers.max(maxSeqNo, seqNo); 32 33 nonFsyncedSequenceNumbers.add(seqNo); 34 35 operationCounter++; 36 37 assert assertNoSeqNumberConflict(seqNo, data); 38 39 location = new Translog.Location(generation, offset, data.length()); 40 bufferedBytes = buffer.size(); 41 } 42 43 return location; 44 }
非同步刷盤的定時任務,這種檔案寫盤的套路都一樣,或非同步刷盤,或針對每個請求進行刷盤(也就是同步寫入),非同步刷盤的預設間隔是 5 s
1 // org.elasticsearch.index.IndexService.AsyncTranslogFSync 2 /** 3 * FSyncs the translog for all shards of this index in a defined interval. 4 */ 5 static final class AsyncTranslogFSync extends BaseAsyncTask { 6 7 AsyncTranslogFSync(IndexService indexService) { 8 super(indexService, indexService.getIndexSettings().getTranslogSyncInterval()); 9 } 10 11 @Override 12 protected String getThreadPool() { 13 return ThreadPool.Names.FLUSH; 14 } 15 16 @Override 17 protected void runInternal() { 18 indexService.maybeFSyncTranslogs(); 19 } 20 21 void updateIfNeeded() { 22 final TimeValue newInterval = indexService.getIndexSettings().getTranslogSyncInterval(); 23 if (newInterval.equals(getInterval()) == false) { 24 setInterval(newInterval); 25 } 26 } 27 28 @Override 29 public String toString() { 30 return "translog_sync"; 31 } 32 } 33 34 // org.elasticsearch.index.IndexService#maybeFSyncTranslogs 35 private void maybeFSyncTranslogs() { 36 if (indexSettings.getTranslogDurability() == Translog.Durability.ASYNC) { 37 for (IndexShard shard : this.shards.values()) { 38 try { 39 if (shard.isSyncNeeded()) { 40 shard.sync(); 41 } 42 } catch (AlreadyClosedException ex) { 43 // fine - continue; 44 } catch (IOException e) { 45 logger.warn("failed to sync translog", e); 46 } 47 } 48 } 49 } 50 51 // org.elasticsearch.index.shard.IndexShard#sync() 52 public void sync() throws IOException { 53 verifyNotClosed(); 54 getEngine().syncTranslog(); 55 } 56 57 58 // org.elasticsearch.index.engine.InternalEngine#syncTranslog 59 @Override 60 public void syncTranslog() throws IOException { 61 translog.sync(); 62 revisitIndexDeletionPolicyOnTranslogSynced(); 63 } 64 65 // org.elasticsearch.index.translog.TranslogWriter#syncUpTo 66 /** 67 * Syncs the translog up to at least the given offset unless already synced 68 * 69 * @return <code>true</code> if this call caused an actual sync operation 70 */ 71 final boolean syncUpTo(long offset) throws IOException { 72 if (lastSyncedCheckpoint.offset < offset && syncNeeded()) { 73 synchronized (syncLock) { // only one sync/checkpoint should happen concurrently but we wait 74 if (lastSyncedCheckpoint.offset < offset && syncNeeded()) { 75 // double checked locking - we don't want to fsync unless we have to and now that we have 76 // the lock we should check again since if this code is busy we might have fsynced enough already 77 final Checkpoint checkpointToSync; 78 final List<Long> flushedSequenceNumbers; 79 final ReleasableBytesReference toWrite; 80 try (ReleasableLock toClose = writeLock.acquire()) { 81 synchronized (this) { 82 ensureOpen(); 83 checkpointToSync = getCheckpoint(); 84 toWrite = pollOpsToWrite(); 85 if (nonFsyncedSequenceNumbers.isEmpty()) { 86 flushedSequenceNumbers = null; 87 } else { 88 flushedSequenceNumbers = nonFsyncedSequenceNumbers; 89 nonFsyncedSequenceNumbers = new ArrayList<>(64); 90 } 91 } 92 93 try { 94 // Write ops will release operations. 95 writeAndReleaseOps(toWrite); 96 assert channel.position() == checkpointToSync.offset; 97 } catch (final Exception ex) { 98 closeWithTragicEvent(ex); 99 throw ex; 100 } 101 } 102 // now do the actual fsync outside of the synchronized block such that 103 // we can continue writing to the buffer etc. 104 try { 105 assert lastSyncedCheckpoint.offset != checkpointToSync.offset || toWrite.length() == 0; 106 if (lastSyncedCheckpoint.offset != checkpointToSync.offset) { 107 channel.force(false); 108 } 109 writeCheckpoint(checkpointChannel, checkpointPath, checkpointToSync); 110 } catch (final Exception ex) { 111 closeWithTragicEvent(ex); 112 throw ex; 113 } 114 if (flushedSequenceNumbers != null) { 115 flushedSequenceNumbers.forEach(persistedSequenceNumberConsumer::accept); 116 } 117 assert lastSyncedCheckpoint.offset <= checkpointToSync.offset 118 : "illegal state: " + lastSyncedCheckpoint.offset + " <= " + checkpointToSync.offset; 119 lastSyncedCheckpoint = checkpointToSync; // write protected by syncLock 120 return true; 121 } 122 } 123 } 124 return false; 125 }
接下來探究 translog 的讀取,什麼情況下會讀取 translog,一種是災難恢復時,如果 lucene 資料還沒有持久化而恰好 ES 停機,重啟後會從 translog 中恢復資料(如果 translog 持久化了);還有一種情況是基於 id 的實時查詢,如果我們新寫入一個文件,立即透過其他條件而非 id 查詢,這時是查不到文件的,但如果透過 id 查詢,則可以查詢到,因為此時從 translog 中可以查詢到最新的文件資料。
trasnlog 的讀取:
// org.elasticsearch.index.translog.Translog#readOperation(org.elasticsearch.index.translog.Translog.Location) /** * Reads and returns the operation from the given location if the generation it references is still available. Otherwise * this method will return <code>null</code>. */ public Operation readOperation(Location location) throws IOException { try (ReleasableLock ignored = readLock.acquire()) { ensureOpen(); if (location.generation < getMinFileGeneration()) { return null; } if (current.generation == location.generation) { // no need to fsync here the read operation will ensure that buffers are written to disk // if they are still in RAM and we are reading onto that position return current.read(location); } else { // read backwards - it's likely we need to read on that is recent for (int i = readers.size() - 1; i >= 0; i--) { TranslogReader translogReader = readers.get(i); if (translogReader.generation == location.generation) { return translogReader.read(location); } } } } catch (final Exception ex) { closeOnTragicEvent(ex); throw ex; } return null; }
從上文的 transllog 寫入方法返回的 Translog.Location 物件,此時派上用場,根據 Location 去 translog 的指定位置讀取資料,反查出 Operation。
檢視 ES getById 的方法,在 16 行取出 versionValue,針對新增資料是 IndexVersionValue,在第 45 行指定 Location 讀取 translog
1 // org.elasticsearch.index.engine.InternalEngine#get 2 @Override 3 public GetResult get( 4 Get get, 5 MappingLookup mappingLookup, 6 DocumentParser documentParser, 7 Function<Engine.Searcher, Engine.Searcher> searcherWrapper 8 ) { 9 assert Objects.equals(get.uid().field(), IdFieldMapper.NAME) : get.uid().field(); 10 try (ReleasableLock ignored = readLock.acquire()) { 11 ensureOpen(); 12 if (get.realtime()) { 13 final VersionValue versionValue; 14 try (Releasable ignore = versionMap.acquireLock(get.uid().bytes())) { 15 // we need to lock here to access the version map to do this truly in RT 16 versionValue = getVersionFromMap(get.uid().bytes()); 17 } 18 if (versionValue != null) { 19 if (versionValue.isDelete()) { 20 return GetResult.NOT_EXISTS; 21 } 22 if (get.versionType().isVersionConflictForReads(versionValue.version, get.version())) { 23 throw new VersionConflictEngineException( 24 shardId, 25 "[" + get.id() + "]", 26 get.versionType().explainConflictForReads(versionValue.version, get.version()) 27 ); 28 } 29 if (get.getIfSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO 30 && (get.getIfSeqNo() != versionValue.seqNo || get.getIfPrimaryTerm() != versionValue.term)) { 31 throw new VersionConflictEngineException( 32 shardId, 33 get.id(), 34 get.getIfSeqNo(), 35 get.getIfPrimaryTerm(), 36 versionValue.seqNo, 37 versionValue.term 38 ); 39 } 40 if (get.isReadFromTranslog()) { 41 // this is only used for updates - API _GET calls will always read form a reader for consistency 42 // the update call doesn't need the consistency since it's source only + _parent but parent can go away in 7.0 43 if (versionValue.getLocation() != null) { 44 try { 45 final Translog.Operation operation = translog.readOperation(versionValue.getLocation()); 46 if (operation != null) { 47 return getFromTranslog(get, (Translog.Index) operation, mappingLookup, documentParser, searcherWrapper); 48 } 49 } catch (IOException e) { 50 maybeFailEngine("realtime_get", e); // lets check if the translog has failed with a tragic event 51 throw new EngineException(shardId, "failed to read operation from translog", e); 52 } 53 } else { 54 trackTranslogLocation.set(true); 55 } 56 } 57 assert versionValue.seqNo >= 0 : versionValue; 58 refreshIfNeeded("realtime_get", versionValue.seqNo); 59 } 60 return getFromSearcher(get, acquireSearcher("realtime_get", SearcherScope.INTERNAL, searcherWrapper), false); 61 } else { 62 // we expose what has been externally expose in a point in time snapshot via an explicit refresh 63 return getFromSearcher(get, acquireSearcher("get", SearcherScope.EXTERNAL, searcherWrapper), false); 64 } 65 } 66 }
1 final class IndexVersionValue extends VersionValue { 2 3 private static final long RAM_BYTES_USED = RamUsageEstimator.shallowSizeOfInstance(IndexVersionValue.class); 4 5 private final Translog.Location translogLocation; 6 7 IndexVersionValue(Translog.Location translogLocation, long version, long seqNo, long term) { 8 super(version, seqNo, term); 9 this.translogLocation = translogLocation; 10 } 11 12 @Override 13 public long ramBytesUsed() { 14 return RAM_BYTES_USED + RamUsageEstimator.shallowSizeOf(translogLocation); 15 } 16 17 @Override 18 public boolean equals(Object o) { 19 if (this == o) return true; 20 if (o == null || getClass() != o.getClass()) return false; 21 if (super.equals(o) == false) return false; 22 IndexVersionValue that = (IndexVersionValue) o; 23 return Objects.equals(translogLocation, that.translogLocation); 24 } 25 26 @Override 27 public int hashCode() { 28 return Objects.hash(super.hashCode(), translogLocation); 29 } 30 31 @Override 32 public String toString() { 33 return "IndexVersionValue{" + "version=" + version + ", seqNo=" + seqNo + ", term=" + term + ", location=" + translogLocation + '}'; 34 } 35 36 @Override 37 public Translog.Location getLocation() { 38 return translogLocation; 39 } 40 }
最後還有一塊 checkpoint 檔案,checkpoint 檔案記錄了 translog 的後設資料資訊,指導 ES 從 translog 的哪個位置開始重放資料
1 // org.elasticsearch.index.engine.InternalEngine#recoverFromTranslogInternal 2 private void recoverFromTranslogInternal(TranslogRecoveryRunner translogRecoveryRunner, long recoverUpToSeqNo) throws IOException { 3 final int opsRecovered; 4 final long localCheckpoint = getProcessedLocalCheckpoint(); 5 if (localCheckpoint < recoverUpToSeqNo) { 6 try (Translog.Snapshot snapshot = translog.newSnapshot(localCheckpoint + 1, recoverUpToSeqNo)) { 7 opsRecovered = translogRecoveryRunner.run(this, snapshot); 8 } catch (Exception e) { 9 throw new EngineException(shardId, "failed to recover from translog", e); 10 } 11 } else { 12 opsRecovered = 0; 13 } 14 // flush if we recovered something or if we have references to older translogs 15 // note: if opsRecovered == 0 and we have older translogs it means they are corrupted or 0 length. 16 assert pendingTranslogRecovery.get() : "translogRecovery is not pending but should be"; 17 pendingTranslogRecovery.set(false); // we are good - now we can commit 18 logger.trace( 19 () -> format( 20 "flushing post recovery from translog: ops recovered [%s], current translog generation [%s]", 21 opsRecovered, 22 translog.currentFileGeneration() 23 ) 24 ); 25 flush(false, true); 26 translog.trimUnreferencedReaders(); 27 }