PostgreSQL 原始碼解讀(106)- WAL#3(Insert & WAL-heap_i...
本節介紹了插入資料時與WAL相關的處理邏輯,主要是heap_insert->XLogInsert函式。
一、資料結構
靜態變數
程式中全域性共享
/*
* An array of XLogRecData structs, to hold registered data.
* XLogRecData結構體陣列,儲存已註冊的資料
*/
static XLogRecData *rdatas;
//已使用的入口
static int num_rdatas; /* entries currently used */
//已分配的空間大小
static int max_rdatas; /* allocated size */
//是否呼叫XLogBeginInsert函式
static bool begininsert_called = false;
宏定義
typedef char* Pointer;//指標
typedef Pointer Page;//Page
#define XLOG_HEAP_INSERT 0x00
/*
* Pointer to a location in the XLOG. These pointers are 64 bits wide,
* because we don't want them ever to overflow.
* 指向XLOG中的位置.
* 這些指標大小為64bit,以確保指標不會溢位.
*/
typedef uint64 XLogRecPtr;
/*
* Additional macros for access to page headers. (Beware multiple evaluation
* of the arguments!)
*/
#define PageGetLSN(page) \
PageXLogRecPtrGet(((PageHeader) (page))->pd_lsn)
#define PageSetLSN(page, lsn) \
PageXLogRecPtrSet(((PageHeader) (page))->pd_lsn, lsn)
/* Buffer size required to store a compressed version of backup block image */
//儲存壓縮會後的塊映象所需要的快取空間大小
#define PGLZ_MAX_BLCKSZ PGLZ_MAX_OUTPUT(BLCKSZ)
/*
* Fake spinlock implementation using semaphores --- slow and prone
* to fall foul of kernel limits on number of semaphores, so don't use this
* unless you must! The subroutines appear in spin.c.
* 使用訊號量的偽自旋鎖實現——很慢而且容易與核心對訊號量的限制相沖突,
* 所以除非必須,否則不要使用它!
* 相關的子例程出現在spin.c中。
*/
typedef int slock_t;
XLogCtl
XLOG的所有共享記憶體狀態資訊
/*
* Total shared-memory state for XLOG.
* XLOG的所有共享記憶體狀態資訊
*/
typedef struct XLogCtlData
{
XLogCtlInsert Insert;//插入控制器
/* Protected by info_lck: */
//------ 透過info_lck鎖保護
XLogwrtRqst LogwrtRqst;
//Insert->RedoRecPtr最近的複製
XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */
//最後的checkpoint的nextXID & epoch
uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */
TransactionId ckptXid;
//最新非同步提交/回滾的LSN
XLogRecPtr asyncXactLSN; /* LSN of newest async commit/abort */
//slot需要的最"老"的LSN
XLogRecPtr replicationSlotMinLSN; /* oldest LSN needed by any slot */
//最後移除/回收的XLOG段
XLogSegNo lastRemovedSegNo; /* latest removed/recycled XLOG segment */
/* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
//---- "偽裝"的LSN計數器,用於不需要記錄日誌的關係.透過ulsn_lck鎖保護
XLogRecPtr unloggedLSN;
slock_t ulsn_lck;
/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
//---- 切換後最新的xlog段的時間線和LSN,透過WALWriteLock鎖保護
pg_time_t lastSegSwitchTime;
XLogRecPtr lastSegSwitchLSN;
/*
* Protected by info_lck and WALWriteLock (you must hold either lock to
* read it, but both to update)
* 透過info_lck和WALWriteLock保護
* (必須持有其中之一才能讀取,必須全部持有才能更新)
*/
XLogwrtResult LogwrtResult;
/*
* Latest initialized page in the cache (last byte position + 1).
* 在快取中最後初始化的page(最後一個位元組位置 + 1)
*
* To change the identity of a buffer (and InitializedUpTo), you need to
* hold WALBufMappingLock. To change the identity of a buffer that's
* still dirty, the old page needs to be written out first, and for that
* you need WALWriteLock, and you need to ensure that there are no
* in-progress insertions to the page by calling
* WaitXLogInsertionsToFinish().
* 如需改變緩衝區的標識(以及InitializedUpTo),需要持有WALBufMappingLock鎖.
* 改變標記為dirty的緩衝區的識別符號,舊的page需要先行寫出,因此必須持有WALWriteLock鎖,
* 而且必須確保沒有正在透過呼叫WaitXLogInsertionsToFinish()進行執行中的插入page操作
*/
XLogRecPtr InitializedUpTo;
/*
* These values do not change after startup, although the pointed-to pages
* and xlblocks values certainly do. xlblock values are protected by
* WALBufMappingLock.
* 在啟動後這些值不會修改,雖然pointed-to pages和xlblocks值通常會更改.
* xlblock的值透過WALBufMappingLock鎖保護.
*/
//未寫入的XLOG pages的快取
char *pages; /* buffers for unwritten XLOG pages */
//ptr-s的第一個位元組 + XLOG_BLCKSZ
XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
//已分配的xlog緩衝的索引最高值
int XLogCacheBlck; /* highest allocated xlog buffer index */
/*
* Shared copy of ThisTimeLineID. Does not change after end-of-recovery.
* If we created a new timeline when the system was started up,
* PrevTimeLineID is the old timeline's ID that we forked off from.
* Otherwise it's equal to ThisTimeLineID.
* ThisTimeLineID的共享複製.
* 在完成恢復後不要修改.
* 如果在系統啟動後建立了一個新的時間線,PrevTimeLineID是從舊時間線分叉的ID.
* 否則,PrevTimeLineID = ThisTimeLineID
*/
TimeLineID ThisTimeLineID;
TimeLineID PrevTimeLineID;
/*
* SharedRecoveryInProgress indicates if we're still in crash or archive
* recovery. Protected by info_lck.
* SharedRecoveryInProgress標記是否處於當機或者歸檔恢復中,透過info_lck鎖保護.
*/
bool SharedRecoveryInProgress;
/*
* SharedHotStandbyActive indicates if we're still in crash or archive
* recovery. Protected by info_lck.
* SharedHotStandbyActive標記是否處於當機或者歸檔恢復中,透過info_lck鎖保護.
*/
bool SharedHotStandbyActive;
/*
* WalWriterSleeping indicates whether the WAL writer is currently in
* low-power mode (and hence should be nudged if an async commit occurs).
* Protected by info_lck.
* WalWriterSleeping標記WAL writer程式是否處於"節能"模式
* (因此,如果發生非同步提交,應該對其進行微操作).
* 透過info_lck鎖保護.
*/
bool WalWriterSleeping;
/*
* recoveryWakeupLatch is used to wake up the startup process to continue
* WAL replay, if it is waiting for WAL to arrive or failover trigger file
* to appear.
* recoveryWakeupLatch等待WAL arrive或者failover觸發檔案出現,
* 如出現則喚醒啟動程式繼續執行WAL回放.
*
*/
Latch recoveryWakeupLatch;
/*
* During recovery, we keep a copy of the latest checkpoint record here.
* lastCheckPointRecPtr points to start of checkpoint record and
* lastCheckPointEndPtr points to end+1 of checkpoint record. Used by the
* checkpointer when it wants to create a restartpoint.
* 在恢復期間,我們儲存最後檢查點記錄的一個複製在這裡.
* lastCheckPointRecPtr指向檢查點的起始位置
* lastCheckPointEndPtr指向執行檢查點的結束點+1位置
* 在checkpointer程式希望建立一個重新啟動的點時使用.
*
* Protected by info_lck.
* 使用info_lck鎖保護.
*/
XLogRecPtr lastCheckPointRecPtr;
XLogRecPtr lastCheckPointEndPtr;
CheckPoint lastCheckPoint;
/*
* lastReplayedEndRecPtr points to end+1 of the last record successfully
* replayed. When we're currently replaying a record, ie. in a redo
* function, replayEndRecPtr points to the end+1 of the record being
* replayed, otherwise it's equal to lastReplayedEndRecPtr.
* lastReplayedEndRecPtr指向最後一個成功回放的記錄的結束點 + 1的位置.
* 如果正處於redo函式回放記錄期間,那麼replayEndRecPtr指向正在恢復的記錄的結束點 + 1的位置,
* 否則replayEndRecPtr = lastReplayedEndRecPtr
*/
XLogRecPtr lastReplayedEndRecPtr;
TimeLineID lastReplayedTLI;
XLogRecPtr replayEndRecPtr;
TimeLineID replayEndTLI;
/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
//最後的COMMIT/ABORT回放(或正在回放)記錄的時間戳
TimestampTz recoveryLastXTime;
/*
* timestamp of when we started replaying the current chunk of WAL data,
* only relevant for replication or archive recovery
* 我們開始回放當前的WAL chunk的時間戳(僅與複製或存檔恢復相關)
*/
TimestampTz currentChunkStartTime;
/* Are we requested to pause recovery? */
//是否請求暫停恢復
bool recoveryPause;
/*
* lastFpwDisableRecPtr points to the start of the last replayed
* XLOG_FPW_CHANGE record that instructs full_page_writes is disabled.
* lastFpwDisableRecPtr指向最後已回放的XLOG_FPW_CHANGE記錄(禁用對整個頁面的寫指令)的起始點.
*/
XLogRecPtr lastFpwDisableRecPtr;
//鎖結構
slock_t info_lck; /* locks shared variables shown above */
} XLogCtlData;
static XLogCtlData *XLogCtl = NULL;
二、原始碼解讀
heap_insert
主要實現邏輯是插入元組到堆中,其中存在對WAL(XLog)進行處理的部分.
參見PostgreSQL 原始碼解讀(104)- WAL#1(Insert & WAL-heap_insert函式#1)
XLogInsert
插入一個具有指定的RMID和info位元組的XLOG記錄,該記錄的主體是先前透過XLogRegister*呼叫註冊的資料和緩衝區引用。
/*
* Insert an XLOG record having the specified RMID and info bytes, with the
* body of the record being the data and buffer references registered earlier
* with XLogRegister* calls.
* 插入一個具有指定的RMID和info位元組的XLOG記錄,
* 該記錄的主體是先前透過XLogRegister*呼叫註冊的資料和緩衝區引用。
*
* Returns XLOG pointer to end of record (beginning of next record).
* This can be used as LSN for data pages affected by the logged action.
* (LSN is the XLOG point up to which the XLOG must be flushed to disk
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
* 返回XLOG指標到記錄的結束點(下一條記錄的開始)。
* 這可以用作受日誌操作影響的資料頁的LSN。
* (LSN是必須將XLOG重新整理到磁碟才能寫出資料頁的XLOG點。
* 這實現了基本的WAL規則:“在資料之前寫日誌”。)
*/
XLogRecPtr
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;//uint64
/* XLogBeginInsert() must have been called. */
//在此前,XLogBeginInsert()必須已呼叫
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");
/*
* The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
* XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
* 呼叫方必須設定rmgr位:XLR_SPECIAL_REL_UPDATE & XLR_CHECK_CONSISTENCY.
* 其餘在這裡保留使用
*/
if ((info & ~(XLR_RMGR_INFO_MASK |
XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY)) != 0)
elog(PANIC, "invalid xlog info mask %02X", info);
TRACE_POSTGRESQL_WAL_INSERT(rmid, info);
/*
* In bootstrap mode, we don't actually log anything but XLOG resources;
* return a phony record pointer.
* 在bootstrap模式,除了XLOG資源外,不需要實際記錄內容.
* 返回一個偽記錄指標.
*/
if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
{
XLogResetInsertion();
EndPos = SizeOfXLogLongPHD; /* 返回偽記錄指標;start of 1st chkpt record */
return EndPos;
}
do
{
//迴圈
XLogRecPtr RedoRecPtr;
bool doPageWrites;
XLogRecPtr fpw_lsn;
XLogRecData *rdt;
/*
* Get values needed to decide whether to do full-page writes. Since
* we don't yet have an insertion lock, these could change under us,
* but XLogInsertRecord will recheck them once it has a lock.
* 獲取決定是否執行全頁寫入所需的值。
* 由於我們還沒有插入鎖,所以這些可能會在我們的操作期間被更改,
* 但是XLogInsertRecord一旦有了鎖,就會重新檢查它們。
*/
GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
&fpw_lsn);
//curinsert_flags型別為uint8
EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
} while (EndPos == InvalidXLogRecPtr);
XLogResetInsertion();
return EndPos;
}
XLogInsertRecord
插入一個由已經構造的資料chunks連結串列示的XLOG記錄。
/*
* Insert an XLOG record represented by an already-constructed chain of data
* chunks. This is a low-level routine; to construct the WAL record header
* and data, use the higher-level routines in xloginsert.c.
* 插入一個由已經構造的資料chunks連結串列示的XLOG記錄。
* 這是一個比較底層的處理邏輯實現,
* 使用xloginsert.c中高層的子程式構造WAL記錄的頭部和資料
*
* If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
* WAL record applies to, that were not included in the record as full page
* images. If fpw_lsn <= RedoRecPtr, the function does not perform the
* insertion and returns InvalidXLogRecPtr. The caller can then recalculate
* which pages need a full-page image, and retry. If fpw_lsn is invalid, the
* record is always inserted.
* 如"fpw_lsn"是有效的,那麼該值為在所有的WAL記錄應用到pages中最小的LSN,
* 但該值不包括全頁映象的記錄.
* 如fpw_lsn <= RedoRecPtr,該函式不會執行插入同時會返回InvalidXLogRecPtr.
* 呼叫者可以重新計算哪些pages需要full-page image以及記錄入口.
* 如果fpw_lsn無效,那麼記錄已被插入.
*
* 'flags' gives more in-depth control on the record being inserted. See
* XLogSetRecordFlags() for details.
* "flags"在即將插入的記錄上給定了更多的深層次的控制.
* 檢視函式XLogSetRecordFlags()獲取更多的細節資訊.
*
* The first XLogRecData in the chain must be for the record header, and its
* data must be MAXALIGNed. XLogInsertRecord fills in the xl_prev and
* xl_crc fields in the header, the rest of the header must already be filled
* by the caller.
* 鏈中的第一個XLogRecData必須是吉林的頭部,資料必須已被MAXALIGNed.
* XLogInsertRecord填充在頭部的xl_prev和xl_crc域中,
* 頭部的其他域已透過呼叫者提供.
*
* Returns XLOG pointer to end of record (beginning of next record).
* This can be used as LSN for data pages affected by the logged action.
* (LSN is the XLOG point up to which the XLOG must be flushed to disk
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
* 返回XLOG指標,指向記錄結束的位置(下一記錄的起始點).
* 這可以用作受日誌操作影響的資料頁的LSN。
* (LSN是必須將XLOG重新整理到磁碟上才能寫出資料頁的XLOG點。
* 這實現了WAL的基本規則"在寫資料前寫日誌")
*/
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata,
XLogRecPtr fpw_lsn,
uint8 flags)
{
XLogCtlInsert *Insert = &XLogCtl->Insert;//XLOG寫入控制器
pg_crc32c rdata_crc;//uint32
bool inserted;
XLogRecord *rechdr = (XLogRecord *) rdata->data;
uint8 info = rechdr->xl_info & ~XLR_INFO_MASK;
bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
info == XLOG_SWITCH);
XLogRecPtr StartPos;
XLogRecPtr EndPos;
bool prevDoPageWrites = doPageWrites;
/* we assume that all of the record header is in the first chunk */
//假定所有的記錄頭部資料都處於第一個chunk中
Assert(rdata->len >= SizeOfXLogRecord);
/* cross-check on whether we should be here or not */
//交叉檢查
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
/*----------
*
* We have now done all the preparatory work we can without holding a
* lock or modifying shared state. From here on, inserting the new WAL
* record to the shared WAL buffer cache is a two-step process:
* 現在,我們已經完成了所有的準備工作,無需持有鎖或修改共享狀態。
* 從這裡開始,將新的WAL記錄插入到共享的WAL緩衝區快取需要兩個步驟:
*
* 1. Reserve the right amount of space from the WAL. The current head of
* reserved space is kept in Insert->CurrBytePos, and is protected by
* insertpos_lck.
* 1. 從WAL中預留合適的空間.預留空間的頭部儲存在Insert->CurrBytePos中,
* 透過insertpos_lck鎖保護
*
* 2. Copy the record to the reserved WAL space. This involves finding the
* correct WAL buffer containing the reserved space, and copying the
* record in place. This can be done concurrently in multiple processes.
* 2. 複製記錄到保留的WAL空間中.這會涉及到尋找持有保留空間的正確的WAL緩衝區,
* 以及複製記錄到合適的位置上.
* 在多程式間必須同步完成.
*
* To keep track of which insertions are still in-progress, each concurrent
* inserter acquires an insertion lock. In addition to just indicating that
* an insertion is in progress, the lock tells others how far the inserter
* has progressed. There is a small fixed number of insertion locks,
* determined by NUM_XLOGINSERT_LOCKS. When an inserter crosses a page
* boundary, it updates the value stored in the lock to the how far it has
* inserted, to allow the previous buffer to be flushed.
* 為了跟蹤那個插入操作仍處於進行當中,每一個當前的插入器需要insertion鎖.
* 除了用於標識那個insertion處於進行當中,鎖同時會告知其他插入器可以處理的邊界界限.
* 系統有少數幾個固定數量的insertion所,透過引數NUM_XLOGINSERT_LOCKS定義.
* 如果某個插入器跨越了page的邊界,該插入器會更新儲存在鎖中的值以表示它已插入的大小,
* 這樣方便重新整理先前的快取.
*
* Holding onto an insertion lock also protects RedoRecPtr and
* fullPageWrites from changing until the insertion is finished.
* 持有插入鎖還可以保護RedoRecPtr和fullpagewrite在插入完成之前不受更改。
*
* Step 2 can usually be done completely in parallel. If the required WAL
* page is not initialized yet, you have to grab WALBufMappingLock to
* initialize it, but the WAL writer tries to do that ahead of insertions
* to avoid that from happening in the critical path.
* 步驟2通常可以完全並行完成。
* 如果所需的WAL頁面還沒有初始化,您必須獲取WALBufMappingLock來初始化它,
* 但是WAL writer程式會在插入之前嘗試這樣做,以避免在關鍵路徑中發生這種情況。
*
*----------
*/
START_CRIT_SECTION();
if (isLogSwitch)
WALInsertLockAcquireExclusive();
else
WALInsertLockAcquire();
/*
* Check to see if my copy of RedoRecPtr is out of date. If so, may have
* to go back and have the caller recompute everything. This can only
* happen just after a checkpoint, so it's better to be slow in this case
* and fast otherwise.
* 看看程式的RedoRecPtr是不是過期了。
* 如果是,可能需要返回並讓呼叫方重新計算所有內容。
* 這隻會在檢查點之後才會發生,所以在這種情況下最好慢一點,否則最好快一點。
*
* Also check to see if fullPageWrites or forcePageWrites was just turned
* on; if we weren't already doing full-page writes then go back and
* recompute.
* 還要檢查是否開啟了fullpagewrite或forcepagewrite;
* 如果我們還沒有完成整頁的寫操作,那麼返回並重新計算。
*
* If we aren't doing full-page writes then RedoRecPtr doesn't actually
* affect the contents of the XLOG record, so we'll update our local copy
* but not force a recomputation. (If doPageWrites was just turned off,
* we could recompute the record without full pages, but we choose not to
* bother.)
* 如果我們並沒有在執行全頁寫操作,那麼RedoRecPtr實際上不會影響XLOG記錄的內容,
* 因此我們將更新本地副本,但不會強制進行重新計算。
* (如果doPageWrites關閉,可以在沒有完整頁面的情況下重新計算記錄,但我們沒有這種麻煩的做法。)
*
*/
if (RedoRecPtr != Insert->RedoRecPtr)
{
Assert(RedoRecPtr < Insert->RedoRecPtr);
RedoRecPtr = Insert->RedoRecPtr;
}
doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
if (doPageWrites &&
(!prevDoPageWrites ||
(fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
{
/*
* Oops, some buffer now needs to be backed up that the caller didn't
* back up. Start over.
* 糟糕,現在需要備份一些呼叫者沒有備份的緩衝區。
* 讓我們重新開始吧。
*/
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
/*
* Reserve space for the record in the WAL. This also sets the xl_prev
* pointer.
* 在WAL預留記錄空間.同時會設定xl_prev指標.
*
*/
if (isLogSwitch)
inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
else
{
ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
&rechdr->xl_prev);
inserted = true;
}
if (inserted)
{
/*
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
* 現在xl_prev指標已填充,計算記錄頭部的CRC
*/
rdata_crc = rechdr->xl_crc;
COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
FIN_CRC32C(rdata_crc);
rechdr->xl_crc = rdata_crc;
/*
* All the record data, including the header, is now ready to be
* inserted. Copy the record in the space reserved.
* 所有的記錄資料,包括頭部資料,準備插入!
* 複製記錄到保留空間中.
*/
CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
StartPos, EndPos);
/*
* Unless record is flagged as not important, update LSN of last
* important record in the current slot. When holding all locks, just
* update the first one.
* 除非記錄被標記為不重要,否則更新當前slot中最後一條重要記錄的LSN。
* 如持有所有鎖,只需更新第一個。
*/
if ((flags & XLOG_MARK_UNIMPORTANT) == 0)
{
int lockno = holdingAllLocks ? 0 : MyLockNo;
WALInsertLocks[lockno].l.lastImportantAt = StartPos;
}
}
else
{
/*
* This was an xlog-switch record, but the current insert location was
* already exactly at the beginning of a segment, so there was no need
* to do anything.
* 這是一個xlog-switch記錄,但是當前插入位置已經確切地位於段的開頭,所以不需要做任何事情。
*/
}
/*
* Done! Let others know that we're finished.
* 全部完成!讓其他插入器知道我們已經完成了!
*/
WALInsertLockRelease();
MarkCurrentTransactionIdLoggedIfAny();
END_CRIT_SECTION();
/*
* Update shared LogwrtRqst.Write, if we crossed page boundary.
* 如跨越了page邊界,更新共享的LogwrtRqst.Write變數
*/
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
//預先請求包含新塊(s)
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
/* update local result copy while I have the chance */
//如有機會,更新本地的結果複製
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
}
/*
* If this was an XLOG_SWITCH record, flush the record and the empty
* padding space that fills the rest of the segment, and perform
* end-of-segment actions (eg, notifying archiver).
* 如果這是一條XLOG_SWITCH記錄,
* 重新整理記錄和填充該段其餘部分的空白填充空間,
* 並執行段結束操作(例如,通知歸檔器)。
*/
if (isLogSwitch)
{
TRACE_POSTGRESQL_WAL_SWITCH();
XLogFlush(EndPos);
/*
* Even though we reserved the rest of the segment for us, which is
* reflected in EndPos, we return a pointer to just the end of the
* xlog-switch record.
* 即使我們為自己保留了段的其餘部分(這反映在EndPos中),
* 我們也只返回一個指向xlog-switch記錄末尾的指標。
*/
if (inserted)
{
EndPos = StartPos + SizeOfXLogRecord;
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
uint64 offset = XLogSegmentOffset(EndPos, wal_segment_size);
if (offset == EndPos % XLOG_BLCKSZ)
EndPos += SizeOfXLogLongPHD;
else
EndPos += SizeOfXLogShortPHD;
}
}
}
#ifdef WAL_DEBUG//DEBUG程式碼
if (XLOG_DEBUG)
{
static XLogReaderState *debug_reader = NULL;
StringInfoData buf;
StringInfoData recordBuf;
char *errormsg = NULL;
MemoryContext oldCxt;
oldCxt = MemoryContextSwitchTo(walDebugCxt);
initStringInfo(&buf);
appendStringInfo(&buf, "INSERT @ %X/%X: ",
(uint32) (EndPos >> 32), (uint32) EndPos);
/*
* We have to piece together the WAL record data from the XLogRecData
* entries, so that we can pass it to the rm_desc function as one
* contiguous chunk.
*/
initStringInfo(&recordBuf);
for (; rdata != NULL; rdata = rdata->next)
appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);
if (!debug_reader)
debug_reader = XLogReaderAllocate(wal_segment_size, NULL, NULL);
if (!debug_reader)
{
appendStringInfoString(&buf, "error decoding record: out of memory");
}
else if (!DecodeXLogRecord(debug_reader, (XLogRecord *) recordBuf.data,
&errormsg))
{
appendStringInfo(&buf, "error decoding record: %s",
errormsg ? errormsg : "no error message");
}
else
{
appendStringInfoString(&buf, " - ");
xlog_outdesc(&buf, debug_reader);
}
elog(LOG, "%s", buf.data);
pfree(buf.data);
pfree(recordBuf.data);
MemoryContextSwitchTo(oldCxt);
}
#endif
/*
* Update our global variables
* 更新全域性變數
*/
ProcLastRecPtr = StartPos;
XactLastRecEnd = EndPos;
return EndPos;
}
三、跟蹤分析
測試指令碼如下
insert into t_wal_partition(c1,c2,c3) VALUES(0,'HASH0','HAHS0');
啟動gdb,設定斷點,進入XLogInsert
(gdb) b XLogInsert
Breakpoint 1 at 0x5652d6: file xloginsert.c, line 420.
(gdb) c
Continuing.
Breakpoint 1, XLogInsert (rmid=10 '\n', info=0 '\000') at xloginsert.c:420
420 if (!begininsert_called)
在此前,XLogBeginInsert()必須已呼叫
420 if (!begininsert_called)
(gdb) n
呼叫方必須設定rmgr位:XLR_SPECIAL_REL_UPDATE & XLR_CHECK_CONSISTENCY.其餘在這裡保留使用
427 if ((info & ~(XLR_RMGR_INFO_MASK |
(gdb) n
432 TRACE_POSTGRESQL_WAL_INSERT(rmid, info);
進入迴圈
(gdb) n
438 if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
(gdb)
457 GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
獲取決定是否執行全頁寫入所需的值
(gdb) p *RedoRecPtr
$1 = 1166604425
(gdb) p doPageWrites
$2 = false
(gdb) n
459 rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
(gdb) p RedoRecPtr
$3 = 5411227832
(gdb) p doPageWrites
$4 = true
獲取rdt
(gdb) n
462 EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
(gdb) p *rdt
$5 = {next = 0x2a911b8, data = 0x2a8f460 <incomplete sequence \322>, len = 51}
XLogInsertRecord->呼叫XLogInsertRecord,進入XLogInsertRecord函式
fpw_lsn=0, flags=1 '\001'
(gdb) step
XLogInsertRecord (rdata=0xf9cc70 <hdr_rdt>, fpw_lsn=0, flags=1 '\001') at xlog.c:970
970 XLogCtlInsert *Insert = &XLogCtl->Insert;
XLogInsertRecord->獲取插入管理器
(gdb) n
973 XLogRecord *rechdr = (XLogRecord *) rdata->data;
(gdb) p *Insert
$6 = {insertpos_lck = 0 '\000', CurrBytePos = 5395369608, PrevBytePos = 5395369552, pad = '\000' <repeats 127 times>,
RedoRecPtr = 5411227832, forcePageWrites = false, fullPageWrites = true, exclusiveBackupState = EXCLUSIVE_BACKUP_NONE,
nonExclusiveBackups = 0, lastBackupStart = 0, WALInsertLocks = 0x7fa2523d4100}
XLogInsertRecord->變數賦值
(gdb) n
974 uint8 info = rechdr->xl_info & ~XLR_INFO_MASK;
(gdb)
975 bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
(gdb)
979 bool prevDoPageWrites = doPageWrites;
(gdb)
982 Assert(rdata->len >= SizeOfXLogRecord);
(gdb)
(gdb) p *rechdr
$7 = {xl_tot_len = 210, xl_xid = 1948, xl_prev = 0, xl_info = 0 '\000', xl_rmid = 10 '\n', xl_crc = 3212449170}
(gdb) p info
$8 = 0 '\000'
(gdb) p isLogSwitch
$9 = false
(gdb) p prevDoPageWrites
$10 = true
XLogInsertRecord->執行相關判斷,開啟CRIT_SECTION,並獲取WAL插入鎖
(gdb) n
985 if (!XLogInsertAllowed())
(gdb)
1020 START_CRIT_SECTION();
(gdb)
1021 if (isLogSwitch)
(gdb)
1024 WALInsertLockAcquire();
(gdb)
1042 if (RedoRecPtr != Insert->RedoRecPtr)
(gdb)
XLogInsertRecord->執行相關判斷,更新doPageWrites
(gdb) p RedoRecPtr
$11 = 5411227832
(gdb) p Insert->RedoRecPtr
$12 = 5411227832
(gdb) n
1047 doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
(gdb)
1049 if (doPageWrites &&
(gdb) p doPageWrites
$13 = true
(gdb) n
1050 (!prevDoPageWrites ||
(gdb)
1049 if (doPageWrites &&
XLogInsertRecord->在WAL預留記錄空間.同時會設定xl_prev指標.
(gdb)
1050 (!prevDoPageWrites ||
(gdb)
1066 if (isLogSwitch)
(gdb)
1070 ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
(gdb)
1072 inserted = true;
(gdb) p rechdr->xl_tot_len
$14 = 210
(gdb) p StartPos
$15 = 5411228000
(gdb) p EndPos
$16 = 5411228216
(gdb) p *rechdr->xl_prev
Cannot access memory at address 0x14288c928
(gdb) p rechdr->xl_prev
$17 = 5411227944
(gdb)
XLogInsertRecord->現在xl_prev指標已填充,計算記錄頭部的CRC
(gdb) n
1075 if (inserted)
(gdb)
1081 rdata_crc = rechdr->xl_crc;
(gdb)
1082 COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
(gdb)
1083 FIN_CRC32C(rdata_crc);
(gdb)
1084 rechdr->xl_crc = rdata_crc;
(gdb)
1090 CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
(gdb) p rdata_crc
$18 = 2310972234
(gdb) p *rechdr
$19 = {xl_tot_len = 210, xl_xid = 1948, xl_prev = 5411227944, xl_info = 0 '\000', xl_rmid = 10 '\n', xl_crc = 2310972234}
(gdb)
XLogInsertRecord->所有的記錄資料,包括頭部資料已OK,準備插入!複製記錄到保留空間中.
除非記錄被標記為不重要,否則更新當前slot中最後一條重要記錄的LSN.
(gdb) n
1098 if ((flags & XLOG_MARK_UNIMPORTANT) == 0)
(gdb)
1100 int lockno = holdingAllLocks ? 0 : MyLockNo;
(gdb)
(gdb) n
1102 WALInsertLocks[lockno].l.lastImportantAt = StartPos;
(gdb)
1117 WALInsertLockRelease();
XLogInsertRecord->全部完成!讓其他插入器知道我們已經完成了!
如跨越了page邊界,更新共享的LogwrtRqst.Write變數
(gdb)
1117 WALInsertLockRelease();
(gdb) n
1119 MarkCurrentTransactionIdLoggedIfAny();
(gdb)
1121 END_CRIT_SECTION();
(gdb)
1126 if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
(gdb)
1142 if (isLogSwitch)
XLogInsertRecord->更新全域性變數,函式返回
(gdb)
1220 ProcLastRecPtr = StartPos;
(gdb)
1221 XactLastRecEnd = EndPos;
(gdb)
1223 return EndPos;
(gdb)
1224 }
返回XLogInsert,重置insertion,返回EndPos,結束
(gdb)
XLogInsert (rmid=10 '\n', info=0 '\000') at xloginsert.c:463
463 } while (EndPos == InvalidXLogRecPtr);
(gdb) n
465 XLogResetInsertion();
(gdb)
467 return EndPos;
(gdb)
468 }
(gdb) p EndPos
$20 = 5411228216
(gdb)
$21 = 5411228216
(gdb) n
heap_insert (relation=0x7fa280616228, tup=0x2b15440, cid=0, options=0, bistate=0x0) at heapam.c:2590
2590 PageSetLSN(page, recptr);
(gdb)
DONE!
四、參考資料
Write Ahead Logging — WAL
PostgreSQL 原始碼解讀(4)- 插入資料#3(heap_insert)
PgSQL · 特性分析 · 資料庫崩潰恢復(上)
PgSQL · 特性分析 · 資料庫崩潰恢復(下)
PgSQL · 特性分析 · Write-Ahead Logging機制淺析
PostgreSQL WAL Buffers, Clog Buffers Deep Dive
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/6906/viewspace-2374783/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- PostgreSQL 原始碼解讀(104)- WAL#1(Insert & WAL-heap_i...SQL原始碼
- PostgreSQL 原始碼解讀(105)- WAL#2(Insert & WAL-heap_i...SQL原始碼
- PostgreSQL 原始碼解讀(107)- WAL#4(Insert & WAL-heap_i...SQL原始碼
- PostgreSQL 原始碼解讀(14)- Insert語句(如何構造PlannedStmt)SQL原始碼
- PostgreSQL 原始碼解讀(4)- 插入資料#3(heap_insert)SQL原始碼
- PostgreSQL 原始碼解讀(3)- 如何閱讀原始碼SQL原始碼
- PostgreSQL 原始碼解讀(15)- Insert語句(執行過程跟蹤)SQL原始碼
- PostgreSQL 原始碼解讀(113)- WAL#9(Insert&WAL - CopyXL...SQL原始碼
- PostgreSQL 原始碼解讀(110)- WAL#6(Insert&WAL - XLogRe...SQL原始碼
- PostgreSQL 原始碼解讀(111)- WAL#7(Insert&WAL - XLogRe...SQL原始碼
- PostgreSQL 原始碼解讀(241)- plpgsql(CreateFunction)SQL原始碼Function
- PostgreSQL 原始碼解讀(190)- 查詢#106(聚合函式#11 - finalize_aggregate)SQL原始碼函式
- PostgreSQL 原始碼解讀(240)- HTAB簡介SQL原始碼
- PostgreSQL 原始碼解讀(219)- Locks(Overview)SQL原始碼View
- PostgreSQL FSM(Free Space Map) 原始碼解讀SQL原始碼
- PostgreSQL 原始碼解讀(244)- plpgsql(CreateFunction-ProcedureCreate)SQL原始碼Function
- PostgreSQL 原始碼解讀(221)- Locks(PROCLOCK Struct)SQL原始碼Struct
- PostgreSQL 原始碼解讀(1)- 插入資料#1SQL原始碼
- PostgreSQL 原始碼解讀(246)- plpgsql(CreateFunction-SearchSysCache3)SQL原始碼Function
- PostgreSQL 原始碼解讀(220)- Locks(LOCK Struct)SQL原始碼Struct
- PostgreSQL 原始碼解讀(218)- spinlock的實現SQL原始碼
- PostgreSQL 原始碼解讀(10)- 插入資料#9(ProcessQuery)SQL原始碼
- PostgreSQL 原始碼解讀(13)- 插入資料#12(PostgresMain)SQL原始碼AI
- PostgreSQL 原始碼解讀(8)- 插入資料#7(ExecutePlan)SQL原始碼
- PostgreSQL 原始碼解讀(2)- 插入資料#2(RelationPutHeapTuple)SQL原始碼APT
- PostgreSQL 原始碼解讀(245)- plpgsql(CreateFunction-construct_array)SQL原始碼FunctionStruct
- PostgreSQL 原始碼解讀(224)- Locks(The Deadlock Detection Algorithm)SQL原始碼Go
- PostgreSQL 原始碼解讀(196)- 浮點數比較SQL原始碼
- PostgreSQL 原始碼解讀(5)- 插入資料#4(ExecInsert)SQL原始碼
- PostgreSQL 原始碼解讀(6)- 插入資料#5(ExecModifyTable)SQL原始碼
- PostgreSQL 原始碼解讀(19)- 查詢語句#4(ParseTree詳解)SQL原始碼
- PostgreSQL 原始碼解讀(225)- Transaction(子事務處理)SQL原始碼
- PostgreSQL 原始碼解讀(223)- Locks(Fast Path Locking)SQL原始碼AST
- PostgreSQL 原始碼解讀(139)- Buffer Manager#4(StrategyGetBuffer函式)SQL原始碼函式
- PostgreSQL 原始碼解讀(164)- 查詢#84(表示式求值)SQL原始碼
- PostgreSQL 原始碼解讀(145)- Storage Manager#1(RecordAndGetPageWithFreeSpace)SQL原始碼
- PostgreSQL 原始碼解讀(213)- 後臺程式#12(checkpointer-CheckpointWriteDelay)SQL原始碼
- PostgreSQL 原始碼解讀(138)- Buffer Manager#3(BufferAlloc函式)SQL原始碼函式