PostgreSQL 原始碼解讀(113)- WAL#9(Insert&WAL - CopyXL...
本節重點跟蹤分析了ReserveXLogInsertLocation和CopyXLogRecordToWAL函式的實現邏輯,ReserveXLogInsertLocation函式為XLOG Record預留合適的空間,CopyXLogRecordToWAL則負責複製XLOG Record到WAL buffer的保留空間中。
一、資料結構
全域性變數
/* flags for the in-progress insertion */ //用於插入過程中的標記資訊 static uint8 curinsert_flags = 0; /* * These are used to hold the record header while constructing a record. * 'hdr_scratch' is not a plain variable, but is palloc'd at initialization, * because we want it to be MAXALIGNed and padding bytes zeroed. * 在構建XLOG Record時通常會儲存記錄的頭部資訊. * 'hdr_scratch'並不是一個普通(plain)變數,而是在初始化時透過palloc初始化, * 因為我們希望該變數已經是MAXALIGNed並且已被0x00填充. * * For simplicity, it's allocated large enough to hold the headers for any * WAL record. * 簡單起見,該變數預先會分配足夠大的空間用於儲存所有WAL Record的頭部資訊. */ static XLogRecData hdr_rdt; static char *hdr_scratch = NULL; #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) #define HEADER_SCRATCH_SIZE \ (SizeOfXLogRecord + \ MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \ SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin) /* * An array of XLogRecData structs, to hold registered data. * XLogRecData結構體陣列,儲存已註冊的資料. */ static XLogRecData *rdatas; static int num_rdatas; /* entries currently used */ //已分配的空間大小 static int max_rdatas; /* allocated size */ //是否呼叫XLogBeginInsert函式 static bool begininsert_called = false; static XLogCtlData *XLogCtl = NULL; /* flags for the in-progress insertion */ static uint8 curinsert_flags = 0; /* * A chain of XLogRecDatas to hold the "main data" of a WAL record, registered * with XLogRegisterData(...). * 儲存WAL Record "main data"的XLogRecDatas資料鏈 */ static XLogRecData *mainrdata_head; static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head; //鏈中某個位置的mainrdata大小 static uint32 mainrdata_len; /* total # of bytes in chain */ /* * ProcLastRecPtr points to the start of the last XLOG record inserted by the * current backend. It is updated for all inserts. XactLastRecEnd points to * end+1 of the last record, and is reset when we end a top-level transaction, * or start a new one; so it can be used to tell if the current transaction has * created any XLOG records. * ProcLastRecPtr指向當前後端插入的最後一條XLOG記錄的開頭。 * 它針對所有插入進行更新。 * XactLastRecEnd指向最後一條記錄的末尾位置 + 1, * 並在結束頂級事務或啟動新事務時重置; * 因此,它可以用來判斷當前事務是否建立了任何XLOG記錄。 * * While in parallel mode, this may not be fully up to date. When committing, * a transaction can assume this covers all xlog records written either by the * user backend or by any parallel worker which was present at any point during * the transaction. But when aborting, or when still in parallel mode, other * parallel backends may have written WAL records at later LSNs than the value * stored here. The parallel leader advances its own copy, when necessary, * in WaitForParallelWorkersToFinish. * 在並行模式下,這可能不是完全是最新的。 * 在提交時,事務可以假定覆蓋了使用者後臺程式或在事務期間出現的並行worker程式的所有xlog記錄。 * 但是,當中止時,或者仍然處於並行模式時,其他並行後臺程式可能在較晚的LSNs中寫入了WAL記錄, * 而不是儲存在這裡的值。 * 當需要時,並行處理程式的leader在WaitForParallelWorkersToFinish中會推進自己的副本。 */ XLogRecPtr ProcLastRecPtr = InvalidXLogRecPtr; XLogRecPtr XactLastRecEnd = InvalidXLogRecPtr; XLogRecPtr XactLastCommitEnd = InvalidXLogRecPtr; /* For WALInsertLockAcquire/Release functions */ //用於WALInsertLockAcquire/Release函式 static int MyLockNo = 0; static bool holdingAllLocks = false; /* * Private, possibly out-of-date copy of shared LogwrtResult. * See discussion above. * 程式私有的可能已過期的共享LogwrtResult變數的複製. */ static XLogwrtResult LogwrtResult = {0, 0}; /* The number of bytes in a WAL segment usable for WAL data. */ //WAL segment file中可用於WAL data的位元組數(不包括page header) static int UsableBytesInSegment;
宏定義
XLogRegisterBuffer函式使用的flags
/* flags for XLogRegisterBuffer */ //XLogRegisterBuffer函式使用的flags #define REGBUF_FORCE_IMAGE 0x01 /* 強制執行full-page-write;force a full-page image */ #define REGBUF_NO_IMAGE 0x02 /* 不需要FPI;don't take a full-page image */ #define REGBUF_WILL_INIT (0x04 | 0x02) /* 在回放時重新初始化page(表示NO_IMAGE); * page will be re-initialized at * replay (implies NO_IMAGE) */ #define REGBUF_STANDARD 0x08 /* 標準的page layout(資料在pd_lower和pd_upper之間的資料會被跳過) * page follows "standard" page layout, * (data between pd_lower and pd_upper * will be skipped) */ #define REGBUF_KEEP_DATA 0x10 /* include data even if a full-page image * is taken */ /* * Flag bits for the record being inserted, set using XLogSetRecordFlags(). */ #define XLOG_INCLUDE_ORIGIN 0x01 /* include the replication origin */ #define XLOG_MARK_UNIMPORTANT 0x02 /* record not important for durability */ #define XLogSegmentOffset(xlogptr, wal_segsz_bytes) \ ((xlogptr) & ((wal_segsz_bytes) - 1)) /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! * 計算page中在"endptr"後的剩餘空閒空間.注意multiple evaluation! */ #define INSERT_FREESPACE(endptr) \ (((endptr) % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr) % XLOG_BLCKSZ))
XLogRecData
xloginsert.c中的函式構造一個XLogRecData結構體鏈用於標識最後的WAL記錄
/* * The functions in xloginsert.c construct a chain of XLogRecData structs * to represent the final WAL record. * xloginsert.c中的函式構造一個XLogRecData結構體鏈用於標識最後的WAL記錄 */ typedef struct XLogRecData { //鏈中的下一個結構體,如無則為NULL struct XLogRecData *next; /* next struct in chain, or NULL */ //rmgr資料的起始地址 char *data; /* start of rmgr data to include */ //rmgr資料大小 uint32 len; /* length of rmgr data to include */ } XLogRecData;
二、原始碼解讀
ReserveXLogInsertLocation
在WAL(buffer)中為給定大小的記錄預留合適的空間。*StartPos設定為預留部分的開頭,*EndPos設定為其結尾+1。*PrePtr設定為前一記錄的開頭;它用於設定該記錄的xl_prev變數。
/* * Reserves the right amount of space for a record of given size from the WAL. * *StartPos is set to the beginning of the reserved section, *EndPos to * its end+1. *PrevPtr is set to the beginning of the previous record; it is * used to set the xl_prev of this record. * 在WAL(buffer)中為給定大小的記錄預留合適的空間。 * *StartPos設定為預留部分的開頭,*EndPos設定為其結尾+1。 * *PrePtr設定為前一記錄的開頭;它用於設定該記錄的xl_prev。 * * This is the performance critical part of XLogInsert that must be serialized * across backends. The rest can happen mostly in parallel. Try to keep this * section as short as possible, insertpos_lck can be heavily contended on a * busy system. * 這是XLogInsert中與效能密切相關的部分,必須在後臺程式之間序列執行。 * 其餘的大部分可以同時發生。 * 儘量精簡這部分的邏輯,insertpos_lck可以在繁忙的系統上存在激烈的競爭。 * * NB: The space calculation here must match the code in CopyXLogRecordToWAL, * where we actually copy the record to the reserved space. * 注意:這裡計算的空間必須與CopyXLogRecordToWAL()函式一致, * 在CopyXLogRecordToWAL中會實際複製資料到預留空間中. */ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr) { XLogCtlInsert *Insert = &XLogCtl->Insert;//插入控制器 uint64 startbytepos;//開始位置 uint64 endbytepos;//結束位置 uint64 prevbytepos;//上一位置 size = MAXALIGN(size);//大小對齊 /* All (non xlog-switch) records should contain data. */ //除了xlog-switch外,所有的記錄都應該包含資料. Assert(size > SizeOfXLogRecord); /* * The duration the spinlock needs to be held is minimized by minimizing * the calculations that have to be done while holding the lock. The * current tip of reserved WAL is kept in CurrBytePos, as a byte position * that only counts "usable" bytes in WAL, that is, it excludes all WAL * page headers. The mapping between "usable" byte positions and physical * positions (XLogRecPtrs) can be done outside the locked region, and * because the usable byte position doesn't include any headers, reserving * X bytes from WAL is almost as simple as "CurrBytePos += X". * spinlock需要持有的時間透過最小化必須持有鎖的計算邏輯達到最小化。 * 預留的WAL空間透過CurrBytePos變數(大小一個位元組)儲存, * 它只計算WAL中的“可用”位元組,也就是說,它排除了所有的WAL page header。 * “可用”位元組位置和物理位置(XLogRecPtrs)之間的對映可以在鎖定區域之外完成, * 而且由於可用位元組位置不包含任何header,從WAL預留X位元組的大小几乎和“CurrBytePos += X”一樣簡單。 */ SpinLockAcquire(&Insert->insertpos_lck);//申請鎖 //開始位置 startbytepos = Insert->CurrBytePos; //結束位置 endbytepos = startbytepos + size; //上一位置 prevbytepos = Insert->PrevBytePos; //調整控制器的相關變數 Insert->CurrBytePos = endbytepos; Insert->PrevBytePos = startbytepos; //釋放鎖 SpinLockRelease(&Insert->insertpos_lck); //返回值 //計算開始/結束/上一位置偏移 *StartPos = XLogBytePosToRecPtr(startbytepos); *EndPos = XLogBytePosToEndRecPtr(endbytepos); *PrevPtr = XLogBytePosToRecPtr(prevbytepos); /* * Check that the conversions between "usable byte positions" and * XLogRecPtrs work consistently in both directions. * 檢查雙向轉換之後的值是一致的. */ Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos); Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos); Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos); } /* * Converts a "usable byte position" to XLogRecPtr. A usable byte position * is the position starting from the beginning of WAL, excluding all WAL * page headers. * 將“可用位元組位置”轉換為XLogRecPtr。 * 可用位元組位置是從WAL開始的位置,不包括所有WAL page header。 */ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos) { uint64 fullsegs; uint64 fullpages; uint64 bytesleft; uint32 seg_offset; XLogRecPtr result; fullsegs = bytepos / UsableBytesInSegment; bytesleft = bytepos % UsableBytesInSegment; if (bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD) { //剩餘的位元組數 < XLOG_BLCKSZ - SizeOfXLogLongPHD /* fits on first page of segment */ //填充在segment的第一個page中 seg_offset = bytesleft + SizeOfXLogLongPHD; } else { //剩餘的位元組數 >= XLOG_BLCKSZ - SizeOfXLogLongPHD /* account for the first page on segment with long header */ //在segment中說明long header seg_offset = XLOG_BLCKSZ; bytesleft -= XLOG_BLCKSZ - SizeOfXLogLongPHD; fullpages = bytesleft / UsableBytesInPage; bytesleft = bytesleft % UsableBytesInPage; seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD; } XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, wal_segment_size, result); return result; } /* The number of bytes in a WAL segment usable for WAL data. */ //WAL segment file中可用於WAL data的位元組數(不包括page header) static int UsableBytesInSegment;
CopyXLogRecordToWAL
CopyXLogRecordToWAL是XLogInsertRecord中的子過程,用於複製XLOG Record到WAL中的保留區域.
/* * Subroutine of XLogInsertRecord. Copies a WAL record to an already-reserved * area in the WAL. * XLogInsertRecord中的子過程. * 複製XLOG Record到WAL中的保留區域. */ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata, XLogRecPtr StartPos, XLogRecPtr EndPos) { char *currpos;//當前指標位置 int freespace;//空閒空間 int written;//已寫入的大小 XLogRecPtr CurrPos;//事務日誌位置 XLogPageHeader pagehdr;//Page Header /* * Get a pointer to the right place in the right WAL buffer to start * inserting to. * 在合適的WAL buffer中獲取指標用於確定插入的位置 */ CurrPos = StartPos;//賦值為開始位置 currpos = GetXLogBuffer(CurrPos);//獲取buffer指標 freespace = INSERT_FREESPACE(CurrPos);//獲取空閒空間大小 /* * there should be enough space for at least the first field (xl_tot_len) * on this page. * 在該頁上最起碼有第一個欄位(xl_tot_len)的儲存空間 */ Assert(freespace >= sizeof(uint32)); /* Copy record data */ //複製記錄資料 written = 0; while (rdata != NULL)//迴圈 { char *rdata_data = rdata->data;//指標 int rdata_len = rdata->len;//大小 while (rdata_len > freespace)//迴圈 { /* * Write what fits on this page, and continue on the next page. * 該頁能寫多少就寫多少,寫不完就繼續下一頁. */ //確保最起碼剩餘SizeOfXLogShortPHD的頭部資料儲存空間 Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || freespace == 0); //記憶體複製 memcpy(currpos, rdata_data, freespace); //指標調整 rdata_data += freespace; //大小調整 rdata_len -= freespace; //寫入大小調整 written += freespace; //當前位置調整 CurrPos += freespace; /* * Get pointer to beginning of next page, and set the xlp_rem_len * in the page header. Set XLP_FIRST_IS_CONTRECORD. * 獲取下一頁的開始指標,並在下一頁的header中設定xlp_rem_len. * 同時設定XLP_FIRST_IS_CONTRECORD標記. * * It's safe to set the contrecord flag and xlp_rem_len without a * lock on the page. All the other flags were already set when the * page was initialized, in AdvanceXLInsertBuffer, and we're the * only backend that needs to set the contrecord flag. * 就算不持有頁鎖,設定contrecord標記和xlp_rem_len也是安全的. * 在頁面初始化的時候,所有其他標記已透過AdvanceXLInsertBuffer函式初始化, * 我們是需要設定contrecord標記的唯一一個後臺程式,不會有其他程式了. */ currpos = GetXLogBuffer(CurrPos);//獲取buffer pagehdr = (XLogPageHeader) currpos;//獲取page header pagehdr->xlp_rem_len = write_len - written;//設定xlp_rem_len pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;//設定標記 /* skip over the page header */ //跳過page header if (XLogSegmentOffset(CurrPos, wal_segment_size) == 0)//第一個page { CurrPos += SizeOfXLogLongPHD;//Long Header currpos += SizeOfXLogLongPHD; } else { CurrPos += SizeOfXLogShortPHD;//不是第一個page,Short Header currpos += SizeOfXLogShortPHD; } freespace = INSERT_FREESPACE(CurrPos);//獲取空閒空間 } //再次驗證 Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0); //記憶體複製(這時候rdata_len <= freespace) memcpy(currpos, rdata_data, rdata_len); currpos += rdata_len;//調整指標 CurrPos += rdata_len;//調整指標 freespace -= rdata_len;//減少空閒空間 written += rdata_len;//調整已寫入大小 rdata = rdata->next;//下一批資料 } Assert(written == write_len);//確保已寫入 == 需寫入大小 /* * If this was an xlog-switch, it's not enough to write the switch record, * we also have to consume all the remaining space in the WAL segment. We * have already reserved that space, but we need to actually fill it. * 如果是xlog-switch並且沒有足夠的空間寫切換的記錄, * 這時候不得不消費WAL segment剩餘的空間. * 我們已經預留了空間,但需要執行實際的填充. */ if (isLogSwitch && XLogSegmentOffset(CurrPos, wal_segment_size) != 0) { /* An xlog-switch record doesn't contain any data besides the header */ //在header後,xlog-switch沒有包含任何資料. Assert(write_len == SizeOfXLogRecord); /* Assert that we did reserve the right amount of space */ //驗證預留了合適的空間 Assert(XLogSegmentOffset(EndPos, wal_segment_size) == 0); /* Use up all the remaining space on the current page */ //在當前頁面使用所有的剩餘空間 CurrPos += freespace; /* * Cause all remaining pages in the segment to be flushed, leaving the * XLog position where it should be, at the start of the next segment. * We do this one page at a time, to make sure we don't deadlock * against ourselves if wal_buffers < wal_segment_size. * 由於該segment中所有剩餘pages將被刷出,把XLog位置指向下一個segment的開始. * 一個page我們只做一次,在wal_buffers < wal_segment_size的情況下, * 確保我們自己不會出現死鎖. */ while (CurrPos < EndPos)//迴圈 { /* * The minimal action to flush the page would be to call * WALInsertLockUpdateInsertingAt(CurrPos) followed by * AdvanceXLInsertBuffer(...). The page would be left initialized * mostly to zeros, except for the page header (always the short * variant, as this is never a segment's first page). * 刷出page的最小化動作是:呼叫WALInsertLockUpdateInsertingAt(CurrPos) * 然後接著呼叫AdvanceXLInsertBuffer(...). * 除了page header(通常為short格式,除了segment的第一個page)外,其餘部分均初始化為ascii 0. * * The large vistas of zeros are good for compressibility, but the * headers interrupting them every XLOG_BLCKSZ (with values that * differ from page to page) are not. The effect varies with * compression tool, but bzip2 for instance compresses about an * order of magnitude worse if those headers are left in place. * 連續的ascii 0非常適合壓縮,但每個page的頭部資料(用於分隔page&page)把這些0隔開了. * 這種效果隨壓縮工具的不同而不同,但是如果保留這些標頭檔案,則bzip2的壓縮效果會差一個數量級。 * * Rather than complicating AdvanceXLInsertBuffer itself (which is * called in heavily-loaded circumstances as well as this lightly- * loaded one) with variant behavior, we just use GetXLogBuffer * (which itself calls the two methods we need) to get the pointer * and zero most of the page. Then we just zero the page header. * 與其讓AdvanceXLInsertBuffer本身(在過載環境和這個負載較輕的環境中呼叫)變得複雜, * 不如使用GetXLogBuffer(呼叫了我們需要的兩個方法)來初始化page(初始化為ascii 0)/ * 然後把page header設定為ascii 0. */ currpos = GetXLogBuffer(CurrPos);//獲取buffer MemSet(currpos, 0, SizeOfXLogShortPHD);//設定頭部為ascii 0 CurrPos += XLOG_BLCKSZ;//修改指標 } } else { /* Align the end position, so that the next record starts aligned */ //對齊末尾位置,以便下一個記錄可以從對齊的位置開始 CurrPos = MAXALIGN64(CurrPos); } if (CurrPos != EndPos)//驗證 elog(PANIC, "space reserved for WAL record does not match what was written"); }
三、跟蹤分析
測試指令碼如下:
drop table t_wal_longtext; create table t_wal_longtext(c1 int not null,c2 varchar(3000),c3 varchar(3000),c4 varchar(3000)); insert into t_wal_longtext(c1,c2,c3,c4) select i,rpad('C2-'||i,3000,'2'),rpad('C3-'||i,3000,'3'),rpad('C4-'||i,3000,'4') from generate_series(1,7) as i;
ReserveXLogInsertLocation
插入資料:
insert into t_wal_longtext(c1,c2,c3,c4) VALUES(8,'C2-8','C3-8','C4-8');
設定斷點,進入ReserveXLogInsertLocation
(gdb) b ReserveXLogInsertLocation Breakpoint 1 at 0x54d574: file xlog.c, line 1244. (gdb) c Continuing. Breakpoint 1, ReserveXLogInsertLocation (size=74, StartPos=0x7ffebea9d768, EndPos=0x7ffebea9d760, PrevPtr=0x244f4c8) at xlog.c:1244 1244 XLogCtlInsert *Insert = &XLogCtl->Insert; (gdb)
輸入引數:
size=74, 這是待插入XLOG Record的大小,其他三個為待設定的值.
繼續執行.
對齊,74->80(要求為8的N倍,unit64佔用8bytes,因此要求8的倍數)
(gdb) n 1249 size = MAXALIGN(size); (gdb) 1252 Assert(size > SizeOfXLogRecord); (gdb) p size $1 = 80 (gdb)
檢視插入控制器的資訊,其中:
CurrBytePos = 5498377520,十六進位制為0x147BA9530
PrevBytePos = 5498377464,十六進位制為0x147BA94F8
RedoRecPtr = 5514382312,十六進位制為0x148AECBE8 --> 對應pg_control中的Latest checkpoint's REDO location
(gdb) n 1264 SpinLockAcquire(&Insert->insertpos_lck); (gdb) 1266 startbytepos = Insert->CurrBytePos; (gdb) p *Insert $2 = {insertpos_lck = 1 '\001', CurrBytePos = 5498377520, PrevBytePos = 5498377464, pad = '\000' <repeats 127 times>, RedoRecPtr = 5514382312, forcePageWrites = false, fullPageWrites = true, exclusiveBackupState = EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups = 0, lastBackupStart = 0, WALInsertLocks = 0x7f97d1eeb100} (gdb)
設定相應的值.
值得注意的是插入控制器Insert中的位置資訊是不包括page header等資訊,是純粹可用的日誌資料,因此數值要比WAL segment file的數值小.
(gdb) n 1267 endbytepos = startbytepos + size; (gdb) 1268 prevbytepos = Insert->PrevBytePos; (gdb) 1269 Insert->CurrBytePos = endbytepos; (gdb) 1270 Insert->PrevBytePos = startbytepos; (gdb) 1272 SpinLockRelease(&Insert->insertpos_lck); (gdb)
如前所述,需要將“可用位元組位置”轉換為XLogRecPtr。
計算實際的開始/結束/上一位置.
StartPos = 5514538672,0x148B12EB0
EndPos = 5514538752,0x148B12F00
PrevPtr = 5514538616,0x148B12E78
(gdb) n 1274 *StartPos = XLogBytePosToRecPtr(startbytepos); (gdb) 1275 *EndPos = XLogBytePosToEndRecPtr(endbytepos); (gdb) 1276 *PrevPtr = XLogBytePosToRecPtr(prevbytepos); (gdb) 1282 Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos); (gdb) p *StartPos $4 = 5514538672 (gdb) p *EndPos $5 = 5514538752 (gdb) p *PrevPtr $6 = 5514538616 (gdb)
驗證相互轉換是沒有問題的.
(gdb) n 1283 Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos); (gdb) 1284 Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos); (gdb) 1285 } (gdb) XLogInsertRecord (rdata=0xf9cc70 <hdr_rdt>, fpw_lsn=5514538520, flags=1 '\001') at xlog.c:1072 1072 inserted = true; (gdb)
DONE!
CopyXLogRecordToWAL-場景1:不跨WAL page
測試指令碼如下:
insert into t_wal_longtext(c1,c2,c3,c4) VALUES(8,'C2-8','C3-8','C4-8');
繼續上一條SQL的跟蹤.
設定斷點,進入CopyXLogRecordToWAL
(gdb) b CopyXLogRecordToWAL Breakpoint 3 at 0x54dcdf: file xlog.c, line 1479. (gdb) c Continuing. Breakpoint 3, CopyXLogRecordToWAL (write_len=74, isLogSwitch=false, rdata=0xf9cc70 <hdr_rdt>, StartPos=5514538672, EndPos=5514538752) at xlog.c:1479 1479 CurrPos = StartPos; (gdb)
輸入引數:
write_len=74, --> 待寫入大小
isLogSwitch=false, --> 是否日誌切換(不需要)
rdata=0xf9cc70 <\hdr_rdt>, --> 需寫入的資料地址
StartPos=5514538672, --> 開始位置
EndPos=5514538752 --> 結束位置
(gdb) n 1480 currpos = GetXLogBuffer(CurrPos); (gdb)
在合適的WAL buffer中獲取指標用於確定插入的位置.
進入函式GetXLogBuffer,輸入引數ptr為5514538672,即開始位置.
(gdb) step GetXLogBuffer (ptr=5514538672) at xlog.c:1854 1854 if (ptr / XLOG_BLCKSZ == cachedPage) (gdb) p ptr / 8192 --> 取模 $7 = 673161 (gdb) (gdb) p cachedPage $8 = 673161 (gdb)
GetXLogBuffer->ptr / XLOG_BLCKSZ == cachedPage,進入相應的處理邏輯
注意:cachedPage是靜態變數,具體在哪個地方賦值,後續需再行分析
(gdb) n 1856 Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC); (gdb) 1857 Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ)); (gdb) 1858 return cachedPos + ptr % XLOG_BLCKSZ;
GetXLogBuffer->cachedPos開頭是XLogPageHeader結構體
(gdb) p *((XLogPageHeader) cachedPos) $14 = {xlp_magic = 53400, xlp_info = 5, xlp_tli = 1, xlp_pageaddr = 5514534912, xlp_rem_len = 71} (gdb) (gdb) x/24bx (0x7f97d29fe000) 0x7f97d29fe000: 0x98 0xd0 0x05 0x00 0x01 0x00 0x00 0x00 0x7f97d29fe008: 0x00 0x20 0xb1 0x48 0x01 0x00 0x00 0x00 0x7f97d29fe010: 0x47 0x00 0x00 0x00 0x00 0x00 0x00 0x00
回到CopyXLogRecordToWAL,buffer的地址為0x7f97d29feeb0
(gdb) n 1945 } (gdb) CopyXLogRecordToWAL (write_len=74, isLogSwitch=false, rdata=0xf9cc70 <hdr_rdt>, StartPos=5514538672, EndPos=5514538752) at xlog.c:1481 1481 freespace = INSERT_FREESPACE(CurrPos); (gdb) (gdb) p currpos $16 = 0x7f97d29feeb0 "" (gdb)
計算空閒空間,確保在該頁上最起碼有第一個欄位(xl_tot_len)的儲存空間(4位元組).
(gdb) n 1487 Assert(freespace >= sizeof(uint32)); (gdb) p freespace $21 = 4432 (gdb)
開始複製記錄資料.
(gdb) n 1490 written = 0; --> 記錄已寫入的大小 (gdb) 1491 while (rdata != NULL)
rdata的分析詳見第四部分,繼續執行
(gdb) n 1493 char *rdata_data = rdata->data; (gdb) 1494 int rdata_len = rdata->len; (gdb) 1496 while (rdata_len > freespace) (gdb) p rdata_len $34 = 46 (gdb) p freespace $35 = 4432 (gdb)
rdata_len < freespace,無需進入子迴圈.
再次進行驗證沒有問題,執行記憶體複製.
(gdb) n 1536 Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0); (gdb) 1537 memcpy(currpos, rdata_data, rdata_len); (gdb) 1538 currpos += rdata_len; (gdb) 1539 CurrPos += rdata_len; (gdb) 1540 freespace -= rdata_len; (gdb) 1541 written += rdata_len; (gdb) 1543 rdata = rdata->next; (gdb) 1491 while (rdata != NULL) (gdb) p currpos $36 = 0x7f97d29feede "" (gdb) p CurrPos $37 = 5514538718 (gdb) p freespace $38 = 4386 (gdb) p written $39 = 46 (gdb)
rdata共有四部分,繼續寫入第二/三/四部分.
... 1491 while (rdata != NULL) (gdb) 1493 char *rdata_data = rdata->data; (gdb) 1494 int rdata_len = rdata->len; (gdb) 1496 while (rdata_len > freespace) (gdb) 1536 Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0); (gdb) 1537 memcpy(currpos, rdata_data, rdata_len); (gdb) 1538 currpos += rdata_len; (gdb) 1539 CurrPos += rdata_len; (gdb) 1540 freespace -= rdata_len; (gdb) 1541 written += rdata_len; (gdb) 1543 rdata = rdata->next; (gdb) 1491 while (rdata != NULL) (gdb)
完成寫入74bytes
(gdb) 1545 Assert(written == write_len); (gdb) p written $40 = 74 (gdb)
無需執行日誌切換的相關操作.
對齊CurrPos
(gdb) n 1552 if (isLogSwitch && XLogSegmentOffset(CurrPos, wal_segment_size) != 0) (gdb) 1599 CurrPos = MAXALIGN64(CurrPos); (gdb) p CurrPos $41 = 5514538746 (gdb) n 1602 if (CurrPos != EndPos) (gdb) p CurrPos $42 = 5514538752 (gdb) (gdb) p 5514538746 % 8 $44 = 2 --> 需補6個位元組,5514538746 --> 5514538752
對齊後,CurrPos == EndPos,否則報錯!
(gdb) p EndPos $45 = 5514538752
結束呼叫
(gdb) n 1604 } (gdb) XLogInsertRecord (rdata=0xf9cc70 <hdr_rdt>, fpw_lsn=5514538520, flags=1 '\001') at xlog.c:1098 1098 if ((flags & XLOG_MARK_UNIMPORTANT) == 0) (gdb)
DONE!
CopyXLogRecordToWAL-場景2:跨WAL page 後續再行分析
四、再論WAL Record
在記憶體中,WAL Record透過rdata儲存,該變數其實是全域性靜態變數hdr_rdt,型別為XLogRecData,XLOG Record透過XLogRecData連結串列組織起來(這個設計很贊,寫入無需理會結構,按連結串列逐個寫資料即可).
rdata由4部分組成:
第一部分是XLogRecord + XLogRecordBlockHeader + XLogRecordDataHeaderShort,共46位元組
第二部分是xl_heap_header,5個位元組
第三部分是tuple data,20個位元組
第四部分是xl_heap_insert,3個位元組
------------------------------------------------------------------- 1 (gdb) p *rdata $22 = {next = 0x244f2c0, data = 0x244f4c0 "J", len = 46} (gdb) p *(XLogRecord *)rdata->data --> XLogRecord $27 = {xl_tot_len = 74, xl_xid = 2268, xl_prev = 5514538616, xl_info = 0 '\000', xl_rmid = 10 '\n', xl_crc = 1158677949} (gdb) p *(XLogRecordBlockHeader *)(0x244f4c0+24) --> XLogRecordBlockHeader $29 = {id = 0 '\000', fork_flags = 32 ' ', data_length = 25} (gdb) x/2bx (0x244f4c0+44) --> XLogRecordDataHeaderShort 0x244f4ec: 0xff 0x03 ------------------------------------------------------------------- 2 (gdb) p *rdata->next $23 = {next = 0x244f2d8, data = 0x7ffebea9d830 "\004", len = 5} (gdb) p *(xl_heap_header *)rdata->next->data $32 = {t_infomask2 = 4, t_infomask = 2050, t_hoff = 24 '\030'} ------------------------------------------------------------------- 3 (gdb) p *rdata->next->next $24 = {next = 0x244f2a8, data = 0x24e6a2f "", len = 20} (gdb) x/20bc 0x24e6a2f 0x24e6a2f: 0 '\000' 8 '\b' 0 '\000' 0 '\000' 0 '\000' 11 '\v' 67 'C' 50 '2' 0x24e6a37: 45 '-' 56 '8' 11 '\v' 67 'C' 51 '3' 45 '-' 56 '8' 11 '\v' 0x24e6a3f: 67 'C' 52 '4' 45 '-' 56 '8' (gdb) ------------------------------------------------------------------- 4 (gdb) p *rdata->next->next->next $25 = {next = 0x0, data = 0x7ffebea9d840 "\b", len = 3} (gdb) (gdb) p *(xl_heap_insert *)rdata->next->next->next->data $33 = {offnum = 8, flags = 0 '\000'}
五、參考資料
PostgreSQL 原始碼解讀(4)- 插入資料#3(heap_insert)
PostgreSQL 事務日誌WAL結構淺析
PostgreSQL 原始碼解讀(110)- WAL#6(Insert&WAL - XLogRecordAssemble記錄組裝函式)
PostgreSQL 原始碼解讀(111)- WAL#7(Insert&WAL - XLogRecordAssemble-FPW)
PostgreSQL 原始碼解讀(112)- WAL#8(XLogCtrl資料結構)
PG Source Code
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/6906/viewspace-2374769/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- PostgreSQL 原始碼解讀(110)- WAL#6(Insert&WAL - XLogRe...SQL原始碼
- PostgreSQL 原始碼解讀(111)- WAL#7(Insert&WAL - XLogRe...SQL原始碼
- PostgreSQL 原始碼解讀(198)- 查詢#113(排序#6 - Tuplesortstate)SQL原始碼排序
- PostgreSQL 原始碼解讀(3)- 如何閱讀原始碼SQL原始碼
- PostgreSQL 原始碼解讀(219)- Locks(Overview)SQL原始碼View
- PostgreSQL 原始碼解讀(241)- plpgsql(CreateFunction)SQL原始碼Function
- PostgreSQL 原始碼解讀(240)- HTAB簡介SQL原始碼
- PostgreSQL 原始碼解讀(220)- Locks(LOCK Struct)SQL原始碼Struct
- PostgreSQL 原始碼解讀(221)- Locks(PROCLOCK Struct)SQL原始碼Struct
- PostgreSQL 原始碼解讀(1)- 插入資料#1SQL原始碼
- PostgreSQL 原始碼解讀(223)- Locks(Fast Path Locking)SQL原始碼AST
- PostgreSQL 原始碼解讀(224)- Locks(The Deadlock Detection Algorithm)SQL原始碼Go
- PostgreSQL 原始碼解讀(218)- spinlock的實現SQL原始碼
- PostgreSQL 原始碼解讀(244)- plpgsql(CreateFunction-ProcedureCreate)SQL原始碼Function
- PostgreSQL 原始碼解讀(201)- PG 12 BlackholeAM for tablesSQL原始碼
- PostgreSQL 原始碼解讀(152)- PG Tools#4(ReceiveXlogStream)SQL原始碼
- PostgreSQL 原始碼解讀(151)- PG Tools#3(StartLogStreamer)SQL原始碼
- PostgreSQL 原始碼解讀(2)- 插入資料#2(RelationPutHeapTuple)SQL原始碼APT
- PostgreSQL 原始碼解讀(5)- 插入資料#4(ExecInsert)SQL原始碼
- PostgreSQL 原始碼解讀(6)- 插入資料#5(ExecModifyTable)SQL原始碼
- PostgreSQL 原始碼解讀(8)- 插入資料#7(ExecutePlan)SQL原始碼
- PostgreSQL 原始碼解讀(10)- 插入資料#9(ProcessQuery)SQL原始碼
- PostgreSQL 原始碼解讀(13)- 插入資料#12(PostgresMain)SQL原始碼AI
- PostgreSQL 原始碼解讀(217)- A Faster, Lightweight Trigger Function in CSQL原始碼ASTFunction
- PostgreSQL 原始碼解讀(246)- plpgsql(CreateFunction-SearchSysCache3)SQL原始碼Function
- PostgreSQL 原始碼解讀(230)- 查詢#123(NOT IN實現)SQL原始碼
- PostgreSQL 原始碼解讀(222)- Locks(Lock Manager Internal Locking)SQL原始碼
- PostgreSQL 原始碼解讀(245)- plpgsql(CreateFunction-construct_array)SQL原始碼FunctionStruct
- PostgreSQL 原始碼解讀(196)- 浮點數比較SQL原始碼
- PostgreSQL 原始碼解讀(145)- Storage Manager#1(RecordAndGetPageWithFreeSpace)SQL原始碼
- PostgreSQL 原始碼解讀(164)- 查詢#84(表示式求值)SQL原始碼
- PostgreSQL 原始碼解讀(126)- MVCC#10(vacuum過程)SQL原始碼MVCC#
- PostgreSQL 原始碼解讀(225)- Transaction(子事務處理)SQL原始碼
- PostgreSQL 原始碼解讀(215)- 查詢#122(varstrfastcmp_locale)SQL原始碼AST
- PostgreSQL 原始碼解讀(231)- 查詢#124(NOT IN實現#2)SQL原始碼
- PostgreSQL 原始碼解讀(233)- 查詢#126(NOT IN實現#4)SQL原始碼
- PostgreSQL 原始碼解讀(234)- 查詢#127(NOT IN實現#5)SQL原始碼
- PostgreSQL 原始碼解讀(232)- 查詢#125(NOT IN實現#3)SQL原始碼