PostgreSQL 原始碼解讀(159)-PG Tools#6(What does pg_rewind do)
基於streaming replication搭建的PostgreSQL HA環境,如出現網路訪問/硬體故障等原因導致Standby節點升級為Master節點,但Old Master節點資料庫並未損壞,在排除故障後Old Master節點可以通過pg_rewind工具而不需要通過備份的方式成為New Master節點的Standby節點.
在執行命令pg_rewind時,到底做了什麼?本節一探究竟.
零、原理
在PostgreSQL HA環境中,Standby節點升級為Master節點後,時間線會切換為新的時間線,比如從1變為2.而Old Master節點的時間線仍然為原來的時間線,比如仍為1,那麼使用pg_rewind工具,Old Master節點如何從New Master節點讀取相關的資料成為新的Standby節點?
簡單來說,有以下幾步:
1.確定New Master和Old Master資料一致性的Checkpoint位置.在該位置上,New Master和Old Master資料完全一致.這可以通過讀取新Old Master節點時間線歷史檔案可以獲得,該檔案位於$PGDATA/pg_wal/目錄下,檔名稱為XX.history
2.Old Master節點根據上一步獲取的Checkpoint讀取本機日誌檔案WAL Record,獲取在此Checkpoint之後出現變化的Block,並以連結串列的方式儲存Block編號等資訊
3.根據第2步獲取的Block資訊從New Master節點拷貝相應的Block,替換Old Master節點相應的Block
4.拷貝New Master節點上除資料檔案外的所有其他檔案,包括配置檔案等(如果拷貝資料檔案,與備份方式搭建區別不大)
5.Old Master啟動資料庫,應用從Checkpoint開始後的WAL Record.
如下圖所示:
在執行主備切換後,New Master節點的時間線切換為n + 1,通過pg_rewind可使Old Master在分叉點開始與New Master同步,成為New Standby節點.
一、資料結構
XLogRecPtr
64bit的WAL Record定址空間地址.
/*
* Pointer to a location in the XLOG. These pointers are 64 bits wide,
* because we don't want them ever to overflow.
* 指向XLOG中的位置.
* 這些指標大小為64bit,以確保指標不會溢位.
*/
typedef uint64 XLogRecPtr;
TimeLineID
時間線ID
typedef uint32 TimeLineID;
二、原始碼解讀
pg_rewind的原始碼較為簡單,詳細請參考註釋.
int
main(int argc, char **argv)
{
static struct option long_options[] = {
{"help", no_argument, NULL, '?'},
{"target-pgdata", required_argument, NULL, 'D'},
{"source-pgdata", required_argument, NULL, 1},
{"source-server", required_argument, NULL, 2},
{"version", no_argument, NULL, 'V'},
{"dry-run", no_argument, NULL, 'n'},
{"no-sync", no_argument, NULL, 'N'},
{"progress", no_argument, NULL, 'P'},
{"debug", no_argument, NULL, 3},
{NULL, 0, NULL, 0}
};//命令選項
int option_index;//選項編號
int c;//字元ASCII碼
XLogRecPtr divergerec;//分支點
int lastcommontliIndex;
XLogRecPtr chkptrec;//checkpoint Record位置
TimeLineID chkpttli;//時間線
XLogRecPtr chkptredo;checkpoint REDO位置
size_t size;
char *buffer;//緩衝區
bool rewind_needed;//是否需要rewind
XLogRecPtr endrec;//結束點
TimeLineID endtli;//結束時間線
ControlFileData ControlFile_new;//新的控制檔案
set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
progname = get_progname(argv[0]);
/* Process command-line arguments */
//處理命令列引數
if (argc > 1)
{
if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
{
usage(progname);
exit(0);
}
if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
{
puts("pg_rewind (PostgreSQL) " PG_VERSION);
exit(0);
}
}
while ((c = getopt_long(argc, argv, "D:nNP", long_options, &option_index)) != -1)
{
switch (c)
{
case '?':
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
exit(1);
case 'P':
showprogress = true;
break;
case 'n':
dry_run = true;
break;
case 'N':
do_sync = false;
break;
case 3:
debug = true;
break;
case 'D': /* -D or --target-pgdata */
datadir_target = pg_strdup(optarg);
break;
case 1: /* --source-pgdata */
datadir_source = pg_strdup(optarg);
break;
case 2: /* --source-server */
connstr_source = pg_strdup(optarg);
break;
}
}
if (datadir_source == NULL && connstr_source == NULL)
{
fprintf(stderr, _("%s: no source specified (--source-pgdata or --source-server)\n"), progname);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
exit(1);
}
if (datadir_source != NULL && connstr_source != NULL)
{
fprintf(stderr, _("%s: only one of --source-pgdata or --source-server can be specified\n"), progname);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
exit(1);
}
if (datadir_target == NULL)
{
fprintf(stderr, _("%s: no target data directory specified (--target-pgdata)\n"), progname);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
exit(1);
}
if (optind < argc)
{
fprintf(stderr, _("%s: too many command-line arguments (first is \"%s\")\n"),
progname, argv[optind]);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
exit(1);
}
/*
* Don't allow pg_rewind to be run as root, to avoid overwriting the
* ownership of files in the data directory. We need only check for root
* -- any other user won't have sufficient permissions to modify files in
* the data directory.
* 不需要以root使用者執行pg_rewind,避免覆蓋資料目錄中的檔案owner.
* 只需要檢查root使用者,其他使用者沒有足夠的許可權更新資料目錄中的檔案.
*/
#ifndef WIN32
if (geteuid() == 0)
{
//root使用者
fprintf(stderr, _("cannot be executed by \"root\"\n"));
fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"),
progname);
exit(1);
}
#endif
get_restricted_token(progname);
/* Set mask based on PGDATA permissions */
//根據PGDA他的許可權設定許可權mask
if (!GetDataDirectoryCreatePerm(datadir_target))
{
fprintf(stderr, _("%s: could not read permissions of directory \"%s\": %s\n"),
progname, datadir_target, strerror(errno));
exit(1);
}
umask(pg_mode_mask);
/* Connect to remote server */
//連線到遠端伺服器
if (connstr_source)
libpqConnect(connstr_source);
/*
* Ok, we have all the options and we're ready to start. Read in all the
* information we need from both clusters.
* 現在,我們有了相關的執行執行,準備開始執行.
* 從兩個db clusters中讀取所有需要的資訊.
*/
//讀取目標控制檔案
buffer = slurpFile(datadir_target, "global/pg_control", &size);
digestControlFile(&ControlFile_target, buffer, size);
pg_free(buffer);
//讀取源控制檔案
buffer = fetchFile("global/pg_control", &size);
digestControlFile(&ControlFile_source, buffer, size);
pg_free(buffer);
sanityChecks();
/*
* If both clusters are already on the same timeline, there's nothing to
* do.
* 如果兩個clusters已經是同一個時間線,沒有什麼好做的了,報錯.
*/
if (ControlFile_target.checkPointCopy.ThisTimeLineID == ControlFile_source.checkPointCopy.ThisTimeLineID)
{
printf(_("source and target cluster are on the same timeline\n"));
rewind_needed = false;
}
else
{
//找到分叉點
findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
printf(_("servers diverged at WAL location %X/%X on timeline %u\n"),
(uint32) (divergerec >> 32), (uint32) divergerec,
targetHistory[lastcommontliIndex].tli);
/*
* Check for the possibility that the target is in fact a direct
* ancestor of the source. In that case, there is no divergent history
* in the target that needs rewinding.
* 檢查目標是源的直接祖先的可能性.
* 在這種情況下,在需要調整的目標中就沒有不同的歷史.
*/
if (ControlFile_target.checkPoint >= divergerec)
{
//如果目標的checkpoint > 分叉點,則需要rewind
rewind_needed = true;
}
else
{
//目標的checkpoint <= 分叉點
XLogRecPtr chkptendrec;
/* Read the checkpoint record on the target to see where it ends. */
//讀取目標的checkpoint記錄,檢查在哪結束?
chkptendrec = readOneRecord(datadir_target,
ControlFile_target.checkPoint,
targetNentries - 1);
/*
* If the histories diverged exactly at the end of the shutdown
* checkpoint record on the target, there are no WAL records in
* the target that don't belong in the source's history, and no
* rewind is needed.
* 如果正好在shutdown checkpoint Record處出現分叉,
* 那麼在目標cluster中沒有WAL Record屬於源cluster歷史,
* 不需要進行rewind操作,否則需要rewind.
*/
if (chkptendrec == divergerec)
rewind_needed = false;
else
rewind_needed = true;
}
}
if (!rewind_needed)
{
//不需要rewind,退出
printf(_("no rewind required\n"));
exit(0);
}
//找到目標cluster最後的checkpoint點
findLastCheckpoint(datadir_target, divergerec,
lastcommontliIndex,
&chkptrec, &chkpttli, &chkptredo);
printf(_("rewinding from last common checkpoint at %X/%X on timeline %u\n"),
(uint32) (chkptrec >> 32), (uint32) chkptrec,
chkpttli);
/*
* Build the filemap, by comparing the source and target data directories.
* 通過對比源和目標資料目錄構建filemap
*/
//建立filemap
filemap_create();
pg_log(PG_PROGRESS, "reading source file list\n");
fetchSourceFileList();
pg_log(PG_PROGRESS, "reading target file list\n");
traverse_datadir(datadir_target, &process_target_file);
/*
* Read the target WAL from last checkpoint before the point of fork, to
* extract all the pages that were modified on the target cluster after
* the fork. We can stop reading after reaching the final shutdown record.
* XXX: If we supported rewinding a server that was not shut down cleanly,
* we would need to replay until the end of WAL here.
* 從在分叉點之前的最後一個checkpoint開始讀取目標WAL Record,
* 提取目標cluster上在分叉後所有被修改的pages.
* 在到達最後一個shutdown record時停止讀取.
* XXX: 如果我們支援非正常關閉的資料庫rewind,需要在這裡重放WAL Record到WAL的末尾.
*/
//構造filemap
pg_log(PG_PROGRESS, "reading WAL in target\n");
extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
ControlFile_target.checkPoint);
filemap_finalize();
if (showprogress)
calculate_totals();
/* this is too verbose even for verbose mode */
//如為debug模式,則列印filemap
if (debug)
print_filemap();
/*
* Ok, we're ready to start copying things over.
* 現在可以開始拷貝了.
*/
if (showprogress)
{
pg_log(PG_PROGRESS, "need to copy %lu MB (total source directory size is %lu MB)\n",
(unsigned long) (filemap->fetch_size / (1024 * 1024)),
(unsigned long) (filemap->total_size / (1024 * 1024)));
fetch_size = filemap->fetch_size;
fetch_done = 0;
}
/*
* This is the point of no return. Once we start copying things, we have
* modified the target directory and there is no turning back!
* 到了這裡,已無回頭路可走了.
* 一旦開始拷貝,就必須更新目標路徑,無法回頭!
*/
//
executeFileMap();
progress_report(true);
//建立backup_label檔案並更新控制檔案
pg_log(PG_PROGRESS, "\ncreating backup label and updating control file\n");
createBackupLabel(chkptredo, chkpttli, chkptrec);
/*
* Update control file of target. Make it ready to perform archive
* recovery when restarting.
* 更新目標控制檔案.在重啟時可執行歸檔恢復.
*
* minRecoveryPoint is set to the current WAL insert location in the
* source server. Like in an online backup, it's important that we recover
* all the WAL that was generated while we copied the files over.
* minRecoveryPoint設定為目標伺服器上當前WAL插入的位置.
* 與線上backup類似,在拷貝和覆蓋檔案時根據所有生成的WAL日誌進行恢復是很重要的.
*/
//更新控制檔案
memcpy(&ControlFile_new, &ControlFile_source, sizeof(ControlFileData));
if (connstr_source)
{
//獲取源WAL插入的位置
endrec = libpqGetCurrentXlogInsertLocation();
//獲取時間線
endtli = ControlFile_source.checkPointCopy.ThisTimeLineID;
}
else
{
endrec = ControlFile_source.checkPoint;
endtli = ControlFile_source.checkPointCopy.ThisTimeLineID;
}
//更新控制檔案
ControlFile_new.minRecoveryPoint = endrec;
ControlFile_new.minRecoveryPointTLI = endtli;
ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
update_controlfile(datadir_target, progname, &ControlFile_new, do_sync);
pg_log(PG_PROGRESS, "syncing target data directory\n");
//同步資料目錄(除資料檔案之外)
syncTargetDirectory();
printf(_("Done!\n"));
return 0;
}
三、跟蹤分析
N/A
四、參考資料
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/6906/viewspace-2639738/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- PostgreSQL 原始碼解讀(158)-PG Tools#5(pg_basebackup:what does db server do)SQL原始碼Server
- What does -> do in clojure?
- What does "xargs grep" do?
- PostgreSQL pg_rewind原理SQL
- PostgreSQL 原始碼解讀(201)- PG 12 BlackholeAM for tablesSQL原始碼
- PostgreSQL pg_rewind 報錯分析SQL
- PostgreSQL 原始碼解讀(152)- PG Tools#4(ReceiveXlogStream)SQL原始碼
- PostgreSQL 原始碼解讀(151)- PG Tools#3(StartLogStreamer)SQL原始碼
- PostgreSQL 原始碼解讀(254)- PG 14(Improving connection scalability)#6SQL原始碼
- PostgreSQL 原始碼解讀(259)- PG 14(Improving connection scalability)#11SQL原始碼
- PostgreSQL 原始碼解讀(264)- PG 14(Speeding up recovery and VACUUM)SQL原始碼
- PostgreSQL 原始碼解讀(255)- PG 14(Improving connection scalability)#7SQL原始碼
- PostgreSQL 原始碼解讀(256)- PG 14(Improving connection scalability)#8SQL原始碼
- PostgreSQL 原始碼解讀(257)- PG 14(Improving connection scalability)#9SQL原始碼
- PostgreSQL 原始碼解讀(258)- PG 14(Improving connection scalability)#10SQL原始碼
- PostgreSQL 原始碼解讀(260)- PG 14(Improving connection scalability)#12SQL原始碼
- PostgreSQL 原始碼解讀(253)- PG 14(Improving connection scalability)#5SQL原始碼
- PostgreSQL 原始碼解讀(251)- PG 14(Improving connection scalability)#3SQL原始碼
- PostgreSQL 原始碼解讀(250)- PG 14(Improving connection scalability)#2SQL原始碼
- PostgreSQL 原始碼解讀(263)- PG 14(Improving connection scalability)#15SQL原始碼
- PostgreSQL 原始碼解讀(262)- PG 14(Improving connection scalability)#14SQL原始碼
- PostgreSQL 原始碼解讀(261)- PG 14(Improving connection scalability)#13SQL原始碼
- PostgreSQL 原始碼解讀(252)- PG 14(Improving connection scalability)#4SQL原始碼
- PostgreSQL 原始碼解讀(149)- PG Tools#1(pg_basebackup主函式)SQL原始碼函式
- PostgreSQL DBA(33) - HA#2(pg_rewind切換圖解)SQL圖解
- PostgreSQL 原始碼解讀(150)- PG Tools#2(BaseBackup函式)SQL原始碼函式
- PostgreSQL 原始碼解讀(200)- PG 12 Pluggable storage for tables介面淺析SQL原始碼
- PostgreSQL 原始碼解讀(249)- PG 14(Improving connection scalability)#1.mdSQL原始碼
- PostgreSQL DBA(32) - HA#1(pg_rewind切換)SQL
- PostgreSQL 原始碼解讀(3)- 如何閱讀原始碼SQL原始碼
- PostgreSQL pg_rewind例項--could not find previous WAL record at %X/%XSQL
- PostgreSQL 原始碼解讀(219)- Locks(Overview)SQL原始碼View
- PostgreSQL 原始碼解讀(241)- plpgsql(CreateFunction)SQL原始碼Function
- PostgreSQL 原始碼解讀(168)- 查詢#88(PG中的詞法定義:scanner.l)#1SQL原始碼
- PostgreSQL 原始碼解讀(169)- 查詢#89(PG中的詞法定義:scanner.l)#2SQL原始碼
- PostgreSQL 原始碼解讀(170)- 查詢#90(PG中的詞法定義:scanner.l)#3SQL原始碼
- PostgreSQL 原始碼解讀(171)- 查詢#91(PG中的詞法定義:scanner.l)#4SQL原始碼
- PostgreSQL 原始碼解讀(240)- HTAB簡介SQL原始碼