PostgreSQL恢復程式startuphang住的原因分析一例

德哥發表於2015-12-10

最近在一個跨廣域網的PostgreSQL primary standby環境中遇到一個比較奇特的問題。
首先primary standby是跨廣域網的，但這不是問題的重點。重點是歸檔也是跨光域網並且使用NFS來讓standby訪問歸檔檔案。
standby通過NFS獲取歸檔，通過TCP連線primary實現流複製。
但是不知道什麼原因，NFS出現了問題，即standby無法正常的訪問歸檔檔案了，訪問NFS的命令會hang住。
接下來描述一下問題，然後再從PostgreSQL原始碼分析問題的原因。
1. standby的restore_command命令(cp /nfsdir/%f %p)hang住，停留在拷貝某個歸檔xlog的狀態。
2. 手工kill 這個cp命令，緊接著standby資料庫crash了。
3. 重啟standby資料庫，發現hang在cp /nfsdir/0000009.history %p的狀態，然而實際上0000009.history是不存在的，主庫的時間線是8，備庫的時間線也是8，那麼是什麼原因導致standby要去找一個不存在的時間線檔案呢？
這個原因要到原始碼去尋找答案。
    我們一般配置流複製環境，會設定recovery_target_timeline=latest，這樣做的目的是配置一個hot_standby，如果上游切換了時間線，可以自動跟上。
    而這個就可以解釋為什麼standby recovery的時候要去找一個不存在的時間線檔案？

見原始碼：
src/backend/access/transam/timeline.c
/*
 * Find the newest existing timeline, assuming that startTLI exists.
 *
 * Note: while this is somewhat heuristic, it does positively guarantee
 * that (result + 1) is not a known timeline, and therefore it should
 * be safe to assign that ID to a new timeline.
 */
TimeLineID
findNewestTimeLine(TimeLineID startTLI)
{
        TimeLineID      newestTLI;
        TimeLineID      probeTLI;

        /*
         * The algorithm is just to probe for the existence of timeline history
         * files.  XXX is it useful to allow gaps in the sequence?
         */
        newestTLI = startTLI;

        for (probeTLI = startTLI + 1;; probeTLI++)      # 問題就出在這裡, 探測下一個時間線是否存在。
        {
                if (existsTimeLineHistory(probeTLI))
                {
                        newestTLI = probeTLI;           /* probeTLI exists */
                }
                else
                {
                        /* doesn`t exist, assume we`re done */
                        break;
                }
        }

        return newestTLI;
}

src/backend/access/transam/xlog.c
/*
 * See if there is a recovery command file (recovery.conf), and if so
 * read in parameters for archive recovery and XLOG streaming.
 *
 * The file is parsed using the main configuration parser.
 */
static void
readRecoveryCommandFile(void)
......
                else if (strcmp(item->name, "recovery_target_timeline") == 0)
                {
                        rtliGiven = true;
                        if (strcmp(item->value, "latest") == 0)
                                rtli = 0;
.....
        /*
         * If user specified recovery_target_timeline, validate it or compute the
         * "latest" value.  We can`t do this until after we`ve gotten the restore
         * command and set InArchiveRecovery, because we need to fetch timeline
         * history files from the archive.
         */
        if (rtliGiven)
        {
                if (rtli)
                {
                        /* Timeline 1 does not have a history file, all else should */
                        if (rtli != 1 && !existsTimeLineHistory(rtli))
                                ereport(FATAL,
                                                (errmsg("recovery target timeline %u does not exist",
                                                                rtli)));
                        recoveryTargetTLI = rtli;
                        recoveryTargetIsLatest = false;
                }
                else
                {
                        /* We start the "latest" search from pg_control`s timeline */   # 問題出在這裡，我配置的就是recovery_target_timeline=latest, 所以需要呼叫findNewestTimeLine. 控制檔案是8，所以find 0000009.history.
                        recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
                        recoveryTargetIsLatest = true;
                }
        }

找到原因後，把NFS的問題解決掉，重啟資料庫就好了，再也不會hang住。

[參考]
1. src/backend/access/transam/xlog.c
2. src/backend/access/transam/timeline.c

ibbackup恢復報錯一例
2019-05-23
mysql資料庫恢復一例
2024-06-09
MySql資料庫
postgreSQL 恢復至故障點精準恢復
2019-01-01
SQL
PostGreSql12.6的備份恢復
2021-04-14
SQL
PostgreSQL 時間點恢復
2024-07-21
SQL
Postgresql 備份與恢復
2018-04-08
SQL
PostgreSQL啟動恢復期間，恢復到的時間線的確定
2018-08-12
SQL
PostgreSql資料庫的備份和恢復
2021-04-17
SQL資料庫
pt-archiver工具歸檔和恢復資料一例
2019-06-19
Hive
POSTGRESQL 小版本升級失敗後的原因分析
2023-01-05
SQL
Oracle & MySQL & PostgreSQL資料庫恢復支援
2021-11-22
OracleMySql資料庫
postgresql備份與恢復資料庫
2020-12-15
SQL資料庫
PostgreSQL 恢復大法 - 恢復部分資料庫、跳過壞塊、修復無法啟動的資料庫
2018-04-18
SQL資料庫
PostgreSQL啟動恢復過程中日誌源的切換
2018-10-21
SQL
Oracle恢復一例--ORA-03113、ORA-24324，ORA-01041錯誤
2019-08-01
Oracle
PostgreSQL備份恢復管理器pg_probackup
2024-04-06
SQL
MySQL crash recovery恢復慢分析
2023-02-10
MySql
虛擬機器未知原因丟失的資料恢復案例
2019-08-06
虛擬機資料恢復
為什麼Win10會詢問我的BitLocker恢復金鑰詢問恢復金鑰的原因解析
2021-10-14
Win10
日誌分析一例
2019-06-25
PostgreSQL啟動恢復透過checkpoint open wal檔案
2018-08-04
SQL
資料庫恢復中需要大量儲存空間的原因HQ
2022-03-21
資料庫
資料恢復經典案例分析-raid兩塊硬碟離線恢復
2018-12-03
資料恢復AI硬碟
MySQL 崩潰恢復過程分析
2022-11-29
MySql
如何恢復 Windows 上 PostgreSQL 14 中被誤刪的 pg_restore.exe
2024-10-23
WindowsSQLREST
PostgreSQL啟動恢復讀取checkpoint記錄失敗的條件
2018-08-06
SQL
【伺服器資料恢復】伺服器硬碟黃燈的資料恢復案例分析
2021-12-09
伺服器資料恢復硬碟
如何恢復行動硬碟損壞的資料？先找原因後解決
2021-12-08
硬碟
硬碟資料丟失原因和解決方案/資料恢復方法
2018-05-22
硬碟資料恢復
Linux下Python程式Killed，分析其原因
2018-08-24
LinuxPython
故障分析 | MySQL鎖等待超時一例分析
2022-11-23
MySql
enq: TX - index contention故障修復一例
2021-06-01
ENQIndex
PostgreSQL DBA(30) - Backup&Recovery#3(資料檔案損壞恢復)
2019-03-12
SQL
如何恢復SSD NVME固態硬碟的資料恢復
2024-07-07
硬碟資料恢復
bitlocker如何恢復金鑰 bitlocker恢復金鑰的方法
2021-12-29
7.6 實現程式掛起與恢復
2023-09-24
win伺服器系統程式原因分析
2019-05-27
伺服器
Prometheus 告警恢復時，怎麼獲取恢復時的值？
2024-08-29
Prometheus
照片恢復軟體是如何恢復數位相機照片的？
2020-04-23

PostgreSQL恢復程式startuphang住的原因分析一例

相關文章