PostgreSQL恢復程式startuphang住的原因分析一例

德哥發表於2015-12-10
最近在一個跨廣域網的PostgreSQL primary standby環境中遇到一個比較奇特的問題。
首先primary standby是跨廣域網的,但這不是問題的重點。重點是歸檔也是跨光域網並且使用NFS來讓standby訪問歸檔檔案。
standby通過NFS獲取歸檔,通過TCP連線primary實現流複製。
但是不知道什麼原因,NFS出現了問題,即standby無法正常的訪問歸檔檔案了,訪問NFS的命令會hang住。
接下來描述一下問題,然後再從PostgreSQL原始碼分析問題的原因。
1. standby的restore_command命令(cp /nfsdir/%f %p)hang住,停留在拷貝某個歸檔xlog的狀態。
2. 手工kill 這個cp命令,緊接著standby資料庫crash了。
3. 重啟standby資料庫,發現hang在cp /nfsdir/0000009.history %p的狀態,然而實際上0000009.history是不存在的,主庫的時間線是8,備庫的時間線也是8,那麼是什麼原因導致standby要去找一個不存在的時間線檔案呢?
這個原因要到原始碼去尋找答案。
    我們一般配置流複製環境,會設定recovery_target_timeline=latest,這樣做的目的是配置一個hot_standby,如果上游切換了時間線,可以自動跟上。
    而這個就可以解釋為什麼standby recovery的時候要去找一個不存在的時間線檔案?

見原始碼:
src/backend/access/transam/timeline.c
/*
 * Find the newest existing timeline, assuming that startTLI exists.
 *
 * Note: while this is somewhat heuristic, it does positively guarantee
 * that (result + 1) is not a known timeline, and therefore it should
 * be safe to assign that ID to a new timeline.
 */
TimeLineID
findNewestTimeLine(TimeLineID startTLI)
{
        TimeLineID      newestTLI;
        TimeLineID      probeTLI;

        /*
         * The algorithm is just to probe for the existence of timeline history
         * files.  XXX is it useful to allow gaps in the sequence?
         */
        newestTLI = startTLI;

        for (probeTLI = startTLI + 1;; probeTLI++)      # 問題就出在這裡, 探測下一個時間線是否存在。
        {
                if (existsTimeLineHistory(probeTLI))
                {
                        newestTLI = probeTLI;           /* probeTLI exists */
                }
                else
                {
                        /* doesn`t exist, assume we`re done */
                        break;
                }
        }

        return newestTLI;
}

src/backend/access/transam/xlog.c
/*
 * See if there is a recovery command file (recovery.conf), and if so
 * read in parameters for archive recovery and XLOG streaming.
 *
 * The file is parsed using the main configuration parser.
 */
static void
readRecoveryCommandFile(void)
......
                else if (strcmp(item->name, "recovery_target_timeline") == 0)
                {
                        rtliGiven = true;
                        if (strcmp(item->value, "latest") == 0)
                                rtli = 0;
.....
        /*
         * If user specified recovery_target_timeline, validate it or compute the
         * "latest" value.  We can`t do this until after we`ve gotten the restore
         * command and set InArchiveRecovery, because we need to fetch timeline
         * history files from the archive.
         */
        if (rtliGiven)
        {
                if (rtli)
                {
                        /* Timeline 1 does not have a history file, all else should */
                        if (rtli != 1 && !existsTimeLineHistory(rtli))
                                ereport(FATAL,
                                                (errmsg("recovery target timeline %u does not exist",
                                                                rtli)));
                        recoveryTargetTLI = rtli;
                        recoveryTargetIsLatest = false;
                }
                else
                {
                        /* We start the "latest" search from pg_control`s timeline */   # 問題出在這裡,我配置的就是recovery_target_timeline=latest, 所以需要呼叫findNewestTimeLine. 控制檔案是8,所以find 0000009.history.
                        recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
                        recoveryTargetIsLatest = true;
                }
        }

找到原因後,把NFS的問題解決掉,重啟資料庫就好了,再也不會hang住。

[參考]
1. src/backend/access/transam/xlog.c
2. src/backend/access/transam/timeline.c


相關文章