PostgreSQL 跟蹤checkpointer出現死鎖
gdb跟蹤checkpointer程式,出現死鎖,Mark一下.
跟蹤checkpointer程式,檢視共享記憶體中的信(heckpointerShmem->requests)
(gdb) p CheckpointerShmem->requests[150] ... $16 = {rnode = {spcNode = 1663, dbNode = 16402, relNode = 26185}, forknum = MAIN_FORKNUM, segno = 0} (gdb) p CheckpointerShmem->requests[200] Cannot access memory at address 0xf9fb18 ...
然後,請求checkpoint的程式報錯
testdb=# update t_wal_ckpt set c2 = 'C2#'||substr(c2,4,40); UPDATE 8192 testdb=# checkpoint; 2019-01-07 12:30:32.114 CST [1418] PANIC: stuck spinlock detected at RequestCheckpoint, checkpointer.c:1050 2019-01-07 12:30:32.114 CST [1418] STATEMENT: checkpoint; 2019-01-07 12:30:37.081 CST [1390] PANIC: stuck spinlock detected at FirstCallSinceLastCheckpoint, checkpointer.c:1376 2019-01-07 12:30:38.610 CST [1370] LOG: background writer process (PID 1390) was terminated by signal 6: Aborted 2019-01-07 12:30:38.610 CST [1370] LOG: terminating any other active server processes 2019-01-07 12:30:38.611 CST [1392] WARNING: terminating connection because of crash of another server process 2019-01-07 12:30:38.611 CST [1392] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-01-07 12:30:38.611 CST [1392] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-01-07 12:30:38.613 CST [1558] WARNING: terminating connection because of crash of another server process 2019-01-07 12:30:38.613 CST [1558] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-01-07 12:30:38.613 CST [1558] HINT: In a moment you should be able to reconnect to the database and repeat your command. PANIC: stuck spinlock detected at RequestCheckpoint, checkpointer.c:1050 server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2019-01-07 12:30:54.041 CST [1560] FATAL: the database system is in recovery mode Failed. !>
嘗試重新連線,發現DB已coredump.
[xdb@localhost ~]$ [xdb@localhost ~]$ psql -d testdb 2019-01-07 14:10:16.114 CST [1629] FATAL: the database system is in recovery mode psql: FATAL: the database system is in recovery mode
執行恢復
[xdb@localhost ~]$ pg_ctl start pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2019-01-07 14:11:50.821 CST [1632] FATAL: lock file "postmaster.pid" already exists 2019-01-07 14:11:50.821 CST [1632] HINT: Is another postmaster (PID 1370) running in data directory "/data/xdb/pg111db"? stopped waiting pg_ctl: could not start server Examine the log output. [xdb@localhost ~]$ find /data/xdb -name postmaster.pid /data/xdb/pg111db/postmaster.pid [xdb@localhost ~]$ rm -rf /data/xdb/pg111db/postmaster.pid [xdb@localhost ~]$ pg_ctl start waiting for server to start....2019-01-07 14:12:44.578 CST [1639] LOG: could not bind IPv6 address "::1": Address already in use [xdb@localhost ~]$ ps -ef|grep postgres xdb 1370 1 0 12:01 pts/0 00:00:02 /appdb/atlasdb/pg11.1/bin/postgres xdb 1389 1370 0 12:01 ? 00:00:00 [postgres] <defunct> xdb 1641 1332 0 14:12 pts/0 00:00:00 grep --color=auto postgres [xdb@localhost ~]$ kill -9 1370 [xdb@localhost ~]$ pg_ctl start waiting for server to start....2019-01-07 14:13:33.125 CST [1648] LOG: listening on IPv6 address "::1", port 5432 2019-01-07 14:13:33.125 CST [1648] LOG: listening on IPv4 address "127.0.0.1", port 5432 2019-01-07 14:13:33.142 CST [1648] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" .2019-01-07 14:13:34.361 CST [1649] LOG: database system was interrupted; last known up at 2019-01-07 12:26:22 CST 2019-01-07 14:13:34.818 CST [1649] LOG: database system was not properly shut down; automatic recovery in progress 2019-01-07 14:13:34.863 CST [1649] LOG: redo starts at 1/48F9ED08 .2019-01-07 14:13:35.467 CST [1649] LOG: invalid record length at 1/4914FF58: wanted 24, got 0 2019-01-07 14:13:35.467 CST [1649] LOG: redo done at 1/4914FF30 2019-01-07 14:13:35.467 CST [1649] LOG: last completed transaction was at log time 2019-01-07 12:28:37.521542+08 2019-01-07 14:13:35.977 CST [1648] LOG: database system is ready to accept connections done server started
經分析,是因為共享記憶體結構中的CheckpointerShmem->ckpt_lck導致的.
在跟蹤checkpointer程式時,執行
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
釋放lock後,不再出現上述問題.
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/6906/viewspace-2374768/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- PostgreSQL死鎖相關SQL
- PostgreSQL 死鎖異常SQL
- 一個ORACLE死鎖問題的追蹤Oracle
- 基於行跟蹤的ROWDEPENDENCIES實現資訊變化跟蹤
- 面試:什麼是死鎖,如何避免或解決死鎖;MySQL中的死鎖現象,MySQL死鎖如何解決面試MySql
- PostgreSQL DBA(56) - Why does checkpointer use so much memorySQL
- Java多執行緒-死鎖的出現和解決Java執行緒
- 在Unity中實現手部跟蹤Unity
- [zt] oracle跟蹤檔案與跟蹤事件Oracle事件
- oracle跟蹤檔案與跟蹤事件(zt)Oracle事件
- oracle跟蹤檔案和跟蹤事件(zt)Oracle事件
- 模擬SQLserver死鎖現象SQLServer
- Java鎖——死鎖Java
- 死鎖
- PostgreSQL10.0preview功能增強-CLOGoldestXID跟蹤SQLViewGo
- SQ死鎖及死鎖的解決
- Apache Bloodhound 0.8 釋出,專案跟蹤工具Apache
- Java 實現執行緒死鎖Java執行緒
- 面試官:什麼是死鎖?怎麼排查死鎖?怎麼避免死鎖?面試
- sqlnet跟蹤SQL
- ORACLE 跟蹤工具Oracle
- 什麼是死鎖?如何解決死鎖?
- 殺死Oracle死鎖程式Oracle
- PostgreSQL模擬兩個update語句死鎖-利用掃描方法SQL
- PostgreSQL 原始碼解讀(213)- 後臺程式#12(checkpointer-CheckpointWriteDelay)SQL原始碼
- 死鎖分析
- oracle 死鎖Oracle
- 在 WebSphere Application Server V6.1 應用程式中跟蹤死鎖WebAPPServer
- PostgreSQL:鎖SQL
- mysqldump跟蹤匯出來東西是否排序了MySql排序
- Firefox 63 釋出,增強跟蹤保護Firefox
- 利用行SCN實現表變化跟蹤
- 檢視oracle死鎖程式並結束死鎖Oracle
- 例項詳解 Java 死鎖與破解死鎖Java
- MySQL死鎖系列-線上死鎖問題排查思路MySql
- 死磕Synchronized底層實現–偏向鎖synchronized
- 作業系統(5) 死鎖的概念 死鎖產生的必要條件 死鎖的處理策略 預防死鎖 避免死鎖 死鎖的檢測和解除 銀行家演算法作業系統演算法
- PostgreSQL 原始碼解讀(15)- Insert語句(執行過程跟蹤)SQL原始碼