pg_terminate_backend()運維改造--列印呼叫記錄及被終止query文字
同學,別再說不是你kill的會話了,這個鍋DBA不背。
一、場景
A和B兩位同學由於某個重要SQL會話被kill,爭論得面紅耳赤,原因是一個表加欄位沒生效。A同學是一位業務研發,B同學是一位
DBA。 B同學看了資料庫日誌,列印如下:
2023-02-26 11:25:01.925 CST,"prouser","prodb",1514,"192.168.2.5",63fad0a5.5ea,4,"idle in transaction",2023-02-26 11:23:17 CST,7/50609,43757341,FATAL,57P01, "terminating connection due to administrator command",,,,,,,,,"psql","client backend",,-7856242769374161054
從日誌看對應時間有一個idle in transaction的session被kill了,但是並沒記錄被kill的具體query,
沒有十足的證據就是對應sql 被kill。 只能是懷疑alter table執行完後,在idle in transaction狀態session被kill,導致事務未提交alter table未生效。當然這是根據
日誌做 的推測,沒有十足的證據,被A同學懟的沒有半點脾氣。
二、分析
idle,idle in transaction狀態的會話被terminate後不會列印對應的query,具體可以看下程式接收到SIGTERM後
ProcessInterrupts和errfinish的處理邏輯。只有active也就是正在執行的被terminate後可以列印出具體的query,
但是也存在兩個因素無法證明是誰執行的terminate。
1、只記錄被terminate的程式資訊,不記錄操作terminate的程式資訊,也就是不記錄“兇手” ;
2、直接在資料庫後臺kill -15 pid,也是同樣的效果,使用pg_terminate_backend() 其實就是封裝了kill -15 pid,
因此DBA同學也沒法完全自證排除嫌疑。
這種情況下,免不了要互相扯皮,甩鍋。
那麼只要記錄pg_terminate_backend() 的呼叫操作記錄,同時記錄kill的會話對應的query資訊,這裡就非常清晰了,
不用再浪費時間扯皮了。看起來改造pg_terminate_backend()函式就可以了。
我們都知道當例項發生crash後postmaster會記錄異常的process資訊,假如是某個query導致了OOM,
那麼日誌會列印出對應程式的query。看了下具體的實現,是postmaster在處理子程式退出時LogChildExit函式記錄退出子程式
的資訊。但是,對於普通backend子程式的FATAL這種level是不呼叫LogChildExit的,被pg_terminate_backend() kill的就是
FATAL型別的報錯。
/* * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer, * walwriter, autovacuum, archiver or background worker. * * The objectives here are to clean up our local state about the child * process, and to signal all other remaining children to quickdie. */ static void HandleChildCrash(int pid, int exitstatus, const char *procname) { dlist_mutable_iter iter; slist_itersiter; Backend *bp; booltake_action; /* * We only log messages and send signals if this is the first process * crash and we're not doing an immediate shutdown; otherwise, we're only * here to update postmaster's idea of live processes. If we have already * signaled children, nonzero exit status is to be expected, so don't * clutter log. */ /* 當子程式是FatalError退出時是不會呼叫LogChildExit函記錄程式資訊的 */ take_action = !FatalError && Shutdown != ImmediateShutdown; if (take_action) { LogChildExit(LOG, procname, pid, exitstatus); ereport(LOG, (errmsg("terminating any other active server processes"))); SetQuitSignalReason(PMQUIT_FOR_CRASH); } /* 省略部分程式碼行 */ }
改動Postmaster不合適,代價太高。修改pg_terminate_backend()函式,直接從LogChildExit中copy記錄子程式的邏輯即可。
三、方案
/* * Send a signal to terminate a backend process. This is allowed if you are a * member of the role whose process is being terminated. If the timeout input * argument is 0, then this function just signals the backend and returns * true. If timeout is nonzero, then it waits until no process has the given * PID; if the process ends within the timeout, true is returned, and if the * timeout is exceeded, a warning is emitted and false is returned. * * Note that only superusers can signal superuser-owned processes. */ Datum pg_terminate_backend(PG_FUNCTION_ARGS) { intpid; intr; inttimeout;/* milliseconds */ /* Modify by Nickyoung at 2023-02-26 AM */ /* * size of activity_buffer is arbitrary, but set equal to default * track_activity_query_size */ char activity_buffer[1024]; const char *activity = NULL; /* End at 2023-02-26 AM */ pid = PG_GETARG_INT32(0); timeout = PG_GETARG_INT64(1); if (timeout < 0) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("\"timeout\" must not be negative"))); /* 從共享記憶體查詢pid對應query文字 */ activity = pgstat_get_crashed_backend_activity(pid, activity_buffer, sizeof(activity_buffer)); /* 向對應pid傳送SIGTERM終止程式 */ r = pg_signal_backend(pid, SIGTERM); if (r == SIGNAL_BACKEND_NOSUPERUSER) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be a superuser to terminate superuser process"))); if (r == SIGNAL_BACKEND_NOPERMISSION) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be a member of the role whose process is being terminated or member of pg_signal_backend"))); /* Modify by Nickyoung at 2023-02-26 AM */ /* * Record the operation of using pg_terminate_backend (PID) * to kill the session and the terminated query */ /* 如果SIGTERM傳送成功,那麼列印pg_terminate_backend()函式呼叫記錄,以及被終止的query文字 */ if (r != SIGNAL_BACKEND_ERROR) { ereport(WARNING, (errmsg("process is terminated by: select pg_terminate_backend(%d), query is: %s" ,pid ,activity))); } /* End at 2023-02-26 AM */ /* Wait only on success and if actually requested */ if (r == SIGNAL_BACKEND_SUCCESS && timeout > 0) PG_RETURN_BOOL(pg_wait_until_termination(pid, timeout)); else PG_RETURN_BOOL(r == SIGNAL_BACKEND_SUCCESS); }
四、驗證
1、session1開啟事務執行alter table,未提交,程式處於idle in transaction狀態
testdb=> begin; BEGIN testdb=*> alter table instance_list add column type varchar(50) not null default 'rds'; ALTER TABLE testdb=*>
2、session2 查詢表等鎖,程式active狀態
testdb=> select * from instance_list limit 1;
3、session3 kill所有admin賬戶的query
testdb=> select pg_terminate_backend(pid),* from pg_stat_activity where usename='admin' and pid <> pg_backend_pid(); WARNING: process is terminated by: select pg_terminate_backend(9680), query is: alter table instance_list add column type varchar(50) not null default 'rds'; WARNING: process is terminated by: select pg_terminate_backend(9744), query is: select * from instance_list limit 1; pg_terminate_backend | datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend _start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type ----------------------+-------+---------+------+------------+----------+---------+------------------+-------------+-----------------+-------------+---------------- ---------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+-------------------- -+-------------+--------------+----------------------+-------------------------------------------------------------------------------+---------------- t | 24583 | testdb | 9680 | | 24582 | admin | psql | | | -1 | 2023-02-26 14:4 6:09.621174+08 | 2023-02-26 14:46:12.540059+08 | 2023-02-26 14:46:17.352702+08 | 2023-02-26 14:46:17.354734+08 | Client | ClientRead | idle in transaction | 43785370 | | -7856242769374161054 | alter table instance_list add column type varchar(50) not null default 'rds'; | client backend t | 24583 | testdb | 9744 | | 24582 | admin | psql | | | -1 | 2023-02-26 14:4 6:45.833136+08 | 2023-02-26 14:46:48.770609+08 | 2023-02-26 14:46:48.770609+08 | 2023-02-26 14:46:48.770626+08 | Lock | relation | active | | 43785370 | | select * from instance_list limit 1; | client backend (2 rows) testdb=>
可以看到執行pg_terminate_backend(pid)後列印了對應的資訊
WARNING: process is terminated by: select pg_terminate_backend(9680), query is: alter table instance_list add column type varchar(50) not null default 'rds'; WARNING: process is terminated by: select pg_terminate_backend(9744), query is: select * from instance_list limit 1;
檢視日誌已列印了pg_terminate_backend()函式呼叫記錄,以及被終止的query文字
2023-02-26 14:48:37.792 CST,"admin","testdb",9850,"192.168.2.6",63fb0068.267a,4,"SELECT",2023-02-26 14:47:04 CST,7/21,0,WARNING,01000,"process is terminated by: select pg_terminate_backend(9680), query is: alter table instance_list add column type varchar(50) not null default 'rds';",,,,,,,,,"psql","client backend",,-3764573643027268885 2023-02-26 14:48:37.792 CST,"admin","testdb",9680,"192.168.2.6",63fb0031.25d0,1,"idle in transaction",2023-02-26 14:46:09 CST,4/16,43785370,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"psql","client backend",,-7856242769374161054 2023-02-26 14:48:37.792 CST,"admin","testdb",9850,"192.168.2.6",63fb0068.267a,5,"SELECT",2023-02-26 14:47:04 CST,7/21,0,WARNING,01000,"process is terminated by: select pg_terminate_backend(9744), query is: select * from instance_list limit 1;",,,,,,,,,"psql","client backend",,-3764573643027268885 2023-02-26 14:48:37.792 CST,"admin","testdb",9744,"192.168.2.6",63fb0055.2610,1,"SELECT waiting",2023-02-26 14:46:45 CST,5/18,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,"select * from instance_list limit 1;",15,,"psql","client backend",,0
這樣就能抓到操作pg_terminate_backend()的兇手了,再也不用扯皮嘍。
不過,因為和postmaster處理不同,postmaster是回收子程式過程中記錄要退出子程式的query資訊,子程式這裡已經在退出
邏輯中不再執行新sql了,postmaster記錄的就是子程式最終執行的query。
而我們這裡的方案是兩個同級的子程式,先獲取了query,然後緊接著去傳送訊號終止程式,這是個非同步的過程,有細微的
時間差 (納秒級別),獲取query過程中假如連線數非常多,那麼輪詢匹配pid的過程可能就會久一些。
那麼假如批次select pg_terminate_backend(pid),pid,state,query from pg_stat_activity where xxx;殺連線的過程中,
對於執行sql很快的活躍連線(比如單條sql執行1ms都不到這種)可能就會存在本來要殺這個sql,實際終止時已經執行到了
其他sql。
我用pgbench做了測試,例項連線數20000的情況下,批次 殺2000 active連線,大約會出現20個pid是這個情況。
修改為先發訊號終止程式,再獲取query,這個又會出現採集不到query的情況,可能在採集時程式就已經退出了,
獲取的query就是NULL。
其實換個思路來想,假如我想殺一個pid ,這個pid執行的sql很快等我操作完pg_terminate_backend(pid)後可能已經是在
執行其他query了。這種情況下是不可避免的,不過這個方案對於pg_terminate_backend(pid)的操作記錄可以很準確的抓取到。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/70033867/viewspace-2986768/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Linux運維學習記錄07Linux運維
- Linux運維學習記錄02Linux運維
- Liunx常用運維命令整理記錄運維
- linux-把終端輸出記錄到文字Linux
- java呼叫DLL,呼叫北洋印表機列印二維碼標籤Java
- docker無法啟動排查運維記錄Docker運維
- Android低記憶體終止守護程式Android記憶體
- python:運用while迴圈怎麼進行終止PythonWhile
- Django搭建CMDB系統完整[14](運維記錄)Django運維
- Spring Security OAuth正式終止維護,已從官網下架SpringOAuth
- mysql運維利器percona-toolkit工具之pt-query-digestMySql運維
- 你可以終止 forEach 嗎?
- 終止非同步任務非同步
- 正常終止expdp作業
- CentOS 將於年底終止!CentOS
- 終止指定埠的程式
- 兩階段終止模式模式
- Linux終端記錄神器Linux
- MAC 終端 命令 記錄Mac
- 運維指令碼備忘錄運維指令碼
- 記錄 - 網站被 HACK!網站
- Windows 程式的建立和終止Windows
- AngularJS終止了長期支援AngularJS
- Docker 容器優雅終止方案Docker
- 3.6 延遲例項終止
- response()->download();下載中途終止
- Thread執行緒終止interruptthread執行緒
- 1_文字記錄說明
- 老凡的運維筆記 | 智慧化運維知多少?運維筆記
- iOS 稽核被拒記錄 2.5.1iOS
- AI,被“橫掃記錄”反噬?AI
- Activiti7 結束/終止流程
- Go的優雅終止姿勢Go
- Linux 下使用 killall 命令終止程式Linux
- 利器 | Terminal & Shell 改造記錄 Windows Terminal + Oh My ZSH + TmuxWindowsUX
- ES 筆記十:Query String & Simple Query String筆記
- 運維除錯記錄:Ubuntu14.04下部署Opendaylight Nitrogen叢集運維除錯Ubuntu
- ORACLE OGG運維及日常監控Oracle運維