linux-HA 資源的failcount 問題

babyyellow發表於2012-09-19
先看一個例子:

[root@node96 ~]# ps -ef |grep postgres
postgres  4076     1  0 17:17 ?        00:00:00 /data/postgresql-9.2.0/bin/postgres -D /usr/local/pgsql/data -c config_file=/usr/local/pgsql/data/postgresql.conf
postgres  4128  4076  0 17:17 ?        00:00:00 postgres: logger process                                                                                        
postgres  4130  4076  0 17:17 ?        00:00:00 postgres: checkpointer process                                                                                  
postgres  4131  4076  0 17:17 ?        00:00:00 postgres: writer process                                                                                        
postgres  4132  4076  0 17:17 ?        00:00:00 postgres: wal writer process                                                                                    
postgres  4133  4076  0 17:17 ?        00:00:00 postgres: autovacuum launcher process                                                                           
postgres  4134  4076  0 17:17 ?        00:00:00 postgres: archiver process                                                                                      
postgres  4135  4076  0 17:17 ?        00:00:00 postgres: stats collector process                                                                               
postgres  4229  4076  0 17:17 ?        00:00:00 postgres: wal sender process repl 192.168.11.95(35071) streaming 0/170131D8                                     
root      4462  4420  0 17:17 pts/4    00:00:00 su - postgres
postgres  4463  4462  0 17:17 pts/4    00:00:00 -bash
postgres  4493  4463  0 17:17 pts/4    00:00:00 psql
postgres  4494  4076  0 17:17 ?        00:00:00 postgres: postgres postgres [local] idle                                                                        
root      6115 20538  0 17:23 pts/2    00:00:00 grep postgres

找到pg 資料庫資料庫主程式id  4076

先看看資源狀態:

============
Last updated: Wed Sep 19 17:26:06 2012
Last change: Wed Sep 19 17:16:56 2012 via crmd on node95
Stack: openais
Current DC: node96 - partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Node node95: online
        fence_vm96      (stonith:fence_vmware) Started
Node node96: online
        ClusterIp       (ocf::heartbeat:IPaddr2) Started
        fence_vm95      (stonith:fence_vmware) Started
        ping    (ocf::pacemaker:ping) Started
        postgres_res    (ocf::heartbeat:pgsql) Started

Inactive resources:


Migration summary:
* Node node95:
* Node node96:

然後我們kill -9   資料庫的主程式。

看看資源狀態

============
Last updated: Wed Sep 19 17:29:06 2012
Last change: Wed Sep 19 17:16:56 2012 via crmd on node95
Stack: openais
Current DC: node96 - partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Node node95: online
        fence_vm96      (stonith:fence_vmware) Started
Node node96: online
        ClusterIp       (ocf::heartbeat:IPaddr2) Started
        fence_vm95      (stonith:fence_vmware) Started
        ping    (ocf::pacemaker:ping) Started
        postgres_res    (ocf::heartbeat:pgsql) Started

Inactive resources:


Migration summary:
* Node node95:
* Node node96:
   postgres_res: migration-threshold=1000000 fail-count=1

Failed actions:
    postgres_res_monitor_30000 (node=node96, call=45, rc=7, status=complete): not running

資料庫資源在node96 重啟了,下面多了個錯誤記錄


看看日誌:

Sep 19 17:29:05 node96 pgsql(postgres_res)[7479]: ERROR: command failed: su postgres -c cd /usr/local/pgsql/data; kill -s 0 4076 >/dev/null 2>&1             /* 資料庫檢測失敗
Sep 19 17:29:05 node96 pgsql(postgres_res)[7479]: INFO: PostgreSQL is down
Sep 19 17:29:05 node96 crmd[1925]:     info: process_lrm_event: LRM operation postgres_res_monitor_30000 (call=45, rc=7, cib-update=380, confirmed=false) not running
Sep 19 17:29:05 node96 crmd[1925]:     info: process_graph_event: Action postgres_res_monitor_30000 arrived after a completed transition
Sep 19 17:29:05 node96 crmd[1925]:     info: abort_transition_graph: process_graph_event:481 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=postgres_res_last_failure_0, magic=0:7;5:128:0:87f9cd86-767b-4162-a0c8-da0217d89baf, cib=0.115.12) : Inactive graph
Sep 19 17:29:05 node96 crmd[1925]:  warning: update_failcount: Updating failcount for postgres_res on node96 after failed monitor: rc=7 (update=value++, time=1348046945)   /* 更新叢集狀態

--------/下面這些就是策略引擎的資訊
Sep 19 17:29:05 node96 crmd[1925]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL rigin=abort_transition_graph ]
Sep 19 17:29:05 node96 attrd[1923]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-postgres_res (1)
Sep 19 17:29:05 node96 attrd[1923]:   notice: attrd_perform_update: Sent update 133: fail-count-postgres_res=1
Sep 19 17:29:05 node96 attrd[1923]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-postgres_res (1348046945)
Sep 19 17:29:05 node96 pengine[1924]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Sep 19 17:29:05 node96 pengine[1924]:   notice: unpack_rsc_op: Operation monitor found resource postgres_res active on node95
Sep 19 17:29:05 node96 pengine[1924]:  warning: unpack_rsc_op: Processing failed op postgres_res_last_failure_0 on node96: not running (7)
Sep 19 17:29:05 node96 crmd[1925]:     info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-node96-fail-count-postgres_res, name=fail-count-postgres_res, value=1, magic=NA, cib=0.115.13) : Transient attribute: update
Sep 19 17:29:05 node96 attrd[1923]:   notice: attrd_perform_update: Sent update 135: last-failure-postgres_res=1348046945
Sep 19 17:29:05 node96 pengine[1924]:   notice: LogActions: Recover postgres_res#011(Started node96)
Sep 19 17:29:05 node96 crmd[1925]:     info: handle_response: pe_calc calculation pe_calc-dc-1348046945-250 is obsolete
Sep 19 17:29:05 node96 crmd[1925]:     info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-node96-last-failure-postgres_res, name=last-failure-postgres_res, value=1348046945, magic=NA, cib=0.115.14) : Transient attribute: update
Sep 19 17:29:05 node96 pengine[1924]:   notice: process_pe_message: Transition 129: PEngine Input stored in: /var/lib/pengine/pe-input-285.bz2
Sep 19 17:29:05 node96 pengine[1924]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Sep 19 17:29:05 node96 pengine[1924]:   notice: unpack_rsc_op: Operation monitor found resource postgres_res active on node95
Sep 19 17:29:05 node96 pengine[1924]:  warning: unpack_rsc_op: Processing failed op postgres_res_last_failure_0 on node96: not running (7)
Sep 19 17:29:05 node96 pengine[1924]:   notice: common_apply_stickiness: postgres_res can fail 999999 more times on node96 before being forced off                   /*  這裡提到了failcount 還剩下
999999次失敗的機會。

-----------/下面的日誌就是在本地節點重新啟動了服務。
Sep 19 17:29:05 node96 pengine[1924]:   notice: LogActions: Recover postgres_res#011(Started node96)
Sep 19 17:29:05 node96 crmd[1925]:   notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE rigin=handle_response ]
Sep 19 17:29:05 node96 crmd[1925]:     info: do_te_invoke: Processing graph 130 (ref=pe_calc-dc-1348046945-251) derived from /var/lib/pengine/pe-input-286.bz2
Sep 19 17:29:05 node96 crmd[1925]:     info: te_rsc_command: Initiating action 6: stop postgres_res_stop_0 on node96 (local)
Sep 19 17:29:05 node96 lrmd: [1922]: info: cancel_op: operation monitor[45] on ocf::pgsql::postgres_res for client 1925, its parameters: CRM_meta_depth=[0] pgdba=[postgres] pgdb=[postgres] pgdata=[/usr/local/pgsql/data] config=[/usr/local/pgsql/data/postgresql.conf] depth=[0] psql=[/usr/local/pgsql/bin/psql] pgctl=[/usr/local/pgsql/bin/pg_ctl] start_opt=[] crm_feature_set=[3.0.6] CRM_meta_on_fail=[standby] CRM_meta_name=[monitor] CRM_meta_interval=[30000] CRM_meta_timeout=[30000]  cancelled
Sep 19 17:29:05 node96 lrmd: [1922]: info: rsc:postgres_res:46: stop
Sep 19 17:29:05 node96 crmd[1925]:     info: process_lrm_event: LRM operation postgres_res_monitor_30000 (call=45, status=1, cib-update=0, confirmed=true) Cancelled
Sep 19 17:29:05 node96 pengine[1924]:   notice: process_pe_message: Transition 130: PEngine Input stored in: /var/lib/pengine/pe-input-286.bz2
Sep 19 17:29:05 node96 pgsql(postgres_res)[7523]: ERROR: command failed: su postgres -c cd /usr/local/pgsql/data; kill -s 0 4076 >/dev/null 2>&1
Sep 19 17:29:05 node96 crmd[1925]:     info: process_lrm_event: LRM operation postgres_res_stop_0 (call=46, rc=0, cib-update=384, confirmed=true) ok
Sep 19 17:29:05 node96 crmd[1925]:     info: te_rsc_command: Initiating action 17: start postgres_res_start_0 on node96 (local)
Sep 19 17:29:05 node96 lrmd: [1922]: info: rsc:postgres_res:47: start
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: ERROR: command failed: su postgres -c cd /usr/local/pgsql/data; kill -s 0 4076 >/dev/null 2>&1
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: INFO: server starting
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: INFO: PostgreSQL start command sent.
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: WARNING: psql: could not connect to server: Connection refused Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: WARNING: PostgreSQL postgres isn't running
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: WARNING: Connection error (connection to the server went bad and the session was not interactive) occurred while executing the psql command.
Sep 19 17:29:06 node96 pgsql(postgres_res)[7561]: INFO: PostgreSQL is started.


我們本來的的願望是資源要切到備庫去的,結果沒有切,原因想必就是這個failcount 在做怪了。

pacemaker  裡有個引數 :
migration-threshold=N 
設定的是本機上資源失敗的次數達到了這個N 以後就會切到standby node 上去了。 並且再也不會切回來了。
除非管理員清理了這個節點上failcount。

就是這部分:
Migration summary:
* Node node95:
* Node node96:
   postgres_res: migration-threshold=1000000 fail-count=1

命令很簡單:

cmr resource cleanup   

還有另外一個引數  failure-timeout=N
失敗的超時時間

如果同時設定了這兩個引數 ,如果達到了
migration-threshold的閥值,導致應用切換到從庫,那麼經過failure-timeout=N的時間後,可能會導致資源再切回到原來的主庫,這個切回來的動作受到(stickiness and constraint scores) 這兩個 條件的約束。

這裡還涉及到了兩個意外情況,在本機start-fail 和 stop-fail  如果是start-fail 會直接跟新failcount 到INFINITY ,  這會導致切換到備機。

如果是stop-fail  ,如果開啟了fence裝置,會導致fence 裝置發生fence動作,然後把資源切到備機。

如果沒有啟動 fence 裝置,那麼叢集會嘗試不停的去stop 應用,但是這個應用不會切刀備機。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/133735/viewspace-744252/,如需轉載,請註明出處,否則將追究法律責任。

相關文章