Nagios監控mongodb分片叢集服務實戰

mchdba發表於2014-10-10


1,監控外掛下載
Mongodb外掛下載地址為:git clone git://github.com/mzupan/nagios-plugin-mongodb.git,剛開始本人這裡沒有安裝gitpub環境,找網友草根幫忙下載的,之後上傳到了csdn資源頁面,新的下載地址為:
 

2,新增新的mongodb監控命令

因為mongodb服務是和mysql從庫公用一臺物理機,之前已經做了基礎nagios以及mysql服務監控,所以這裡只需要在原來的基礎上新增mongodb命令和服務即可。Nagios監控mysql請參考:http://blog.itpub.net/26230597/viewspace-760141/以及http://blog.itpub.net/26230597/viewspace-1217246/。所以這裡需要新增的mongodb監控命令如下所示:

  1. [root@wgq objects]# cd /usr/local/nagios/etc/objects
  2. [root@wgq objects]# vim commands.cfg
  3. define command {
  4.     command_name check_mongodb
  5.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$
  6. }

  7. define command {
  8.     command_name check_mongodb_database
  9.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -d $ARG5$
  10. }

  11. define command {
  12.     command_name check_mongodb_collection
  13.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -d $ARG5$ -c $ARG6$
  14. }

  15. define command {
  16.     command_name check_mongodb_replicaset
  17.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -r $ARG5$
  18. }

  19. define command {
  20.     command_name check_mongodb_query
  21.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -q $ARG5$
  22. }

3,新增mongodb監控服務
mongodb的服務也需要單獨重新新增,如下所示:

  1. #檢測mongodb服務的連線時間,超過2秒就普通報警,5秒就嚴重報警
  2. define service{
  3.         host_name dbm1slave1
  4.         service_description Mongo Connect Check
  5.         check_command check_mongodb!connect!30000!2!5
  6.         max_check_attempts 5
  7.         normal_check_interval 3
  8.         retry_check_interval 2
  9.         check_period 24x7
  10.         notification_interval 10
  11.         notification_period 24x7
  12.         notification_options w,u,c,r
  13.         contact_groups ops
  14.         }

  15. #檢查mongodb的連線數,超過150普通報警,200嚴重報警
  16. define service{
  17.         host_name dbm1slave1
  18.         service_description Mongo Free Connections
  19.         check_command check_mongodb!connections!27017!70!80
  20.         max_check_attempts 5
  21.         normal_check_interval 3
  22.         retry_check_interval 2
  23.         check_period 24x7
  24.         notification_interval 10
  25.         notification_period 24x7
  26.         notification_options w,u,c,r
  27.         contact_groups ops
  28.         }
  29.         
  30.         
  31. #檢查mongodb複製完成的百分比率,確保primary和standby的time是一致的。
  32. define service{
  33.         host_name dbm1slave1
  34.         service_description Mongo Replication Lag
  35.         check_command check_mongodb!replication_lag!27017!15!30
  36.         max_check_attempts 5
  37.         normal_check_interval 3
  38.         retry_check_interval 2
  39.         check_period 24x7
  40.         notification_interval 10
  41.         notification_period 24x7
  42.         notification_options w,u,c,r
  43.         contact_groups ops
  44.         }
  45.         
  46. #檢查mongodb記憶體使用率,閥值與mongodb所在機器的總記憶體數相關
  47. define service{
  48.         host_name dbm1slave1
  49.         service_description Mongo Memory Usage
  50.         check_command check_mongodb!memory!27017!20!28
  51.         max_check_attempts 5
  52.         normal_check_interval 3
  53.         retry_check_interval 2
  54.         check_period 24x7
  55.         notification_interval 10
  56.         notification_period 24x7
  57.         notification_options w,u,c,r
  58.         contact_groups ops
  59.         }
  60.         
  61. #檢查mongodb Mapped的記憶體使用率,閥值與mongodb所在機器的總記憶體數相關
  62. define service{
  63.         host_name dbm1slave1
  64.         service_description Mongo Mapped Memory Usage
  65.         check_command check_mongodb!memory_mapped!27017!20!28
  66.         max_check_attempts 5
  67.         normal_check_interval 3
  68.         retry_check_interval 2
  69.         check_period 24x7
  70.         notification_interval 10
  71.         notification_period 24x7
  72.         notification_options w,u,c,r
  73.         contact_groups ops
  74.         }
  75.         
  76. #檢查Lock Time的百分率,如果lock time佔據mongo執行時間的5%就普通報警,如果超過10%就嚴重報警
  77. define service{
  78.         host_name dbm1slave1
  79.         service_description Mongo Lock Percentage
  80.         check_command check_mongodb!lock!27017!5!10
  81.         max_check_attempts 5
  82.         normal_check_interval 3
  83.         retry_check_interval 2
  84.         check_period 24x7
  85.         notification_interval 10
  86.         notification_period 24x7
  87.         notification_options w,u,c,r
  88.         contact_groups ops
  89.         }

  90. # Check Average Flush Time,檢查mongo伺服器的平均flush時間,
  91. define service{
  92.         host_name dbm1slave1
  93.         service_description Mongo Flush Average
  94.         check_command check_mongodb!flushing!27017!100!200
  95.         max_check_attempts 5
  96.         normal_check_interval 3
  97.         retry_check_interval 2
  98.         check_period 24x7
  99.         notification_interval 10
  100.         notification_period 24x7
  101.         notification_options w,u,c,r
  102.         contact_groups ops
  103.         }

  104. # Check Last Flush Time,檢查最新的flush時間,如果超過200ms就普通報警,超過400ms就嚴重報警
  105. define service{
  106.         host_name dbm1slave1
  107.         service_description Mongo Last Flush Time
  108.         check_command check_mongodb!last_flush_time!27017!200!400
  109.         max_check_attempts 5
  110.         normal_check_interval 3
  111.         retry_check_interval 2
  112.         check_period 24x7
  113.         notification_interval 10
  114.         notification_period 24x7
  115.         notification_options w,u,c,r
  116.         contact_groups ops
  117.         }
  118.         
  119. # Check status of mongodb replicaset,檢查mongo複製的狀態
  120. define service{
  121.         host_name dbm1slave1
  122.         service_description MongoDB state
  123.         check_command check_mongodb!replset_state!27017!0!0
  124.         max_check_attempts 5
  125.         normal_check_interval 3
  126.         retry_check_interval 2
  127.         check_period 24x7
  128.         notification_interval 10
  129.         notification_period 24x7
  130.         notification_options w,u,c,r
  131.         contact_groups ops
  132.         }

  133. # Check status of index miss ratio,檢查索引命中率,
  134. define service{
  135.         host_name dbm1slave1
  136.         service_description MongoDB Index Miss Ratio
  137.         check_command check_mongodb!index_miss_ratio!27017!.005!.01
  138.         max_check_attempts 5
  139.         normal_check_interval 3
  140.         retry_check_interval 2
  141.         check_period 24x7
  142.         notification_interval 10
  143.         notification_period 24x7
  144.         notification_options w,u,c,r
  145.         contact_groups ops
  146.         }
  147.         
  148. # Check number of databases and number of collections
  149. define service{
  150.         host_name dbm1slave1
  151.         service_description MongoDB Number of databases
  152.         check_command check_mongodb!databases!27017!300!500
  153.         max_check_attempts 5
  154.         normal_check_interval 3
  155.         retry_check_interval 2
  156.         check_period 24x7
  157.         notification_interval 10
  158.         notification_period 24x7
  159.         notification_options w,u,c,r
  160.         contact_groups ops
  161.         }
  162. define service{
  163.         host_name dbm1slave1
  164.         service_description MongoDB Number of collections
  165.         check_command check_mongodb!collections!27017!300!500
  166.         max_check_attempts 5
  167.         normal_check_interval 3
  168.         retry_check_interval 2
  169.         check_period 24x7
  170.         notification_interval 10
  171.         notification_period 24x7
  172.         notification_options w,u,c,r
  173.         contact_groups ops
  174.         }        
  175.         
  176. # Check size of a database,檢查庫的大小
  177. define service{
  178.         host_name dbm1slave1
  179.         service_description MongoDB Database size your-database
  180.         check_command check_mongodb_database!database_size!27017!300!500!your-database
  181.         max_check_attempts 5
  182.         normal_check_interval 3
  183.         retry_check_interval 2
  184.         check_period 24x7
  185.         notification_interval 10
  186.         notification_period 24x7
  187.         notification_options w,u,c,r
  188.         contact_groups ops
  189.         }                
  190.         
  191. # Check index size of a database,檢查庫索引的大小
  192. define service{
  193.         host_name dbm1slave1
  194.         service_description MongoDB Database index size your-database
  195.         check_command check_mongodb_database!database_indexes!27017!50!100!your-database
  196.         max_check_attempts 5
  197.         normal_check_interval 3
  198.         retry_check_interval 2
  199.         check_period 24x7
  200.         notification_interval 10
  201.         notification_period 24x7
  202.         notification_options w,u,c,r
  203.         contact_groups ops
  204.         }            
  205.         
  206. # Check index size of a collection,檢查集合collection的索引大小
  207. define service{
  208.         host_name dbm1slave1
  209.         service_description MongoDB Database index size your-database
  210.         check_command check_mongodb_collection!collection_indexes!27017!50!100!your-database!your-collection
  211.         max_check_attempts 5
  212.         normal_check_interval 3
  213.         retry_check_interval 2
  214.         check_period 24x7
  215.         notification_interval 10
  216.         notification_period 24x7
  217.         notification_options w,u,c,r
  218.         contact_groups ops
  219.         }
  220.         
  221. # Check the primary server of replicaset,檢查複製的primary服務
  222. define service{
  223.         host_name dbm1slave1
  224.         service_description MongoDB Replicaset Master Monitor: your-replicaset
  225.         check_command check_mongodb_replicaset!replica_primary!27017!0!1!your-replicaset
  226.         #示例:check_command check_mongodb_replicaset!replica_primary!27017!0!1!shard2
  227.         max_check_attempts 5
  228.         normal_check_interval 3
  229.         retry_check_interval 2
  230.         check_period 24x7
  231.         notification_interval 10
  232.         notification_period 24x7
  233.         notification_options w,u,c,r
  234.         contact_groups ops
  235.         }

  236.         
  237. # Check the number of queries per second,檢查每一秒的查詢數量
  238. define service{
  239.         host_name dbm1slave1
  240.         service_description MongoDB Updates per Second
  241.         check_command check_mongodb_query!queries_per_second!27017!200!150!update
  242.         max_check_attempts 5
  243.         normal_check_interval 3
  244.         retry_check_interval 2
  245.         check_period 24x7
  246.         notification_interval 10
  247.         notification_period 24x7
  248.         notification_options w,u,c,r
  249.         contact_groups ops
  250.         }
  251.         
  252. # Check Primary Connection,檢查複製中與primary庫的連線時間,超過2秒就普通報警,超過4秒就嚴重報警
  253. define service{
  254.         host_name dbm1slave1
  255.         service_description Mongo Connect Check
  256.         check_command check_mongodb!connect_primary!27017!2!4
  257.         max_check_attempts 5
  258.         normal_check_interval 3
  259.         retry_check_interval 2
  260.         check_period 24x7
  261.         notification_interval 10
  262.         notification_period 24x7
  263.         notification_options w,u,c,r
  264.         contact_groups ops
  265.         }

  266. # Check Collection State,檢查collection狀態,檢查mongo服務組列表的每一個主機,可以檢查重要collection的高可用性(鎖、超時、服務配置的可用性),如果發現一個查詢失敗就會報警。
  267. define service{
  268.         host_name dbm1slave1
  269.         service_description Mongo Collection State
  270.         check_command check_mongodb!collection_state!27017!your-database!your-collection
  271.         max_check_attempts 5
  272.         normal_check_interval 3
  273.         retry_check_interval 2
  274.         check_period 24x7
  275.         notification_interval 10
  276.         notification_period 24x7
  277.         notification_options w,u,c,r
  278.         contact_groups ops
  279.         }

4,檢視部分監控項效果

配置完nagios端服務,重啟下service nagios restart; 等上幾分鐘,nagios監控介面就會出現完整的mongo服務資訊,如下所示:




5
,從ps中確定mongodb的架構

[root@db-m1-slave-1 ~]# ps -eaf|grep mongo

mongodb   2457     1  0  2013 ?        2-03:39:08 ./mongod --configsvr --dbpath /home/data/mongodb/config --port 20000 --logpath /home/data/mongodb/config.log --logappend --fork

mongodb   2804     1  0  2013 ?        1-10:02:33 mongos --configdb 192.168.12.62:20000,192.168.12.63:20000,192.168.12.72:20000 --port 30000 --chunkSize 64 --logpath /home/data/mongodb/mongos.log --logappend --fork

mongodb   3072     1  0  2013 ?        1-10:17:20 mongod --shardsvr --replSet shard1 --port 27017 --dbpath /home/data/mongodb/shard11 --oplogSize 2048 --logpath /home/data/mongodb/shard11.log --logappend --fork

root     11179  9391  0 11:14 pts/1    00:00:00 grep mongo

mongodb  30414     1  0 Feb14 ?        1-06:20:50 mongod --shardsvr --replSet shard2 --port 27018 --dbpath /home/data/mongodb/shard21 --oplogSize 2048 --logpath /home/data/mongodb/shard21.log --logappend --fork

[root@db-m1-slave-1 ~]#

 

看到有4mongo程式,

a)         啟動引數有“--configdb”的就是叢集入口程式;

b)         Shard Server,啟動引數帶“--shardsvr --replSet”的是叢集分片的一個片組啟動程式,使用者儲存實際的資料塊,也就是27017埠和27018埠的mongodb服務例項。至於如何判斷27017埠中哪個是primary哪個是secondary需要去登入27107埠執行rs.status();去檢視一下。

c)         Config Server:啟動引數帶“--configsvr”的程式,儲存了整個Cluster Metadata,其中包括chunk資訊,也就是20000埠的mongodb服務例項。

d)         Route Server:啟動引數帶“mongos --configdb”的程式,前端路由,客戶端由此接入,且讓整個叢集看上去像單一資料庫,前端應用可以透明使用,也就是30000埠的mongodb例項。


6
,除錯中出現過的錯誤

錯誤1

[root@wgq nagios ~]# tail -f /usr/local/nagios/var/nagios.log

[1412819956] Warning: Return code of 13 for check of service 'Mongo Memory Usage' on host 'dbm1slave1' was out of bounds.

[1412819956] SERVICE ALERT: dbm1slave1;Mongo Memory Usage;CRITICAL;SOFT;1;(Return code of 13 is out of bounds)

[1412819975] Warning: Return code of 13 for check of service 'Mongodb Connect Check' on host 'dbm1slave1' was out of bounds.

[1412819975] SERVICE ALERT: dbm1slave1;Mongodb Connect Check;CRITICAL;SOFT;1;(Return code of 13 is out of bounds)

[1412820058] Warning: Return code of 13 for check of service 'Mongo Free Connections' on host 'dbm1slave1' was out of bounds.

 

需要賦值nagios使用者所有許可權以及r執行許可權

chmod 770 /usr/lib/nagios/plugins/check_mongodb.py

chown -R nagios.nagios /usr/lib/nagios/plugins/check_mongodb.py

 

錯誤2

監控介面Status Information一欄出現 No module named pymongo報錯提示資訊:

出現這個提示是因為需要安裝pymongo模組,執行easy_install pymongo命令安裝即可,如下所示:

[root@wgq objects]# easy_install pymongo

Searching for pymongo

Reading

Best match: pymongo 2.7.2

......

zip_safe flag not set; analyzing archive contents...

Adding pymongo 2.7.2 to easy-install.pth file

 

Installed /usr/lib/python2.6/site-packages/pymongo-2.7.2-py2.6-linux-x86_64.egg

Processing dependencies for pymongo

Finished processing dependencies for pymongo

 

參考文章:

 



來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26230597/viewspace-1293589/,如需轉載,請註明出處,否則將追究法律責任。

相關文章