Linux下monit程式管理操作梳理

散盡浮華發表於2017-01-20

 

Monit對運維人員來說可謂神器,它是一款功能非常豐富的程式、檔案、目錄和裝置的監測工具,用於Unix平臺它可以自動修復那些已經停止運作的程式,特使適合處理那些由於多種原因導致的軟體錯誤。
Monit不但本地監控十分有效,還可以監控遠端服務,只要花點功夫就能永遠實現服務的“死而復生”,就是說它可以使它監控的服務程式在宕停後迅速自啟動,不需要人工干預。絕對牛X的一款系統監控神奇!
比如下面兩個場景:
1)持續郵件提醒
預設情況下,如果服務Down了,無論它持續Down了多久,Monit程式只會郵件提醒你一次。下一次提醒,就是服務恢復的時候。
如果希望,在多個週期內,即使服務狀態沒有變化(持續當機著),也能收到郵件提醒,那麼加上這句:
alert foo@bar with reminder on 10 cycles 此句表示,在10個週期內都會郵件提醒

2)誤報提醒解決
有些時候,Monit也會誤報,這很正常,任何監控軟體都會。大多數是由於網路狀況不佳。比如某一個服務,Monit發現停了,又迅速啟動了,那就不要來煩了,別總是一封郵件接著一封。這樣設定:
if failed host 172.16.5.1 port 8599 for 3 times within 4 cycles then alert  這樣就是:若在四個週期內,三次 8599(我的電驢口)埠都無法通,則郵件通知。很方便!

廢話不多說,下面對monit監控環境的部署做一梳理:

需求說明:
隨著線上伺服器數量的增加,各種開源軟體和工具的廣泛使用,一些服務自動停止或無響應的情況時有發生,其中有很大一部分是由於軟體自身的穩定性或者機器硬體資源限制而造成的。按道理來講,這些情況都應該設法找到本質原因,然後避免再次出現。但現實是殘酷的,不少軟體本身的穩定性有待提升,機器的硬體資源提升會觸及成本,因此在叢集的環境中,具備冗餘,使得執行簡單的服務重啟成為了最現實的選擇。這本身不是什麼困難的事情,實現的方法有很多,比如在Zabbix或Nagios的報警中增加Action或Commands,或自己寫指令碼放到計劃任務中執行都可以。然而下面要介紹的就是專門來做這種事情的一個工具:Monit。它最大的特點是配置檔案簡單易讀,同時支援程式和系統狀態的監控,並靈活的提供了各種檢測的方式,週期,並進行報警和響應(重啟服務,執行命令等)

Centos6下部署Monit環境過程:
1)安裝EPEL倉庫
[root@bastion-IDC src]# wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
[root@bastion-IDC src]# rpm -ivh epel-release-latest-6.noarch.rpm

2)安裝monit
[root@bastion-IDC src]# yum install -y monit

--------------------------------------------------------------------------------------------------------------
原始碼安裝包下載:https://mmonit.com/monit/dist/binary/
# wget https://mmonit.com/monit/dist/binary/5.20.0/monit-5.20.0-linux-x86.tar.gz
# tar -zvxf monit-5.20.0-linux-x86.tar.gz
# mv monit-5.20.0 /usr/local/monit
# cp /usr/local/monit/conf/monitrc /etc/
然後編輯配置檔案/etc/monitrc即可。centos6的配置檔案是monit.conf,centos7的配置檔案是monitrc
-------------------------------------------------------------------------------------------------------------

3)monit配置說明(官網配置說明:https://mmonit.com/monit/documentation/monit.html
Monit配置檔案/etc/monit.conf,可以將預設配置檔案備份下,然後自定義配置
[root@bastion-IDC src]# cp /etc/monit.conf /etc/monit.conf.bak
[root@bastion-IDC src]# cat /etc/monit.conf         //自定義配置如下
set daemon 120              #Poll at 2-minute intervals         //每2分鐘檢查一次,單位為秒;monit做不到實時監控。
set logfile /home/monit/log/monit.log                  //monit的日誌檔案
set alert zhouwei@chinabank.com.cn with reminder on 1 cycle      //出現1次錯誤就發報警郵件到指定郵箱。多個郵箱地址就配置多行;with後的配置可以不加。
#set mailserver mail.tildeslash.com, mail.foo.bar port 10025, localhost with timeout 15 seconds
set mailserver 10.10.9.109             //設定郵件伺服器
set httpd port 2812 and use address 10.10.8.2            //設定http監控頁面的埠和ip
allow localhost           #Allow localhost to connect        //允許本機訪問
allow 10.10.8.0/24            //允許此IP段訪問
allow admin:nishiwode      #Allow Basic Auth          //認證的使用者名稱和密碼
# all system               //平均負載.記憶體使用率,cpu使用率
check system 10.10.8.2
   if loadavg (1min) > 4 then alert
   if loadavg (5min) > 2 then alert
   if memory usage > 75% then alert
   if cpu usage (user) > 70% then alert
   if cpu usage (system) > 30% then alert
   if cpu usage (wait) > 20% then alert
# all disk                    //磁碟空間使用率
check device data with path /dev/sda2
   if space usage > 90% then alert
   if inode usage > 85% then alert
check device home with path /dev/sda3
   if space usage > 85% for 5 cycles then alert      //如果在5個監控週期內,space使用率超過85%就發報警郵件  
   if inode usage > 85% for 5 cycles then alert
# all rsync
#10.10.8.2
check process sshd with pidfile /var/run/sshd.pid        //監控ssh服務
   start program "/etc/init.d/sshd start"
   stop program "/etc/init.d/sshd stop"
   if failed host 127.0.0.1 port 22 protocol ssh then restart
   if 3 restarts within 5 cycles then timeout              //設定在5個監控週期內重啟3次則超時,那麼就不再監控這個服務程式

check process httpd with pidfile /var/run/httpd.pid       //監控http服務
   start program = "/etc/init.d/httpd start"
   stop program = "/etc/init.d/httpd stop"
   if failed host 127.0.0.1 port 80 protocol http then restart
   if 5 restarts within 5 cycles then timeout

check process web_lb with pidfile /data/v20/server/web_lb/httpd.pid     //監控自定義服務
   start program = "/data/v20/bin/lb.sh"                 //啟動指令碼
   stop program = "/data/v20/bin/lb_stop.sh"         //停止指令碼
   if failed host 10.10.8.2 port 16101 proto http then restart
   if failed host 10.10.8.2 port 16101 proto http for 5 times within 5 cycles then exec "/data/v20/bin/lb_pay.sh"
   if failed host 10.10.8.2 port 16102 type TCPSSL proto http then restart
   if failed host 10.10.8.2 port 16102 type TCPSSL proto http for 5 times within 5 cycles then exec "/data/v20/bin/lb_pay.sh

4)monit的啟動(monit的預設埠是30000。最好在本地的/etc/hosts裡面做下本機主機名的對映關係,將hostname對映到127.0.0.1)
[root@bastion-IDC src]# /etc/init.d/monit start/stop/reload/status/restart

[root@bastion-IDC ~]# monit -t                     //檢測monit配置是否正確
[root@bastion-IDC ~]# monit reload              //過載monit配置
[root@bastion-IDC ~]# monit status             //檢視monit程式監控情況

若啟動monit的時候報錯如下:

Cannot translate 'huanqiu_web2' to FQDN name -- Name or service not known
Generated unique Monit id af76cbce671f323782e09e0d114857fd and stored to '/root/.monit.id'
Reinitializing monit daemon
No daemon process found

解決辦法:
在本機的/etc/hosts裡面做下主機對映,即
127.0.0.1 huanqiu_web2

---------------------------------------------線上用過的一個配置------------------------------------------------

[root@huanqiu_web1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 huanqiu_web1
 
[root@huanqiu_web1 ~]# cat /etc/monit.conf
set daemon 30
set logfile syslog facility log_daemon
set pidfile /var/run/monit.pid
set httpd port 30000
use address 127.0.0.1
allow 127.0.0.1
 
check process nginx with pidfile /Data/app/nginx/logs/nginx.pid
        start program = "/Data/app/nginx/sbin/nginx"
        stop program = "/Data/app/nginx/sbin/nginx -s stop"
 
check process php-fpm with pidfile /Data/app/php5.6.26/var/run/php-fpm.pid 
       start program = "/Data/app/php5.6.26/sbin/php-fpm"
       stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep /Data/app/php5.6.26/etc/php-fpm.conf|grep -v grep|awk -F" " '{print $2}'`'"

check process mysql with pidfile /Data/app/mysql5.1.57/var/dev-new-test.pid
       start program = "/Data/app/mysql5.1.57/bin/mysqld_safe --defaults-file=/Data/app/mysql5.1.57/my.cnf &"
       stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep mysqld_safe|grep -v grep|awk -F" " '{print $2}'`'"

check process tomcat-7-admin-wls matching "/Data/app/tomcat-7-wls/conf"
        start program = "/Data/app/tomcat-7-wls/bin/startup.sh"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep /Data/app/tomcat-7-wls/conf|grep -v grep|awk -F" " '{print $2}'`'"
 
check process tomcat-7-wls matching "/Data/app/tomcat-7-wls/conf"
        start program = "/Data/app/tomcat-7-wls/bin/startup.sh"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep /Data/app/tomcat-7-wls/conf|grep -v grep|awk -F" " '{print $2}'`'"
 
check process tomcat-7 matching "/Data/app/tomcat-7/conf"
        start program = "/Data/app/tomcat-7/bin/startup.sh"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep /Data/app/tomcat-7/conf|grep -v grep|awk -F" " '{print $2}'`'"

check process tomcat-7-banshanbandao matching "/Data/app/tomcat-7-banshanbandao/conf"
        start program = "/Data/app/tomcat-7-banshanbandao/bin/startup.sh"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep /Data/app/tomcat-7-banshanbandao/conf|grep -v grep|awk -F" " '{print $2}'`'"

check process vpn matching "/etc/vpnc/vpnc-script"
        start program = "/bin/sh /bin/vpn_start"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep vpnc-script|grep -v grep|awk -F" " '{print $2}'`'"

[root@huanqiu_web1 ~]# monit -t
Control file syntax OK
[root@huanqiu_web1 ~]# /etc/init.d/monit start
Starting monit:                                            [  OK  ]
 
[root@huanqiu_web1 ~]# lsof -i:30000
COMMAND  PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
monit   6109 root    5u  IPv4 2438183462      0t0  TCP localhost:30000 (LISTEN)

[root@huanqiu_web1 ~]# monit reload
Reinitializing monit daemon
 
[root@huanqiu_web1 ~]# monit status
The Monit daemon 5.14 uptime: 8m 

Process 'nginx'
  status                            Running
  monitoring status                 Monitored
  pid                               499
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            28d 20h 17m 
  children                          8
  memory                            19.6 MB
  memory total                      381.6 MB
  memory percent                    0.0%
  memory percent total              0.5%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'php-fpm'
  status                            Running
  monitoring status                 Monitored
  pid                               3153
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            43d 19h 26m 
  children                          16
  memory                            8.7 MB
  memory total                      352.3 MB
  memory percent                    0.0%
  memory percent total              0.5%
  cpu percent                       0.0%
  cpu percent total                 0.1%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'mysql'
  status                            Running
  monitoring status                 Monitored
  pid                               46403
  parent pid                        46254
  uid                               500
  effective uid                     500
  gid                               500
  uptime                            93d 0h 34m 
  children                          0
  memory                            317.8 MB
  memory total                      317.8 MB
  memory percent                    0.4%
  memory percent total              0.4%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'tomcat-7-admin-wls'
  status                            Running
  monitoring status                 Monitored
  pid                               34188
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            4d 19h 15m 
  children                          0
  memory                            803.6 MB
  memory total                      803.6 MB
  memory percent                    1.2%
  memory percent total              1.2%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'tomcat-7-wls'
  status                            Running
  monitoring status                 Monitored
  pid                               34188
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            4d 19h 15m 
  children                          0
  memory                            803.6 MB
  memory total                      803.6 MB
  memory percent                    1.2%
  memory percent total              1.2%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'tomcat-7'
  status                            Running
  monitoring status                 Monitored
  pid                               14524
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            5d 21h 43m 
  children                          0
  memory                            581.2 MB
  memory total                      581.2 MB
  memory percent                    0.9%
  memory percent total              0.9%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'tomcat-7-banshanbandao'
  status                            Running
  monitoring status                 Monitored
  pid                               29217
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            117d 0h 35m 
  children                          0
  memory                            1.4 GB
  memory total                      1.4 GB
  memory percent                    2.1%
  memory percent total              2.1%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

Process 'vpn'
  status                            Running
  monitoring status                 Monitored
  pid                               13774
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            1h 36m 
  children                          0
  memory                            2.4 MB
  memory total                      2.4 MB
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 22 Mar 2017 11:32:42

System 'huanqiu_web1'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.00] [0.04] [0.09]
  cpu                               0.6%us 0.1%sy 0.0%wa
  memory usage                      5.1 GB [8.0%]
  swap usage                        0 B [0.0%]
  data collected                    Wed, 22 Mar 2017 11:32:42

5)monit監控程式程式的方式
a)利用程式的pid檔案進行監控:with pidfile
b)利用程式的關鍵字匹配方式進行監控: matching;可以使用“monit procmatch 程式名 CLI”來查詢要匹配的唯一關鍵字
不管是pid檔案裡的pid號還是程式的關鍵字,都要求是唯一性的!必須是唯一的,如果matching匹配欄位不唯一,那麼監控無效!

下面羅列幾個平時工作中常用的幾個監控項:
[root@bastion-IDC ~]# cat /etc/monit.conf

set daemon 30
set logfile syslog facility log_daemon
set pidfile /var/run/monit.pid
set httpd port 30000
use address 127.0.0.1
allow 127.0.0.1
............
check process nginx with pidfile /usr/local/nginx/logs/nginx.pid
        start program = "/usr/local/nginx/sbin/nginx"
        stop program = "/usr/local/nginx/sbin/nginx -s stop"

check process nginx with pidfile /webserver/nginx/run/nginx.pid 
        start program = "/webserver/init.d/nginx start" with timeout 10 seconds 
        stop program = "/webserver/init.d/nginx stop" 
        if failed host heylinux.com port 80 protocol http with timeout 10 seconds then restart 
        if 3 restarts within 5 cycles then timeout group webserver

check process php-fpm with pidfile /var/run/php-fpm/php-fpm.pid
       start program = "/etc/init.d/php-fpm start"
       stop program = "/etc/init.d/php-fpm stop"

check process mysqld  with pidfile "/letv/mysql2/data/cdn.oss.letv.com.pid"
       start program = "/etc/init.d/mysqld start"
       stop program = "/etc/init.d/mysqld stop"
       if failed host 127.0.0.1 port 3306 then restart

check process mysql with pidfile /webserver/mysql/run/mysqld.pid 
       start program = "/webserver/init.d/mysqld start" with timeout 10 seconds 
       stop program = "/webserver/init.d/mysqld stop" 
       if failed port 3307 protocol mysql with timeout 10 seconds then restart 
       if 3 restarts within 5 cycles then timeout group webserver

check process memcached with pidfile "/var/run/memcached/memcached.pid"
       start program = "/etc/init.d/memcached start"
       stop program = "/etc/init.d/memcached stop"
       if failed host 127.0.0.1 port 11211 protocol memcache then restart

check process zabbix    with pidfile "/usr/local/zabbix/zabbix_agentd.pid"
        start program = "/usr/local/zabbix/sbin/zabbix_agentd -c /usr/local/zabbix/conf/zabbix_agentd.conf"
        stop program = "/bin/bash -c 'kill -s SIGTERM `cat /usr/local/zabbix/zabbix_agentd.pid`'"
        if failed host 127.0.0.1 port 10050 type tcp 2 times within 2 cycles then restart

check process httpd
        with pidfile "/usr/local/apache/logs/httpd.pid"
        start program = "/usr/local/apache/bin/httpd -k start"
        stop  program = "/bin/bash -c 'kill -s SIGTERM `cat /usr/local/apache/logs/httpd.pid`'"

check process redis
        with pidfile "/var/run/redis.pid"
        start program = "/usr/local/bin/redis-server /letv/uss/redis/redis.conf"
        stop  program = "/bin/bash -c 'kill -s SIGTERM `cat /var/run/redis.pid`'"

check process rsync with pidfile "/var/run/rsyncd.pid"
        start program = "/usr/bin/rsync --daemon"
        stop program = "/bin/bash -c 'kill -s SIGTERM `cat /var/run/rsyncd.pid`'"

check process pytask.py matching "/letv/p2sp/offline/pytask.py"
        start program = "/usr/bin/python /letv/p2sp/offline/pytask.py"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep offline/pytask.py|grep -v grep|awk -F" " '{print $2}'`'"

check process pytimed.py matching "/letv/p2sp/offline/pytimed.py"
        start program = "/usr/bin/python /letv/p2sp/offline/pytimed.py"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep offline/pytimed.py|grep -v grep|awk -F" " '{print $2}'`'"

check process hadoop with pidfile "/usr/local/hadoop/pids/hadoop-hadoop-datanode.pid"
     start program = "/usr/bin/sudo -u hadoop  -i  hadoop-daemon.sh start  datanode"
     stop program = "/usr/bin/sudo -u hadoop  -i  hadoop-daemon.sh stop  datanode"

check process ETMDaemon matching "/letv/p2sp/xware/lib/ETMDaemon"
        start program = "/letv/p2sp/xware/portal"
        stop program = "/bin/bash -c 'kill -s SIGTERM `ps -ef|grep ETMDaemon|grep -v grep |awk '{print $2}'`'"

如果監控配置項比較多,不想放在/etc/monit.conf檔案裡,那麼可以定義include選項,比如:
[root@bastion-IDC ~]# vim /etc/monit.conf
.......
include /etc/services.cfg

然後建立/etc/services.cfg,將監控程式的配置集中放到這個檔案裡。

[root@bastion-IDC ~]# vim /etc/services.cfg
check process nginx with pidfile /usr/local/nginx/logs/nginx.pid
        start program = "/usr/local/nginx/sbin/nginx"
        stop program = "/usr/local/nginx/sbin/nginx -s stop"
check process php-fpm with pidfile /var/run/php-fpm/php-fpm.pid
       start program = "/etc/init.d/php-fpm start"
       stop program = "/etc/init.d/php-fpm stop"
check process mysqld  with pidfile "/letv/mysql2/data/cdn.oss.letv.com.pid"
       start program = "/etc/init.d/mysqld start"
       stop program = "/etc/init.d/mysqld stop"
.........

monit.conf檔案裡的郵件通知就可以這樣配置:

# mail-server
set mailserver  smtp.huanqiu.cn port 587
# email-format
set mail-format {
 from: monit@huanqiu.cn
 subject: $SERVICE $EVENT at $DATE on $HOST
 message: Monit $ACTION $SERVICE $EVENT at $DATE on $HOST : $DESCRIPTION.

       Yours sincerely,
          Monit

  }

set alert wangshibo@huanqiu.cn

Monit會提供幾個內部變數($DATE、$EVENT、$HOST等),你可以按照你的需求自定義郵件內容。如果你想要從Monit所在機器傳送郵件,就需要一個已經安裝的與sendmail相容的程式(如postfix或者ssmtp)。

監控本機部分效能: 

check system 127.0.0.1
    if loadavg (5min) > 4 for 4 times 5 cycles then exec "/etc/monit/script/sendsms sysload 5min >4"
    if memory usage > 90% then exec "/etc/monit/script/sendsms 127.0.0.1 memory useage>90%"
    if cpu usage (user)  > 70% for 4 times within 5 cycles then exec "/etc/monit/script/sendsms cpu(user) >70%"
    if cpu usage (system) > 30% for 4 times within 5 cycles then exec "/etc/monit/script/sendsms cpu(system) >30% "
    if cpu usage (wait)  > 20% for 4 times within 5 cycles then exec "/etc/monit/script/sendsms system busy! cpu(wait) >20%"

監控遠端機器的部分埠:

check host Unicom_mobi with address 211.90.246.51
      if failed icmp type echo count 10 with timeout 20 seconds then exec "/etc/monit/script/sendsms Unicom_mobi  211.90.246.51 ping failed!"
      if failed port 22 type tcp with timeout 10 seconds for 2 times within 3 cycles then exec "/etc/monit/script/sendsms unicom 211.90.246.51:2222 connect failed!"
      if failed port 9528 type tcp with timeout 10 seconds for 2 times within 3 cycles then exec "/etc/monit/script/sendsms unicom 211.90.246.51:9528 connect failed!"
      if failed port 9529 type tcp with timeout 10 seconds for 2 times within 3 cycles then exec "/etc/monit/script/sendsms unicom 211.90.246.51:9529 connect failed!"
      if failed port 9530 type tcp with timeout 10 seconds for 2 times within 3 cycles then exec "/etc/monit/script/sendsms unicom 211.90.246.51:9530 connect failed!"

monit好處是可以在監控故障設定重啟服務和執行自定義指令碼,如下

check filesystem root with path /dev/mapper/VolGroup00-LogVol00
      if space usage > 80% for 5 times within 15 cycles then exec "/etc/monit/script/clear_core.sh"
         else if succeed for 1 times within 2 cycles then exec "/etc/monit/script/sendsms '/dev/sda1 usage > 90% clear core file succeed!'>/dev/null 2"

再來看看幾個小配置:

1)監控本地伺服器的CPU、記憶體佔用
check system localhost
    if loadavg (1min) > 10 then alert
    if loadavg (5min) > 6 then alert
    if memory usage > 75% then alert
    if cpu usage (user) > 70% then alert
    if cpu usage (system) > 60% then alert
    if cpu usage (wait) > 75% then alert

如果某個監控項不需要每個週期都檢查,可以如下配置:
    if loadavg (1min) > 10 for 2 cycles then alert

2)設定一個檢查遠端SMTP伺服器(如192.168.1.102)的監控。假定SMTP伺服器執行著SMTP、IMAP、SSH服務。
check host MAIL with address 192.168.1.102
   if failed icmp type echo within 10 cycles then alert
   if failed port 25  protocol smtp then alert
      else if recovered then exec "/scripts/mail-script"
   if failed port 22  protocol ssh  then alert
   if failed port 143 protocol imap then alert
檢查遠端主機是否響應ICMP協議。如果我們在10個週期內沒有收到ICMP迴應,就傳送一條報警。如果監測到25埠上的SMTP協議是異常的,就傳送一條報警。如果在一次監測失敗後又監測成功了,就執行一個指令碼(/scripts/mail-script)。如果檢查22埠上的SSH或者143埠上的IMAP協議不正常,同樣傳送報警。


3)存在性測試:
Monit當發現一個檔案不存或者一個服務沒有啟動的時候預設操作是重啟這個操作
check file with path /home/laicb/test.txt 
   if does not exist for 5 cycles then alert
注意:
檢測的是檔案(使用的是path),如果只寫了/home/laicb那麼監視的時候就會提示,path不是一個有效的型別!
如果檢測目錄的話,就用directory替換path

4)資源測試:
有些資源可以在check system路口,有些可以在check entry路口,有些都可以。
if cpu is greater than 50% for 5 cycles then restart
注意:greater表示大於,%也可以用位元組或者GB,MB等字元

5)時間戳測試:
時間戳[1]是指檔案屬性裡的建立、修改、訪問的時間。(下面的exec表示執行後面的命令動作)
改變形式:
 check file httpd.conf with path /usr/local/apache/conf/httpd.conf
   if changed timestamp
     then exec "/usr/local/apache/bin/apachectl graceful"
常量模式:
check file stored.ckp with path /msg-foo/config/stored.ckp
   if timestamp > 1 minute then alert

6)檔案大小測試:
這個只能用在check file入口
check file with path /home/laicb/test.txt  
    if does not exist for 5 cycles then alert  
    if changed size for  1 cycles then alert            //如果沒有指定,檢視服務所對應的會發現是for 5 times within 5cycles   

如果更改檔案大小,那麼檔案大小變化之後就在狀態列裡顯示size changed


7)許可權測試:
check file monit.bin with path "/usr/local/bin/monit"
       if failed permission 0555 then unmonitor    //如果/usr/local/bin/monit檔案許可權不是555就拒絕執行

check file passwd with path /etc/passwd           
       if failed uid root then unmonitor    //如果不是root訪問/etc/passwd那麼拒絕訪問
 
8)PID測試
 check process sshd with pidfile /var/run/sshd.pid
       if changed pid then exec "/my/script"

9)更新時間測試:
正常執行時間測試:
check process myapp with pidfile /var/run/myapp.pid
    start program = "/etc/init.d/myapp start"
    stop program = "/etc/init.d/myapp stop"
    if uptime > 3 days then restart

10)監控主機通訊
 check host www.huanqiu.com with address www.huanqiu.com
       if failed icmp type echo count 5 with timeout 15 seconds
          then alert

11)apache程式監控
 check process apache with pidfile /var/run/httpd.pid
       start program = "/etc/init.d/httpd start"
       stop program  = "/etc/init.d/httpd stop"
       if cpu > 40% for 2 cycles then alert
       if totalcpu > 60% for 2 cycles then alert
       if totalcpu > 80% for 5 cycles then restart
       if mem > 100 MB for 5 cycles then stop
       if loadavg(5min) greater than 10.0 for 8 cycles then stop

-------------------------------------------------------------------------------------------
來看看下面遇到的幾種monit不能使用的解決辦法:
1)monit程式連線錯誤!(缺少http的埠支援,少了這部分內容)
最後經過排查發現,monit的配置檔案/etc/monitrc裡面少了下面兩行:
set httpd port 30000
allow 127.0.0.1

將上面兩行新增上,monit即可恢復正常使用狀態中
[root@cdn ~]# cat /etc/monitrc
set daemon 30
set logfile syslog facility log_daemon
set pidfile /var/run/monit.pid
set httpd port 30000
allow 127.0.0.1
#allow admin:TVA3z3i
..........

2)另外一種錯誤:(修改127.0.0.1為localhost)
在monit -t和monit reload都沒有報錯的情況下,monit status報錯如下:
[root@cdn ~]# monit status
monit: cannot read status from the monit daemon

檢視日誌資訊:
[root@cdn ~]# tail -f /var/log/messages
Aug 18 19:27:21 cdn monit[14491]: monit: Denied connection from non-authorized client [220.181.153.243]
Aug 18 19:27:21 cdn monit[16899]: monit: cannot read status from the monit daemon

解決辦法:
將monit配置檔案中的“allow 127.0.0.1”修改為“allow localhost”即可!!
[root@cdn ~]# vim /etc/monitrc
set daemon 30
set logfile syslog facility log_daemon
set pidfile /var/run/monit.pid
set httpd port 30000
allow localhost
.........

這樣,問題得到解決

[root@cdn ~]# monit status
The Monit daemon 5.3.2 uptime: 5m
.........................
Process 'nginx_down'
status Running
monitoring status Monitored
pid 18671
parent pid 1
uptime 145d 5h 17m
children 8
memory kilobytes 484
memory kilobytes total 4572
memory percent 0.0%
memory percent total 0.0%
cpu percent 0.0%
cpu percent total 0.0%
data collected Mon, 18 Aug 2014 19:34:25
.............

3)下面的錯誤在使用上面兩種方法後,仍不能解決問題! (新增use address 127.0.0.1)
[root@182 conf]# monit -t
Control file syntax OK
[root@182 conf]# monit reload
Reinitializing monit daemon
[root@182 conf]# monit status
monit: error connecting to the monit daemon

檢視monit配置檔案
[root@182 conf]# cat /etc/monitrc.bak
set daemon 30
set httpd port 30000
allow 127.0.0.1
set logfile syslog facility log_daemon
set pidfile /var/run/monit.pid
..................

最後解決辦法:
需要在monit配置檔案中新增“use address 127.0.0.1"內容!
[root@182 conf]# cat /etc/monitrc
set daemon 30
set httpd port 30000
use address 127.0.0.1
allow 127.0.0.1
set logfile syslog facility log_daemon
set pidfile /var/run/monit.pid
.............

檢視,問題已經得到解決
[root@182 conf]# monit status
The Monit daemon 5.3.2 uptime: 7m
..............
Process 'rsync'
status Running
monitoring status Monitored
pid 13519
parent pid 1
uptime 393d 9h 8m
children 0
memory kilobytes 540
memory kilobytes total 540
memory percent 0.0%
memory percent total 0.0%
cpu percent 0.0%
cpu percent total 0.0%
data collected Thu, 21 Aug 2014 19:24:58
..............

注意:
上面的第3鍾方式是最全面的,如果新增了use address 127.0.0.1後,使用monit status仍然出現下面的情況:
[root@ly-u-gfs1 ~]# monit status
monit: error connecting to the monit daemon

那麼就稍微等待一會兒,等一小段時間後,就會發現monit使用順暢了
[root@ly-u-gfs1 ~]# monit status
The Monit daemon 5.3.2 uptime: 6m

Process 'net-snmp'
status Running
monitoring status Monitored

-------------------------------------------------------------------------------------------------------------------------
Centos7下部署Monit環境過程:
[root@linux-node2 ~]# yum update
[root@linux-node2 ~]# yum install -y monit
[root@linux-node2 ~]# rpm -ql monit
/etc/logrotate.d/monit
/etc/monit.d
/etc/monit.d/logging
/etc/monitrc
/usr/bin/monit
/usr/lib/systemd/system/monit.service
/usr/share/doc/monit-5.14
/usr/share/doc/monit-5.14/COPYING
/usr/share/doc/monit-5.14/README
/usr/share/man/man1/monit.1.gz
/var/log/monit.log
[root@linux-node2 ~]# monit -V
This is Monit version 5.14
Copyright (C) 2001-2016 Tildeslash Ltd. All Rights Reserved.

[root@linux-node2 ~]# monit -v
Adding host allow 'localhost'
Skipping redundant host 'localhost'
Adding credentials for user 'admin'
Runtime constants:
Control file = /etc/monitrc
Log file = /var/log/monit.log
Pid file = /run/monit.pid
Id file = /root/.monit.id
State file = /root/.monit.state
Debug = True
Log = True
Use syslog = False
Is Daemon = True
Use process engine = True
Poll time = 30 seconds with start delay 0 seconds
Expect buffer = 256 bytes
Mail from = (not defined)
Mail subject = (not defined)
Mail message = (not defined)
Start monit httpd = True
httpd bind address = localhost
httpd portnumber = 2812
httpd ssl = Disabled
httpd signature = Enabled
httpd auth. style = Basic Authentication and Host/Net allow list

The service list contains the following entries:

System Name = linux-node2.openstack
Monitoring mode = active

檢視預設的配置檔案內容
[root@linux-node2 ~]# grep -v '^#' /etc/monitrc
set daemon 30 # check services at 30 seconds intervals
set logfile syslog

set httpd port 2812 and
use address localhost         # only accept connection from localhost
allow localhost                   # allow localhost to connect to the server and
allow admin:monit              # require user 'admin' with password 'monit'
allow @monit                    # allow users of group 'monit' to connect (rw)
allow @users readonly       # allow users of group 'users' to connect readonly

include /etc/monit.d/*

[root@linux-node2 ~]# cat /etc/logrotate.d/monit
/var/log/monit.log {
missingok
notifempty
size 100k
create 0644 root root
postrotate
/bin/systemctl reload monit.service > /dev/null 2>&1 || :
endscript
}

[root@linux-node2 ~]# cat /etc/monit.d/logging
# log to monit.log
set logfile /var/log/monit.log                     //監視週期為60秒,日誌輸出及日誌滾動以配置好了

配置monit
[root@linux-node2 ~]# vim /etc/monitrc

set daemon  5   
set logfile syslog
 
set httpd port 2812 and
    use address localhost
    allow localhost       
    allow admin:monit     
    allow @monit          
    allow @users readonly    
 
include /etc/monit.d/*
 
check process sshd with pidfile /var/run/sshd.pid
    start program "/usr/bin/systemctl start sshd.service"
    stop program "/usr/bin/systemctl stop sshd.service"
    if failed port 22 protocol ssh then restart
    if 5 restart within 5 cycles then timeout
 
check process apache with pidfile /etc/httpd/run/httpd.pid
  start program = "/usr/bin/systemctl start httpd" with timeout 60 seconds
  stop program  = "/usr/bin/systemctl stop httpd"
  if failed host linux-node2.openstack port 80 protocol http
     and request "/readme.html"
     then restart
  if 3 restarts within 5 cycles then timeout
  group apache
 
check process mariadb with pidfile "/var/lib/mysql/linux-node2.pid"
    start = "/usr/bin/systemctl start mariadb.service"
    stop = "/usr/bin/systemctl stop mariadb.service"
    if failed host 127.0.0.1 port 3306 protocol mysql then restart
    if 5 restarts within 5 cycles then timeout

----------------------------------------------------------------------------------------------
檢視mysql服務的pid
MariaDB [(none)]> show variables like "%pid%";
+---------------+--------------------------------+
| Variable_name | Value |
+---------------+--------------------------------+
| pid_file | /var/lib/mysql/linux-node2.pid |
+---------------+--------------------------------+
1 row in set (0.00 sec)
--------------------------------------------------------------------------------------------

啟動monit
[root@linux-node2 ~]# systemctl enable monit.service
Created symlink from /etc/systemd/system/multi-user.target.wants/monit.service to /usr/lib/systemd/system/monit.service.
[root@linux-node2 ~]# systemctl start monit.service
[root@linux-node2 ~]# lsof -i:2812
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
monit 89106 root 5u IPv4 172788270 0t0 TCP localhost:atmtcp (LISTEN)
[root@linux-node2 ~]# systemctl status monit.service
● monit.service - Pro-active monitoring utility for unix systems
Loaded: loaded (/usr/lib/systemd/system/monit.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2017-02-03 10:47:22 CST; 50s ago
Main PID: 89106 (monit)
CGroup: /system.slice/monit.service
└─89106 /usr/bin/monit -I

Feb 03 10:47:22 linux-node2.openstack systemd[1]: Started Pro-active monitoring utility for unix systems.
Feb 03 10:47:23 linux-node2.openstack systemd[1]: Starting Pro-active monitoring utility for unix systems...
Feb 03 10:47:23 linux-node2.openstack monit[89106]: /etc/monitrc:20: Program does not exist: 'systemctl'
Feb 03 10:47:23 linux-node2.openstack monit[89106]: /etc/monitrc:21: Program does not exist: 'systemctl'
Feb 03 10:47:23 linux-node2.openstack monit[89106]: Starting Monit 5.14 daemon with http interface at [localhost]:2812

檢視monit狀態

[root@linux-node2 ~]# monit status
The Monit daemon 5.14 uptime: 9m 

Process 'sshd'
  status                            Running
  monitoring status                 Monitored
  pid                               1755
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            86d 19h 39m 
  children                          6
  memory                            3.5 MB
  memory total                      25.1 MB
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  port response time                0.021s to [localhost]:22 type TCP/IP protocol SSH
  data collected                    Fri, 03 Feb 2017 10:57:20

Process 'apache'
  status                            Not monitored
  monitoring status                 Not monitored
  data collected                    Fri, 03 Feb 2017 10:50:21

Process 'mariadb'
  status                            Running
  monitoring status                 Monitored
  pid                               46235
  parent pid                        1
  uid                               27
  effective uid                     27
  gid                               27
  uptime                            29d 16h 1m 
  children                          0
  memory                            296.1 MB
  memory total                      296.1 MB
  memory percent                    0.4%
  memory percent total              0.4%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  port response time                0.001s to [127.0.0.1]:3306 type TCP/IP protocol MYSQL
  data collected                    Fri, 03 Feb 2017 10:57:20

System 'linux-node2.openstack'
  status                            Running
  monitoring status                 Monitored
  load average                      [2.01] [1.86] [1.94]
  cpu                               5.2%us 2.0%sy 0.0%wa
  memory usage                      44.0 GB [70.1%]
  swap usage                        2.6 MB [0.1%]
  data collected                    Fri, 03 Feb 2017 10:57:20

重新載入monit服務
[root@linux-node2 ~]# monit reload
Reinitializing monit daemon

確認monit自動啟動程式
停止nginx程式之後,檢視monit.log檔案
[root@linux-node2 ~]# systemctl stop nginx.service
[root@linux-node2 ~]# tailf /var/log/monit.log
[CST Apr 5 21:35:18] error : 'nginx' process is not running
[CST Apr 5 21:35:18] info : 'nginx' trying to restart
[CST Apr 5 21:35:18] info : 'nginx' start: /usr/bin/systemctl

配置啟動啟動。根據系統及版本自動啟動的命令不同,在這裡介紹CentOS7上配置自動啟動的方法
[root@linux-node2 ~]# systemctl list-unit-files | grep monit.service
monit.service disabled
[root@linux-node2 ~]# systemctl enable monit.service
ln -s '/usr/lib/systemd/system/monit.service' '/etc/systemd/system/multi-user.target.wants/monit.service'
[root@linux-node2 ~]# systemctl list-unit-files | grep monit.service
monit.service enabled
-------------------------------------------------------------------------------------------------------------------------

相關文章