基於AWS-ELK部署系統日誌告警系統

welchang發表於2021-09-09


前言

運維故障排障速度往往與監控系統體系顆粒度成正比,監控到位才能快速排障

在部署這套系統之前,平臺所有系統日誌都由Graylog+Zabbix,針對日誌出現的錯誤關鍵字進行告警,這種做法在運維工作開展過程中暴露出多個不足點,不詳述;在考慮多方面原因後,最終對日誌告警系統進行更換,選用的方案是:ELK + Kafka+ Filebeat + Elastalert

本文主要以兩個需求為主軸做介紹

非工作時間伺服器異常登入告警

系統日誌出現錯誤關鍵字告警

架構

架構圖

服務選型

name    version info

Amazon Elasticsearch Service    v6.2    AWK官網部署教程

Logstash    v6.2.3  選用與ES相同版本

Filebeat    v6.2.3  選用與ES相同版本

Confluent(Kafka)    v4.0    這裡推薦 Confluent 的版本,Confluent 是 kafka 作者 Neha Narkhede 從 Linkedin 出來之後聯合 LinkedIn 前員工建立的大資料公司,專注於 kafka 的企業應用。

Elastalert  v0.1.29 原先考慮採用X-Pack但由於AWS目前還不支援

部署

本文采用的作業系統 :CentOS release 6.6

Filebeat

# 下載源

$ curl -L -O

# 安裝

$ sudo rpm -vi filebeat-6.2.3-x86_64.rpm

Logstash

# 匯入Yum源

$ rpm --import

$ cat <<EOF > /etc/yum.repos.d/logstash.repo

[logstash-6.x]

name=Elastic repository for 6.x packages

baseurl=

gpgcheck=1

gpgkey=

enabled=1

autorefresh=1

type=rpm-md

EOF

# 安裝

yum install logstash -y

Elastalert

# pip直接安裝

$ pip install elastalert

# 如果出現依賴包報錯,以下為常用開發所需依賴包

$ yum install -y zlib openssl openssl-devel gcc gcc-c++ Xvfb libXfont Xorg libffi libffi-devel python-cffi python-devel libxslt-devel libxml2-devel zlib-devel bzip2-devel xz-libs wget

配置

Filebeat

/etc/filebeat/filebeat.yml

filebeat.config:

  prospectors:

    path: /etc/filebeat/conf/*.yml

    reload.enabled: true

    reload.period: 10s

output.kafka:

  # kafkaNode為Kafaka服務所在伺服器 

  hosts: ["kafkaNode:9092"]

  # 索引取fields.out_topic

  topic: "%{[fields][out_topic]}"

  partition.round_robin:

    reachable_only: false

/etc/filebeat/conf/base.yml

# 收集系統日誌

- type: log

  paths: 

    - /var/log/messages

    - /var/log/syslog*

  exclude_files: [".gz$"]

  exclude_lines: ["ssh_host_dsa_key"]

  tags: ["system_log"]

  scan_frequency: 1s

  fields:

    # 新增欄位用於辨別來源客戶端

    server_name: client01

    # 索引

    out_topic: "system_log"

  multiline:

    pattern: "^\s"

    match: after

# 收集登入日誌

- type: log

  paths:

    - /var/log/secure*

    - /var/log/auth.log*

  tags: ["system_secure"]

  exclude_files: [".gz$"]

  scan_frequency: 1s

  fields:

    server_name: client01

    out_topic: "system_secure"

  multiline:

    pattern: "^\s"

    match: after

Logstash

/etc/logstash/conf.d/system_log.conf

input {

    kafka {

        bootstrap_servers => "kafkaNode:9092"

        consumer_threads => 3

        topics => ["system_log"]

        auto_offset_reset => "latest"

        codec => "json"

    }

}

filter {

    # 排除logstash日誌

    if [source] == "/var/log/logstash-stdout.log" {

        drop {}

    }

    if [fields][out_topic] == "system_log" {

        date {match => [ "[system][syslog][timestamp]", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ]}

        grok {

            match => { "message" => ["%{SYSLOGTIMESTAMP:[system][syslog][timestamp]} %{SYSLOGHOST:[system][syslog][hostname]} %{DATA:[system][syslog][program]}(?:[%{POSINT:[system][syslog][pid]}])?: %{GREEDYMULTILINE:[system][syslog][message]}"] }

            pattern_definitions => { "GREEDYMULTILINE" => "(.|n)*" }

            remove_field => "message"

        }

    }

}

output {

        elasticsearch {

            hosts => ["<亞馬遜ES地址>"]

            index => "%{[fields][out_topic]}_%{+YYYYMMdd}"

            document_type => "%{[@metadata][type]}"

        }

}

/etc/logstash/conf.d/secure_log.conf

input {

    kafka {

        bootstrap_servers => "kafkaNode:9092"

        consumer_threads => 3

        topics => ["system_secure"]

        auto_offset_reset => "latest"

        codec => "json"

    }

}

filter {

    if [fields][out_topic] == "system_secure" {

        grok {

          match => { "message" => ["%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sshd(?:[%{POSINT:[system][auth][pid]}])?: %{DATA:[system][auth][ssh][event]} %{DATA:[system][auth][ssh][method]} for (invalid user )?%{DATA:[system][auth][user]} from %{IPORHOST:[system][auth][ssh][ip]} port %{NUMBER:[system][auth][ssh][port]} ssh2(: %{GREEDYDATA:[system][auth][ssh][signature]})?",

                   "%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sshd(?:[%{POSINT:[system][auth][pid]}])?: %{DATA:[system][auth][ssh][event]} user %{DATA:[system][auth][user]} from %{IPORHOST:[system][auth][ssh][ip]}",

                   "%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sshd(?:[%{POSINT:[system][auth][pid]}])?: Did not receive identification string from %{IPORHOST:[system][auth][ssh][dropped_ip]}",

                   "%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sudo(?:[%{POSINT:[system][auth][pid]}])?: s*%{DATA:[system][auth][user]} :( %{DATA:[system][auth][sudo][error]} ;)? TTY=%{DATA:[system][auth][sudo][tty]} ; PWD=%{DATA:[system][auth][sudo][pwd]} ; USER=%{DATA:[system][auth][sudo][user]} ; COMMAND=%{GREEDYDATA:[system][auth][sudo][command]}",

                   "%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} groupadd(?:[%{POSINT:[system][auth][pid]}])?: new group: name=%{DATA:system.auth.groupadd.name}, GID=%{NUMBER:system.auth.groupadd.gid}",

                   "%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} useradd(?:[%{POSINT:[system][auth][pid]}])?: new user: name=%{DATA:[system][auth][user][add][name]}, UID=%{NUMBER:[system][auth][user][add][uid]}, GID=%{NUMBER:[system][auth][user][add][gid]}, home=%{DATA:[system][auth][user][add][home]}, shell=%{DATA:[system][auth][user][add][shell]}$",

                   "%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} %{DATA:[system][auth][program]}(?:[%{POSINT:[system][auth][pid]}])?: %{GREEDYMULTILINE:[system][auth][message]}"] }

        pattern_definitions => {"GREEDYMULTILINE"=> "(.|n)*"}

        remove_field => "message"

        }

    }

}

output {

        elasticsearch {

            hosts => ["<亞馬遜ES地址>"]

            index => "%{[fields][out_topic]}_%{+YYYYMMdd}"

            document_type => "%{[@metadata][type]}"

        }

}

Kafka

# 匯入

rpm --import

cat <<EOF > /etc/yum.repos.d/confluent.repo

[Confluent.dist]

name=Confluent repository (dist)

baseurl=

gpgcheck=1

gpgkey=

enabled=1

[Confluent]

name=Confluent repository

baseurl=

gpgcheck=1

gpgkey=

enabled=1

EOF

yum install confluent-platform-oss-2.11

Elastalert

Elastalert可以部署到任何一臺能夠讀取到ES的伺服器上;配置檔案中modules.eagle_post.EagleAlerter blacklist_v2經過修改,後面會介紹到

rules/system_log.yaml

es_host: <亞馬遜ES地址>

es_port: 80

name: system log rule

type: blacklist_v2

index: system_log*

timeframe:

    minutes: 1

# 監控key

compare_key: system.syslog.message

# 出現下面任意關鍵字將告警,按需新增

blacklist_v2:

  - "ERROR"

  - "error"

alert: "modules.eagle_post.EagleAlerter"

eagle_post_url: "<eagle>"

eagle_post_all_values: False

eagle_post_payload:

  server: "fields.server_name"

  info: "system.syslog.message"

  source: "source"

rules/system_log.yaml

es_host: <亞馬遜ES地址>

es_port: 80

name: system secure rule

type: frequency

index: system_secure*

num_events: 1

timeframe:

    minutes: 1

filter:

- query:

    wildcard:

       system.auth.user : "*"

alert: "modules.eagle_post.EagleAlerter"

eagle_post_url: "<eagle>"

eagle_post_all_values: False

# 非工作時間

eagle_time_start: "09:00"

eagle_time_end: "18:00"

eagle_post_payload:

  user: "system.auth.user"

  server: "fields.server_name"

  ip: "system.auth.ssh.ip"

  event: "system.auth.ssh.event"

Elastalert

自定義type與alert

為了能夠將告警接入到Eagle(自研統一介面平臺)在嘗試使用http_post做告警型別過程中,發現無法傳入ES結果作為POST引數,所以對其進行簡單修改,新增型別,實現能夠無縫接入Eagle

Alert

moudules/eagle_post.py

將資料夾儲存到site-packages/elastalert

import json

import requests

import dateutil.parser

import datetime

from elastalert.alerts import Alerter

from elastalert.util import EAException

from elastalert.util import elastalert_logger

from elastalert.util import lookup_es_key

class EagleAlerter(Alerter):

    def __init__(self, rule):

        super(EagleAlerter, self).__init__(rule)

        # 設定時間有效範圍

        self.post_time_start = self.rule.get('eagle_time_start','00:00')

        self.post_time_end = self.rule.get('eagle_time_end','00:00')

        # post連結

        self.post_url = self.rule.get('eagle_post_url','')

        self.post_payload = self.rule.get('eagle_post_payload', {})

        self.post_static_payload = self.rule.get('eagle_post_static_payload', {})

        self.post_all_values = self.rule.get('eagle_post_all_values', False)

        self.post_lock = False

    def alert(self, matches):

        if not self.post_url:

            elastalert_logger.info('Please input eagle url!')

            return False

        for match in matches:

            # 獲取所有payload

            payload = match if self.post_all_values else {}

            # 構建字典

            for post_key, es_key in self.post_payload.items():

                payload[post_key] = lookup_es_key(match, es_key)

            # 獲取當前時間

            login_time = datetime.datetime.now().time()

            # 獲取時間限制

            time_start = dateutil.parser.parse(self.post_time_start).time()

            time_end =  dateutil.parser.parse(self.post_time_end).time()

            # 如果在時間範圍內,將不做告警

            self.post_lock = False if login_time > time_start and

                                login_time < time_end else True

            # 合併兩種型別payload

            data = self.post_static_payload

            data.update(payload)

            # 傳送告警

            if self.post_lock:

                myRequests = requests.Session()

                myRequests.post(url=self.post_url,data=data,verify=False)

                elastalert_logger.info("[-] eagle alert sent.")

            else:

                elastalert_logger.info("[*] nothing to do.")

    def get_info(self):

        return {'type': 'http_post'}

type

在使用blaklist過程發現改型別是全匹配,為了方便編寫配置檔案,所以對其做了簡單修改

elastalert/ruletypes.py

# 新增

class BlacklistV2Rule(CompareRule):

    required_options = frozenset(['compare_key', 'blacklist_v2'])

    def __init__(self, rules, args=None):

        super(BlacklistV2Rule, self).__init__(rules, args=None)

        self.expand_entries('blacklist_v2')

    def compare(self, event):

        term = lookup_es_key(event, self.rules['compare_key'])

        # 迴圈配置檔案, 這種做法對效能有一定的損耗,在沒找到更合適的解決方案前,就採取這種方式

        for i in self.rules['blacklist_v2']:

            if i in term:

                return True 

        return False

elastalert/config.py

# 新增

rules_mapping = {

    'frequency': ruletypes.FrequencyRule,

    'any': ruletypes.AnyRule,

    'spike': ruletypes.SpikeRule,

    'blacklist': ruletypes.BlacklistRule,

    'blacklist_v2': ruletypes.BlacklistV2Rule,

    'whitelist': ruletypes.WhitelistRule,

    'change': ruletypes.ChangeRule,

    'flatline': ruletypes.FlatlineRule,

    'new_term': ruletypes.NewTermsRule,

    'cardinality': ruletypes.CardinalityRule,

    'metric_aggregation': ruletypes.MetricAggregationRule,

    'percentage_match': ruletypes.PercentageMatchRule,

}

elastalert/schema.yaml

# 新增

  - title: BlacklistV2

    required: [blacklist_v2, compare_key]

    properties:

      type: {enum: [blacklist_v2]}

      compare_key: {'items': {'type': 'string'},'type': ['string', 'array']}

      blacklist: {type: array, items: {type: string}}

打進Docker

做了個簡單DockerFile做參考

FROM python:2.7-alpine

ENV SITE_PACKAGES /usr/local/lib/python2.7/site-packages/elastalert

WORKDIR /opt/elastalert

RUN apk update &&     apk add gcc ca-certificates openssl-dev openssl libffi-dev  gcc musl-dev tzdata openntpd &&  

    pip install elastalert && cp -rf /usr/share/zoneinfo/Asia/Taipei /etc/localtime

COPY ./ /opt/elastalert

CMD ["/opt/elastalert/start.sh"]

start.sh

#!/bin/sh

SITE_PATH=/usr/local/lib/python2.7/site-packages/elastalert

CONFIG=/opt/elastalert/config/config.yaml

MODULES=/opt/elastalert/modules

if [ -n "${MODULES}" ]

then

    cp -rf ${MODULES}  ${SITE_PATH}

    echo "[-] Copy ${MODULES} to ${SITE_PATH}"

fi

cp -rf elastalert/* ${SITE_PATH}/

echo "[-] Copy elastalert/* to ${SITE_PATH}"

python -m elastalert.elastalert --verbose  --config ${CONFIG}

基礎工作準備就緒,加入Bee容器管理平臺完成自動構建。

實現效果

效果圖

碰到的坑

Zookeeper

問題描述

老版Kafaka依賴Zookeeper,預設安裝時註冊地址為:localhost,導致問題的現象:

filebeat錯誤日誌

2018-04-25T09:14:55.590+0800    INFO    kafka/log.go:36    client/metadata fetching metadata for [[[system_log] kafkaNode:9092]] from broker %!s(MISSING)

2018-04-25T09:14:55.591+0800    INFO    kafka/log.go:36    producer/broker/[[0]] starting up

2018-04-25T09:14:55.591+0800    INFO    kafka/log.go:36    producer/broker/[[0 %!d(string=system_log) 0]] state change to [open] on %!s(MISSING)/%!d(MISSING)

2018-04-25T09:14:55.591+0800    INFO    kafka/log.go:36    producer/leader/[[system_log %!s(int32=0) %!s(int32=0)]]/%!d(MISSING) selected broker %!d(MISSING)

2018-04-25T09:14:55.591+0800    INFO    kafka/log.go:36    producer/leader/[[system_secure %!s(int32=0) %!s(int=3)]]/%!d(MISSING) state change to [retrying-%!d(MISSING)]

2018-04-25T09:14:55.591+0800    INFO    kafka/log.go:36    producer/leader/[[system_secure %!s(int32=0) %!s(int32=0)]]/%!d(MISSING) abandoning broker %!d(MISSING)

2018-04-25T09:14:55.592+0800    INFO    kafka/log.go:36    producer/broker/[[0]] shut down

2018-04-25T09:14:55.592+0800    INFO    kafka/log.go:36    Failed to connect to broker [[localhost:9092 dial tcp [::1]:9092: getsockopt: connection refused]]: %!s(MISSING)

日誌出現兩個地址,一個是kafka地址,另外出現一個localhost地址。

這是因為filebeat已經跟kafaka建立了連線,但是從kafaka到zookeeper這一段找不到

解決方法

# get /brokers/ids/0 

{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://localhost:9092"],"jmx_port":-1,"host":"localhost","timestamp":"1523429158364","port":9092,"version":4}

cZxid = 0x1d

ctime = Wed Apr 11 14:45:58 CST 2018

mZxid = 0x1d

mtime = Wed Apr 11 14:45:58 CST 2018

pZxid = 0x1d

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x162b374170d0000

dataLength = 188

numChildren = 0

# 發現註冊地址是localhost,修改之

set /brokers/ids/0  {"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://kafkaNode:9092"],"jmx_port":9999,"host":"kafkaNode","timestamp":"1523429158364","port":9092,"version":4}

修改完重啟,問題解決。

博文連結:

©著作權歸作者所有:來自51CTO部落格作者迷失的貓妖的原創作品,如需轉載,請註明出處,否則將追究法律責任

ELK日誌告警[Linux運維]


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4830/viewspace-2821811/,如需轉載,請註明出處,否則將追究法律責任。

相關文章