基於AWS-ELK部署系統日誌告警系統
前言
運維故障排障速度往往與監控系統體系顆粒度成正比,監控到位才能快速排障
在部署這套系統之前,平臺所有系統日誌都由Graylog+Zabbix,針對日誌出現的錯誤關鍵字進行告警,這種做法在運維工作開展過程中暴露出多個不足點,不詳述;在考慮多方面原因後,最終對日誌告警系統進行更換,選用的方案是:ELK + Kafka+ Filebeat + Elastalert
本文主要以兩個需求為主軸做介紹
非工作時間伺服器異常登入告警
系統日誌出現錯誤關鍵字告警
架構
架構圖
服務選型
name version info
Amazon Elasticsearch Service v6.2 AWK官網部署教程
Logstash v6.2.3 選用與ES相同版本
Filebeat v6.2.3 選用與ES相同版本
Confluent(Kafka) v4.0 這裡推薦 Confluent 的版本,Confluent 是 kafka 作者 Neha Narkhede 從 Linkedin 出來之後聯合 LinkedIn 前員工建立的大資料公司,專注於 kafka 的企業應用。
Elastalert v0.1.29 原先考慮採用X-Pack但由於AWS目前還不支援
部署
本文采用的作業系統 :CentOS release 6.6
Filebeat
# 下載源
$ curl -L -O
# 安裝
$ sudo rpm -vi filebeat-6.2.3-x86_64.rpm
Logstash
# 匯入Yum源
$ rpm --import
$ cat <<EOF > /etc/yum.repos.d/logstash.repo
[logstash-6.x]
name=Elastic repository for 6.x packages
baseurl=
gpgcheck=1
gpgkey=
enabled=1
autorefresh=1
type=rpm-md
EOF
# 安裝
yum install logstash -y
Elastalert
# pip直接安裝
$ pip install elastalert
# 如果出現依賴包報錯,以下為常用開發所需依賴包
$ yum install -y zlib openssl openssl-devel gcc gcc-c++ Xvfb libXfont Xorg libffi libffi-devel python-cffi python-devel libxslt-devel libxml2-devel zlib-devel bzip2-devel xz-libs wget
配置
Filebeat
/etc/filebeat/filebeat.yml
filebeat.config:
prospectors:
path: /etc/filebeat/conf/*.yml
reload.enabled: true
reload.period: 10s
output.kafka:
# kafkaNode為Kafaka服務所在伺服器
hosts: ["kafkaNode:9092"]
# 索引取fields.out_topic
topic: "%{[fields][out_topic]}"
partition.round_robin:
reachable_only: false
/etc/filebeat/conf/base.yml
# 收集系統日誌
- type: log
paths:
- /var/log/messages
- /var/log/syslog*
exclude_files: [".gz$"]
exclude_lines: ["ssh_host_dsa_key"]
tags: ["system_log"]
scan_frequency: 1s
fields:
# 新增欄位用於辨別來源客戶端
server_name: client01
# 索引
out_topic: "system_log"
multiline:
pattern: "^\s"
match: after
# 收集登入日誌
- type: log
paths:
- /var/log/secure*
- /var/log/auth.log*
tags: ["system_secure"]
exclude_files: [".gz$"]
scan_frequency: 1s
fields:
server_name: client01
out_topic: "system_secure"
multiline:
pattern: "^\s"
match: after
Logstash
/etc/logstash/conf.d/system_log.conf
input {
kafka {
bootstrap_servers => "kafkaNode:9092"
consumer_threads => 3
topics => ["system_log"]
auto_offset_reset => "latest"
codec => "json"
}
}
filter {
# 排除logstash日誌
if [source] == "/var/log/logstash-stdout.log" {
drop {}
}
if [fields][out_topic] == "system_log" {
date {match => [ "[system][syslog][timestamp]", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]}
grok {
match => { "message" => ["%{SYSLOGTIMESTAMP:[system][syslog][timestamp]} %{SYSLOGHOST:[system][syslog][hostname]} %{DATA:[system][syslog][program]}(?:[%{POSINT:[system][syslog][pid]}])?: %{GREEDYMULTILINE:[system][syslog][message]}"] }
pattern_definitions => { "GREEDYMULTILINE" => "(.|n)*" }
remove_field => "message"
}
}
}
output {
elasticsearch {
hosts => ["<亞馬遜ES地址>"]
index => "%{[fields][out_topic]}_%{+YYYYMMdd}"
document_type => "%{[@metadata][type]}"
}
}
/etc/logstash/conf.d/secure_log.conf
input {
kafka {
bootstrap_servers => "kafkaNode:9092"
consumer_threads => 3
topics => ["system_secure"]
auto_offset_reset => "latest"
codec => "json"
}
}
filter {
if [fields][out_topic] == "system_secure" {
grok {
match => { "message" => ["%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sshd(?:[%{POSINT:[system][auth][pid]}])?: %{DATA:[system][auth][ssh][event]} %{DATA:[system][auth][ssh][method]} for (invalid user )?%{DATA:[system][auth][user]} from %{IPORHOST:[system][auth][ssh][ip]} port %{NUMBER:[system][auth][ssh][port]} ssh2(: %{GREEDYDATA:[system][auth][ssh][signature]})?",
"%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sshd(?:[%{POSINT:[system][auth][pid]}])?: %{DATA:[system][auth][ssh][event]} user %{DATA:[system][auth][user]} from %{IPORHOST:[system][auth][ssh][ip]}",
"%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sshd(?:[%{POSINT:[system][auth][pid]}])?: Did not receive identification string from %{IPORHOST:[system][auth][ssh][dropped_ip]}",
"%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} sudo(?:[%{POSINT:[system][auth][pid]}])?: s*%{DATA:[system][auth][user]} :( %{DATA:[system][auth][sudo][error]} ;)? TTY=%{DATA:[system][auth][sudo][tty]} ; PWD=%{DATA:[system][auth][sudo][pwd]} ; USER=%{DATA:[system][auth][sudo][user]} ; COMMAND=%{GREEDYDATA:[system][auth][sudo][command]}",
"%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} groupadd(?:[%{POSINT:[system][auth][pid]}])?: new group: name=%{DATA:system.auth.groupadd.name}, GID=%{NUMBER:system.auth.groupadd.gid}",
"%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} useradd(?:[%{POSINT:[system][auth][pid]}])?: new user: name=%{DATA:[system][auth][user][add][name]}, UID=%{NUMBER:[system][auth][user][add][uid]}, GID=%{NUMBER:[system][auth][user][add][gid]}, home=%{DATA:[system][auth][user][add][home]}, shell=%{DATA:[system][auth][user][add][shell]}$",
"%{SYSLOGTIMESTAMP:[system][auth][timestamp]} %{SYSLOGHOST:[system][auth][hostname]} %{DATA:[system][auth][program]}(?:[%{POSINT:[system][auth][pid]}])?: %{GREEDYMULTILINE:[system][auth][message]}"] }
pattern_definitions => {"GREEDYMULTILINE"=> "(.|n)*"}
remove_field => "message"
}
}
}
output {
elasticsearch {
hosts => ["<亞馬遜ES地址>"]
index => "%{[fields][out_topic]}_%{+YYYYMMdd}"
document_type => "%{[@metadata][type]}"
}
}
Kafka
# 匯入
rpm --import
cat <<EOF > /etc/yum.repos.d/confluent.repo
[Confluent.dist]
name=Confluent repository (dist)
baseurl=
gpgcheck=1
gpgkey=
enabled=1
[Confluent]
name=Confluent repository
baseurl=
gpgcheck=1
gpgkey=
enabled=1
EOF
yum install confluent-platform-oss-2.11
Elastalert
Elastalert可以部署到任何一臺能夠讀取到ES的伺服器上;配置檔案中modules.eagle_post.EagleAlerter blacklist_v2經過修改,後面會介紹到
rules/system_log.yaml
es_host: <亞馬遜ES地址>
es_port: 80
name: system log rule
type: blacklist_v2
index: system_log*
timeframe:
minutes: 1
# 監控key
compare_key: system.syslog.message
# 出現下面任意關鍵字將告警,按需新增
blacklist_v2:
- "ERROR"
- "error"
alert: "modules.eagle_post.EagleAlerter"
eagle_post_url: "<eagle>"
eagle_post_all_values: False
eagle_post_payload:
server: "fields.server_name"
info: "system.syslog.message"
source: "source"
rules/system_log.yaml
es_host: <亞馬遜ES地址>
es_port: 80
name: system secure rule
type: frequency
index: system_secure*
num_events: 1
timeframe:
minutes: 1
filter:
- query:
wildcard:
system.auth.user : "*"
alert: "modules.eagle_post.EagleAlerter"
eagle_post_url: "<eagle>"
eagle_post_all_values: False
# 非工作時間
eagle_time_start: "09:00"
eagle_time_end: "18:00"
eagle_post_payload:
user: "system.auth.user"
server: "fields.server_name"
ip: "system.auth.ssh.ip"
event: "system.auth.ssh.event"
Elastalert
自定義type與alert
為了能夠將告警接入到Eagle(自研統一介面平臺)在嘗試使用http_post做告警型別過程中,發現無法傳入ES結果作為POST引數,所以對其進行簡單修改,新增型別,實現能夠無縫接入Eagle
Alert
moudules/eagle_post.py
將資料夾儲存到site-packages/elastalert
import json
import requests
import dateutil.parser
import datetime
from elastalert.alerts import Alerter
from elastalert.util import EAException
from elastalert.util import elastalert_logger
from elastalert.util import lookup_es_key
class EagleAlerter(Alerter):
def __init__(self, rule):
super(EagleAlerter, self).__init__(rule)
# 設定時間有效範圍
self.post_time_start = self.rule.get('eagle_time_start','00:00')
self.post_time_end = self.rule.get('eagle_time_end','00:00')
# post連結
self.post_url = self.rule.get('eagle_post_url','')
self.post_payload = self.rule.get('eagle_post_payload', {})
self.post_static_payload = self.rule.get('eagle_post_static_payload', {})
self.post_all_values = self.rule.get('eagle_post_all_values', False)
self.post_lock = False
def alert(self, matches):
if not self.post_url:
elastalert_logger.info('Please input eagle url!')
return False
for match in matches:
# 獲取所有payload
payload = match if self.post_all_values else {}
# 構建字典
for post_key, es_key in self.post_payload.items():
payload[post_key] = lookup_es_key(match, es_key)
# 獲取當前時間
login_time = datetime.datetime.now().time()
# 獲取時間限制
time_start = dateutil.parser.parse(self.post_time_start).time()
time_end = dateutil.parser.parse(self.post_time_end).time()
# 如果在時間範圍內,將不做告警
self.post_lock = False if login_time > time_start and
login_time < time_end else True
# 合併兩種型別payload
data = self.post_static_payload
data.update(payload)
# 傳送告警
if self.post_lock:
myRequests = requests.Session()
myRequests.post(url=self.post_url,data=data,verify=False)
elastalert_logger.info("[-] eagle alert sent.")
else:
elastalert_logger.info("[*] nothing to do.")
def get_info(self):
return {'type': 'http_post'}
type
在使用blaklist過程發現改型別是全匹配,為了方便編寫配置檔案,所以對其做了簡單修改
elastalert/ruletypes.py
# 新增
class BlacklistV2Rule(CompareRule):
required_options = frozenset(['compare_key', 'blacklist_v2'])
def __init__(self, rules, args=None):
super(BlacklistV2Rule, self).__init__(rules, args=None)
self.expand_entries('blacklist_v2')
def compare(self, event):
term = lookup_es_key(event, self.rules['compare_key'])
# 迴圈配置檔案, 這種做法對效能有一定的損耗,在沒找到更合適的解決方案前,就採取這種方式
for i in self.rules['blacklist_v2']:
if i in term:
return True
return False
elastalert/config.py
# 新增
rules_mapping = {
'frequency': ruletypes.FrequencyRule,
'any': ruletypes.AnyRule,
'spike': ruletypes.SpikeRule,
'blacklist': ruletypes.BlacklistRule,
'blacklist_v2': ruletypes.BlacklistV2Rule,
'whitelist': ruletypes.WhitelistRule,
'change': ruletypes.ChangeRule,
'flatline': ruletypes.FlatlineRule,
'new_term': ruletypes.NewTermsRule,
'cardinality': ruletypes.CardinalityRule,
'metric_aggregation': ruletypes.MetricAggregationRule,
'percentage_match': ruletypes.PercentageMatchRule,
}
elastalert/schema.yaml
# 新增
- title: BlacklistV2
required: [blacklist_v2, compare_key]
properties:
type: {enum: [blacklist_v2]}
compare_key: {'items': {'type': 'string'},'type': ['string', 'array']}
blacklist: {type: array, items: {type: string}}
打進Docker
做了個簡單DockerFile做參考
FROM python:2.7-alpine
ENV SITE_PACKAGES /usr/local/lib/python2.7/site-packages/elastalert
WORKDIR /opt/elastalert
RUN apk update && apk add gcc ca-certificates openssl-dev openssl libffi-dev gcc musl-dev tzdata openntpd &&
pip install elastalert && cp -rf /usr/share/zoneinfo/Asia/Taipei /etc/localtime
COPY ./ /opt/elastalert
CMD ["/opt/elastalert/start.sh"]
start.sh
#!/bin/sh
SITE_PATH=/usr/local/lib/python2.7/site-packages/elastalert
CONFIG=/opt/elastalert/config/config.yaml
MODULES=/opt/elastalert/modules
if [ -n "${MODULES}" ]
then
cp -rf ${MODULES} ${SITE_PATH}
echo "[-] Copy ${MODULES} to ${SITE_PATH}"
fi
cp -rf elastalert/* ${SITE_PATH}/
echo "[-] Copy elastalert/* to ${SITE_PATH}"
python -m elastalert.elastalert --verbose --config ${CONFIG}
基礎工作準備就緒,加入Bee容器管理平臺完成自動構建。
實現效果
效果圖
碰到的坑
Zookeeper
問題描述
老版Kafaka依賴Zookeeper,預設安裝時註冊地址為:localhost,導致問題的現象:
filebeat錯誤日誌
2018-04-25T09:14:55.590+0800 INFO kafka/log.go:36 client/metadata fetching metadata for [[[system_log] kafkaNode:9092]] from broker %!s(MISSING)
2018-04-25T09:14:55.591+0800 INFO kafka/log.go:36 producer/broker/[[0]] starting up
2018-04-25T09:14:55.591+0800 INFO kafka/log.go:36 producer/broker/[[0 %!d(string=system_log) 0]] state change to [open] on %!s(MISSING)/%!d(MISSING)
2018-04-25T09:14:55.591+0800 INFO kafka/log.go:36 producer/leader/[[system_log %!s(int32=0) %!s(int32=0)]]/%!d(MISSING) selected broker %!d(MISSING)
2018-04-25T09:14:55.591+0800 INFO kafka/log.go:36 producer/leader/[[system_secure %!s(int32=0) %!s(int=3)]]/%!d(MISSING) state change to [retrying-%!d(MISSING)]
2018-04-25T09:14:55.591+0800 INFO kafka/log.go:36 producer/leader/[[system_secure %!s(int32=0) %!s(int32=0)]]/%!d(MISSING) abandoning broker %!d(MISSING)
2018-04-25T09:14:55.592+0800 INFO kafka/log.go:36 producer/broker/[[0]] shut down
2018-04-25T09:14:55.592+0800 INFO kafka/log.go:36 Failed to connect to broker [[localhost:9092 dial tcp [::1]:9092: getsockopt: connection refused]]: %!s(MISSING)
日誌出現兩個地址,一個是kafka地址,另外出現一個localhost地址。
這是因為filebeat已經跟kafaka建立了連線,但是從kafaka到zookeeper這一段找不到
解決方法
# get /brokers/ids/0
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://localhost:9092"],"jmx_port":-1,"host":"localhost","timestamp":"1523429158364","port":9092,"version":4}
cZxid = 0x1d
ctime = Wed Apr 11 14:45:58 CST 2018
mZxid = 0x1d
mtime = Wed Apr 11 14:45:58 CST 2018
pZxid = 0x1d
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x162b374170d0000
dataLength = 188
numChildren = 0
# 發現註冊地址是localhost,修改之
set /brokers/ids/0 {"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://kafkaNode:9092"],"jmx_port":9999,"host":"kafkaNode","timestamp":"1523429158364","port":9092,"version":4}
修改完重啟,問題解決。
博文連結:
©著作權歸作者所有:來自51CTO部落格作者迷失的貓妖的原創作品,如需轉載,請註明出處,否則將追究法律責任
ELK日誌告警[Linux運維]
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4830/viewspace-2821811/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Centos部署Loki日誌聚合系統CentOSLoki
- 部署Sentry日誌監控系統
- ELK 日誌分析系統 ----------- 部署ElasticSearch群集Elasticsearch
- 日誌系統
- Linux系統級日誌系統Linux
- AIX系統日誌AI
- Rsyslog日誌系統
- 基於多雲構建監控告警系統
- docker筆記48-部署EFK日誌系統Docker筆記
- 最新Centos7.6 部署ELK日誌分析系統CentOS
- 告警系統郵件引擎 執行告警系統
- grafana/loki:用於日誌聚合的系統GrafanaLoki
- 日誌審計系統
- ELK日誌分析系統
- 日誌系統相關
- windows10系統怎麼刪系統日誌_win10如何徹底刪除系統日誌WindowsWin10
- 日誌分析系統 - k8s部署ElasticSearch叢集K8SElasticsearch
- ELK-日誌分析系統
- 檢視系統的日誌
- Grafana系列-Loki-基於日誌實現告警GrafanaLoki
- 滴滴基於Clickhouse構建新一代日誌儲存系統
- shell-告警系統
- Win10系統日誌怎麼檢視_win10系統日誌在哪裡Win10
- 日誌收集系統PLG(Promtail+Loki+Grafana)介紹及部署AILokiGrafana
- 日誌: 分散式系統的核心分散式
- ELK+FileBeat日誌分析系統
- MySQL 日誌系統 redo log、binlogMySql
- mysql日誌系統簡單使用MySql
- ELK日誌分析系統詳解
- 分散式日誌傳輸系統Databus(一)--系統介紹分散式
- AOP實現系統告警
- 開發日誌:Kylin麒麟作業系統部署ASP.NET CORE作業系統ASP.NET
- 各大作業系統AIX/HPUX/Solaris/Linux下的系統日誌作業系統AILinux
- 【系統設計】指標監控和告警系統指標
- Rust 日誌系統實踐總結Rust
- Linux檔案系統與日誌分析Linux
- Elasticsearch+Kibana日誌分析系統搭建Elasticsearch
- 企業級日誌分析系統——ELK