需求:
為aws上各個region的elbv2進行配置監控告警,監控使用的是aws的cloudwatch,告警使用的是aws的sns
難點:
每一個region下有多個elb2,每一個elb2下又分為80和443的listen,目前是需要對每一個listen都配置以上四個監控需求,所以整體來說工作量比較大。
解決方式:
採用aws的cli工具進行批量建立告警。
涉及到的awscli命令有以下:
aws elbv2 describe-load-balancers; //獲取到該可用區內所有lb的資訊
複製程式碼
aws elbv2 describe-listeners --load-balancer-arn XXXXXXX; //獲取某一個lb的資訊複製程式碼
需要配置的地方:
- 由於每一個region需要不同的sns配置,以及執行指令碼的主機是不同region的ops機器,所以每一次執行的時候要到指定ops機器以及修改成該region下的sns地址。如下程式碼:
if __name__ == '__main__':
# 1.獲取lb2的arn
print("1. 獲取當前可用區所有lb2所有的arn。")
arndict = getAllArn()
sns = "arn:aws:sns:ap-south-1:2617960:AWS_Alert_Ops"
getSimpleArnAndTaggroup(arndict,sns)複製程式碼
- 然後需要對執行的ops授權iam,目前是所有region下的ops機器有許可權。
- 登入ops機器後需要執行:
aws configure //key等資訊不需要輸入直接回車即可,直接輸入該ops機器所在的region,目前美圖所涉及的region有:
ap-south-1, //孟買
us-west-2, //俄勒岡
ap-southeast-1, //新加坡
ap-northeast-1, //東京複製程式碼
- 最後直接執行指令碼python aws_elb_add_alert.py
指令碼中每一個aws命令具體引數的意思如下:(更加細節的可以看一下awscli文件)
def getHealthyHostCountComm(lbname,port,taggroup,lbarn,sns):
comm = '''
%s cloudwatch put-metric-alarm \
--alarm-name 'AWS_ELB_%s_PORT_%s_HealthyHostCount' \ //警報的名稱
--alarm-description 'aws elb HealthyHostCount' \ //警報描述
--metric-name HealthyHostCount \ //警報型別,可選型別有HTTPCode_Target_4XX_Count,UnHealthyHostCount,HealthyHostCount等
--namespace AWS/ApplicationELB \ //報警的物件,可選型別有AWS/ApplicationELB,AWS/RDS,AWS/EC2等
--statistic Maximum \ //對採集資料的一個判斷
--period 60 \ //每次採集資料的週期
--threshold 0 \ //告警的閾值
--evaluation-periods 1 \ //採集的次數,總時間=次數*週期時間
--datapoints-to-alarm 1 \ //滿足告警條件超過閾值的次數
--comparison-operator LessThanOrEqualToThreshold \ //當前值與閾值對於操作,GreaterThanOrEqualToThreshold,GreaterThanThreshold,LessThanThreshold,LessThanOrEqualToThreshold
--treat-missing-data notBreaching \ //對於不滿足告警條件的資料處理方式,missing,notBreaching,breaching,ignore
--alarm-actions '%s' \ //告警方式,這裡填寫的是sns的arn值
--dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
return comm複製程式碼
最終程式碼:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# @Version : 1.0
# @Time : 2018/5/28 18:12
# @Author : *************
# @File : aws_elb_add_alert.py
# @Function: aws elb批量配置告警
# @Note : 由於每一個可用區都有一臺ops機器,所以每一個可用區需要單獨執行此指令碼,或者用ansible
import os,sys,commands,re,json
Contants = {
"AWSCLI":'/usr/bin/aws',
"AWSREGION":['ap-south-1','us-west-2','ap-southeast-1','ap-northeast-1'] #孟買,俄勒岡,新加坡,東京
}
# 構造字典
class CreateDict(dict):
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
#########################################################################################################
# 配置告警
# HealthyHostCount,一分鐘檢查一次,當健康主機數量等於0,就告警。健康主機的最大值小於等於0就告警
def getHealthyHostCountComm(lbname,port,taggroup,lbarn,sns):
comm = '''
%s cloudwatch put-metric-alarm \
--alarm-name 'AWS_ELB_%s_PORT_%s_HealthyHostCount' \
--alarm-description 'aws elb HealthyHostCount' \
--metric-name HealthyHostCount \
--namespace AWS/ApplicationELB \
--statistic Maximum \
--period 60 \
--threshold 0 \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--comparison-operator LessThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions '%s' \
--dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
return comm
# UnHealthyHostCount 五分鐘檢查一次,當不健康主機數量大於或等於1個,就告警. 不健康主機數量的最小值大於等於1就告警
def getUnHealthyHostCountComm(lbname,port,taggroup,lbarn,sns):
comm = '''
%s cloudwatch put-metric-alarm \
--alarm-name 'AWS_ELB_%s_PORT_%s_UnHealthyHostCount' \
--alarm-description 'aws elb UnHealthyHostCount' \
--metric-name UnHealthyHostCount \
--namespace AWS/ApplicationELB \
--statistic Minimum \
--period 300 \
--threshold 1 \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions '%s' \
--dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
return comm
# HTTP_5XX 一分鐘採集一次,週期為1分鐘,1個資料點中有1次超過閾值就告警,當5xx超過10個為超過閾值
def getHTTP_5XXComm(lbname,port,taggroup,lbarn,sns):
comm = '''
%s cloudwatch put-metric-alarm \
--alarm-name 'AWS_ELB_%s_PORT_%s_HTTP_5XX' \
--alarm-description 'aws elb http 5xx alert' \
--metric-name HTTPCode_Target_5XX_Count \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 60 \
--threshold 10 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--evaluation-periods 1 \
--datapoints-to-alarm 1 \
--alarm-actions '%s' \
--dimensions 'Name=LoadBalancer,Value=app/%s' '''%(Contants['AWSCLI'],lbname,port,sns,lbarn)
return comm
# HTTP_4XX 一分鐘採集一次,週期為5分鐘,5個資料點中有三次超過閾值就告警,當4xx超過10%為超過閾值
def getHTTP_4XXComm(lbname,port,taggroup,lbarn,sns):
comm = '''
%s cloudwatch put-metric-alarm \
--alarm-name 'AWS_ELB_%s_PORT_%s_HTTP_4XX' \
--alarm-description 'aws elb http 4xx alert' \
--metric-name HTTPCode_Target_4XX_Count \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 60 \
--threshold 10 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--unit Percent \
--alarm-actions '%s' \
--dimensions 'Name=TargetGroup,Value=targetgroup/%s' 'Name=LoadBalancer,Value=app/%s' ''' %(Contants['AWSCLI'],lbname,port,sns,taggroup,lbarn)
return comm
# 執行命令函式
def execCommand(comm):
try:
(exitstatus, outtext) = commands.getstatusoutput(comm)
return outtext
except Exception as e:
print(e)
# 獲取當前可用區內所有lb2的基礎資訊
def getAllArn():
comm1 = "%s elbv2 describe-load-balancers" % Contants['AWSCLI']
AllLb2Details = eval(execCommand(comm1))['LoadBalancers']
arndict = CreateDict()
for i in range(0,len(AllLb2Details)):
lbarn = AllLb2Details[i]["LoadBalancerArn"]
lbname = AllLb2Details[i]["LoadBalancerName"]
arndict[lbname]["lbarn"] = lbarn
comm2 = "%s elbv2 describe-listeners --load-balancer-arn %s" %(Contants['AWSCLI'],lbarn)
alllisten = eval(execCommand(comm2))['Listeners']
for j in range(0,len(alllisten)):
taggroup = alllisten[j]['DefaultActions'][0]['TargetGroupArn']
port = alllisten[j]['Port']
arndict[lbname]["lbgroup"][port] = taggroup
return json.dumps(arndict)
# 獲取簡寫
def getSimpleArnAndTaggroup(arndict,sns):
for lbname,lbvalue in eval(arndict).iteritems():
lbarn = re.split(r'loadbalancer/app/',lbvalue['lbarn'])[-1]
for port,taggroup in lbvalue['lbgroup'].iteritems():
print("######################################################")
taggroup = re.split(r':targetgroup/',taggroup)[-1]
print("#####開始配置HealthyHostCountAlert#####")
comm1 = getHealthyHostCountComm(lbname,port,taggroup,lbarn,sns)
print(comm1)
execCommand(comm1)
print("#####開始配置UnHealthyHostCountAlert#####")
comm2 = getUnHealthyHostCountComm(lbname,port,taggroup,lbarn,sns)
print(comm2)
execCommand(comm2)
print("#####開始配置HTTP_5XX#####")
comm3 = getHTTP_5XXComm(lbname,port,taggroup,lbarn,sns)
print(comm3)
execCommand(comm3)
print("#####開始配置HTTP_4XX#####")
comm4 = getHTTP_4XXComm(lbname,port,taggroup,lbarn,sns)
print(comm4)
execCommand(comm4)
if __name__ == '__main__':
# 1.獲取lb2的arn
print("1. 獲取當前可用區所有lb2所有的arn。")
arndict = getAllArn()
sns = "arn:aws:sns:us-west-2:217608:AWS_Alert_Ops"
getSimpleArnAndTaggroup(arndict,sns)
複製程式碼