【轉！】skywalking告警配置

紫陌花间客發表於2024-04-11

原文網址 : https://www.cnblogs.com/richered/p/18129165

當履行職責成為功績，當基本道德成為美德，當所有人都失去了正確的價值判斷，這個社會距離崩潰就不遠了

---《中國是部金融史》

轉自：https://www.cnblogs.com/gaoyanbing/p/17867548.html

1、概述

Skywalking傳送告警的基本原理是每隔一段時間輪詢skywalking-oap收集到的鏈路追蹤的資料，再根據所配置的告警規則（如服務響應時間、服務響應時間百分比）等，如果達到閾值則傳送響應的告警資訊。傳送告警資訊是以執行緒池非同步的方式呼叫webhook介面完成的，具體的webhook介面可以由使用者自行定義，從而可以在指定的webhook介面中自行編寫各種告警方式，比如釘釘告警、郵件告警等等。告警的資訊也可以在RocketBot即ui中檢視到。

目前對應我前面文章中部署的8.4.0版本支援的告警介面如下：

普通webhook
gRPCHook
Slack Chat Hook
WeChat Hook（微信告警）
Dingtalk Hook（釘釘告警）
Feishu Hook（飛書告警）

2、告警規則

2.1 預設告警規則

在Skywalking中，告警規則稱為rule，預設安裝的Skywalking oap server元件中包含了告警規則的配置檔案，位於安裝目錄下conﬁg資料夾下alarm-settings.yml檔案中，在容器中執行的也是一樣的

# kubectl -n monitoring exec -it skywalking-oap-57d7f454f5-w4k4j -- bash
bash-5.0# pwd
/skywalking       
bash-5.0# cat config/alarm-settings.yml

以下是預設的告警規則配置檔案內容

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    metrics-name: database_access_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    metrics-name: endpoint_relation_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

2.2 告警規則詳解

下面取預設的告警規則中的一條進行分析

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

首先提示宣告瞭告警規則名稱應該具有唯一性，且必須以 _rule 結尾，這裡是service_resp_time_rule（服務響應時間）

metrics-name：告警指標，指標度量值為long、double或int型別
op：度量值和閾值的比較方式，這裡是大於
threshold：閾值，這裡是1000，毫秒為單位
period：評估度量標準的時間長度，也就是告警檢查週期，分鐘為單位
count：累計達到多少次告警值後觸發告警
silence-period：忽略相同告警資訊的週期，預設與告警檢查週期一致。簡單來說，就是在觸發告警時開始計時N，在N+period時間內保持沉默silence不會再次觸發告警，這和alertmanager的告警抑制類似
message：告警訊息主體，透過變數在傳送訊息時進行自動替換

除此之外，還有以下可選（高階）規則配置：

到這裡，就能分析出上面列出的所有預設告警規則的含義，依次為：

1 最近3分鐘內服務平均響應時間超過1秒
2 最近2分鐘內服務成功率低於80％
3 最近3分鐘的服務響應時間百分比超過1秒
4 最近2分鐘內服務例項的平均響應時間超過1秒
5 最近2分鐘內資料庫訪問的平均響應時間超過1秒
6 最近2分鐘內端點平均響應時間超過1秒
7 過去2分鐘內端點關係的平均響應時間超過1秒
  這條規則預設沒有開啟，並且提示：由於端點的數量遠遠多於服務和例項，活動端點相關度量告警將比服務和服務例項度量告警消耗更多記憶體

3、自定義告警規則

Skywalking的配置大部分內容是透過應用的application.yml及系統的環境變數設定的，同時也支援下面系統的動態配置來源

gRPC服務
Zookeeper
Etcd
Consul
Apollo
Nacos
k8s configmap

參考Skywalking動態配置說明，如果開啟了動態配置，可以透過鍵alarm.default.alarm-settings覆蓋掉預設配置檔案alarm-settings.yml

本文記錄的是基於k8s和helm部署的Skywalking，因此可以透過k8s-configmap進行自定義配置的注入，最終在Skywalking配置檔案中的實現如下，此檔案中有很多變數，透過分析chart，發現已經寫好邏輯會根據是否啟用動態配置來自動注入所有變數，所以就無需在value.yaml中宣告瞭

cluster:
  selector: ${SW_CLUSTER:standalone}
...
  kubernetes:
    namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
    labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,release=skywalking}
    uidEnvName: ${SW_CLUSTER_K8S_UID:SKYWALKING_COLLECTOR_UID}
...
configuration:
  selector: ${SW_CONFIGURATION:k8s-configmap}
...
  k8s-configmap:
      # Sync period in seconds. Defaults to 60 seconds.
      period: ${SW_CONFIG_CONFIGMAP_PERIOD:60}
      # Which namespace is confiigmap deployed in.
      namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
      # Labelselector is used to locate specific configmap
      labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,release=skywalking}

在自定義配置告警規則的同時加入webhook後端報警相關配置，configmap檔案寫法可以參考官方helm configmap示例

我這裡只把預設的報警規則提示資訊改成了中文報警資訊，具體每條規則的引數沒有變化，同時還加入了釘釘webhook配置，具體流程如下

修改chart包的value.yaml，開啟動態配置

...
oap:
  name: oap
  dynamicConfigEnabled: true # 開啟動態配置功能
...

修改chart包中template的oap-configmap.yaml，配置自定義的rule和釘釘webhook

{{- if .Values.oap.dynamicConfigEnabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-dynamic-config
  labels:
    app: {{ template "skywalking.name" . }}
    release: {{ .Release.Name }}
    component: {{ .Values.oap.name }}
data:
  alarm.default.alarm-settings: |-
    rules:
      # Rule unique name, must be ended with `_rule`.
      service_resp_time_rule:
        metrics-name: service_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 3
        silence-period: 5
        message: 最近3分鐘內服務 {name} 的平均響應時間超過1秒
      service_sla_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_sla
        op: "<"
        threshold: 8000
        # The length of time to evaluate the metrics
        period: 10
        # How many times after the metrics match the condition, will trigger alarm
        count: 2
        # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
        silence-period: 3
        message: 最近2分鐘內服務 {name} 的成功率低於80％
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
        op: ">"
        threshold: 1000,1000,1000,1000,1000
        period: 10
        count: 3
        silence-period: 5
        message: 最近3分鐘的服務 {name} 的響應時間百分比超過1秒
      service_instance_resp_time_rule:
        metrics-name: service_instance_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 2
        silence-period: 5
        message: 最近2分鐘內服務例項 {name} 的平均響應時間超過1秒
      database_access_resp_time_rule:
        metrics-name: database_access_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        # message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
        message: 最近2分鐘內資料庫訪問 {name} 的平均響應時間超過1秒
      endpoint_relation_resp_time_rule:
        metrics-name: endpoint_relation_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        message: 最近2分鐘內端點 {name} 的平均響應時間超過1秒
    dingtalkHooks:
      textTemplate: |-
        {
          "msgtype": "text",
          "text": {
            "content": "SkyWalking 鏈路追蹤告警: \n %s."
          }
        }
      webhooks:
        - url: https://oapi.dingtalk.com/robot/send?access_token=<釘釘機器人token>
          secret: <釘釘機器人加簽>
{{- end }}

修改完成後，執行helm進行更新

# ls                                                                                
skywalking
# helm -n monitoring upgrade skywalking skywalking --values ./skywalking/values.yaml
# helm -n monitoring list                                                           
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
skywalking      monitoring      3               2021-03-22 13:35:36.779541 +0800 CST    deployed        skywalking-4.0.0
# helm -n monitoring history skywalking                                             
REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION                                                                              
1               Sun Mar 21 17:45:34 2021        superseded      skywalking-4.0.0                        Install complete                                                                         
2               Mon Mar 22 13:35:36 2021        deployed        skywalking-4.0.0                        Upgrade complete

觀察pod狀態，直到正常

# kubectl -n monitoring get pods                               
NAME                              READY   STATUS      RESTARTS   AGE
elasticsearch-logging-0           1/1     Running     0          19h
elasticsearch-logging-1           1/1     Running     0          19h
elasticsearch-logging-2           1/1     Running     0          19h
skywalking-es-init-ktdcn          0/1     Completed   0          19h
skywalking-oap-7bbb775965-49895   1/1     Running     0          15s
skywalking-oap-7bbb775965-s89dz   1/1     Running     0          43s
skywalking-ui-698cdb4dbc-mjl2m    1/1     Running     0          19h

4、測試告警

為了測試告警功能，拉上業務研發在專案中簡單寫了個url地址，請求時會超時5s返回

然後利用瀏覽器或postman請求應用的/api/timeout進行測試

檢視Skywalking的ui介面，鏈路追蹤

告警介面

到釘釘中檢視報警訊息

到這裡，在Skywalking中配置報警就完成了 ~

附：在一次Skywalking線上分享會上記錄的關於使用Skywalking定位問題的思路：

縱覽全域性，Skywalking拓撲圖
監控告警，metric/tracing確定問題存在故障（根據metric做告警，根據tracing統計作比較）
確定故障在哪，tracing呼叫關係，確定故障出現在哪個service或者endpoint
profile手段（skywalking新能力）或者常見傳統效能定位方法，定位單節點問題所在（比如CPU、記憶體、io、網路 ——> 動態追蹤取樣 ——> 火焰圖）基本可以解決99.9%的問題

附錄：

釘釘hooks配置：

tee <your_skywalking_path>/configs/alarm-settings.yml <<-'EOF'
dingtalkHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking Alarm: \n %s."
      }
    }
  webhooks:
    - url: https://oapi.dingtalk.com/robot/send?access_token=<access_token>
      secret: <加簽值>

prometheus告警配置
2024-09-05
Prometheus
zabbix配置郵件告警
2024-12-02
Alertmanager傳送告警郵件配置
2022-03-10
【Grafana】告警配置併傳送郵件
2019-03-13
Grafana
Skywalking-03：Skywalking本地除錯
2021-07-28
除錯
Skywalking-12：Skywalking SPI機制
2021-10-03
部署SkyWalking
2024-03-24
Apache SkyWalking
2020-04-19
Apache
運維文件：系統監控及告警配置
2024-07-26
運維
Skywalking-02：如何寫一個Skywalking trace外掛
2021-07-31
Zabbix實現電話告警通知的配置方法分享
2019-11-04
docker-elk配置sentinl外掛傳送釘釘告警
2019-04-13
Docker
快速實現釘釘告警通知、處理告警、關閉告警
2020-03-19
skywalking鏈路追蹤
2024-08-04
skywalking學習筆記
2018-03-28
筆記
Skywalking 外掛開發
2022-06-14
基於 Prometheus+Grafana+Alertmanager 搭建 K8S 雲監控告警平臺（附配置告警至QQ、釘釘）
2024-10-07
PrometheusGrafanaK8S
告警與恢復告警原理及實現
2023-03-14
22.Prometheus告警之Alertmanager告警(二)-手機電話加簡訊告警
2024-05-06
Prometheus
APM系統SkyWalking介紹
2019-07-15
Skywalking微服務監控分析
2019-01-04
微服務
SkyWalking 安裝部署操作指引
2024-11-18
Linux 環境Skywalking部署Elasticsearch
2022-03-27
LinuxElasticsearch
Skywalking-06：OAL基礎
2021-08-16
Seata 新特性，APM 支援 SkyWalking
2021-06-09
webstorm配置scss/less轉wxss，小程式配置scss轉wxss
2020-09-27
WebORMCSS
分散式監控系統Zabbix3.4-釘釘告警配置記錄
2018-09-03
分散式
nginx配置轉發
2020-04-23
Nginx
告警系統郵件引擎執行告警系統
2018-04-29
Skywalking APM監控系列(二、Mysql、Linux伺服器與前端JS接入Skywalking監聽)
2023-10-13
MySqlLinux伺服器前端JS
Linux 環境部署Skywalking支援Elasticsearch
2022-03-29
LinuxElasticsearch
Zabbix 5.0 LTS 配置企業微信(Webhook)自動傳送告警資訊
2024-07-30
WebHook
MASW之APM告警
2018-07-16
nagios mail告警通知
2020-12-01
iOSAI
SkyWalking Java 外掛貢獻實踐
2019-01-21
Java
Skywalking 一般有用看1 速
2024-03-15
速查列表：Apache SkyWalking OAL 的域（Scopes）
2021-09-03
Apache
SkyWalking部署及.Net Core簡單使用
2021-11-04