IoT 邊緣叢集基於 Kubernetes Events 的告警通知實現

東風微鳴發表於2023-02-16

背景

邊緣叢集(基於 樹莓派 + K3S) 需要實現基本的告警功能。

邊緣叢集限制

  1. CPU/記憶體/儲存 資源緊張,無法支撐至少需要 2GB 以上記憶體和大量儲存的基於 Prometheus 的完整監控體系方案(即使是基於 Prometheus Agent, 也無法支撐) (需要避免額外的儲存和計算資源消耗)
  2. 網路條件,無法支撐監控體系,因為監控體系一般都需要每 1min 定時(或每時每刻)傳輸資料,且資料量不小;
    1. 存在 5G 收費網路的情況,且訪問的目的端地址需要開通許可權,且按照流量收費,且因為 5G 網路條件,網路傳輸能力受限,且不穩定(可能會在一段時間內離線);

關鍵需求

總結下來,關鍵需求如下:

  1. 實現對邊緣叢集異常的及時告警,需要知道邊緣叢集正在發生的異常情況;
  2. 網路:網路條件情況較差,網路流量少,只只能開通極少數目的端地址,可以容忍網路不穩定(一段時間內離線)的情況;
  3. 資源:需要儘量避免額外的儲存和計算資源消耗

方案

綜上所訴,採用如下方案實現:

基於 Kubernetes Events 的告警通知

架構圖

kubernetes-events-arch

技術方案規劃

  1. 從 Kubernetes 的各項資源收集 Events, 如:
    1. pod
    2. node
    3. kubelet
    4. crd
    5. ...
  2. 透過 kubernetes-event-exporter 元件來實現對 Kubernetes Events 的收集;
  3. 只篩選 Warning 級別 Events 供告警通知(後續,條件可以進一步定義)
  4. 告警透過 飛書 webhook 等通訊工具進行傳送(後續,傳送渠道可以增加)

實施步驟

手動方式:

在邊緣叢集上,執行如下操作:

1. 建立 roles

如下:

cat << _EOF_ | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: event-exporter-extra
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: monitoring
  name: event-exporter
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-exporter
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    namespace: monitoring
    name: event-exporter
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-exporter-extra
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: event-exporter-extra
subjects:
  - kind: ServiceAccount
    namespace: kube-event-export
    name: event-exporter
_EOF_

2. 建立 kubernetes-event-exporter config

如下:

cat << _EOF_ | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-cfg
  namespace: monitoring
data:
  config.yaml: |
    logLevel: error
    logFormat: json
    route:
      routes:
        - match:
            - receiver: "dump"      
        - drop:
            - type: "Normal"
          match:
            - receiver: "feishu"                     
    receivers:
      - name: "dump"
        stdout: {}
      - name: "feishu"
        webhook:
          endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
          headers:
            Content-Type: application/json
          layout:
            msg_type: interactive
            card:
              config:
                wide_screen_mode: true
                enable_forward: true
              header:
                title:
                  tag: plain_text
                  content: XXX IoT K3S 叢集告警
                template: red
              elements:
                - tag: div
                  text: 
                    tag: lark_md
                    content: "**EventType:**  {{ .Type }}\n**EventKind:**  {{ .InvolvedObject.Kind }}\n**EventReason:**  {{ .Reason }}\n**EventTime:**  {{ .LastTimestamp }}\n**EventMessage:**  {{ .Message }}"
      
_EOF_

? 注意:

  • endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..." 按需修改為對應的 webhook endpoint, ❌切記勿對外公佈!!!
  • content: XXX IoT K3S 叢集告警: 按需調整為方便快速識別的名稱,如:"家裡測試 K3S 叢集告警"

3. 建立 Deployment

cat << _EOF_ | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: event-exporter
      version: v1
  template:
    metadata:
      labels:
        app: event-exporter
        version: v1
    spec:
      volumes:
        - name: cfg
          configMap:
            name: event-exporter-cfg
            defaultMode: 420
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: ''
        - name: zoneinfo
          hostPath:
            path: /usr/share/zoneinfo
            type: ''
      containers:
        - name: event-exporter
          image: ghcr.io/opsgenie/kubernetes-event-exporter:v0.11
          args:
            - '-conf=/data/config.yaml'
          env:
            - name: TZ
              value: Asia/Shanghai
          volumeMounts:
            - name: cfg
              mountPath: /data
            - name: localtime
              readOnly: true
              mountPath: /etc/localtime
            - name: zoneinfo
              readOnly: true
              mountPath: /usr/share/zoneinfo
          imagePullPolicy: IfNotPresent
      serviceAccount: event-exporter
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-role.kubernetes.io/controlplane
                    operator: In
                    values:
                      - 'true'
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: In
                    values:
                      - 'true'
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-role.kubernetes.io/master
                    operator: In
                    values:
                      - 'true'    
      tolerations:
        - key: node-role.kubernetes.io/controlplane
          value: 'true'
          effect: NoSchedule
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule      
_EOF_

? 說明:

  1. event-exporter-cfg 相關配置,是用於載入以 ConfigMap 形式儲存的配置檔案;
  2. localtime zoneinfo TZ 相關配置,是用於修改該 pod 的時區為Asia/Shanghai, 以使得最終顯示的通知效果為 CST 時區;
  3. affinity tolerations 相關配置,是為了確保:無論如何,優先排程到 master node 上去,按需調整,此處是因為 master 往往在邊緣叢集中作為閘道器存在,配置較高,且線上時間較長;

自動化部署

效果:安裝 K3S 時就自動部署

在 K3S server 所在節點,/var/lib/rancher/k3s/server/manifests/ 目錄(如果沒有該目錄就先建立)下,建立 event-exporter.yaml

---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: event-exporter-extra
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: monitoring
  name: event-exporter
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-exporter
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    namespace: monitoring
    name: event-exporter
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-exporter-extra
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: event-exporter-extra
subjects:
  - kind: ServiceAccount
    namespace: kube-event-export
    name: event-exporter
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-cfg
  namespace: monitoring
data:
  config.yaml: |
    logLevel: error
    logFormat: json
    route:
      routes:
        - match:
            - receiver: "dump"      
        - drop:
            - type: "Normal"
          match:
            - receiver: "feishu"                     
    receivers:
      - name: "dump"
        stdout: {}
      - name: "feishu"
        webhook:
          endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/dc4fd384-996b-4d20-87cf-45b3518869ec"
          headers:
            Content-Type: application/json
          layout:
            msg_type: interactive
            card:
              config:
                wide_screen_mode: true
                enable_forward: true
              header:
                title:
                  tag: plain_text
                  content: xxxK3S叢集告警
                template: red
              elements:
                - tag: div
                  text: 
                    tag: lark_md
                    content: "**EventType:**  {{ .Type }}\n**EventKind:**  {{ .InvolvedObject.Kind }}\n**EventReason:**  {{ .Reason }}\n**EventTime:**  {{ .LastTimestamp }}\n**EventMessage:**  {{ .Message }}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: event-exporter
      version: v1
  template:
    metadata:
      labels:
        app: event-exporter
        version: v1
    spec:
      volumes:
        - name: cfg
          configMap:
            name: event-exporter-cfg
            defaultMode: 420
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: ''
        - name: zoneinfo
          hostPath:
            path: /usr/share/zoneinfo
            type: ''
      containers:
        - name: event-exporter
          image: ghcr.io/opsgenie/kubernetes-event-exporter:v0.11
          args:
            - '-conf=/data/config.yaml'
          env:
            - name: TZ
              value: Asia/Shanghai
          volumeMounts:
            - name: cfg
              mountPath: /data
            - name: localtime
              readOnly: true
              mountPath: /etc/localtime
            - name: zoneinfo
              readOnly: true
              mountPath: /usr/share/zoneinfo
          imagePullPolicy: IfNotPresent
      serviceAccount: event-exporter
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-role.kubernetes.io/controlplane
                    operator: In
                    values:
                      - 'true'
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: In
                    values:
                      - 'true'
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-role.kubernetes.io/master
                    operator: In
                    values:
                      - 'true'    
      tolerations:
        - key: node-role.kubernetes.io/controlplane
          value: 'true'
          effect: NoSchedule
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule  

之後啟動 K3S 就會自動部署。

?️Reference:
自動部署 manifests 和 Helm charts | Rancher 文件

最終效果

如下圖:

image-20220413122040530

?️參考文件

三人行, 必有我師; 知識共享, 天下為公. 本文由東風微鳴技術部落格 EWhisper.cn 編寫.

相關文章