Longhorn,企業級雲原生容器分散式儲存 - 監控(Prometheus+AlertManager+Grafana)

為少發表於2021-08-24

內容來源於官方 Longhorn 1.1.2 英文技術手冊。

系列

目錄

  1. 設定 PrometheusGrafana 來監控 Longhorn
  2. Longhorn 指標整合到 Rancher 監控系統中
  3. Longhorn 監控指標
  4. 支援 Kubelet Volume 指標
  5. Longhorn 警報規則示例

設定 PrometheusGrafana 來監控 Longhorn

概覽

LonghornREST 端點 http://LONGHORN_MANAGER_IP:PORT/metrics 上以 Prometheus 文字格式原生公開指標。
有關所有可用指標的說明,請參閱 Longhorn's metrics
您可以使用 Prometheus, Graphite, Telegraf 等任何收集工具來抓取這些指標,然後通過 Grafana 等工具將收集到的資料視覺化。

本文件提供了一個監控 Longhorn 的示例設定。監控系統使用 Prometheus 收集資料和警報,使用 Grafana 將收集的資料視覺化/儀表板(visualizing/dashboarding)。 高階概述來看,監控系統包含:

  • Prometheus 伺服器從 Longhorn 指標端點抓取和儲存時間序列資料。Prometheus 還負責根據配置的規則和收集的資料生成警報。Prometheus 伺服器然後將警報傳送到 Alertmanager
  • AlertManager 然後管理這些警報(alerts),包括靜默(silencing)、抑制(inhibition)、聚合(aggregation)和通過電子郵件、呼叫通知系統和聊天平臺等方法傳送通知。
  • GrafanaPrometheus 伺服器查詢資料並繪製儀表板進行視覺化。

下圖描述了監控系統的詳細架構。

上圖中有 2 個未提及的元件:

  • Longhorn 後端服務是指向 Longhorn manager pods 集的服務。Longhorn 的指標在端點 http://LONGHORN_MANAGER_IP:PORT/metricsLonghorn manager pods 中公開。
  • Prometheus operator 使在 Kubernetes 上執行 Prometheus 變得非常容易。operator 監視 3 個自定義資源:ServiceMonitorPrometheusAlertManager。當使用者建立這些自定義資源時,Prometheus Operator 會使用使用者指定的配置部署和管理 Prometheus server, AlerManager

安裝

按照此說明將所有元件安裝到 monitoring 名稱空間中。要將它們安裝到不同的名稱空間中,請更改欄位 namespace: OTHER_NAMESPACE

建立 monitoring 名稱空間

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

安裝 Prometheus Operator

部署 Prometheus Operator 及其所需的 ClusterRoleClusterRoleBindingService Account

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-operator
subjects:
- kind: ServiceAccount
  name: prometheus-operator
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
rules:
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
- apiGroups:
  - apiextensions.k8s.io
  resourceNames:
  - alertmanagers.monitoring.coreos.com
  - podmonitors.monitoring.coreos.com
  - prometheuses.monitoring.coreos.com
  - prometheusrules.monitoring.coreos.com
  - servicemonitors.monitoring.coreos.com
  - thanosrulers.monitoring.coreos.com
  resources:
  - customresourcedefinitions
  verbs:
  - get
  - update
- apiGroups:
  - monitoring.coreos.com
  resources:
  - alertmanagers
  - alertmanagers/finalizers
  - prometheuses
  - prometheuses/finalizers
  - thanosrulers
  - thanosrulers/finalizers
  - servicemonitors
  - podmonitors
  - prometheusrules
  verbs:
  - '*'
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list
  - delete
- apiGroups:
  - ""
  resources:
  - services
  - services/finalizers
  - endpoints
  verbs:
  - get
  - create
  - update
  - delete
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - watch
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/name: prometheus-operator
  template:
    metadata:
      labels:
        app.kubernetes.io/component: controller
        app.kubernetes.io/name: prometheus-operator
        app.kubernetes.io/version: v0.38.3
    spec:
      containers:
      - args:
        - --kubelet-service=kube-system/kubelet
        - --logtostderr=true
        - --config-reloader-image=jimmidyson/configmap-reload:v0.3.0
        - --prometheus-config-reloader=quay.io/prometheus-operator/prometheus-config-reloader:v0.38.3
        image: quay.io/prometheus-operator/prometheus-operator:v0.38.3
        name: prometheus-operator
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          allowPrivilegeEscalation: false
      nodeSelector:
        beta.kubernetes.io/os: linux
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccountName: prometheus-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 8080
    targetPort: http
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator

安裝 Longhorn ServiceMonitor

Longhorn ServiceMonitor 有一個標籤選擇器 app: longhorn-manager 來選擇 Longhorn 後端服務。
稍後,Prometheus CRD 可以包含 Longhorn ServiceMonitor,以便 Prometheus server 可以發現所有 Longhorn manager pods 及其端點。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: monitoring
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager

安裝和配置 Prometheus AlertManager

  1. 使用 3 個例項建立一個高可用的 Alertmanager 部署:

    apiVersion: monitoring.coreos.com/v1
    kind: Alertmanager
    metadata:
      name: longhorn
      namespace: monitoring
    spec:
      replicas: 3
    
  2. 除非提供有效配置,否則 Alertmanager 例項將無法啟動。有關 Alertmanager 配置的更多說明,請參見此處。下面的程式碼給出了一個示例配置:

    global:
      resolve_timeout: 5m
    route:
      group_by: [alertname]
      receiver: email_and_slack
    receivers:
    - name: email_and_slack
      email_configs:
      - to: <the email address to send notifications to>
        from: <the sender address>
        smarthost: <the SMTP host through which emails are sent>
        # SMTP authentication information.
        auth_username: <the username>
        auth_identity: <the identity>
        auth_password: <the password>
        headers:
          subject: 'Longhorn-Alert'
        text: |-
          {{ range .Alerts }}
            *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
            *Description:* {{ .Annotations.description }}
            *Details:*
            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
            {{ end }}
          {{ end }}
      slack_configs:
      - api_url: <the Slack webhook URL>
        channel: <the channel or user to send notifications to>
        text: |-
          {{ range .Alerts }}
            *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
            *Description:* {{ .Annotations.description }}
            *Details:*
            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
            {{ end }}
          {{ end }}
    

    將上述 Alertmanager 配置儲存在名為 alertmanager.yaml 的檔案中,並使用 kubectl 從中建立一個 secret

    Alertmanager 例項要求 secret 資源命名遵循 alertmanager-{ALERTMANAGER_NAME} 格式。
    在上一步中,Alertmanager 的名稱是 longhorn,所以 secret 名稱必須是 alertmanager-longhorn

    $ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n monitoring
    
  3. 為了能夠檢視 AlertmanagerWeb UI,請通過 Service 公開它。一個簡單的方法是使用 NodePort 型別的 Service

    apiVersion: v1
    kind: Service
    metadata:
      name: alertmanager-longhorn
      namespace: monitoring
    spec:
      type: NodePort
      ports:
      - name: web
        nodePort: 30903
        port: 9093
        protocol: TCP
        targetPort: web
      selector:
        alertmanager: longhorn
    

    建立上述服務後,您可以通過節點的 IP 和埠 30903 訪問 Alertmanagerweb UI

    使用上面的 NodePort 服務進行快速驗證,因為它不通過 TLS 連線進行通訊。您可能希望將服務型別更改為 ClusterIP,並設定一個 Ingress-controller 以通過 TLS 連線公開 Alertmanagerweb UI

安裝和配置 Prometheus server

  1. 建立定義警報條件的 PrometheusRule 自定義資源。

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        prometheus: longhorn
        role: alert-rules
      name: prometheus-longhorn-rules
      namespace: monitoring
    spec:
      groups:
      - name: longhorn.rules
        rules:
        - alert: LonghornVolumeUsageCritical
          annotations:
            description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
              more than 5 minutes.
            summary: Longhorn volume capacity is over 90% used.
          expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
          for: 5m
          labels:
            issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
            severity: critical
    

    有關如何定義警報規則的更多資訊,請參見https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules

  2. 如果啟用了 RBAC 授權,則為 Prometheus Pod 建立 ClusterRoleClusterRoleBinding

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: prometheus
      namespace: monitoring
    
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRole
    metadata:
      name: prometheus
      namespace: monitoring
    rules:
    - apiGroups: [""]
      resources:
      - nodes
      - services
      - endpoints
      - pods
      verbs: ["get", "list", "watch"]
    - apiGroups: [""]
      resources:
      - configmaps
      verbs: ["get"]
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
    
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRoleBinding
    metadata:
      name: prometheus
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prometheus
    subjects:
    - kind: ServiceAccount
      name: prometheus
      namespace: monitoring
    
  3. 建立 Prometheus 自定義資源。請注意,我們在 spec 中選擇了 Longhorn 服務監視器(service monitor)和 Longhorn 規則。

    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: prometheus
      namespace: monitoring
    spec:
      replicas: 2
      serviceAccountName: prometheus
      alerting:
        alertmanagers:
          - namespace: monitoring
            name: alertmanager-longhorn
            port: web
      serviceMonitorSelector:
        matchLabels:
          name: longhorn-prometheus-servicemonitor
      ruleSelector:
        matchLabels:
          prometheus: longhorn
          role: alert-rules
    
  4. 為了能夠檢視 Prometheus 伺服器的 web UI,請通過 Service 公開它。一個簡單的方法是使用 NodePort 型別的 Service

    apiVersion: v1
    kind: Service
    metadata:
      name: prometheus
      namespace: monitoring
    spec:
      type: NodePort
      ports:
      - name: web
        nodePort: 30904
        port: 9090
        protocol: TCP
        targetPort: web
      selector:
        prometheus: prometheus
    

    建立上述服務後,您可以通過節點的 IP 和埠 30904 訪問 Prometheus serverweb UI

    此時,您應該能夠在 Prometheus server UI 的目標和規則部分看到所有 Longhorn manager targets 以及 Longhorn rules

    使用上述 NodePort service 進行快速驗證,因為它不通過 TLS 連線進行通訊。您可能希望將服務型別更改為 ClusterIP,並設定一個 Ingress-controller 以通過 TLS 連線公開 Prometheus serverweb UI

安裝 Grafana

  1. 建立 Grafana 資料來源配置:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: grafana-datasources
      namespace: monitoring
    data:
      prometheus.yaml: |-
        {
            "apiVersion": 1,
            "datasources": [
                {
                   "access":"proxy",
                    "editable": true,
                    "name": "prometheus",
                    "orgId": 1,
                    "type": "prometheus",
                    "url": "http://prometheus:9090",
                    "version": 1
                }
            ]
        }
    
  2. 建立 Grafana 部署:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: grafana
      namespace: monitoring
      labels:
        app: grafana
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: grafana
      template:
        metadata:
          name: grafana
          labels:
            app: grafana
        spec:
          containers:
          - name: grafana
            image: grafana/grafana:7.1.5
            ports:
            - name: grafana
              containerPort: 3000
            resources:
              limits:
                memory: "500Mi"
                cpu: "300m"
              requests:
                memory: "500Mi"
                cpu: "200m"
            volumeMounts:
              - mountPath: /var/lib/grafana
                name: grafana-storage
              - mountPath: /etc/grafana/provisioning/datasources
                name: grafana-datasources
                readOnly: false
          volumes:
            - name: grafana-storage
              emptyDir: {}
            - name: grafana-datasources
              configMap:
                  defaultMode: 420
                  name: grafana-datasources
    
  3. NodePort 32000 上暴露 Grafana

    apiVersion: v1
    kind: Service
    metadata:
      name: grafana
      namespace: monitoring
    spec:
      selector:
        app: grafana
      type: NodePort
      ports:
        - port: 3000
          targetPort: 3000
          nodePort: 32000
    

    使用上述 NodePort 服務進行快速驗證,因為它不通過 TLS 連線進行通訊。您可能希望將服務型別更改為 ClusterIP,並設定一個 Ingress-controller 以通過 TLS 連線公開 Grafana

  4. 使用埠 32000 上的任何節點 IP 訪問 Grafana 儀表板。預設憑據為:

    User: admin
    Pass: admin
    
  5. 安裝 Longhorn dashboard

    進入 Grafana 後,匯入預置的皮膚:https://grafana.com/grafana/dashboards/13032

    有關如何匯入 Grafana dashboard 的說明,請參閱 https://grafana.com/docs/grafana/latest/reference/export_import/

    成功後,您應該會看到以下 dashboard

Longhorn 指標整合到 Rancher 監控系統中

關於 Rancher 監控系統

使用 Rancher,您可以通過與領先的開源監控解決方案 Prometheus 的整合來監控叢集節點、Kubernetes 元件和軟體部署的狀態和程式。

有關如何部署/啟用 Rancher 監控系統的說明,請參見https://rancher.com/docs/rancher/v2.x/en/monitoring-alerting/

Longhorn 指標新增到 Rancher 監控系統

如果您使用 Rancher 來管理您的 Kubernetes 並且已經啟用 Rancher 監控,您可以通過簡單地部署以下 ServiceMonitorLonghorn 指標新增到 Rancher 監控中:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: longhorn-system
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager

建立 ServiceMonitor 後,Rancher 將自動發現所有 Longhorn 指標。

然後,您可以設定 Grafana 儀表板以進行視覺化。

Longhorn 監控指標

Volume(卷)

指標名 說明 示例
longhorn_volume_actual_size_bytes 對應節點上卷的每個副本使用的實際空間 longhorn_volume_actual_size_bytes{node="worker-2",volume="testvol"} 1.1917312e+08
longhorn_volume_capacity_bytes 此卷的配置大小(以 byte 為單位) longhorn_volume_capacity_bytes{node="worker-2",volume="testvol"} 6.442450944e+09
longhorn_volume_state 本卷狀態: 1=creating, 2=attached, 3=Detached, 4=Attaching, 5=Detaching, 6=Deleting longhorn_volume_state{node="worker-2",volume="testvol"} 2
longhorn_volume_robustness 本卷的健壯性: 0=unknown, 1=healthy, 2=degraded, 3=faulted longhorn_volume_robustness{node="worker-2",volume="testvol"} 1

Node(節點)

指標名 說明 示例
longhorn_node_status 該節點的狀態: 1=true, 0=false longhorn_node_status{condition="ready",condition_reason="",node="worker-2"} 1
longhorn_node_count_total Longhorn 系統中的節點總數 longhorn_node_count_total 4
longhorn_node_cpu_capacity_millicpu 此節點上的最大可分配 CPU longhorn_node_cpu_capacity_millicpu{node="worker-2"} 2000
longhorn_node_cpu_usage_millicpu 此節點上的 CPU 使用率 longhorn_node_cpu_usage_millicpu{node="pworker-2"} 186
longhorn_node_memory_capacity_bytes 此節點上的最大可分配記憶體 longhorn_node_memory_capacity_bytes{node="worker-2"} 4.031229952e+09
longhorn_node_memory_usage_bytes 此節點上的記憶體使用情況 longhorn_node_memory_usage_bytes{node="worker-2"} 1.833582592e+09
longhorn_node_storage_capacity_bytes 本節點的儲存容量 longhorn_node_storage_capacity_bytes{node="worker-3"} 8.3987283968e+10
longhorn_node_storage_usage_bytes 該節點的已用儲存 longhorn_node_storage_usage_bytes{node="worker-3"} 9.060941824e+09
longhorn_node_storage_reservation_bytes 此節點上為其他應用程式和系統保留的儲存空間 longhorn_node_storage_reservation_bytes{node="worker-3"} 2.519618519e+10

Disk(磁碟)

指標名 說明 示例
longhorn_disk_capacity_bytes 此磁碟的儲存容量 longhorn_disk_capacity_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 8.3987283968e+10
longhorn_disk_usage_bytes 此磁碟的已用儲存空間 longhorn_disk_usage_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 9.060941824e+09
longhorn_disk_reservation_bytes 此磁碟上為其他應用程式和系統保留的儲存空間 longhorn_disk_reservation_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 2.519618519e+10

Instance Manager(例項管理器)

指標名 說明 示例
longhorn_instance_manager_cpu_usage_millicpu 這個 longhorn 例項管理器的 CPU 使用率 longhorn_instance_manager_cpu_usage_millicpu{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 80
longhorn_instance_manager_cpu_requests_millicpu 在這個 Longhorn 例項管理器的 kubernetes 中請求的 CPU 資源 longhorn_instance_manager_cpu_requests_millicpu{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 250
longhorn_instance_manager_memory_usage_bytes 這個 longhorn 例項管理器的記憶體使用情況 longhorn_instance_manager_memory_usage_bytes{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 2.4072192e+07
longhorn_instance_manager_memory_requests_bytes 這個 longhorn 例項管理器在 Kubernetes 中請求的記憶體 longhorn_instance_manager_memory_requests_bytes{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 0

Manager(管理器)

指標名 說明 示例
longhorn_manager_cpu_usage_millicpu 這個 Longhorn Manager 的 CPU 使用率 longhorn_manager_cpu_usage_millicpu{manager="longhorn-manager-5rx2n",node="worker-2"} 27
longhorn_manager_memory_usage_bytes 這個 Longhorn Manager 的記憶體使用情況 longhorn_manager_memory_usage_bytes{manager="longhorn-manager-5rx2n",node="worker-2"} 2.6144768e+07

支援 Kubelet Volume 指標

關於 Kubelet Volume 指標

Kubelet 公開了以下指標

  1. kubelet_volume_stats_capacity_bytes
  2. kubelet_volume_stats_available_bytes
  3. kubelet_volume_stats_used_bytes
  4. kubelet_volume_stats_inodes
  5. kubelet_volume_stats_inodes_free
  6. kubelet_volume_stats_inodes_used

這些指標衡量與 Longhorn 塊裝置內的 PVC 檔案系統相關的資訊。

它們與 longhorn_volume_* 指標不同,後者測量特定於 Longhorn 塊裝置(block device)的資訊。

您可以設定一個監控系統來抓取 Kubelet 指標端點以獲取 PVC 的狀態並設定異常事件的警報,例如 PVC 即將耗盡儲存空間。

一個流行的監控設定是 prometheus-operator/kube-prometheus-stack,,它抓取 kubelet_volume_stats_* 指標併為它們提供儀表板和警報規則。

Longhorn CSI 外掛支援

v1.1.0 中,Longhorn CSI 外掛根據 CSI spec 支援 NodeGetVolumeStats RPC。

這允許 kubelet 查詢 Longhorn CSI 外掛以獲取 PVC 的狀態。

然後 kubeletkubelet_volume_stats_* 指標中公開該資訊。

Longhorn 警報規則示例

我們在下面提供了幾個示例 Longhorn 警報規則供您參考。請參閱此處獲取所有可用 Longhorn 指標的列表並構建您自己的警報規則。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: longhorn
    role: alert-rules
  name: prometheus-longhorn-rules
  namespace: monitoring
spec:
  groups:
  - name: longhorn.rules
    rules:
    - alert: LonghornVolumeActualSpaceUsedWarning
      annotations:
        description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary: The actual used space of Longhorn volume is over 90% of the capacity.
      expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
      for: 5m
      labels:
        issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
        severity: warning
    - alert: LonghornVolumeStatusCritical
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
          more than 2 minutes.
        summary: Longhorn volume {{$labels.volume}} is Fault
      expr: longhorn_volume_robustness == 3
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} is Fault.
        severity: critical
    - alert: LonghornVolumeStatusWarning
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
          more than 5 minutes.
        summary: Longhorn volume {{$labels.volume}} is Degraded
      expr: longhorn_volume_robustness == 2
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} is Degraded.
        severity: warning
    - alert: LonghornNodeStorageWarning
      annotations:
        description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary:  The used storage of node is over 70% of the capacity.
      expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
      for: 5m
      labels:
        issue: The used storage of node {{$labels.node}} is high.
        severity: warning
    - alert: LonghornDiskStorageWarning
      annotations:
        description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary:  The used storage of disk is over 70% of the capacity.
      expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
      for: 5m
      labels:
        issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
        severity: warning
    - alert: LonghornNodeDown
      annotations:
        description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
        summary: Longhorn nodes is offline
      expr: longhorn_node_total - (count(longhorn_node_status{condition="ready"}==1) OR on() vector(0))
      for: 5m
      labels:
        issue: There are {{$value}} Longhorn nodes are offline
        severity: critical
    - alert: LonghornIntanceManagerCPUUsageWarning
      annotations:
        description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
          more than 5 minutes.
        summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
      expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
      for: 5m
      labels:
        issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
        severity: warning
    - alert: LonghornNodeCPUUsageWarning
      annotations:
        description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
          more than 5 minutes.
        summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
      expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
      for: 5m
      labels:
        issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
        severity: warning

在https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules
檢視有關如何定義警報規則的更多資訊。

公眾號:黑客下午茶

相關文章