內容來源於官方 Longhorn 1.1.2
英文技術手冊。
系列
- Longhorn 是什麼?
- Longhorn 企業級雲原生容器分散式儲存解決方案設計架構和概念
- Longhorn 企業級雲原生容器分散式儲存-部署篇
- Longhorn 企業級雲原生容器分散式儲存-券(Volume)和節點(Node)
- Longhorn,企業級雲原生容器分散式儲存-K8S 資源配置示例
目錄
- 設定
Prometheus
和Grafana
來監控Longhorn
- 將
Longhorn
指標整合到Rancher
監控系統中 Longhorn
監控指標- 支援
Kubelet Volume
指標 Longhorn
警報規則示例
設定 Prometheus
和 Grafana
來監控 Longhorn
概覽
Longhorn
在 REST
端點 http://LONGHORN_MANAGER_IP:PORT/metrics
上以 Prometheus 文字格式原生公開指標。
有關所有可用指標的說明,請參閱 Longhorn's metrics。
您可以使用 Prometheus, Graphite, Telegraf 等任何收集工具來抓取這些指標,然後通過 Grafana
等工具將收集到的資料視覺化。
本文件提供了一個監控 Longhorn
的示例設定。監控系統使用 Prometheus
收集資料和警報,使用 Grafana
將收集的資料視覺化/儀表板(visualizing/dashboarding
)。 高階概述來看,監控系統包含:
Prometheus
伺服器從Longhorn
指標端點抓取和儲存時間序列資料。Prometheus
還負責根據配置的規則和收集的資料生成警報。Prometheus
伺服器然後將警報傳送到Alertmanager
。AlertManager
然後管理這些警報(alerts
),包括靜默(silencing
)、抑制(inhibition
)、聚合(aggregation
)和通過電子郵件、呼叫通知系統和聊天平臺等方法傳送通知。Grafana
向Prometheus
伺服器查詢資料並繪製儀表板進行視覺化。
下圖描述了監控系統的詳細架構。
上圖中有 2
個未提及的元件:
- Longhorn 後端服務是指向
Longhorn manager pods
集的服務。Longhorn
的指標在端點http://LONGHORN_MANAGER_IP:PORT/metrics
的Longhorn manager pods
中公開。 - Prometheus operator 使在
Kubernetes
上執行Prometheus
變得非常容易。operator
監視3
個自定義資源:ServiceMonitor
、Prometheus
和AlertManager
。當使用者建立這些自定義資源時,Prometheus Operator
會使用使用者指定的配置部署和管理Prometheus server
,AlerManager
。
安裝
按照此說明將所有元件安裝到 monitoring
名稱空間中。要將它們安裝到不同的名稱空間中,請更改欄位 namespace: OTHER_NAMESPACE
建立 monitoring
名稱空間
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
安裝 Prometheus Operator
部署 Prometheus Operator
及其所需的 ClusterRole
、ClusterRoleBinding
和 Service Account
。
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-operator
subjects:
- kind: ServiceAccount
name: prometheus-operator
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
rules:
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- create
- apiGroups:
- apiextensions.k8s.io
resourceNames:
- alertmanagers.monitoring.coreos.com
- podmonitors.monitoring.coreos.com
- prometheuses.monitoring.coreos.com
- prometheusrules.monitoring.coreos.com
- servicemonitors.monitoring.coreos.com
- thanosrulers.monitoring.coreos.com
resources:
- customresourcedefinitions
verbs:
- get
- update
- apiGroups:
- monitoring.coreos.com
resources:
- alertmanagers
- alertmanagers/finalizers
- prometheuses
- prometheuses/finalizers
- thanosrulers
- thanosrulers/finalizers
- servicemonitors
- podmonitors
- prometheusrules
verbs:
- '*'
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- '*'
- apiGroups:
- ""
resources:
- configmaps
- secrets
verbs:
- '*'
- apiGroups:
- ""
resources:
- pods
verbs:
- list
- delete
- apiGroups:
- ""
resources:
- services
- services/finalizers
- endpoints
verbs:
- get
- create
- update
- delete
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get
- list
- watch
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
template:
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
spec:
containers:
- args:
- --kubelet-service=kube-system/kubelet
- --logtostderr=true
- --config-reloader-image=jimmidyson/configmap-reload:v0.3.0
- --prometheus-config-reloader=quay.io/prometheus-operator/prometheus-config-reloader:v0.38.3
image: quay.io/prometheus-operator/prometheus-operator:v0.38.3
name: prometheus-operator
ports:
- containerPort: 8080
name: http
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
securityContext:
allowPrivilegeEscalation: false
nodeSelector:
beta.kubernetes.io/os: linux
securityContext:
runAsNonRoot: true
runAsUser: 65534
serviceAccountName: prometheus-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
spec:
clusterIP: None
ports:
- name: http
port: 8080
targetPort: http
selector:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
安裝 Longhorn ServiceMonitor
Longhorn ServiceMonitor
有一個標籤選擇器 app: longhorn-manager
來選擇 Longhorn
後端服務。
稍後,Prometheus CRD
可以包含 Longhorn ServiceMonitor
,以便 Prometheus server
可以發現所有 Longhorn manager pods
及其端點。
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: longhorn-prometheus-servicemonitor
namespace: monitoring
labels:
name: longhorn-prometheus-servicemonitor
spec:
selector:
matchLabels:
app: longhorn-manager
namespaceSelector:
matchNames:
- longhorn-system
endpoints:
- port: manager
安裝和配置 Prometheus AlertManager
-
使用
3
個例項建立一個高可用的Alertmanager
部署:apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: longhorn namespace: monitoring spec: replicas: 3
-
除非提供有效配置,否則
Alertmanager
例項將無法啟動。有關 Alertmanager 配置的更多說明,請參見此處。下面的程式碼給出了一個示例配置:global: resolve_timeout: 5m route: group_by: [alertname] receiver: email_and_slack receivers: - name: email_and_slack email_configs: - to: <the email address to send notifications to> from: <the sender address> smarthost: <the SMTP host through which emails are sent> # SMTP authentication information. auth_username: <the username> auth_identity: <the identity> auth_password: <the password> headers: subject: 'Longhorn-Alert' text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }} slack_configs: - api_url: <the Slack webhook URL> channel: <the channel or user to send notifications to> text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }}
將上述
Alertmanager
配置儲存在名為alertmanager.yaml
的檔案中,並使用kubectl
從中建立一個secret
。Alertmanager
例項要求secret
資源命名遵循alertmanager-{ALERTMANAGER_NAME}
格式。
在上一步中,Alertmanager
的名稱是longhorn
,所以secret
名稱必須是alertmanager-longhorn
$ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n monitoring
-
為了能夠檢視
Alertmanager
的Web UI
,請通過Service
公開它。一個簡單的方法是使用NodePort
型別的Service
:apiVersion: v1 kind: Service metadata: name: alertmanager-longhorn namespace: monitoring spec: type: NodePort ports: - name: web nodePort: 30903 port: 9093 protocol: TCP targetPort: web selector: alertmanager: longhorn
建立上述服務後,您可以通過節點的
IP
和埠30903
訪問Alertmanager
的web UI
。使用上面的
NodePort
服務進行快速驗證,因為它不通過TLS
連線進行通訊。您可能希望將服務型別更改為ClusterIP
,並設定一個Ingress-controller
以通過TLS
連線公開Alertmanager
的web UI
。
安裝和配置 Prometheus server
-
建立定義警報條件的
PrometheusRule
自定義資源。apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: longhorn role: alert-rules name: prometheus-longhorn-rules namespace: monitoring spec: groups: - name: longhorn.rules rules: - alert: LonghornVolumeUsageCritical annotations: description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for more than 5 minutes. summary: Longhorn volume capacity is over 90% used. expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90 for: 5m labels: issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical. severity: critical
有關如何定義警報規則的更多資訊,請參見https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules
-
如果啟用了 RBAC 授權,則為
Prometheus Pod
建立ClusterRole
和ClusterRoleBinding
:apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus namespace: monitoring rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"]
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: monitoring
-
建立
Prometheus
自定義資源。請注意,我們在spec
中選擇了Longhorn
服務監視器(service monitor
)和Longhorn
規則。apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: replicas: 2 serviceAccountName: prometheus alerting: alertmanagers: - namespace: monitoring name: alertmanager-longhorn port: web serviceMonitorSelector: matchLabels: name: longhorn-prometheus-servicemonitor ruleSelector: matchLabels: prometheus: longhorn role: alert-rules
-
為了能夠檢視
Prometheus
伺服器的web UI
,請通過Service
公開它。一個簡單的方法是使用NodePort
型別的Service
:apiVersion: v1 kind: Service metadata: name: prometheus namespace: monitoring spec: type: NodePort ports: - name: web nodePort: 30904 port: 9090 protocol: TCP targetPort: web selector: prometheus: prometheus
建立上述服務後,您可以通過節點的
IP
和埠30904
訪問Prometheus server
的web UI
。此時,您應該能夠在
Prometheus server UI
的目標和規則部分看到所有Longhorn manager targets
以及Longhorn rules
。使用上述
NodePort
service 進行快速驗證,因為它不通過TLS
連線進行通訊。您可能希望將服務型別更改為ClusterIP
,並設定一個Ingress-controller
以通過TLS
連線公開Prometheus server
的web UI
。
安裝 Grafana
-
建立
Grafana
資料來源配置:apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources namespace: monitoring data: prometheus.yaml: |- { "apiVersion": 1, "datasources": [ { "access":"proxy", "editable": true, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://prometheus:9090", "version": 1 } ] }
-
建立
Grafana
部署:apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: monitoring labels: app: grafana spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: name: grafana labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:7.1.5 ports: - name: grafana containerPort: 3000 resources: limits: memory: "500Mi" cpu: "300m" requests: memory: "500Mi" cpu: "200m" volumeMounts: - mountPath: /var/lib/grafana name: grafana-storage - mountPath: /etc/grafana/provisioning/datasources name: grafana-datasources readOnly: false volumes: - name: grafana-storage emptyDir: {} - name: grafana-datasources configMap: defaultMode: 420 name: grafana-datasources
-
在
NodePort 32000
上暴露Grafana
:apiVersion: v1 kind: Service metadata: name: grafana namespace: monitoring spec: selector: app: grafana type: NodePort ports: - port: 3000 targetPort: 3000 nodePort: 32000
使用上述
NodePort
服務進行快速驗證,因為它不通過TLS
連線進行通訊。您可能希望將服務型別更改為ClusterIP
,並設定一個Ingress-controller
以通過TLS
連線公開Grafana
。 -
使用埠
32000
上的任何節點IP
訪問Grafana
儀表板。預設憑據為:User: admin Pass: admin
-
安裝 Longhorn dashboard
進入
Grafana
後,匯入預置的皮膚:https://grafana.com/grafana/dashboards/13032有關如何匯入
Grafana dashboard
的說明,請參閱 https://grafana.com/docs/grafana/latest/reference/export_import/成功後,您應該會看到以下
dashboard
:
將 Longhorn
指標整合到 Rancher
監控系統中
關於 Rancher
監控系統
使用 Rancher
,您可以通過與領先的開源監控解決方案 Prometheus 的整合來監控叢集節點、Kubernetes
元件和軟體部署的狀態和程式。
有關如何部署/啟用 Rancher
監控系統的說明,請參見https://rancher.com/docs/rancher/v2.x/en/monitoring-alerting/
將 Longhorn
指標新增到 Rancher
監控系統
如果您使用 Rancher
來管理您的 Kubernetes
並且已經啟用 Rancher
監控,您可以通過簡單地部署以下 ServiceMonitor
將 Longhorn
指標新增到 Rancher
監控中:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: longhorn-prometheus-servicemonitor
namespace: longhorn-system
labels:
name: longhorn-prometheus-servicemonitor
spec:
selector:
matchLabels:
app: longhorn-manager
namespaceSelector:
matchNames:
- longhorn-system
endpoints:
- port: manager
建立 ServiceMonitor
後,Rancher
將自動發現所有 Longhorn
指標。
然後,您可以設定 Grafana 儀表板以進行視覺化。
Longhorn
監控指標
Volume(卷)
指標名 | 說明 | 示例 |
---|---|---|
longhorn_volume_actual_size_bytes | 對應節點上卷的每個副本使用的實際空間 | longhorn_volume_actual_size_bytes{node="worker-2",volume="testvol"} 1.1917312e+08 |
longhorn_volume_capacity_bytes | 此卷的配置大小(以 byte 為單位) | longhorn_volume_capacity_bytes{node="worker-2",volume="testvol"} 6.442450944e+09 |
longhorn_volume_state | 本卷狀態: 1=creating, 2=attached, 3=Detached, 4=Attaching, 5=Detaching, 6=Deleting | longhorn_volume_state{node="worker-2",volume="testvol"} 2 |
longhorn_volume_robustness | 本卷的健壯性: 0=unknown, 1=healthy, 2=degraded, 3=faulted | longhorn_volume_robustness{node="worker-2",volume="testvol"} 1 |
Node(節點)
指標名 | 說明 | 示例 |
---|---|---|
longhorn_node_status | 該節點的狀態: 1=true, 0=false | longhorn_node_status{condition="ready",condition_reason="",node="worker-2"} 1 |
longhorn_node_count_total | Longhorn 系統中的節點總數 | longhorn_node_count_total 4 |
longhorn_node_cpu_capacity_millicpu | 此節點上的最大可分配 CPU | longhorn_node_cpu_capacity_millicpu{node="worker-2"} 2000 |
longhorn_node_cpu_usage_millicpu | 此節點上的 CPU 使用率 | longhorn_node_cpu_usage_millicpu{node="pworker-2"} 186 |
longhorn_node_memory_capacity_bytes | 此節點上的最大可分配記憶體 | longhorn_node_memory_capacity_bytes{node="worker-2"} 4.031229952e+09 |
longhorn_node_memory_usage_bytes | 此節點上的記憶體使用情況 | longhorn_node_memory_usage_bytes{node="worker-2"} 1.833582592e+09 |
longhorn_node_storage_capacity_bytes | 本節點的儲存容量 | longhorn_node_storage_capacity_bytes{node="worker-3"} 8.3987283968e+10 |
longhorn_node_storage_usage_bytes | 該節點的已用儲存 | longhorn_node_storage_usage_bytes{node="worker-3"} 9.060941824e+09 |
longhorn_node_storage_reservation_bytes | 此節點上為其他應用程式和系統保留的儲存空間 | longhorn_node_storage_reservation_bytes{node="worker-3"} 2.519618519e+10 |
Disk(磁碟)
指標名 | 說明 | 示例 |
---|---|---|
longhorn_disk_capacity_bytes | 此磁碟的儲存容量 | longhorn_disk_capacity_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 8.3987283968e+10 |
longhorn_disk_usage_bytes | 此磁碟的已用儲存空間 | longhorn_disk_usage_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 9.060941824e+09 |
longhorn_disk_reservation_bytes | 此磁碟上為其他應用程式和系統保留的儲存空間 | longhorn_disk_reservation_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 2.519618519e+10 |
Instance Manager(例項管理器)
指標名 | 說明 | 示例 |
---|---|---|
longhorn_instance_manager_cpu_usage_millicpu | 這個 longhorn 例項管理器的 CPU 使用率 | longhorn_instance_manager_cpu_usage_millicpu{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 80 |
longhorn_instance_manager_cpu_requests_millicpu | 在這個 Longhorn 例項管理器的 kubernetes 中請求的 CPU 資源 | longhorn_instance_manager_cpu_requests_millicpu{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 250 |
longhorn_instance_manager_memory_usage_bytes | 這個 longhorn 例項管理器的記憶體使用情況 | longhorn_instance_manager_memory_usage_bytes{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 2.4072192e+07 |
longhorn_instance_manager_memory_requests_bytes | 這個 longhorn 例項管理器在 Kubernetes 中請求的記憶體 | longhorn_instance_manager_memory_requests_bytes{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 0 |
Manager(管理器)
指標名 | 說明 | 示例 |
---|---|---|
longhorn_manager_cpu_usage_millicpu | 這個 Longhorn Manager 的 CPU 使用率 | longhorn_manager_cpu_usage_millicpu{manager="longhorn-manager-5rx2n",node="worker-2"} 27 |
longhorn_manager_memory_usage_bytes | 這個 Longhorn Manager 的記憶體使用情況 | longhorn_manager_memory_usage_bytes{manager="longhorn-manager-5rx2n",node="worker-2"} 2.6144768e+07 |
支援 Kubelet Volume
指標
關於 Kubelet Volume
指標
Kubelet 公開了以下指標:
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used
這些指標衡量與 Longhorn
塊裝置內的 PVC
檔案系統相關的資訊。
它們與 longhorn_volume_* 指標不同,後者測量特定於 Longhorn
塊裝置(block device)
的資訊。
您可以設定一個監控系統來抓取 Kubelet
指標端點以獲取 PVC
的狀態並設定異常事件的警報,例如 PVC
即將耗盡儲存空間。
一個流行的監控設定是 prometheus-operator/kube-prometheus-stack,,它抓取 kubelet_volume_stats_*
指標併為它們提供儀表板和警報規則。
Longhorn CSI 外掛支援
在 v1.1.0
中,Longhorn CSI
外掛根據 CSI spec 支援 NodeGetVolumeStats
RPC。
這允許 kubelet
查詢 Longhorn CSI
外掛以獲取 PVC
的狀態。
然後 kubelet
在 kubelet_volume_stats_*
指標中公開該資訊。
Longhorn
警報規則示例
我們在下面提供了幾個示例 Longhorn
警報規則供您參考。請參閱此處獲取所有可用 Longhorn
指標的列表並構建您自己的警報規則。
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: longhorn
role: alert-rules
name: prometheus-longhorn-rules
namespace: monitoring
spec:
groups:
- name: longhorn.rules
rules:
- alert: LonghornVolumeActualSpaceUsedWarning
annotations:
description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
more than 5 minutes.
summary: The actual used space of Longhorn volume is over 90% of the capacity.
expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
for: 5m
labels:
issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
severity: warning
- alert: LonghornVolumeStatusCritical
annotations:
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
more than 2 minutes.
summary: Longhorn volume {{$labels.volume}} is Fault
expr: longhorn_volume_robustness == 3
for: 5m
labels:
issue: Longhorn volume {{$labels.volume}} is Fault.
severity: critical
- alert: LonghornVolumeStatusWarning
annotations:
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
more than 5 minutes.
summary: Longhorn volume {{$labels.volume}} is Degraded
expr: longhorn_volume_robustness == 2
for: 5m
labels:
issue: Longhorn volume {{$labels.volume}} is Degraded.
severity: warning
- alert: LonghornNodeStorageWarning
annotations:
description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
more than 5 minutes.
summary: The used storage of node is over 70% of the capacity.
expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
for: 5m
labels:
issue: The used storage of node {{$labels.node}} is high.
severity: warning
- alert: LonghornDiskStorageWarning
annotations:
description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
more than 5 minutes.
summary: The used storage of disk is over 70% of the capacity.
expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
for: 5m
labels:
issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
severity: warning
- alert: LonghornNodeDown
annotations:
description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
summary: Longhorn nodes is offline
expr: longhorn_node_total - (count(longhorn_node_status{condition="ready"}==1) OR on() vector(0))
for: 5m
labels:
issue: There are {{$value}} Longhorn nodes are offline
severity: critical
- alert: LonghornIntanceManagerCPUUsageWarning
annotations:
description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
more than 5 minutes.
summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
for: 5m
labels:
issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
severity: warning
- alert: LonghornNodeCPUUsageWarning
annotations:
description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
more than 5 minutes.
summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
for: 5m
labels:
issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
severity: warning
在https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules
檢視有關如何定義警報規則的更多資訊。
公眾號:黑客下午茶