使用 Prometheus 在 KubeSphere 上監控 KubeEdge 邊緣節點(Jetson) CPU、GPU 狀態

kubesphere發表於2024-04-11

作者:朱亞光,之江實驗室工程師,雲原生/開源愛好者。

KubeSphere 邊緣節點的可觀測性

在邊緣計算場景下,KubeSphere 基於 KubeEdge 實現應用與工作負載在雲端與邊緣節點的統一分發與管理,解決在海量邊、端裝置上完成應用交付、運維、管控的需求。

根據 KubeSphere 的支援矩陣,只有 1.23.x 版本的 K8s 支援邊緣計算,而且 KubeSphere 介面也沒有邊緣節點資源使用率等監控資訊的顯示。

本文基於 KubeSphere 和 KubeEdge 構建雲邊一體化計算平臺,透過 Prometheus 來監控 Nvidia Jetson 邊緣裝置狀態,實現 KubeSphere 在邊緣節點的可觀測性。

元件 版本
KubeSphere 3.4.1
containerd 1.7.2
K8s 1.26.0
KubeEdge 1.15.1
Jetson 型號 NVIDIA Jetson Xavier NX (16GB ram)
Jtop 4.2.7
JetPack 5.1.3-b29
Docker 24.0.5

部署 K8s 環境

參考 KubeSphere 部署文件。透過 KubeKey 可以快速部署一套 K8s 叢集。

//  all in one 方式部署一臺 單 master 的 k8s 叢集

./kk create cluster --with-kubernetes v1.26.0 --with-kubesphere v3.4.1 --container-manager containerd

部署 KubeEdge 環境

參考 在 KubeSphere 上部署最新版的 KubeEdge,部署 KubeEdge。

開啟邊緣節點日誌查詢功能

  1. vim /etc/kubeedge/config/edgecore.yaml

  2. enable=true

開啟後,可以方便查詢 pod 日誌,定位問題。

修改 KubeSphere 配置

開啟 KubeEdge 邊緣節點外掛

  1. 修改 configmap--ClusterConfiguration

  1. advertiseAddress 設定為 cloudhub 所在的物理機地址

KubeSphere 開啟邊緣節點文件連結:https://www.kubesphere.io/zh/docs/v3.3/pluggable-components/kubeedge/。

修改完發現可以顯示邊緣節點,但是沒有 CPU 和 記憶體資訊,發現邊緣節點沒有 node-exporter 這個 pod。

修改 node-exporter 親和性

kubectl get ds -n kubesphere-monitoring-system 發現不會部署到邊緣節點上。

修改為:

    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/edgetest  -- 修改這裡,讓親和性失效
                operator: DoesNotExist

node-exporter 是部署在邊緣節點上了,但是 pods 起不來。

透過kubectl edit 該失敗的 pod,我們發現 node-exporter 這個pod 裡面有兩個容器,其中 kube-rbac-proxy 這個容器啟動失敗。看這個容器的日誌,發現是 kube-rbac-proxy 想要獲取 KUBERNETES_SERVICE_HOSTKUBERNETES_SERVICE_PORT 這兩個環境變數,但是獲取失敗,所以容器啟動失敗。

在 K8s 的叢集中,當建立 pod 時,會在 pod 中增加 KUBERNETES_SERVICE_HOSTKUBERNETES_SERVICE_PORT 這兩個環境變數,用於 pod 內的程序對 kube-apiserver 的訪問,但是在 KubeEdge 的 edge 節點上建立的 pod 中,這兩個環境變數存在,但它是空的。

向 KubeEdge 的開發人員諮詢,他們說會在 KubeEdge 1.17 版本上增加這兩個環境變數的設定。參考如下:
https://github.com/wackxu/kubeedge/blob/4a7c00783de9b11e56e56968b2cc950a7d32a403/docs/proposals/edge-pod-list-watch-natively.md

另一方面,推薦安裝 EdgeMesh,安裝之後在 edge 的 pod 上就可以訪問 kubernetes.default.svc.cluster.local:443 了。

EdgeMesh 部署

  1. 配置 cloudcore configmap

    kubectl edit cm cloudcore -n kubeedge 設定 dynamicController=true.

    修改完 重啟 cloudcore kubectl delete pod cloudcore-776ffcbbb9-s6ff8 -n kubeedge

  2. 配置 edgecore 模組,配置 metaServer=true 和 clusterDNS

    $ vim /etc/kubeedge/config/edgecore.yaml
    
    modules:
      ...
      metaManager:
        metaServer:
          enable: true   //配置這裡
    ...
    
    modules:
      ...
      edged:
        ...
        tailoredKubeletConfig:
          ...
          clusterDNS:     //配置這裡
          - 169.254.96.16
    ...
    
    //重啟edgecore
    $ systemctl restart edgecore
    

修改完,驗證是否修改成功。

$ curl 127.0.0.1:10550/api/v1/services

{"apiVersion":"v1","items":[{"apiVersion":"v1","kind":"Service","metadata":{"creationTimestamp":"2021-04-14T06:30:05Z","labels":{"component":"apiserver","provider":"kubernetes"},"name":"kubernetes","namespace":"default","resourceVersion":"147","selfLink":"default/services/kubernetes","uid":"55eeebea-08cf-4d1a-8b04-e85f8ae112a9"},"spec":{"clusterIP":"10.96.0.1","ports":[{"name":"https","port":443,"protocol":"TCP","targetPort":6443}],"sessionAffinity":"None","type":"ClusterIP"},"status":{"loadBalancer":{}}},{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"prometheus.io/port":"9153","prometheus.io/scrape":"true"},"creationTimestamp":"2021-04-14T06:30:07Z","labels":{"k8s-app":"kube-dns","kubernetes.io/cluster-service":"true","kubernetes.io/name":"KubeDNS"},"name":"kube-dns","namespace":"kube-system","resourceVersion":"203","selfLink":"kube-system/services/kube-dns","uid":"c221ac20-cbfa-406b-812a-c44b9d82d6dc"},"spec":{"clusterIP":"10.96.0.10","ports":[{"name":"dns","port":53,"protocol":"UDP","targetPort":53},{"name":"dns-tcp","port":53,"protocol":"TCP","targetPort":53},{"name":"metrics","port":9153,"protocol":"TCP","targetPort":9153}],"selector":{"k8s-app":"kube-dns"},"sessionAffinity":"None","type":"ClusterIP"},"status":{"loadBalancer":{}}}],"kind":"ServiceList","metadata":{"resourceVersion":"377360","selfLink":"/api/v1/services"}}

  1. 安裝 EdgeMesh

    git clone https://github.com/kubeedge/edgemesh.git
    cd edgemesh
    
    kubectl apply -f build/crds/istio/
    
    kubectl apply -f build/agent/resources/
    

dnsPolicy

EdgeMesh 部署完成後,edge 節點上的 node-exporter 中的兩個境變數還是空的,也無法訪問 kubernetes.default.svc.cluster.local:443,原因是該 pod 中 DNS 伺服器配置錯誤,應該是 169.254.96.16 的,但是卻是跟宿主機一樣的 DNS 配置。

kubectl exec -it node-exporter-hcmfg -n kubesphere-monitoring-system -- sh
Defaulted container "node-exporter" out of: node-exporter, kube-rbac-proxy
$ cat /etc/resolv.conf
nameserver 127.0.0.53

將 dnsPolicy 修改為 ClusterFirstWithHostNet,之後重啟 node-exporter,DNS 的配置正確。

kubectl edit ds node-exporter -n kubesphere-monitoring-system

  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true

新增環境變數

vim /etc/systemd/system/edgecore.service

Environment=METASERVER_DUMMY_IP=kubernetes.default.svc.cluster.local
Environment=METASERVER_DUMMY_PORT=443

修改完重啟 edgecore。

systemctl daemon-reload
systemctl restart edgecore

node-exporter 變成 running!!!!

在邊緣節點 curl http://127.0.0.1:9100/metrics 可以發現採集到了邊緣節點的資料。

最後我們可以將 KubeSphere 的 K8s 服務透過 NodePort 暴露出來。就可以在頁面檢視。

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.39.1
  name: prometheus-k8s-nodeport
  namespace: kubesphere-monitoring-system
spec:
  ports:
  - port: 9090
    targetPort: 9090
    protocol: TCP
    nodePort: 32143
  selector:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: NodePort

透過訪問 master IP + 32143 埠,就可以訪問邊緣節點 node-exporter 資料。

然後介面上也出現了 CPU 和記憶體的資訊。

搞定了 CPU 和記憶體,接下來就是 GPU 了。

監控 Jetson GPU 狀態

安裝 Jtop

首先 Jetson 是一個 ARM 裝置,所以無法執行 nvidia-smi ,需要安裝 Jtop。

sudo apt-get install python3-pip python3-dev -y
sudo -H pip3 install jetson-stats
sudo systemctl restart jtop.service

安裝 Jetson GPU Exporter

參考部落格,製作 Jetson GPU Exporter 映象,並且對應的 Grafana 儀表盤都有。

Dockerfile

FROM python:3-buster
RUN pip install --upgrade pip && pip install -U jetson-stats prometheus-client
RUN mkdir -p /root
COPY jetson_stats_prometheus_collector.py /root/jetson_stats_prometheus_collector.py
WORKDIR /root
USER root
RUN chmod +x /root/jetson_stats_prometheus_collector.py
ENTRYPOINT ["python3", "/root/jetson_stats_prometheus_collector.py"]

jetson_stats_prometheus_collector.py 程式碼

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import atexit
import os
from jtop import jtop, JtopException
from prometheus_client.core import InfoMetricFamily, GaugeMetricFamily, REGISTRY, CounterMetricFamily
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server

class CustomCollector(object):
    def __init__(self):
        atexit.register(self.cleanup)
        self._jetson = jtop()
        self._jetson.start()

    def cleanup(self):
        print("Closing jetson-stats connection...")
        self._jetson.close()

    def collect(self):
        # spin傳入true,表示不會等待下一次資料讀取完成
        if self._jetson.ok(spin=True):
            #
            # Board info
            #
            i = InfoMetricFamily('gpu_info_board', 'Board sys info', labels=['board_info'])
            i.add_metric(['info'], {
                'machine': self._jetson.board['info']['machine'] if 'machine' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Module'],
                'jetpack': self._jetson.board['info']['jetpack'] if 'jetpack' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Jetpack'],
                'l4t':  self._jetson.board['info']['L4T'] if 'L4T' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['L4T']
                })
            yield i

            i = InfoMetricFamily('gpu_info_hardware', 'Board hardware info', labels=['board_hw'])
            i.add_metric(['hardware'], {
                'codename': self._jetson.board['hardware'].get('Codename', self._jetson.board['hardware'].get('CODENAME', 'unknown')),
                'soc': self._jetson.board['hardware'].get('SoC', self._jetson.board['hardware'].get('SOC', 'unknown')),
                'module': self._jetson.board['hardware'].get('P-Number', self._jetson.board['hardware'].get('MODULE', 'unknown')),
                'board': self._jetson.board['hardware'].get('699-level Part Number', self._jetson.board['hardware'].get('BOARD', 'unknown')),
                'cuda_arch_bin': self._jetson.board['hardware'].get('CUDA Arch BIN', self._jetson.board['hardware'].get('CUDA_ARCH_BIN', 'unknown')),
                'serial_number': self._jetson.board['hardware'].get('Serial Number', self._jetson.board['hardware'].get('SERIAL_NUMBER', 'unknown')),
                })
            yield i

            #
            # NV power mode
            #
            i = InfoMetricFamily('gpu_nvpmode', 'NV power mode', labels=['nvpmode'])
            i.add_metric(['mode'], {'mode': self._jetson.nvpmodel.name})
            yield i

            #
            # System uptime
            #
            g = GaugeMetricFamily('gpu_uptime', 'System uptime', labels=['uptime'])
            days = self._jetson.uptime.days
            seconds = self._jetson.uptime.seconds
            hours = seconds//3600
            minutes = (seconds//60) % 60
            g.add_metric(['days'], days)
            g.add_metric(['hours'], hours)
            g.add_metric(['minutes'], minutes)
            yield g

            #
            # CPU usage
            #
            g = GaugeMetricFamily('gpu_usage_cpu', 'CPU % schedutil', labels=['cpu'])
            g.add_metric(['cpu_1'], self._jetson.stats['CPU1'] if ('CPU1' in self._jetson.stats and isinstance(self._jetson.stats['CPU1'], int)) else 0)
            g.add_metric(['cpu_2'], self._jetson.stats['CPU2'] if ('CPU2' in self._jetson.stats and isinstance(self._jetson.stats['CPU2'], int)) else 0)
            g.add_metric(['cpu_3'], self._jetson.stats['CPU3'] if ('CPU3' in self._jetson.stats and isinstance(self._jetson.stats['CPU3'], int)) else 0)
            g.add_metric(['cpu_4'], self._jetson.stats['CPU4'] if ('CPU4' in self._jetson.stats and isinstance(self._jetson.stats['CPU4'], int)) else 0)
            g.add_metric(['cpu_5'], self._jetson.stats['CPU5'] if ('CPU5' in self._jetson.stats and isinstance(self._jetson.stats['CPU5'], int)) else 0)
            g.add_metric(['cpu_6'], self._jetson.stats['CPU6'] if ('CPU6' in self._jetson.stats and isinstance(self._jetson.stats['CPU6'], int)) else 0)
            g.add_metric(['cpu_7'], self._jetson.stats['CPU7'] if ('CPU7' in self._jetson.stats and isinstance(self._jetson.stats['CPU7'], int)) else 0)
            g.add_metric(['cpu_8'], self._jetson.stats['CPU8'] if ('CPU8' in self._jetson.stats and isinstance(self._jetson.stats['CPU8'], int)) else 0)
            yield g

            #
            # GPU usage
            #
            g = GaugeMetricFamily('gpu_usage_gpu', 'GPU % schedutil', labels=['gpu'])
            g.add_metric(['val'], self._jetson.stats['GPU'])
            yield g

            #
            # Fan usage
            #
            g = GaugeMetricFamily('gpu_usage_fan', 'Fan usage', labels=['fan'])
            g.add_metric(['speed'], self._jetson.fan.get('speed', self._jetson.fan.get('pwmfan', {'speed': [0] })['speed'][0]))
            yield g

            #
            # Sensor temperatures
            #
            g = GaugeMetricFamily('gpu_temperatures', 'Sensor temperatures', labels=['temperature'])
            keys = ['AO', 'GPU', 'Tdiode', 'AUX', 'CPU', 'thermal', 'Tboard']
            for key in keys:
                if key in self._jetson.temperature:
                    g.add_metric([key.lower()], self._jetson.temperature[key]['temp'] if isinstance(self._jetson.temperature[key], dict) else self._jetson.temperature.get(key, 0))
            yield g
            #
            # Power
            #
            g = GaugeMetricFamily('gpu_usage_power', 'Power usage', labels=['power'])
            if isinstance(self._jetson.power, dict):
                g.add_metric(['cv'], self._jetson.power['rail']['VDD_CPU_CV']['avg'] if 'VDD_CPU_CV' in self._jetson.power['rail'] else self._jetson.power['rail'].get('CV', { 'avg': 0 }).get('avg'))
                g.add_metric(['gpu'], self._jetson.power['rail']['VDD_GPU_SOC']['avg'] if 'VDD_GPU_SOC' in self._jetson.power['rail'] else self._jetson.power['rail'].get('GPU', { 'avg': 0 }).get('avg'))
                g.add_metric(['sys5v'], self._jetson.power['rail']['VIN_SYS_5V0']['avg'] if 'VIN_SYS_5V0' in self._jetson.power['rail'] else self._jetson.power['rail'].get('SYS5V', { 'avg': 0 }).get('avg'))
            if isinstance(self._jetson.power, tuple):
                g.add_metric(['cv'], self._jetson.power[1]['CV']['cur'] if 'CV' in self._jetson.power[1] else 0)
                g.add_metric(['gpu'], self._jetson.power[1]['GPU']['cur'] if 'GPU' in self._jetson.power[1] else 0)
                g.add_metric(['sys5v'], self._jetson.power[1]['SYS5V']['cur'] if 'SYS5V' in self._jetson.power[1] else 0)
            yield g

            #
            # Processes
            #
            try:
                processes = self._jetson.processes
                # key exists in dict
                i = InfoMetricFamily('gpu_processes', 'Process usage', labels=['process'])
                for index in range(len(processes)):
                    i.add_metric(['info'], {
                        'pid': str(processes[index][0]),
                        'user': processes[index][1],
                        'gpu': processes[index][2],
                        'type': processes[index][3],
                        'priority': str(processes[index][4]),
                        'state': processes[index][5],
                        'cpu': str(processes[index][6]),
                        'memory': str(processes[index][7]),
                        'gpu_memory': str(processes[index][8]),
                        'name': processes[index][9],
                    })
                yield i
            except AttributeError:
                # key doesn't exist in dict
                i = 0

if __name__ == '__main__':
    port = os.environ.get('PORT', 9998)
    REGISTRY.register(CustomCollector())
    app = make_wsgi_app()
    httpd = make_server('', int(port), app)
    print('Serving on port: ', port)
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        print('Goodbye!')

記得給 Jetson 的板子打標籤,確保 GPU 的 Exporter 在 Jetson 上執行。否則在其他 node 上執行會因為採集不到資料而報錯.

kubectl label node edge-wpx machine.type=jetson

新建 KubeSphere 資源

新建 ServiceAccount、DaemonSet、Service、servicemonitor,目的是將 jetson-exporter 採集到的資料提供給 KubeSphere 的 Prometheus。

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: jetson-exporter
      app.kubernetes.io/part-of: kube-prometheus
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: jetson-exporter
        app.kubernetes.io/part-of: kube-prometheus
        app.kubernetes.io/version: 1.0.0
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/edge
                operator: Exists
      containers:
      - image: jetson-status-exporter:v1
        imagePullPolicy: IfNotPresent
        name: jetson-exporter
        resources:
          limits:
            cpu: "1"
            memory: 500Mi
          requests:
            cpu: 102m
            memory: 180Mi
        ports:
        - containerPort: 9998
          hostPort: 9998
          name: http
          protocol: TCP
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /run/jtop.sock
          name: jtop-sock
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      hostPID: true
      nodeSelector:
        kubernetes.io/os: linux
        machine.type: jetson
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: jetson-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /run/jtop.sock
          type: Socket
        name: jtop-sock
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 9998
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: jetson-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/vendor: kubesphere
    app.kubernetes.io/version: 1.0.0
  name: jetson-exporter
  namespace: kubesphere-monitoring-system
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 1m
    port: http
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    - action: labeldrop
      regex: (service|endpoint|container)
    scheme: http
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: jetson-exporter
      app.kubernetes.io/part-of: kube-prometheus

部署完成後,jetson-exporter pod running。

重啟 Prometheus pod,重新載入配置後,可以在 Prometheus 介面看到新增加的 GPU exporter 的 target。

kubectl delete pod prometheus-k8s-0 -n kubesphere-monitoring-system

在 KubeSphere 前端,檢視 GPU 監控資料

前端需要修改 KubeSphere 的 console 的程式碼,這裡屬於前端內容,這裡就不詳細說明了。

其次將 Prometheus 的 SVC 埠暴露出來,透過 nodeport 的方式將 Prometheus 的埠暴露出來,前端透過 http 介面來查詢 GPU 的狀態。

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.39.1
  name: prometheus-k8s-nodeport
  namespace: kubesphere-monitoring-system
spec:
  ports:
  - port: 9090
    targetPort: 9090
    protocol: TCP
    nodePort: 32143
  selector:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: NodePort

http 介面

查詢瞬時值:
get http://masterip:32143/api/v1/query?query=gpu_info_board_info&time=1711431293.686
get http://masterip:32143/api/v1/query?query=gpu_info_hardware_info&time=1711431590.574
get http://masterip:32143/api/v1/query?query=gpu_usage_gpu&time=1711431590.574
其中query為查詢欄位名,time是查詢的時間

查詢某個時間段的採集值:
get http://10.11.140.87:32143/api/v1/query_range?query=gpu_usage_gpu&start=1711428221.998&end=1711431821.998&step=14
其中query為查詢欄位名,start和end是起始結束時間,step是間隔時間

這樣就成功在 KubeSphere,監控 KubeEdge 邊緣節點 Jetson 的 GPU 狀態了。

總結

基於 KubeEdge,我們在 KubeSphere 的前端介面上實現了邊緣裝置的可觀測性,包括 GPU 資訊的可觀測性。

對於邊緣節點 CPU、記憶體狀態的監控,首先修改親和性,讓 KubeSphere 自帶的 node-exporter 能夠採集邊緣節點監控資料,接下來利用 KubeEdge 的 EdgeMesh 將採集的資料提供給 KubeSphere 的 Prometheus。這樣就實現了 CPU、記憶體資訊的監控。

對於邊緣節點 GPU 狀態的監控,安裝 jtop 獲取 GPU 使用率、溫度等資料,然後開發 Jetson GPU Exporter,將 jtop 獲取的資訊傳送給 KubeSphere 的 Prometheus,透過修改 KubeSphere 前端 ks-console 的程式碼,在介面上透過 http 介面獲取 Prometheus 資料,這樣就實現了 GPU 使用率等資訊監控。

本文由部落格一文多發平臺 OpenWrite 釋出!

相關文章