作者:朱亞光,之江實驗室工程師,雲原生/開源愛好者。
KubeSphere 邊緣節點的可觀測性
在邊緣計算場景下,KubeSphere 基於 KubeEdge 實現應用與工作負載在雲端與邊緣節點的統一分發與管理,解決在海量邊、端裝置上完成應用交付、運維、管控的需求。
根據 KubeSphere 的支援矩陣,只有 1.23.x 版本的 K8s 支援邊緣計算,而且 KubeSphere 介面也沒有邊緣節點資源使用率等監控資訊的顯示。
本文基於 KubeSphere 和 KubeEdge 構建雲邊一體化計算平臺,透過 Prometheus 來監控 Nvidia Jetson 邊緣裝置狀態,實現 KubeSphere 在邊緣節點的可觀測性。
元件 | 版本 |
---|---|
KubeSphere | 3.4.1 |
containerd | 1.7.2 |
K8s | 1.26.0 |
KubeEdge | 1.15.1 |
Jetson 型號 | NVIDIA Jetson Xavier NX (16GB ram) |
Jtop | 4.2.7 |
JetPack | 5.1.3-b29 |
Docker | 24.0.5 |
部署 K8s 環境
參考 KubeSphere 部署文件。透過 KubeKey 可以快速部署一套 K8s 叢集。
// all in one 方式部署一臺 單 master 的 k8s 叢集
./kk create cluster --with-kubernetes v1.26.0 --with-kubesphere v3.4.1 --container-manager containerd
部署 KubeEdge 環境
參考 在 KubeSphere 上部署最新版的 KubeEdge,部署 KubeEdge。
開啟邊緣節點日誌查詢功能
-
vim /etc/kubeedge/config/edgecore.yaml
-
enable=true
開啟後,可以方便查詢 pod 日誌,定位問題。
修改 KubeSphere 配置
開啟 KubeEdge 邊緣節點外掛
- 修改 configmap--ClusterConfiguration
- advertiseAddress 設定為 cloudhub 所在的物理機地址
KubeSphere 開啟邊緣節點文件連結:https://www.kubesphere.io/zh/docs/v3.3/pluggable-components/kubeedge/。
修改完發現可以顯示邊緣節點,但是沒有 CPU 和 記憶體資訊,發現邊緣節點沒有 node-exporter 這個 pod。
修改 node-exporter 親和性
kubectl get ds -n kubesphere-monitoring-system
發現不會部署到邊緣節點上。
修改為:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/edgetest -- 修改這裡,讓親和性失效
operator: DoesNotExist
node-exporter 是部署在邊緣節點上了,但是 pods 起不來。
透過kubectl edit 該失敗的 pod,我們發現 node-exporter 這個pod 裡面有兩個容器,其中 kube-rbac-proxy 這個容器啟動失敗。看這個容器的日誌,發現是 kube-rbac-proxy 想要獲取 KUBERNETES_SERVICE_HOST
和 KUBERNETES_SERVICE_PORT
這兩個環境變數,但是獲取失敗,所以容器啟動失敗。
在 K8s 的叢集中,當建立 pod 時,會在 pod 中增加 KUBERNETES_SERVICE_HOST
和 KUBERNETES_SERVICE_PORT
這兩個環境變數,用於 pod 內的程序對 kube-apiserver 的訪問,但是在 KubeEdge 的 edge 節點上建立的 pod 中,這兩個環境變數存在,但它是空的。
向 KubeEdge 的開發人員諮詢,他們說會在 KubeEdge 1.17 版本上增加這兩個環境變數的設定。參考如下:
https://github.com/wackxu/kubeedge/blob/4a7c00783de9b11e56e56968b2cc950a7d32a403/docs/proposals/edge-pod-list-watch-natively.md。
另一方面,推薦安裝 EdgeMesh,安裝之後在 edge 的 pod 上就可以訪問 kubernetes.default.svc.cluster.local:443
了。
EdgeMesh 部署
-
配置 cloudcore configmap
kubectl edit cm cloudcore -n kubeedge
設定 dynamicController=true.修改完 重啟 cloudcore
kubectl delete pod cloudcore-776ffcbbb9-s6ff8 -n kubeedge
-
配置 edgecore 模組,配置 metaServer=true 和 clusterDNS
$ vim /etc/kubeedge/config/edgecore.yaml modules: ... metaManager: metaServer: enable: true //配置這裡 ... modules: ... edged: ... tailoredKubeletConfig: ... clusterDNS: //配置這裡 - 169.254.96.16 ... //重啟edgecore $ systemctl restart edgecore
修改完,驗證是否修改成功。
$ curl 127.0.0.1:10550/api/v1/services
{"apiVersion":"v1","items":[{"apiVersion":"v1","kind":"Service","metadata":{"creationTimestamp":"2021-04-14T06:30:05Z","labels":{"component":"apiserver","provider":"kubernetes"},"name":"kubernetes","namespace":"default","resourceVersion":"147","selfLink":"default/services/kubernetes","uid":"55eeebea-08cf-4d1a-8b04-e85f8ae112a9"},"spec":{"clusterIP":"10.96.0.1","ports":[{"name":"https","port":443,"protocol":"TCP","targetPort":6443}],"sessionAffinity":"None","type":"ClusterIP"},"status":{"loadBalancer":{}}},{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"prometheus.io/port":"9153","prometheus.io/scrape":"true"},"creationTimestamp":"2021-04-14T06:30:07Z","labels":{"k8s-app":"kube-dns","kubernetes.io/cluster-service":"true","kubernetes.io/name":"KubeDNS"},"name":"kube-dns","namespace":"kube-system","resourceVersion":"203","selfLink":"kube-system/services/kube-dns","uid":"c221ac20-cbfa-406b-812a-c44b9d82d6dc"},"spec":{"clusterIP":"10.96.0.10","ports":[{"name":"dns","port":53,"protocol":"UDP","targetPort":53},{"name":"dns-tcp","port":53,"protocol":"TCP","targetPort":53},{"name":"metrics","port":9153,"protocol":"TCP","targetPort":9153}],"selector":{"k8s-app":"kube-dns"},"sessionAffinity":"None","type":"ClusterIP"},"status":{"loadBalancer":{}}}],"kind":"ServiceList","metadata":{"resourceVersion":"377360","selfLink":"/api/v1/services"}}
-
安裝 EdgeMesh
git clone https://github.com/kubeedge/edgemesh.git cd edgemesh kubectl apply -f build/crds/istio/ kubectl apply -f build/agent/resources/
dnsPolicy
EdgeMesh 部署完成後,edge 節點上的 node-exporter 中的兩個境變數還是空的,也無法訪問 kubernetes.default.svc.cluster.local:443
,原因是該 pod 中 DNS 伺服器配置錯誤,應該是 169.254.96.16 的,但是卻是跟宿主機一樣的 DNS 配置。
kubectl exec -it node-exporter-hcmfg -n kubesphere-monitoring-system -- sh
Defaulted container "node-exporter" out of: node-exporter, kube-rbac-proxy
$ cat /etc/resolv.conf
nameserver 127.0.0.53
將 dnsPolicy 修改為 ClusterFirstWithHostNet,之後重啟 node-exporter,DNS 的配置正確。
kubectl edit ds node-exporter -n kubesphere-monitoring-system
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
新增環境變數
vim /etc/systemd/system/edgecore.service
Environment=METASERVER_DUMMY_IP=kubernetes.default.svc.cluster.local
Environment=METASERVER_DUMMY_PORT=443
修改完重啟 edgecore。
systemctl daemon-reload
systemctl restart edgecore
node-exporter 變成 running!!!!
在邊緣節點 curl http://127.0.0.1:9100/metrics
可以發現採集到了邊緣節點的資料。
最後我們可以將 KubeSphere 的 K8s 服務透過 NodePort 暴露出來。就可以在頁面檢視。
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.39.1
name: prometheus-k8s-nodeport
namespace: kubesphere-monitoring-system
spec:
ports:
- port: 9090
targetPort: 9090
protocol: TCP
nodePort: 32143
selector:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
type: NodePort
透過訪問 master IP + 32143 埠,就可以訪問邊緣節點 node-exporter 資料。
然後介面上也出現了 CPU 和記憶體的資訊。
搞定了 CPU 和記憶體,接下來就是 GPU 了。
監控 Jetson GPU 狀態
安裝 Jtop
首先 Jetson 是一個 ARM 裝置,所以無法執行 nvidia-smi
,需要安裝 Jtop。
sudo apt-get install python3-pip python3-dev -y
sudo -H pip3 install jetson-stats
sudo systemctl restart jtop.service
安裝 Jetson GPU Exporter
參考部落格,製作 Jetson GPU Exporter 映象,並且對應的 Grafana 儀表盤都有。
Dockerfile
FROM python:3-buster
RUN pip install --upgrade pip && pip install -U jetson-stats prometheus-client
RUN mkdir -p /root
COPY jetson_stats_prometheus_collector.py /root/jetson_stats_prometheus_collector.py
WORKDIR /root
USER root
RUN chmod +x /root/jetson_stats_prometheus_collector.py
ENTRYPOINT ["python3", "/root/jetson_stats_prometheus_collector.py"]
jetson_stats_prometheus_collector.py 程式碼
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import atexit
import os
from jtop import jtop, JtopException
from prometheus_client.core import InfoMetricFamily, GaugeMetricFamily, REGISTRY, CounterMetricFamily
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server
class CustomCollector(object):
def __init__(self):
atexit.register(self.cleanup)
self._jetson = jtop()
self._jetson.start()
def cleanup(self):
print("Closing jetson-stats connection...")
self._jetson.close()
def collect(self):
# spin傳入true,表示不會等待下一次資料讀取完成
if self._jetson.ok(spin=True):
#
# Board info
#
i = InfoMetricFamily('gpu_info_board', 'Board sys info', labels=['board_info'])
i.add_metric(['info'], {
'machine': self._jetson.board['info']['machine'] if 'machine' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Module'],
'jetpack': self._jetson.board['info']['jetpack'] if 'jetpack' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['Jetpack'],
'l4t': self._jetson.board['info']['L4T'] if 'L4T' in self._jetson.board.get('info', {}) else self._jetson.board['hardware']['L4T']
})
yield i
i = InfoMetricFamily('gpu_info_hardware', 'Board hardware info', labels=['board_hw'])
i.add_metric(['hardware'], {
'codename': self._jetson.board['hardware'].get('Codename', self._jetson.board['hardware'].get('CODENAME', 'unknown')),
'soc': self._jetson.board['hardware'].get('SoC', self._jetson.board['hardware'].get('SOC', 'unknown')),
'module': self._jetson.board['hardware'].get('P-Number', self._jetson.board['hardware'].get('MODULE', 'unknown')),
'board': self._jetson.board['hardware'].get('699-level Part Number', self._jetson.board['hardware'].get('BOARD', 'unknown')),
'cuda_arch_bin': self._jetson.board['hardware'].get('CUDA Arch BIN', self._jetson.board['hardware'].get('CUDA_ARCH_BIN', 'unknown')),
'serial_number': self._jetson.board['hardware'].get('Serial Number', self._jetson.board['hardware'].get('SERIAL_NUMBER', 'unknown')),
})
yield i
#
# NV power mode
#
i = InfoMetricFamily('gpu_nvpmode', 'NV power mode', labels=['nvpmode'])
i.add_metric(['mode'], {'mode': self._jetson.nvpmodel.name})
yield i
#
# System uptime
#
g = GaugeMetricFamily('gpu_uptime', 'System uptime', labels=['uptime'])
days = self._jetson.uptime.days
seconds = self._jetson.uptime.seconds
hours = seconds//3600
minutes = (seconds//60) % 60
g.add_metric(['days'], days)
g.add_metric(['hours'], hours)
g.add_metric(['minutes'], minutes)
yield g
#
# CPU usage
#
g = GaugeMetricFamily('gpu_usage_cpu', 'CPU % schedutil', labels=['cpu'])
g.add_metric(['cpu_1'], self._jetson.stats['CPU1'] if ('CPU1' in self._jetson.stats and isinstance(self._jetson.stats['CPU1'], int)) else 0)
g.add_metric(['cpu_2'], self._jetson.stats['CPU2'] if ('CPU2' in self._jetson.stats and isinstance(self._jetson.stats['CPU2'], int)) else 0)
g.add_metric(['cpu_3'], self._jetson.stats['CPU3'] if ('CPU3' in self._jetson.stats and isinstance(self._jetson.stats['CPU3'], int)) else 0)
g.add_metric(['cpu_4'], self._jetson.stats['CPU4'] if ('CPU4' in self._jetson.stats and isinstance(self._jetson.stats['CPU4'], int)) else 0)
g.add_metric(['cpu_5'], self._jetson.stats['CPU5'] if ('CPU5' in self._jetson.stats and isinstance(self._jetson.stats['CPU5'], int)) else 0)
g.add_metric(['cpu_6'], self._jetson.stats['CPU6'] if ('CPU6' in self._jetson.stats and isinstance(self._jetson.stats['CPU6'], int)) else 0)
g.add_metric(['cpu_7'], self._jetson.stats['CPU7'] if ('CPU7' in self._jetson.stats and isinstance(self._jetson.stats['CPU7'], int)) else 0)
g.add_metric(['cpu_8'], self._jetson.stats['CPU8'] if ('CPU8' in self._jetson.stats and isinstance(self._jetson.stats['CPU8'], int)) else 0)
yield g
#
# GPU usage
#
g = GaugeMetricFamily('gpu_usage_gpu', 'GPU % schedutil', labels=['gpu'])
g.add_metric(['val'], self._jetson.stats['GPU'])
yield g
#
# Fan usage
#
g = GaugeMetricFamily('gpu_usage_fan', 'Fan usage', labels=['fan'])
g.add_metric(['speed'], self._jetson.fan.get('speed', self._jetson.fan.get('pwmfan', {'speed': [0] })['speed'][0]))
yield g
#
# Sensor temperatures
#
g = GaugeMetricFamily('gpu_temperatures', 'Sensor temperatures', labels=['temperature'])
keys = ['AO', 'GPU', 'Tdiode', 'AUX', 'CPU', 'thermal', 'Tboard']
for key in keys:
if key in self._jetson.temperature:
g.add_metric([key.lower()], self._jetson.temperature[key]['temp'] if isinstance(self._jetson.temperature[key], dict) else self._jetson.temperature.get(key, 0))
yield g
#
# Power
#
g = GaugeMetricFamily('gpu_usage_power', 'Power usage', labels=['power'])
if isinstance(self._jetson.power, dict):
g.add_metric(['cv'], self._jetson.power['rail']['VDD_CPU_CV']['avg'] if 'VDD_CPU_CV' in self._jetson.power['rail'] else self._jetson.power['rail'].get('CV', { 'avg': 0 }).get('avg'))
g.add_metric(['gpu'], self._jetson.power['rail']['VDD_GPU_SOC']['avg'] if 'VDD_GPU_SOC' in self._jetson.power['rail'] else self._jetson.power['rail'].get('GPU', { 'avg': 0 }).get('avg'))
g.add_metric(['sys5v'], self._jetson.power['rail']['VIN_SYS_5V0']['avg'] if 'VIN_SYS_5V0' in self._jetson.power['rail'] else self._jetson.power['rail'].get('SYS5V', { 'avg': 0 }).get('avg'))
if isinstance(self._jetson.power, tuple):
g.add_metric(['cv'], self._jetson.power[1]['CV']['cur'] if 'CV' in self._jetson.power[1] else 0)
g.add_metric(['gpu'], self._jetson.power[1]['GPU']['cur'] if 'GPU' in self._jetson.power[1] else 0)
g.add_metric(['sys5v'], self._jetson.power[1]['SYS5V']['cur'] if 'SYS5V' in self._jetson.power[1] else 0)
yield g
#
# Processes
#
try:
processes = self._jetson.processes
# key exists in dict
i = InfoMetricFamily('gpu_processes', 'Process usage', labels=['process'])
for index in range(len(processes)):
i.add_metric(['info'], {
'pid': str(processes[index][0]),
'user': processes[index][1],
'gpu': processes[index][2],
'type': processes[index][3],
'priority': str(processes[index][4]),
'state': processes[index][5],
'cpu': str(processes[index][6]),
'memory': str(processes[index][7]),
'gpu_memory': str(processes[index][8]),
'name': processes[index][9],
})
yield i
except AttributeError:
# key doesn't exist in dict
i = 0
if __name__ == '__main__':
port = os.environ.get('PORT', 9998)
REGISTRY.register(CustomCollector())
app = make_wsgi_app()
httpd = make_server('', int(port), app)
print('Serving on port: ', port)
try:
httpd.serve_forever()
except KeyboardInterrupt:
print('Goodbye!')
記得給 Jetson 的板子打標籤,確保 GPU 的 Exporter 在 Jetson 上執行。否則在其他 node 上執行會因為採集不到資料而報錯.
kubectl label node edge-wpx machine.type=jetson
新建 KubeSphere 資源
新建 ServiceAccount、DaemonSet、Service、servicemonitor,目的是將 jetson-exporter 採集到的資料提供給 KubeSphere 的 Prometheus。
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 1.0.0
name: jetson-exporter
namespace: kubesphere-monitoring-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 1.0.0
name: jetson-exporter
namespace: kubesphere-monitoring-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
template:
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 1.0.0
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/edge
operator: Exists
containers:
- image: jetson-status-exporter:v1
imagePullPolicy: IfNotPresent
name: jetson-exporter
resources:
limits:
cpu: "1"
memory: 500Mi
requests:
cpu: 102m
memory: 180Mi
ports:
- containerPort: 9998
hostPort: 9998
name: http
protocol: TCP
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run/jtop.sock
name: jtop-sock
readOnly: true
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostPID: true
nodeSelector:
kubernetes.io/os: linux
machine.type: jetson
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: jetson-exporter
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
volumes:
- hostPath:
path: /run/jtop.sock
type: Socket
name: jtop-sock
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 1.0.0
name: jetson-exporter
namespace: kubesphere-monitoring-system
spec:
clusterIP: None
clusterIPs:
- None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http
port: 9998
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
sessionAffinity: None
type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/vendor: kubesphere
app.kubernetes.io/version: 1.0.0
name: jetson-exporter
namespace: kubesphere-monitoring-system
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
interval: 1m
port: http
relabelings:
- action: replace
regex: (.*)
replacement: $1
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
- action: labeldrop
regex: (service|endpoint|container)
scheme: http
tlsConfig:
insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: jetson-exporter
app.kubernetes.io/part-of: kube-prometheus
部署完成後,jetson-exporter pod running。
重啟 Prometheus pod,重新載入配置後,可以在 Prometheus 介面看到新增加的 GPU exporter 的 target。
kubectl delete pod prometheus-k8s-0 -n kubesphere-monitoring-system
在 KubeSphere 前端,檢視 GPU 監控資料
前端需要修改 KubeSphere 的 console 的程式碼,這裡屬於前端內容,這裡就不詳細說明了。
其次將 Prometheus 的 SVC 埠暴露出來,透過 nodeport 的方式將 Prometheus 的埠暴露出來,前端透過 http 介面來查詢 GPU 的狀態。
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.39.1
name: prometheus-k8s-nodeport
namespace: kubesphere-monitoring-system
spec:
ports:
- port: 9090
targetPort: 9090
protocol: TCP
nodePort: 32143
selector:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
type: NodePort
http 介面
查詢瞬時值:
get http://masterip:32143/api/v1/query?query=gpu_info_board_info&time=1711431293.686
get http://masterip:32143/api/v1/query?query=gpu_info_hardware_info&time=1711431590.574
get http://masterip:32143/api/v1/query?query=gpu_usage_gpu&time=1711431590.574
其中query為查詢欄位名,time是查詢的時間
查詢某個時間段的採集值:
get http://10.11.140.87:32143/api/v1/query_range?query=gpu_usage_gpu&start=1711428221.998&end=1711431821.998&step=14
其中query為查詢欄位名,start和end是起始結束時間,step是間隔時間
這樣就成功在 KubeSphere,監控 KubeEdge 邊緣節點 Jetson 的 GPU 狀態了。
總結
基於 KubeEdge,我們在 KubeSphere 的前端介面上實現了邊緣裝置的可觀測性,包括 GPU 資訊的可觀測性。
對於邊緣節點 CPU、記憶體狀態的監控,首先修改親和性,讓 KubeSphere 自帶的 node-exporter 能夠採集邊緣節點監控資料,接下來利用 KubeEdge 的 EdgeMesh 將採集的資料提供給 KubeSphere 的 Prometheus。這樣就實現了 CPU、記憶體資訊的監控。
對於邊緣節點 GPU 狀態的監控,安裝 jtop 獲取 GPU 使用率、溫度等資料,然後開發 Jetson GPU Exporter,將 jtop 獲取的資訊傳送給 KubeSphere 的 Prometheus,透過修改 KubeSphere 前端 ks-console 的程式碼,在介面上透過 http 介面獲取 Prometheus 資料,這樣就實現了 GPU 使用率等資訊監控。
本文由部落格一文多發平臺 OpenWrite 釋出!