隨著人工智慧和大模型的快速發展,雲上GPU資源共享變得必要,因為它可以降低硬體成本,提升資源利用效率,並滿足模型訓練和推理對大規模平行計算的需求。
在kubernetes內建的資源排程功能中,GPU排程只能根據“核數”進行排程,但是深度學習等演算法程式執行過程中,資源佔用比較高的是視訊記憶體,這樣就形成了很多的資源浪費。
目前的GPU資源共享方案有兩種。一種是將一個真正的GPU分解為多個虛擬GPU,即vGPU,這樣就可以基於vGPU的數量進行排程;另一種是根據GPU的視訊記憶體進行排程。
本文將講述如何安裝kubernetes元件實現根據GPU視訊記憶體排程資源。
系統資訊
-
系統:centos stream8
-
核心:4.18.0-490.el8.x86_64
-
驅動:NVIDIA-Linux-x86_64-470.182.03
-
docker:20.10.24
-
kubernetes版本:1.24.0
1. 驅動安裝
請登入nvida官網自行安裝:https://www.nvidia.com/Download/index.aspx?lang=en-us
2. docker安裝
請自行安裝docker或其他容器執行時,如果使用其他容器執行時,第三步配置請參考NVIDA官網 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installation-guide
注意:官方支援docker、containerd、podman,但本文件只驗證過docker的使用,如果使用其他容器執行時,請注意差異性。
3. NVIDIA Container Toolkit 安裝
- 設定倉庫與GPG Key
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
- 開始安裝
sudo dnf clean expire-cache --refresh
sudo dnf install -y nvidia-container-toolkit
- 修改docker配置檔案新增容器執行時實現
sudo nvidia-ctk runtime configure --runtime=docker
- 修改/etc/docker/daemon.json,設定nvidia為預設容器執行時(必需)
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
- 重啟docker並開始驗證是否生效
sudo systemctl restart docker
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
如果返回如下資料,說明配置成功
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
4. 安裝K8S GPU排程器
- 首先執行以下yaml,部署排程器
# rbac.yaml
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: gpushare-schd-extender
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- ""
resources:
- pods
verbs:
- update
- patch
- get
- list
- watch
- apiGroups:
- ""
resources:
- bindings
- pods/binding
verbs:
- create
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpushare-schd-extender
namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: gpushare-schd-extender
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gpushare-schd-extender
subjects:
- kind: ServiceAccount
name: gpushare-schd-extender
namespace: kube-system
# deployment yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: gpushare-schd-extender
namespace: kube-system
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: gpushare
component: gpushare-schd-extender
template:
metadata:
labels:
app: gpushare
component: gpushare-schd-extender
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
hostNetwork: true
tolerations:
- effect: NoSchedule
operator: Exists
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoSchedule
operator: Exists
key: node.cloudprovider.kubernetes.io/uninitialized
nodeSelector:
node-role.kubernetes.io/control-plane: ""
serviceAccount: gpushare-schd-extender
containers:
- name: gpushare-schd-extender
image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-schd-extender:1.11-d170d8a
env:
- name: LOG_LEVEL
value: debug
- name: PORT
value: "12345"
# service.yaml
---
apiVersion: v1
kind: Service
metadata:
name: gpushare-schd-extender
namespace: kube-system
labels:
app: gpushare
component: gpushare-schd-extender
spec:
type: NodePort
ports:
- port: 12345
name: http
targetPort: 12345
nodePort: 32766
selector:
# select app=ingress-nginx pods
app: gpushare
component: gpushare-schd-extender
- 在/etc/kubernetes目錄下新增排程策略配置檔案
#scheduler-policy-config.yaml
---
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
# 不知道為什麼不支援svc的方式呼叫,必須用nodeport
- urlPrefix: "http://gpushare-schd-extender.kube-system:12345/gpushare-scheduler"
filterVerb: filter
bindVerb: bind
enableHTTPS: false
nodeCacheCapable: true
managedResources:
- name: aliyun.com/gpu-mem
ignoredByScheduler: false
ignorable: false
上面的 http://gpushare-schd-extender.kube-system:12345 注意要替換為你本地部署的{nodeIP}:{gpushare-schd-extender的nodeport埠},否則會訪問不到
查詢命令如下:
kubectl get service gpushare-schd-extender -n kube-system -o jsonpath='{.spec.ports[?(@.name=="http")].nodePort}'
- 修改kubernetes排程配置 /etc/kubernetes/manifests/kube-scheduler.yaml
1. 在commond中新增
- --config=/etc/kubernetes/scheduler-policy-config.yaml
2. 新增pod掛載目錄
在volumeMounts:中新增
- mountPath: /etc/kubernetes/scheduler-policy-config.yaml
name: scheduler-policy-config
readOnly: true
在volumes:中新增
- hostPath:
path: /etc/kubernetes/scheduler-policy-config.yaml
type: FileOrCreate
name: scheduler-policy-config
注意:這裡千萬不要改錯,否則可能會出現莫名其妙的錯誤
示例如下:
- 配置rbac及安裝device外掛
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
5. 在GPU節點上新增標籤
kubectl label node <target_node> gpushare=true
6. 安裝kubectl Gpu 外掛
cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare
7. 驗證
- 使用kubectl查詢GPU資源使用情況
# kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
cn-shanghai.i-uf61h64dz1tmlob9hmtb 192.168.0.71 6/15 6/15
cn-shanghai.i-uf61h64dz1tmlob9hmtc 192.168.0.70 3/15 3/15
------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/30 (30%)
- 建立一個有GPU需求的資源,檢視其資源排程情況
apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 1
selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-1
template: # define the pods specifications
metadata:
labels:
app: binpack-1
spec:
tolerations:
- effect: NoSchedule
key: cloudClusterNo
operator: Exists
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# 單位GiB
aliyun.com/gpu-mem: 3
8. 問題排查
如果在安裝過程中發現資源未安裝成功,可以透過pod檢視日誌
kubectl get po -n kube-system -o=wide | grep gpushare-device
kubecl logs -n kube-system <pod_name>
參考地址:
NVIDA官網container-toolkit安裝文件: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
阿里雲GPU外掛安裝:https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md