最近公司有專案想在 k8s 叢集中執行 GPU 任務,於是研究了一下。下面是部署的步驟。
1. 首先得有一個可以執行的 k8s 叢集. 叢集部署參考 kubeadm安裝k8s
2. 準備 GPU 節點
2.1 安裝驅動
apt-get install cuda-drivers-455 # 按需要安裝對應的版本
2.2 安裝 nvidia-docker2
<!-- Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn't reached kubernetes yet -->
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-docker2 ## /etc/docker/daemon.json 檔案中加入以下內容, 使預設的執行時是 nvidia { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } ## 重啟 docker sudo systemctl restart docker
2.3 在 k8s 叢集中安裝 nvidia-device-plugin
使叢集支援 GPU
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.7.3/nvidia-device-plugin.yml # 如果因為網路問題訪問不到該檔案, 可在瀏覽器開啟 https://github.com/NVIDIA/k8s-device-plugin/blob/v0.7.3/nvidia-device-plugin.yml ## 把檔案內容拷貝到本地執行
Expose the number of GPUs on each nodes of your cluster
Keep track of the health of your GPUs
Run GPU enabled containers in your Kubernetes cluster.
之後把節點加入 k8s 叢集
以上步驟成功完成之後, 執行以下命令能看到類似下面圖片中的內容說明外掛安裝好了
kubectl get pod --all-namespaces | grep nvidia kubectl describe node
3. 執行 GPU Jobs
# cat nvidia-gpu-demo.yaml apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs
kubectl apply -f nvidia-gpu-demo.yaml kubectl exec -it xxx-76dd5bd849-hlmdr -- bash # nvidia-smi
以上就簡單實現了 k8s 排程 GPU 任務。