1、安裝GPU驅動和CUDA
https://www.nvidia.com/Download/index.aspx?lang=en-us 下載驅動程式,按照自己和顯示卡型號選擇
安裝cuda,地址:https://developer.nvidia.com/cuda-toolkit-archive
2、安裝nvidia-docker2
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum-config-manager --enable nvidia-container-toolkit-experimental
yum install nvidia-docker2
其它版本系統可參考:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-yum-or-dnf
此源包含nvidia-docker2和nvidia-container-toolkit,應該這兩個都可以支援呼叫GPU.
修改docker配置
vim /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
systemctl restart docker
3、安裝k8s-device-plugin
# wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml
# kubectl create -f nvidia-device-plugin.yml
檢查外掛執行情況:
kubectl get pods -n kube-system
若沒有nvidia-device-plugin-daemonset的pod執行,需要檢查新建的nvidia-device-plugin.yml中的DaemonSet是否正常,如果DaemonSet執行不正常,則檢查DaemonSet。
kubectl describe daemonset -n default
4、問題處理
單機器K8S,無法排程GPU
去除master節點的汙點標記,預設K8S叢集不允許master節點上被排程到帶GPU的POD,需要去除汙點。
Insufficient nvidia.com/gpu
在使用kubectl logs查詢外掛的日誌時,出現瞭如下問題:
當出現這個問題時,請仔細檢查一下是否配置了nvidia預設執行時,檢查/etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
如有修改,請重啟docker