Kubernetes (k8s)呼叫GPU配置

清风6661發表於2024-07-23

1、安裝GPU驅動和CUDA

https://www.nvidia.com/Download/index.aspx?lang=en-us 下載驅動程式,按照自己和顯示卡型號選擇

安裝cuda,地址:https://developer.nvidia.com/cuda-toolkit-archive

2、安裝nvidia-docker2

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum-config-manager --enable nvidia-container-toolkit-experimental
yum install nvidia-docker2

其它版本系統可參考:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-yum-or-dnf

此源包含nvidia-docker2和nvidia-container-toolkit,應該這兩個都可以支援呼叫GPU.

修改docker配置

vim /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
systemctl restart docker

3、安裝k8s-device-plugin

# wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml
# kubectl create -f nvidia-device-plugin.yml

檢查外掛執行情況:

kubectl get pods -n kube-system

若沒有nvidia-device-plugin-daemonset的pod執行,需要檢查新建的nvidia-device-plugin.yml中的DaemonSet是否正常,如果DaemonSet執行不正常,則檢查DaemonSet。

kubectl describe daemonset -n default

4、問題處理

單機器K8S,無法排程GPU

去除master節點的汙點標記,預設K8S叢集不允許master節點上被排程到帶GPU的POD,需要去除汙點。

Insufficient nvidia.com/gpu

 在使用kubectl logs查詢外掛的日誌時,出現瞭如下問題:

Kubernetes (k8s)呼叫GPU配置

當出現這個問題時,請仔細檢查一下是否配置了nvidia預設執行時,檢查/etc/docker/daemon.json

  {
  "default-runtime": "nvidia",
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  }

如有修改,請重啟docker

相關文章