GPU 環境搭建指南:使用 GPU Operator 加速 Kubernetes GPU 環境搭建

探索云原生發表於2024-11-14

gpu-operator.png

本文主要分享如何使用 GPU Operator 快速搭建 Kubernetes GPU 環境。

1. 概述

上一篇文章 GPU 使用指南:如何在裸機、Docker、K8s 等環境中使用 GPU 分享了裸機、Docker 環境以及 K8s 環境中如何使用 GPU。

整個流程還算比較簡單,但是因為需要在節點上安裝 GPU Driver、Container Toolkit 等元件,當叢集規模較大時還是比較麻煩的。

為了解決這個問題,NVIDIA 推出了 GPU OperatorGPU Operator 旨在簡化在 Kubernetes 環境中使用 GPU 的過程,透過自動化的方式處理 GPU 驅動程式安裝、Controller Toolkit、Device-Plugin 、監控等元件

基本上把需要手動安裝、配置的地方全部自動化處理了,極大簡化了 k8s 環境中的 GPU 使用。

ps:只有 NVIDIA GPU 可以使用,其他廠家現在基本還是手動安裝。

2. 元件介紹

這部分主要分析下 GPU Operator 涉及到的各個元件及其作用。

NVIDIA GPU Operator總共包含如下的幾個元件:

  • NFD(Node Feature Discovery):用於給節點打上某些標籤,這些標籤包括 cpu id、核心版本、作業系統版本、是不是 GPU 節點等,其中需要關注的標籤是nvidia.com/gpu.present=true,如果節點存在該標籤,那麼說明該節點是 GPU 節點。
  • GFD(GPU Feature Discovery):用於收集節點的 GPU 裝置屬性(GPU 驅動版本、GPU型號等),並將這些屬性以節點標籤的方式透出。在k8s 叢集中以 DaemonSet 方式部署,只有節點擁有標籤nvidia.com/gpu.present=true時,DaemonSet 控制的 Pod 才會在該節點上執行。
    • 新版本 GFD 遷移到了 NVIDIA/k8s-device-plugin
  • NVIDIA Driver Installer:基於容器的方式在節點上安裝 NVIDIA GPU 驅動,在 k8s 叢集中以 DaemonSet 方式部署,只有節點擁有標籤nvidia.com/gpu.present=true時,DaemonSet 控制的 Pod 才會在該節點上執行。
  • NVIDIA Container Toolkit Installer:能夠實現在容器中使用 GPU 裝置,在 k8s 叢集中以 DaemonSet 方式部署,同樣的,只有節點擁有標籤nvidia.com/gpu.present=true時,DaemonSet 控制的 Pod 才會在該節點上執行。
  • NVIDIA Device Plugin:NVIDIA Device Plugin 用於實現將 GPU 裝置以 Kubernetes 擴充套件資源的方式供使用者使用,在 k8s 叢集中以 DaemonSet 方式部署,只有節點擁有標籤nvidia.com/gpu.present=true時,DaemonSet 控制的 Pod 才會在該節點上執行。
  • DCGM Exporter:週期性的收集節點 GPU 裝置的狀態(當前溫度、總的視訊記憶體、已使用視訊記憶體、使用率等)並暴露 Metrics,結合 Prometheus 和 Grafana 使用。在 k8s 叢集中以DaemonSet 方式部署,只有節點擁有標籤nvidia.com/gpu.present=true時,DaemonSet 控制的 Pod 才會在該節點上執行。

首先是 GFD、NFD,二者都是用於發現 Node 上的資訊,並以 label 形式新增到 k8s node 物件上,特別是 GFD 會新增nvidia.com/gpu.present=true 標籤表示該節點有 GPU,只有攜帶該標籤的節點才會安裝後續元件。

然後則是 Driver Installer、Container Toolkit Installer 用於安裝 GPU 驅動和 container toolkit。

接下來這是 device-plugin 讓 k8s 能感知到 GPU 資源資訊便於排程和管理。

最後的 exporter 則是採集 GPU 監控並以 Prometheus Metrics 格式暴露,用於做 GPU 監控。

這些元件基本就把需要手動配置的東西都自動化了。

NVIDIA GPU Operator 依如下的順序部署各個元件,並且如果前一個元件部署失敗,那麼其後面的元件將停止部署:

  • NVIDIA Driver Installer
  • NVIDIA Container Toolkit Installer
  • NVIDIA Device Plugin
  • DCGM Exporter
  • GFD

每個元件都是以 DaemonSet 方式部署,並且只有當節點存在標籤 nvidia.com/gpu.present=true 時,各 DaemonSet控制的 Pod 才會在節點上執行。

nvidia.com/gpu.deploy.driver=pre-installed

GFD & NFD

  • GFD:GPU Feature Discovery

  • NFD:Node Feature Discovery

根據名稱基本能猜到這兩個元件的功能,發現節點資訊和 GPU 資訊並以 Label 形式新增到 k8s 中的 node 物件上。

其中 NFD 新增的 label 以 feature.node.kubernetes.io 作為字首,比如:

feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04

對於 GFD 則主要記錄 GPU 資訊

nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=161
nvidia.com/gpu.product=Tesla-T4
nvidia.com/gpu.memory=15360

Driver Installer

NVIDIA 官方提供了一種基於容器安裝 NVIDIA 驅動的方式,GPU Operator 安裝 nvidia 驅動也是採用的這種方式。

當 NVIDIA 驅動基於容器化安裝後,整個架構將演變成圖中描述的樣子:

gpu-operator-driver-container.png

Driver Installer 元件對應的 DaemonSet 就是nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04

該 DaemonSet 對應的映象為

root@test:~# kgo get ds nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04 -oyaml|grep image
        image: nvcr.io/nvidia/driver:535-5.15.0-105-generic-ubuntu22.04

其中 DaemonSet 名稱/映象由幾部分元件:

  • nvidia-driver-daemonset 這部分為字首
  • 5.15.0-105-generic 為核心版本,使用uname -r 命令檢視
  • ubuntu22.04 作業系統版本,使用cat /etc/os-release 命令檢視
  • 535:這個是 GPU Driver 的版本號,這裡表示安裝 535 版本驅動,在部署時可以指定。

GPU Operator 會自動根據節點上的核心版本和作業系統生成 DaemonSet 映象,因為是以 DaemonSet 方式執行的,所有節點上都是跑的同一個 Pod,因此要限制叢集中的所有 GPU 節點作業系統和核心版本必須一致

ps:如果提前手動在節點上安裝 GPU 驅動,那麼 GPU Operator 檢測到之後就不會在該節點上啟動 Installer Pod,這樣該節點就可以不需要管作業系統和核心版本。

NVIDIA Container Toolkit Installer

該元件用於安裝 NVIDIA Container Toolkit。

手動安裝的時候有兩個步驟:

  • 1)安裝 NVIDIA Container Toolkit
  • 2)修改 Runtime 配置指定使用 nvidia-runtime

在整個呼叫鏈中新增 nvidia-container-runtime,以便處理 GPU 相關操作。

nv-container-runtime-call-flow

這個 Installer 做的操作也就是這兩步:

  • 1)將容器中NVIDIA Container Toolkit元件所涉及的命令列工具和庫檔案移動到/usr/local/nvidia/toolkit目錄下
  • 2)在 /usr/local/nvidia/toolkit/.config/nvidia-container-runtime建立nvidia-container-runtime的配置檔案config.toml,並設定nvidia-container-cli.root的值為/run/nvidia/driver。

3. 部署

參考官方文件: operator-install-guide

準備工作

要求:

1)GPU 節點必須執行相同的作業系統,

  • 如果提前手動在節點上安裝驅動的話,該節點可以使用不同的作業系統
  • CPU 節點作業系統沒要求,因為 gpu-operator 只會在 GPU 節點上執行

2)GPU 節點必須配置相同容器引擎,例如都是 containerd 或者都是 docker

3)如果使用了 Pod Security Admission (PSA) ,需要為 gpu-operator 標記特權模式

kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

4)叢集中不要安裝 NFD,如果已經安裝了需要再安裝 gpu-operator 時禁用 NFD 部署。

使用以下命令檢視叢集中是否部署 NFD

kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

如果返回 true 則說明叢集中安裝了 NFD。

使用 Helm 部署

官方文件:operator-install-guide

# 新增 nvidia helm 倉庫並更新
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update
# 以預設配置安裝
helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator

# 如果提前手動安裝了 gpu 驅動,operator 中要禁止 gpu 安裝
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=false

完成後 會啟動 Pod 安裝驅動,如果節點上已經安裝了驅動了,那麼 gpu-operaotr 就不會啟動安裝驅動的 Pod,透過 label 進行篩選。

  • 沒安裝驅動的節點會打上 nvidia.com/gpu.deploy.driver=true ,表示需要安裝驅動
  • 已經手動安裝過驅動的節點會打上nvidia.com/gpu.deploy.driver=pre-install,Daemonset 則不會在該節點上執行

當然,並不是每個作業系統+核心版本的組合,NVIDIA 都提供了對應的映象,可以提前在 NVIDIA/driver tags 檢視當前 NVIDIA 提供的驅動版本。

測試

部署後,會在gpu-operator namespace 下啟動相關 Pod,檢視一下 Pod 的執行情況,除了一個 Completed 之外其他應該都是 Running 狀態。

root@test:~# kubectl -n gpu-operator get po
NAME                                                           READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-jdqpb                                    1/1     Running     0             35d
gpu-operator-67f8b59c9b-k989m                                  1/1     Running     6 (35d ago)   35d
nfd-node-feature-discovery-gc-5644575d55-957rp                 1/1     Running     6 (35d ago)   35d
nfd-node-feature-discovery-master-5bd568cf5c-c6t9s             1/1     Running     6 (35d ago)   35d
nfd-node-feature-discovery-worker-sqb7x                        1/1     Running     6 (35d ago)   35d
nvidia-container-toolkit-daemonset-rqgtv                       1/1     Running     0             35d
nvidia-cuda-validator-9kqnf                                    0/1     Completed   0             35d
nvidia-dcgm-exporter-8mb6v                                     1/1     Running     0             35d
nvidia-device-plugin-daemonset-7nkjw                           1/1     Running     0             35d
nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04-g5dgx   1/1     Running     5 (35d ago)   35d
nvidia-operator-validator-6mqlm                                1/1     Running     0             35d

然後進入nvidia-driver-daemonset-xxx Pod,該 Pod 負責 GPU Driver 的安裝,在該 Pod 中可以執行 nvidia-smi 命令,比如檢視 GPU 資訊:

root@j99cloudvm:~# kubectl -n gpu-operator exec -it nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04-g5dgx -- nvidia-smi
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, config-manager, toolkit-validation (init), config-manager-init (init)
Wed Jul 17 01:49:35 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:00:07.0 Off |                    0 |
|  0%   46C    P0    88W / 300W |    484MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          Off  | 00000000:00:08.0 Off |                    0 |
|  0%   48C    P0    92W / 300W |  40916MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

最後再檢視 Pod 資訊

$ kubectl get node xxx -oyaml
status:
  addresses:
  - address: 172.18.187.224
    type: InternalIP
  - address: izj6c5dnq07p1ic04ei9vwz
    type: Hostname
  allocatable:
    cpu: "4"
    ephemeral-storage: "189889991571"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15246720Ki
    nvidia.com/gpu: "1"
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 206043828Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15349120Ki
    nvidia.com/gpu: "1"
    pods: "110"

確認 capacity 是否包含 GPU,正常應該是有的,比如這樣:

  capacity:
    nvidia.com/gpu: "1"

至此,說明我們的 GPU Operator 已經安裝成功,K8s 也能感知到節點上的 GPU,接下來就可以在 Pod 中使用 GPU 了。

建立一個測試 Pod,申請一個 GPU:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1

正常的 Pod 日誌如下:

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

至此,我們已經可以在 k8s 中使用 GPU 了。


【Kubernetes 系列】持續更新中,搜尋公眾號【探索雲原生】訂閱,閱讀更多文章。


4. 原理

這部分主要分析一下 Driver Installer 和 NVIDIA Container Toolkit Installer 這兩個元件是怎麼實現的,大致原理。

Driver Installer

NVIDIA 官方提供了一種基於容器安裝 NVIDIA 驅動的方式,GPU Operator 安裝 nvidia 驅動也是採用的這種方式。

當 NVIDIA 驅動基於容器化安裝後,整個架構將演變成圖中描述的樣子:

gpu-operator-driver-container.png

安裝

Driver Installer 元件對應的 DaemonSet 就是nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04

該 DaemonSet 對應的映象為

root@test:~# kgo get ds nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04 -oyaml|grep image
        image: nvcr.io/nvidia/driver:535-5.15.0-105-generic-ubuntu22.04

其中 DaemonSet 名稱/映象由幾部分元件:

  • nvidia-driver-daemonset 這部分為字首
  • 5.15.0-105-generic 為核心版本,使用uname -r 命令檢視
  • ubuntu22.04 作業系統版本,使用cat /etc/os-release 命令檢視
  • 535:這個是 GPU Driver 的版本號,這裡表示安裝 535 版本驅動,在部署時可以指定。

檢視一下 Pod 日誌:

root@test:~# kubectl -n gpu-operaator logs -f nvidia-driver-daemonset-5.15.0-105-generic-ubuntu22.04-g5dgx

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 535 for Linux kernel version 5.15.0-105-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Installing NVIDIA driver kernel modules...
Reading package lists...
Building dependency tree...
Reading state information...
The following packages were automatically installed and are no longer required:
 ...
Setting up linux-modules-nvidia-535-server-5.15.0-105-generic (5.15.0-105.115+1) ...
linux-image-nvidia-5.15.0-105-generic: constructing .ko files
nvidia-drm.ko: OK
nvidia-modeset.ko: OK
nvidia-peermem.ko: OK
nvidia-uvm.ko: OK
nvidia.ko: OK
Processing triggers for linux-image-5.15.0-105-generic (5.15.0-105.115) ...
/etc/kernel/postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.15.0-105-generic
   ...done.
Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

可以看到,先是在安裝驅動,安裝完成後又載入了一些核心模組。

為了實現在容器中安裝驅動,該 Pod 透過 hostPath 將安裝驅動相關的目錄都掛載到容器中了,

      volumes:
      - hostPath:
          path: /run/nvidia
          type: DirectoryOrCreate
        name: run-nvidia
      - hostPath:
          path: /etc/os-release
          type: ""
        name: host-os-release
      - hostPath:
          path: /run/nvidia-topologyd
          type: DirectoryOrCreate
        name: run-nvidia-topologyd
        name: run-mellanox-drivers
      - hostPath:
          path: /run/nvidia/validations
          type: DirectoryOrCreate
        name: run-nvidia-validations
      - hostPath:
          path: /sys
          type: Directory

映象構建

根據 Dockerfile 來看下映象是怎麼構建的,以 CentOS8 的 Dockerfile 為例

檔案來自:https://gitlab.com/nvidia/container-images/driver/-/blob/master/centos8/Dockerfile

FROM nvidia/cuda:11.4.1-base-centos8

ENV NVIDIA_VISIBLE_DEVICES=void

RUN NVIDIA_GPGKEY_SUM=d0664fbbdb8c32356d45de36c5984617217b2d0bef41b93ccecd326ba3b80c87 && \
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/D42D0685.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \
    echo "$NVIDIA_GPGKEY_SUM  /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict -

#首先安裝一些依賴
RUN dnf install -y \
        ca-certificates \
        curl \
        gcc \
        glibc.i686 \
        make \
        dnf-utils \
        kmod && \
    rm -rf /var/cache/dnf/*

RUN curl -fsSL -o /usr/local/bin/donkey https://github.com/3XX0/donkey/releases/download/v1.1.0/donkey && \
    curl -fsSL -o /usr/local/bin/extract-vmlinux https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux && \
    chmod +x /usr/local/bin/donkey /usr/local/bin/extract-vmlinux

#ARG BASE_URL=http://us.download.nvidia.com/XFree86/Linux-x86_64
ARG BASE_URL=https://us.download.nvidia.com/tesla
ARG DRIVER_VERSION
ENV DRIVER_VERSION=$DRIVER_VERSION

# 然後下載驅動檔案並安裝,注意 --no-kernel-module,這裡只安裝了 userspace 部分
RUN cd /tmp && \
    curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run && \
    sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run -x && \
    cd NVIDIA-Linux-x86_64-$DRIVER_VERSION && \
    ./nvidia-installer --silent \
                       --no-kernel-module \
                       --install-compat32-libs \
                       --no-nouveau-check \
                       --no-nvidia-modprobe \
                       --no-rpms \
                       --no-backup \
                       --no-check-for-alternate-installs \
                       --no-libglx-indirect \
                       --no-install-libglvnd \
                       --x-prefix=/tmp/null \
                       --x-module-path=/tmp/null \
                       --x-library-path=/tmp/null \
                       --x-sysconfig-path=/tmp/null && \
    mkdir -p /usr/src/nvidia-$DRIVER_VERSION && \
    mv LICENSE mkprecompiled kernel /usr/src/nvidia-$DRIVER_VERSION && \
    sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest > /usr/src/nvidia-$DRIVER_VERSION/.manifest && \
    rm -rf /tmp/*

COPY nvidia-driver /usr/local/bin

WORKDIR /usr/src/nvidia-$DRIVER_VERSION

ARG PUBLIC_KEY=empty
COPY ${PUBLIC_KEY} kernel/pubkey.x509

ARG PRIVATE_KEY

# Remove cuda repository to avoid GPG errors
RUN rm -f /etc/yum.repos.d/cuda.repo

# Add NGC DL license from the CUDA image
RUN mkdir /licenses && mv /NGC-DL-CONTAINER-LICENSE /licenses/NGC-DL-CONTAINER-LICENSE

ENTRYPOINT ["nvidia-driver", "init"]

最後執行的 nvidia-driver 是一個指令碼檔案,init 部分內容如下:

init() {
    echo -e "\n========== NVIDIA Software Installer ==========\n"
    echo -e "Starting installation of NVIDIA driver version ${DRIVER_VERSION} for Linux kernel version ${KERNEL_VERSION}\n"

    exec 3> ${PID_FILE}
    if ! flock -n 3; then
        echo "An instance of the NVIDIA driver is already running, aborting"
        exit 1
    fi
    echo $$ >&3

    trap "echo 'Caught signal'; exit 1" HUP INT QUIT PIPE TERM
    trap "_shutdown" EXIT

    _unload_driver || exit 1
    _unmount_rootfs

    if _kernel_requires_package; then
        _update_package_cache
        _resolve_kernel_version || exit 1
        _install_prerequisites
        _create_driver_package
        #_remove_prerequisites
        _cleanup_package_cache
    fi

    _install_driver
    _load_driver
    _mount_rootfs
    _write_kernel_update_hook

    echo "Done, now waiting for signal"
    sleep infinity &
    trap "echo 'Caught signal'; _shutdown && { kill $!; exit 0; }" HUP INT QUIT PIPE TERM
    trap - EXIT
    while true; do wait $! || continue; done
    exit 0
}

然後_install_driver 部分在安裝驅動,因為之前構建映象時就安裝了 userspace 部分,因此這裡指定了--kernel-module-only 來限制安裝驅動部分。

這也是為什麼容器方式安裝很快,因為在構建映象時就不 驅動的 userspace 部分安裝好了。

# Link and install the kernel modules from a precompiled package using the nvidia-installer.
_install_driver() {
    local install_args=()

    echo "Installing NVIDIA driver kernel modules..."
    cd /usr/src/nvidia-${DRIVER_VERSION}
    rm -rf /lib/modules/${KERNEL_VERSION}/video

    if [ "${ACCEPT_LICENSE}" = "yes" ]; then
        install_args+=("--accept-license")
    fi
    nvidia-installer --kernel-module-only --no-drm --ui=none --no-nouveau-check ${install_args[@]+"${install_args[@]}"}
}

_load_driver 載入相關核心模組

# Load the kernel modules and start persistenced.
_load_driver() {
    echo "Loading ipmi and i2c_core kernel modules..."
    modprobe -a i2c_core ipmi_msghandler ipmi_devintf

    echo "Loading NVIDIA driver kernel modules..."
    modprobe -a nvidia nvidia-uvm nvidia-modeset

    echo "Starting NVIDIA persistence daemon..."
    nvidia-persistenced --persistence-mode
}

_mount_rootfs 將驅動程式掛載到 /var/run 目錄下

# Mount the driver rootfs into the run directory with the exception of sysfs.
_mount_rootfs() {
    echo "Mounting NVIDIA driver rootfs..."
    mount --make-runbindable /sys
    mount --make-private /sys
    mkdir -p ${RUN_DIR}/driver
    mount --rbind / ${RUN_DIR}/driver
}

這就是驅動安裝的部分流程,和我們看到的 Pod 日誌也是匹配的。

解除安裝的話就是相反的操作了。

NVIDIA Container Toolkit Installer

該元件用於安裝 NVIDIA Container Toolkit。

手動安裝的時候有兩個步驟:

  • 1)安裝 NVIDIA Container Toolkit
  • 2)修改 Runtime 配置指定使用 nvidia-runtime

在整個呼叫鏈中新增 nvidia-container-runtime,以便處理 GPU 相關操作。

nv-container-runtime-call-flow

這個 Installer 做的操作也就是這兩步:

  • 1)將容器中NVIDIA Container Toolkit元件所涉及的命令列工具和庫檔案移動到/usr/local/nvidia/toolkit目錄下
  • 2)在 /usr/local/nvidia/toolkit/.config/nvidia-container-runtime建立nvidia-container-runtime的配置檔案config.toml,並設定nvidia-container-cli.root的值為/run/nvidia/driver。

安裝

該 Installer 對應的 DaemonSet 為nvidia-container-toolkit-daemonset

Pod 啟動命令如下:

      containers:
      - args:
        - /bin/entrypoint.sh
        command:
        - /bin/bash
        - -c      

這個entrypoint.sh內容存放在nvidia-container-toolkit-entrypoint Configmap 中,內容如下:

apiVersion: v1
data:
  entrypoint.sh: |-
    #!/bin/bash

    set -e

    driver_root=/run/nvidia/driver
    driver_root_ctr_path=$driver_root
    if [[ -f /run/nvidia/validations/host-driver-ready ]]; then
      driver_root=/
      driver_root_ctr_path=/host
    fi

    export NVIDIA_DRIVER_ROOT=$driver_root
    export DRIVER_ROOT_CTR_PATH=$driver_root_ctr_path

    #
    # The below delay is a workaround for an issue affecting some versions
    # of containerd starting with 1.6.9. Staring with containerd 1.6.9 we
    # started seeing the toolkit container enter a crashloop whereby it
    # would recieve a SIGTERM shortly after restarting containerd.
    #
    # Refer to the commit message where this workaround was implemented
    # for additional details:
    #   https://github.com/NVIDIA/gpu-operator/commit/963b8dc87ed54632a7345c1fcfe842f4b7449565
    #
    sleep 5

    exec nvidia-toolkit

設定了驅動相關環境變數,真正執行配置的是exec nvidia-toolkit 這一句。

該同樣使用 hostPath 方式把宿主機目錄掛載到容器中,便於對宿主機上的內容進行修改。

      volumes:
      - hostPath:
          path: /run/nvidia
          type: DirectoryOrCreate
        name: nvidia-run-path
      - hostPath:
          path: /run/nvidia/validations
          type: DirectoryOrCreate
        name: run-nvidia-validations
      - hostPath:
          path: /run/nvidia/driver
          type: ""
        name: driver-install-path
        name: host-root
      - hostPath:
          path: /usr/local/nvidia
          type: ""
        name: toolkit-install-dir
      - hostPath:
          path: /run/containers/oci/hooks.d
          type: ""
        name: crio-hooks
      - hostPath:
          path: /dev/char
          type: ""
        name: host-dev-char
      - hostPath:
          path: /var/run/cdi
          type: DirectoryOrCreate
        name: cdi-root
      - hostPath:
          path: /etc/docker
          type: DirectoryOrCreate
        name: docker-config
      - hostPath:
          path: /var/run
          type: ""
        name: docker-socket

檢視 Pod 日誌,看看安裝流程

root@test:~# kubectl -n gpu-operator logs -f nvidia-container-toolkit-daemonset-rqgtv
# 安裝 container toolkit
time="2024-06-12T02:07:58Z" level=info msg="Parsing arguments"
time="2024-06-12T02:07:58Z" level=info msg="Starting nvidia-toolkit"
time="2024-06-12T02:07:58Z" level=info msg="Verifying Flags"
time="2024-06-12T02:07:58Z" level=info msg=Initializing
time="2024-06-12T02:07:58Z" level=info msg="Installing toolkit"
time="2024-06-12T02:07:58Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"

# 修改配置
time="2024-06-12T02:07:58Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2024-06-12T02:07:58Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
time="2024-06-12T02:07:58Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2024-06-12T02:07:58Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
time="2024-06-12T02:07:58Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2024-06-12T02:07:58Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2024-06-12T02:07:58Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.cdi]
      default-kind = "management.nvidia.com/gpu"

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
  path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
  skip-mode-detection = true

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2024-06-12T02:07:58Z" level=info msg="Setting up runtime"
time="2024-06-12T02:07:58Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-06-12T02:07:58Z" level=info msg="Successfully parsed arguments"
time="2024-06-12T02:07:58Z" level=info msg="Starting 'setup' for docker"
time="2024-06-12T02:07:58Z" level=info msg="Loading docker config from /runtime/config-dir/daemon.json"
time="2024-06-12T02:07:58Z" level=info msg="Successfully loaded config"
time="2024-06-12T02:07:58Z" level=info msg="Flushing config to /runtime/config-dir/daemon.json"

和手動安裝類似,分為兩步。

檢視宿主機上 Docker 的配置檔案,也確實是被修改過的,default-runtime 改成了 nvidia。

root@test:~# cat /etc/docker/daemon.json
{
    "data-root": "/var/lib/docker",
    "default-runtime": "nvidia",
    "exec-opts": [
        "native.cgroupdriver=systemd"
    ],
    "log-driver": "json-file",
    "log-opts": {
        "max-file": "3",
        "max-size": "100m"
    },
    "registry-mirrors": [
        "https://docker.chenby.cn"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        },
        "nvidia-cdi": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
        },
        "nvidia-experimental": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental"
        },
        "nvidia-legacy": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
        }
    },
    "storage-driver": "overlay2",
    "storage-opts": [
        "overlay2.override_kernel_check=true"
    ]
}

這就是 NVIDIA Container Toolkit Installer 的安裝部分,具體程式碼實現可以看下一節 映象構建 部分。

映象構建

Installer 程式碼合併到了 nvidia-container-toolkit 倉庫 tools 目錄,分別為不同的 Runtime 做了不同的實現,比如 Containerd 的實現就在 containerd.go 中,部分程式碼如下:

// Setup updates a containerd configuration to include the nvidia-containerd-runtime and reloads it
func Setup(c *cli.Context, o *options) error {
	log.Infof("Starting 'setup' for %v", c.App.Name)

	cfg, err := containerd.New(
		containerd.WithPath(o.Config),
		containerd.WithRuntimeType(o.runtimeType),
		containerd.WithUseLegacyConfig(o.useLegacyConfig),
		containerd.WithContainerAnnotations(o.containerAnnotationsFromCDIPrefixes()...),
	)
	if err != nil {
		return fmt.Errorf("unable to load config: %v", err)
	}

	runtimeConfigOverride, err := o.runtimeConfigOverride()
	if err != nil {
		return fmt.Errorf("unable to parse config overrides: %w", err)
	}
	err = o.Configure(cfg, runtimeConfigOverride)
	if err != nil {
		return fmt.Errorf("unable to configure containerd: %v", err)
	}

	err = RestartContainerd(o)
	if err != nil {
		return fmt.Errorf("unable to restart containerd: %v", err)
	}

	log.Infof("Completed 'setup' for %v", c.App.Name)

	return nil
}
// Cleanup reverts a containerd configuration to remove the nvidia-containerd-runtime and reloads it
func Cleanup(c *cli.Context, o *options) error {
	log.Infof("Starting 'cleanup' for %v", c.App.Name)

	cfg, err := containerd.New(
		containerd.WithPath(o.Config),
		containerd.WithRuntimeType(o.runtimeType),
		containerd.WithUseLegacyConfig(o.useLegacyConfig),
		containerd.WithContainerAnnotations(o.containerAnnotationsFromCDIPrefixes()...),
	)
	if err != nil {
		return fmt.Errorf("unable to load config: %v", err)
	}

	err = o.Unconfigure(cfg)
	if err != nil {
		return fmt.Errorf("unable to unconfigure containerd: %v", err)
	}

	err = RestartContainerd(o)
	if err != nil {
		return fmt.Errorf("unable to restart containerd: %v", err)
	}

	log.Infof("Completed 'cleanup' for %v", c.App.Name)

	return nil
}

其中

  • Setup 為修改 Runtime 配置,增加 nvidia runtime
  • Cleanup 則是取消 Runtime 配置中 nvidia runtime

對應的 Dockerfile內容如下:

# Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG GOLANG_VERSION=x.x.x
ARG VERSION="N/A"

FROM nvidia/cuda:12.5.1-base-ubi8 as build

RUN yum install -y \
    wget make git gcc \
     && \
    rm -rf /var/cache/yum/*

ARG GOLANG_VERSION=x.x.x
RUN set -eux; \
    \
    arch="$(uname -m)"; \
    case "${arch##*-}" in \
        x86_64 | amd64) ARCH='amd64' ;; \
        ppc64el | ppc64le) ARCH='ppc64le' ;; \
        aarch64 | arm64) ARCH='arm64' ;; \
        *) echo "unsupported architecture" ; exit 1 ;; \
    esac; \
    wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${ARCH}.tar.gz \
    | tar -C /usr/local -xz


ENV GOPATH /go
ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH

WORKDIR /build
COPY . .

# NOTE: Until the config utilities are properly integrated into the
# nvidia-container-toolkit repository, these are built from the `tools` folder
# and not `cmd`.
RUN GOPATH=/artifacts go install -ldflags="-s -w -X 'main.Version=${VERSION}'" ./tools/...


FROM nvidia/cuda:12.5.1-base-ubi8

ENV NVIDIA_DISABLE_REQUIRE="true"
ENV NVIDIA_VISIBLE_DEVICES=void
ENV NVIDIA_DRIVER_CAPABILITIES=utility

ARG ARTIFACTS_ROOT
ARG PACKAGE_DIST
COPY ${ARTIFACTS_ROOT}/${PACKAGE_DIST} /artifacts/packages/${PACKAGE_DIST}

WORKDIR /artifacts/packages

ARG PACKAGE_VERSION
ARG TARGETARCH
ENV PACKAGE_ARCH ${TARGETARCH}
RUN PACKAGE_ARCH=${PACKAGE_ARCH/amd64/x86_64} && PACKAGE_ARCH=${PACKAGE_ARCH/arm64/aarch64} && \
    yum localinstall -y \
    ${PACKAGE_DIST}/${PACKAGE_ARCH}/libnvidia-container1-1.*.rpm \
    ${PACKAGE_DIST}/${PACKAGE_ARCH}/libnvidia-container-tools-1.*.rpm \
    ${PACKAGE_DIST}/${PACKAGE_ARCH}/nvidia-container-toolkit*-${PACKAGE_VERSION}*.rpm

WORKDIR /work

COPY --from=build /artifacts/bin /work

ENV PATH=/work:$PATH

LABEL io.k8s.display-name="NVIDIA Container Runtime Config"
LABEL name="NVIDIA Container Runtime Config"
LABEL vendor="NVIDIA"
LABEL version="${VERSION}"
LABEL release="N/A"
LABEL summary="Automatically Configure your Container Runtime for GPU support."
LABEL description="See summary"

RUN mkdir /licenses && mv /NGC-DL-CONTAINER-LICENSE /licenses/NGC-DL-CONTAINER-LICENSE

ENTRYPOINT ["/work/nvidia-toolkit"]

這部分比較簡單,就是編譯生成二進位制檔案,以及安裝部分依賴的 RPM 包。


【Kubernetes 系列】持續更新中,搜尋公眾號【探索雲原生】訂閱,閱讀更多文章。


5. 小結

本文主要分享如何使用 GPU Operator 自動化完成 GPU Driver、NVIDIA Container Toolkit、device-plugin、exporter 等元件的部署,快速實現在 k8s 環境中使用 GPU。

最後簡單分析了 Driver Installer 和 NVIDIA Container Toolkit Installer 這兩個元件的工作原理。


GPU Operator 極大簡化了在 k8s 中使用 GPU 的繁瑣過程,但是也存在一些缺點:

  • Driver Installer 以 DaemonSet 方式執行的,每個節點上執行的 Pod 都一樣,但是映象由 驅動版本+核心版本+作業系統版本拼接而成,因此需要叢集中所有節點作業系統一致
  • NVIDIA Container Toolkit Installer 同樣是以 DaemonSet 方式執行的,另外安裝時需要指定 Runtime,這也造成了叢集的節點必須安裝相同的 Container Runtime

6. 參考

gpu-operator getting-started

About the NVIDIA GPU Operator

nvidia-container-toolkit

NVIDIA GPU Operator分析一:NVIDIA驅動安裝

相關文章