如何構建一臺機器學習伺服器

NeoNexus發表於2024-03-27

原文網址 : https://www.cnblogs.com/NeoNexus/p/18099946

機器學習伺服器

如何構建一臺機器學習伺服器

Version：V1.0

Author：NeoNexus

Date：2024.03.26

伺服器設計要求：高效能、滿足同時多人開發的需求、架構清晰、後期方便維護。因此寫下此文件。

基於最先進的架構、最先進硬體。

監修中敬告

本文處於Preview階段，不對文章內容負任何責任，如有意見探討歡迎留言。
部分內容我還沒有補充，補充之後再在正式釋出~

聯絡方式——綠泡泡：NeoNexusX

如何構建一臺機器學習伺服器
- 監修中敬告
系統資訊
- 系統安裝
硬體配置：
- 硬體安裝指南
- CPU
- GPU
- 硬碟分割槽結果
- 乙太網和IP設定
基礎內容配置
- Jetbrain IDE & VSCode安裝
  - Jetbrains shell scripts有什麼用？
- Matlab安裝與配置
- R Studio Server安裝
- VSCode安裝與配置
- 將軟體快捷方式（desktop）送到使用者桌面
- 內網磁碟對映
  - 使用SAMBA服務
- 安裝Git
- 使用者與使用者組管理
  - 檢視檔案許可權
  - 檢視當前使用者
- Docker部署
深度學習配置相關
- 安裝Python和Pip
- 安裝CUDA Toolkit
- 安裝cuDNN
- 安裝Anaconda環境
- Pytorch安裝
- NVIDIA Container Toolkit
  - 執行docker部署測試
  - 此處應使用dockerfile來配置，後續更新，先手動。
  - 選配：rootless來操作docker daemon
參考文章

系統資訊

系統安裝

系統安裝這裡就不再贅述，推薦使用ventory作為PE盤，來安裝系統，這樣方便快捷，可同時包含多個映象，無需重複製作，需要注意的是在安裝系統的時候需要手動進行分割槽，我們可以看一下我的分割槽結果：

在安裝系統之後請先確認系統版本等內容和預想一致：

使用命令：

uname -m && cat /etc/*release

結果：

x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

GCC版本：

gcc --version

bionet@Bionet:/usr/local/cuda-12.4$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

由於後邊要配置cuda資訊，這裡就直接先把需求放上來，各位要看符不符合要求：

下圖由CUDA官方文件釋出：1. Introduction — Installation Guide for Linux 12.4 documentation (nvidia.com)

硬體配置：

硬體安裝指南

由於伺服器上存在幾個殘缺的pcie插槽，什麼叫殘缺的呢？如下圖：

不適合安裝顯示卡，所以透過轉接版來安裝PCIE下的NVME協議M.2介面固態硬碟，其優點是穩定，速度快。

相比普通的SATA順序讀寫快上5倍~10倍，測試效果如下：

CPU

bionet@Bionet:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            1
    CPU max MHz:         3000.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            4199.71
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31

GPU

bionet@Bionet:~$ nvidia-smi
Sat Mar 23 19:30:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:04:00.0 Off |                  N/A |
|  0%   27C    P8              16W / 300W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:09:00.0 Off |                  Off |
|  0%   29C    P8              20W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:86:00.0 Off |                  N/A |
|  0%   28C    P8              13W / 300W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off | 00000000:8A:00.0 Off |                  N/A |
|  0%   23C    P8               7W / 370W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

簡單查詢：

bionet@Bionet:~$ nvidia-smi --query-gpu=index,name,uuid,serial --format=csv
index, name, uuid, serial
0, NVIDIA GeForce RTX 2080 Ti, GPU-2fdf7ca3-be62-5646-3d62-2e2db057e8b2, [N/A]
1, NVIDIA GeForce RTX 4090, GPU-3d19dd88-2507-8278-5045-9f68011b7ce0, [N/A]
2, NVIDIA GeForce RTX 2080 Ti, GPU-6384bfe4-3e8a-18a2-2132-fc5e686d1404, [N/A]
3, NVIDIA GeForce RTX 3090, GPU-d91f3e9a-e7d0-4f91-2798-1d8b05587fb6, [N/A]

驗證顯示卡速率正常：

nvidia-smi -i 0 -q

指定GPUID來實現，0為0號裝置，再輸出資訊中找到：

16x頻寬為正常

硬碟分割槽結果

乙太網和IP設定

使用命令檢視目前已安裝的，能檢測到對應的驅動的網路卡資訊：

bionet@Bionet:~$ lspci | grep -i 'eth'

結果如下：

81:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
81:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
#雙千兆網口

基礎內容配置

Jetbrain IDE & VSCode安裝

Jetbrain系IDE針對在校大學生都是免費的，該如何申請JetBrian系的IDE呢？

詳見此文，申請後就擁有了一個免費的JetBrain全家桶賬號，非常方便，可以使用他們家的全部IDE，關於IDE的使用和最佳化，可以參考我的專欄：Jetbrain入門指南 - 文章分類 - NeoNexus - 部落格園 (cnblogs.com)

為了方便管理和使用IDE這裡使用Toolbox來操作IDE：

首先下載ToolBox：

下載之後是一個.tar.gz的壓縮包，我們使用命令解壓即可：

tar -zxvf 檔名.tar.gz

其中，-z 表示使用 gzip 解壓縮，-x 表示解壓縮，-v 表示顯示詳細資訊，-f 表示指定檔名。

(base) bionet@Bionet:~/Downloads$ tar -zxvf ./jetbrains-toolbox-2.2.3.20090.tar.gz

如下圖所示解壓之後效果如下：

jetbrains-toolbox-2.2.3.20090/
jetbrains-toolbox-2.2.3.20090/jetbrains-toolbox
(base) bionet@Bionet:~/Downloads$ ls
Anaconda3-2024.02-1-Linux-x86_64.sh                jetbrains-toolbox-2.2.3.20090
cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb  jetbrains-toolbox-2.2.3.20090.tar.gz

將解壓過後的資料夾的內容遷移到我們規定的目錄，命令執行如下：

base) bionet@Bionet:~/Downloads$ sudo mv jetbrains-toolbox-2.2.3.20090 jetbrain-toolbox-2.2.3
[sudo] password for bionet: 
(base) bionet@Bionet:~/Downloads$ ls
Anaconda3-2024.02-1-Linux-x86_64.sh                jetbrains-toolbox-2.2.3.20090.tar.gz
cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb  jetbrain-toolbox-2.2.3
(base) bionet@Bionet:~/Downloads$ sudo mv ./jetbrain-toolbox-2.2.3 /home/jetbrain-toolbox-2.2.3
(base) bionet@Bionet:~/Downloads$ cd /home/
(base) bionet@Bionet:/home$ ls
anaconda3  bionet  jetbrain-toolbox-2.2.3  lost+found  Neo

同時將其新增到啟動項中：

新增即可：

使用者登入之後toolbox就可以啟動：

登入過程稍微有點慢，慢慢等就行。

當然選擇之後登陸即可下載對應的IDE：

效果如下：

需要注意的是我們需要修改安裝路徑到指定位置：

我們在管理使用者的Home下建立一個資料夾來專門存放IDE，這樣每個使用者就不需要重複下載IDE了，同時需要有一個環境變數的路徑來存放安裝的IDE的執行指令碼：

向全域性變數中匯入一個PATH：

(base) bionet@Bionet:~/Desktop$ sudo vim /etc/profile

開啟後在最下面新增一個PATH：

export PATH="/home/SoftWares/JetBrains/Scripts:$PATH"

注意這裡的路徑要放在有許可權的地方，大家都有許可權可以使用才可以。

工具安裝位置也要放置到大家都能使用的位置，如下圖所示（隔了一個命令）。

讓環境變數生效：

source /etc/profile

這時候你會發現shell scripts location還是會報錯，不過沒關係，我們只需要將此使用者重新登出再登入即可：

Jetbrains shell scripts有什麼用？

讓我們先安裝一個IDE在討論這個問題：

開啟一個IDE設定：

我們拉到最下面：

隨便寫個縮寫名字,比如:

開啟命令列直接執行PCP，可以發現直接執行了~

需要注意的是JetbrainToolBox只能讓一個使用者來使用！每個使用者如果要使用ToolBox的話需要單獨安裝，這裡只給root使用者安裝，是為了方便管理。

Matlab安裝與配置

MATLAB學校購買了正版，這裡需要按照學校的安裝步驟來走：

我們直接跳轉到下載的步驟：

登入之後下載，在之後是一個安裝包：

注意要解決安裝路徑的問題我們可以把他放在我們建立好的SoftWare目錄之下：

參照官方教程安裝：下載並安裝 MATLAB - MATLAB & Simulink - MathWorks 中國

注意這一步只需要執行這個既可：

一路安裝下去即可：

勾選全部內容：

注意下一頁需要將指令碼對映到合理位置，這裡對映到了如下路徑,就不放圖了：

/home/SoftWares/MATLAB/MATLABScripts

在對應安裝目錄下執行一下：

效果如下：

R Studio Server安裝

R使用過Docker來部署的：

R-studio server版本的映象都在這裡：rocker/rstudio Tags | Docker Hub

我們直接建立一個容器：

docker run -d -p 8787:8787 -p 8788:22\
  -v /home/SoftWares/R_Share:/home/rstudio/R_Share \
  -v /etc/timezone:/etc/timezone \
  -v /etc/localtime:/etc/localtime \
  --name R_422 \
  rocker/rstudio:4.2.2

-v /etc/timezone:/etc/timezone \ -v /etc/localtime:/etc/localtime \為時間同步命令

-v /home/SoftWares/R_Share:/home/rstudio/R_Share \將對應的檔案掛載到系統上某個盤

--name R_422 \名字命名為R422

rocker/rstudio:4.2.2拉去這個版本的映象

-p 8787:8787 -p 8788:22埠對映命令主機的8788對映到22埠

進入容器裡面：

docker exec -it R_422 /bin/bash

R_422是容器的名字。可根據需要切換

我們安裝一些必要的內容，來保證容器的執行：

首先設定密碼：

passwd root

設定完畢之後需要安裝ssh，來方便管理，使用命令：

sudo apt update
sudo apt-get install -y vim openssh-server
sudo apt upgrade

配置容器內的SSH：

echo "PermitRootLogin yes">>/etc/ssh/sshd_config
echo "export VISIBLE=now" >>/etc/profile

echo "PermitRootLogin yes">>/etc/ssh/sshd_config新增一段資訊到sshd_config中。

echo "export VISIBLE=now" >> /etc/profile：向 /etc/profile 檔案中新增一行 export VISIBLE=now，這個設定使得 SSH 會話可以在登入時建立 utmp 記錄，使得使用者能夠在 w 或 who 命令中看到 SSH 登入的使用者資訊。

然後執行重啟：

service ssh restart

這時候你如果開啟另一個宿主機命令列執行以下命令可以看到：

Neo@Bionet:~/Desktop$ docker port pytorch 22
0.0.0.0:10003
[::]:10003

這時候我們開啟一個遠端的命令列來來連線一下容器：

輸入後正常登入即可。

為什麼要新增使用者呢因為RStudio預設不允許root使用者登入，我們需要新增一些使用者進來，這裡可以使用我的指令碼，來批次新增使用者，我們直接執行即可：

sudo ./createuser4R.sh

預設密碼為名字+123即名字123.

登入之後可以看到：

VSCode安裝與配置

首先下載VsCode安裝包，在官網這裡：Visual Studio Code - Code Editing. Redefined

然後使用命令：

(base) bionet@Bionet:~/Desktop$ sudo dpkg -i code_1.87.2-1709912201_amd64.deb

進行安裝，效果如下：

報錯了不要慌，實際上是傳輸過來的時候安裝包損壞了，也就是無法透過校驗。

安裝完成之後就可以在目錄中看到了：

將軟體快捷方式（desktop）送到使用者桌面

到這裡還沒結束呢，安裝完成之後我們還需要再把快捷方式丟去每個使用者的目錄，這裡提示一下，每個建立的使用者都需要在這個目錄下有對應的檔案才能看到應用程式，或者直接將軟體丟去此目錄也可以這裡展示其中一個比較通用的方式：

一般桌面的軟體的快捷方式都在此資料夾下：

/usr/share/applications/

使用者安裝的軟體目錄在：

~/.local/share/applications/

我們需要將安裝軟體的使用者的目錄下的圖示遷移到此目錄下，使用如下命令：

sudo cp -r  ~/.local/share/applications/.  /usr/share/applications/

r如果無法執行，需要切換到對應的使用者，這裡安裝的使用者就是bonet所以可以使用bionet的，如果還不行的話只能使用root來實現，輸入以下命令之後安裝：

sudo -i

這樣後來的使用者都可以看到安裝的軟體了。

內網磁碟對映

雖然目前已經實現了檔案的傳輸，直接複製貼上就可以，但是對於一些稍微大一點的檔案，還是不靠譜，所以透過另一個服務將伺服器上的磁碟對映過來：

使用SAMBA服務

安裝SAMBA：

sudo apt-get install samba samba-common-bin

配置SAMBA：

sudo gedit /etc/samba/smb.conf

在最下面加入一行：

# 共享資料夾顯示的名稱
[Storge]
# 說明資訊
comment = Bionet No1 WorkStation Storage
# 可以訪問的使用者
valid users = Neo,root,Bionet
# 共享檔案的路徑
path = /home/SAMBA/Storge/
# 可被其他人看到資源名稱（非內容）
browseable = yes
# 可寫
writable = yes
# 新建檔案的許可權為 664
create mask = 0664
# 新建目錄的許可權為 775
directory mask = 0775

執行以下命令來測試：

bionet@Bionet:~$ testparm

結果如下：

Load smb config files from /etc/samba/smb.conf
Loaded services file OK.
Weak crypto is allowed

Server role: ROLE_STANDALONE

Press enter to see a dump of your service definitions

# Global parameters
[global]
	log file = /var/log/samba/log.%m
	logging = file
	map to guest = Bad User
	max log size = 1000
	obey pam restrictions = Yes
	pam password change = Yes
	panic action = /usr/share/samba/panic-action %d
	passwd chat = *Enter\snew\s*\spassword:* %n\n *Retype\snew\s*\spassword:* %n\n *password\supdated\ssuccessfully* .
	passwd program = /usr/bin/passwd %u
	server role = standalone server
	server string = %h server (Samba, Ubuntu)
	unix password sync = Yes
	usershare allow guests = Yes
	idmap config * : backend = tdb


[printers]
	browseable = No
	comment = All Printers
	create mask = 0700
	path = /var/spool/samba
	printable = Yes


[print$]
	comment = Printer Drivers
	path = /var/lib/samba/printers


[Storge]
	comment = Bionet No1 WorkStation Storage
	create mask = 0664
	directory mask = 0775
	path = /home/SAMBA/Storge/
	read only = No
	valid users = Neo root Bionet

新增SMB使用者（必須是已經建立了的linux使用者）：

bionet@Bionet:~$ sudo smbpasswd -a Neo
New SMB password:
Retype new SMB password:
Added user Neo.

這裡由於學校網路分割，這一部分暫且擱置

安裝Git

使用命令：

 sudo apt install git

版本：

bionet@Bionet:~$ git --version
git version 2.34.1

git本身需要設定使用者名稱等內容，這裡建議先設定一個全域性的通用的使用者名稱和賬號，個人使用者有需求再用個人的key來進行程式碼同步，這裡就簡單說一下，詳細內容敬請百度。

git config --global user.name "Your Name"
git config --global user.email "youremail@yourdomain.com"

配置完成以後驗證一下：

bionet@Bionet:/usr/local/cuda-12.4$ git config --global user.name "Bionet"
bionet@Bionet:/usr/local/cuda-12.4$ git config --global user.email "Bionet@xmu.edu.cn"
bionet@Bionet:/usr/local/cuda-12.4$ git config --list
user.name=Bionet
user.email=Bionet@xmu.edu.cn

註冊一個Github賬號（實驗室已經有賬號了，詳詢老大）：

生成金鑰：

bionet@Bionet:/usr/local/cuda-12.4$ ssh-keygen -t ed25519 -C "BioNet@xmu.edu.cn"
Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/bionet/.ssh/id_ed25519): 
Created directory '/home/bionet/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/bionet/.ssh/id_ed25519
Your public key has been saved in /home/bionet/.ssh/id_ed25519.pub

然後新增到咱們的賬戶中即可，如果重灌的話，此處就要把原來的刪除掉，在生成一個新的來作為驗證。

沒有新增金鑰，驗證失敗結果如下：

bionet@Bionet:/usr/local/cuda-12.4$ ssh -T git@github.com
git@github.com: Permission denied (publickey).

等待新增金鑰即可

(base) bionet@Bionet:~/Desktop$ ssh -T git@github.com
Hi NeoNexusX! You've successfully authenticated, but GitHub does not provide shell access.

使用者與使用者組管理

管理使用者及使用者組的許可權是十分重要的，合理的許可權分配能大大的減少後期維護成本。這裡簡單介紹一下linux檔案許可權的內容，並由此來介紹一下如何配使用者組的許可權。

Linux系統一般將檔案可存/取訪問的身份分為3個類別：owner、group、others，且3種身份各有read、write、execute等許可權。在多使用者（可以不同時）計算機系統的管理中，許可權是指某個特定的使用者具有特定的系統資源使用權力，像是資料夾、特定系統指令的使用。

讀許可權：

對於資料夾來說，讀許可權影響使用者是否能夠列出目錄結構
對於檔案來說，讀許可權影響使用者是否可以檢視檔案內容
寫許可權：

對資料夾來說，寫許可權影響使用者是否可以在資料夾下“建立/刪除/複製到/移動到”文件
對於檔案來說，寫許可權影響使用者是否可以編輯檔案內容
執行許可權：

一般都是對於檔案來說，特別指令碼檔案

上述說了身份也分為很多種，下面是詳細介紹：

Owner身份：檔案所有者，預設為文件的建立者
Group身份：與檔案所有者同組的使用者
Others身份：其他人，相對於所有者所在組
Root使用者：超級使用者，管理著普通使用者，具有所有許可權

檢視檔案許可權

使用命令如下可以檢視當前檔案下的檔案的許可權：

許可權區一共有10個字母，每個的內容意思是：

上圖中許可權為：-rwxr-xr-x 意思是檔案(-)、在bionet使用者下有全部許可權(rwx)、在所屬使用者組下有除了寫的全部許可權(r-x)、其他使用者有除了寫的全部許可權(r-x)。

檢視當前使用者

使用命令：

Neo@Bionet:/home/jetbrain-toolbox-2.2.3$ getent passwd

上述內容表達為：使用者名稱:密碼(x):使用者ID:組ID:描述資訊(無用):HOME目錄:執行終端(預設bash)

檢視當前使用者組：

getent group

在我們劃分的時候需要將所有core組成員公用的地方加上許可權，效果如下：

紅框的三個資料夾都是公用的所以要加上許可權

sudo chown -R :core /path/to/folder

sudo chmod -R 770 /path/to/folder

Docker部署

Docker也是虛擬化環境的神器，前面說的conda雖然可以提供python的虛擬環境並方便地切換，但是有的時候我們的開發環境並不只是用到python，比如有的native庫需要對應gcc版本的編譯環境，或者進行交叉編譯時喲啊安裝很多工具鏈等等。如果這些操作都在伺服器本地上進行，那時間久了就會讓伺服器的檔案系統非常雜亂，而且還會遇到各種軟體版本衝突問題。

簡單理解Docker為一個輕量化的虛擬機器即可，但是其並不是虛擬機器，虛擬機器需要提供作業系統等，Docker只需要提供程式執行所需要的環境，對與常規開發流程來說一般是：

graph TB; A("本地docker環境搭建")--> B("程式碼編寫測試") --> C("打包映象") --> D("部署到伺服器執行")

這裡主要考慮伺服器環境搭建和執行

docker官方教程：Install Docker Engine on Ubuntu | Docker Docs

首先設定docker的apt倉庫資訊：

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

命令一條一條複製，比較安全。

安裝docker：

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

執行測試的映象：

sudo docker run hello-world

執行版本資訊檢視一下：

為了方便後邊Docker設定：我們先來建立一個使用者組，因為對於後面使用docker來說，其守護程序使用的是Unix socket，並不是TCP socket，Docker的守護程序通常只能執行在root許可權使用者下，因此我們只能建立一個docker使用者組來專門賦予許可權，我們先看看是否有docker使用者組，已經被建立好了，因為在某些發行版的linux下，安裝完成docker後會自行建立：

使用命令：

getent group

果然有！所以我們就需要把當前的需要使用docker的使用者加入到這個使用者組裡面去，這樣執行的時候就不需要sudo許可權了。

sudo usermod -aG docker $USER

-aG：這是 usermod 命令的選項之一，其中：

-a 表示“追加”，它告訴 usermod 命令將使用者新增到指定的組，而不是覆蓋原有的組成員資格。
-G 表示“組”，它指定要操作的組。

登出後重新登入這個賬戶，並輸入以下內容：

newgrp docker

重新再來驗證一下，目前是否能使用了

設定開機自啟動：

Neo@Bionet:~/Desktop$ sudo systemctl enable docker.service
sudo systemctl enable containerd.service
[sudo] password for Neo: 
Synchronizing state of docker.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable docker

如果需要關閉請使用：

sudo systemctl disable docker.service
sudo systemctl disable containerd.service

到此完成docker的基本部署，接下來安裝Nvida的docker，這部分在深度學習配置最後。

深度學習配置相關

安裝NVIDIA驅動，由於新的版Ubuntu可以在管理器中直接安裝，這裡就不再贅述，只需要點選即可，新的顯示卡使用較新的驅動是最好的了：

安裝Python和Pip

使用命令

sudo apt install python3
sudo apt install python3-pip

安裝完成之後，替換python的pip源

bionet@Bionet:~$ cd ~
bionet@Bionet:~$ mkdir .pip
bionet@Bionet:~$ sudo gedit ~/.pip/pip.conf

其中gedit是ubuntu自帶的圖形化文字編輯器，如果你喜歡vim那麼可以替換成：

bionet@Bionet:~$ sudo vim ~/.pip/pip.conf

將以下內容填入：

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple/ 
[install]
trusted-host = pypi.tuna.tsinghua.edu.cn

測試一下：

bionet@Bionet:~$ python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit();

安裝CUDA Toolkit

這裡選擇最新的CUDA Toolkit12.4，在網頁上按我的選擇如下：

複製對應的命令後下載下來：

bionet@Bionet:~$ wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

Saving to: ‘cuda_12.4.0_550.54.14_linux.run’
cuda_12.4.0_550.54.14_linux.run              100%[==============================================================================================>]   4.15G  89.4MB/s    in 57s     
2024-03-23 20:26:43 (73.9 MB/s) - ‘cuda_12.4.0_550.54.14_linux.run’ saved [4454353277/4454353277]

bionet@Bionet:~$ ls
cuda_12.4.0_550.54.14_linux.run  Desktop  Documents  Downloads  matlab  Music  Pictures  Public  snap  Templates  thinclient_drives  Videos

使用命令執行：

sudo sh cuda_12.4.0_550.54.14_linux.run

注意進入選擇模式之後，不要選擇驅動，我們已經打了新驅動了，無需在安裝一次：

至於這裡選擇不選擇新驅動要看這個圖：

如果你不滿足對應的驅動條件我建議可以打上新驅動。

根據提示我們需要套件的內容新增到環境變數裡面：

bionet@Bionet:~$ sudo gedit ~/.bashrc

環境變數如下：

export CUDA_HOME=/usr/local/cuda-12.4
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}

使其生效：

bionet@Bionet:~$ source ~/.bashrc

使用命令測試：

bionet@Bionet:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

到這一步並不代表成功了，要成功執行cuda才說明環境沒有問題：

自從11.7之後cuda不再單獨提供測試樣例，我們可以從github上克隆下來，直接編譯後執行即可：

cd /usr/local/cuda-12.4/
git clone https://github.com/NVIDIA/cuda-samples.git
cd /cuda-samples/Samples/1_Utilities/deviceQuery
make
./deviceQuery

結果如下：

bionet@Bionet:/usr/local/cuda-12.4/cuda-samples/Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.2 / 12.4
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24217 MBytes (25393692672 bytes)
  (128) Multiprocessors, (128) CUDA Cores/MP:    16384 CUDA Cores
  GPU Max Clock rate:                            2580 MHz (2.58 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA GeForce RTX 3090"
  CUDA Driver Version / Runtime Version          12.2 / 12.4
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24260 MBytes (25438126080 bytes)
  (082) Multiprocessors, (128) CUDA Cores/MP:    10496 CUDA Cores
  GPU Max Clock rate:                            1755 MHz (1.75 GHz)
  Memory Clock rate:                             9751 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 138 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "NVIDIA GeForce RTX 2080 Ti"
  CUDA Driver Version / Runtime Version          12.2 / 12.4
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 11012 MBytes (11546394624 bytes)
  (068) Multiprocessors, (064) CUDA Cores/MP:    4352 CUDA Cores
  GPU Max Clock rate:                            1650 MHz (1.65 GHz)
  Memory Clock rate:                             7000 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 5767168 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "NVIDIA GeForce RTX 2080 Ti"
  CUDA Driver Version / Runtime Version          12.2 / 12.4
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 11012 MBytes (11546394624 bytes)
  (068) Multiprocessors, (064) CUDA Cores/MP:    4352 CUDA Cores
  GPU Max Clock rate:                            1650 MHz (1.65 GHz)
  Memory Clock rate:                             7000 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 5767168 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 134 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 3090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 3090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 3090 (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 3090 (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 3090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 3090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.4, NumDevs = 4
Result = PAS

可以看到4090的ECC沒有開啟不過問題不大，這個以後再處理。

同樣可以再跑一個BandwidthTest，編譯後結果如下

bionet@Bionet:/usr/local/cuda-12.4/cuda-samples/Samples/1_Utilities/bandwidthTest$ ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce RTX 4090
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			11.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			3627.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

安裝cuDNN

cuDNN是nvidia專門用來加速深度學習的一些庫，需要注意的是安裝的時候要和你的cuda版本對應

這裡是網頁下載途徑，但是網有牆，很爛，推薦使用這個頁面下載，選擇對應的命令即可，我這裡是cuda12所以要用cuda12的內容：

注意這個和網上的不太一樣，這裡使用的是系統原生的包管理器來進行安裝，詳見：Installing cuDNN on Linux — NVIDIA cuDNN 9.0.0 documentation。

安裝Anaconda環境

從清華映象站下載：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/?C=M&O=D

bionet@Bionet:~/Downloads$ chmod +x Anaconda3-2024.02-1-Linux-x86_64.sh 
bionet@Bionet:~/Downloads$ sudo ./Anaconda3-2024.02-1-Linux-x86_64.sh

注意這裡我修改到了home目錄下,home很大，我就喜歡整整齊齊的

安裝完成之後再輸入一個yes，最後會給你顯示目前安裝的內容：

你安裝完成以後發現不行啊沒有conda的環境變數，其實並不是，他只是把環境變數寫到了root的shell裡面，這就很尷尬了。

所以我們需要手動baroot裡面的內容複製過來：

這裡下邊打個樣：

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/anaconda3/etc/profile.d/conda.sh" ]; then
        . "/home/anaconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

不同的系統需要的路徑也不一樣，這裡需要注意下。

使用的命令如下：

bionet@Bionet:~/Desktop$ sudo vim ~/.bashrc
bionet@Bionet:~/Desktop$ source ~/.bashrc

可以看到使用後的結果：

你以為到這裡就結束了嗎？當然不可能對於我一個目錄潔癖的人來說，肯定不止於此，我們要設定其包的路徑，來保證整潔。

在開始之前先完成換源：

(base) bionet@Bionet:~$ sudo vim ~/.condarc

替換成如下內容：

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  deepmodeling: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/

我們檢查一下當前路徑結果：

(base) bionet@Bionet:~$ conda info

     active environment : base
    active env location : /home/anaconda3
            shell level : 1
       user config file : /home/bionet/.condarc
 populated config files : /home/bionet/.condarc
          conda version : 24.1.2
    conda-build version : 24.1.2
         python version : 3.11.7.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=broadwell
                          __conda=24.1.2=0
                          __cuda=12.2=0
                          __glibc=2.35=0
                          __linux=6.5.0=0
                          __unix=0=0
       base environment : /home/anaconda3  (read only)
      conda av data dir : /home/anaconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/linux-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/linux-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/noarch
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2/linux-64
                          https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2/noarch
          package cache : /home/anaconda3/pkgs
                          /home/bionet/.conda/pkgs
       envs directories : /home/bionet/.conda/envs
                          /home/anaconda3/envs

發現目錄正好不用改都在/home下，注意這裡的路徑是/home不是使用者的home。

Pytorch安裝

終於到了這一步了，我們在conda上建立一個新環境：

conda create --name test python=3.10
conda activate test

在pytorch官網中找到合適的版本：PyTorch

複製命令下載：

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

隨便寫一個指令碼，然後執行一下看看：

sudo vim test.py

指令碼內容如下：

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
from torchvision import datasets, transforms


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)


    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def train(model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 10 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss.item()))


def main():
    cudnn.benchmark = True
    torch.manual_seed(1)
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    print("Using device: {}".format(device))
    kwargs = {'num_workers': 1, 'pin_memory': True}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('./data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=64, shuffle=True, **kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    for epoch in range(1, 11):
        train(model, device, train_loader, optimizer, epoch)

if __name__ == '__main__':
    main()

執行後沒有報錯的話就因該是這樣的：

test) bionet@Bionet:~$ python  test.py                                                                                                        
Using device: cuda                                                                                                                               
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz                                                                          
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz                        
100.0%                                                                                                                                           
Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw                                                             
100.0%
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.327492
Train Epoch: 1 [640/60000 (1%)]	Loss: 2.328194
Train Epoch: 1 [1280/60000 (2%)]	Loss: 2.278235
Train Epoch: 1 [1920/60000 (3%)]	Loss: 2.281009
省略一部分
Train Epoch: 10 [56320/60000 (94%)]	Loss: 0.196389
Train Epoch: 10 [56960/60000 (95%)]	Loss: 0.387766
Train Epoch: 10 [57600/60000 (96%)]	Loss: 0.109143
Train Epoch: 10 [58240/60000 (97%)]	Loss: 0.077670
Train Epoch: 10 [58880/60000 (98%)]	Loss: 0.182428
Train Epoch: 10 [59520/60000 (99%)]	Loss: 0.392815

注意這裡的device一定要是cuda。別急，這是裸機部分，但我們真正需要的是Docker！

NVIDIA Container Toolkit

英偉達虛擬化環境技術分類：

nvidia docker, nvidia docker2, nvidia container toolkits三者的區別-CSDN部落格

不扯沒用的，我們直接上最新的nvidia-container-toolkits

NVIDIA Container Toolkit 的目的是為了能夠創造一個合適的環境來執行顯示卡的程式。同時有一定的自由度，可以切換CUDA版本等操作，最重要的是實現在不同機器上、不同硬體上無需提前配置相同環境就可以直接執行。

官方教程：NVIDIA/nvidia-container-toolkit: Build and run containers leveraging NVIDIA GPUs (github.com)

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.5 documentation

docker

NVIDIA 容器工具包允許用nvidai戶構建和執行 GPU 加速容器。該工具包包括一個容器執行時庫和實用程式，用於自動配置容器以利用 NVIDIA GPU。產品文件（包括體系結構概述、平臺支援以及安裝和使用指南）可以在文件儲存庫中找到。

需要注意的是，這裡的使用的docker無需安裝Nvidia Toolkit，但是需要在宿主機上安裝Nvidia的驅動來支撐執行，

首先需要配置倉庫資訊：

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

這裡把命令列拉的長一點看的清楚一點：

使用如下命令：

sudo apt-get update

開始安裝：

sudo apt-get install -y nvidia-container-toolkit

安裝完成之後開始配置部分：

使用nvidia-ctk命令修改docker在宿主機上的配置檔案，來讓Docker能使用 NVIDIA Container Runtime：

sudo nvidia-ctk runtime configure --runtime=docker

執行docker部署測試

先跑一個簡單一點的測試：

使用命令建立一個Ubuntu的映象，並輸出容器內GPU的資訊

Neo@Bionet:~/Desktop$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

結果如下：

Neo@Bionet:~/Desktop$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Mon Mar 25 09:15:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:04:00.0 Off |                  N/A |
|  0%   28C    P8              17W / 300W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:09:00.0 Off |                  Off |
|  0%   31C    P8              34W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:86:00.0 Off |                  N/A |
|  0%   28C    P8              12W / 300W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off | 00000000:8A:00.0 Off |                  N/A |
|  0%   27C    P8              20W / 370W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

接著測試程式碼任務，使用命令安裝一個映象，然後配置環境之後打包成新的映象儲存下來供大家使用：

執行以下命令之前需要使用，命令：

newgrp docker

來重新整理一下使用者組資訊，當然前提是你的賬戶已經被新增到使用者組了，否則需要聯絡管理員處理。

docker run  -it -p 10003:22 -p 10004:10002 --name pytorch \
 -v /etc/timezone:/etc/timezone \
 -v /etc/localtime:/etc/localtime \
 -v /home/Neo/WorkSpace/_Share:/home/workspace/_share  \
 --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04

-v /etc/timezone:/etc/timezone 和 -v /etc/localtime:/etc/localtime：這兩個選項用於將宿主機的 /etc/timezone 和 /etc/localtime 檔案掛載到容器內對應的位置，以確保容器內的時間設定與宿主機一致。

-v /home/Neo/WorkSpace/_Share:/home/workspace/_share：這個選項將宿主機上的 /home/Neo/WorkSpace/_Share 目錄掛載到容器內的 /home/workspace/_share 目錄，以實現宿主機和容器之間的共享檔案。

每個標籤都具有以下格式：

11.4.0-base-ubuntu20.04docker

11.4.0 – CUDA version.
base – Image flavor. Image的變種型別，常見有base runtime等，有不同功能。
ubuntu20.04 – Operating system version.

目前有什麼Tags，請參考：

nvidia/cuda Tags | Docker Hub

這裡需要注意的是：cuda tollkit的版本要和驅動的版本匹配：

詳見：1. CUDA 12.4 Release Notes — Release Notes 12.4 documentation (nvidia.com)

同時需要把對應的埠對映出來，SSH埠是22，從容器中對映出來，對映到主機的10001埠。同時保留一個10002埠以備不時之需。

此時就進入了docker之中，可以愉快的執行了~

我們把裸機的例子拿來再跑一遍看看有什麼效果~

root@54990fb612d5:/# nvidia-smi

你如果仔細看這張圖就看到了docker的真正優點：在於CUDA版本的切換！這裡的cuda版本成功切換到12.3版本~

當然你在容器裡面隨便操作也不會太影響其他人，這是最棒的優點。

此處應使用dockerfile來配置，後續更新，先手動。

接下來安裝一些必要的工具來方便後邊的開發：

apt update
apt-get install sudo
sudo apt-get install -y vim git curl unzip net-tools openssh-server

修改源：

sudo vim /etc/apt/sources.list

將以下內容填入：

適用 22.04：

deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse

# deb https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse

適用 20.04：

deb https://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse

# deb https://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse

記得再次升級：

sudo apt update
sudo apt upgrade

當然別忘記設定密碼：

root@49906fdf69a6:/# passwd root
New password: 
Retype new password: 
passwd: password updated successfully

配置容器內的SSH：

echo "PermitRootLogin yes">>/etc/ssh/sshd_config
echo "export VISIBLE=now" >>/etc/profile

echo "PermitRootLogin yes">>/etc/ssh/sshd_config新增一段資訊到sshd_config中。

然後執行重啟：

service ssh restart

這時候你如果開啟另一個宿主機命令列執行以下命令可以看到：

Neo@Bionet:~/Desktop$ docker port pytorch 22
0.0.0.0:10003
[::]:10003

這時候我們開啟一個遠端的命令列來來連線一下容器：

ssh root@10.26.58.61 -p 10003

這時候我們就成功進入啦，到這裡已經完成一個映象的50%了，我們還需要安裝python環境，這裡選擇miniconda來作為虛擬Python環境執行，或者你覺得麻煩，直接用pip也可以，這裡不贅述pip方案，這裡選擇conda 方案。

額外注意如果你出現以下報錯：

PS C:\Users\NeoNexus\Desktop> ssh root@10.26.58.61 -p 10003
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:8jZY1c5+1VHkr9H0HoLXl6dV6c/oGj1i6HlsPsUCjPA.
Please contact your system administrator.
Add correct host key in C:\\Users\\NeoNexus/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in C:\\Users\\NeoNexus/.ssh/known_hosts:9
Host key for [10.26.58.61]:10003 has changed and you have requested strict checking.

同樣的問題在vscode中報錯如下：

請按照下圖刪除對應內容即可

再來安裝miniconda：

首先去官網下載好miniconda的安裝包，地址如下：

Miniconda — Anaconda documentation

這裡就安裝在預設的位置，這樣好安排一點，就不做路徑修改了。

root@6ee3c0747be2:/# echo "PermitRootLogin yes">>/etc/ssh/sshd_config
root@6ee3c0747be2:/# echo "export VISIBLE=now" >>/etc/profile
root@6ee3c0747be2:/# service ssh restart
 * Restarting OpenBSD Secure Shell server sshd                                                                                                                                                         [ OK ] 
root@6ee3c0747be2:/# cd /home/workspace/_share/
root@6ee3c0747be2:/home/workspace/_share# chmod +x Miniconda3-latest-Linux-x86_64.sh 
root@6ee3c0747be2:/home/workspace/_share# sudo ./Miniconda3-latest-Linux-x86_64.sh

安裝完成之後我們可以使用了，這時候需要推出一下容器的命令列然後再進入：

root@49906fdf69a6:/home/workspace/_share# exit
exit
Neo@Bionet:~/Desktop$ docker ps -a
CONTAINER ID   IMAGE                                    COMMAND                  CREATED        STATUS        PORTS                                                                                    NAMES
49906fdf69a6   nvidia/cuda:12.3.2-runtime-ubuntu22.04   "/opt/nvidia/nvidia_…"   12 hours ago   Up 12 hours   0.0.0.0:10003->22/tcp, :::10003->22/tcp, 0.0.0.0:10004->10002/tcp, :::10004->10002/tcp   pytorch
Neo@Bionet:~/Desktop$ docker exec -it pytorch /bin/bash
#這時候就會看到conda的啟動，我們直接安裝pytorch來執行一下上邊的測試程式碼，不同的是我們這次使用vscode來遠端連線docker。
(base) root@49906fdf69a6:/# 
(base) root@49906fdf69a6:/#

vscode透過SSH連線之後我們建立一個test資料夾在這裡，同時把測試程式碼複製過來：

將推薦的外掛安裝一下：

選擇方才的conda來使用：

選擇完成之後會自動幫你建立一個新的虛擬環境：

我們來開啟terminal來安裝一下pytorh環境：

使用命令如下：

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

在其下載的同時我們來看一下效能的消耗：

點選右上角直接執行，執行結果如下：

你仔細看會看到這裡實際上執行的裝置是cpu！這是不對的，因為cuda版本驅動版本不匹配導致的，修改後（就是上邊的命令已經修改好了，你正常安裝應該是我下邊的結果）：

我們將映象打包一下方便以後用：

命令格式：

docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]

docker commit  -a Neo -m "first commit" pytorch pytorch221_cuda122

打包完成之後使用docker images檢視：

打包成映象以後，要儲存成tar方便傳輸：

docker save [OPTIONS] IMAGE [IMAGE...]

示例：

docker save -o pytorch221_cuda122 pytorch

選配：rootless來操作docker daemon

根據提示restart一下docker：

systemctl --user restart docker

Neo@Bionet:~/Desktop$ systemctl --user restart docker
Failed to restart docker.service: Unit docker.service not found.

實際上只是新增了幾個引數：

這裡暫停，因為伺服器許可權不開放，所以暫停這樣使用。

參考文章

Ubuntu - 測試硬碟讀寫速度 - Citrusliu - 部落格園 (cnblogs.com)

Ubuntu 20.04安裝XRDP遠端桌面服務及xfce輕量桌面 – 技術什錦派 (weizhiyong.com)

Ubuntu 20.04 安裝xfce4桌面、Xrdp遠端桌面 - 掘金 (juejin.cn)

如何在 Ubuntu 20.04 安裝 Xrdp 伺服器 | myfreax

How To Transfer Files Over Remote Desktop (All You Need To Know) (helpwire.app)

[debian 關閉gnome-掘金 (juejin.cn)](https://juejin.cn/s/debian 關閉gnome)

【保姆級教程】個人深度學習工作站配置指南 - 知乎 (zhihu.com)

linux之使用者和許可權管理（乾貨） - 知乎 (zhihu.com)

Linux——使用者和許可權、使用者組管理、許可權管理_使用者許可權跟組許可權的關係linux-CSDN部落格

/etc/profile、/etc/bashrc、_{/.bash_profile、}/.bashrc 檔案的作用 - 倥傯時光 - 部落格園 (cnblogs.com)

1. Introduction — Installation Guide for Linux 12.4 documentation (nvidia.com)

CUDA Toolkit 12.4 Downloads | NVIDIA Developer

nvidia/cuda Tags | Docker Hub

【Linux】CUDA Toolkit和cuDNN版本對應關係（更新至2022年6月，附官網永久更新連結）_cuda12.0對應cudnn-CSDN部落格

linux新增環境變數 - ilovetesting - 部落格園 (cnblogs.com)

Anaconda下載與安裝詳解 - 上善若淚 - 部落格園 (cnblogs.com)

【anaconda】啟用環境失敗-bash: activate:No such file/沒有那個檔案或目錄_bash: activate: no such file or directory-CSDN部落格

How to SSH into Docker containers | CircleCI

ssh連線docker容器；docker容器設定root密碼_docker容器root使用者密碼-CSDN部落格

使用Docker容器配置ssh服務，遠端直接進入容器_ssh連線docker容器群暉-CSDN部落格

如何透過 SSH 連線到 Docker 容器 |CircleCI的

Install Docker Engine on Ubuntu | Docker Docs

Linux post-installation steps for Docker Engine | Docker Docs

Install Docker Engine on Ubuntu | Docker Docs

Run the Docker daemon as a non-root user (Rootless mode) | Docker Docs

機器學習入門系列(2)--如何構建一個完整的機器學習專案(一)
2019-01-26
機器學習
如何構建一臺網路引導伺服器（二）
2018-12-29
伺服器
萬丈高樓平地起：如何構建全流程機器學習平臺
2019-01-08
機器學習
使用Kubeflow構建機器學習流水線
2020-06-19
機器學習
機器學習：神經網路構建（上）
2024-12-03
機器學習神經網路
機器學習：神經網路構建（下）
2024-12-04
機器學習神經網路
機器學習建議
2019-03-26
機器學習
從預處理到部署：如何使用Lore快速構建機器學習模型
2018-03-13
機器學習模型
滴滴機器學習平臺架構演進
2019-05-18
機器學習架構
從零開始學機器學習——構建一個推薦web應用
2024-10-17
機器學習Web
從模型到部署，教你如何用Python構建機器學習API服務
2024-04-08
模型Python機器學習API
如何學習機器學習
2019-02-01
機器學習
吳恩達《構建機器學習專案》課程筆記（1）– 機器學習策略（上）
2018-07-31
吳恩達機器學習筆記
吳恩達《構建機器學習專案》課程筆記（2）– 機器學習策略（下）
2018-07-31
吳恩達機器學習筆記
一站式機器學習平臺Deepthought的建設與初探
2020-07-15
機器學習
滴滴機器學習平臺架構演進之路
2019-03-28
機器學習架構
如何搭建一臺伺服器？
2022-09-25
伺服器
Linux教學資源伺服器構建
2020-08-24
Linux伺服器
如何使用Linux構建高效FTP伺服器
2021-01-25
LinuxFTP伺服器
如何自己打造一個深度學習伺服器？
2019-03-03
深度學習伺服器
構建FTP伺服器
2019-03-03
FTP伺服器
（一）機器學習和機器學習介紹
2021-09-09
機器學習
像Google一樣構建機器學習系統 - 在阿里雲上搭建Kubeflow Pipelines
2019-05-06
Go機器學習阿里
機器學習-習題(一)
2022-05-04
機器學習
如何建立企業級別的機器學習模型伺服器？- kdnuggets
2020-09-16
機器學習模型伺服器
構建前端mock伺服器
2019-03-27
前端Mock伺服器
Endeavour的機器學習平臺
2022-08-23
機器學習
巧用機器學習定位雲伺服器故障
2018-09-13
機器學習伺服器
如何學習伺服器的知識？
2022-07-26
伺服器
【火爐煉AI】機器學習031-KNN迴歸器模型的構建
2018-10-08
AI機器學習KNN模型
豐田如何構建"學習型企業"
2022-04-26
吳恩達機器學習筆記 —— 11 應用機器學習的建議
2018-07-23
吳恩達機器學習筆記
DeepMind依靠CBN統計方法來構建公平的機器學習模型
2020-10-25
機器學習模型
如何管理機器學習模型
2019-01-12
機器學習模型
流批一體機器學習演算法平臺
2020-05-18
機器學習演算法
MLFlow機器學習管理平臺入門教程一覽
2018-12-26
機器學習
Node構建一個靜態檔案伺服器
2018-03-20
伺服器
Comet如何在GitLab DevOps平臺上簡化機器學習？
2021-12-07
Gitlabdev機器學習

如何構建一臺機器學習伺服器

如何構建一臺機器學習伺服器

監修中敬告

系統資訊

系統安裝

硬體配置：

硬體安裝指南

CPU

GPU

硬碟分割槽結果

乙太網和IP設定

基礎內容配置

Jetbrain IDE & VSCode安裝

Jetbrains shell scripts有什麼用？

Matlab安裝與配置

R Studio Server安裝

VSCode安裝與配置

將軟體快捷方式（desktop）送到使用者桌面

內網磁碟對映

使用SAMBA服務

安裝Git

使用者與使用者組管理

檢視檔案許可權

檢視當前使用者

Docker部署

深度學習配置相關

安裝Python和Pip

安裝CUDA Toolkit

安裝cuDNN

安裝Anaconda環境

Pytorch安裝

NVIDIA Container Toolkit

執行docker部署測試

此處應使用dockerfile來配置，後續更新，先手動。

選配：rootless來操作docker daemon

參考文章

相關文章