超算雲(GPU伺服器)環境配置

orion發表於2021-11-28

最近在用並行超算雲GPU伺服器(中國國家網格12區)搭建畢設的環境,這裡記錄一下。
首先,超算雲伺服器的登入可以採用網頁版、也可以採用客戶端(超算雲地址:https://cloud.paratera.com/ )。需要注意的是,並行超算雲只提供windows和mac的客戶端,Linux使用者可能只有使用網頁版的介面了(或者用pappcloud直接遠端練ssh用vim寫:( 哈哈,pappcloud的用法可參見官網下載的《papp_cloud使用手冊》)。
超算雲上最常見的是用module進行包管理(文件:https://modules.readthedocs.io/en/latest/module.html)。我們可以用module avail命令來檢視現有的包:

[macong@paratera01 ~]$ module avail 

------------------------- /usr/share/Modules/modulefiles -------------------------
dot         module-git  module-info modules     null        use.own

-------------------------------- /etc/modulefiles --------------------------------
mpi/compat-openmpi16-x86_64 mpi/mpich-x86_64
mpi/mpich-3.0-x86_64        mpi/openmpi-x86_64
mpi/mpich-3.2-x86_64

----------------------------- /software/modulefiles ------------------------------
alphafold/2.0
anaconda/2.7
anaconda/3.7(default)
anaconda/3.7.4
bcftools/1.10.1

具體新建環境相關操作大家可以參見module的文件,此處不再贅述。大家需要注意的是,因為Pytorch和Tensorflow對應的CUDA版本有出入,我們建議Pytorch和Tensorflow分別裝兩個不同的環境。
然後根據下列不同的命令對Pytorch和Tensorflow進行裝載。

1. Pytorch環境裝載與測試

Pytorch 1.9.0 環境裝載:

[macong@paratera01 project]$ module load anaconda/3.7.4(tensflow)  
[macong@paratera01 project]$ source activate torch
(torch) [macong@paratera01 project]$

可以檢視此時的torch版本:

(torch) [macong@paratera01 ~]$ pip list |grep torch
torch                   1.9.0+cu111
torchvision             0.10.0+cu111

接下來我們編寫test_torch.py測試檔案:

# test_torch.py
import torch
print(torch.cuda.is_available())

採用以下的sub_torch.sh指令碼提交到GPU運算節點執行(注意,提交指令碼里面一定要有裝載環境操作(在計算節點裝載),在使用者節點裡裝載環境沒用(使用者節點只能用於安裝依賴包))

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 5
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --no-requeue
module load anaconda/3.7.4
source activate torch
export PYTHONUNBUFFERED=1
python test_torch.py

提交命令為

sbatch sub_torch.sh

(注意,不是bash sub_torch.sh,bash不能提交到計算節點)
用squeue檢視佇列情況

[macong@paratera01 project]$ squeue
CLUSTER: priv
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

CLUSTER: swarm
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16601003       gpu sub_tens   macong  R    INVALID      1 g0002

一段時間後,檢視同目錄下的slurm-16601003.out輸出檔案,我們看到

True

說明Pytorch環境配置成功。

2. Tensorflow 2.5.1 環境裝載與測試:

Tensorflow 2.5.1 環境裝載:

[macong@paratera01 project]$ module load anaconda/3.7.4
[macong@paratera01 project]$ export LD_LIBRARY_PATH=/home/macong/project/cuda/lib64:$LD_LIBRARY_PATH
[macong@paratera01 project]$ source activate tensflow
(tensflow) [macong@paratera01 project]$ 

載入完畢後,可以檢視此時的tensorflow版本

(tensflow) [macong@paratera01 project]$ pip list |grep tensorflow
tensorflow-estimator    2.4.0
tensorflow-gpu          2.4.1

接下來我們編寫以下test_tensorflow.py檔案:

# test_tensorflow.py
import tensorflow as tf
print(tf.test.is_gpu_available())

採用以下的sub_tensorflow.sh指令碼提交到GPU運算節點執行(同樣地,提交指令碼里面一定要有裝載環境操作。另外注意,因為Tensorflow需要cudnn,這裡要額外地增加cuda動態連結庫的載入地址)

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 5
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --no-requeue
module load anaconda/3.7.4
export LD_LIBRARY_PATH=/home/macong/project/cuda/lib64:$LD_LIBRARY_PATH
source activate tensflow
export PYTHONUNBUFFERED=1
python test_tensorflow.py 

提交命令為

sbatch sub_tensorflow.sh

輸出顯示

job 16601097 on cluster swarm

同樣,我們可以用squeue檢視佇列情況

[macong@paratera01 project]$ squeue
CLUSTER: priv
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

CLUSTER: swarm
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16601097       gpu sub_tens   macong  R    INVALID      1 g0039

一段時間後,檢視同目錄下的slurm-16601097.out輸出檔案,我們看到一長串列印輸出

2021-11-28 15:29:22.848812: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From test_tensorflow.py:2: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-11-28 15:30:04.558903: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-28 15:30:04.592168: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-28 15:30:04.596694: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-28 15:30:04.736951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:84:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-11-28 15:30:04.737540: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-11-28 15:30:05.810351: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-11-28 15:30:05.810525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-11-28 15:30:06.033285: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-28 15:30:06.193055: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-28 15:30:06.630374: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-28 15:30:06.820341: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-11-28 15:30:06.847036: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-11-28 15:30:06.850769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-28 15:30:06.850852: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-11-28 15:30:09.592923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-28 15:30:09.593017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-11-28 15:30:09.593043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-11-28 15:30:09.628099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 14761 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0)
True

當然,我們只需要關注最後一行的“True”,說明Tensorflow環境配置成功。

3. 常用命令:

(1) squeue

squeue可用於檢視當前的任務佇列資訊,如之前我們看到的:

[macong@paratera01 project]$ squeue
CLUSTER: priv
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

CLUSTER: swarm
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16601097       gpu sub_tens   macong  R    INVALID      1 g0039

(2) scancel
scancel+任務id可用於將正在執行的任務殺掉,如殺掉正在執行的16601167任務

scancel  16601167

對應的slurm-16601167.out檔案中會顯示:

slurmstepd: error: *** JOB 16601167 ON g0011 CANCELLED AT 2021-11-28T10:10:00 ***

更多命令可詳見官網《中國國家網格12區使用者手冊v2.4》

相關文章