deepspeed 訓練多機多卡報錯 ncclSystemError Last error

高颜值的殺生丸發表於2024-04-18

原文網址 : https://www.cnblogs.com/LiuXinyu12378/p/18144455

最近在搞分散式訓練大模型，踩了兩個晚上的坑今天終於爬出來了

我們使用 2臺 8*H100

遇到過

錯誤1

10.255.19.85: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
10.255.19.85: Last error:
10.255.19.85: socketStartConnect: Connect to 127.0.0.1<34273> failed : Software caused connection abort

錯誤2

10.255.19.82: torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by

錯誤3

10.255.19.85: ncclInternalError: Internal check failed.
10.255.19.85: Last error:
10.255.19.85: Bootstrap : no socket interface found

其實這三個錯誤都是一個問題導致的，就是網路卡配置的問題

這是之前的配置

hostfile

10.255.19.82 slots=8
10.255.19.85 slots=8

fine-tune.sh

hostfile="/data2/xinyuuliu/Baichuan2-main/fine-tune/hostfile"

export NCCL_SOCKET_IFNAME=enp194s0f0
# export MASTER_ADDR=10.255.19.82   # 主節點的IP地址
# export MASTER_PORT=29500
# --include localhost:0,1,2,3,4,5,6,7
export NCCL_DEBUG=INFO
# export NCCL_IB_TIMEOUT=22
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_TC=128
# export NCCL_IB_DISABLE=1

deepspeed --master_addr 10.255.19.82 --master_port 29500 --hostfile=$hostfile fine-tune.py  \
    --report_to "none" \
    --data_path "/data2/xinyuuliu/Baichuan2-main/fine-tune/data/全網評價總結訓練資料.json" \
    --model_name_or_path "/data1/xinyuuliu/Baichuan2-13B-Chat" \
    --output_dir "output_lora_summary" \
    --model_max_length  10000\
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --save_strategy epoch \
    --learning_rate 2e-4 \
    --lr_scheduler_type constant \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-8 \
    --max_grad_norm 1.0 \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed ds_config.json \
    --bf16 True \
    --tf32 True \
    --use_lora True \
    # --load_lora_path /data2/xinyuuliu/Baichuan2-main/fine-tune/output_lora4_1_1/checkpoint-7516
    # --use_NEFT True
    # --use_frozen True

 #"/data2/xinyuuliu/baichuan2_7B"

修改後之後的配置

問題主要出現在 NCCL_SOCKET_IFNAME 這個環境變數，這個環境變數會被攜帶到其他機器上，但是網路卡的名稱是不一樣的，尤其我的兩臺機器包含很多的網路卡，因此，我們配置的時候需要把那些沒用的虛擬網路卡遮蔽掉，只留下需要的

解決方法，這個要根據實際情況來改，英偉達官方寫法如下，遮蔽顯示卡用^開頭

export NCCL_SOCKET_IFNAME=^br-c485a8390817,docker0,eno2,ens6f0,ens6f1,enx46a838614d5f,lo,veth23d7383,br-dcd3e4ec14e7,enp194s0f1,ens6f0,ens6f1,enxe278666d5a52,veth110d0b7,veth215ea4e,veth3203d6b,veth87c3cbf,vethec6fc79,virbr0

我的網路卡太多了，之前一直以為環境變數在 /etc/profile 下配置各自的就行，

然後再source /etc/profile，結果發現不對，從機也都指向了主機的網路卡，沒辦法建立socket

hostfile="/data2/xinyuuliu/Baichuan2-main/fine-tune/hostfile"

export NCCL_SOCKET_IFNAME=^br-c485a8390817,docker0,eno2,ens6f0,ens6f1,enx46a838614d5f,lo,veth23d7383,br-dcd3e4ec14e7,enp194s0f1,ens6f0,ens6f1,enxe278666d5a52,veth110d0b7,veth215ea4e,veth3203d6b,veth87c3cbf,vethec6fc79,virbr0
# export NCCL_SOCKET_IFNAME=enp194s0f0
# export MASTER_ADDR=10.255.19.82   # 主節點的IP地址
# export MASTER_PORT=29500
# --include localhost:0,1,2,3,4,5,6,7
export NCCL_DEBUG=INFO
# export NCCL_IB_TIMEOUT=22
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_TC=128
# export NCCL_IB_DISABLE=1

deepspeed --master_addr 10.255.19.82 --master_port 29500 --hostfile=$hostfile fine-tune.py  \
    --report_to "none" \
    --data_path "/data2/xinyuuliu/Baichuan2-main/fine-tune/data/全網評價總結訓練資料.json" \
    --model_name_or_path "/data1/xinyuuliu/Baichuan2-13B-Chat" \
    --output_dir "output_lora_summary" \
    --model_max_length  10000\
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --save_strategy epoch \
    --learning_rate 2e-4 \
    --lr_scheduler_type constant \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-8 \
    --max_grad_norm 1.0 \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed ds_config.json \
    --bf16 True \
    --tf32 True \
    --use_lora True \
    # --load_lora_path /data2/xinyuuliu/Baichuan2-main/fine-tune/output_lora4_1_1/checkpoint-7516
    # --use_NEFT True
    # --use_frozen True

 #"/data2/xinyuuliu/baichuan2_7B"

修改後成功跑起16*H100，爽歪歪

Wenet多機多卡分散式訓練
2023-01-09
分散式
Pytorch：單卡多程式並行訓練
2023-01-24
PyTorch並行
windows下使用pytorch進行單機多卡分散式訓練
2023-04-02
WindowsPyTorch分散式
單機多卡、多機多卡的藝術
2022-05-18
MinkowskiEngine多GPU訓練
2021-01-04
GPU
deepspeed基礎入門 cifar的訓練
2024-06-03
GTID複製報錯處理：Last_Error: Error 'Can't drop database 'test';
2021-05-16
ASTErrorDatabase
2024牛客暑期多校訓練營4 - J. Zero （卡常）
2024-10-14
hdu-5384Danganronpa+多校訓練+AC自動機
2020-04-04
A6000單機多卡大模型訓練踩坑記錄（CUDA環境、多GPU卡住且視訊記憶體100%）
2024-03-22
大模型GPU記憶體
2024牛客暑期多校訓練營2 解題報告
2024-07-20
2024牛客暑期多校訓練營9
2024-08-13
2024牛客暑期多校訓練營8
2024-08-09
2024牛客暑期多校訓練營6
2024-08-01
2024牛客暑期多校訓練營2
2024-07-21
2024牛客暑期多校訓練營4
2024-07-26
2024牛客暑期多校訓練營5
2024-07-30
2024牛客暑期多校訓練營1
2024-07-17
Last_IO_Error: Got fatal error 1236
2024-11-06
ASTErrorGo
預訓練語言模型：還能走多遠？
2020-11-13
模型
2024牛客暑期多校訓練營2 HI
2024-07-20
YOLOv3 中的多尺度融合與訓練
2020-11-28
YOLO
ERROR 1045 (28000): ProxySQL Error: 報錯
2021-04-14
ErrorSQL
周明：預訓練模型在多語言、多模態任務的進展
2020-09-18
模型
DcatAdmin 多對多關聯是，multipleSelect 報錯 Array to string conversion
2024-05-26
2024牛客暑期多校訓練營9 - VP記錄
2024-10-22
2024牛客多校訓練營覆盤：上篇（1~5）
2024-09-06
2024牛客暑期多校訓練營10 - VP記錄
2024-10-30
聯盟訓練在遊戲AI多樣化的探索
2021-10-21
遊戲AI
Pytorch使用資料並行，單機多卡
2020-05-14
PyTorch並行
多項NLP任務新SOTA，Facebook提出預訓練模型BART
2019-12-10
模型
講堂丨周明：預訓練模型在多語言、多模態任務的進展
2020-09-11
模型
20240927 隨機訓練
2024-09-27
隨機
機率期望訓練
2024-07-19
2024牛客暑期多校訓練營10 - L. Tada! - 題解
2024-10-31
【Redis】slaveof 報錯 Background transfer error
2018-05-09
RedisError
網站報錯：“Database Server Error”
2024-09-21
網站DatabaseServerError
pip install scrapy報錯：error: Unable
2021-09-09
Error

deepspeed 訓練多機多卡報錯 ncclSystemError Last error

相關文章