Distributed Training: DeepSpeed ZeRO 1/2/3 + Accelerate, Megatron-LM

ForHHeart發表於2024-09-07

原文網址 : https://www.cnblogs.com/forhheart/p/18401234

1 Introduction

Github: https://github.com/microsoft/DeepSpeed

ZeRO: Memory Optimizations Toward Training Trillion
Parameter Models

ZeRO-Offload: Democratizing Billion-Scale Model Training

ZeRO-Infinity: Breaking the GPU Memory Wall
for Extreme Scale Deep Learning

ZeRO++: Extremely Efficient Collective Communication for Giant
Model Training

ZeRO(Zero Redundancy Optimizer)是一種去除冗餘的分散式資料並行(Data Parallel)方案，分為Stage 1, Stage 2, Stage 3，而Deepspeed就是論文中ZeRO方法的Microsoft官方的工程實現。

ZeRO-Offload為解決由於ZeRO而增加通訊資料量的問題，提出將GPU轉移到CPU

ZeRO-Infinity同樣是進行offload，ZeRO-Offload更側重單卡場景，而ZeRO-Infinity則是典型的工業界風格，奔著極大規模訓練去了

ZeRO++是對ZeRO 3的通訊最佳化，最佳化了以下三個方面：

每個伺服器有完整的模型引數，消除跨伺服器的All_gather操作；
通訊時，基於塊的量化，模型引數從FP16轉換成INT8；
替代ring-based ReduceScatter通訊，改為分層級的量化 AllToALL；

Megatron-LM是NVIDIA開發的大規模語言模型訓練框架，相比於DeepSpeed而言，具有更好的模型並行和流水線並行技術，但資料並行DeepSpeed更有優勢。

2 預備知識

2.1 分散式並行策略

Tensor Parallel, Data Parallel, Model Parallel, Pipeline Parallel

2.2 LLM推理和訓練的算力需求估算

2.2.1 資料精度格式

對於大語言模型，選擇合適的精度格式至關重要。高精度如FP32適合高要求的任務，但消耗資源多；FP16和bfloat16則在維持效能的同時，顯著降低了計算成本。低精度格式如int8和fp4更適合資源受限的環境，尤其在推理任務中，透過壓縮儲存和計算需求，提高了部署效率。合理運用這些格式能夠最佳化效能和資源利用，推動大語言模型在更廣泛場景中的應用。

名稱	簡稱	對應位元組	對應位元
單精度浮點格式(Single-precision floating-point format)	fp32	4 Bytes	32 bits
半精度浮點格式(Half-precision floating-point format)	fp16	2 Bytes	16 bits
腦浮點格式(Brain floating-point format)	bp16	2 Bytes	16 bits
8位整數格式(8-bit integer format)	int8	1 Bytes	8 bits
4位浮點格式(4-bit floating-point format)	fp4	0.5 Bytes	4 bits
4位正常浮點格式 (4-bit NormalFloat format)	nf4	0.5 Bytes	4 bits

2.2.2 視訊記憶體(VRAM)需求計算 - 推理

以LLaMA 2 7B為例，視訊記憶體需求如下：

僅考慮了模型引數本身，並未包括其他執行時所需的額外空間，如最佳化器狀態、啟用等。

型別	模型精度	模型規模	推理/訓練	最低視訊記憶體(以粗略計算方式)
全精度	FP32	7B	推理	7B * 4 Bytes = 28 GB
半精度	FP16	7B	推理	7B * 2 Bytes = 14 GB
低精度	INT8	7B	推理	7B * 1 Bytes = 7 GB
INT4	7B	推理	7B * 0.5 Bytes = 3.5 GB

3 ZeRO

Distributed Training: DeepSpeed ZeRO 1/2/3 + Accelerate, Megatron-LM

如今，Mixed-Precision Training和Adam Optimizer是LLM Distributed Training的標配

ZeRO將模型訓練階段，每張卡中視訊記憶體內容分為兩類(以最佳化器Adam為例)：

Model States:
- Parameters(fp16)
- Gradient(fp16)
- Optimizer States(fp32)

VRAM計算：假設Parameters是\(\Psi\)，則共需要2\(\Psi\)+2\(\Psi\)+(4\(\Psi\)+4\(\Psi\)+4\(\Psi\))=16\(\Psi\) bytes進行儲存。

Residual States
- 啟用值activation
- 臨時緩衝區buffer
- 無法使用的視訊記憶體碎片fragmentation

3.1 Stage 1, 2, 3

\(P_{os}\)是指

3.2 ZeRO-Offload

3.2.1 通訊資料量分析

3.3 ZeRO-Infinity

3.4 ZeRO++

4 DeepSpeed + Accelerate

pip install deepspeed
accelerate config

5 Megatron-LM

Github: https://github.com/NVIDIA/Megatron-LM

Megatron-LM: Training Multi-Billion Parameter Language Models Using
Model Parallelism

Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM

Reducing Activation Recomputation in Large Transformer Models

Reference

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters | Microsoft Research Blog
DeepSpeed之ZeRO系列：將視訊記憶體最佳化進行到底 | 知乎
分散式訓練框架Megatron-LM程式碼概覽 | Bilibili
Zero++分散式並行資料並行 | Bilibili

Hugging Face Accelerate 兩個後端的故事：FSDP 與 DeepSpeed
2024-06-27
Hugging Face後端
CTF：從0到1 -> zero2one
2020-12-30
2024 Autumn Training #2 CG (by hzy)
2024-09-28
AI
Accelerate Framework in Swift
2020-06-22
FrameworkSwift
a===1 && a===2 && a===3
2018-08-26
2018 Multi-University Training Contest 3 - HDU Contest
2018-08-05
AI
模型並行-Megatron-LM
2024-12-10
模型並行
鏈式呼叫 f(1)(2)(3) f(1,2)(3,4,5)
2021-09-09
Spring Training 5
2024-05-21
SpringAI
Principles of Distributed Ledgers
2024-11-13
DeepSpeed x MiniGPT4Qwen
2024-07-23
GPT
asp.net core 系列之Response caching 之 Distributed caching(3)
2019-06-27
ASP.NET
accelerate檔案可以刪嗎？電腦當中的.accelerate是什麼資料夾？
2020-06-10
●1/2, 2/3, 3/4, 4/5, 5/6, 7/6構成的謎題●(1)
2020-08-07
NX11 Training
2020-04-07
AI
deepspeed從入門到放棄
2024-09-01
BUPT 2024 Spring Training #3（ICPC2023 杭州站）Ag覆盤
2024-03-22
SpringAI
[原始碼解析] PyTorch 分散式(14) --使用 Distributed Autograd 和 Distributed Optimizer
2021-12-13
原始碼PyTorch分散式
zcmu1074: 求1＋1/2＋1/3＋...＋1/n
2020-11-24
BEVFormer-accelerate：基於EasyCV加速BEVFormer
2022-12-13
ORM
gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
2020-11-18
Error
●1/2, 2/3, 3/4, 4/5, 5/6, 7/6構成的謎題●(2)
2020-08-11
●1/2, 2/3, 3/4, 4/5, 5/6, 7/6構成的謎題●(3)
2020-08-12
【HDFS】HADOOP DISTRIBUTED FILE SYSTEM
2018-11-19
Hadoop
計算 4/1*2*3+5/2*3*4+...+11/8*9*10
2024-09-01
add(1)(2)(3)...函式的實現
2019-02-28
函式
1-3-2檔案操作進階
2020-10-05
JZ-047-求 1+2+3+...+n
2022-01-12
ZJ047求1+2+3+...+n
2020-12-26
Codeforces Round 976 (Div. 2) and Divide By Zero 9.0(A,B,C)
2024-09-30
IDE
deepspeed基礎入門 cifar的訓練
2024-06-03
Machine Learning (5) - Training and Testing Data
2019-06-06
MacAI
【HITCON-Training】Lab 12 - SecretGarden
2024-10-02
AI
some notes about distributed workflows in Git
2022-09-01
Git
Akka-Cluster（2）- distributed pub/sub mechanism 分散式釋出/訂閱機制
2018-11-08
分散式
1, 1/2, 3/4, 5/6, 7/8, 9/10, 11/12：七數謎題(2)
2020-08-25
1, 1/2, 3/4, 5/6, 7/8, 9/10, 11/12：七數謎題(3)
2020-08-26
3,1,4,1,5,9,2,6,5圖譜(2019年6月1日)謎宮中的2,3,5,7,11.
2019-06-01