MinkowskiEngine多GPU訓練

wujianming_110117發表於2021-01-04

原文網址 : https://wujianming110117.blog.csdn.net/article/details/112167515

MinkowskiEngine多GPU訓練
目前，MinkowskiEngine通過資料並行化支援Multi-GPU訓練。在資料並行化中，有一組微型批處理，這些微型批處理將被送到到網路的一組副本中。
首先定義一個網路。
import MinkowskiEngine as ME
from examples.minkunet import MinkUNet34C

Copy the network to GPU

net = MinkUNet34C(3, 20, D=3)
net = net.to(target_device)
同步批處理規範
接下來，建立一個新網路，以ME.MinkowskiSynchBatchNorm替換all ME.MinkowskiBatchNorm。這樣一來，網路就可以使用大批處理量，並通過單GPU訓練來保持相同的效能。

Synchronized batch norm

net = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(net);
接下來，需要建立網路和最終損耗層的副本（如果使用一個副本）。
import torch.nn.parallel as parallel

criterion = nn.CrossEntropyLoss()
criterions = parallel.replicate(criterion, devices)
載入多個批次
在訓練過程中，每次訓練迭代都需要一組微型批次。使用了一個返回一個mini-batches批處理的函式，但是無需遵循這種模式。

Get new data

inputs, labels = [], []
for i in range(num_devices):
coords, feat, label = data_loader() // parallel data loaders can be used
with torch.cuda.device(devices[i]):
inputs.append(ME.SparseTensor(feat, coords=coords).to(devices[i]))
labels.append(label.to(devices[i]))
將weights複製到裝置
首先，將權重複制到所有裝置。
replicas = parallel.replicate(net, devices)
將副本應用於所有批次
接下來，將所有mini-batches批次送到到所有裝置上網路的相應副本。然後將所有輸出要素輸入損耗層。
outputs = parallel.parallel_apply(replicas, inputs, devices=devices)

Extract features from the sparse tensors to use a pytorch criterion

out_features = [output.F for output in outputs]
losses = parallel.parallel_apply(
criterions, tuple(zip(out_features, labels)), devices=devices)
收集所有損失到目標裝置。
loss = parallel.gather(losses, target_device, dim=0).mean()
其餘訓練（如backward訓練和在優化器中採取向前步驟）類似於單GPU訓練。請參閱完整的multi-gpu示例以獲取更多詳細資訊。
import os
import argparse
import numpy as np
from time import time
from urllib.request import urlretrieve

try:
import open3d as o3d
except ImportError:
raise ImportError("Please install open3d-python with `pip install open3d`.")

import torch
import torch.nn as nn
from torch.optim import SGD

import MinkowskiEngine as ME
from examples.minkunet import MinkUNet34C

import torch.nn.parallel as parallel

if not os.path.isfile("weights.pth"):
urlretrieve("http://cvgl.stanford.edu/data2/minkowskiengine/1.ply", "1.ply")

parser = argparse.ArgumentParser()
parser.add_argument("--file_name", type=str, default="1.ply")
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--max_ngpu", type=int, default=2)

cache = {}


def load_file(file_name, voxel_size):
if file_name not in cache:
pcd = o3d.io.read_point_cloud(file_name)
cache[file_name] = pcd

pcd = cache[file_name]
quantized_coords, feats = ME.utils.sparse_quantize(
np.array(pcd.points, dtype=np.float32),
np.array(pcd.colors, dtype=np.float32),
quantization_size=voxel_size,
)
random_labels = torch.zeros(len(feats))

return quantized_coords, feats, random_labels


def generate_input(file_name, voxel_size):
# Create a batch, this process is done in a data loader during training in parallel.
batch = [load_file(file_name, voxel_size)]
coordinates_, featrues_, labels_ = list(zip(*batch))
coordinates, features, labels = ME.utils.sparse_collate(
coordinates_, featrues_, labels_
)

# Normalize features and create a sparse tensor
return coordinates, (features - 0.5).float(), labels


if __name__ == "__main__":
# loss and network
config = parser.parse_args()
num_devices = torch.cuda.device_count()
num_devices = min(config.max_ngpu, num_devices)
devices = list(range(num_devices))
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
print("' WARNING: This example is deprecated. '")
print("' Please use DistributedDataParallel or pytorch-lightning'")
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
print(
f"Testing {num_devices} GPUs. Total batch size: {num_devices * config.batch_size}"
)

# For copying the final loss back to one GPU
target_device = devices[0]

# Copy the network to GPU
net = MinkUNet34C(3, 20, D=3)
net = net.to(target_device)

# Synchronized batch norm
net = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(net)
optimizer = SGD(net.parameters(), lr=1e-1)

# Copy the loss layer
criterion = nn.CrossEntropyLoss()
criterions = parallel.replicate(criterion, devices)
min_time = np.inf

for iteration in range(10):
optimizer.zero_grad()

# Get new data
inputs, all_labels = [], []
for i in range(num_devices):
coordinates, features, labels = generate_input(config.file_name, 0.05)
with torch.cuda.device(devices[i]):
inputs.append(ME.SparseTensor(features, coordinates, device=devices[i]))
all_labels.append(labels.long().to(devices[i]))

# The raw version of the parallel_apply
st = time()
replicas = parallel.replicate(net, devices)
outputs = parallel.parallel_apply(replicas, inputs, devices=devices)

# Extract features from the sparse tensors to use a pytorch criterion
out_features = [output.F for output in outputs]
losses = parallel.parallel_apply(
criterions, tuple(zip(out_features, all_labels)), devices=devices
)
loss = parallel.gather(losses, target_device, dim=0).mean()
# Gradient
loss.backward()
optimizer.step()

t = time() - st
min_time = min(t, min_time)
print(
f"Iteration: {iteration}, Loss: {loss.item()}, Time: {t}, Min time: {min_time}"
)

# Must clear cache at regular interval
if iteration % 10 == 0:
torch.cuda.empty_cache()

加速實驗
在4x Titan XP上使用各種批次大小進行實驗，並將負載平均分配給每個GPU。例如，使用1個GPU，每個批次將具有8個批處理大小。使用2個GPU，每個GPU將具有4個批次。使用4個GPU，每個GPU的批處理大小為2。
在這裡插入圖片描述

GPU數量每個GPU的批量大小每次迭代時間加速（理想）
1個GPU 8 1.611秒 x1（x1）
2個GPU 4 0.916秒 x1.76（x2）
4個GPU 2 0.689秒 x2.34（x4）
GPU數量每個GPU的批量大小每次迭代時間加速（理想）
1個GPU 12 2.691秒 x1（x1）
2個GPU 6 1.413秒 x1.90（x2）
3個GPU 4 1.064秒 x2.53（x3）
4個GPU 3 1.006秒 x2.67（x4）
GPU數量每個GPU的批量大小每次迭代時間加速（理想）
1個GPU 16 3.543秒 x1（x1）
2個GPU 8 1.933秒 x1.83（x2）
4個GPU 4 1.322秒 x2.68（x4）
GPU數量每個GPU的批量大小每次迭代時間加速（理想）
1個GPU 18歲 4.391秒 x1（x1）
2個GPU 9 2.114秒 x2.08（x2）
3個GPU 6 1.660秒 x2.65（x3）
GPU數量每個GPU的批量大小每次迭代時間加速（理想）
1個GPU 20 4.639秒 x1（x1）
2個GPU 10 2.426秒 x1.91（x2）
4個GPU 5 1.707秒 x2.72（x4）
GPU數量每個GPU的批量大小每次迭代時間加速（理想）
1個GPU 21 4.894秒 x1（x1）
3個GPU 7 1.877秒 x2.61（x3）
分析
批量較小時，加速非常適中。對於大批處理大小（例如18和20），隨著執行緒初始化開銷在大工作量上攤銷，速度會提高。
同樣，在所有情況下，使用4個GPU效率都不高，並且速度似乎很小（總批量大小為18的3-GPU的x2.65與總批量大小為20的4-GPU的x2.72）。因此，建議最多使用3個大批量的GPU。
在這裡插入圖片描述

GPU數量平均加速（理想）
1個GPU x1（x1）
2個GPU x1.90（x2）
3個GPU x2.60（x3）
4個GPU x2.60（x4）
適度加速的原因是由於CPU使用率過高。在Minkowski引擎中，所有稀疏張量座標都在CPU上進行管理，並且核心in-out出入圖需要大量的CPU計算。因此，為了提高速度，建議使用更快的CPU，這可能是大點雲的瓶頸。

pytorch指定GPU訓練
2020-10-16
PyTorchGPU
6-3使用GPU訓練模型
2024-08-04
GPU模型
分散式入門，怎樣用PyTorch實現多GPU分散式訓練
2019-05-01
分散式PyTorchGPU
Docker環境搭建CUDA12.2 + Yolov5 7.0 GPU訓練環境（單卡訓練）
2024-11-15
DockerYOLOGPU
Wenet多機多卡分散式訓練
2023-01-09
分散式
A6000單機多卡大模型訓練踩坑記錄（CUDA環境、多GPU卡住且視訊記憶體100%）
2024-03-22
大模型GPU記憶體
Pytorch：單卡多程式並行訓練
2023-01-24
PyTorch並行
deepspeed 訓練多機多卡報錯 ncclSystemError Last error
2024-04-18
ErrorAST
在 NVIDIA DGX Cloud 上使用 H100 GPU 輕鬆訓練模型
2024-03-21
CloudGPU模型
怎樣在 10k 個 H100 GPU 上訓練模型？
2024-10-28
GPU模型
2024牛客暑期多校訓練營9
2024-08-13
2024牛客暑期多校訓練營8
2024-08-09
2024牛客暑期多校訓練營6
2024-08-01
2024牛客暑期多校訓練營2
2024-07-21
2024牛客暑期多校訓練營4
2024-07-26
2024牛客暑期多校訓練營5
2024-07-30
2024牛客暑期多校訓練營1
2024-07-17
試用阿里雲GPU伺服器進行深度學習模型訓練
2024-04-17
阿里GPU伺服器深度學習模型
Yandex 開源 LLM 訓練工具節省高達 20% 的 GPU 資源
2024-06-11
GPU
GPU捉襟見肘還想訓練大批次模型？誰說不可以
2018-10-17
GPU模型
預訓練語言模型：還能走多遠？
2020-11-13
模型
2024牛客暑期多校訓練營2 HI
2024-07-20
YOLOv3 中的多尺度融合與訓練
2020-11-28
YOLO
GPU捉襟見肘還想訓練大批量模型？誰說不可以
2018-10-17
GPU模型
訓練效率低？GPU利用率上不去？快來看看別人家的tricks吧～
2019-01-09
GPU
AMD、蘋果、高通GPU存在漏洞，可致AI模型訓練資料洩露
2024-01-19
蘋果GPUAI模型
谷歌免費GPU訓練星際2AI好難？你需要份debug指南
2018-04-18
谷歌GPUAI
TensorFlow在美團外賣推薦場景的GPU訓練優化實踐
2022-06-06
GPU優化
MinkowskiEngine基準測試
2021-01-04
hdu-5384Danganronpa+多校訓練+AC自動機
2020-04-04
20240622訓練
2024-06-22
20240610訓練
2024-06-14
你的Mac有了專用版TensorFlow，GPU可用於訓練，速度最高提升7倍
2020-11-19
MacGPU
百度基於 GPU 的超大規模離散模型訓練框架 PaddleBox 與 FeaBox
2023-03-13
GPU模型框架
2024牛客暑期多校訓練營9 - VP記錄
2024-10-22
2024牛客多校訓練營覆盤：上篇（1~5）
2024-09-06
2024牛客暑期多校訓練營10 - VP記錄
2024-10-30
聯盟訓練在遊戲AI多樣化的探索
2021-10-21
遊戲AI

MinkowskiEngine多GPU訓練

Copy the network to GPU

Synchronized batch norm

Get new data

Extract features from the sparse tensors to use a pytorch criterion

相關文章