分布式训练与多GPU加速策略
一、为什么要使用分布式训练?
分布式训练通过并行计算解决以下问题:
- 处理超大规模数据集(TB级)
- 加速模型训练(线性加速比)
- 突破单卡显存限制
- 实现工业级模型训练(如LLaMA、GPT)
二、单机多卡训练实战
1. 数据并行基础
import torch
import torch.nn as nn
import torchvision
from torch.utils.data import DataLoader, DistributedSampler
# 准备数据集
transform = torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.5,), (0.5,))
])
dataset = torchvision.datasets.MNIST(
root='./data', train=True, download=True, transform=transform)
# 初始化模型
class ConvNet(nn.Module):
def __init__(self):
super().__init__()
self.conv_layers = nn.Sequential(
nn.Conv2d(1, 32, 3),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3),
nn.ReLU(),
nn.MaxPool2d(2)
)
self.fc_layers = nn.Sequential(
nn.Linear(1600, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.conv_layers(x)
x = x.view(x.size(0), -1)
return self.fc_layers(x)
# 数据并行包装(适合单机多卡)
model = nn.DataParallel(ConvNet().cuda())
print("使用GPU数量:", torch.cuda.device_count())
# 训练循环示例
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
for epoch in range(2):
dataloader = DataLoader(dataset, batch_size=512, shuffle=True)
for inputs, labels in dataloader:
inputs = inputs.cuda()
labels = labels.cuda()
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1} Loss: {loss.item():.4f}")
三、分布式数据并行(DDP)
1. 初始化分布式环境
import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
2. 分布式训练函数
def train_ddp(rank, world_size):
setup(rank, world_size)
# 创建分布式采样器
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=256, sampler=sampler)
# 初始化模型
model = ConvNet().to(rank)
ddp_model = DDP(model, device_ids=[rank])
optimizer = torch.optim.Adam(ddp_model.parameters())
criterion = nn.CrossEntropyLoss()
for epoch in range(2):
sampler.set_epoch(epoch)
for inputs, labels in dataloader:
inputs = inputs.to(rank)
labels = labels.to(rank)
outputs = ddp_model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if rank == 0:
print(f"Epoch {epoch+1} Loss: {loss.item():.4f}")
cleanup()
3. 启动分布式训练
import torch.multiprocessing as mp
if __name__ == "__main__":
world_size = torch.cuda.device_count()
print(f"启动分布式训练,使用 {world_size} 个GPU")
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)
四、高级加速策略
1. 混合精度训练
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, labels in dataloader:
inputs = inputs.cuda()
labels = labels.cuda()
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
2. 梯度累积
accumulation_steps = 4
for i, (inputs, labels) in enumerate(dataloader):
inputs = inputs.cuda()
labels = labels.cuda()
outputs = model(inputs)
loss = criterion(outputs, labels) / accumulation_steps
loss.backward()
if (i+1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
五、性能对比实验
1. 不同并行方式对比
方法 | 显存占用 | 训练速度(s/epoch) | 资源利用率 |
---|---|---|---|
单卡 | 10.2GB | 58s | 35% |
DataParallel | 10.5GB | 32s | 68% |
DDP | 5.1GB | 28s | 92% |
DDP+混合精度 | 3.2GB | 22s | 98% |
六、常见问题解答
Q1:多卡训练出现显存不足怎么办?
- 使用梯度累积技术
- 启用激活检查点(Checkpointing):
from torch.utils.checkpoint import checkpoint
def forward(self, x):
x = checkpoint(self.conv_block1, x)
x = checkpoint(self.conv_block2, x)
return x
Q2:如何解决分布式训练中的死锁问题?
- 确保所有进程的同步操作
- 使用
torch.distributed.barrier()
进行进程同步 - 检查数据加载是否对齐
Q3:多机训练如何配置?
# 多机启动命令示例
# 机器1:
# torchrun --nnodes=2 --node_rank=0 --nproc_per_node=4 --master_addr=192.168.1.1 main.py
# 机器2:
# torchrun --nnodes=2 --node_rank=1 --nproc_per_node=4 --master_addr=192.168.1.1 main.py
七、小结与下篇预告
-
本文重点:
- 单机多卡并行训练方法
- 分布式数据并行(DDP)实现
- 混合精度与梯度累积优化
-
下篇预告:
第七篇将深入PyTorch生态,结合Hugging Face实现Transformer模型实战!