Pytorch袖珍手册之十一
第六章 Pytorch加速及优化(性能提升) 之二
模型并行处理 model parallel processing
模型在各GPUs间被分成几个部分,同时数据批量输送到各个单元进行并行化运算。
Model parallel processingModel parallel processing is often reserved for cases in which the model does not fit on a single GPU.
上图中可以看出模型也被分成了N个部分(N为GPUs个数)。同时这里的输入数据是通过“通道”方式传输到GPUs,这样子只是在第一次数据是串行传输的,其他批次就是各GPUs并行运算了。
When we pipeline the data, only the first N batches are run in sequence, and then each subsequent run activates all the GPUs.
模型并行化不像数据并行那么简单,需要对模型进行重构的。
需要定义如何切分模型及数据通道的搭建。
示例代码:
class TwoGPUAlexNet(AlexNet):
"""
subclass from the AlexNet class
we need to describe which pieces of the model go
on GPU0 and which pieces go on GPU1 in the __init__() con‐
structor. Then we need to pipeline the data through each GPU
in the forward() method to implement GPU pipelining.
"""
def __init__(self):
super(ModelParallelAlexNet, self).__init__(num_classes=num_classes, *args, **kwargs)
self.features.to('cuda:0')
self.avgpool.to('cuda:0')
self.classifier.to('cuda:1')
self.split_size = split_size
def forward(self, x):
splits = iter(x.split(self.split_size, dim=0))
s_next = next(splits)
s_prev = self.seq1(s_next).to('cuda:1')
ret = []
for s_next in splits:
s_prev = self.seq2(s_prev)
ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
s_prev = self.seq1.(s_next).to('cuda:1')
s_prev = self.seq2(s_prev)
ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
return torch.cat(ret)
model = TwoGPUAlexNet()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
# 训练模型,确保数据的分派,即labels放置于最后一张显卡上
for epochs in range(n_epochs):
for input, labels in dataloader:
# 数据分别传输到不同GPU上
input, labels = input.to('cuda:0'), labels.to('cuda:1')
optimizer.zero_grad()
outputs = model(input)
loss_fn(outputs, labels).backward()
optimizer.step()
数据与模型双并行化处理
In this case, you will wrap your model using DDP to distribute your data batches among multiple processes. Each process will use multiple GPUs, and your model will be partitioned among each of those GPUs.
跟之前的模式有两个地方需要修改:
- 修改多GPU模型类以支持设备作为输入
- 在forward里可省略输出设备设置,因为DDP会自主决定输入输出数据的位置。
class Simple2GPUModel(nn.Module):
def __init__(self, dev0, dev1):
super(Simple2GPUModel, self).__init__()
self.dev0 = dev0
self.dev1 = dev1
self.net1 = torch.nn.Linear(10, 10).to(dev0)
self.relu = torch.nn.ReLU()
self.net2 = torch.nn.Linear(10, 5).to(dev1)
def forward(self, x):
x = x.to(self.dev0)
x = self.relu(self.net1(x))
x = x.to(self.dev1)
return self.net2(x)
def model_parallel_training(rank, world_size):
print(f"Running DDP with a model parallel")
setup(rank, world_size)
# set up mp_model and devices for this process
dev0 = rank * 2
dev1 = rank * 2 + 1
mp_model = Simple2GPUModel(dev0, dev1)
# Wrap the model in DDP
ddp_mp_model = DDP(mp_model)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)
for epochs in range(n_epochs):
for input, labels in dataloader:
# Move the inputs and labels to the appropriate device IDs.
input = input.to(dev0),
labels = labels,to(dev1)
optimizer.zero_grad()
# The output is on dev1
outputs = ddp_mp_model(input)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
cleanup()
分布式训练(Distributed Traing Multiple Machine)
PyTorch’s distributed subpackage, torch.distributed, provides a rich set of capabilities to suit a variety of training architectures and hardware platforms.
The torch.distributed subpackage consists of three components: DDP, RPC-based distributed training (RPC), and collective communication (c10d).
该部分内容因为也未接触过,这里只是做个记录,实际应用中有用到再来补充。