Pytorch袖珍手册之十

2021-08-26 本文已影响0人深思海数_willschang

pytorch pocket reference

第六章 Pytorch加速及优化（性能提升）之一

在实际应用中，我们可能面对的数据是比之前章节里的还要多，模型网络结构也要更复杂（参数可能更多），这就需要对模型进行优化与加速。

Longer training times can become frustrating, especially when you want to conduct many experiments using different hyperparameter configurations.

在这章中作者介绍了几各种网络加速及优化的方法：

网络加速

TPUs

Google developed its own ASIC for NN acceleration called the TPU.
因为TPU是专为NN计算优化设计的，所以它没有GPU一些缺陷。Google也提供了Google Cloud TPU的服务，同时我们也可以在运行Google Colab时选择TPU服务。
预算够的情况，TPU是一个不错的选择。在这里就不演示TPU的使用，毕竟暂时也不方便使用。以后实际应用中有用到再学习了。
若现在有必要使用到TPU的话，因为Pytorch现在原生还不支持TPU直接应用，所以需要安装一个中间包Pytorc/XLA(Accelerated Linear Algebra) 来完成之间的通信。更详细的Pytorc/XLA可查阅官方文档：https://github.com/pytorch/xla/

curl 'https://raw.githubusercontent.com/pytorch' \
'/xla/master/contrib/scripts/env-setup.py' \
-o pytorch-xla-env-setup.py

import torch_xla.core.xla_model as xm
device = xm.xla_device()

TPUs are very fast at performing dense vector and matrix computations.
1.TPUs are very fast at performing dense vector and matrix computations.
2.Your model has long training times。
3.You want to run multiple iterations of your entire training loop on TPUs.

并行运算和分布式训练

多GPUs方式（单机多GPUs）
一个好的架构就是要充分发挥现有硬件的价值，不让资源无效闲置着。本节主要就是对单机多GPUs的使用进行演示。
多GPUs的应用通常称为并行处理（parallel processing）。

数据并行处理 data parallel processing
数据被并行化发送到各个GPUs，各GPU运行相同的模型。

Data parallel processing is more commonly used in practice.

data parallel processing
上图中可以看出数据并行处理是对各批量数据分成N个部分（N是GPU的数量，N一般为2的幂）。
Each GPU holds a copy of the model, and the gradients and loss are computed for each portion of the batch.
The gradients and loss are combined at the end of each iteration.
数据并行化在Pytorch中的实现方式有：单进程多线程方式（a single-process, multithreaded approach）和多进程方式（multiprocess approach）两种。
我们知道Python的GIL（Global Interpreter Lock）的原因，单进程多线程的性能提升并不理想，所以在实际应用中是以多进程方式居多。
多进程方式
只需在模型封装时调用nn.DataParallel即可，示例代码如下：

# 确认单机上有几张显卡
if torch.cuda.device_count() > 1:
    print('The machine has ', torch.cuad.device_count(), 'GPUs available.')
    # 数据并行化处理
    model = nn.DataParallel(model)
# 模型运行在device下
model.to(device)

基于DDP（Distributed data processing）下的多进程模型加速方式【推荐】

Distributed data processing (DDP) can be used with multiple processes on a single machine or with multiple processes across multiple machines.
1.通过torch.distributed初始化进程组
2.通过torch.nn.to()创建本地模型
3.通过torch.nn.parallel封装带有DDP的模型
4.通过torch.multiprocessing启动进程运算

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

def dist_training_loop(rank, world_size, dataloader, model, loss_fn, optimizer):
    dist.init_process_group('gloo', rank=rank, world_size=world_size)  # 1
    model = model.to(device) # 2
    ddp_model = DDP(model, device_ids=[rank]) # 3
    optimizer = optimizer(ddp_model.parameters(), lr=0.001)

    for epoch in range(n_epochs):
        for input, labels in dataloader:
           input = input.to(rank)
           labels = labels.to(rank)
           optimizer.zero_grad()
           outputs = ddp_model(input)
           loss = loss_fn(outputs, labels)
           loss.backward()
           optimizer.step()

    dist.destroy_process_group()

"""
DDP broadcasts the model states from the rank0 process to all
the other processes, so we don’t have to worry about the different 
processes having models with different initialized weights.
"""

if __name__=="__main__":
    world_size = 2
    # 调用多进程函数spaw()执行程序
    # run the code as main to spawn two processes
    mp.spawn(dist_training_loop,
        args=(world_size,),
        nprocs=world_size,
        join=True)

GPUs

【下面内容后续分两至三个笔记完善，本笔记至此结束。】

模型并行处理 model parallel processing
模型在各GPUs间被分成几个部分，同时数据批量输送到各个单元进行并行化运算。

Model parallel processing is often reserved for cases in which the model does not fit on a single GPU.

多机多GPUs

优化（性能）

-- 超参数微调（hyperparameter tuning）
-- 量化（quantization）
-- 剪枝（prunig）