PyTorch Lightning 加速网络训练

2019-08-23  本文已影响0人  顾北向南

转载于:https://mp.weixin.qq.com/s/WNRz8D9FOlZqTdcjm1usjw
PyTorch Lightning :https://github.com/williamFalcon/pytorch-lightning/projects
用户文档:https://williamfalcon.github.io/pytorch-lightning/

1. 介绍

2. DataLoader

3. DataLoaders中的进程数

# slowloader = DataLoader(dataset, batch_size=32, shuffle=True)
2# fast (use 10 workers)loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=10)

4.批尺寸

5.累积梯度

# clear last stepoptimizer.zero_grad()
2
3# 16 accumulated gradient stepsscaled_loss = 0for accumulated_step_i in range(16):      out = model.forward()     loss = some_loss(out,y)         loss.backward()
4
5       scaled_loss += loss.item()
6
7# update weights after 8 steps. effective batch = 8*16optimizer.step()
8
9# loss is now scaled up by the number of accumulated batchesactual_loss = scaled_loss / 16
trainer = Trainer(accumulate_grad_batches=16)
trainer.fit(model)

https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/?source=post_page---------------------------#accumulated-gradients

6. 保留计算图

losses = []
...losses.append(loss)
print(f current loss: {torch.mean(losses) })
# badlosses.append(loss)
# goodlosses.append(loss.item())

https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/models/trainer.py?source=post_page---------------------------#L767-L768

7.转至单GPU

# put model on GPUmodel.cuda(0)
# put data on gpu (cuda on a variable returns a cuda copy) x = x.cuda(0)
# runs on GPU nowmodel(x)

https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/?source=post_page---------------------------#single-gpu

# expensivex = x.cuda(0)
# very expensivex = x.cpu()x = x.cuda(0)
# really bad idea.Stops all the GPUs until they all catch up
torch.cuda.empty_cache()

8. 16位混合精度训练

# enable 16-bit on the model and the optimize rmodel, 
optimizers = amp.initialize(model, optimizers, opt_level= O2 )
# when doing .backward, let amp do it so it can scale the loss 
with amp.scale_loss(loss, optimizer) as scaled_loss:                         
      scaled_loss.backward()
trainer = Trainer(amp_level=’O2 , use_amp=False)trainer.fit(model)

9.移至多GPU

 # copy model on each GPU and give a fourth of the batch to each
model = DataParallel(model, devices=[0, 1, 2 ,3])
# out has 4 outputs (one for each gpu)
out = model(x.cuda(0))
# ask lightning to use 4 GPUs for training
trainer = Trainer(gpus=[0, 1, 2, 3])trainer.fit(model)
# each model is sooo big we can t fit both in memory
 encoder_rnn.cuda(0)
 decoder_rnn.cuda(1)
 # run input through encoder on GPU 0
 out = encoder_rnn(x.cuda(0))
# run output through decoder on the next GPU
out = decoder_rnn(x.cuda(1))
# normally we want to bring all outputs back to GPU 0
out = out.cuda(0)
class MyModule(LightningModule):
      def __init__():         
          self.encoder = RNN(...)        
          self.decoder = RNN(...)
    def forward(x):
        # models won t be moved after the first forward because         
        # they are already on the correct GPUs        
        self.encoder.cuda(0)        
        self.decoder.cuda(1)        
        out = self.encoder(x)        
        out = self.decoder(out.cuda(1))
# don t pass GPUs to trainer 
model = MyModule()
trainer = Trainer()
trainer.fit(model)
# change these lines
self.encoder = RNN(...)
self.decoder = RNN(...)
# to these# now each RNN is based on a different gpu set
self.encoder = DataParallel(self.encoder, devices=[0, 1, 2, 3])
self.decoder = DataParallel(self.encoder, devices=[4, 5, 6, 7])
# in forward...
out = self.encoder(x.cuda(0))
# notice inputs on first gpu in devices
out = self.decoder(out.cuda(4))  # <--- the 4 here

10.转至多GPU阶段(8+GPUs)

def tng_dataloader():     
    d = MNIST()

    # 4: Add distributed sampler     
    # sampler sends a portion of tng data to each machine     
    dist_sampler = DistributedSampler(dataset)     
    dataloader = DataLoader(d, shuffle=False, sampler=dist_sampler)

def main_process_entrypoint(gpu_nb):      
# 2: set up connections  between all gpus across all machines     
# all gpus connect to a single GPU "root"     
# the default uses env://
    world = nb_gpus * nb_nodes     
    dist.init_process_group("nccl", rank=gpu_nb, world_size=world)

   # 3: wrap model in DPP     
    torch.cuda.set_device(gpu_nb)     
    model.cuda(gpu_nb)     
    model = DistributedDataParallel(model, device_ids=[gpu_nb])

# train your model now...
if  __name__ ==  __main__ :     
    # 1: spawn number of processes     
    # your cluster will call main for each machine     
    mp.spawn(main_process_entrypoint, nprocs=8)
trainer = Trainer(nb_gpu_nodes=128, gpus=[0, 1, 2, 3, 4, 5, 6, 7])

示例:https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_cluster_template.py?source=post_page---------------------------#L103-L134

10.更快的多GPU单节点训练

# train on 4 gpus on the same machine MUCH faster than DataParallel
trainer = Trainer(distributed_backend= ddp , gpus=[0, 1, 2, 3])
上一篇 下一篇

猜你喜欢

热点阅读