深度学习分布式训练 - 以FireCaffe为例

2020-03-09  本文已影响0人  MatrixOnEarth

这篇文章写于2015年,最近翻出来再看,其实江山哪曾变一分。

论文

Forrest N. Iandola etc., FireCaffe: near-linear acceleration of deep neural network training on computer clusters 2016.1

Problem statements from data scientists

4 key pain points summarized by Jeff Dean from Google:

Problem analysis

The speed and scalability of distributed algorithm are almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule.
So the design focuses on the communication enhancement, including:

Key take-aways

Parallelism Scheme: Model parallelism or Data Parallelism

Model parallelism

model parallelism
Each worker gets a subset of the model parameters, and the workers communication by exchanging data gradients and exchanging activations . and data quantity is:

Data parallelism

data parallelism

Each worker gets a subset of the batch, and then the workers communicate by exchanging weight gradient updates \nabla W , where W and \nabla W data quantity is:
|W| = \sum_{L=1}^{\#layers} ch_L * numFilt_L * filterW_L * filterH_L

Convolution layer and fully connection layer have different characteristics in data/weight ratio. So they can use different parallelism schemes.


Different model prefers different parallelism scheme

So a basic conclusion is: convolution layers can be fitted into data parallelism, and fc layers can be fitted into model parallelism.
Further more, for more advanced CNNs like GoogLeNet and ResNet etc., we can directly use data parallelism, as this paper is using.

Gradient Aggregation Scheme: Parameter Server or Reduction Tree

One picture to show how parameter server and reduction tree work in data parallelism.


gradient aggregation scheme

Parameter Server

Parameter communication time with regard to worker number p in parameter server scheme.
param\_server\_communication\_time=\dfrac{|\nabla W| * p}{BW}
The communication time scales linearly as we increase the number of workers. single parameter server becomes scalability bottleneck.
Microsoft Adam and Google DistBelief relief this issue by defining a poll of nodes taht colelctively behave as a parameter server. The bigger the parameter server hierarchy gets, the more it looks like a reduction tree.

Reduction Tree

The idea is same as allreduce in message passing model. Parameter communication time with regard to worker number p in reduction tree scheme.
t=\dfrac{|\nabla W| * 2log_{2}(p)}{BW}
It scales logrithmatically as the number of workers.

reduce tree scalability

Batch size selection

Larger batch size lead to less frequent communication and therefore enable more scalability in a distributed setting. But for larger batch size, we need identify a suitable hyperparameter setting to maintain the speed and accuracy produced in DNN training.
Hyperparameters includes:

Learning rate update rule:
lr = lr_{0}(1 - \dfrac{iter}{max\_iter})^{\alpha}, \alpha = 0.5
On how to get hyperparameters according to batch size, I will write another article for this.

Results

Final results on GPU cluster w/ GoogleNet.


results

More thinkings

References

  1. https://www.slideshare.net/AIFrontiers/jeff-dean-trends-and-developments-in-deep-learning-research
上一篇下一篇

猜你喜欢

热点阅读