TensorFlow Training - Distribute

2019-10-08  本文已影响0人  左心Chris

https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md

http://sharkdtu.com/posts/dist-tf-evolution.html

http://download.tensorflow.org/paper/whitepaper2015.pdf

https://segmentfault.com/a/1190000008376957

PS-Worker 图内/图间
All-Reduce 简单 -> half ps(做集合分发的) worker -> butterfly -> ring all-reduce
MirroredStrategy
MultiWorkerStrategy
ParameterServerStrategy

All-reduce和ps worker对比

https://zhuanlan.zhihu.com/p/50116885

1 不同的分布式策略

All reduce

把参数加在一起,然后同步到所有的机器上去
https://zhuanlan.zhihu.com/p/79030485

MirroredStrategy

support synchronous distributed training on multiple GPUs on one machine

Multi-workerMirroredStrategy

It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs

ParameterServerStrategy

supports parameter servers training on multiple machines. In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

2 演进

http://sharkdtu.com/posts/dist-tf-evolution.html

1 基本组件

client/master/worker
server(host:port)和task一一对应,cluster由server组成,一系列task称为一个job,每个server有两个Service,即master service和worker service,client通过session连接集群的任意一个server的master service来划分派发task

2 基于PS的分布式TensorFlow编程模型

Parameter Server Task集合为ps
Worker Task集合为worker

Low-level 分布式编程模型
High-level 分布式编程模型

使用Estimator和Dataset高阶API

3 基于All-Reduce的分布式TensorFlow架构

上一篇下一篇

猜你喜欢

热点阅读