TensorFlow1.0 - C2 Guide - 8 Ext
1 TensorFlow architecture
Client
Distributed Master
Worker Services(one for each task)
Kernel Implementations
Client
Create a session, which sends the graph definition to the distributed master as a tf.GraphDef protocol buffer. When the client evaluates a node or nodes in the graph, the evaluation triggers a call to the distributed master to initiate computation.
简单来说就是传给master图的定义
Distributed Master
修剪graph来找到为了evaluate node需要的subgraph
分割为graph pieces给每个设备
缓存这些graph pieces
Where graph edges are cut by the partition, the distributed master inserts send and receive nodes to pass information between the distributed tasks (Figure 6)
The distributed master then ships the graph pieces to the distributed tasks.
Worker Service
处理来自master的请求
调度核心计算
不同的task直接交流
The worker service dispatches kernels to local devices and runs kernels in parallel when possible, for example by using multiple CPU cores or GPU streams.
对于不同的设备定义了不同的Send和Recv
Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and data transfer.
Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU.
We also have preliminary support for NVIDIA's NCCL library for multi-GPU communication, see: tf.contrib.nccl
Kernel Implementations
The runtime contains over 200 standard operations including mathematical, array manipulation, control flow, and state management operations.