分布式训练|horovod+keras(1)

2018-12-27  本文已影响0人  reallocing

前提

基于horovod分布式训练的keras代码结构:

import tensorflow as tf
import horovod.keras as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))

# Build model...
loss = ...
model = net()
model.summary()

opt = keras.optimizers.Adam(lr=1.0 * hvd.size())

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)


model.compile(optimizer=opt, loss=loss, metrics=['mse'])


callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),

    # Horovod: average metrics among workers at the end of every epoch.
    #
    # Note: This callback must be in the list before the ReduceLROnPlateau,
    # TensorBoard or other metrics-based callbacks.
    hvd.callbacks.MetricAverageCallback(),

    # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
    # accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
    # the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
    hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),

    # Reduce the learning rate if training plateaues.
    keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1),
]

# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
    callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))


## 或者使用model.fit_generator()
model.fit(train_shuffled, train_mask_shuffled , batch_size=10, epochs=200, verbose=1, shuffle=True,
              validation_data=(val_shuffled, val_mask_shuffled),callbacks=callbacks)

启动训练的脚本:

单机多GPU:

#!/usr/bin/env bash
mpirun -np 2 \
    -H localhost:2 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python horovod_keras.py

多机多GPU:

#!/usr/bin/env bash
mpirun -np 3 \
    -H node1:2,node2:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME=^lo,docker0,cni0,virbr0 \
    -mca pml ob1 -mca btl ^openib \
    -mca btl_tcp_if_exclude lo,docker0,cni0,virbr0 \
    python horovod_keras.py

注意node1:2,node2:2表示了主机:GPU个数,所以加起来的总数要和-np指定的个数一致,即为4.

有待改进的地方:

参考

数据分布训练时的划分:

上一篇 下一篇

猜你喜欢

热点阅读