Horovod分布式训练框架
Installation
-
Open MPI:
到官网下载openmpi-4.0.0.tar.gz
tar -xvzf openmpi-4.0.0.tar.gz
cd openmpi-4.0.0
./configure --prefix="/usr/local/openmpi"
make -j 8
sudo make install
在.bashrc中添加环境变量
export PATH="$PATH:/usr/local/cuda/bin:/usr/local/openmpi/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/openmpi/lib/"
source ~/.bashrc
打开新的终端使之生效。 -
NCCL 2
根据官网教程下载.deb
文件安装
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2=2.4.8-1+cuda10.0 libnccl-dev=2.4.8-1+cuda10.0
-
g++-4.8.5
更新g++到指定版本
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.8
cd /usr/bin
sudo rm gcc g++
sudo ln -s gcc-4.8 gcc
sudo ln -s g++-4.8 g++
查看版本gcc -v
-
Horovod
注意之前已经将cuda的bin和lib64加到环境变量中
HOROVOD_CUDA_HOME=/usr/local/cuda HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 pip install --ignore-installed --no-cache-dir horovod
查看一下安装情况horovodrun --check-build
Instructions
直接看horovod官方文档吧
examples:
tensorflow_word2vec.py 、tensorflow2_mnist.py