sudo apt-key adv –fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004 /x86_64/7fa2af80.pub
sudo add-apt-repository “deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/ x86_64/ /”
sudo apt update
(1) 安装最新版
sudo apt install libnccl2 libnccl-dev
(2) 根据自己的版本进行安装
sudo apt install libnccl2=2.4.8-1+cuda10.0 libnccl-dev=2.4.8-1+cuda10.0
git clone https://github.com/NVIDIA/nccl.git
cd ncclmake -j12 src.build BUILDDIR=/home/yourname/nccl CUDA_HOME=/usr/local/cuda NVCC_GENCODE=”-gencode=arch=compute_86, ode=sm_86″
((NVCC_GENCODE可以不添加,如果不添加该字段,默认会编译支持所有架构;为了加速编译以及降低二进制文件大小,添加该字段,具体comute_86,sm_86是和显卡算力相匹配,具体见:https://developer.nvidia.com/cuda-gpus))
- -j12:表示使用12个核心,使用nproc查看总核心数,根据具体情况进行调整;
- BUILDDIR:表示编译后,一些文件的存储路径;默认是nccl/build;当然如果是root用户可以指定到/usr/local/ncc/;
- CUDA_HOME:表示CUDA的目录,默认就是/usr/local/cuda,可以不加,如果报错,加上
vim ~/.bashrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/yourname/nccl/lib
export PATH=$PATH:/home/yourname/nccl/bin
source ~/.bashrc
4.验证NCCL是否安装成功:
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make -j12 CUDA_HOME=/usr/local/cuda
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
- sudo apt-get install g++
- g++ –version 查看g++的版本
- ompi_info (or mpiexec –version or mpirun –version or mpicxx –showme:version)
pip install tensorflow-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple
import tensorflow as tf
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
解决办法一:
- pip install numpy –upgrade(升级Numpy,但我的已经是最新版,无效)
解决办法二:
- 卸载numpy重新安装(我卸载后再次安装提示我环境有Numpy,这说明刚才环境有两个numpy, import tensorflow as tf,提示No module named ‘numpy.core._multiarray_umath, 升级numpy后成功导入tensorflow)
HOROVOD_GPU_OPERATIONS=NCCL pip install –no-cache-dir horovod
import tensorflow as tf
import horovod.tensorflow as hvdOriginal: https://blog.csdn.net/JNash/article/details/122909857
Author: JNash
Title: 记录一下这两天配置NCCL和horovod的过程(原创)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/508961/
转载文章受原作者版权保护。转载请注明原作者出处!