记一次编译tensorflow-gpu爬过的坑

废话不多说，先说最终成功的版本：系统=>centos7 ,cuda=>10.0 ,cudnn=>7.5 ,nccl=>源码编译, tensorflow=>最新版本源码编译

第一次尝试：cuda=>10.1 cudnn=>7.5 nccl=>2.4.2

1.cuda下载包：*.run,，直接 sh ./*.run 按照提示选择就能安装，一般选择默认路径 /usr/local/cuda方便后续操作

配置环境，在/etc/profile末尾加上

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local//lib64:$LD_LIBRARY_PATH"

2.cudnn 解压后文件夹为cuda，将头文件和库文件分别拷贝到cuda对应的目录下：

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

更改执行权限

sudo chmod a+r /usr/local/cuda/include/cudnn.h 
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

更新链接

sudo ln -sf libcudnn.so.7.0.5 libcudnn.so.7  
sudo ln -sf libcudnn.so.7 libcudnn.so  
sudo ldconfig

查看nvcc是否成功

nvcc --version

3.安装nccl

目前官网只有*.rpm格式，网上说的deb格式没找到，所以没法试验是否能用，所以使用rpm安装

rpm -ivh nccl*.rpm

但是这一步是解压，会解压到/var/nccl*目录下，发现下面有三个rpm文件，依次rpm安装

4.安装bazel

因为编译tensorflow需要使用google的bazel，看网上教程让下载bazel-0.24.1-dist.zip，解压后编译

./compile.sh

发现报错，需要安装cmake（见后面）

编译报错，忘了什么错了，搜索无果，重新下载bazel-0.24.1-installer-linux-x86_64.sh版本在线安装，直接运行，成功！

5.安装cmake

下载cmake>3.4的版本,解压编译安装

./configure
gmake
make install

配置环境变量

PATH=/usr/local/cmake/bin:$PATH
export PATH

6.编译tensorflow

按照提示选择路径及插件

Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:10.1
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.1]:  
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]: 2.4.2
Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: 
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1] 
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]: 
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

使用编译命令

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

报错

Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1

搜索后发现大部分人都认为cuda10.1尚不可用，只能放弃，中间试过加入链接（https://github.com/tensorflow/tensorflow/issues/26289）

sudo ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105 /usr/lib64/libcublas.so.10.0

执行编译后报新的错误

Cuda Configuration Error: None of the libraries match their SONAME: /home/bernard/opt/cuda_test/cuda/lib64/libcublas.so.10.1

决定卸掉10.1，重装10.0

第二次尝试：cuda=>10.0 cudnn=>7.5 nccl=>2.4.2

1.下载cuda10.0的安装包，其他不变

2.编译tensorflow时报新的错误

fatal error: nccl.h: No such file or directory

找不到nccl.h，就是说上面那种方式安装失败

搜索发现需要安装 libnccl2 libnccl-dev libnccl-static ，但是网上教程都是ubuntu的使用apt get 安装，centos只有yum，尝试执行，报错

No package "libnccl" available

3.使用rpm卸载nccl,重新编译安装nccl

github上clone下nccl项目，编译安装

cd nccl
make -j src.build
make src.build
yum install build-essential devscripts debhelper
make pkg.debian.build

4.重新编译tensorflow

Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.1]:  
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]: 
Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: 
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1] 
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]: 
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

标红的做了修改，其他不变，大概等一个小时后编译完成

转换为whl文件

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

使用pip安装

pip install /tmp/tensorflow_pkg/*.whl

成功截图

5.测试tensorflow,gpu是否可用

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

报了一个很奇怪的错误

开始以为是没有编译tensorboard依赖，看了源码发现并不需要另外下载，最后查看了一下tensorboard的文件时间，发现是以前安装的没有卸载干净，pip uninstall 卸载后重新安装，一切正常

总结

其实安装完cuda和cudnn后可以直接pip install tensorflow-gpu的，不用自己重新编译（也就不需要安装cmake,bazel)，当初以为没有最新版本，所以自己编译，后来发现直接安装的编译环境就是cuda10.0，不过贴合系统的编译总是好用的，哈哈！

下面是直接安装的截图，AVX2没有正常使用，所以还是编译一把好点