Ubuntu 16.04下安装CUDA 8.0, Anaconda 4.4.0和TensorFlow 1.2.1

Cuda

  如果配了Nvidia卡的，可以考虑安装Cuda，这样之后可以用GPU加速。之前写过一篇在Ubuntu 14.04上装Cuda 7.5的文章( 
 http://www.linuxidc.com/Linux/2016-11/136768.htm) 
 。TensorFlow 1.2版本貌似需要Cuda Toolkit 8.0，过程和之前是差不多的。更新driver（如需），然后去Nvidia官网下载Cuda和cuDNN安装即可。具体不再累述。 
 对于大部分N卡，Cuda 8.0需要driver的最低版本为367，所以如果已经够用，在安装cuda的时候保险点的话就不用更新驱动。如果更新驱动后不幸中招，如循环登录或无法进入图形界面等问题，可以到字符终端(CTL+ALT+F1)先尝试清除已有驱动，禁用Nvidia开源驱动nouveau，然后重装驱动。 

sudo apt-get remove --purge nvdia*
sudo apt-get install update
sudo apt-get install dkms build-essential linux-headers-generic
sudo vim /etc/modprobe.d/blacklist.conf

sudo apt-get remove --purge nvdia*
sudo apt-get install update
sudo apt-get install dkms build-essential linux-headers-generic
sudo vim /etc/modprobe.d/blacklist.conf

  在blacklist.conf中加上： 

  blacklist nouveau 

  blacklist lbm-nouveau 

  options nouveau modeset=0 

  alias nouveau off 

  alias lbm-nouveau off 

sudo service lightdm stop
sudo add-apt-repository ppa:graphics-drivers/ppa && sudo apt-get update
sudo apt-get install nvidia-375

sudo service lightdm stop
sudo add-apt-repository ppa:graphics-drivers/ppa && sudo apt-get update
sudo apt-get install nvidia-375

  重启。如果进不了图形界面，就把unity那坨都重装一下，然后再通过sudo service lightdm start启动桌面环境。 

Anaconda

 
 Anaconda发行版可以用于创建独立的python开发运行环境。每个环境中的python runtime都是独立的，互不影响。这样就不用担心安装A的时候把B的环境给破坏了。Anaconda最新版本4.4.0。下载链接为： 
 https://www.continuum.io/downloads 
 。安装很方便，以Anaconda for Python 2.7为例： 

bash ~/Downloads/Anaconda2-4.4.0-Linux-x86_64.sh

bash ~/Downloads/Anaconda2-4.4.0-Linux-x86_64.sh

 
 然后就可以创建环境，比如创建两个分别为python 2.7和3.5的环境： 

conda create --name py35 python=3.5
conda create --name py27 python=2.7

conda create --name py35 python=3.5
conda create --name py27 python=2.7

  其中py27和py35为环境名，之后用： 

source activate <env name>

source activate <env name>

  进入相应的环境。退出用： 

source deactivate

source deactivate

  列出当前环境信息： 

conda list

conda list

  删除环境可以用： 

conda remove --name <env name> --all

conda remove --name <env name> --all

  列出现有的环境： 

conda env list

conda env list

  列出环境中安装的包： 

conda list --name=<env name>

conda list --name=<env name>

  更多用法请参见：https://conda.io/docs/using/envs.html　 

  进入环境后安装包既可以用conda install也可以用传统的pip install，有时网络不给力的时候可能下载会超时： 

  ReadTimeoutError: HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out. 

  如果真的只是因为慢，这里可以用延长timeout时间来解决： 

pip --default-timeout=10000 install -U <package name>

pip --default-timeout=10000 install -U <package name>

  另外如果在使用过程中碰到下面错误： 

  ValueError: failed to parse CPython 

  有可能是和用户目录下的本地环境串了。一个方法是打开anaconda2/lib/python2.7/site.py，修改ENABLE_USER_SITE = False。 

TensorFlow

  目前最新release版本为1.2.1（1.3还是RC状态）。我们就以v1.2.1为例。最方便的话就是装prebuild版：https://www.tensorflow.org/install/install_linux。如果已经装了Anaconda，先进入环境（假设已经创建了python 2.7的环境，名为py27）： 

source activate py27

source activate py27

  如果没有安装Anaconda的话上面这步就省了。之后安装TensorFlow，其中的binary下载链接需要根据python版本，有无GPU信息在 
 https://www.tensorflow.org/install/install_linux#the_url_of_the_tensorflow_python_package 
 中自行选取。如python 3.5，有GPU的情况下就可以用： 

pip install --ignore-installed --upgradehttps://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.1-cp35-cp35m-linux_x86_64.whl

pip install --ignore-installed --upgradehttps://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.1-cp35-cp35m-linux_x86_64.whl

  再稍微验证下能否顺利加载： 

python -c "import tensorflow as tf;print(tf.__version__);"

python -c "import tensorflow as tf;print(tf.__version__);"

  如果打印出刚装的版本号那就差不多了。 

 
 但官方prebuild版没有加入x86并行指令(SSE/AVX/FMA)优化。因此训练的时候会打印类似下面信息： 

2017-08-12 20:10:39.973508: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973536: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973541: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973549: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

2017-08-12 20:10:39.973508: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973536: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973541: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-12 20:10:39.973549: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

  有个鸵鸟的办法就是将log level提高，眼不见心不烦： 

export TF_CPP_MIN_LOG_LEVEL=2

export TF_CPP_MIN_LOG_LEVEL=2

  但这样把其它一些log也过滤了。另一方面，x86的并行加速指令在一些情况下是可以带来几倍的性能提升的。因此我们可以考虑自己编译一个带该优化的版本。先下载源码，然后checkout相应版本分支(如r1.2)： 

git clone https://github.com/tensorflow/tensorflow
git checkout r1.2

git clone https://github.com/tensorflow/tensorflow
git checkout r1.2

  参考 
 https://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions 
 ，安装好编译工具bazel后(https://docs.bazel.build/versions/master/install-ubuntu.html)，可以用以下命令编译： 

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

  如果你编译的时候碰到以下错误： 

Loading:
Loading: 0 packages loaded
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):

Loading: 
Loading: 0 packages loaded
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):

 
 这是一个已知问题（ 
 https://github.com/tensorflow/tensorflow/pull/11949 
 ），解决方法见 
 https://github.com/tensorflow/tensorflow/pull/11949/commits/c5d311eaf8cc6471643b5c43810a1feb19662d6c，目前貌似还没有pick到发布分支，人肉pick下吧，应该就解决了。 
 编译好后用下面命令在指定目录（如~/tmp/）生成whl安装包，然后就和前面一样安装即可。 

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tmp/

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tmp/

  如果运行时出现下面错误： 

ImportError: Traceback (most recent call last):
File "tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
ImportError: No module named pywrap_tensorflow_internal

ImportError: Traceback (most recent call last):
File "tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
ImportError: No module named pywrap_tensorflow_internal

 
 根据https://stackoverflow.com/questions/35953210/error-running-basic-tensorflow-example，cd到非tensorflow源码目录即可。 

Ubuntu 16.04下安装CUDA 8.0, Anaconda 4.4.0和TensorFlow 1.2.1

猜你喜欢