TensorFlow(8):xubuntu18.04 安装nvidia-docker2,可以使用nvidia-smi 命令,使用docker --runtime=nvidia 启动TF的GPU镜像

前言


相关Golang 全部分类:
https://blog.csdn.net/freewebsys/category_6872378.html

本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/105269765

未经博主允许不得转载。
博主地址是:http://blog.csdn.net/freewebsys

说明


https://blog.csdn.net/freewebsys/article/details/82291919

之前也折腾过 xubuntu 上面安装过 TensorFlow GPU ,但是那个是直接安装的,
并且因为CPU 版本的问题,不支持最新的,只支持 1.5 ,这次是直接使用 nvidia-docker2 进安装测试的。

1,安装nvidia-docker2 和cuda ,驱动


这个文档说了一大堆,其实就直接安装 nvidia-docker2 就可以了。

https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver

特别简单:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

apt-get install -y nvidia-docker2 cuda-drivers 

https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/


# nvidia-docker -v
Docker version 19.03.8, build afacb8b7f0
# cat /etc/docker/daemon.json 
{
    
    
    "runtimes": {
    
    
        "nvidia": {
    
    
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

# nvidia-docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.


docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\\\n\\\"\"": unknown.
ERRO[0168] error waiting for container: context canceled 

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation

https://developer.nvidia.com/cuda-gpus

https://www.nvidia.com/download/index.aspx?lang=en-us

https://developer.nvidia.com/cuda-downloads

使用 local 文件进行安装:cuda 好像可以不用安装,直接安装驱动就行。

wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
sudo sh cuda_10.2.89_440.33.01_linux.run

在这里插入图片描述

# nvidia-smi 
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

安装驱动:

下载驱动:

https://www.nvidia.com/Download/index.aspx?lang=en-us

我的电脑的显卡是 750 也算是个带 cuda 显卡的,便宜的。
选择自己显卡的相关型号:

$ lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2)

在这里插入图片描述

安装驱动:

chmod 755 NVIDIA-Linux-x86_64-440.64.run
./NVIDIA-Linux-x86_64-440.64.run

如果没有安装驱动的话,在这个之前已经安装 cuda了,但是驱动还要再安装一遍。
在这里插入图片描述
必须关闭x-server

 /etc/init.d/lightdm stop
 然后 按住 ctrl + alt + F6 切换到另外一个 tty 终端上进行安装。

安装成功之后就可以使用nvidia-smi 看到信息了。:

$ nvidia-smi
Thu Apr  9 11:12:19 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  On   | 00000000:01:00.0 Off |                  N/A |
| 33%   32C    P8     1W /  38W |   3840MiB /  4043MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9860      C   python                                      3827MiB |
+-----------------------------------------------------------------------------+

再执行 nivdia-docker :

# nvidia-docker run -it --rm  nvidia/cuda nvidia-smi 
Thu Apr  9 03:37:24 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  On   | 00000000:01:00.0 Off |                  N/A |
| 33%   33C    P8     1W /  38W |      1MiB /  4043MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


2,使用 TensorFlow GPU docker镜像


https://tensorflow.google.cn/install/gpu

https://tensorflow.google.cn/install/docker

# docker run --runtime=nvidia -it tensorflow/tensorflow:latest-gpu bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@dd97a5d0821c:/# python
Python 2.7.17 (default, Nov  7 2019, 10:07:09) 
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))
2020-04-09 03:51:52.995985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-09 03:51:52.997348: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-09 03:51:53.434667: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-09 03:51:53.442235: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.442889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0
coreClock: 1.0845GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 77.49GiB/s
2020-04-09 03:51:53.442917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-09 03:51:53.442979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-09 03:51:53.444490: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-09 03:51:53.444837: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-09 03:51:53.446391: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-09 03:51:53.447314: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-09 03:51:53.447362: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-09 03:51:53.447468: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.448163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.448754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-09 03:51:53.449148: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-09 03:51:53.476294: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192970000 Hz
2020-04-09 03:51:53.476705: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5586dc0d8ae0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-09 03:51:53.476732: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-09 03:51:53.521245: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.521713: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5586dc14e9a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-09 03:51:53.521733: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 750 Ti, Compute Capability 5.0
2020-04-09 03:51:53.521883: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.522212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0
coreClock: 1.0845GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 77.49GiB/s
2020-04-09 03:51:53.522244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-09 03:51:53.522255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-09 03:51:53.522272: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-09 03:51:53.522283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-09 03:51:53.522294: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-09 03:51:53.522306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-09 03:51:53.522316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-09 03:51:53.522375: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.522721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.523009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-09 03:51:53.523035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-09 03:51:53.708581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-09 03:51:53.708619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-04-09 03:51:53.708626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-04-09 03:51:53.708924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.709315: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-09 03:51:53.709631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3655 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
tf.Tensor(-606.43463, shape=(), dtype=float32)
>>> 

有个警告,需要使用 sudo 执行,挂上非root 用户:

~$ sudo docker run -u $(id -u):$(id -g) --runtime=nvidia --rm -it tensorflow/tensorflow:latest-gpu bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!

tf-docker / > 
tf-docker / > nvidia-smi 
Thu Apr  9 04:28:04 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  On   | 00000000:01:00.0 Off |                  N/A |
| 33%   33C    P8     1W /  38W |      1MiB /  4043MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

关于 TensorFlow 的说明:

在这里插入图片描述

其中在 BIOS 需要配置下:
在这里插入图片描述
并且在设置 Video 显卡驱动的时候不能选择 Auto,选择 Intel HD Graphics ,同时显示器也要接到集成显卡上面,然后专门使用 NVIDIA 显卡进行运算。

3,总结


折腾半天,终于安装好了,nvidia-docker2 ,可以成功启动。开始刚安装完成,确实是没有驱动,把所有折腾的事情都记录了下来,其中 cuda 安装可以不用。
应该是第一步安装驱动,然后在安装 nvidia-docker2 就可以了。
都可以启动了。下一步继续研究物体识别。

本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/105269765

博主地址是:https://blog.csdn.net/freewebsys

猜你喜欢

转载自blog.csdn.net/freewebsys/article/details/105269765