Bug
使用tensorflow 官网images构造个人image, 此处本人使用的是:tensorflow/tensorflow:1.11.0-devel-gpu。运行container :
import tensorflow
会出现以下问题:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
按照tips,以及stackoverflow的说法,找到,然后添加到环境变量即可以。
问题是,在container中输入:
find / -name 'libcuda.so.1'
找不到,所以就难免 motherfxxx, fathersxxx and unclelxxx(口吐芬芳)。
Solution
在 tensorflow dockerhub中 Optional Features有这么一句话:
然后进入nvidia-docker的安装,参考installation guide:
安装过程基本上就是官网的过程:
[En]
The installation process is basically the process of the official website:
step1:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
step2:
curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
step3:Install the nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2
step4:Restart the Docker daemon
sudo systemctl restart docker
step5:run a test demon
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
如果出现下面这个:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
那就说明nvidia-docker2安装成功了,但按照loser(这里指我)设定,必然出问题,果然copy step5进去之后,出现了这个:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
好在CSDN人才多,似乎只是没有安装好,重新执行,参照这篇文章:
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
然后就算是成功安装nvidia-docker2。
所以到底要怎么执行,回到 tensorflow dockerhub,关于running containers中有:
docker run -it --rm --runtime=nvidia image_name:tag python
进入并击打它的头部:
[En]
Enter it and hit it on the head:
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.
此刻,无语凝噎,查询了一下,下面这几个都说是修改/etc/docker/daemon.json的内容,然后看了眼睛daemon.json文件似乎是有的。这就有点尴尬了。
docker启动容器报错 Unknown runtime specified nvidia. – luwanglin – 博客园
Docker专题——安装nvidia-docker – 知乎
直到看到这篇文章:ubuntu docker-nvidia安装,最后一句话,”需要把 –runtime=nvidia 改成 –gpus all 即可”,试运行了一下
docker run -it --rm --gpus all image_name:tag python
果然可以了,Done!
so why?
其实在nvidia-docker的安装的step5中也算是给出了答案了。根据csdn文章docker学习笔记(9):nvidia-docker安装、部署与使用 的说法,是因为nvidia-docker不同版本造成的问题。
所以还是要看updated的文章。
Acknowledge
此bug仅是本人遇到的问题的描述以及解决之法,适用性存疑。且深感自己就是照猫画虎,只知其然,未能深究其所以然,所以有错误不当之处,还望指出~~~
Refer
- Tensorflow Docker Images
- Installation Guide — NVIDIA Cloud Native Technologies documentation
- Nvidia Docker 工作原理
- 在docker中使用GPU
- ubuntu docker-nvidia安装
Original: https://blog.csdn.net/weixin_45595378/article/details/122328087
Author: weixin_45595378
Title: 使用tensorflow Dockerhub 构建image出现的问题
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/513994/
转载文章受原作者版权保护。转载请注明原作者出处!