tensorflow-gpu-docker镜像安装及部署文档

Posted on 2017-12-15(星期五) 18:01 in Data

基本环境

  • tensorflow-gpu 1.4.1
  • cuda8
  • cudnn6
  • centos7
  • python 3.6.2
  • portainer 1.15.5 (轻量级docker集群管理平台)
  • harbor 1.1.2 (用于存储和分发Docker镜像的企业级Registry服务器)
  • kubernetes 1.8.3 (企业级容器集群管理系统)

前置条件

制作tensorflow-gpu docker镜像

拉取并启动基础镜像—— nvidia/cuda:8.0-cudnn6-runtime-centos7

  • 获取nvidia/cuda镜像

    docker pull nvidia/cuda:8.0-cudnn6-runtime-centos7
    
  • 运行nvidia/cuda镜像

    docker run -it -v /data:/host_data nvidia/cuda:8.0-cudnn6-runtime-centos7 /bin/bash
    

编译安装python 3.6.2, 过程参照 [python编译安装]

1.安装tensorflow-gpu 1.4.1

pip3 install -U scikit-image -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install -U h5py -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install -U scipy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install -U imageio -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install -U tensorflow-gpu==1.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

2.安装openmpi-2.1.2

[官网下载]
网盘下载: [google drive] [百度网盘]

注意, 目前openmpi已经有3.0版本,但是tensorflow目前还是只认2.0版本。

在宿主机上下载openmpi-2.1.2, 编译,然后将openmpi-2.1.2文件夹拷贝到镜像的 /usr/local/ 目录下。

tar -zxvf openmpi-2.1.2.tar.gz
cd openmpi-2.1.2
./configure --prefix=/usr/local/openmpi-2.1.2 --enable-static --with-cuda   后面的参数很重要
make all install

在~/.bashrc中加入:

export OPENMPI=/usr/local/openmpi-2.1.2
export LD_LIBRARY_PATH=$OPENMPI/lib:$LD_LIBRARY_PATH
export PATH=$OPENMPI/bin:$PATH

3.安装nccl_2.1.2-1

[官网下载]

下载nccl需要注册nvidia账号,免费的。或者从我备份的网盘下载:[google drive] [百度网盘]

tar -xvf nccl_2.1.2-1.txz
cp -r ./nccl_2.1.2-1 /usr/local

LD_LIBRARY_PATH 加入 /usr/local/nccl_2.1.2-1/lib

很重要!! 在~/.bash下加入: (enp2s0修改为自己网卡的名称)

export NCCL_SOCKET_IFNAME=enp2s0

声明NCCL使用socket通讯使用的网卡. 使用 ^ 符合,表示非的意思. 比如export NCCL_SOCKET_IFNAME=^enp2s0, 排除enp2s0网卡

4.安装horovod

在安装之前,确保cuda, tensorflow, openmpi已经安装,并且各种lib在已经在LD_LIBRARY_PATH中,然后执行

HOROVOD_NCCL_HOME=/usr/local/nccl_2.1.2-1 HOROVOD_GPU_ALLREDUCE=NCCL pip3 install --no-cache-dir horovod

5.拷贝库文件

原cuda环境没有libcuda.so和libnvidia-fatbinaryloader.so的库文件,将宿主机上的 lib64/usr/lib64 文件夹下的如下文件拷贝到镜像中的 /lib64目录下即可。

lrwxrwxrwx 1 root root       12 Feb 22 10:03 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       17 Feb 22 10:03 libcuda.so.1 -> libcuda.so.384.69
-rwxr-xr-x 1 root root 13030360 Feb 22 10:03 libcuda.so.384.69
-rwxr-xr-x 1 root root   313832 Feb 22 10:06 libnvidia-fatbinaryloader.so.384.69

6.将新环境打成新镜像包

查询当前 CONTAINER ID

[[email protected] ~]# docker ps 
CONTAINER ID        IMAGE                                                        COMMAND                  CREATED             STATUS              PORTS               NAMES
4e5918306494        192.168.1.184:5000/nvidia/cuda:8.0-cudnn6-runtime-centos7   "/bin/bash"              4 minutes ago       Up 4 minutes                            ecstatic_saha

保存新镜像

docker commit 4e5918306494 tensorflow-gpu-1.4.1-py36:1.0.0

上传镜像到本地仓库

打开portainer管理界面, 在左侧边框选择 endpoint-> images,右侧会显示出刚保存的新镜像

点击镜像id进去,在右侧界面选择仓库,输入tag name,点击Tag就可以在界面上方生成一个新tag(PS:示意图为之前打好的tag,所以上方只有一个tag,仅做说明界面用),点击 “上箭头” 符号,一两分钟后便可以上传到本地镜像仓库中。

部署镜像并启动训练模型任务

使用 portainer 单点部署

进入portainer管理界面,左侧选择Containers,右侧点击Add container

填入容器名称,选择镜像,在Command一栏输入需要执行的脚本,还有工作目录。

挂载本地Volumn,(主要是nvidia cuda库)

在Env中配置显卡id, CUDA_VISIABLE_DEVIDE为0,使用第一块儿显卡

启用特权权限,点击部署容器就可以了。(PS:启动之后,可以从容器界面看启动的状态)

使用 Kubernetes 自动集群部署

在Kubernetes dashboard中,点击右上角添加,选择上传yaml文件,选择本地yaml文件后,点击上传就可以了(相当于运行 kubectl create -f xxx.yaml

  • yaml配置文件示例
    apiVersion: extensions/v1beta1    # Deployment 对应的是 extensions/v1beta1 版本
    kind: Deployment
    metadata: 
      labels:  
        app: tensorflow-py2  
      name: tensorflow-py2  
      namespace: bigdata
    spec:  
      replicas: 1  
      selector:  
        matchLabels:  
          app: tensorflow-py2
      template:  
        metadata:  
          labels:  
            app: tensorflow-py2 
        spec:  
          containers: 
          - name: tensorflow-py2 
            image: 192.168.1.184:5000/bigdata/tensorflow-gpu-py2:1.3.0-gpu
            workingDir: /ceph/docker/katong_2_lemiao/
            resources: 
              limits: 
                alpha.kubernetes.io/nvidia-gpu: 1   
            # env:
            # - name: CUDA_VISIBLE_DEVICES
            #   value: "0"   #kubernetes 1.5.2版本只能识别一台机器中的一张显卡
            command: 
            - ./run.sh
            ports:  
            - containerPort: 8088  
              protocol: TCP  
            #livenessProbe:  
            #  httpGet:  
            #    path: /  
            #    port: 8088  
            #  initialDelaySeconds: 30
            #  timeoutSeconds: 30 
            volumeMounts:
            - mountPath: /ceph/docker
              name: zi-data
            - mountPath: /usr/local/nvidia
              name: nvidia-driver
            - mountPath: /sys/fs/cgroup
              name: cgroup
          # nodeName: 192.168.1.186      #
          volumes:
          - hostPath:
              path: /ceph/docker
            name: zi-data
          - hostPath:
              path: /var/lib/nvidia-docker/volumes/nvidia_driver/384.69
            name: nvidia-driver
          - hostPath:
              path: /sys/fs/cgroup
            name: cgroup
    
    ---  
    kind: Service  
    apiVersion: v1  
    metadata:  
      labels:  
        app: tensorflow-py2  
      name: tensorflow-py2  
      namespace: bigdata
    spec:  
      type: NodePort  
      ports: 
      - port: 8088  
        targetPort: 8088 
        nodePort: 30009
      selector:  
        app: tensorflow-py2
    

问题集锦

1.docker启动tensorflow session报错: failed call to cuInit: CUDA_ERROR_UNKNOWN, 如下:

[root@466ad9c0689b python3]# python3
Python 3.6.2 (default, Feb 11 2018, 10:08:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/usr/local/python3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> sess = tf.Session()
2018-02-22 04:22:12.427870: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-22 04:22:12.430545: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-02-22 04:22:12.430672: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 466ad9c0689b
2018-02-22 04:22:12.430715: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 466ad9c0689b
2018-02-22 04:22:12.430840: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Invalid argument: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
2018-02-22 04:22:12.431332: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.69  Wed Aug 16 19:34:54 PDT 2017
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)
"""
2018-02-22 04:22:12.431410: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.69.0

原因比较明确,就是没有发现显卡设备,解决的方式就是挂载显卡的设备:

docker run --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm -itd -v /sys/fs/cgroup:/sys/fs/cgroup {IMAGE-NAME}:{VERSION}

注: 这个问题不会在k8s中出现,原因在于我们申请显卡资源的时候,k8s替我们做了上述工作。

resources:
  limits:
    alpha.kubernetes.io/nvidia-gpu: 1

另外,nvidia官方提供的nvidia-docker也会帮我们挂载显卡设备。

参考文档

Harbor用户机制、镜像同步和与Kubernetes的集成实践
CUDA docker镜像
https://github.com/uber/horovod
https://github.com/uber/horovod/blob/master/docs/gpus.md
Nvidia nccl-release