当前位置: 首页 > 工具软件 > nvidia-docker > 使用案例 >

nvidia-docker2

堵远航
2023-12-01

1. 安装

按照本文步骤执行前,你需要安装好:

nvidia驱动:Kubuntu 16.04上安装Nvidia GPU驱动 + CUDA + cuDNN
docker:Get Docker CE for Ubuntu
还需要安装好docker-compose: Install Docker Compose

使用的是新版本的docker (>=19.03),则推荐使用nvidia-container-toolkit包来代替nvidia-docker2包:

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

vidia-docker2包也可以继续使用,如果非得使用nvidia-docker2。

安装nvidia-docker2 的repo

1,安装nvidia-docker2的repo

其实,nvidia-docker 1.0也是同样的repo。

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

2,安装nvidia-docker2软件包

sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd

3,配置daemon的默认运行时
因为gemfield需要在docker-compose中去编排nvidia docker 容器,因此需要设置docker的默认runtime为nvidia。在/etc/docker/daemon.json 文件中配置如下内容:

cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

其实,在安装完成nvidia-docker2之后,nvidia-docker2已经默认在/etc/docker/daemon.json文件中写入了以下内容:

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

也可以设置代理来加快在国内的使用:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": [],
            "registry-mirrors": ["https://gemfield.mirror.aliyuncs.com"]
        }
    }
}

4,重启docker服务

sudo systemctl restart docker
#再次检查状态
systemctl status docker

2.测试

1, 在执行docker命令的时候:

docker run --runtime=nvidia --rm nvidia/cuda bash

2,看log信息
可以使用nvidia-container-cli -k -d /dev/tty info 来查看下具体的问题(或者使用命令nvidia-container-cli --debug=/dev/stdout list --compute

3. docker内安装cuda组件

Dockerfile

ARG OS_VERSION="7"
ARG CUDA_VERSION="11.0"
ARG CUDNN_VERSION="8.1"


ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441"
# cuda
RUN cd /tmp && \
    wget "wget https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda_11.6.1_510.47.03_linux.run" && \
    chmod +x cuda_11.6.1_510.47.03_linux.run && \
    ./cuda_11.6.1_510.47.03_linux.run --silent --toolkit --samples --librarypath=/usr/local/cuda-${CUDA_VERSION} && \
    ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    rm -rf cuda_11.6.1_510.47.03_linux.run
 
RUN cd /tmp && \
    wget "https://cudnn-${CUDA_VERSION}-linux-x64-v${CUDNN_VERSION}.tgz" && \
    tar -xvf cudnn-${CUDA_VERSION}-linux-x64-v${CUDNN_VERSION}.tgz && \
    cp -r cuda/include/* /usr/include/ && \
    cp -r cuda/lib64/* /usr/lib64/ && \
    rm -rf *
ENV CUDA_VERSION 11.0
ENV CUDNN_VERSION="8.1" 
 类似资料: