企业生产环境,由于安全性考虑,不能连接外网,给gpu环境搭建带来很大麻烦。因此只能进行离线安装gpu驱动、docker、nvidia-docker等。
环境:RedHat7.5(内核需为3.10+)
[root@localhost ~]# cat /etc/redhat-release
RedHat Linux release 7.5.1804 (Core)
[root@localhost ~]# lshw -numeric -C display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family [1A03:2000]
vendor: ASPEED Technology, Inc. [1A03]
physical id: 0
bus info: pci@0000:03:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller bus_master cap_list rom
configuration: driver=ast latency=0
resources: irq:17 memory:98000000-9bffffff memory:9c000000-9c01ffff ioport:2000(size=128)
*-display
description: 3D controller
product: GP104GL [Tesla P4] [10DE:1BB3]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:3b:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:3aff0-3afef iomemory:3aff0-3afef irq:315 memory:b7000000-b7ffffff memory:3affe0000000-3affefffffff memory:3afff0000000-3afff1ffffff
*-display
description: 3D controller
product: GP104GL [Tesla P4] [10DE:1BB3]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:af:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:3eff0-3efef iomemory:3eff0-3efef irq:318 memory:ed000000-edffffff memory:3effe0000000-3effefffffff memory:3efff0000000-3efff1ffffff
可以看到本机机器包含两张 Tesla P4
显卡
根据显卡信息去 https://www.nvidia.cn/Download/index.aspx?lang=cn
下载显卡驱动 NVIDIA-Linux-x86_64-415.13.run
# 先查看nouveau驱动是否开启(有内容说明未禁用)
lsmod | grep nouveau
修改dist-blacklist.conf文件
vim /lib/modprobe.d/dist-blacklist.conf
注释blacklist nvidiafb
#blacklist nvidiafb添加下面两句:
blacklist nouveau
options nouveau modeset=0
#备份一份成bak文件
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
#重启镜像
dracut /boot/initramfs-$(uname -r).img $(uname -r)
#修改运行级别为文本模式
systemctl set-default multi-user.target
# 重启服务器
reboot
# 虽然是离线安装,一般完整版系统是具备以下两个包的,如果没有,从https://pkgs.org/下载
yum -y install kernel-devel gcc
chmod u+x NVIDIA-Linux-x86_64-415.13.run
./NVIDIA-Linux-x86_64-375.39.run --kernel-source-path=/usr/src/kernels/3.10.0-862.el7.x86_64
离线环境 yum
一般是不带docker库的;使用 rpm
一般会出现缺包的情况,依赖包特别多,离线环境一般是安装不成功的。离线环境最容易成功的是使用二进制包安装方式,二进制包可以从 Index of linux/static/stable/x86_64/ 下载,docker-18.06.3-ce.tgz
以上版本安装后再去安装 nvidia-docker
,会出现缺少 docker-ce
情况,因此最好选用 18.03.x 或 18.06.x 版本。
systemctl stop docker 或 service docker stop
yum remove docker docker-client docker-client-latest docker-common docker-latest docker-latest-logrotate \
docker-logrotate docker-engine docker-ce docker-ce-cli containerd.io
# 使用 yum -y remove 卸载下行命令列出的内容
yum list installed| grep docker
rm -rf /var/lib/docker
rm -rf /var/lib/containerd
rm -rf /etc/docker
rm -rf /etc/systemd/system/docker.service
下载安装文件 docker-18.06.3-ce.tgz
创建安装脚本 install-docker.sh
#!/bin/sh
usage(){
echo "Usage: $0 FILE_NAME_DOCKER_CE_TAR_GZ"
echo " $0 docker-17.09.0-ce.tgz"
echo "Get docker-ce binary from: https://download.docker.com/linux/static/stable/x86_64/"
echo "eg: wget https://download.docker.com/linux/static/stable/x86_64/docker-17.09.0-ce.tgz"
echo ""
}
SYSTEMDDIR=/usr/lib/systemd/system
SERVICEFILE=docker.service
DOCKERDIR=/usr/bin
DOCKERBIN=docker
SERVICENAME=docker
if [ $# -ne 1 ]; then
usage
exit 1
else
FILETARGZ="$1"
fi
if [ ! -f ${FILETARGZ} ]; then
echo "Docker binary tgz files does not exist, please check it"
echo "Get docker-ce binary from: https://download.docker.com/linux/static/stable/x86_64/"
echo "eg: wget https://download.docker.com/linux/static/stable/x86_64/docker-17.09.0-ce.tgz"
exit 1
fi
echo "##unzip : tar xvpf ${FILETARGZ}"
tar xvpf ${FILETARGZ}
echo
echo "##binary : ${DOCKERBIN} copy to ${DOCKERDIR}"
cp -p ${DOCKERBIN}/* ${DOCKERDIR} >/dev/null 2>&1
which ${DOCKERBIN}
echo "##systemd service: ${SERVICEFILE}"
echo "##docker.service: create docker systemd file"
cat >${SYSTEMDDIR}/${SERVICEFILE} <<EOF
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target docker.socket
[Service]
Type=notify
EnvironmentFile=-/run/flannel/docker
WorkingDirectory=/usr/local/bin
ExecStart=/usr/bin/dockerd \
-H tcp://0.0.0.0:4243 \
-H unix:///var/run/docker.sock \
--selinux-enabled=false \
--log-opt max-size=1g \
--graph=/data/sys_docker # 设置镜像及容器目录
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
echo ""
systemctl daemon-reload
echo "##Service status: ${SERVICENAME}"
systemctl status ${SERVICENAME}
echo "##Service restart: ${SERVICENAME}"
systemctl restart ${SERVICENAME}
echo "##Service status: ${SERVICENAME}"
systemctl status ${SERVICENAME}
echo "##Service enabled: ${SERVICENAME}"
systemctl enable ${SERVICENAME}
echo "## docker version"
docker version
执行安装
chmod +x install-docker.sh
./install-docker.sh ./docker-18.06.3-ce.tgz
nvidia-docker2
nvidia-docker
为了方便使用,nvidia-docker
的安装包与依赖包已经上传,安装前先进行下载 nvidia-docker2
# 解压后,使用rpm安装
rpm -Uvh *.rpm --nodeps --force
因为 nvidia-docker
随docker一起启动,因此需要修改或新增docker启动配置 /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
systemctl restart docker 或 service docker restart
参考文章
2、Install Docker Engine on CentOS