k8s ubuntu系统的gpu节点,在机器重启后,出现显卡驱动失效,nvidia-smi无法使用
安装显卡驱动的时候,是按照本机器的内核包来编译的,如果开启了自动更新,那么
当有新内核的时候,机器就会去下载,然后重启的时候就会更新内核版本,这时候
使用旧内核包编译的显卡驱动无法在新的内核上运行,导致显卡无法使用
sudo apt remove unattended-upgrades
example:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run --dkms
sudo sh NVIDIA-Linux-x86_64-515.65.01.run --dkms
如果只希望升级内核,运行:
sudo apt-get upgrade linux-image-generic
reboot
升级时出现:
The following packages have unmet dependencies:
linux-generic : Depends: linux-image-generic (= 5.4.0.100.104) but 5.4.0.125.126 is to be installed
则可以使用以下方式来安装:
sudo apt-get update
sudo apt-get purge linux-generic
sudo apt-get install --reinstall linux-generic
1、对于已经安装好了驱动的机器,可以执行卸载,再重新以dkms方式安装:
卸载:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run -uninstall
安装:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run -dkms
此时如果有程序正在使用也是可以uninstall的,只不过使用gpu的程序会报错:
sudo docker run -it --gpus=all registry.cn-hangzhou.aliyuncs.com/mkmk/all:gpu-burn-cuda11.1 "/app/gpu_burn" "10"
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled
2、对于已经安装好了驱动的机器,可以重新以dkms方式再安装:
安装:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run -dkms
此时如果有程序正在使用,无法再进行install的,执行程序会报错:
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an
X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for
module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no
GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error
may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
检查是否有程序正在使用gpu:nvidia-smi查看,出现
No running processes found
则代表没有程序正在使用gpu,否则应该先停下在使用gpu的服务
3、对于卸载后重装提示仍有进程在使用或者提示已经有安装了,那么此时执行reboot再安装即可
推荐做法:关闭自动升级 + 以dkms方式安装显卡驱动
附上显卡推荐安装方法:
// 关闭自动更新
sudo apt remove unattended-upgrades
// 以dkms方式安装显卡驱动
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
sudo apt-get update
sudo apt-get install gcc ubuntu-make make
sudo apt-get install -y dkms
sudo sh NVIDIA-Linux-x86_64-515.65.01.run --dkms