【博客509】k8s gpu节点避免内核升级时显卡驱动失效

壤驷康裕
2023-12-01

k8s gpu节点避免内核升级时显卡驱动失效

场景:

k8s ubuntu系统的gpu节点,在机器重启后,出现显卡驱动失效,nvidia-smi无法使用

原因:

安装显卡驱动的时候,是按照本机器的内核包来编译的,如果开启了自动更新,那么
当有新内核的时候,机器就会去下载,然后重启的时候就会更新内核版本,这时候
使用旧内核包编译的显卡驱动无法在新的内核上运行,导致显卡无法使用

解决方法:

1、关闭自动升级

sudo apt remove unattended-upgrades

2、确实需要升级内核,升级后重新编译显卡驱动

example:

sudo sh NVIDIA-Linux-x86_64-515.65.01.run --dkms

3、一开始就以dkms方式安装显卡驱动 (推荐)

sudo sh NVIDIA-Linux-x86_64-515.65.01.run --dkms

如何升级内核

如果只希望升级内核,运行:

sudo apt-get upgrade linux-image-generic
reboot

升级时出现:

The following packages have unmet dependencies:
linux-generic : Depends: linux-image-generic (= 5.4.0.100.104) but 5.4.0.125.126 is to be installed

则可以使用以下方式来安装:
sudo apt-get update
sudo apt-get purge linux-generic    
sudo apt-get install --reinstall linux-generic

如何对已有机器将显卡驱动变为dkms方式安装

1、对于已经安装好了驱动的机器,可以执行卸载,再重新以dkms方式安装:

卸载:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run -uninstall
安装:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run -dkms


此时如果有程序正在使用也是可以uninstall的,只不过使用gpu的程序会报错:
sudo docker run -it  --gpus=all  registry.cn-hangzhou.aliyuncs.com/mkmk/all:gpu-burn-cuda11.1  "/app/gpu_burn"  "10"
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled

2、对于已经安装好了驱动的机器,可以重新以dkms方式再安装:

安装:
sudo sh NVIDIA-Linux-x86_64-515.65.01.run -dkms

此时如果有程序正在使用,无法再进行install的,执行程序会报错:

ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an
         X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for
         module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no
         GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error
         may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.

检查是否有程序正在使用gpu:nvidia-smi查看,出现

No running processes found

则代表没有程序正在使用gpu,否则应该先停下在使用gpu的服务

3、对于卸载后重装提示仍有进程在使用或者提示已经有安装了,那么此时执行reboot再安装即可

总结:

推荐做法:关闭自动升级 + 以dkms方式安装显卡驱动

附上显卡推荐安装方法:

// 关闭自动更新
sudo apt remove unattended-upgrades

// 以dkms方式安装显卡驱动
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
sudo apt-get update
sudo apt-get install gcc ubuntu-make make
sudo apt-get install -y dkms
sudo sh NVIDIA-Linux-x86_64-515.65.01.run --dkms
 类似资料: