nvidia-smi version mismatch 版本不匹配

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal
# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
# 检查 NVIDIA 内核模块
lsmod | grep nvidia

检查内核加载的驱动版本

# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  525.89.02  Wed Feb  1 23:23:25 UTC 2023
GCC version:  gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

检查 DKMS 状态

# dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/srv-460.106.00/source/dkms.conf does not exist.
  • Ubuntu 20.04 上的 NVIDIA 驱动版本不匹配
  • 系统内核加载了 NVIDIA 525.89.02 版本的驱动
  • DKMS 配置中存在对 460.106.00 版本的引用
# 移除错误的 DKMS 配置
sudo dkms remove nvidia/srv-460.106.00 --all

# 如果上面的命令报错,手动删除相关目录
sudo rm -rf /var/lib/dkms/nvidia/srv-460.106.00
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.run
sudo sh cuda_12.6.0_560.28.03_linux.run

# bash cuda_12.6.0_560.28.03_linux.run
===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.6/

Please make sure that
 -   PATH includes /usr/local/cuda-12.6/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.6/lib64, or, add /usr/local/cuda-12.6/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.6/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log
# CUDA
export PATH=/usr/local/cuda-12.6/bin${
    
    PATH:+:${
    
    PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda-12.6
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0