This blog is mainly about the summary of common CUDA code errors and solutions~
1.RuntimeError running error
1.1.RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Error parsing:
The program was running fine, the code was fine, the video memory was still useless, and the video memory was sufficient, the GPU may be occupied,
It may be because of the cache problem of the previous training, because it is running in the docker container, so stop the docker container first, and then start the container~
1.2.RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
possible errors
The version of pytorch and cuda is wrong
Insufficient video memory
Refer to other blog test code
# True:每次返回的卷积算法将是确定的,即默认算法。
torch.backends.cudnn.deterministic = True
# 程序在开始时花额外时间,为整个网络的每个卷积层搜索最适合它的卷积实现算法
# 实现网络的加速。
torch.backends.cudnn.benchmark = True
final solution
Set numwork to 0
1.3.RuntimeError: CUDA out of memory
①RuntimeError: CUDA out of memory. Tried to allocate 152.00 MiB (GPU 0; 23.65 GiB total capacity; 13.81 GiB already allocated; 118.44 MiB free; 14.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exceeding the memory occupied by the GPU, the local GPU resources should be completely sufficient. However, during the pytorch training process, due to the backpropagation and forward parameters of neural network parameters such as gradient descent , a large amount of GPU memory will be occupied, so it needs to be reduced. batch.
Solution:
Reduce the batch, that is, reduce the sample size of word training
Release video memory: torch.cuda.empty_cache()
②torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.65 GiB total capacity; 22.73 GiB already allocated; 116.56 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA has no memory. Trying to allocate 128.00 MiB (GPU 0; 23.65 GiB total; 22.73 GiB allocated; 116.56 MiB free; 22.78 GiB total reserved by PyTorch) If reserved memory >> allocated memory, try setting max_split_size_mb to avoid fragmentation. See the documentation for memory management and PYTORCH_CUDA_ALLOC_CONF.
Analysis of the cause of the error:
During the training of the deep learning model, the code does not release the video memory every time it is trained
solution:
View nvidia-smi
At this time , the GPU is running without a program, and the video memory is still occupied, as shown in the figure
Use fuser query
fuser -v /dev/nvidia*
(Option) If you enter the above command and it prompts that there is no fuser, then install
apt-get install psmisc
If Unable to locate package XXX appears, then
apt-get update
Force (-9) to kill the process, enter the following command
kill -9 PID
Example diagram
Just release the video memory~