PyTorch - 大模型多卡训练 “CUDA error: an illegal memory access was encountered”

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/133640212

Img

错误日志:

# ...
  File "lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in fit
    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
    trainer._teardown()
  File "lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1121, in _teardown
    self.strategy.teardown()
  File "lib/python3.7/site-packages/pytorch_lightning/strategies/horovod.py", line 241, in teardown
    super().teardown()
  File "lib/python3.7/site-packages/pytorch_lightning/strategies/parallel.py", line 114, in teardown
    super().teardown()
  File "lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 499, in teardown
    self.accelerator.teardown()
  File "lib/python3.7/site-packages/pytorch_lightning/accelerators/cuda.py", line 76, in teardown
    torch.cuda.empty_cache()
  File "lib/python3.7/site-packages/torch/cuda/memory.py", line 125, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
# ...

核心错误:CUDA error: an illegal memory access was encountered,遇到非法内存访问。

原因:显存溢出,降低配置中影响显存占用的参数即可,例如输入特征的尺寸,即可。

观察 WanbB 显存占用,也可及时发现,例如,高显存 100% 占用,容易造成内存溢出:
高显存占用
正常占用 83%:
GPU

猜你喜欢

转载自blog.csdn.net/u012515223/article/details/133640212