DDP分布式训练踩坑记录

UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero

原因是GPU内存被占用了,再次启用会显示没有可用的GPU。这时候需要释放再加载GPU的内存:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use).

DDP的进程没有释放,需要手动kill掉

nvidia-smi查看是哪个进程,记录下PID

然后

kill -9 36132 # pid

猜你喜欢

转载自blog.csdn.net/weixin_44506963/article/details/142671425