UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero
原因是GPU内存被占用了,再次启用会显示没有可用的GPU。这时候需要释放再加载GPU的内存:
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use).
DDP的进程没有释放,需要手动kill掉
nvidia-smi查看是哪个进程,记录下PID
然后
kill -9 36132 # pid