【分布式多卡训练问题】：error: unrecognized arguments:Error initializing torch.distributed using env:// rendezvo

问题描述

在分布式训练时可能会遇到以下报错

error: unrecognized arguments: --local-rank=2
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK

原因分析：

提示：这里填写问题的分析：

注意以下的local rank,这是torch1.x和torch2.x两种写法,实际上就是参数因为这个小的细节，根本就不存在

local-rank #torch2.0
local_rank #torch1.0

而如果使用torch.distribution.launch 方法也是一直无法实现

解决方案：

提示：这里填写该问题的具体解决方案：

1：写成以下方式，引入–local-rank，

    parser.add_argument('--local_rank', type=int, default=0)
    parser.add_argument('--local-rank', type=int, default=0)
    args = parser.parse_args()
    if 'LOCAL_RANK' not in os.environ:
        os.environ['LOCAL_RANK'] = str(args.local_rank)

再使用命令

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py

2:在命令行中
将torch.distribution.launch 换成torch.distriution.run

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 train.py

以上

问题描述

原因分析：

解决方案：

猜你喜欢

目录

热门文章