RuntimeError: CUDA error: an illegal instruction was encountered

pytorch训练跑着好好的, 断了:

Traceback (most recent call last):
  File "main_multi_model_test.py", line 147, in <module>
    main()
  File "main_multi_model_test.py", line 119, in main
    train_loss, train_acc, train_bacc = train(model, optimizer, train_loader, criterions, taskNum, )
  File "../train.py", line 65, in train
    optimizer.step()
  File "/home/user1/miniconda3/lib/python3.7/site-packages/torch/optim/adam.py", line 93, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: CUDA error: an illegal instruction was encountered

以下是训练代码:

def train(model, optimizer, train_loader, criterions, taskNum, ):
    model.train()
    # taskNum = 40
    taskAttrNum = 40 // taskNum # num of attrs to pred in one task
    train_loss, corrects, tns, tps, Nns, Nps = [0] * taskNum, [0] * taskNum, [0] * taskNum, [0] * taskNum, [0] * taskNum, [0] * taskNum

    samplesNum = 0
    for i, (inputs, labels) in enumerate(train_loader):
        inputs = inputs.cuda(non_blocking=True)  
        labels = labels.cuda(non_blocking=True)  

        outputs = model(inputs)  
        batch_size = labels.size(0)

        loss, correct, tn, tp, Nn, Np = acc_bacc(criterions, outputs, labels, taskNum)

        loss_sum = sum(loss)
        optimizer.zero_grad()
        loss_sum.backward()
        optimizer.step()

        train_loss = [train_loss[i]+loss[i] for i in range(taskNum)]
        corrects = [corrects[i]+correct[i] for i in range(taskNum)]

        tns = [tns[i]+tn[i] for i in range(taskNum)]
        tps = [tps[i]+tp[i] for i in range(taskNum)]
        Nns = [Nns[i]+Nn[i] for i in range(taskNum)]
        Nps = [Nps[i]+Np[i] for i in range(taskNum)]

        samplesNum += batch_size
        acc = [100 * corrects[i] / (samplesNum * taskAttrNum) for i in range(taskNum)]
        bacc = list_bacc(tns, tps, Nns, Nps)

        assert len(train_loss) == len(acc) == len(bacc) == taskNum

        loss_avg = sum(train_loss)/ (samplesNum*taskNum)
        acc_avg = sum(acc) / len(acc)
        bacc_avg = sum(bacc) / len(bacc)
    return loss_avg, acc_avg, bacc_avg

训练设置为:

python3.7 
-lr 0.0001
-gpu 2
Model:Arcface, TaskNum:40, Bits/Attr:1
ir_se_50 model generated
Loaded weights /home/user1/Downloads/model_ir_se50.pth
Use CelebA+Lfw-a train set weights
Using image size: 112
Using Mixed CelebA train + 80% Lfw-a train set
Using Mixed CelebA Val + 20% Lfw-a Train as Validation set
Save in ckpt/0716114937_Arcface_t40_bs128_lr0.0001

奇怪的是,同样的代码, 跑别的参数设置时没有断…

可能的解决方法:
1, 换机器

猜你喜欢

转载自blog.csdn.net/qxqxqzzz/article/details/107442690