【笔记】解决学习Chatglm2 时遇到的 CUDA Error: no kernel image is available for execution on the device 问题

学习Chatglm2,使用chatglm2-6b-int4,使用model.half().cuda()时,遇到的问题:

        CUDA Error: no kernel image is available for execution on the device

如果只是想跑起来,如果对速度不介意,可以尝试用下面的简单方法:

1. 模型加载时,使用本地程序以便于修改
 

from chatglm2_6b_int4.configuration_chatglm import *
from chatglm2_6b_int4.modeling_chatglm import *
from chatglm2_6b_int4.tokenization_chatglm import *
from chatglm2_6b_int4.quantization import *

tokenizer = ChatGLMTokenizer.from_pretrained("chatglm2_6b_int4/")
model = ChatGLMForConditionalGeneration.from_pretrained("chatglm2_6b_int4/").half().cuda()

2. 修改 chatglm2_6b_int4/quantization.py 中的 extract_weight_to_half 函数
 

# func(
#     gridDim,
#     blockDim,
#     0,
#     stream,
#     [
#         ctypes.c_void_p(weight.data_ptr()),
#         ctypes.c_void_p(scale_list.data_ptr()),
#         ctypes.c_void_p(out.data_ptr()),
#         ctypes.c_int32(n),
#         ctypes.c_int32(m),
#     ],
# )

out[:, 0::2] = scale_list.view(-1,1) * (weight >> 4)
out[:, 1::2] = scale_list.view(-1,1) * ((weight << 4) >> 4)

2. 修改 chatglm2_6b_int4/quantization.py 中的 quant_gemv 函数
 

# func(
#     gridDim,
#     blockDim,
#     shm_size,
#     stream,
#     [
#         ctypes.c_void_p(weight.data_ptr()),
#         ctypes.c_void_p(input.data_ptr()),
#         ctypes.c_void_p(scale_list.data_ptr()),
#         ctypes.c_void_p(out.data_ptr()),
#         ctypes.c_int32(m),
#         ctypes.c_int32(k),
#     ],
# )

if input.dtype == torch.float:
    source_bit_width = 8
elif input.dtype == torch.float16:
    source_bit_width = 4
else:
    assert False, f"unsupport input type: {input.dtype}"

tmp = torch.empty(weight.size(0), weight.size(1) * (8 // source_bit_width), dtype=input.dtype, device="cuda")
tmp[:, 0::2] = scale_list.view(-1,1) * (weight >> 4)
tmp[:, 1::2] = scale_list.view(-1,1) * ((weight << 4) >> 4)
out = torch.matmul(input, tmp.transpose(1,0))

实测可以运行,速度大约与CPU上面运行chatglm2-6b相当。

猜你喜欢

转载自blog.csdn.net/miles2007/article/details/132805941