[Large Model] Transformers library single machine multi-card reasoning device_map

Device_map parameter analysis
device_map="auto" Code Example
Manual Configuration
References

Hugging Face's transformerslibrary supports the model instantiation method of the automatic model (AutoModel) to automatically load and use models such as GPT and ChatGLM. The parameter in AutoModel.from_pretrained()the method device_mapcan achieve single-machine multi-card inference.

Device_map parameter analysis

device_mapis AutoModel.from_pretrained()an important parameter in the method, which is used to specify which specific computing device each component of the model should be loaded on to achieve efficient allocation and utilization of resources . This parameter is particularly useful when performing model parallel or distributed training.

device_mapThere are several options for the parameter auto, balanced, balanced_low_0, sequential, as follows:

“auto” 和 “balanced”: will balance the split model across all GPUs. Mainly to potentially discover more efficient allocation strategies. The functionality of the "balanced" parameter remains stable. (use as needed)
“balanced_low_0”: The model will be balanced on all GPUs except the first GPU, and will occupy less resources on the first GPU. This option is suitable for the need to perform additional operations on the first GPU, such as executing the generate function (iterative process) on the first GPU. (Recommended)
“sequential”: Allocate model shards in order of GPUs, starting from GPU 0 and ending at the last GPU (the last GPU is often not fully occupied, the difference from - "balanced_low_0" is the first or last, and non-balanced filling), but in actual use, GPU 0 will directly burst the video memory. (Not recommended)

Code Example for device_map="auto"

Here our environment is a single machine with two graphics cards. Use device_map="auto"to load ChatGLM-6Bthe model and observe the graphics card usage.

The sample code is as follows:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 加载模型
model_path = "./model/chatglm2-6b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

text = '什么是机器学习?'
inputs = tokenizer(text, return_tensors="pt")
print(inputs)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The graphics card usage before the program runs is as follows:

You can see that graphics card No. 0 itself has about 13G of video memory occupied by other programs.
device_map="auto"The graphics card usage after use is as follows:

After using the auto strategy, graphics cards 0 and 1 occupy about 6~7G of video memory respectively.

Manual Configuration

Configured as a single card

device = "cuda:1"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=device)

Configured for multiple graphics cards
Suppose you want some parts of the model to be on the first graphics card and other parts to be on the second graphics card, you need to know the layer name of the model or allocate it reasonably according to the size of the model components. However, the specific layer name needs to be determined according to the actual model. Here is a conceptual example:

device = {
    
    
    "transformer.h.0": "cuda:0",  # 第一部分放在GPU 0
    "transformer.h.1": "cuda:1",  # 第二部分放在GPU 1
    # ... 根据模型结构继续分配
}
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device)