【Accelerate】accelerate-large-models （RuntimeError: Expected all tensors to be on the same device……）

Article directory

accelerate-large-models

accelerate-large-models

1. Load and run the large model

1.1 General model

Create model
Load weights into memory
Load the weights into the model
Move the model to the inference device

1.2 Large model

Create an empty (i.e. no weights) model
Determine the location of each layer (when multiple devices are available)
Load the part with its weights in memory
Load these weights in an empty model
Weights on mobile devices for inference
Repeat step 3 for the next set of weights until all weights are loaded

2. Create an empty model

For example, this cannot be created (default precision FP32):

import torch

large_tensor = torch.randn(100000, 100000)

It works on meta device:

import torch

large_tensor = torch.randn(100000, 100000, device="meta")
# tensor(..., device='meta', size=(100000, 100000))

In the actual process, we cannot modify the device attribute of each tensor.

Therefore, we generally instantiate it like this (taking BLOOM as an example):

from accelerate import init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("bigscience/bloom")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

3. Compute device mapping

from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("facebook/opt-13b")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

device_map = infer_auto_device_map(model)

The results returned by running are as follows:

{
    
    'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10.self_attn': 0,
 'model.decoder.layers.10.activation_fn': 0,
 'model.decoder.layers.10.self_attn_layer_norm': 0,
 'model.decoder.layers.10.fc1': 'cpu',
 'model.decoder.layers.10.fc2': 'cpu',
 'model.decoder.layers.10.final_layer_norm': 'cpu',
 'model.decoder.layers.11': 'cpu',
 ...
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18.self_attn': 'cpu',
 'model.decoder.layers.18.activation_fn': 'cpu',
 'model.decoder.layers.18.self_attn_layer_norm': 'cpu',
 'model.decoder.layers.18.fc1': 'disk',
 'model.decoder.layers.18.fc2': 'disk',
 'model.decoder.layers.18.final_layer_norm': 'disk',
 'model.decoder.layers.19': 'disk',
 ...
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

From the results we can see:

Layers 0 to 9 are on GPU 0.
The first part of layer 10 is on GPU 0 and the second part is on CPU.
Layers 11 to 17 are located on the CPU.
The first part of layer 18 is on the CPU and the second part is on the disk.
Layers 19 to 39 are located on the disk.

This is only feasible since each layer needs to be on the same device.

Therefore, one should add:

device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer"])

code show as below:

from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("facebook/opt-13b")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer"])

This will return:

'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10': 'cpu',
 'model.decoder.layers.11': 'cpu',
 ...
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18': 'disk',
 ...
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

device_mapExplanation of the pattern:

"auto" or "balanced": Accelerate will split the weights to ensure equal load on each GPU.
"balanced_low_0": Accelerate will split the weights to ensure equal load on each GPU, but the first GPU will try to keep as few weights as possible (useful when you want to process the output of the model on one GPU, such as using a generating function hour).
"sequential": Accelerate will fill the GPUs sequentially (so the last GPU may not be used at all).

4. Status layering

4.1 Traditional save/load weights

# Save the model weights
torch.save(my_model.state_dict(), 'model_weights.pth')

# Reload them
new_model = ModelClass()
new_model.load_state_dict(torch.load('model_weights.pth'))

4.2 large-models

Rather than having one big file containing all the weights, large models on Hugging Face Hub are saved and shared using several of the weights .

You enter the BLOOM model page and you will see that there are 72 files named pytorch_model_xxxxx-of-00072.bin (each approximately 7.19GB.), each containing partial model weights. Using this format we can load part of the state dictionary into memory, put the weights into the model, move them to the correct device, and then discard this part before moving on to the next one.

import torch
from transformers import AutoModelForCausalLM

# Will error
checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.float16)

If you don't have enough GPU and CPU RAM, you'll get an error message indicating that you need to pass a folder where the weights that should be stored on disk will be unloaded.

The error message is as follows:

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an 
`offload_folder` for them.

Solution:

import torch
from transformers import AutoModelForCausalLM

# Will go out of RAM on Colab
checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", torch_dtype=torch.float16
)

If you try to load a very large model that requires some disk offloading in addition to the CPU offload, you may run out of RAM when loading the last few shards of the checkpoint because some part of the model still resides on the CPU and takes up space. If this is the case, use the option to offload_state_dict=Truetemporarily unload the portion of the model left on the CPU after all weights have been loaded, and then reload into RAM after all weights have been processed.

import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

This will accommodate Colab, but will come very close to using all available RAM, so it may run out of RAM while trying to generate predictions. In order to get a model that we can use, we need to offload a layer to disk. We can do this by taking the one we calculated in the previous section device_map, tweaking it slightly, and then passing it to the call.from_pretrained

import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
device_map["model.decoder.layers.37"] = "disk"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map=device_map, offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

5. Run model splits on multiple devices

hooks是一个 PyTorch API，它添加在每次转发调用之前执行的函数

We can't use it directly because they only support models with regular parameters and don't support keyword arguments in the forward pass, but we adapted the same idea. Once the model is loaded, the dispatch_model function will add hooks to each module and submodule that are executed before and after each forward pass. They will:

Make sure all inputs and weights of the module are on the same device;
If the weights have been offloaded to the CPU, move them to GPU 0 before the forward pass and back to the CPU after;
If the weights have been offloaded to disk, load them into RAM before the forward pass, then load onto GPU 0, and free that memory afterwards.

6. Summary

Video explanation: https://www.youtube.com/watch?v=MWCSGj9jEAo

This method requires pre-estimation, and each layer must be on the same device.。

Reference: https://github.com/huggingface/blog/blob/main/accelerate-large-models.md