Seamlessly support the Hugging Face community, Colossal-AI can easily accelerate large models at low cost

Large models have become a trend in the AI ​​circle, not only sweeping the performance lists, but also producing many interesting applications.
For example, Copilot, an automatic code suggestion completion artifact developed by Microsoft and OpenAI, becomes the best assistant for programmers and improves work efficiency.
          
 
OpenAI has just released DALL-E 2, an image model that can generate fake and real text, and Google immediately released Imagen. In terms of large models, large companies are also quite volume, no worse than CV.
        

 Text-to-image generation example "A statue of a Greek man tripped by a cat" (two columns on the left are Imagen, two columns on the right are DALL·E 2)

The magical performance brought about by the size of the model has led to an explosive growth in the scale of pre-trained models in recent years. However, training and even fine-tuning large models requires very high hardware costs, often dozens or hundreds of GPUs. In addition, existing deep learning frameworks such as PyTorch and TensorFlow are also difficult to effectively handle large models, and usually require professional AI system engineers to adapt and optimize specific models.
More importantly, not every laboratory and R&D team has the ability to "money" and can call large-scale GPU clusters at any time to use large models, not to mention individual developers with only one graphics card. Therefore, although the large model has attracted a lot of attention, the high entry threshold has left the public "out of reach".
       
 
The core reason for the increased cost of using large models is the limitation of video memory. Although GPU computing is fast, the memory capacity is limited and cannot accommodate large models. In response to this pain point, Colossal-AI uses a heterogeneous memory system to efficiently use GPU memory and low-cost CPU memory at the same time. It can train up to 18 billion parameters of GPT on a personal PC with only one GPU, which can increase the model capacity by ten It can greatly reduce the threshold for downstream tasks and application deployment such as AI large model fine-tuning and reasoning, and can also be easily extended to large-scale distribution.
Hugging Face has provided the implementation of more than 50,000 AI models for the deep learning community, including large models such as GPT and OPT, and has become one of the most popular AI libraries.
     
 
Colossal-AI seamlessly supports the Hugging Face community model, making big models accessible to every developer. Next, we will take the large model OPT released by Meta as an example to show how to use Colossal-AI to achieve low-cost training and fine-tuning of large models with only a few lines of code.

Accelerate large model OPT at low cost

OPT model

The full name of OPT is Open Pretrained Transformer, which is a large-scale Transformer model that benchmarked GPT-3 released by Meta (Facebook) AI Lab. Compared with GPT-3, which has not yet disclosed the model weights of OpenAI, Meta AI has generously open sourced all the codes and model weights, which has greatly promoted the popularization of AI large models, and every developer can develop their own personality based on this. downstream tasks. Next, we will fine-tune Casual Language Modelling with the pre-trained weights of the OPT model provided by Hugging Face.

Add configuration file

To use the powerful functions in Colossal-AI, users do not need to change the code training logic, just add a simple configuration file to give the model the desired functions , such as mixed precision, gradient accumulation, multi-dimensional parallel training, redundancy memory optimization, etc.
On a GPU, taking heterogeneous training as an example, we only need to add relevant configuration items to the configuration file. which tensor_placement_policy determines our heterogeneous training strategy, and this parameter can be cuda , cpu and auto . Each strategy has different advantages:
  • cuda : All model parameters are placed on the GPU, which is suitable for traditional scenarios where training can still be performed without offloading;
  • cpu The model parameters will be placed in the CPU memory, and only the weights currently participating in the calculation will be retained in the GPU memory, which is suitable for the training of very large models;
  • auto According to the real-time memory information, the amount of parameters to be retained in the GPU memory will be automatically determined, which can maximize the use of the GPU memory and reduce the data transmission between the CPU and the GPU.
For general users, only a strategy needs to be selected, and Colossal-AI automatically and dynamically selects the best heterogeneous strategy in real time to maximize computing efficiency. auto
 
from colossalai.zero.shard_utils import TensorShardStrategy

zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
                              tensor_placement_policy="auto"),
            optimizer_config=dict(gpu_margin_mem_ratio=0.8))

run start

After the configuration file is ready, we can start the declared new function by inserting a few lines of code .
First, start Colossal-AI with a configuration file through a line of code. Colossal-AI will automatically initialize the distributed environment, read the relevant configuration, and then automatically inject the functions in the configuration into components such as the model and optimizer.
 
colossalai.launch_from_torch(config='./configs/colossalai_zero.py')
Next, users can define datasets, models, optimizers, loss functions, etc. as usual, for example directly using native PyTorch code. When defining a model, simply place the model ZeroInitContext under initialization. In the example, we use the OPTForCausalLM model and pretrained weights provided by Hugging Face, fine-tuned on the Wikitext dataset.
 
with ZeroInitContext(target_device=torch.cuda.current_device(), 
                    shard_strategy=shard_strategy,
                    shard_param=True):
    model = OPTForCausalLM.from_pretrained(
                'facebook/opt-1.3b'
                config=config
            )
Then, you only need to call colossalai.initialize to inject the heterogeneous memory functions defined in the configuration file into the training engine, and then start the corresponding functions.
 
engine, train_dataloader, eval_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                               optimizer=optimizer,
                                                                               criterion=criterion,
                                                                               train_dataloader=train_dataloader,
                                                                               test_dataloader=eval_dataloader,
                                                                               lr_scheduler=lr_scheduler)

Significant advantage

On a single GPU, compared with Microsoft DeepSpeed, Colossal-AI's automatic auto strategy and DeepSpeed's ZeRO Offloading strategy on different model scales have significant advantages, and can achieve a speedup of 40% at the fastest . Traditional deep learning frameworks such as PyTorch cannot run such a large model on a single GPU.
 
For parallel training with 8 GPUs, Colossal-AI just needs to add -nprocs 8 to !

The secret behind

Such a significant improvement comes from Gemini, Colossal-AI's efficient heterogeneous memory management subsystem. To put it simply, during model training, Gemini preheats in the first few steps to collect memory consumption information in the PyTorch dynamic calculation graph; after preheating, before calculating an operator, use the collected memory usage records. , Gemini will reserve the peak memory required by this operator on the computing device, and move some model tensors from GPU memory to CPU memory at the same time.

    

Gemini's built-in memory manager marks each tensor with a state information, including HOLD, COMPUTE, FREE, etc. Then, according to the dynamically queried memory usage, the state of the tensor is continuously dynamically converted and the position of the tensor is adjusted. Compared with the static division of DeepSpeed's ZeRO Offload, Colossal-AI Gemini can use GPU memory and CPU memory more efficiently. In the case of extremely limited hardware, maximize model capacity and balance training speed.
       
 
For the representative GPT of the large model, using Colossal-AI is enough to train a model with up to 1.5 billion parameters on an ordinary gaming notebook equipped with RTX 2060 6GB; for a personal computer equipped with RTX3090 24GB, it can directly train a model with 18 billion parameters; For professional computing cards such as Tesla V100, Colossal-AI can also show significant improvement.

Convenient and efficient parallel expansion

In order to train the world's largest and most advanced AI models in the shortest time, efficient distributed parallel expansion is still inseparable. For complex , it can be automatically implemented with a simple declaration. Colossal-AI does not need to intrude into the code like other systems and frameworks, and manually deal with complex underlying logic.
 
parallel = dict(
    pipeline=2,
    tensor=dict(mode='2.5d', depth = 1, size=4)
)
In the face of massively parallel scenarios that scale to dozens or even hundreds of GPUs, Colossal-AI still shows significant acceleration and resource savings in performance compared to existing systems such as NVIDIA's Megatron-LM. This means that for pre- trained GPT-3 and other super-large AI models, millions of yuan in training costs can be saved.
 
 
Colossal-AI related solutions have been successfully implemented in well-known manufacturers in the autopilot, cloud computing, retail, pharmaceutical, chip and other industries, and have been widely praised.
Colossal-AI focuses on the construction of open source communities, provides Chinese tutorials, opens user communities and forums, conducts efficient exchanges and iterative updates for user feedback, and continuously adds cutting-edge applications such as PaLM and AlphaFold.
Since the natural open source, Colossal-AI has ranked first in the world on GitHub and Papers With Code hot list for many times , and has attracted attention at home and abroad together with many star open source projects with tens of thousands of stars!

portal

project address:

Recruitment

Luchen Technology is still recruiting talents, recruiting full-time/internship AI distributed system, architecture, compiler, network, CUDA, SaaS, k8s and other core system R&D personnel, open source community operation and sales personnel.
Luchen Technology provides competitive salary returns, especially excellent ones, and can also apply for remote work. You are also welcome to introduce outstanding talents to Luchen Technology. If you recommend outstanding talents and successfully sign a contract with Luchen Technology, we will provide you with a referral fee of several thousand yuan to tens of thousands of yuan.
Work location: Beijing, Singapore, the United States (can be transferred to each other).
Resume delivery email: [email protected]

references:

 
 
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5650766/blog/5553444