KubeAI large model inference acceleration practice

[Shenzhen] Yuanchuanghui: 5.26pm, the party hall is waiting for you.”

1. Background

Recently, we have deployed large-model dedicated inference clusters in batches in the production environment, and successfully increased the inference speed of large models including 70B by 50%, significantly reducing deployment costs and stably applying it to production environments. This article is based on some of our experiences in deploying large model inference clusters and shares some methods to effectively improve the inference speed of large models. Finally, at the end, we recommend several large model inference frameworks that we have evaluated and performed well. I hope these suggestions will help readers choose the appropriate reasoning framework for their projects.

Hyung Won Chung, a scientist at OpenAI, pointed out in his 2023 public lecture "Large Language Models" [8] that certain capabilities of large models can only be revealed when they reach a certain scale. It can be seen that the number of parameters of large models will definitely increase in the future. Large, this is also the development trend of large models. As the number of parameters increases, the requirements for the inference speed of large models are getting higher and higher. What methods can be used to improve the inference speed or throughput of large models?

First, we will discuss the direction of acceleration optimization for large models. Then, based on the timeline, the article will introduce some of the more classic and practical large model acceleration technologies in the industry, including but not limited to technologies such as "FlashAttention[1]" and "PageAttention[3]" .

The following are some classic large model inference acceleration technologies in the industry in chronological order. This article attempts to provide readers with a review of large model acceleration methods in chronological order of development.

In addition to the technologies mentioned above, there are also quantification technologies for large models that can improve the inference speed of large models. We will not discuss them here for now. We will publish a separate article to introduce them later when we have the opportunity.

2. Challenges faced by the development of large models

In the future, the number of parameters of large models will definitely become larger and larger. This is also the development trend of large models, and the requirements for inference acceleration will become higher and higher.

OpenAI introduced expansion rules for large models in its paper "Scaling Laws for Neural Language Models" [7]. These rules illustrate the relationship between the model's capabilities and its size. Specifically, a model's capabilities strongly depend on its size, including the number of model parameters, the size of the data set, and the amount of computation required during training. In addition, OpenAI scientist Hyung Won Chung pointed out in his 2023 public lecture "Large Language Models" [8] that certain capabilities of large models can only be revealed when reaching a certain scale.

The picture above is taken from the ppt[8] in Hyung Won Chung's speech. The figure mainly expresses a point of view. As the scale of the model increases, such as from GPT3 to GPT4, the capabilities of the model become stronger and stronger, and even new capabilities will appear.

However, as the size of the model increases, the inference speed of large models will gradually decrease because more parameters require more GPU calculations. The decrease in inference speed further leads to a worse user experience, so how to accelerate large model inference becomes increasingly important.

3. Optimization directions for large model inference acceleration

Llama2 model structure

Let’s first take a brief look at the structure of the Llama 2 model series, referring to the Llama 2 paper [9]. Currently, most generative language models like the Llama series mainly use the Decoder module in the Transformer architecture. On the Huggingface platform, this type of model structure is usually called CausalLM, which is a causal language model.

The picture above shows the structure of the Llama2 large model, the core of which is attention calculation (Llama Attention). This is also the most time-consuming module in the entire reasoning process. Most of the subsequent optimizations are implemented based on Attention. In order to better understand the structure of the Llama 2 model, we first briefly dismantle the entire reasoning process of the Llama2 model. Students who are not interested can skip it directly.

After the user submits a Prompt to the model, the first operation performed by the model is to predict the next character (Token), and add the predicted character to the input to continue prediction. This process will continue until the model outputs a STOP token, at which time the prediction stops and the model outputs the final result.
In the process of generating the next character (Token), the model needs to perform N times of Llama Decoder Layer calculations. Specifically, the Llama-2-7B model performs 32 calculations, while the Llama-2-13B model performs 40 calculations.
The most critical calculation link in the Llama Decoder Layer is the calculation of attention (Llama Attention). Most of the inference time is consumed in the calculation of Attention, so a variety of optimization techniques are designed to improve the efficiency of Attention calculation.

What are the acceleration directions for large model inference?

From the structural analysis of the Llama 2 model, we can conclude that the large model exhibits the following characteristics during the inference calculation process:

在整个推理过程中，最耗时的部分为注意力(Attention)计算。针对Attention的计算进行速度优化，可以显著提高整体推理性能。
During the attention calculation process, the key-value cache (KV Cache) occupies a large amount of video memory resources. Taking the 13B model as an example, processing a Prompt sequence requires approximately 3GB of additional video memory, and this part of the video memory will be frequently allocated and released, resulting in a large number of fragments. If the video memory fragments can be reduced, the throughput of large models can also be improved.
During the inference process, the GPU needs to process and calculate a large number of parameters. The 7B model has 7 billion parameters, while the 13B model contains 13 billion parameters. The latest and most powerful model DBRX in the world has 130 billion parameters, which requires efficient processing of these parameters. There can also be room for optimization here.

In response to the above three characteristics, the industry has currently proposed a variety of effective optimization methods, typically as follows:

1. FlashAttention-Attention calculation speed optimization

FlashAttention[1]在不改变Attention算子的计算结果的前提下，提升Attention算子的计算速度。FlashAttention在各种模型和任务上展示了显著的性能提升。例如，在BERT-large、GPT-2等模型上，相比于基线实现，FlashAttention能够实现15%到3倍的端到端加速。

2. PageAttention-KV Cache memory management optimization

The goal of PageAttention[3] is to reduce video memory fragmentation. The VLLM system based on PageAttention can increase the throughput of the popular large language model (LLM) to more than 10 times while maintaining a smooth distribution of time consumption.

3. MOE-reduce model parameters during inference

The goal of MOE (Mixture of Experts) [4] is to reduce the number of parameters involved in calculation during model inference.

Experimental results: The Mixtral model outperforms the Llama 2 70B model in most benchmark tests, and its inference speed is 6 times faster than the latter. The model supports multiple languages, has strong code generation capabilities, and can be finely configured to follow specific instructions, resulting in high scores on the MT-Bench benchmark.

We will introduce each of the above directions in detail later.

4. FlashAttention-Attention operator calculation optimization

FlashAttention has published two papers describing the optimization of the Attention operator, including FlashAttention-1[1] and FlashAttention-2[2]. Let’s take FlashAttention-1[1] as an example to understand its optimization principle.

Let’s first understand the memory hierarchical structure of the GPU. Refer to the figure below. The picture is from the paper FlashAttention-1[1].

The memory hierarchy of the GPU consists of three main parts: SRAM, HBM and DRAM. The following is the reference configuration of the A100GPU.

SRAM (Static Random Access Memory) has the fastest access speed (19TB/s), but its capacity is relatively small (only 20MB).

HBM (High Bandwidth Memory) provides large storage space (40GB) and high-speed data access (1.5TB/s).

DRAM (Dynamic Random Access Memory), here specifically refers to the main memory outside the GPU, has the largest capacity (more than 1TB), but the slowest access speed (12.8GB/s).

As can be seen from the above configuration, the smaller the memory capacity, the faster the processing speed.

In the traditional Attention calculation process, a large number of input/output operations are completed by accessing HBM. The FlashAttention algorithm reduces the number of accesses to HBM by optimizing the Attention calculation process to improve calculation efficiency, so it is an IO-aware optimization algorithm.

The figure below shows the acceleration method of FlashAttention, from the paper FlashAttention-1[1]

FlashAttention utilizes a clever trick to compute the attention mechanism quickly and memory efficiently, namely it avoids processing the entire huge attention matrix at once by tiling the input data, which usually requires a lot of memory and computing resources. . Imagine that we have a huge library (matrix), and the FlashAttention method is like dividing the books in the library into several small piles, and then processing only one pile of books at a time. This way, we don't need to take out all the books and put them on the table at once (which requires a big table and a lot of time).

Specifically, when doing matrix calculations, FlashAttention effectively reduces the need for slow but large-capacity storage (HBM) by dividing the data into blocks and using the fast but small-capacity storage (SRAM) on the GPU for calculation. access. This not only speeds up calculations, but also significantly reduces the need for video memory.

通过减少对慢速存储的依赖，FlashAttention能够显著提高模型训练的速度，同时保持或甚至提高模型的性能。例如，让BERT-large的训练比MLPerf 1.1的记录快15%，GPT-2训练速度是HuggingFace和Megatron-LM基线的三倍，长序列领域训练速度提升至2.4倍。

The picture below comes from the blog [14] introduced by huggingface on flash attention. It can better understand the way Flash Attention splits the matrix.

Since Flash Attention can accelerate calculations, what are the frameworks that support Flash Attention calculations? We will recommend some excellent inference frameworks in the second half of the article.

5. PageAttention-Video memory management optimization

The concept of PageAttention[3] was originally proposed by Woosuk Kwon, the author of VLLM, and it is also the most important optimization strategy of the VLLM reasoning framework. In his paper, Woosuk Kwon introduced how to use PageAttention to solve a key problem in large language model (LLM) services - effectively managing memory to improve throughput without increasing latency.

Let’s first understand the memory structure distribution of the large model in the case of inference. The following figure is from the paper [3].

This is a memory layout for serving a large language model with 13B parameters on NVIDIA A100. 13B LLM inference memory occupies part. The parameters of 13B LLM occupy 26G video memory. For each request, KV Cache will occupy 12G video memory. With QPS As the KVCache increases rapidly, it will be allocated and released frequently, and the system will generate a large number of video memory fragments. If not processed, the system will slowly collapse.

So how does VLLM solve the problem of video memory fragmentation through PageAttention? The picture below comes from the article [14], which is the video memory management technology of VLLM.

PageAttention works by splitting the key-value cache (KV cache) into fixed-size chunks (or "pages") and allowing these chunks to be stored non-contiguously in memory. This method is inspired by the virtual memory and paging technology of the operating system to manage memory resources more flexibly and efficiently.

In the traditional attention mechanism, a requested KV cache needs to be stored continuously in memory, which leads to two main problems: memory fragmentation and inability to share memory efficiently. Memory fragmentation limits the size of batches, while the inability to share memory results in duplicate data, wasting valuable memory resources.

PageAttention works through the following steps to resolve these issues:

Split the KV cache: Divide the KV cache for each request into multiple smaller chunks, which are fixed in size and can be adjusted based on the specific needs of the model and hardware.
Non-contiguous storage: Unlike traditional KV cache blocks, which are stored contiguously in memory, PageAttention allows these blocks to be distributed non-contiguously in physical memory. In this way, memory blocks can be dynamically allocated and recycled according to actual needs, reducing memory waste.
Dynamic management: PageAttention dynamically manages these memory blocks in a manner similar to virtual memory management in the operating system. The system can optimize memory usage by allocating or releasing KV cache blocks on demand based on current memory usage.
Memory Sharing: PageAttention also supports sharing KV cache blocks between different requests or between different sequences in the same request. This sharing is flexible and can occur on a block level, further reducing memory usage and increasing efficiency.

通过这种方式，PageAttention允许LLM服务系统在保持相同延迟的情况下，通过减少内存浪费和提高内存共享，显著提高处理请求的吞吐量。

Through the optimization of PageAttention, VLLM has increased the throughput of LLaMA 7B and 13B by more than 10 times. The figure below is from the article [11].

6. MOE-reduce model parameters during inference

The recently released DBRX, the world's most powerful open source large model with 130 billion parameters, and Mistral's 8x7B open source large model are both based on the MOE architecture. Why do models with larger number of parameters need to use MOE architecture? We take Mistral's 8x7B open source large model as an example to introduce the performance advantages of the MOE architecture.

Speaking of MOE large models, let's first compare the structural differences between ordinary large models and MOE large models, refer to the picture above. In the MOE large model, the parameters of the large model are divided into 8 groups plus a router. Each group is called an expert group. When the request comes, the MOE large model first has the router select two from the eight expert groups, and only these two expert groups participate in the calculation. Compared with ordinary large models, all parameters need to participate in GPU calculations.

Therefore, the MOE large model inference speed is about four times faster than the ordinary large model of the same level.

Let’s take a look at the implementation of Mistral MOE. Mistral MOE is an 8*7B large model [12] released by mistral.ai. The figure below is from the paper [12], which is the structure of the expert layer of its 8*7B large model.

Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model. It is based on the architecture of Mistral 7B, but each layer is composed of 8 feedforward blocks (ie experts). As each token is processed, a routing network at each layer selects two experts to process the current state and combine their outputs. Although each token only interacts with two experts, the experts selected at each time step can be different, so each token has access to 47B of parameters, but only 13B of active parameters are used during inference.

Mixtral demonstrates its superior performance on multiple benchmarks, especially in mathematics, code generation and multi-language understanding. Compared to Llama 2 70B and GPT-3.5, Mixtral shows similar or better performance on most evaluation metrics. Specifically, Mixtral uses 5x fewer active parameters (13B) than Llama 2 70B (70B), but performs better or equally in almost all categories.

The MOE large model can increase the number of parameters without reducing the inference speed, which is the development trend of large models in the future.

7. Tensor parallelize-Tensor parallelism

If you have multiple GPU cards, you can use tensor parallelism to further accelerate the inference speed of large models.

Imagine you have a very thick book and you want to copy the entire book at once, but your copier can only copy a few pages at a time. At this time, you can divide the book into several parts, copy each part separately, and finally join all the copied parts in order, thus completing the copy of the entire book.

In tensor parallelism, the large model we are dealing with is like that thick book, and the GPU is like a copy machine. Because a single GPU cannot process the entire large model at once, we need to divide the model (in this case, the weight tensor) into several parts and let different GPUs process them separately (equivalent to different parts of a photocopied book). When processing input data, it is like copying each page of the book separately, and then splicing the copied parts together to form a complete output result.

In this way, by sharing the work, multiple GPUs work together to complete a large task that cannot be completed by a single GPU. This is how tensor parallelism works, and it allows us to handle those very large models.

Picture from article[13]

Tensor parallelism technology is used to deploy large models distributedly on multiple GPUs. Take matrix multiplication as an example. When the input tensor is matrix multiplied by the first weight tensor, this operation can be regarded as first dividing the weight tensor by columns, and then dividing each column after the division with the input. Multiply tensors and combine the results of these products. These combined outputs will be exported from the GPU and aggregated to form the final output result. The process is shown in the figure above, refer to the article [13].

8. Recommended reasoning framework

In the previous article, we discussed several acceleration and optimization technologies, such as Flash Attention, Page Attention, MOE, and tensor parallel technology. Next, based on our own actual operations and evaluations, we will recommend to you some inference frameworks that currently perform well.

9. Summary and Outlook

In this article, we deeply explore a series of technologies and methods designed to improve the speed of large model inference, including but not limited to Flash Attention, Page Attention, MOE, and tensor parallel technology. By deploying dedicated large model inference clusters in batches in the production environment, we successfully reduced the inference speed by 50%, including 70B scale models, and stably applied these technologies to the production environment, thus proving the effectiveness and effectiveness of these optimization methods. Practicality.

As large models are increasingly used in various fields, how to effectively improve the inference speed and reduce the cost of inference has become a challenge. Our practice not only demonstrates some currently available acceleration technologies, but also recommends several large model inference frameworks that have excellent performance after evaluation based on our experience. These suggestions are intended to help readers select the reasoning framework that best suits their needs when faced with many choices.

Looking to the future, with the continuous advancement of technology and the emergence of new algorithms, we believe that more acceleration optimization technologies will be developed to further improve the efficiency of large model inference. Finally, we also look forward to the opportunity to deeply discuss and introduce more new technologies and methods to improve the speed of large model inference in the future.

References

[1] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness(https://arxiv.org/abs/2205.14135)

[2] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning(https://arxiv.org/abs/2307.08691)

[3] Efficient Memory Management for Large Language Model Serving with PagedAttention(https://arxiv.org/abs/2309.06180)

[4] mixtral-of-experts(https://mistral.ai/news/mixtral-of-experts/)

[5] Mixtral of Experts(https://arxiv.org/abs/2401.04088)

[6] MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads(https://arxiv.org/pdf/2401.10774.pdf)

[7] Scaling Laws for Neural Language Models(https://arxiv.org/pdf/2001.08361.pdf)

[8] Hyung Won Chung(OpenAI), Large Language Models (in 2023) , talked at Seoul National University

[9] Llama 2: Open Foundation and Fine-Tuned Chat Models(https://arxiv.org/abs/2307.09288)

[10] Attention Is All You Need(https://arxiv.org/pdf/1706.03762.pdf)

[11] https://blog.vllm.ai/2023/06/20/vllm.html

[12] https://arxiv.org/pdf/2401.04088.pdf

[13] https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism

[14] https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention

[15] https://blog.vllm.ai/2023/06/20/vllm.html

* Text/ linggong

This article is original to Dewu Technology. For more exciting articles, please see: Dewu Technology

Reprinting without the permission of Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law!

KubeAI large model inference acceleration practice | Dewu Technology

Guess you like