Model quantification and application of quantification in LLM|Dewu Technology

1. Model inference optimization

With the implementation of models in various scenarios, model inference acceleration has long become an important part of AI engineering. In recent years, large models based on the Transformer architecture have become mainstream, achieving SoTA results in various tasks. Their expensive costs in training and inference make their deployment practices at reasonable costs even more important.

The challenges faced by large model inference mainly include the following two points:

  • The huge memory (video memory) requirements mainly come from the immediate requirements for parameters and inference of the model itself.
    • For an LLaMA2-30B model, loading the model itself into the video memory requires about 60GiB of video memory. During the inference process, the KV cache of a single token requires about 1.6MiB of video memory: 6656(layer dim) * 52(layer num) *2 ( K & V) * 2(fp16, 2bytes); for a request of 2048 tokens, 3.3GiB of video memory is required.
  • The parallelism is poor because the generation process is usually a serial process in terms of timing, making the decoding process difficult to parallelize and becoming a computational bottleneck.

Common reasoning optimization methods include Knowledge Distillation (KD), Pruning and Quantization, as well as various solutions proposed for LLM memory optimization (such as Flash Attention, Paged Attention, etc.).

Distillation refers to directly constructing a small model as a student model, and supervised learning of the knowledge of the original model through the combination of soft labels and original labels, so that the small model has performance comparable to the original model, and finally replaces the large model with a small model to improve reasoning. efficiency.

 

[Image source: Knowledge Distillation: A survey, 2021, p2]

Pruning "slims" the model by pruning unimportant weights in the model and improves the model's reasoning efficiency. In order to ensure the capabilities of the model, the pruning process usually also needs to be accompanied by fine-tuning of the model based on training data. According to the different dimensions of pruning weights, it can be divided into structured pruning and unstructured pruning.

  • Structured pruning: usually pruning unimportant channels in blocks according to one or more dimensions of the weight tensor, and maintaining normal matrix multiplication; however, because the pruned channels affect the reasoning of the upper and lower layers, the logical accuracy of the network needs to be checked .
  • Unstructured pruning: Randomly pruning unimportant elements in the weight tensor, so it usually maintains the original weight structure, resulting in sparse multiplication calculations, but it is not suitable for general-purpose hardware, so special hardware is required to implement it. accelerate.

At present, there are few applications of pruning in LLM. For example, the following pruning work based on Activation-aware [1] mainly performs unstructured pruning based on the absolute value of the weight itself and the absolute value of the input tensor. The weight tensor itself is made sparse, and the accuracy loss of the model cannot meet the engineering requirements.

 

[Image source: A simple and effective pruning approach for large language models, 2021, p2]

As shown below, the recent work on structured pruning [2] uses search methods to find substructures in the model, and maintains model accuracy through retraining. The accuracy of the pruned model is greatly reduced compared to the original model. , it can only be compared with other smaller models with the same number of parameters (after pruning) to show the significance of its method.

 

[Image source: Sheared LLaMA: accelerating language model pre-training via structured pruning,2023,p3]

 

[Picture source: huggingface/Sheared-llama-1.3B]

The reason why quantification has become the first choice for neural networks and LLM is because of the following advantages:

  • The intuitive reflection of reducing video memory.
    • Generally, LLM weights are stored in FP16, and after the weights are quantized into int4, the volume will intuitively be reduced to 1/4 of the original size (actually it may be slightly larger due to non-quantified embeddings, memory allocation and other reasons), and the resource requirements for video memory Greatly reduced.
  • Acceleration of operators such as W4A16 and W8A16 to increase calculation speed.

2. Introduction to quantification

base

The essence of quantification is usually to convert the parameters of the model, or the reasoning process of the entire model, from floating point to integer.

The quantization parameter usually consists of two values: scale and zero-point. The former is a floating point and the latter is an integer. Assuming x is a tensor (it can be a weight or an intermediate variable for reasoning), the quantification process can be expressed as follows,

 

Use b to represent the quantization bit width, q{min} and q{max} respectively represent the range of the integer value range. For example, int-8 quantization can take [-128,127], that is, q{min}=-2^(b-1 )=-128, q{max}=2^(b-1)-1=127, clamp(a;q{min},q{max}) means that the input value a is based on [q{min}, q{max }] range truncation operation, x{int} represents the quantized result, s and z represent the quantization parameters scale and zero-point.

 

 

【图片出处:A Survey of Quantization Methods for Efficient Neural Network Inference,2021,p5;An Introduction to Quantization of Large Language Models,p12】

The dequantization process from integer to floating point is as follows,

 

Regarding quantization parameters, there are many algorithms based on search, optimization, LKD (layer-by-layer distillation) and other algorithms to calculate their optimal solutions, thereby minimizing the accuracy loss caused by quantization; and the most direct calculation method of scale and That is based on tensor elements min/max.

 

The following is a simple code example that represents the quantization of tensor x from fp32 to int8 integer type, and then inverse quantization back to fp32:

An example of the procedure x->x{int}->x_hat is as follows:

 

x before quantization:

 

After quantization x_hat:

 

Symmetrical/asymmetrical

Compared with asymmetric quantization, the definition of symmetric quantization is that the integer value range mapped by quantization is symmetrical based on the 0 value, that is, the zero-point of the above formula is 0, qmax = -qmin, which makes the expression form of quantization more simplified.

Asymmetric quantization is beneficial to fully utilizing the quantization range. For example, the excitation tensor output by Conv+ReLU has all positive values. If symmetric quantization is used, all floating points will be mapped to the [0~127] range, half of the range is unused, and its quantization accuracy is not as good as asymmetric quantization. .

 

[Image source: A Survey of Quantization Methods for Efficient Neural Network Inference, 2021, p5]

In practice, it is often chosen to perform symmetric quantization on the weight tensor and asymmetric quantization on the input tensor. The following is an analysis from qualcomm's quantification white paper. For example, when asymmetric quantization is selected for both weights and inputs, taking matrix multiplication of the Linear layer as an example, the expression is expanded as follows:

 

  • The first item is the multiplication operation of the integer tensor, which is a necessary immediate operation;
  • The operations of the third and fourth items include the multiplication of scale, zero and integer weights. These are all predicted in advance and can therefore be calculated in advance and added as offsets;
  • The calculation of the second item depends on x{int}, which needs to be calculated immediately for each inference, which will cause additional computing power.

Therefore, when we change the weight quantization to symmetric quantization (zW=0), the above formula is simplified as follows. In real-time calculation, we only need to calculate the matrix multiplication of the first item, and the second item is the pre-calculated bias item:

 

When both are symmetrically quantized, the expressions are simplified as follows:

 

Comparing the floating point calculation W{x} in the original model, W{int}x{int} is the multiplication between integers and integers. The latter is much faster on Nvidia GPU than the former. This is a quantized model. The reasoning speed is greatly accelerated.

3. Quantification of LLM

Challenges in LLM Quantization

From the perspective of model performance, one premise that quantification must solve from beginning to end is how to maintain the accuracy of the quantized model, that is, to make users of the model feel that the quantized model can maintain the original performance while improving the reasoning efficiency.

The operations that need to be quantified in the neural network are mainly the convolutional layer Conv(x;W) and the fully connected layer Wx, that is, the weight quantization (WQ) and excitation of W and x respectively according to the operations described in the previous part. Quantification (Activation Quantization, AQ).

Unlike the CNN model or the small Transformer model, the excitation tensor generated by the matrix multiplication of the large Transformer model usually has more outliers, that is, the point group formed by most points of the value distribution is far away. Value, these element values ​​with large absolute value but low proportion increase the difficulty of quantification. How to choose outliers is usually a major difficulty in quantification work. If you consider it too much, the quantification range will be too large and the quantification expression range will be reduced. If you truncate it too much, these values ​​with large absolute values ​​will usually be in the It has a greater impact on the results in model inference, leading to poor model performance, and the latter is particularly obvious in LLM quantification.

The figures below show the element value statistics of a certain layer of input tensors of Resnet18 and Opt-13B respectively. sigma represents the standard deviation of their respective distributions. The maximum value of the Resnet18 input is about 28sigma, and the proportion of absolute values ​​other than 6sigma is 0.05%; and The maximum input value of the Opt-13B network is 325 sigma, and the proportion of absolute values ​​other than 6 sigma is 0.2%. In terms of quantification effect, the int-8 accuracy of Resnet18 has basically no loss, while the accuracy of the int-8 model of Opt-13B has collapsed.

 

[Image source: An Introduction to Quantization of Large Language Models, p20]

In response to the challenge of incentive quantification, there are some solutions that try to reduce the quantization accuracy, such as the idea proposed by SmoothQuant.

 

 

[Image source: SmoothQuant,p4]

In matrix multiplication, they scale down the value of the input tensor ) and diag(s)·W. This reduces the difficulty of quantification of the tensor X while ensuring that the product of the multiplication operation remains unchanged. In actual engineering, the quantization error caused by this quantization scheme still has a relatively obvious impact on the reasoning effect of large models, and there are obvious errors even in int-8 precision quantization. For example, the following SmoothQuant application results for Llama2-7B show that its perplexity is very poor and difficult to apply in practice.

 

Therefore, most of the practical solutions in current engineering deployment are weight-only quantification solutions, that is, giving up the quantification of activation.

GPTQ

GPTQ is the earliest quantification scheme accepted for engineering deployment. The quantification effect of W8A16 or W4A16 is close to the original model in most scenarios, and the quantification process is very fast.

Quantification process

Taking the basic unit operation of matrix multiplication as an example, based on the mean square error of the product before and after weight-only quantization, the following optimization function can be written,

 

W is the Linear layer weight in Transformer, and X represents its corresponding input. The process of offline quantization is to quantize module by module (Transformer) and layer by layer (Q, K, V, O, Fc1, Fc2).

The parameters and data are defined as follows:

  • W∈R^{K×M},X∈R^{M×N},Y=W×X∈R^{K ×N}
  • calibrate set: Part of the data is used for inference, used to view the value range of the input tensor of each layer, and quantized based on this.

The specific quantification process is as follows:

  • Calculate the Hessian (the above optimization function is for the Hessian of W_hat, not the Hessian in backpropagation), and add the disturbance term:

 

  • act order sort (desc_act, columns with similar value ranges are quantified together), based on diag(H), the columns of W are rearranged based on the M dimension. In the same way, H is rearranged in two dimensions accordingly.
  • Find the inverse H^(-1) (cholesky decomposition).
  • For W along dimension M, quantize block by block from left to right, block size B=128, and the unquantized part on the right side is updated based on H^(-1) to compensate for the quantization loss.

 

  • (Inner loop) For each block, quantize column by column, calculate the error, and update the unquantized columns within the block based on the error.

 

 

  • (outer loop) After operating the block, update all columns following it:

 

group_size

  • If the group size is not specified, the default is g=-1. The quantization parameters are counted in units of all columns, and the weight of each row is quantified. For W∈R^{K×M}, the number of quantization parameters is K×1.

 

  • If the group size is specified, for example, g=128, the quantization parameters will be counted in units of 128 columns, and the weight of each row will be quantified. For W∈R^{K×M}, the number of quantization parameters is K×( M/g).

 

Rearrange desc_act

According to the Hessian Matrix H, W is rearranged based on the M dimension based on diag(H). The purpose is to prioritize the weight columns corresponding to the activaiton with larger absolute values. These columns are regarded as more important columns that affect the results in reasoning. Therefore, it is hoped to produce as small an error as possible when quantifying these columns. Shifts more quantization error to later, less important columns.

Some experiments show that desc_act's effect on quantization loss is an effective trick in most tasks.

 

Perplexity of Pygmalion-7B with GPTQ [7]

[Picture source: https://huggingface.co/reeducator/vicuna-13b-free/discussions/22]

operator

Strictly speaking, W4A16 based on weight-only does not have much efficiency improvement compared to the original W16A16, and the quant/dequant process is also added to the inference; as weight-only becomes the mainstream of LLM quantification and its application is becoming more and more There are many open source works based on the writing of W4A16 efficient operators to speed up the reasoning of quantized algorithms. For example, GPTQ's python package  AutoGPTQ has been integrated into the open source tool exllama, which rewrites the parallel computing of quantized multiplication based on triton and CUDA. . In
exllama/exllama_ext/matrix.cuh you can see dot_product8_h's implementation of out=W_hat·x=(W{int}-z)s·x=(W{int}-z)x·s.

 

[Picture source: https://github.com/turboderp/exllama/blob/3b013cd53c7d413cf99ca04c7c28dd5c95117c0d/exllama_ext/matrix.cuh#L86]

AWQ

Compared with GPTQ, which designs solutions based on optimization problems, AWQ is a quantitative solution based on search.

Using Q(·) to represent the quantization and inverse quantization process, the quantization process before modification is as follows:

 

After modification, the quantization process is as follows, adding scaling to W:

 

search

The full name of AWQ is Activation-aware Weight Quantization, which means that the influence of Activation value is considered in the quantification process of Weight. The starting point is also based on the fact that in each channel of Weight, the channel with a larger corresponding Activtion value is relatively important, and vice versa. Its importance is then multiplied by a scaling factor Δ to reflect its importance, and the value of Δ is equal to The range is designed by the tensor value of the input activation.

 

The measurement standard of the search is based on the comparison of the output results before and after linear layer quantization, and the one with the smallest MSE result is the optimal solution.

 

Effect

In terms of model performance, the optimal scaling coefficient is found through layer-by-layer scale search to obtain the solution with the smallest quantization error. The following effect comparison from the AWQ paper shows the quantification results in the test of two generations of Llama from the perspective of Perplexity. Slightly better than GPTQ and the sorted version of GPTQ.

 

[Image source: AWQ, p6]

Judging from the accuracy of actual tasks, the accuracy of AWQ is comparable to the act_order version of GPTQ (GPTQ-R), while the speed is better than the latter.

 

[Image source: AWQ, p5]

In terms of the computing performance of the model, GPTQ has a reorder operation, and the matrix multiplication is MV (matrix×vector), which is a discontinuous memory access, while AWQ does not have a reorder operation, and the matrix multiplication is (matrix×matrix), which is faster.

4. Summary

Regarding the current SOTA performance of LLM quantification work, it is basically based on the weight-only quantization mode. The reduction of the video memory required for running the model on the GPU is its main contribution.

From the performance of the model, because there is inevitable quantization loss, and the LLM model is usually much more sensitive to quantization than the traditional CNN model, although the performance of the LLM after quantization is not much different from that before quantization on many tasks, but in You may still be unable to perform some tasks.

From the perspective of model acceleration, weight-only quantification promotes the underlying acceleration work, which is basically accelerated on multiplication operators such as W4A16, W3A16, and W8A16. Judging from the theoretical data provided in the paper, compared with the FP16 model, it is usually only 1.x ~3.x times the speed improvement, but the actual deployment effect may be lower than this value, and its acceleration effect is far inferior to the traditional quantization method of W4A4, W8A8 and other full-integer multiplication operators.

Generally speaking, the quantitative work in the field of LLM is still preliminary. If the performance accuracy of the model is very high in actual tasks, it is recommended to use algorithms and tools based solely on KV cache to improve unit memory throughput, such as Flash Attention-2. , Paged Attention, etc.

5. Reference

1. A Simple and Effective Pruning Approach for Large Language Models, 2023.

2. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, 2023.

3. A White Paper on Neural Network Quantization, 2021.

4. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2023.

5. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2023.

6. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, 2023.

7. Some evaluation on GPTQ performance.

 

*Text/ xujiong

This article is original to Dewu Technology. For more exciting articles, please see: Dewu Technology official website

Reprinting without the permission of Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law!

The Google Python Foundation team was laid off. Google confirmed the layoffs, and the teams involved in Flutter, Dart and Python rushed to the GitHub hot list - How can open source programming languages ​​and frameworks be so cute? Xshell 8 opens beta test: supports RDP protocol and can remotely connect to Windows 10/11. When passengers connect to high-speed rail WiFi , the "35-year-old curse" of Chinese coders pops up when they connect to high-speed rail WiFi. MySQL's first long-term support version 8.4 GA AI search tool Perplexica : Completely open source and free, an open source alternative to Perplexity. Huawei executives evaluate the value of open source Hongmeng: It still has its own operating system despite continued suppression by foreign countries. German automotive software company Elektrobit open sourced an automotive operating system solution based on Ubuntu.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/11066139