The first fully quantized Vision Transformer method FQ-ViT, the large model is not far away

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

cd85e10d8fb723cbc958c6082b5b5960.gif

a75b63478a1297bc5e60e05ab6088c35.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

8c913d1b9a389120c5bd559f7100ee42.png

Paper address: https://arxiv.org/pdf/2111.13824.pdf

Project code: https://github.com/megvii-research/FQ-ViT

Computer Vision Research Institute column

Column of Computer Vision Institute

Quantizing and model-transforming algorithmic networks can significantly reduce the complexity of model inference and have been widely used in practical deployments. However, most existing quantization methods are mainly developed for convolutional neural networks and suffer severe dropouts when applied on fully quantized vision transformers. Today we will share a new technology to achieve high-precision quantitative Vit deployment. Is it still far from us to use AI large models?

af6e32dd7c9ae5f426a130cdef27db22.gif

01

Overview

Transformer is the basis of the hot AIGC pre-training large model, and ViT (Vision Transformer) is the real sense of bringing the Transformer in the field of natural language processing to the field of vision. From the development process of Transformer, it can be seen that from the proposal of Transformer to the application of Transformer to vision, in fact, it took three years of dormancy in the middle. It took almost two or three years from the application of Transformer to the visual field (ViT) to the popularity of AIGC. In fact, the popularity of AIGC has been slack since the end of 2022. At that time, some AIGC interesting algorithms were gradually released. Up to now, AIGC interesting projects are really emerging one after another.

53adaadcc73a8a947e26b5143cb3db3e.png

With the in-depth research on the Visual Transformer model (ViT) in the past two years, the expressive ability of ViT has been continuously improved, and it has achieved substantial performance breakthroughs in most basic visual tasks (classification, detection, segmentation, etc.). However, many practical application scenarios require high real-time reasoning capabilities of the model, but most lightweight ViTs still cannot achieve the same speed as lightweight CNNs (such as MobileNet) in multiple deployment scenarios (GPU, CPU, ONNX, mobile terminals, etc.).

Therefore, the 2 exclusive modules of ViT were revisited, and the reasons for the degradation were found as follows:

  • The researchers found that the channel-to-channel variation of the LayerNorm input is severe, and the range of some channels even exceeds 40 times the median value. Traditional methods cannot handle such large activation fluctuations, which will lead to large quantization errors

  • It is also found that the values ​​of the attention map have an extremely uneven distribution, most of the values ​​are clustered between 0 and 0.01, and a few high attention values ​​are close to 1

Based on the above analysis, the researchers proposed Power-of-Two Factor (PTF) to quantify the input of LayerNorm. In this way, the quantization error is greatly reduced, and the overall computational efficiency is the same as that of layered quantization due to the Bit-Shift operator. Also proposed is Log Int Softmax (LIS), which provides higher quantization resolution for small values ​​and more efficient integer inference for Softmax. Combining these approaches, this paper achieves post-training quantization of a fully quantized Vision Transformer for the first time.

accd74a46ab431dfcf413bbd4d58c136.png

02

new frame

The two figures below show that there is severe channel-to-channel variation in visual transformers compared to CNNs, which leads to unacceptable quantization errors for layered quantization.

db2096d07788e6d617d746da5068142d.png

First the network quantization notation is explained. Assuming that the quantization bit width is b, the quantizer Q(X|b) can be formulated as a function that maps a floating-point number X∈R to the nearest quantization bin:

074ae552137b321b0cc8c8d02118ce00.png

Uniform Quantization

Uniform Quantization is well supported on most hardware platforms. Its quantizer Q(X|b) can be defined as:

1cbf5c02453ce30557d0ed53801d3f1b.png

d7bb56c7d81e50f4de0dcc4b6446f50c.png

where s (scale) and zp (zero point) are quantization parameters determined by the lower bound l and upper bound u of X, which are usually the minimum and maximum values.

Log2 Quantization

Log2 Quantization converts the quantization process from linear to exponential. Its quantizer Q(X|b) can be defined as:

5040ab6a1e72c9981dca0bec07c5c2ee.png

In order to achieve a fully quantized visual transformer, the researchers quantized all modules, including Conv, Linear, MatMul, LayerNorm, Softmax, etc. In particular, use uniform Min-Max quantization for Conv, Linear and MatMul modules, and the following methods for LayerNor and Softmax.

Power-of-Two Factor for LayerNorm Quantization

During inference, LayerNorm computes the statistics µX, σX at each forward step and normalizes the input X. Then, the affine parameters γ, β rescale the normalized input to another learned distribution.

As explained at the beginning of the analysis, unlike BatchNorm commonly used in neural networks, LayerNorm cannot be folded to the previous layer due to its dynamic calculation characteristics, so it must be quantized separately. However, a significant performance drop is observed when post-training quantization is applied to it. Looking at the input of the LayerNorm layer, it is found that there is severe inter-channel variation.

The researchers proposed a simple and effective layer norm quantification method, namely Power-of-Two Factor (PTF). The core idea of ​​PTF is to equip different channels with different factors, not different quantization parameters. Given the quantization bit width b, input activity X∈RB×L×C, layer-by-layer quantization parameters s, zp∈R1, and PTFα∈NC, the quantization activity XQ can be formulated as:

dcf6441a73f2b325c412c3afdf2f9dc4.png

Some of the parameters are as follows:

16e1fd60e95d0ddbe341f70501d90dfb.png

Softmax quantized with Log-Int-Softmax (LIS)

Note that the storage and calculation of the graph is the bottleneck of the transformer structure, so researchers hope to quantize it to an extremely low bit width (for example, 4 bits). However, if 4-bit uniform quantization is implemented directly, serious accuracy degradation will occur. We observe that the distribution is centered on a fairly small value of the Softmax output, while only a few outliers have larger values ​​close to 1. Based on the following visualization, for small-valued intervals with a dense distribution, Log2 preserves more quantized intervals than uniform.

9bc8fbed0e57c08bc67976b9be68a8a0.png

Combining Log2 quantization with i-exp (a polynomial approximation of the exponential function proposed by i-BERT), we propose LIS, an integer-only, faster, and low-power Softmax.

The whole process is shown below.

3b96be61e246213a496b483fbc810b12.png

03

Experiment & Visualization

Comparison of the top-1 accuracy with state-of-the-art methods on ImageNet dataset

095088d0d342bd0b05143e586f02b7c2.png

32fa6fca585a1f1c2f772adf1b1e7ad3.png

Visualize the attention map to see the difference between uniform quantization and LIS, as shown above. When both use 8 bits, uniform quantization focuses on high activation regions, while LIS preserves more texture in low activation regions, which preserves more relative rank of the attention map. In the case of 8 bits, this difference doesn't make much of a difference. However, when quantizing to lower bit widths, as shown in the 6-bit and 4-bit cases, uniform quantization degrades dramatically, even invalidating all regions of interest. In contrast, LIS still exhibits acceptable performance similar to 8-bit.

183bcedf25abcde63bfc688daf12b2c8.png

Channel-wise minimum and maximum values of Vision Transformers and ResNets

© THE END 

For reprinting, please contact this official account for authorization

23686331dc34e1d748488c377e47e19d.gif

The Computer Vision Research Institute study group is waiting for you to join!

da1d603aba4cd65e8099249cd92602df.png

72c278d98d3b92c6ae13d8da6e61a8c1.png

3dafa80435f96c21b2a4226396e451a9.png

2020a6bad1800e40a045480358fb8a94.png

Click "Read the original text" to cooperate and consult immediately

 

Guess you like

Origin blog.csdn.net/gzq0723/article/details/131587689