Let’s talk about large model position encoding and its extrapolability

Author | Wang Jianing 

Organizing | NewBeeNLP

https://wjn1996.blog.csdn.net/article/details/131753251

Enter the NLP group—> Join the NLP exchange group

Nowadays, many large models have begun to support inference with a length exceeding 4096. For example, GPT-4 supports more than 30k, and ChatGLM2-6B also supports text up to 32K. However, due to limitations of video memory resources, these large models do not necessarily need to train such long texts during the actual training process. They are usually only designed to be around 4k during pre-training.

Therefore, how to ensure that the model inference stage can support a length that far exceeds the pre-training time is one of the core issues of current large models. We classify this issue as the extrapolation of large models .

The extrapolation of large models is currently mainly considered in these two aspects, which are also the two most effective perspectives for improvement:

  • Find or design appropriate location coding;

  • Designing a local attention mechanism.

This article conducts an in-depth discussion of the position encoding and extrapolation issues of large models from these two aspects.

1. Basic introduction to position coding

For a token, its representation vector is denoted as , and for a sentence, it is denoted as . Then the token in this sentence can be represented by a mapping function:

where and represents the and-th token.

1.1 Absolute position encoding

In Transformer, the sine and cosine functions are used to represent the absolute position. The formula is as follows:

This encoding method is also called Sinusoidal encoding . Intuitively, the representation vector dimension of the th position is that the odd position elements of this vector use cosine values, and the even position elements use sine values.
The visualization is as follows:

de365d1a147408f24756cc49ec97d3dc.png
  • The adjacent position encoding vectors are very similar, and the farther position encoding vectors are very different, indicating that the absolute position based on the sine and cosine functions can represent the correlation of the position;

  • No need to explicitly learn the location, improving efficiency.

Finally the mapping function can be defined as follows. That is, the input representation is the representation of token and its corresponding absolute position representation.

This representation is usually a direct addition of the position representation and the word representation.

1.2 Relative position encoding

(1) Explicit relative position
For the tokens at the and -th positions, the relative position can be expressed as, that is, the relative distance between the two tokens, and is constrained by the maximum and minimum values ​​(the relative position cannot exceed the maximum value or be less than minimum value).

Therefore, compared with the absolute position, the relative position only needs to have a representation vector. That is, when calculating the attention value between two tokens, it only needs to inject the relative position corresponding to the two positions during the attention calculation process. The position representation vector can be:

In this way, only a limited number of position codes are needed to express relative positions of any length (because of truncation). Whether you choose trainable or trigonometric functions, you can meet the needs of processing text of any length.

This representation is usually a direct addition of the position representation and the word representation.
Reference paper: "Self-Attention with Relative Position Representations"

(2)Transformer-XL(XLNet)

The QK calculation of the and -th positions is decomposed. Some learnable parameters are introduced:

where represents the relative position vector of the token to be learned, and represents the relative position vector of the token to be learned. By decomposing the relative position is injected into the calculation process of attention.

This representation is usually integrated into the absolute position during the Attention calculation process.

(3) Improvements of Transformer-XL

The second and fourth items use relative position representations instead of absolute position representations. At the same time, new trainable parameters and.

This representation method was used for the first time in the T5 model. Reference paper: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

This representation is usually integrated into the relative position during the Attention calculation process.

(4) Trainable bias term

The position representation in Transformer-XL is to decompose QK into 4 items, and the last three items have position-related parameters. The sum of the last three items can be directly abstracted into an offset:

Further modifications can also get:

as well as:

These three methods all improve the representation form of Transformer-XL.

2. RoPE rotation position encoding

The starting point of RoPE (Rotary Position Embedding) is " to achieve relative position encoding through absolute position encoding ", or it can be said to be a combination of relative position encoding and absolute position encoding .

This is both theoretically elegant and practical. For example, it can be extended to linear attention mainly because of this.

2.1 Introduction of complex numbers

Assume that sum is a two-dimensional row vector corresponding to the position (that is, each position has two vectors to represent the position), so this two-dimensional vector can be replaced by a complex number (including real and imaginary parts), so their inner product can As its corresponding Attention value.

The calculation of the inner product can be realized by two complex numbers: , which represents the conjugate complex number and represents the real part of the complex number.

The inner product of two two-dimensional vectors is equal to the real part of the product of the conjugate of one complex number and the other complex number when they are treated as complex numbers.

Therefore, when integrating and into absolute positions respectively, we get:

The RoPE solution process deduces how to determine the encoding of each position.

It can be found that when multiplying the absolute positions θ and θ, it is equivalent to multiplying by θ in complex number operations, that is, it is equivalent to a relative position in the complex number space. In this way, the absolute position is cleverly converted into a relative position through the form of complex number operations. .

The geometric meaning of complex multiplication is the rotation of the vector. Assuming that the vector is encoded in the position, then there is:

in

Equivalent to

The product of the last two terms is essentially two two-dimensional row vectors of the vector (or).

When the vector dimension is ( is an even number), it can be expanded to:

Each two is a set of two-dimensional vectors, and there is a combination that can be directly spliced ​​as a dimensional rotation position encoding.
Interpretation of two-dimensional extension to multi-dimensional: Transformer upgrade path: 4. Rotary position encoding of two-dimensional position - Scientific Spaces | Scientific Spaces Transformer upgrade path: 6. Completeness analysis of rotation position encoding - Scientific Spaces | Scientific Spaces

2.2 Implementation of RoPE

7ac3d04f74e38df41c20c8547c04874e.png
  • When a sentence "Enhanced Transformer with Rotary Position Embedding" is input, first obtain its Query and Key vectors q and k, whose corresponding dimensions are both d, and then for a group of two adjacent elements in the vector, d/ can be obtained 2 groups (two elements of the same color in the lower left part of the figure are grouped together. For each group, one text can get two row vectors);

  • Obtain the absolute position number of each word (the sentence consists of 6 words, the position numbers are 1, 2, 3, 4, 5, 6 respectively), assuming that the word "Enhanced" is taken as an example, its first group of elements is θ1, The position is m=1, then the new element value can be obtained by rotating the position encoding.

  • All d/2 combinations of words are "rotated" in this form to obtain new position codes (lower right corner)

A linear implementation of RoPE is as follows:

Properties of RoPE

(1) Remote attenuation

c57f453ea9fcb7f4fffaa437c1d3393e.png

From the figure, we can see that as the relative distance increases, the inner product result has a tendency to attenuate. Therefore, selection can indeed bring about a certain degree of remote attenuation. Of course, this is not the only option that can bring about long-range attenuation, almost any smooth monotonic function will do. If you think of initialization, consider θ as a trainable parameter, and then after training for a period of time, you will find that θ has not been significantly updated, so you can simply fix it directly.

(2) Advantages

  1. Use a rotation matrix to encode the absolute position, and at the same time, meanwhile;

  2. Import explicit position dependence in the self-attention mechanism.

  3. Free sequence length;

  4. Inter-token dependency that gradually delays degradation (= attenuation) as the relative position increases;

  5. "Arming" linear self-attention with relative position encoding.

Specifically, RoPE uses a rotation matrix to encode absolute position while incorporating explicit relative position dependence into the self-attention formulation .

[The two core points are the "rotation matrix" and the "explicit relative position dependence"].

3. Extrapolation from long texts

The meaning of extrapolation is that in the process of long text representation, only a limited length needs to be learned in the training phase, that is, the length can be extended to several times in the inference phase and still maintain good performance and effects.

Length extrapolation is a problem of inconsistent lengths between training and prediction, which is mainly reflected in two aspects:

  • Untrained position coding (whether absolute position or relative position) is used when predicting;

  • The number of tokens processed by the attention mechanism during prediction far exceeds the number during training.

A simple and effective way to solve the extrapolation problem of long text is Attention Mask, as shown in the figure:

b58aaeaf0e64a1fc7e2c07caa8aa1732.png
  • Through a structure similar to a sliding window, each token is constrained to only calculate the Attention value for tokens in a local area, so the relative position size will not exceed the window size, solving the first problem;

  • Attention will only be calculated within the window, which avoids the phenomenon of excessive "smoothing" of the final weight caused by weighted averaging of the Attention of a large number of tokens.

In the implementation process, it is essentially subtracting a matrix after calculation, that is, the shape is as shown in the figure below:

5d3b96da6376bed97afcc2edf655717a.png

It can be seen that the blue area (that is, the local area within the sliding window) is 0, indicating that the original Attention value before normalization is maintained; other areas are the largest integer within an INT, indicating that the Attention value is a very small number (almost 0 after softmax normalization).

3.1 ALIBI

论文:《Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation》

It is the same as the above idea, except that the above M matrix is ​​improved to λ, that is, the calculation of Attention before normalization is:, where is the hyperparameter, and the value of each head in the Transformer's multi-head attention can be set differently. The shape of the matrix λ is as follows:

abb0397ff0e1e62392211c49f5e2bce0.png

Compared with the original method, the longer the relative distance, the larger the λ value. The farther the Attention value is, the smaller it will be after normalization. Compared with the "sliding window" method, the method used is hard (calculated within the window). attention, not calculated within the window), AIBLI is relatively soft (attention will be larger if it is close, and smaller if it is far away).

3.2 KERPLE

论文:《KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation》

It makes some improvements to ALIBI and introduces two learnable parameters and to "dynamically" learn local areas. As shown in the figure below, the left side is the original one, which dynamically subtracts the λ matrix in AIBLI by introducing parameters:

f5e2464b16dcff8fe11014a1d92efd23.png

Two modes are defined, namely power and logarithmic, corresponding to the forms without and with logarithms respectively:

4cec239fa7dd9f6138fb85c299213101.png

In logarithmic mode, c controls the overall quantity, which is equivalent to ALIBI, c is a constant. Su Shen version simplifies writing:

c6583e0be6f621a3fbcf91f0cdc6412d.png

3.3 Sandwich

论文:《Receptive Field Alignment Enables Transformer Length Extrapolation》

Sandwich and KEPRLE were proposed by the same author, who made a few improvements to KEPRLE, that is, the corresponding formula was rewritten as:, where the sum can be expressed using Sinusoidal encoding, that is:

d3700e47ce4921804dbc099611f51f82.png

Since Sinusoidal encoding is equivalent in monotonicity, both are linearly increasing forms, so Sandwich is just a new face.

3.4 XPOS

Paper: "A Length-Extrapolatable Transformer"
Reference Interpretation: Transformer Upgrade Road: 7. Length Extrapolation and Local Attention

It introduces local attention based on RoPE. The essence of RoPE is:

in:

It has been introduced in Section 2 that RoPE implements absolute positions to represent relative positions by using complex numbers. XPOS introduces a new scalar, that is:

Since the relative position of RoPE is , not, XPOS is constrained to the one-way Transformer, thus avoiding negative numbers.

XPOS has also designed a local-aware attention mechanism, Blockwise Causal Attention, to further improve the performance of local attention and improve the extrapolation of long texts.

4. Other explorations of extrapolation

(1) Mixed attention mask

When solving long text position representation, typical representatives include Transformer-XL, BigBird, and LongFormer. In addition to the local attention mechanism, they also introduce the nature of random positions:

988639630ed439ccc27c95633dc18963.png

As shown above, the second picture shows local attention (sliding window), and the third picture shows limited global perception (for example, only the first two tokens can see all tokens). The first picture is a random mask to alleviate excessively hard local attention. The fourth picture is obtained after the three attentions are mixed. This is also a method commonly used when training large models of very long texts.

(2) Random position representation

论文:《Randomized Positional Encodings Boost Length Generalization of Transformers》

b916747218caa72130ee00f8d6a64fea.png

When representing absolute positions, there will be an OOV problem at the position. Random position encoding adopts the following strategy during the training process:

5d7d6d9db2fa5743d98989754cb8d461.png

The corresponding code is also very simple:

def random_position_ids(N, L=2048):
    """从[0, L)中随机不重复挑N个整数,并从小到大排列
    """
    return np.sort(np.random.permutation(L)[:N])

Su Shen’s new exploration of random position coding:

d029b6ca3d1ce08c3a668cfe838fb4ca.png

The corresponding code is:

def random_position_ids(N):
    """先随机采样n,然后从[0, n]均匀取N个点
    """
    n = sample_from_xxx()
    return np.linspace(0, 1, N) * n

(3) Attention Scale

The original Attention calculation formula is:

Just simply change it to the following:

where is the maximum length during training and is the position during prediction, generally speaking. Intuitively, it means to control the value of Attention based on its relative position directly during calculation. When and are far apart, the value of will be very large, which will make the overall Attention normalized relatively flat, helping to solve the problem of extrapolation.

(4) Global dependencies

If you look at the sliding window method in a Transformer Layer, it is essentially similar to the N-Gram model with a length of , as shown in the figure below:

c88c76f631e50522a0180f74d21d7520.png

If the Transformer has another layer, then, starting from the input layer, the information in the window of length , can be passed to a wider area after passing through the layer, the area length is, as shown in the following figure:

da8aa06027ecdc2b9ccbd421f2224c7b.png

A new idea given by Su Shen is that if I have a layer of Transformer, I can use this expansion feature in the previous layer to get the final length area, and then use the logn Attention Scale method mentioned above in the last layer. Information about the expansion of the front layer can be quickly interacted with all tokens in the last layer. The original text introducing Su Shen is:

15b59ad10ae2f7d2699bd3ae844007c2.png

This combination of local attention + Attention Scale is also a very clever idea. The experiment also found that the extrapolation of this strategy is amazing.


Enter the NLP group—> Join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132727428