T5模型中的位置编码

编程语言 2022-06-10 09:30:16 阅读次数: 0

持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第13天，点击查看活动详情

导语

T5模型是一个基于传统Transformer encoder-decoder结构的预训练语言模型，它将几乎所有的NLP任务建模为seq2seq形式，在当时的GLEU和SuperGULE排行榜上取得了最好的表现。本篇博客主要介绍其中的位置编码部分原理，并结合Huggingface上的T5源码进行实验分析。

Transformer中的位置编码

Transformer是谷歌在2017年提出的机器翻译的架构，是一个典型的Encoder-Decoder结构，其核心是Self-attention。由于Self-attention是与顺序无关的(即它是对集合的操作)，因此通常会向Transformer提供一个显式的位置信号。

最早人们使用的都是绝对位置编码，即只考虑每个token的绝对位置信息，如上图所示，绝对位置编码在输入阶段直接将位置信息加入到输入input embedding中。例如，最初的Transformer使用正弦形式的位置信息进行编码，或者让模型自己学习position embedding。

相对位置编码不是对每个位置使用固定的嵌入，而是根据自我注意机制中所比较的键和查询之间的偏移量产生不同的学习嵌入。比如Relation-aware Self-attention（参考：juejin.cn/post/710563… 就是在计算Self-attention的过程中加入相对位置信息。相比之下，相对位置编码信息形式上更加灵活，也能够处理任意长的输入。T5模型中所使用的位置编码就是一种相对位置编码。

T5结构初探

前面说到，T5是一种典型的Transformer结构，现在我们来打印一下他的模型进行直观的感受。

import torch, math
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Model

model = T5Model.from_pretrained("t5-small")
print(model)
复制代码

得到输出如下：

T5Model(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
/
/ 太长，省略了（1）-（5）
/
    (final_layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerCrossAttention(
            (EncDecAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (2): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
/
/ 太长，省略了（1）-（5）
/
    (final_layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
)
复制代码

以上展示了一个最下size的T5模型的全部结构。可以看到：

T5的整体结构包括两部分，分别是Encoder和Decoder。
每个Encoder或Decoder都是一个T5Stack的类。
每个T5Stack包含了若干T5Block组成的ModuleList，以及最后的T5LayerNorm和Dropout层。
每个T5Block就是一个基本的Transformer层，它由T5LayerSelfAttention以及后续的T5LayerFF组成。
每个T5LayerSelfAttention层中包含Q, K, V和O四个Linear层，其中Q,K,V即Transformer中的Q，K，V矩阵，O用来将Self-attention的输出维度d_intermidiate变换到d_model维度用于信息的传递（这里T5-small模型刚好是d_intermidiate=d_model=512）
只有在Encoder和Decoder的第一层处的Attention模块最后有一个relative_attention_bias的Embedding矩阵，这是因为T5为了简便相对位置的计算，令所有的Encoder layer都共享同一组Embedding参数，但不同的head是不一样的，所以这个relative_attention_bias的Embedding矩阵维度为(num_relative_buckets * num_heads)

这里的relative_attention_bias的embedding矩阵就是T5中用来进行相对位置编码的矩阵。

Self-attention回顾

在直观了解完T5的结构后，这里我们看一下T5中的相对位置编码原理。首先，我们回顾一下Self-attention的计算过程，对于一个具有H个head的Self-Attention模块，输入 $x_i$ 和 $x_j$ 代表了i位置和j位置的token。Self-attention会进行如下计算：

首先是第h个head的token i和token j的Q向量和K向量相乘。

接着，对相乘得到的结果做softmax，得到系数 $\alpha$

最后，利用 $\alpha$ 对V向量进行加权求和得到Attention模块的输出。

T5中的相对位置编码

整体来说，T5中使用的相对位置编码比较简单。正如在打印T5模型得到的模型结构输出中看到的那样，T5并没有在输入的input embedding之后加position embedding，而是在Encoder的第一层的Self-attention计算Q和K乘积之后加入了一个relative position embbedding，也就是在计算softmax之前。即

\hat{e}_{ij}^{(h)} = e_{ij}^{(h)} + r_{ij}

后续，使用 $\hat{e}_{ij}^{(h)}$ 进行softmax操作，其余计算过程与Self-attention一致。

而这个相对位置编码 $r_{ij}$ 则是一个标量数值，一共有32种类型，也就是说在T5中一共是32种相对位置。然而，我们都知道，T5的输入长度是512 token，相对的位置类型远大于32种。所以，T5使用了一种分区的方式将任意多的相对位置信息映射到32种最终类型上。

举个例子来说，直觉上我们认为离得近的token之间的相对位置信息更加重要，因而也需要更加精确。所以，对于token单独是一类，比如token $x_i$ 和 $x_j$ ，当i和j之间的相对位置小于4时，我们各自使用一个精确的类型表示它，即i-j=0是一种位置类型、i-j=1，i-j=-3等都是单独的位置类型。同时，过远的相对位置没有必要十分精确，比如i-j=94余i-j=95对于模型来说其实意义不大，所以T5将这些过远的位置全部分成同一个类型。比如32<i-j<64的 $r_{ij}$ 是同一种类型的位置embedding。

最后，太远的token之间的相对位置归到一类，比如i-j>128的 $r_{ij}$ 都是一类，每个类都是一个标量，可以直接加到 $e_{ij}^{(h)}$ 上。

需要注意的是，相对位置信息只在Encoder和Decoder的第一层Self-attention计算中起作用，并且所有head之间不共享。

代码实验

在了解完T5模型中对相对位置的编码原理之后，我们看一下其代码实现。参考代码来自于Huggingface的T5代码，其中有一个专门的函数来计算相对位置分区。

def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
    """
    Adapted from Mesh Tensorflow:
    https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593

    Translate relative position to a bucket number for relative attention. The relative position is defined as
    memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
    position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
    small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
    positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
    This should allow for more graceful generalization to longer sequences than the model has been trained on

    Args:
        relative_position: an int32 Tensor
        bidirectional: a boolean - whether the attention is bidirectional
        num_buckets: an integer
        max_distance: an integer

    Returns:
        a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
    """
    relative_buckets = 0
    if bidirectional:
        num_buckets //= 2
        relative_buckets += (relative_position > 0).to(torch.long) * num_buckets
        relative_position = torch.abs(relative_position)
    else:
        relative_position = -torch.min(relative_position, torch.zeros_like(relative_position))
    # now relative_position is in the range [0, inf)

    # half of the buckets are for exact increments in positions
    max_exact = num_buckets // 2
    is_small = relative_position < max_exact

    # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
    relative_postion_if_large = max_exact + (
        torch.log(relative_position.float() / max_exact)
        / math.log(max_distance / max_exact)
        * (num_buckets - max_exact)
    ).to(torch.long)
    relative_postion_if_large = torch.min(
        relative_postion_if_large, torch.full_like(relative_postion_if_large, num_buckets - 1)
    )

    relative_buckets += torch.where(is_small, relative_position, relative_postion_if_large)
    return relative_buckets
复制代码

这个函数的功能就是输入一个相对位置i-j，返回其对应的分区类型（一共有32种）。我们来写个简单的脚本调用一下看看。

for i in range(-15, 15):
    relative_position = torch.tensor([int(i)])
    encoded_position = _relative_position_bucket(relative_position=relative_position, bidirectional=True, num_buckets=32, max_distance=128)
    print(-i, encoded_position)
复制代码

得到输出如下：

15 tensor([9])
14 tensor([9])
13 tensor([9])
12 tensor([9])
11 tensor([8])
10 tensor([8])
9 tensor([8])
8 tensor([8])
7 tensor([7])
6 tensor([6])
5 tensor([5])
4 tensor([4])
3 tensor([3])
2 tensor([2])
1 tensor([1])
0 tensor([0])
-1 tensor([17])
-2 tensor([18])
-3 tensor([19])
-4 tensor([20])
-5 tensor([21])
-6 tensor([22])
-7 tensor([23])
-8 tensor([24])
-9 tensor([24])
-10 tensor([24])
-11 tensor([24])
-12 tensor([25])
-13 tensor([25])
-14 tensor([25])
复制代码

第一列的输出是i-j的值，即精确的相对位置差。第二列的输出是分区类型，即其对应的相对位置类型。可以看到，在相对位置变化很小时，每个位置都是一个单独的相对位置类型（如-8<i-j<8时），而当相对位置很大时，他们都被归为了一类位置类型（如12>i-j>7）。