WaveNet causal convolution and Transformer architecture analysis

WaveNet-cross-border application of convolutional network in NLP

In 2017, Google proposed "WaveNet: Google Assistant's Voice Synthesizer", which is a sequential processing algorithm based on one-dimensional convolution and applied to Google Voice Assistant. His core is: dilated causal convolution.

What is causal convolution

In layman's terms, the fruit of the present moment must be the cause of the previous planting. The current data must have evolved from the previous data. Then we do a convolution, x 0 x_0x0and x 1 x_1x1Convolution gets x 0 ′ {x_0}^{'}x0 x 1 x_1 x1and x 2 x_2x2Convolution gets x 1 ′ {x_1}^{'}x1 , at this time, it will be found that the data at time 0 is associated with the data at time 1, and the data at time 1 is associated with the data at time 2, but this is also a problem, how can the current time be associated with the next time? So we need to do a padding process, that is,x 0 x_0x0Fill with some 0 before, xn x_nxnAfter filling with 0, then x 0 x 1 x_0 x_1x0x1Convolve with the previously filled 0 to get x 1 ′ {x_1}^{'}x1 , so that there will be no space-time confusion.

What is Dilated Convolution

After correcting the timing misalignment problem, there is another biggest problem with this convolution method, that is, the receptive field is too small.
insert image description here
For example, we have worked so hard for 4 times, and he can only capture 5 sequence data. Our previous examples of filling in the blanks need to capture at least 13 sequence data to capture. If we continue to increase the number of convolution layers, it will require more calculations. It is very large, so the method of dilation convolution is proposed:
insert image description here
dilation=1 of conventional convolution, the kernel of 3 3 will only sweep to the area of ​​3 3, but if dilation=2, the kernel of 3 3 will sweep to 5 5 areas.
Quoting Zhihu: The graph implemented by WaveNet's Pytorch:
insert image description here
If we have dilation=2^n in each layer, then we can capture the information of the entire sequence. At this time, the data we extract is very global. Then we can further process the extracted data. (Practical projects are being updated)

Transformer

architecture analysis

Overall architecture

input part

input part schema
Includes word embedding layers and positional encoders for source and target texts.

Word embedding layer nn.Embedding

It has been used in previous projects, here is a simple demonstration.

import torch
sentence = 'How are you'
input_size = len(sentence)
output_size = 10
# input_size表示源文本的总词数,output_size表示词映射的维度
embedding = nn.Embedding(input_size, output_size)
# 我们假设How are you分别对应字符映射编码是[1,2,3],那么经过embedding之后这三个字符都会被映射成[1,1,10]的shape
x = embedding(torch.tensor([[1,2,3]]))
print(x)
print(x.shape)
>>>
tensor([[[-0.1730, -0.2589, -0.4128,  1.1708, -0.5708, -1.5719, -0.5521,
           0.7226,  1.7971, -1.1838],
         [-2.4243, -1.0639, -1.1274, -0.2122,  0.5868, -1.8033, -1.0478,
          -0.0812, -0.1956, -1.3679],
         [-1.9887,  0.2366, -1.5908, -1.8331, -1.7438,  0.2815,  0.6011,
           1.6243, -1.7086, -1.0831]]], grad_fn=<EmbeddingBackward>)
torch.Size([1, 3, 10])

Packaged as a class

class Embeddings(nn.Module):
    def __init__(self, vocab, d_model):
        # d_model:词嵌入维度
        # vocab:词表大小
        super(Embeddings, self).__init__()
        self.d_model = d_model
        self.vocab = vocab
        self.lut = nn.Embedding(vocab, d_model)
        
    def forward(self, x):
        # x:输入进模型的文本通过词汇映射后的数字张量
        return self.lut(x) * math.sqrt(self.d_model)
vocab = 1000
d_model = 512
x = Variable(torch.LongTensor([[100, 2, 421, 508],[491, 998, 1, 221]]))
emb = Embeddings(vocab, d_model)
embr = emb(x)
print(embr)
print(embr.shape)
>>>
tensor([[[-38.0086, -28.3689,  11.5360,  ..., -17.3576, -13.9426,  12.3066],
         [ 18.3994, -25.3799, -10.7227,  ...,  -5.3271,   8.5594, -47.6293],
         [ -9.8023,  13.2265,   0.6361,  ...,  22.1892,  32.0531,   0.2602],
         [ 26.9539,  21.3255, -19.0987,  ..., -22.8677,   1.2920,  15.5454]],

        [[-14.5776,  22.1955, -39.4145,  ..., -28.2664,  41.6184,  -5.1912],
         [ -9.5976,  13.2798,  12.4504,  ...,  33.1238, -29.1298,  39.2560],
         [ -9.4381,  -7.8411,  37.6495,  ...,  34.4752, -13.9440,  -3.5493],
         [ 12.4780, -13.1469,  -1.5811,  ...,  17.1686, -24.5159, -31.4329]]],
       grad_fn=<MulBackward0>)
torch.Size([2, 4, 512])

Positional encoding positional encoding

Because in the encoder structure of Transformer, there is no processing for word position information, so it is necessary to add a position encoder after the Embedding layer, and add information that may produce different semantics due to different word positions into the word embedding tensor to make up for the position information. missing. (For example, a is in front of b, and the impact of b in front of a on semantics). Because we want to add location information, we can add his location information based on the output of the embedding layer. The transformer uses trigonometric position encoding. Because the trigonometric function can perfectly control the added position encoding information between [-1, 1], it will not cause the problem of too large or too small values ​​under long distances.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        # d_model:词嵌入维度
        # dropout:失活比例
        # max_len:每个句子最大长度
        super(PositionalEncoding, self).__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(p = dropout)
        self.max_len = max_len
        # 初始化一个位置矩阵max_len * d_model
        pe = torch.zeros(max_len, d_model)
        # 初始化一个绝对位置矩阵,在这里,词汇的绝对位置就是用他的索引表示
        # 所以首先使用arange方法获得一个连续的自然数向量,然后再使用unsqueeze方法扩展维度
        # 又因为参数传的是1,代表矩阵拓展位置,会使向量变成一个max_len * 1的矩阵
        position = torch.arange(0, max_len).unsqueeze(1)
        # 绝对位置初始化后,接下来考虑如何将这些位置信息加入到位置编码中
        # 最简单的思路就是先将max_len * 1的绝对位置矩阵变换成max_len * d_model形状,然后覆盖初始矩阵
        # 要做这总矩阵变换,需要一个1 * d_model的变换矩阵div_term
        # 我们还希望这个变换矩阵可以把自然数的绝对位置编码缩放成足够小的数字有助于梯度下降过程中更快的收敛

        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0: :2] = torch.sin(position * div_term)
        pe[:, 1: :2] = torch.cos(position * div_term)

        # 这样就得到了位置编码矩阵pe,pe还只是一个二维矩阵,要和embedding的输出保持一直,需要拓展一个位置
        pe = pe.unsqueeze(0)

        # 最后把pe位置编码矩阵注册成模型buffer——我们把认为对模型效果有帮助,却不是模型结构中超参数或者参数,不需要随着优化步骤进行更新增益的对象
        # 注册之后我们就可以在模型保存后重加载时和模型结构与参数一同被加载
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        # 在相加之前对pe做适配工作,将这个三维张量的第二维也就是句子最大长度那一维切开
        # 因为max_len默认5000一般实在太大了,很难有一条句子包含5000个词汇,所以要进行切片
        # 最后使用Variable封装,使其和x形式相同,并且不需要梯度求解
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)
        return self.dropout(x)
d_model = 10
dropout = 0.1
max_len = 60
x = embr
pe = PositionalEncoding(d_model, dropout, max_len)
pe_result = pe(x)
print(pe_result)
print(pe_result.shape)
>>>
tensor([[[-42.2318, -30.4098,  12.8177,  ..., -18.1752, -15.4918,  14.7851],
         [ 21.3788, -27.5995, -11.0009,  ...,  -4.8079,   9.5106,  -0.0000],
         [ -9.8811,  14.2337,   0.0000,  ...,  25.7657,  35.6148,   1.4002],
         [ 30.1056,  22.5950, -20.9485,  ..., -24.2974,   1.4359,  18.3838]],

        [[-16.1973,  25.7728, -43.7939,  ..., -30.2960,  46.2427,  -4.6569],
         [ -9.7290,  15.3556,  14.7470,  ...,   0.0000, -32.3663,  44.7289],
         [ -9.4765,  -9.1747,  42.8732,  ...,  39.4169, -15.4931,  -2.8326],
         [ 14.0212, -15.7077,  -1.4844,  ...,  20.1873, -27.2396,  -0.0000]]],
       grad_fn=<MulBackward0>)
torch.Size([2, 4, 512])

output section

insert image description here
Linear layer followed by softmax classification

class Generator(nn.Module):
    def __init__(self, d_model, vocab_size):
        super(Generator, self).__init__()
        self.project = nn.Linear(d_model, vocab_size)
        
    def forward(self, x):
        return F.log_softmax(self.project(x), dim=-1)

Encoder part

insert image description here
N encoder layers are stacked (N=6 in the original paper), and each encoder consists of two sublayer structures, which are the multi-head self-attention sublayer and the normalization and residual connection layer; the feedforward fully connected sublayer and normalized, residually connected layers

Attention mechanism

We know from the cyclic neural network that if the information of the previous time step is iteratively transmitted to the current state, then the output of the current state can reflect the semantic association with the previous state to a certain extent. However, the effect of single-layer RNN is generally not good, so we will use stacking to stack many layers of RNN, so that once the data set is complex, whether it is a convolutional neural network or a recurrent neural network, the calculation will be very time-consuming. We need to find a linear way to solve this problem.
When we have a picture, the reason why we know whether it is a cat or a dog does not need to look at the whole picture, but only needs to look at the part of the picture, that is, we focus on the features instead of the overall situation. The most critical information can also be extracted. In the previous decoder that added the attention mechanism, we used three QKV matrices. In order to vividly explain the meaning of these three matrices, let me give an example:
Now we have an article, using some keywords to describe it. In order to unify the direction of everyone's answers, some key words are given as hints. These hints are K. The information of the whole article is Q, and the answer you give after reading the article is V. Assuming that you are not smart, you only know Q and K after reading the article, and the answer we can give is only K, that is, K=V, but as we read repeatedly and understand deeply, V will gradually change , we call this changing process the attention mechanism process. However, there is another special case, that is, this article is very simple. His QK is the same as the V we understand. This situation is called the self-attention mechanism. The general attention mechanism uses words different from K to represent him. The self-attention mechanism uses the text itself to represent it, that is to say, it extracts keywords from the text to express, which is an extraction of its own characteristics.
For example:
conventional attention mechanism: Q-I love beautiful and rich China; K-I love China; V-I love China
Self-attention mechanism: Q-I love China; K-I love China; V-I love China

residual layer

If no residual layer is added, the value input to the next unit is:
y = F ( x , { W i } ) \bf y = F(\bf x,\{W_i\})y=F(x,{ Wi})
If the number of layers is very large, after multiple times of multiplying a number less than 1, after reverse derivation:
∂ y ∂ xi = ∂ F ( x , { W i } ) ∂ xi = 0 \frac {\partial \bf y} {\partial x_i} = \frac {\partial F(\bf x,\{W_i\})} {\partial x_i}=0xiy=xiF(x,{ Wi})=0
This is very unfavorable for gradient updates. If you add a residual layer:
y = F ( x , { W i } ) + x \bf y = F(x,\{W_i\}) + \bf xy=F(x,{ Wi})+x
∂ y ∂ x i = ∂ F ( x , { W i } ) + x ∂ x i = 1 \frac {\partial \bf y} {\partial x_i} = \frac {\partial F(\bf x,\{W_i\}) + \bf x} {\partial x_i}=1 xiy=xiF(x,{ Wi})+x=1
In this way, many layers can be stacked, and it only needs to go through a simple linear sum to ensure that each gradient update of each module is 1. In the ResNet network, the residual layer is even stacked to 1000 layers.

mask tensor

Mask tensors start to emerge in transformers and really shine in bert. The so-called mask is to cover some information so that the machine cannot see it. Why do you do that? The order we read is from front to back. For example, there is such a sentence, "The weather is very good today. I went to play golf, but I dropped my phone, so I went to repair it at night." I know what's going to happen later, but if the information I read is that the weather is fine today, I'm going to play golf, and I'm going to fix my phone at night. Did you know that I dropped my phone in the middle. But as a training model, we don't want the machine to explain the previous things through the latter information. We hope that the machine can use the known conditions as much as possible to learn this language. So when we only read that the weather is good and we go to play, the machine has to analyze through learning that it may be a delicious meal at night, winning a game, sudden rain, and so on. Although it may not correspond to the following text, I hope that the machine can dig out the deep meaning of the existing information as much as possible. So the role of the mask is to overwrite the information after the current time step.

Multi-head self-attention layer

insert image description here
According to the definition of the attention mechanism, QKV expresses semantics from different angles, so how to better extract these semantics is a multi-headed task. The mask ensures that we can only see the current information. The multi-head is to split the current information into h pieces and send them to h fully connected layers. The fully connected layer here is a square matrix, that is, the input dimensions and The output dimension remains the same, and its only function is to split a piece of information into multiple segments, understand them separately, and achieve a better semantic extraction effect.

decoder part

insert image description here
N decoder layers are stacked, and each decoder layer consists of three sublayers, which are multi-head self-attention sublayer and normalization, residual connection layer; multi-head attention sublayer and normalization, residual connection layer; feedforward Fully connected sublayers and normalized, residually connected layers.

Guess you like

Origin blog.csdn.net/D_Ddd0701/article/details/122524016