入门小菜鸟,希望像做笔记记录自己学的东西,也希望能帮助到同样入门的人,更希望大佬们帮忙纠错啦~侵权立删。
目录
一、原理
输入层:hidden state,shape为(b, s, h);最终输出也是(b,s,h);
若输入有记忆层,则对他进行LayerNorm处理,并送进self attention中发挥作用;
1、LayerNorm
简单而言,LayerNorm就是针对每条样本,对每条样本的所有特征做归一化。为的是让当前层的参数稳定下来,避免梯度消失或者梯度爆炸,方便后面的继续学习。
公式:(其中防止分母为0)
2、self attention
详见CogView中的Self attention_tt丫的博客-CSDN博客
3、残差结构的引入
可以看看深度学习之Resnet详解|CSDN创作打卡_tt丫的博客-CSDN博客_resnet学习中对残差结构的介绍。
用于解决深度网络的退化问题。
4、MLP
二、代码解析
1、__init__
(1)参数设定
class GPT2ParallelTransformerLayer(torch.nn.Module):
"""A single layer transformer for GPT2.
We use the following notation:
h: hidden size
n: number of attention heads
b: batch size
s: sequence length
Transformore layer takes input with size [b, s, h] and returns an
output of the same size.
Arguments:
hidden_size: The hidden size of the self attention.
num_attention_heads: number of attention head in the self
attention.
attention_dropout_prob: dropout probability of the attention
score in self attention.
output_dropout_prob: dropout probability for the outputs
after self attention and final output.
layernorm_epsilon: epsilon used in layernorm to avoid
division by zero.
init_method: initialization method used for the weights. Note
that all biases are initialized to zero and
layernorm weight are initialized to one.
output_layer_init_method: output layers (attention output and
mlp output) initialization. If None,
use `init_method`.
"""
def __init__(self,
hidden_size,
num_attention_heads,
attention_dropout_prob,
output_dropout_prob,
layernorm_epsilon,
init_method,
output_layer_init_method=None,
query_window=128,
key_window_times=6,
scale_normalization=True
):
super(GPT2ParallelTransformerLayer, self).__init__()
- hidden_size:自我注意力模块的隐藏大小(嵌入向量的维度);
- num_attention_heads:自我注意力模块中attention head的数量;
- attention_dropout_prob:注意力模块中注意力得分被dropout的概率;
- output_dropout_prob:输出层后的输出被dropout的概率;
- layernorm_epsilon:在layernform中用于避免被零除的ε(用于防止分母为0);
- init_method:用于权重的初始化方法(所有偏差均初始化为0,分层形式权重初始化为1);
- output_layer_init_method:输出层(注意力输出和mlp输出)初始化方法定义。若为None,则用init_method方法;
- query_window:稀疏处理中的滑动窗口大小;
- key_window_times:可用于稀疏处理中窗口数目;
- scale_normalization:是否调用LayerNorm类给第3,4层网络的参数进行一个缩放处理(为了让权重小一点);
(2)对输出输入层
对输出层:若output_layer_init_method为None则对输出层权重用init_method方法初始化;
对输入层:调用LayerNorm类对其进行LayerNorm处理;
# Set output layer initialization if not provided.
if output_layer_init_method is None:
output_layer_init_method = init_method
# Layernorm on the input data.
self.input_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)
LayerNorm类——对网络权重参数进行类归一化并且进一步变小
#对输入的值除以输入中最大绝对值的1/8
class LayerNorm(FusedLayerNorm):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def forward(self, x):
return super().forward(x / (x.abs().max().detach()/8))
这里引用了apex.normalization.fused_layer_norm中的FusedLayerNorm类,然后forward再对输入的所有值除以输入中最大绝对值的1/8(具体解释见apex.normalization.fused_layer_norm — Apex 0.1.0 documentation)
(3)Self attention模块初始化
调用GPT2ParallelSelfAttention类进行初始化
# Self attention.
self.attention = GPT2ParallelSelfAttention(
hidden_size,
num_attention_heads,
attention_dropout_prob,
output_dropout_prob,
init_method,
output_layer_init_method=output_layer_init_method,
query_window=query_window,
key_window_times=key_window_times)
(4)约束的方法定义——LayerNorm
# Layernorm on the input data.
self.post_attention_layernorm = LayerNorm(hidden_size,
eps=layernorm_epsilon)#调用LayerNorm进行权重值的约束
self.scale_normalization = scale_normalization
#是否约束第三,四层网络层的权重
if scale_normalization:
self.third_layernorm = LayerNorm(hidden_size,
eps=layernorm_epsilon)
self.fourth_layernorm = LayerNorm(hidden_size,
eps=layernorm_epsilon)
(5)MLP定义
# MLP
self.mlp = GPT2ParallelMLP(
hidden_size,
output_dropout_prob,
init_method,
output_layer_init_method=output_layer_init_method)
2、forward
def forward(self, hidden_states, ltor_mask, pivot_idx=None, is_sparse=0, mem=None):
# hidden_states: [b, s, h] 上一层的输出作为这一层的输入
# ltor_mask: [1, 1, s, s] attention mask矩阵
(1)第一层:对输入进行LayerNorm处理
# Layer norm at the begining of the transformer layer.
layernorm_output1 = self.input_layernorm(hidden_states)
mem = self.input_layernorm(mem) if mem is not None else None#对记忆模块也进行LayerNorm处理
(2)第二层:自我注意力机制
输出大小[b, s, h]
# Self attention.
attention_output = self.attention(layernorm_output1, ltor_mask, pivot_idx, is_sparse, mem)
(3)第三层:Third LayerNorm
判断是否对第二层网络层输出的权重进行LayerNorm处理,形成第三层
# Third LayerNorm
if self.scale_normalization:
attention_output = self.third_layernorm(attention_output)
(4)第四层
首先构建残差网络:将输出和输入相加作为 output
# Residual connection.
layernorm_input = hidden_states + attention_output
再对其做LayerNorm处理
# Layer norm post the self attention.
layernorm_output = self.post_attention_layernorm(layernorm_input)
最后来一波MLP,即做非线性运算变幻,h -> 4*h -> h
# MLP.
mlp_output = self.mlp(layernorm_output)
判断是否对第四层网络层的权重进行LayerNorm处理
# Fourth LayerNorm
if self.scale_normalization:
mlp_output = self.fourth_layernorm(mlp_output)
(5)输出层:第二个残差结构
# Second residual connection.
output = layernorm_input + mlp_output
return output
欢迎大家在评论区批评指正,谢谢大家~