1 权重衰退

1.1 使用均方范数作硬性限制

$\min \ell(\mathbf{w}, b) \quad \text { subject to }\|\mathbf{w}\|^{2} \leq \theta$

通常不限制b
小的 $\theta$ 意味着更强的正则项

1.2 使用均方范数作为柔性限制

对每个 $\theta$ 都可以找到 $\lambda$ 使得之前的目标函数等价于下面
$\min \ell(\mathbf{w}, b)+\frac{\lambda}{2}\|\mathbf{w}\|^{2}$
超参数 $\lambda$ 控制了正则项的重要程度
- $\lambda = 0$ : 无作用
- $\lambda \rightarrow \infty, \mathbf{w}^{*} \rightarrow \mathbf{0}$

2 权重衰退代码实现

3 丢弃法 Dropout

定义：在层之间添加噪音

3.1 无偏差的加入噪音

对 $\mathbf{x}$ 加入噪音得到 $\mathbf{x}'$ ，我们希望：
$\mathbf{E}\left[\mathbf{x}^{\prime}\right]=\mathbf{x}$
丢弃法对每个元素进行如下扰动
$x_{i}^{\prime}=\left\{\begin{array}{ll} 0 & \text { with probablity } p \\ \frac{x_{i}}{1-p} & \text { otherise } \end{array}\right.$
证明
$\begin{aligned} E\left[x_{i}\right] &=p \cdot 0+(1-p) \frac{x_{i}}{1-p} \\ &=x_{i} \end{aligned}$

3.2 使用丢弃法

通常将丢弃法作用在隐藏全连接层的输出上
$\begin{aligned} \mathbf{h} &=\sigma\left(\mathbf{W}_{1} \mathbf{x}+\mathbf{b}_{1}\right) \\ \mathbf{h}^{\prime} &=\text { dropout }(\mathbf{h}) \\ \mathbf{o} &=\mathbf{W}_{2} \mathbf{h}^{\prime}+\mathbf{b}_{2} \\ \mathbf{y} &=\operatorname{softmax}(\mathbf{o}) \end{aligned}$

3.3 总结

丢弃法将一些输出项随机置0来控制模型复杂度
常作用在多层感知机的隐藏层输出上
丢弃概率是控制模型复杂度的超参数

4 丢弃法代码实现

# -*- coding: utf-8 -*- 
# @Time : 2021/9/13 16:54 
# @Author : Amonologue
# @software : pycharm   
# @File : Dropout_from_zero.py
import torch


def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1:
        return torch.zeros_like(X)
    if dropout == 0:
        return X
    mask = (torch.randn(X.shape) > dropout).float()
    return mask * X / (1 - dropout)


if __name__ == '__main__':
    X = torch.arange(16, dtype=torch.float32).reshape((2, 8))
    print(X)
    print(dropout_layer(X, 0))
    print(dropout_layer(X, 0.5))
    print(dropout_layer(X, 0.5))
    print(dropout_layer(X, 1))

输出结果：
1)tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.], [ 8., 9., 10., 11., 12., 13., 14., 15.]])
2)tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.], [ 8., 9., 10., 11., 12., 13., 14., 15.]])
3)tensor([[ 0., 2., 0., 6., 0., 0., 0., 14.], [ 0., 0., 0., 22., 0., 26., 0., 0.]])
4)tensor([[ 0., 2., 4., 0., 0., 10., 12., 14.], [ 0., 0., 0., 0., 0., 26., 0., 0.]])
5)tensor([[0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.]])

从2, 3, 4, 5可以看出选择不同的dropout概率，可以将X中随机位置的数置为0。

同时，从3, 4当两次dropout概率一样时，得到的结果不同。

文章目录