参考伯禹学习平台《动手学深度学习》课程内容内容撰写的学习笔记 原文链接:https://www.boyuai.com/elites/course/cZu18YmweLv10OeV/jupyter/6X2EcSYKYpTDlzAKwQhNi 感谢伯禹平台,Datawhale,和鲸,AWS给我们提供的免费学习机会!! 总的学习感受:伯禹的课程做的很好,课程非常系统,每个较高级别的课程都会有需要掌握的前续基础知识的介绍,因此很适合本人这种基础较差的同学学习,建议基础较差的同学可以关注伯禹的其他课程: 数学基础:https://www.boyuai.com/elites/course/D91JM0bv72Zop1D3 机器学习基础:https://www.boyuai.com/elites/course/5ICEBwpbHVwwnK3C
主要内容:
批量归一化(BatchNormalization)和残差网络 凸优化 梯度下降
1、 批量归一化(BatchNormalization)和残差网络
批量归一化(BatchNormalization)
对输入的标准化(浅层模型)
处理后的任意一个特征在数据集中所有样本上的均值为0、标准差为1。 标准化处理输入数据使各个特征的分布相近
批量归一化(深度模型)
利用小批量上的均值和标准差,不断调整神经网络中间输出,从而使整个神经网络在各层的中间输出的数值更稳定。 本质上是标准化数据的处理
1.对全连接层做批量归一化
位置:全连接层中的仿射变换和激活函数之间。 全连接:
x
=
W
u
+
b
o
u
t
p
u
t
=
ϕ
(
x
)
\boldsymbol{x} = \boldsymbol{W\boldsymbol{u} + \boldsymbol{b}} \\ output =\phi(\boldsymbol{x})
x = W u + b o u t p u t = ϕ ( x )
批量归一化(均值为0标准差为1):
o
u
t
p
u
t
=
ϕ
(
BN
(
x
)
)
output=\phi(\text{BN}(\boldsymbol{x}))
o u t p u t = ϕ ( BN ( x ) )
y
(
i
)
=
BN
(
x
(
i
)
)
\boldsymbol{y}^{(i)} = \text{BN}(\boldsymbol{x}^{(i)})
y ( i ) = BN ( x ( i ) )
μ
B
←
1
m
∑
i
=
1
m
x
(
i
)
,
\boldsymbol{\mu}_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i = 1}^{m} \boldsymbol{x}^{(i)},
μ B ← m 1 i = 1 ∑ m x ( i ) ,
σ
B
2
←
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
B
)
2
,
\boldsymbol{\sigma}_\mathcal{B}^2 \leftarrow \frac{1}{m} \sum_{i=1}^{m}(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B})^2,
σ B 2 ← m 1 i = 1 ∑ m ( x ( i ) − μ B ) 2 ,
x
^
(
i
)
←
x
(
i
)
−
μ
B
σ
B
2
+
ϵ
,
\hat{\boldsymbol{x}}^{(i)} \leftarrow \frac{\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B}}{\sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}},
x ^ ( i ) ← σ B 2 + ϵ
x ( i ) − μ B ,
这⾥ϵ > 0是个很小的常数,保证分母大于0
y
(
i
)
←
γ
⊙
x
^
(
i
)
+
β
.
{\boldsymbol{y}}^{(i)} \leftarrow \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}}^{(i)} + \boldsymbol{\beta}.
y ( i ) ← γ ⊙ x ^ ( i ) + β .
引入可学习参数:拉伸参数γ和偏移参数β。若
γ
=
σ
B
2
+
ϵ
\boldsymbol{\gamma} = \sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}
γ = σ B 2 + ϵ
和
β
=
μ
B
\boldsymbol{\beta} = \boldsymbol{\mu}_\mathcal{B}
β = μ B ,批量归一化无效。 m*d
2.对卷积层做批量归⼀化
位置:卷积计算之后、应⽤激活函数之前。 如果卷积计算输出多个通道,我们需要对这些通道的输出分别做批量归一化,且每个通道都拥有独立的拉伸和偏移参数。 计算:对单通道,batchsize=m,卷积计算输出=pxq 对该通道中m×p×q个元素同时做批量归一化,使用相同的均值和方差。mc p*q
3.预测时的批量归⼀化
训练:以batch为单位,对每个batch计算均值和方差。 预测:用移动平均 估算整个训练数据集的样本均值和方差。
从零实现
残差网络(ResNet)
(何凯明2015imagenet图像识别赛里夺魁) 深度学习的问题:深度CNN网络达到一定深度后再一味地增加层数并不能带来进一步地分类性能提高,反而会招致网络收敛变得更慢,准确率也变得更差。
残差块(Residual Block)
恒等映射: 左边:f(x)=x 右边:f(x)-x=0 (易于捕捉恒等映射的细微波动 )
在残差块中,输⼊x可通过跨层 的数据线路更快 地向前传播。 并没有对层做改变。
ResNet模型
卷积(64,7x7,3) 批量一体化 最大池化(3x3,2)
残差块x4 (通过步幅为2的残差块在每个模块之间减小高和宽)
全局平均池化
全连接
稠密连接网络(DenseNet)
用connect连结 A的通道数和B的通道数相加
###主要构建模块: 稠密块(dense block): 定义了输入和输出是如何连结的。 过渡层(transition layer):用来控制通道数,使之不过大。
稠密块
A inchannels 与B outchannels连结 输出inchannels+outchannels,在给Ainchannels+outchannels,循环
过渡层
1
×
1
1\times1
1 × 1 卷积层:来减小通道数 步幅为2的平均池化层:减半高和宽 减小模型复杂度
2、凸优化
优化与估计
尽管优化方法可以最小化深度学习中的损失函数值,但本质上优化方法达到的目标与深度学习的目标并不相同。
优化方法目标:训练集损失函数值
深度学习目标:测试集损失函数值(泛化性)
优化在深度学习中的挑战
局部最小值
鞍点 (对所有自变量一阶偏导数都为0,且Hessian矩阵特征值有正有负的点)
梯度消失
局部最小值
f
(
x
)
=
x
cos
π
x
f(x) = x\cos \pi x
f ( x ) = x cos π x
鞍点
梯度为0
e.g. Hessian矩阵
A
=
[
∂
2
f
∂
x
1
2
∂
2
f
∂
x
1
∂
x
2
⋯
∂
2
f
∂
x
1
∂
x
n
∂
2
f
∂
x
2
∂
x
1
∂
2
f
∂
x
2
2
⋯
∂
2
f
∂
x
2
∂
x
n
⋮
⋮
⋱
⋮
∂
2
f
∂
x
n
∂
x
1
∂
2
f
∂
x
n
∂
x
2
⋯
∂
2
f
∂
x
n
2
]
A=\left[\begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{n}^{2}}}\end{array}\right]
A = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ∂ x 1 2 ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ⋮ ∂ x n ∂ x 1 ∂ 2 f ∂ x 1 ∂ x 2 ∂ 2 f ∂ x 2 2 ∂ 2 f ⋮ ∂ x n ∂ x 2 ∂ 2 f ⋯ ⋯ ⋱ ⋯ ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x n ∂ 2 f ⋮ ∂ x n 2 ∂ 2 f ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
梯度消失
凸性 (Convexity)
凸函数的性质对研究损失函数的优化问题是有一定帮助的,局部最小值点的区域展现出凸函数的特征(尽管整个函数并不是凸函数)
集合
对于一个集合内的任意两点,如果这两点的连线的所有点都在这个集合内,那么这个集合就是凸集合。 1、不是凸集合,2/3/4/5是凸集合、两个凸集合的交集是凸集合6,并一定是凸集合7
Jensen 不等式
函数值的期望大于期望的函数值
∑
i
α
i
f
(
x
i
)
≥
f
(
∑
i
α
i
x
i
)
and
E
x
[
f
(
x
)
]
≥
f
(
E
x
[
x
]
)
\sum_{i} \alpha_{i} f\left(x_{i}\right) \geq f\left(\sum_{i} \alpha_{i} x_{i}\right) \text { and } E_{x}[f(x)] \geq f\left(E_{x}[x]\right)
i ∑ α i f ( x i ) ≥ f ( i ∑ α i x i ) and E x [ f ( x ) ] ≥ f ( E x [ x ] )
性质
无局部极小值
与凸集的关系
二阶条件 4.
凸函数与二阶导数
f
′
′
(
x
)
≥
0
⟺
f
(
x
)
f^{''}(x) \ge 0 \Longleftrightarrow f(x)
f ′ ′ ( x ) ≥ 0 ⟺ f ( x ) 是凸函数
必要性 (
⇐
\Leftarrow
⇐ ):
对于凸函数:
1
2
f
(
x
+
ϵ
)
+
1
2
f
(
x
−
ϵ
)
≥
f
(
x
+
ϵ
2
+
x
−
ϵ
2
)
=
f
(
x
)
\frac{1}{2} f(x+\epsilon)+\frac{1}{2} f(x-\epsilon) \geq f\left(\frac{x+\epsilon}{2}+\frac{x-\epsilon}{2}\right)=f(x)
2 1 f ( x + ϵ ) + 2 1 f ( x − ϵ ) ≥ f ( 2 x + ϵ + 2 x − ϵ ) = f ( x )
故:
f
′
′
(
x
)
=
lim
ε
→
0
f
(
x
+
ϵ
)
−
f
(
x
)
ϵ
−
f
(
x
)
−
f
(
x
−
ϵ
)
ϵ
ϵ
f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{\frac{f(x+\epsilon) - f(x)}{\epsilon}-\frac{f(x) - f(x-\epsilon)}{\epsilon}}{\epsilon}
f ′ ′ ( x ) = ε → 0 lim ϵ ϵ f ( x + ϵ ) − f ( x ) − ϵ f ( x ) − f ( x − ϵ )
f
′
′
(
x
)
=
lim
ε
→
0
f
(
x
+
ϵ
)
+
f
(
x
−
ϵ
)
−
2
f
(
x
)
ϵ
2
≥
0
f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{f(x+\epsilon)+f(x-\epsilon)-2 f(x)}{\epsilon^{2}} \geq 0
f ′ ′ ( x ) = ε → 0 lim ϵ 2 f ( x + ϵ ) + f ( x − ϵ ) − 2 f ( x ) ≥ 0
充分性 (
⇒
\Rightarrow
⇒ ):
令
a
<
x
<
b
a < x < b
a < x < b 为
f
(
x
)
f(x)
f ( x ) 上的三个点,由拉格朗日中值定理:
f
(
x
)
−
f
(
a
)
=
(
x
−
a
)
f
′
(
α
)
for some
α
∈
[
a
,
x
]
and
f
(
b
)
−
f
(
x
)
=
(
b
−
x
)
f
′
(
β
)
for some
β
∈
[
x
,
b
]
\begin{array}{l}{f(x)-f(a)=(x-a) f^{\prime}(\alpha) \text { for some } \alpha \in[a, x] \text { and }} \\ {f(b)-f(x)=(b-x) f^{\prime}(\beta) \text { for some } \beta \in[x, b]}\end{array}
f ( x ) − f ( a ) = ( x − a ) f ′ ( α ) for some α ∈ [ a , x ] and f ( b ) − f ( x ) = ( b − x ) f ′ ( β ) for some β ∈ [ x , b ]
根据单调性,有
f
′
(
β
)
≥
f
′
(
α
)
f^{\prime}(\beta) \geq f^{\prime}(\alpha)
f ′ ( β ) ≥ f ′ ( α ) , 故:
f
(
b
)
−
f
(
a
)
=
f
(
b
)
−
f
(
x
)
+
f
(
x
)
−
f
(
a
)
=
(
b
−
x
)
f
′
(
β
)
+
(
x
−
a
)
f
′
(
α
)
≥
(
b
−
a
)
f
′
(
α
)
\begin{aligned} f(b)-f(a) &=f(b)-f(x)+f(x)-f(a) \\ &=(b-x) f^{\prime}(\beta)+(x-a) f^{\prime}(\alpha) \\ & \geq(b-a) f^{\prime}(\alpha) \end{aligned}
f ( b ) − f ( a ) = f ( b ) − f ( x ) + f ( x ) − f ( a ) = ( b − x ) f ′ ( β ) + ( x − a ) f ′ ( α ) ≥ ( b − a ) f ′ ( α )
限制条件
minimize
x
f
(
x
)
subject to
c
i
(
x
)
≤
0
for all
i
∈
{
1
,
…
,
N
}
\begin{array}{l}{\underset{\mathbf{x}}{\operatorname{minimize}} f(\mathbf{x})} \\ {\text { subject to } c_{i}(\mathbf{x}) \leq 0 \text { for all } i \in\{1, \ldots, N\}}\end{array}
x m i n i m i z e f ( x ) subject to c i ( x ) ≤ 0 for all i ∈ { 1 , … , N }
拉格朗日乘子法
Boyd & Vandenberghe, 2004
L
(
x
,
α
)
=
f
(
x
)
+
∑
i
α
i
c
i
(
x
)
where
α
i
≥
0
L(\mathbf{x}, \alpha)=f(\mathbf{x})+\sum_{i} \alpha_{i} c_{i}(\mathbf{x}) \text { where } \alpha_{i} \geq 0
L ( x , α ) = f ( x ) + i ∑ α i c i ( x ) where α i ≥ 0
惩罚项
欲使
c
i
(
x
)
≤
0
c_i(x) \leq 0
c i ( x ) ≤ 0 , 将项
α
i
c
i
(
x
)
\alpha_ic_i(x)
α i c i ( x ) 加入目标函数,如多层感知机章节中的
λ
2
∣
∣
w
∣
∣
2
\frac{\lambda}{2} ||w||^2
2 λ ∣ ∣ w ∣ ∣ 2
投影
Proj
X
(
x
)
=
argmin
x
′
∈
X
∥
x
−
x
′
∥
2
\operatorname{Proj}_{X}(\mathbf{x})=\underset{\mathbf{x}^{\prime} \in X}{\operatorname{argmin}}\left\|\mathbf{x}-\mathbf{x}^{\prime}\right\|_{2}
P r o j X ( x ) = x ′ ∈ X a r g m i n ∥ x − x ′ ∥ 2
3、 梯度下降
(Boyd & Vandenberghe, 2004 )
一维梯度下降
证明:沿梯度反方向移动自变量可以减小函数值
泰勒展开:
f
(
x
+
ϵ
)
=
f
(
x
)
+
ϵ
f
′
(
x
)
+
O
(
ϵ
2
)
f(x+\epsilon)=f(x)+\epsilon f^{\prime}(x)+\mathcal{O}\left(\epsilon^{2}\right)
f ( x + ϵ ) = f ( x ) + ϵ f ′ ( x ) + O ( ϵ 2 )
代入沿梯度方向的移动量
η
f
′
(
x
)
\eta f^{\prime}(x)
η f ′ ( x ) :
f
(
x
−
η
f
′
(
x
)
)
=
f
(
x
)
−
η
f
′
2
(
x
)
+
O
(
η
2
f
′
2
(
x
)
)
f\left(x-\eta f^{\prime}(x)\right)=f(x)-\eta f^{\prime 2}(x)+\mathcal{O}\left(\eta^{2} f^{\prime 2}(x)\right)
f ( x − η f ′ ( x ) ) = f ( x ) − η f ′ 2 ( x ) + O ( η 2 f ′ 2 ( x ) )
f
(
x
−
η
f
′
(
x
)
)
≲
f
(
x
)
f\left(x-\eta f^{\prime}(x)\right) \lesssim f(x)
f ( x − η f ′ ( x ) ) ≲ f ( x )
x
←
x
−
η
f
′
(
x
)
x \leftarrow x-\eta f^{\prime}(x)
x ← x − η f ′ ( x ) e.g.
f
(
x
)
=
x
2
f(x) = x^2
f ( x ) = x 2 def f(x): return x**2 # Objective function
def gradf(x): return 2 * x # Its derivative
def gd(eta): x = 10 #初值 results = [x] for i in range(10): x -= eta * gradf(x) results.append(x) print(‘epoch 10, x:’, x) return results
res = gd(0.2) def show_trace(res): n = max(abs(min(res)), abs(max(res))) f_line = np.arange(-n, n, 0.01) d2l.set_figsize((3.5, 2.5)) d2l.plt.plot(f_line, [f(x) for x in f_line],’-’) d2l.plt.plot(res, [f(x) for x in res],’-o’) d2l.plt.xlabel(‘x’) d2l.plt.ylabel(‘f(x)’)
show_trace(res)
学习率
show_trace(gd(0.05)) show_trace(gd(1.1))
局部极小值
e.g.
f
(
x
)
=
x
cos
c
x
f(x) = x\cos cx
f ( x ) = x cos c x
c = 0.15 * np.pi
def f(x): return x * np.cos(c * x)
def gradf(x): return np.cos(c * x) - c * x * np.sin(c * x)
show_trace(gd(2)) #学习率过大导致
多维梯度下降
def train_2d(trainer, steps=20): x1, x2 = -5, -2 results = [(x1, x2)] for i in range(steps): x1, x2 = trainer(x1, x2) results.append((x1, x2)) print(‘epoch %d, x1 %f, x2 %f’ % (i + 1, x1, x2)) return results
def show_trace_2d(f, results): d2l.plt.plot(*zip(*results), ‘-o’, color=’#ff7f0e’) x1, x2 = np.meshgrid(np.arange(-5.5, 1.0, 0.1), np.arange(-3.0, 1.0, 0.1)) d2l.plt.contour(x1, x2, f(x1, x2), colors=’#1f77b4’) d2l.plt.xlabel(‘x1’) d2l.plt.ylabel(‘x2’)
f
(
x
)
=
x
1
2
+
2
x
2
2
f(x) = x_1^2 + 2x_2^2
f ( x ) = x 1 2 + 2 x 2 2 eta = 0.1
def f_2d(x1, x2): # 目标函数 return x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2): return (x1 - eta * 2 * x1, x2 - eta * 4 * x2)
show_trace_2d(f_2d, train_2d(gd_2d))
自适应方法
牛顿法
(自动调整学习率),实际应用中,由于速度太慢,可能用的少,但可以给我们提供一种思路。 高中的时候,牛顿法求根,很相似。
在
x
+
ϵ
x + \epsilon
x + ϵ 处泰勒展开:
f
(
x
+
ϵ
)
=
f
(
x
)
+
ϵ
⊤
∇
f
(
x
)
+
1
2
ϵ
⊤
∇
∇
⊤
f
(
x
)
ϵ
+
O
(
∥
ϵ
∥
3
)
f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\frac{1}{2} \epsilon^{\top} \nabla \nabla^{\top} f(\mathbf{x}) \epsilon+\mathcal{O}\left(\|\epsilon\|^{3}\right)
f ( x + ϵ ) = f ( x ) + ϵ ⊤ ∇ f ( x ) + 2 1 ϵ ⊤ ∇ ∇ ⊤ f ( x ) ϵ + O ( ∥ ϵ ∥ 3 )
最小值点处满足:
∇
f
(
x
)
=
0
\nabla f(\mathbf{x})=0
∇ f ( x ) = 0 , 即我们希望
∇
f
(
x
+
ϵ
)
=
0
\nabla f(\mathbf{x} + \epsilon)=0
∇ f ( x + ϵ ) = 0 , 对上式关于
ϵ
\epsilon
ϵ 求导,忽略高阶无穷小,有:
∇
f
(
x
)
+
H
f
ϵ
=
0
and hence
ϵ
=
−
H
f
−
1
∇
f
(
x
)
\nabla f(\mathbf{x})+\boldsymbol{H}_{f} \boldsymbol{\epsilon}=0 \text { and hence } \epsilon=-\boldsymbol{H}_{f}^{-1} \nabla f(\mathbf{x})
∇ f ( x ) + H f ϵ = 0 and hence ϵ = − H f − 1 ∇ f ( x ) c = 0.5
def f(x): return np.cosh(c * x) # Objective
def gradf(x): return c * np.sinh(c * x) # Derivative
def hessf(x): return c**2 * np.cosh(c * x) # Hessian
Hide learning rate for now
def newton(eta=1): x = 10 results = [x] for i in range(10): x -= eta * gradf(x) / hessf(x) results.append(x) print(‘epoch 10, x:’, x) return results
show_trace(newton())
c = 0.15 * np.pi
def f(x): return x * np.cos(c * x)
def gradf(x): return np.cos(c * x) - c * x * np.sin(c * x)
def hessf(x): return - 2 * c * np.sin(c * x) - x * c**2 * np.cos(c * x)
show_trace(newton()) #错误,默认学习率1
show_trace(newton(0.5))
收敛性分析
只考虑在函数为凸函数 , 且最小值点上
f
′
′
(
x
∗
)
>
0
f''(x^*) > 0
f ′ ′ ( x ∗ ) > 0 时的收敛速度:
令
x
k
x_k
x k 为第
k
k
k 次迭代后
x
x
x 的值,
e
k
:
=
x
k
−
x
∗
e_{k}:=x_{k}-x^{*}
e k : = x k − x ∗ 表示
x
k
x_k
x k 到最小值点
x
∗
x^{*}
x ∗ 的距离,由
f
′
(
x
∗
)
=
0
f'(x^{*}) = 0
f ′ ( x ∗ ) = 0 :
0
=
f
′
(
x
k
−
e
k
)
=
f
′
(
x
k
)
−
e
k
f
′
′
(
x
k
)
+
1
2
e
k
2
f
′
′
′
(
ξ
k
)
for some
ξ
k
∈
[
x
k
−
e
k
,
x
k
]
0=f^{\prime}\left(x_{k}-e_{k}\right)=f^{\prime}\left(x_{k}\right)-e_{k} f^{\prime \prime}\left(x_{k}\right)+\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) \text{for some } \xi_{k} \in\left[x_{k}-e_{k}, x_{k}\right]
0 = f ′ ( x k − e k ) = f ′ ( x k ) − e k f ′ ′ ( x k ) + 2 1 e k 2 f ′ ′ ′ ( ξ k ) for some ξ k ∈ [ x k − e k , x k ]
两边除以
f
′
′
(
x
k
)
f''(x_k)
f ′ ′ ( x k ) , 有:
e
k
−
f
′
(
x
k
)
/
f
′
′
(
x
k
)
=
1
2
e
k
2
f
′
′
′
(
ξ
k
)
/
f
′
′
(
x
k
)
e_{k}-f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right)=\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right)
e k − f ′ ( x k ) / f ′ ′ ( x k ) = 2 1 e k 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k )
代入更新方程
x
k
+
1
=
x
k
−
f
′
(
x
k
)
/
f
′
′
(
x
k
)
x_{k+1} = x_{k} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right)
x k + 1 = x k − f ′ ( x k ) / f ′ ′ ( x k ) , 得到:
x
k
−
x
∗
−
f
′
(
x
k
)
/
f
′
′
(
x
k
)
=
1
2
e
k
2
f
′
′
′
(
ξ
k
)
/
f
′
′
(
x
k
)
x_k - x^{*} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right) =\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right)
x k − x ∗ − f ′ ( x k ) / f ′ ′ ( x k ) = 2 1 e k 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k )
x
k
+
1
−
x
∗
=
e
k
+
1
=
1
2
e
k
2
f
′
′
′
(
ξ
k
)
/
f
′
′
(
x
k
)
x_{k+1} - x^{*} = e_{k+1} = \frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right)
x k + 1 − x ∗ = e k + 1 = 2 1 e k 2 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k )
当
1
2
f
′
′
′
(
ξ
k
)
/
f
′
′
(
x
k
)
≤
c
\frac{1}{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) \leq c
2 1 f ′ ′ ′ ( ξ k ) / f ′ ′ ( x k ) ≤ c 时,有:
e
k
+
1
≤
c
e
k
2
e_{k+1} \leq c e_{k}^{2}
e k + 1 ≤ c e k 2
预处理 (Heissan阵辅助梯度下降)
x
←
x
−
η
diag
(
H
f
)
−
1
∇
x
\mathbf{x} \leftarrow \mathbf{x}-\eta \operatorname{diag}\left(H_{f}\right)^{-1} \nabla \mathbf{x}
x ← x − η d i a g ( H f ) − 1 ∇ x
梯度下降与线性搜索(共轭梯度法)
随机梯度下降
随机梯度下降参数更新
对于有
n
n
n 个样本对训练数据集,设
f
i
(
x
)
f_i(x)
f i ( x ) 是第
i
i
i 个样本的损失函数, 则目标函数为:
f
(
x
)
=
1
n
∑
i
=
1
n
f
i
(
x
)
f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} f_{i}(\mathbf{x})
f ( x ) = n 1 i = 1 ∑ n f i ( x )
其梯度为:
∇
f
(
x
)
=
1
n
∑
i
=
1
n
∇
f
i
(
x
)
\nabla f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})
∇ f ( x ) = n 1 i = 1 ∑ n ∇ f i ( x )
使用该梯度的一次更新的时间复杂度为
O
(
n
)
\mathcal{O}(n)
O ( n )
随机梯度下降更新公式
O
(
1
)
\mathcal{O}(1)
O ( 1 ) :
x
←
x
−
η
∇
f
i
(
x
)
\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f_{i}(\mathbf{x})
x ← x − η ∇ f i ( x )
且有:
E
i
∇
f
i
(
x
)
=
1
n
∑
i
=
1
n
∇
f
i
(
x
)
=
∇
f
(
x
)
\mathbb{E}_{i} \nabla f_{i}(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x})
E i ∇ f i ( x ) = n 1 i = 1 ∑ n ∇ f i ( x ) = ∇ f ( x ) e.g.
f
(
x
1
,
x
2
)
=
x
1
2
+
2
x
2
2
f(x_1, x_2) = x_1^2 + 2 x_2^2
f ( x 1 , x 2 ) = x 1 2 + 2 x 2 2 def f(x1, x2): return x1 ** 2 + 2 * x2 ** 2 # Objective
def gradf(x1, x2): return (2 * x1, 4 * x2) # Gradient
def sgd(x1, x2): # Simulate noisy gradient global lr # Learning rate scheduler (g1, g2) = gradf(x1, x2) # Compute gradient (g1, g2) = (g1 + np.random.normal(0.1), g2 + np.random.normal(0.1)) eta_t = eta * lr() # Learning rate at time t return (x1 - eta_t * g1, x2 - eta_t * g2) # Update variables
eta = 0.1 lr = (lambda: 1) # Constant learning rate show_trace_2d(f, train_2d(sgd, steps=50)) #最后在最有点处还是有抖动。
动态学习率
规划学习率
η
(
t
)
=
η
i
if
t
i
≤
t
≤
t
i
+
1
piecewise constant
η
(
t
)
=
η
0
⋅
e
−
λ
t
exponential
η
(
t
)
=
η
0
⋅
(
β
t
+
1
)
−
α
polynomial
\begin{array}{ll}{\eta(t)=\eta_{i} \text { if } t_{i} \leq t \leq t_{i+1}} & {\text { piecewise constant }} \\ {\eta(t)=\eta_{0} \cdot e^{-\lambda t}} & {\text { exponential }} \\ {\eta(t)=\eta_{0} \cdot(\beta t+1)^{-\alpha}} & {\text { polynomial }}\end{array}
η ( t ) = η i if t i ≤ t ≤ t i + 1 η ( t ) = η 0 ⋅ e − λ t η ( t ) = η 0 ⋅ ( β t + 1 ) − α piecewise constant exponential polynomial def exponential(): global ctr #叠加的次数 ctr += 1 return math.exp(-0.1 * ctr)
ctr = 1 lr = exponential # Set up learning rate show_trace_2d(f, train_2d(sgd, steps=1000)) def polynomial(): global ctr ctr += 1 return (1 + 0.1 * ctr)**(-0.5)
ctr = 1 lr = polynomial # Set up learning rate show_trace_2d(f, train_2d(sgd, steps=50)) 在这里插入图片描述
小批量随机梯度下降
梯度下降和随机梯度下降折中的一种方法。 def get_data_ch7(): # 本函数已保存在d2lzh_pytorch包中方便以后使用 data = np.genfromtxt(’/home/kesci/input/airfoil4755/airfoil_self_noise.dat’, delimiter=’\t’) data = (data - data.mean(axis=0)) / data.std(axis=0) # 标准化 return torch.tensor(data[:1500, :-1], dtype=torch.float32), torch.tensor(data[:1500, -1], dtype=torch.float32) # 前1500个样本(每个样本5个特征)
features, labels = get_data_ch7() features.shape
import pandas as pd df = pd.read_csv(’/home/kesci/input/airfoil4755/airfoil_self_noise.dat’, delimiter=’\t’, header=None) df.head(10)
如何更新参数
def sgd(params, states, hyperparams): #params,模型参数;states,不用;hyperparams,学习率 for p in params: p.data -= hyperparams[‘lr’] * p.grad.data
本函数已保存在d2lzh_pytorch包中方便以后使用
def train_ch7(optimizer_fn, states, hyperparams, features, labels, batch_size=10, num_epochs=2): # 初始化模型 net, loss = d2l.linreg, d2l.squared_loss
w = torch.nn.Parameter(torch.tensor(np.random.normal(0, 0.01, size=(features.shape[1], 1)), dtype=torch.float32),
requires_grad=True) #weight
b = torch.nn.Parameter(torch.zeros(1, dtype=torch.float32), requires_grad=True)
def eval_loss():
return loss(net(features, w, b), labels).mean().item()
ls = [eval_loss()]
data_iter = torch.utils.data.DataLoader(
torch.utils.data.TensorDataset(features, labels), batch_size, shuffle=True)
for _ in range(num_epochs):
start = time.time()
for batch_i, (X, y) in enumerate(data_iter):
l = loss(net(X, w, b), y).mean() # 使用平均损失
# 梯度清零
if w.grad is not None:
w.grad.data.zero_()
b.grad.data.zero_()
l.backward()
optimizer_fn([w, b], states, hyperparams) # 迭代模型参数
if (batch_i + 1) * batch_size % 100 == 0:
ls.append(eval_loss()) # 每100个样本记录下当前训练误差
# 打印结果和作图
print('loss: %f, %f sec per epoch' % (ls[-1], time.time() - start))
d2l.set_figsize()
d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)
d2l.plt.xlabel('epoch')
d2l.plt.ylabel('loss')
def train_sgd(lr, batch_size, num_epochs=2): train_ch7(sgd, None, {‘lr’: lr}, features, labels, batch_size, num_epochs)
对比
train_sgd(1, 1500, 6)
train_sgd(0.005, 1)
train_sgd(0.05, 10)