计算图
用计算图来表示任何函数,其中图的节点表示我们要执行的每一步计算。如上图的线性分类器中,输入是
x
x
x 和
W
W
W ,
∗
*
∗ 表示矩阵乘法,即
W
∗
x
W*x
W ∗ x ,输出得分向量。另一个节点表示 hinge loss,计算数据损失项
L
i
L_{i}
L i ,还有一个正则项,在右下角。在最后的总的损失
L
L
L ,是正则项和数据项的和。 画出计算图后,可以用链式求导法则得到每个节点的梯度。
(x+y)z的链式求导公式
令
f
(
x
,
y
,
z
)
=
(
x
+
y
)
z
f_{(x, y, z)}=(x+y)z
f ( x , y , z ) = ( x + y ) z ,
q
(
x
,
y
)
=
x
+
y
q_{(x, y)}=x+y
q ( x , y ) = x + y ,则
∂
f
∂
x
=
∂
f
∂
q
×
∂
q
∂
x
=
z
×
1
=
z
\frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\times \frac{\partial q}{\partial x}=z\times 1=z
∂ x ∂ f = ∂ q ∂ f × ∂ x ∂ q = z × 1 = z
∂
f
∂
y
=
∂
f
∂
q
×
∂
q
∂
y
=
z
×
1
=
z
\frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}\times \frac{\partial q}{\partial y}=z\times 1=z
∂ y ∂ f = ∂ q ∂ f × ∂ y ∂ q = z × 1 = z
∂
f
∂
z
=
q
=
x
+
y
\frac{\partial f}{\partial z}=q=x+y
∂ z ∂ f = q = x + y
反向求梯度的例子
正向传播计算图如图所示,反向传播过程为: 开始第一个梯度为1。 令
f
(
x
)
=
1
x
f_{(x)}=\frac{1}{x}
f ( x ) = x 1 ,则求导得
f
(
x
)
′
=
−
1
x
2
f_{(x)}^{'}=-\frac{1}{x^{2}}
f ( x ) ′ = − x 2 1 ,将
x
=
1.37
x=1.37
x = 1 . 3 7 代入得
f
(
x
)
′
=
−
0.53
f_{(x)}^{'}=-0.53
f ( x ) ′ = − 0 . 5 3 ,故其梯度为
−
0.53
×
1
=
−
0.53
-0.53\times 1=-0.53
− 0 . 5 3 × 1 = − 0 . 5 3 。 令
f
(
x
)
=
x
+
1
f_{(x)}=x+1
f ( x ) = x + 1 ,则求导得
f
(
x
)
′
=
1
f_{(x)}^{'}=1
f ( x ) ′ = 1 ,故其梯度为
1
×
−
0.53
=
−
0.53
1\times -0.53=-0.53
1 × − 0 . 5 3 = − 0 . 5 3 。 令
f
(
x
)
=
e
x
f_{(x)}=e^{x}
f ( x ) = e x ,则求导得
f
(
x
)
′
=
e
x
f_{(x)}^{'}=e^{x}
f ( x ) ′ = e x ,将
x
=
−
1
x=-1
x = − 1 代入得
f
(
x
)
′
=
0.37
f_{(x)}^{'}=0.37
f ( x ) ′ = 0 . 3 7 ,故其梯度为
0.37
×
−
0.53
=
−
0.2
0.37\times -0.53=-0.2
0 . 3 7 × − 0 . 5 3 = − 0 . 2 。 以此类推,得到所有梯度为: 上图中画框的地方其实是
s
i
g
m
o
i
d
sigmoid
s i g m o i d 函数,可以不用一步一步地从开始求解到0.20处,直接用
s
i
g
m
o
i
d
sigmoid
s i g m o i d 求导得到梯度。
sigmoid求导
σ
(
x
)
=
1
1
+
e
−
x
\sigma_{(x)}=\frac{1}{1+e^{-x}}
σ ( x ) = 1 + e − x 1
d
σ
(
x
)
d
x
=
e
−
x
(
1
+
e
−
x
)
2
=
(
1
+
e
−
x
−
1
1
+
e
−
x
)
(
1
1
+
e
−
x
)
=
(
1
−
σ
(
x
)
)
σ
(
x
)
\frac{d\sigma_{(x)}}{dx}=\frac{e^{-x}}{(1+e^{-x})^{2}}=(\frac{1+e^{-x}-1}{1+e^{-x}})(\frac{1}{1+e^{-x}})=(1-\sigma_{(x)})\sigma_{(x)}
d x d σ ( x ) = ( 1 + e − x ) 2 e − x = ( 1 + e − x 1 + e − x − 1 ) ( 1 + e − x 1 ) = ( 1 − σ ( x ) ) σ ( x )
向量的反向传播
如上图所示,对
f
(
q
i
)
f_{(q_{i})}
f ( q i ) 求导,得到
∂
f
∂
q
i
=
2
q
i
\frac{\partial f}{\partial q_{i}}=2q_{i}
∂ q i ∂ f = 2 q i ,即反向求导后得到梯度
[
0.44
0.52
]
\begin{bmatrix} 0.44 \\ 0.52 \\ \end{bmatrix}
[ 0 . 4 4 0 . 5 2 ] 用
q
1
q_{1}
q 1 (即
W
1
,
1
x
1
+
W
1
,
2
x
2
W_{1, 1}x_{1}+W_{1, 2}x_{2}
W 1 , 1 x 1 + W 1 , 2 x 2 )对
W
1
,
1
W_{1, 1}
W 1 , 1 求导,得
∂
q
1
∂
W
1
,
1
=
x
1
=
0.2
\frac{\partial q_{1}}{\partial W_{1, 1}}=x_{1}=0.2
∂ W 1 , 1 ∂ q 1 = x 1 = 0 . 2 用
q
1
q_{1}
q 1 对
W
1
,
2
W_{1, 2}
W 1 , 2 求导,得
∂
q
1
∂
W
1
,
2
=
x
2
=
0.4
\frac{\partial q_{1}}{\partial W_{1, 2}}=x_{2}=0.4
∂ W 1 , 2 ∂ q 1 = x 2 = 0 . 4 用
q
1
q_{1}
q 1 对
W
2
,
1
W_{2, 1}
W 2 , 1 求导,得
∂
q
1
∂
W
2
,
1
=
0
\frac{\partial q_{1}}{\partial W_{2, 1}}=0
∂ W 2 , 1 ∂ q 1 = 0 用
q
1
q_{1}
q 1 对
W
2
,
2
W_{2, 2}
W 2 , 2 求导,得
∂
q
1
∂
W
2
,
2
=
0
\frac{\partial q_{1}}{\partial W_{2, 2}}=0
∂ W 2 , 2 ∂ q 1 = 0 同理,
∂
q
2
∂
W
1
,
1
=
0
\frac{\partial q_{2}}{\partial W_{1, 1}}=0
∂ W 1 , 1 ∂ q 2 = 0 ,
∂
q
2
∂
W
1
,
2
=
0
\frac{\partial q_{2}}{\partial W_{1, 2}}=0
∂ W 1 , 2 ∂ q 2 = 0 ,
∂
q
2
∂
W
2
,
1
=
x
1
=
0.2
\frac{\partial q_{2}}{\partial W_{2, 1}}=x_{1}=0.2
∂ W 2 , 1 ∂ q 2 = x 1 = 0 . 2 ,
∂
q
2
∂
W
2
,
2
=
x
2
=
0.4
\frac{\partial q_{2}}{\partial W_{2, 2}}=x_{2}=0.4
∂ W 2 , 2 ∂ q 2 = x 2 = 0 . 4 。 即:
∂
q
k
∂
W
i
,
j
=
1
k
=
i
x
j
\frac{\partial q_{k}}{\partial W_{i, j}}=1_{k=i}x_{j}
∂ W i , j ∂ q k = 1 k = i x j 其中
1
k
=
i
1_{k=i}
1 k = i 指:如果
k
=
i
k=i
k = i ,则
1
k
=
i
=
1
1_{k=i}=1
1 k = i = 1 ,否则等于
0
0
0 。 故:
∂
f
∂
W
i
,
j
=
∑
k
∂
f
∂
q
k
∂
q
k
∂
W
i
,
j
=
∑
k
(
2
q
k
)
(
1
k
=
i
x
j
)
=
2
q
i
x
j
\frac{\partial f}{\partial W_{i, j}}=\sum_{k}\frac{\partial f}{\partial q_{k}}\frac{\partial q_{k}}{\partial W_{i, j}}=\sum_{k}(2q_{k})(1_{k=i}x_{j})=2q_{i}x_{j}
∂ W i , j ∂ f = k ∑ ∂ q k ∂ f ∂ W i , j ∂ q k = k ∑ ( 2 q k ) ( 1 k = i x j ) = 2 q i x j 故:
∂
f
∂
W
1
,
1
=
2
q
1
x
1
=
0.088
\frac{\partial f}{\partial W_{1, 1}}=2q_{1}x_{1}=0.088
∂ W 1 , 1 ∂ f = 2 q 1 x 1 = 0 . 0 8 8
∂
f
∂
W
1
,
2
=
2
q
1
x
2
=
0.176
\frac{\partial f}{\partial W_{1, 2}}=2q_{1}x_{2}=0.176
∂ W 1 , 2 ∂ f = 2 q 1 x 2 = 0 . 1 7 6
∂
f
∂
W
2
,
1
=
2
q
2
x
1
=
0.104
\frac{\partial f}{\partial W_{2, 1}}=2q_{2}x_{1}=0.104
∂ W 2 , 1 ∂ f = 2 q 2 x 1 = 0 . 1 0 4
∂
f
∂
W
2
,
2
=
2
q
2
x
2
=
0.208
\frac{\partial f}{\partial W_{2, 2}}=2q_{2}x_{2}=0.208
∂ W 2 , 2 ∂ f = 2 q 2 x 2 = 0 . 2 0 8 最终得到:
∂
f
∂
W
=
[
0.088
0.176
0.104
0.208
]
\frac{\partial f}{\partial W}= \begin{bmatrix} 0.088 & 0.176 \\ 0.104 & 0.208 \\ \end{bmatrix}
∂ W ∂ f = [ 0 . 0 8 8 0 . 1 0 4 0 . 1 7 6 0 . 2 0 8 ] 继续用
q
1
q_{1}
q 1 对
x
1
x_{1}
x 1 求导,得
∂
q
1
∂
x
1
=
W
1
,
1
=
0.1
\frac{\partial q_{1}}{\partial x_{1}}=W_{1, 1}=0.1
∂ x 1 ∂ q 1 = W 1 , 1 = 0 . 1 同理得
∂
q
1
∂
x
2
=
W
1
,
2
=
0.5
\frac{\partial q_{1}}{\partial x_{2}}=W_{1, 2}=0.5
∂ x 2 ∂ q 1 = W 1 , 2 = 0 . 5
∂
q
2
∂
x
1
=
W
2
,
1
=
−
0.3
\frac{\partial q_{2}}{\partial x_{1}}=W_{2, 1}=-0.3
∂ x 1 ∂ q 2 = W 2 , 1 = − 0 . 3
∂
q
2
∂
x
2
=
W
2
,
2
=
0.8
\frac{\partial q_{2}}{\partial x_{2}}=W_{2, 2}=0.8
∂ x 2 ∂ q 2 = W 2 , 2 = 0 . 8 即:
∂
q
k
∂
x
i
=
W
k
,
i
\frac{\partial q_{k}}{\partial x_{i}}=W_{k, i}
∂ x i ∂ q k = W k , i
∂
f
∂
x
i
=
∑
k
∂
f
∂
q
k
∂
q
k
∂
x
i
=
∑
k
2
q
k
W
k
,
i
\frac{\partial f}{\partial x_{i}}=\sum_{k}\frac{\partial f}{\partial q_{k}}\frac{\partial q_{k}}{\partial x_{i}}=\sum_{k}2q_{k}W_{k, i}
∂ x i ∂ f = k ∑ ∂ q k ∂ f ∂ x i ∂ q k = k ∑ 2 q k W k , i 故:
∂
f
∂
x
1
=
2
q
1
W
1
,
1
+
2
q
2
W
2
,
1
=
−
0.112
\frac{\partial f}{\partial x_{1}}=2q_{1}W_{1, 1}+2q_{2}W_{2, 1}=-0.112
∂ x 1 ∂ f = 2 q 1 W 1 , 1 + 2 q 2 W 2 , 1 = − 0 . 1 1 2
∂
f
∂
x
2
=
2
q
1
W
1
,
2
+
2
q
2
W
2
,
2
=
0.636
\frac{\partial f}{\partial x_{2}}=2q_{1}W_{1, 2}+2q_{2}W_{2, 2}=0.636
∂ x 2 ∂ f = 2 q 1 W 1 , 2 + 2 q 2 W 2 , 2 = 0 . 6 3 6 故:
∂
f
∂
x
=
[
−
0.112
0.636
]
\frac{\partial f}{\partial x}=\begin{bmatrix} -0.112 \\ 0.636 \\ \end{bmatrix}
∂ x ∂ f = [ − 0 . 1 1 2 0 . 6 3 6 ] 最终: