卷积神经网络CNN的反向传播算法推导

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/diyoosjtu/article/details/89405023

1. 全连接层

与深度神经网络DNN的反向传播算法一致,辅助变量:
(1) { δ L = J z L = J a L σ ( z L ) δ l = ( W l + 1 ) T δ l + 1 σ ( z l ) \left\{\begin{aligned} &\delta^L = \frac{\partial J}{\partial z^L} = \frac{\partial J}{\partial a^L} \odot \sigma'(z^L)\\ &\\ &\delta^l = (W^{l+1})^T\delta^{l+1}\odot \sigma'(z^l) \end{aligned}\right. \tag{1}
进而求得参数 W W b b 的梯度:
{ J W l = J z l z l W l = δ l ( a l 1 ) T J b l = J z l z l b l = δ l \left\{\begin{aligned} &\frac{\partial J}{\partial W^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial W^l} = \delta^l(a^{l-1})^T\\ &\\ & \frac{\partial J}{\partial b^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial b^l} = \delta^l \end{aligned}\right.

解释一下式(1)中为何使用点乘:

以平方误差损失函数为例:
J = 1 2 a L y 2 2 = 1 2 i = 1 N ( a i L y ) 2 = 1 2 i = 1 N ( σ ( z i L ) y ) 2 J = \frac{1}{2}\left\Vert a^L - y\right\Vert_2^2= \frac{1}{2}\sum_{i=1}^N(a_i^L - y)^2= \frac{1}{2}\sum_{i=1}^N(\sigma(z_i^L) - y)^2

δ L = J z L = [ J z 1 L J z 2 L J z N L ] = [ ( a 1 L y ) σ ( z 1 L ) ( a 2 L y ) σ ( z 2 L ) ( a N L y ) σ ( z N L ) ] = ( a L y ) σ ( z L ) = J a L σ ( z L ) \begin{aligned} &\delta^L = \frac{\partial J}{\partial z^L} =\begin{bmatrix} \frac{\partial J}{\partial z_1^L} \\ \\ \frac{\partial J}{\partial z_2^L} \\ \\ \vdots\\ \\ \frac{\partial J}{\partial z_N^L} \end{bmatrix}=\begin{bmatrix} (a_1^L-y)\sigma'(z_1^L) \\ \\ (a_2^L-y)\sigma'(z_2^L) \\ \\ \vdots\\ \\ (a_N^L-y)\sigma'(z_N^L) \end{bmatrix}=(a^L-y)\odot\sigma'(z^L)= \frac{\partial J}{\partial a^L} \odot \sigma'(z^L)\\ &\\ \end{aligned}

也可以从向量的维度上进行分析:

向量 a L y a^L-y 和向量 σ ( z L ) \sigma'(z^L) 维度相同,都属于 R N × 1 \mathbb{R}^{N\times1} ,因此只能是点乘。

如果是交叉熵损失函数:
J = i = 1 N y i log a i L J= - \sum\limits_{i=1}^Ny_i\log a_i^L

δ L = J z L = a L y \delta^L = \frac{\partial J}{\partial z^L} = a^L - y
不存在 σ ( z L ) \sigma'(z^L) 这一项。

2. 池化层

设池化层的输入为 a l a^{l} ,输出为 z l + 1 z^{l+1} ,则有:
z l + 1 = pool ( a l ) z^{l+1} = \text{pool}(a^{l})

δ l = J z l = J z l + 1 z l + 1 a l a l z l = upsample ( δ l + 1 ) σ ( z l ) \delta^{l}= \frac{\partial J}{\partial z^{l}}= \frac{\partial J}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial a^{l}}\frac{\partial a^{l}}{\partial z^{l}} = \text{upsample} (\delta^{l+1})\odot \sigma'(z^l)
其中,upsample指在反向传播时,把 δ l + 1 \delta^{l+1} 的矩阵大小还原成池化之前的大小,一共分为两种情况:

  1. 如果是Max,则把 δ l + 1 \delta^{l+1} 的各元素值放在之前做前向传播算法得到最大值的位置,所以这里需要额外记录每个区块中最大元素的位置
  2. 如果是Average,则把 δ l + 1 \delta^{l+1} 的各元素值取平均后,填入对应的区块位置。

举例,设池化层的核心大小是 2 × 2 2\times2 ,则:
δ l + 1 = ( 2 8 4 6 ) Max upsample ( 2 0 0 0 0 0 0 8 0 4 0 0 0 0 6 0 ) \delta^{l+1} = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) \xrightarrow{\text{Max upsample}} \left( \begin{array}{ccc} 2&0&0&0 \\ 0&0& 0&8 \\ 0&4&0&0 \\ 0&0&6&0 \end{array} \right)
δ l + 1 = ( 2 8 4 6 ) Average upsample ( 0.5 0.5 2 2 0.5 0.5 2 2 1 1 1.5 1.5 1 1 1.5 1.5 ) \delta^{l+1} = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) \xrightarrow{\text{Average upsample}} \left( \begin{array}{ccc} 0.5&0.5&2&2 \\ 0.5&0.5&2&2 \\ 1&1&1.5&1.5 \\ 1&1&1.5&1.5 \end{array} \right)
注意,对于Average情况下的反向传播,容易误认为是把梯度值复制几遍之后直接填入对应的区块位置。其实很容易理解为什么要把梯度值求平均,我们用一个小例子来说明:

假设对四个变量 a , b , c , d a, b, c, d 求平均,得到 z z ,也即:
z = 1 4 ( a + b + c + d ) z=\frac{1}{4}(a+b+c+d)
那么, z z 关于每个变量的导数都是1/4。反向传播到 z z 时,累积的梯度值为 δ \delta ,那么,
{ J a = J z z a = 1 4 δ J b = J z z b = 1 4 δ J c = J z z c = 1 4 δ J d = J z z d = 1 4 δ \left\{\begin{aligned} &\frac{\partial J}{\partial a} = \frac{\partial J}{\partial z}\frac{\partial z}{\partial a} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial b}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial b} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial c}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial c} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial d}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial d} = \frac{1}{4}\delta \end{aligned}\right.
这样就很容易理解了。

3. 卷积层

卷积层的前向传播公式:
a l + 1 = σ ( z l + 1 ) = σ ( a l W l + 1 + b l + 1 ) a^{l+1} = \sigma(z^{l+1}) = \sigma(a^l*W^{l+1} + b^{l+1})

δ l = J z l = J z l + 1 z l + 1 a l a l z l = δ l + 1 Rotation180 ( W l + 1 ) σ ( z l ) \delta^{l}= \frac{\partial J}{\partial z^{l}}= \frac{\partial J}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial a^{l}}\frac{\partial a^{l}}{\partial z^{l}} = \delta^{l+1} *\text{Rotation180}(W^{l+1})\odot \sigma'(z^l)
其中Rotation180意思是卷积核 W W 被旋转180度,也即上下翻转一次,接着左右翻转一次。另外注意,这里需要对 δ l + 1 \delta^{l+1} 进行适当的padding,当stride为1时, p = k p 1 p'=k-p-1

详细推导请参见 https://www.cnblogs.com/pinard/p/6494810.html

参数 W W b b 的梯度:
{ J W l = J z l z l W l = a l 1 δ l J b l = J z l z l b l = u , v ( δ l ) u , v \left\{\begin{aligned} &\frac{\partial J}{\partial W^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial W^l} = a^{l-1}*\delta^l\\ &\\ & \frac{\partial J}{\partial b^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial b^l} = \sum\limits_{u,v}(\delta^l)_{u,v} \end{aligned}\right.
其中,关于 W W 的梯度没有旋转操作, u , v ( δ l ) u , v \sum\limits_{u,v}(\delta^l)_{u,v} 意思是把 δ l \delta^l 的所有通道沿通道方向求和,累加成一个通道。

4. 参考资料

感谢 https://www.cnblogs.com/pinard/p/6494810.html

猜你喜欢

转载自blog.csdn.net/diyoosjtu/article/details/89405023