概述

梯度下降法需要利用损失函数对所有参数求梯度，来寻找最小值点。而反向传播法就是用于计算该梯度的方法，其本质利用链式法则对每个参数求导。

网络传播的过程

这里写图片描述

前向传播

Initialize the parameters for for an $L$ -layer neural network.
Implement the forward propagation module.
- Complete the LINEAR part of a layer’s forward propagation step (resulting in $Z^{[l]}$ ).
- We give you the ACTIVATION function (relu/sigmoid).
- Combine the previous two steps into a new [LINEAR->ACTIVATION] forward function.
- Stack the [LINEAR->RELU] forward function L-1 time (for layers 1 through L-1) and add a [LINEAR->SIGMOID] at the end (for the final layer $L$ ). This gives you a new L_model_forward function.
Compute the loss.
Implement the backward propagation module (denoted in red in the figure below).
- Complete the LINEAR part of a layer’s backward propagation step.
- Gradient of the ACTIVATE function (relu_backward/sigmoid_backward)
- Combine the previous two steps into a new [LINEAR->ACTIVATION] backward function.
- Stack [LINEAR->RELU] backward L-1 times and add [LINEAR->SIGMOID] backward in a new L_model_backward function
Finally update the parameters.

输入数据及各层shape

Input $X$ is $(12288, 209)$ (with $m=209$ examples) then:

	Shape of W	Shape of b	Activation	Shape of Activation
Layer 1	$(n^{[1]},12288)$	$(n^{[1]},1)$	$Z^{[1]} = W^{[1]} X + b^{[1]}$	$(n^{[1]},209)$
Layer 2	$(n^{[2]}, n^{[1]})$	$(n^{[2]},1)$	$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$	$(n^{[2]}, 209)$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
Layer L-1	$(n^{[L-1]}, n^{[L-2]})$	$(n^{[L-1]}, 1)$	$Z^{[L-1]} = W^{[L-1]} A^{[L-2]} + b^{[L-1]}$	$(n^{[L-1]}, 209)$
Layer L	$(n^{[L]}, n^{[L-1]})$	$(n^{[L]}, 1)$	$Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}$	$(n^{[L]}, 209)$

We compute $W X + b$ in python, it carries out broadcasting. For example, if:

\begin{matrix} (2) & W = [\begin{matrix} j & k & l \\ m & n & o \\ p & q & r \end{matrix}] X = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}] b = [\begin{matrix} s \\ t \\ u \end{matrix}] \end{matrix}

$W = \begin{bmatrix} j & k & l\\ m & n & o \\ p & q & r \end{bmatrix}\;\;\; X = \begin{bmatrix} a & b & c\\ d & e & f \\ g & h & i \end{bmatrix} \;\;\; b =\begin{bmatrix} s \\ t \\ u \end{bmatrix}\tag{2}$

Then $WX + b$ will be:

\begin{matrix} (3) & W X + b = [\begin{matrix} (j a + k d + l g) + s & (j b + k e + l h) + s & (j c + k f + l i) + s \\ (m a + n d + o g) + t & (m b + n e + o h) + t & (m c + n f + o i) + t \\ (p a + q d + r g) + u & (p b + q e + r h) + u & (p c + q f + r i) + u \end{matrix}] \end{matrix}

$WX + b = \begin{bmatrix} (ja + kd + lg) + s & (jb + ke + lh) + s & (jc + kf + li)+ s\\ (ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\ (pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u \end{bmatrix}\tag{3}$

Forward propagation

这里写图片描述

Three functions:

LINEAR
LINEAR -> ACTIVATION where ACTIVATION will be either ReLU or Sigmoid.
[LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID (whole model)

The linear forward module (vectorized over all the examples) computes the following equations:

Z^{[l]} = W^{[l]} A^{[l - 1]} + b^{[l]}

$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}$

where $A^{[0]} = X$ .

$A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} +b^{[l]})$ where the activation “g” can be sigmoid() , relu(), and so on.

Loss Function

Check if your model is actually learning.

Backward propagation

Remember that back propagation is used to calculate the gradient of the loss function with respect to the parameters.
这里写图片描述

反向传播的公式

全连接

这里写图片描述

计算输出层L每个单元的误差，第L层第j个单元的误差的定义 $\delta^l_j$ ：
$δ_{j}^{l} = \frac{\partial C}{\partial z_{j}^{l}} .$ $\ \delta^l_j = \frac{\partial C}{\partial z^l_j}. \$
根据链式求导得：

$δ_{j}^{L} = \frac{\partial C}{\partial a_{j}^{L}} \frac{\partial a_{j}^{L}}{\partial z_{j}^{L}} = \frac{\partial C}{\partial a_{j}^{L}} σ^{'} (z_{j}^{L}) .$ $\delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}= \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).$
算完输出层开始往回算之前层的误差，当前层l的误差用下一层l+1的误差来表示：

$\begin{array}{rcl} (40) & δ_{j}^{l} & = & \frac{\partial C}{\partial z_{j}^{l}} \\ (41) & = & \sum_{k} \frac{\partial C}{\partial z_{k}^{l + 1}} \frac{\partial z_{k}^{l + 1}}{\partial z_{j}^{l}} \\ (42) & = & \sum_{k} \frac{\partial z_{k}^{l + 1}}{\partial z_{j}^{l}} δ_{k}^{l + 1}, \end{array}$ $\begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \tag{42}\end{eqnarray}$
其中j是当前层的某个神经元，它和下一层的若干个神经元（由k表示）相连。注意上式交换右边两项，并用误差δ的定义代入，同时注意到：
$\begin{array}{rcl} (43) & z_{k}^{l + 1} = \sum_{j} w_{k j}^{l + 1} a_{j}^{l} + b_{k}^{l + 1} = \sum_{j} w_{k j}^{l + 1} σ (z_{j}^{l}) + b_{k}^{l + 1} . \end{array}$ $\begin{eqnarray} z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k. \tag{43}\end{eqnarray}$

\begin{array}{rcl} (44) & \frac{\partial z_{k}^{l + 1}}{\partial z_{j}^{l}} = w_{k j}^{l + 1} σ^{'} (z_{j}^{l}) . \end{array}

$\begin{eqnarray} \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \tag{44}\end{eqnarray}$

\begin{array}{rcl} (45) & δ_{j}^{l} = \sum_{k} w_{k j}^{l + 1} δ_{k}^{l + 1} σ^{'} (z_{j}^{l}) . \end{array}

$\begin{eqnarray} \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}\end{eqnarray}$

W，b

CNN

CNN的反向传播不同之处

一般神经网络中每一层输入输出a,z都只是一个向量，而CNN中的a,z是一个三维张量，即由若干个输入的子矩阵组成。
池化层没有激活函数，可以令池化层的激活函数为σ(z)=z，这样池化层激活函数的导数为1。
池化层在前向传播的时候，对输入进行了压缩，那么我们向前反向推导上一层的误差时，需要做upsample处理。
卷积层是通过若干个矩阵卷积求和而得到当前层的输出，这和一般的网络直接进行矩阵乘法得到当前层的输出不同。这样在卷积层反向传播的时候，上一层误差的递推计算方法会不同。
对于卷积层，由于W使用的运算是卷积，那么由该层误差推导出该层的所有卷积核的W,b的方式也不

池化层

已知池化层的误差，反向推导上一隐藏层的误差，第l层误差的第k个子矩阵 $δ^l_k$ :
这里写图片描述
扩展：

如果是MAX，假设我们之前在前向传播时记录的最大值位置分别是左上，右下，右上，左下，则转换后的矩阵为：

这里写图片描述
如果是Average，则进行平均，转换后的矩阵为：

这里写图片描述
上边这个矩阵就是误差矩阵经过upsample之后的矩阵，那么，由后一层误差推导出前一层误差的公式为：

δ_{k}^{l - 1} = \frac{\partial J (W, b)}{\partial a_{k}^{l - 1}} \frac{\partial a_{k}^{l - 1}}{\partial z_{k}^{l - 1}} = u p s a m p l e (δ_{k}^{l}) ⊙ σ^{^{'}} (z_{k}^{l - 1})

$\delta_k^{l-1} = \frac{\partial J(W,b)}{\partial a_k^{l-1}} \frac{\partial a_k^{l-1}}{\partial z_k^{l-1}} = upsample(\delta_k^l) \odot \sigma^{'}(z_k^{l-1})$
简化：

δ^{l - 1} = u p s a m p l e (δ^{l}) ⊙ σ^{^{'}} (z^{l - 1})

$\delta^{l-1} = upsample(\delta^l) \odot \sigma^{'}(z^{l-1})$
对比：

\begin{array}{rcl} (45) & δ_{j}^{l} = \sum_{k} w_{k j}^{l + 1} δ_{k}^{l + 1} σ^{'} (z_{j}^{l}) . \end{array}

$\begin{eqnarray} \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}\end{eqnarray}$
区别：

无权重W
upsample

池化层: 没有W,b,不用求W,b的梯度

卷积层

已知卷积层的误差，推导上一隐藏层的误差:

$δ^{l - 1} = δ^{l} \frac{\partial z^{l}}{\partial z^{l - 1}} = δ^{l} * r o t 180 (W^{l}) ⊙ σ^{^{'}} (z^{l - 1})$ $\delta^{l-1} = \delta^{l}\frac{\partial z^{l}}{\partial z^{l-1}} = \delta^{l}*rot180(W^{l}) \odot \sigma^{'}(z^{l-1})$
对比：翻转180°
$\begin{array}{rcl} (45) & δ_{j}^{l} = \sum_{k} w_{k j}^{l + 1} δ_{k}^{l + 1} σ^{'} (z_{j}^{l}) . \end{array}$ $\begin{eqnarray} \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}\end{eqnarray}$
权重梯度

对比：旋转180度的操作。
偏置
偏差是三维向量，b是一个一个向量，不能和偏差直接相等，通常做法就是将偏差的子矩阵分别求和，得到误差向量，即b的梯度

旋转180°原因
卷积神经网络(CNN)反向传播算法-刘建平

一句话

DNN：

这里写图片描述

L层的偏差 $\delta^L_j$ =损失函数对a求梯度再圈乘该层激活函数的导数
l层偏差 $\delta^l_j$ =上层权重矩阵的转置与偏差矩阵的乘积，再圈乘该层的激活函数的导数。
根据偏差分别求W，b的梯度
- W：等于该层偏差与上一层a转置的矩阵乘积
- b: 等于该层偏差矩阵

CNN

卷积偏差：上层偏差与权重矩阵的翻转180°的卷积，再乘该层的激活函数的导数。

δ^{l - 1} = δ^{l} \frac{\partial z^{l}}{\partial z^{l - 1}} = δ^{l} * r o t 180 (W^{l}) ⊙ σ^{^{'}} (z^{l - 1})

$\delta^{l-1} = \delta^{l}\frac{\partial z^{l}}{\partial z^{l-1}} = \delta^{l}*rot180(W^{l}) \odot \sigma^{'}(z^{l-1})$
池化偏差：上层偏差上采样后圈乘该层的激活函数的导数。

δ^{l - 1} = u p s a m p l e (δ^{l}) ⊙ σ^{^{'}} (z^{l - 1})

$\delta^{l-1} = upsample(\delta^l) \odot \sigma^{'}(z^{l-1})$
权重W：上层a卷积该层偏差

\frac{\partial J (W, b)}{\partial W^{l}} = \frac{\partial J (W, b)}{\partial z^{l}} \frac{\partial z^{l}}{\partial W^{l}} = a^{l - 1} * δ^{l}

$\frac{\partial J(W,b)}{\partial W^{l}} = \frac{\partial J(W,b)}{\partial z^{l}}\frac{\partial z^{l}}{\partial W^{l}} =a^{l-1} *\delta^l$
偏置b: 将

δ^{l}

$\delta^l$ 的各个子矩阵的项分别求和，得到一个误差向量，即为b的梯度：

\frac{\partial J (W, b)}{\partial b^{l}} = \sum_{u, v} (δ^{l})_{u, v}

$\frac{\partial J(W,b)}{\partial b^{l}} = \sum\limits_{u,v}(\delta^l)_{u,v}$

深度学习基础-反向传播算法

概述

网络传播的过程

前向传播

输入数据及各层shape

Forward propagation

Loss Function

Backward propagation

反向传播的公式

全连接

CNN

CNN的反向传播不同之处

池化层

卷积层

一句话

DNN：

CNN

猜你喜欢