【深度学习实践】卷积神经网络前向传播和反向传递的详细推导过程：公式与代码对应

企业开发 2025-04-11 17:28:05 阅读次数: 0

引言
记号与约定
前向传播推导
反向传播推导
代码实现与公式对应
结论

1 引言

深度卷积神经网络（CNN）之所以强大，源于其端到端可微的结构：从输入到损失的每一步都有明确的数学表达式，并可通过链式法则计算梯度。下面我们对典型 CNN 的前向传播与反向传播进行逐层、逐公式的完整推导，并给出纯 NumPy 实现，代码中用 # Eq.(n) 明确指出与公式的对应关系。

2 记号与约定

记号	含义
$\mathbf{X}\in\mathbb{R}^{C_\text{in}\times H \times W}$	当前层输入特征图
$\mathbf{W}\in\mathbb{R}^{C_\text{out}\times C_\text{in}\times K\times K}$	卷积核权重
$\mathbf{b}\in\mathbb{R}^{C_\text{out}}$	卷积偏置
$\mathcal{L}$	总损失
$\odot$	Hadamard（逐元素）积

默认步幅 $s = 1$ ，无填充；若有特殊步幅或填充会在式中注明。

3 前向传播推导

3.1 卷积层 (Convolution)

$\boxed{\mathbf{Y}_{k}(i,j)=\sum_{c=1}^{C_\text{in}}\sum_{u=0}^{K-1}\sum_{v=0}^{K-1} \mathbf{W}_{k,c}(u,v)\;\mathbf{X}_{c}(i+u,j+v)+b_k}\tag{1}$

其中 $k\!=\!1,\dots,C_\text{out}$ 。输出尺寸：
$H_\text{out}=H-K+1,\quad W_\text{out}=W-K+1\tag{2}$

3.2 非线性激活 (ReLU)

$\boxed{\mathbf{Z}=\mathrm{ReLU}(\mathbf{Y})=\max(0,\mathbf{Y})}\tag{3}$

3.3 池化层 (Max‑Pooling)

以 $2\times2$ 、步幅 2 为例：
$\boxed{\mathbf{P}(c,i,j)=\max_{\substack{0\le u<2\\0\le v<2}} \mathbf{Z}\bigl(c,2i+u,2j+v\bigr)}\tag{4}$

3.4 全连接层 (Fully‑Connected)

输入向量 $\mathbf{p}\in\mathbb{R}^{d_\text{in}}$ ，权重 $\mathbf{W}_\text{fc}\in\mathbb{R}^{d_\text{out}\times d_\text{in}}$ ，偏置 $\mathbf{b}_\text{fc}\in\mathbb{R}^{d_\text{out}}$ ：

$\boxed{\mathbf{s}=\mathbf{W}_\text{fc}\mathbf{p}+ \mathbf{b}_\text{fc}}\tag{5}$

3.5 Softmax 与交叉熵损失

对样本的得分向量 $\mathbf{s}$ ：

$\hat{y}_k=\frac{e^{s_k}}{\sum_{t} e^{s_t}},\qquad \mathcal{L}=-\log \hat{y}_{y^*}\tag{6}$

其中 $y^*$ 为真标签。

4 反向传播推导

4.1 Softmax 梯度

$\frac{\partial \mathcal{L}}{\partial s_k} = \hat{y}_k - \mathbf{1}_{\{k = y^*\}} \tag{7}$

4.2 全连接层梯度

$\frac{\partial\mathcal{L}}{\partial \mathbf{W}_\text{fc}} =\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\;\mathbf{p}^\top\tag{8}$

$\frac{\partial\mathcal{L}}{\partial \mathbf{b}_\text{fc}} =\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\tag{9}$

$\frac{\partial\mathcal{L}}{\partial \mathbf{p}} =\mathbf{W}_\text{fc}^\top\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\tag{10}$

4.3 卷积层梯度

设 $\delta_{k}(i,j)=\dfrac{\partial\mathcal{L}}{\partial \mathbf{Y}_{k}(i,j)}$ ：

对权重：

$\boxed{\frac{\partial\mathcal{L}}{\partial \mathbf{W}_{k,c}(u,v)}= \sum_{i,j}\delta_{k}(i,j)\; \mathbf{X}_{c}(i+u,j+v)}\tag{11}$

对输入：

$\boxed{\frac{\partial\mathcal{L}}{\partial \mathbf{X}_{c}(i,j)}= \sum_{k}\sum_{u,v}\delta_{k}(i-u,j-v)\; \mathbf{W}_{k,c}(u,v)}\tag{12}$

4.4 池化层梯度

反向时把上采样梯度分配给前向最大值所在位置，其余为 0：

$\frac{\partial\mathcal{L}}{\partial \mathbf{Z}(c,h,w)}= \begin{cases} \frac{\partial\mathcal{L}}{\partial \mathbf{P}(c,i,j)} & (h,w)=\arg\max \text{块}_{ij}\\ 0 & \text{otherwise} \end{cases}\tag{13}$

4.5 参数更新（SGD 示例）

$\theta\leftarrow\theta-\eta\;\frac{\partial\mathcal{L}}{\partial\theta}\tag{14}$

5 代码实现与公式对应

说明：以下为最小可运行的纯 NumPy 实现。每行关键梯度计算都用 # Eq.(n) 标注其对应公式编号，可直接对照上文推导。

import numpy as np

def conv_forward(X, W, b, stride=1):
    # X: (C_in, H, W), W: (C_out, C_in, K, K)
    C_out, C_in, K, _ = W.shape
    H_out = (X.shape[1] - K) // stride + 1  # Eq.(2)
    W_out = (X.shape[2] - K) // stride + 1
    Y = np.zeros((C_out, H_out, W_out))
    for k in range(C_out):
        for i in range(H_out):
            for j in range(W_out):
                region = X[:, i:i+K, j:j+K]               # 对应 X_c(i+u,j+v)
                Y[k, i, j] = np.sum(W[k] * region) + b[k]  # Eq.(1)
    cache = (X, W, b, Y, stride)
    return Y, cache

def relu_forward(Y):
    Z = np.maximum(0, Y)  # Eq.(3)
    cache = Y
    return Z, cache

def maxpool_forward(Z, size=2, stride=2):
    C, H, W = Z.shape
    H_out, W_out = H // size, W // size
    P = np.zeros((C, H_out, W_out))
    mask = np.zeros_like(Z, dtype=bool)
    for c in range(C):
        for i in range(H_out):
            for j in range(W_out):
                region = Z[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride]
                idx = np.unravel_index(np.argmax(region), region.shape)
                mask[c, i*stride+idx[0], j*stride+idx[1]] = True
                P[c, i, j] = region[idx]                 # Eq.(4)
    cache = (mask, stride)
    return P, cache

def fc_forward(p, W_fc, b_fc):
    s = W_fc @ p + b_fc         # Eq.(5)
    cache = (p, W_fc, b_fc)
    return s, cache

def softmax_loss(s, y_true):
    exp_s = np.exp(s - np.max(s))
    probs = exp_s / np.sum(exp_s)
    loss = -np.log(probs[y_true])       # Eq.(6)
    ds = probs
    ds[y_true] -= 1                     # Eq.(7)
    return loss, ds

# ----------------- backward functions ----------------- #
def fc_backward(ds, cache):
    p, W_fc, b_fc = cache
    dW = np.outer(ds, p)        # Eq.(8)
    db = ds.copy()              # Eq.(9)
    dp = W_fc.T @ ds            # Eq.(10)
    return dp, dW, db

def maxpool_backward(dP, cache):
    mask, stride = cache
    dZ = np.zeros_like(mask, dtype=float)
    C, H_out, W_out = dP.shape
    for c in range(C):
        for i in range(H_out):
            for j in range(W_out):
                dZ[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride] += \
                    mask[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride] * dP[c, i, j]  # Eq.(13)
    return dZ

def relu_backward(dZ, cache):
    Y = cache
    dY = dZ * (Y > 0)  # ReLU 导数
    return dY

def conv_backward(dY, cache):
    X, W, b, Y, stride = cache
    C_out, C_in, K, _ = W.shape
    dX = np.zeros_like(X)
    dW = np.zeros_like(W)
    db = np.zeros_like(b)
    _, H_out, W_out = dY.shape

    for k in range(C_out):
        db[k] = np.sum(dY[k])          # 对偏置求梯度
        for i in range(H_out):
            for j in range(W_out):
                region = X[:, i:i+K, j:j+K]
                dW[k] += dY[k, i, j] * region     # Eq.(11)
                dX[:, i:i+K, j:j+K] += dY[k, i, j] * W[k]  # Eq.(12)

    return dX, dW, db