【深度学习实践】卷积神经网络前向传播和反向传递的详细推导过程:公式与代码对应

目录

  1. 引言
  2. 记号与约定
  3. 前向传播推导
    1. 卷积层 (Convolution)
    2. 非线性激活 (ReLU)
    3. 池化层 (Max‑Pooling)
    4. 全连接层 (Fully‑Connected)
    5. Softmax 与交叉熵损失
  4. 反向传播推导
    1. Softmax 梯度
    2. 全连接层梯度
    3. 卷积层梯度
    4. 池化层梯度
    5. 参数更新
  5. 代码实现与公式对应
  6. 结论

1 引言

深度卷积神经网络(CNN)之所以强大,源于其端到端可微的结构:从输入到损失的每一步都有明确的数学表达式,并可通过链式法则计算梯度。下面我们对典型 CNN 的前向传播反向传播进行逐层、逐公式的完整推导,并给出纯 NumPy 实现,代码中用 # Eq.(n) 明确指出与公式的对应关系。

2 记号与约定

记号 含义
X ∈ R C in × H × W \mathbf{X}\in\mathbb{R}^{C_\text{in}\times H \times W} XRCin×H×W 当前层输入特征图
W ∈ R C out × C in × K × K \mathbf{W}\in\mathbb{R}^{C_\text{out}\times C_\text{in}\times K\times K} WRCout×Cin×K×K 卷积核权重
b ∈ R C out \mathbf{b}\in\mathbb{R}^{C_\text{out}} bRCout 卷积偏置
L \mathcal{L} L 总损失
⊙ \odot Hadamard(逐元素)积

默认步幅 s = 1 s=1 s=1,无填充;若有特殊步幅或填充会在式中注明。

3 前向传播推导

3.1 卷积层 (Convolution)

Y k ( i , j ) = ∑ c = 1 C in ∑ u = 0 K − 1 ∑ v = 0 K − 1 W k , c ( u , v )    X c ( i + u , j + v ) + b k (1) \boxed{\mathbf{Y}_{k}(i,j)=\sum_{c=1}^{C_\text{in}}\sum_{u=0}^{K-1}\sum_{v=0}^{K-1} \mathbf{W}_{k,c}(u,v)\;\mathbf{X}_{c}(i+u,j+v)+b_k}\tag{1} Yk(i,j)=c=1Cinu=0K1v=0K1Wk,c(u,v)Xc(i+u,j+v)+bk(1)

其中 k  ⁣ =  ⁣ 1 , … , C out k\!=\!1,\dots,C_\text{out} k=1,,Cout。输出尺寸:
H out = H − K + 1 , W out = W − K + 1 (2) H_\text{out}=H-K+1,\quad W_\text{out}=W-K+1\tag{2} Hout=HK+1,Wout=WK+1(2)

3.2 非线性激活 (ReLU)

Z = R e L U ( Y ) = max ⁡ ( 0 , Y ) (3) \boxed{\mathbf{Z}=\mathrm{ReLU}(\mathbf{Y})=\max(0,\mathbf{Y})}\tag{3} Z=ReLU(Y)=max(0,Y)(3)

3.3 池化层 (Max‑Pooling)

2 × 2 2\times2 2×2、步幅 2 为例:
P ( c , i , j ) = max ⁡ 0 ≤ u < 2 0 ≤ v < 2 Z ( c , 2 i + u , 2 j + v ) (4) \boxed{\mathbf{P}(c,i,j)=\max_{\substack{0\le u<2\\0\le v<2}} \mathbf{Z}\bigl(c,2i+u,2j+v\bigr)}\tag{4} P(c,i,j)=0u<20v<2maxZ(c,2i+u,2j+v)(4)

3.4 全连接层 (Fully‑Connected)

输入向量 p ∈ R d in \mathbf{p}\in\mathbb{R}^{d_\text{in}} pRdin,权重 W fc ∈ R d out × d in \mathbf{W}_\text{fc}\in\mathbb{R}^{d_\text{out}\times d_\text{in}} WfcRdout×din,偏置 b fc ∈ R d out \mathbf{b}_\text{fc}\in\mathbb{R}^{d_\text{out}} bfcRdout

s = W fc p + b fc (5) \boxed{\mathbf{s}=\mathbf{W}_\text{fc}\mathbf{p}+ \mathbf{b}_\text{fc}}\tag{5} s=Wfcp+bfc(5)

3.5 Softmax 与交叉熵损失

对样本的得分向量 s \mathbf{s} s

y ^ k = e s k ∑ t e s t , L = − log ⁡ y ^ y ∗ (6) \hat{y}_k=\frac{e^{s_k}}{\sum_{t} e^{s_t}},\qquad \mathcal{L}=-\log \hat{y}_{y^*}\tag{6} y^k=testesk,L=logy^y(6)

其中 y ∗ y^* y 为真标签。


4 反向传播推导

4.1 Softmax 梯度

∂ L ∂ s k = y ^ k − 1 { k = y ∗ } (7) \frac{\partial \mathcal{L}}{\partial s_k} = \hat{y}_k - \mathbf{1}_{\{k = y^*\}} \tag{7} skL=y^k1{ k=y}(7)

4.2 全连接层梯度

∂ L ∂ W fc = ∂ L ∂ s    p ⊤ (8) \frac{\partial\mathcal{L}}{\partial \mathbf{W}_\text{fc}} =\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\;\mathbf{p}^\top\tag{8} WfcL=sLp(8)

∂ L ∂ b fc = ∂ L ∂ s (9) \frac{\partial\mathcal{L}}{\partial \mathbf{b}_\text{fc}} =\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\tag{9} bfcL=sL(9)

∂ L ∂ p = W fc ⊤ ∂ L ∂ s (10) \frac{\partial\mathcal{L}}{\partial \mathbf{p}} =\mathbf{W}_\text{fc}^\top\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\tag{10} pL=WfcsL(10)

4.3 卷积层梯度

δ k ( i , j ) = ∂ L ∂ Y k ( i , j ) \delta_{k}(i,j)=\dfrac{\partial\mathcal{L}}{\partial \mathbf{Y}_{k}(i,j)} δk(i,j)=Yk(i,j)L

对权重:

∂ L ∂ W k , c ( u , v ) = ∑ i , j δ k ( i , j )    X c ( i + u , j + v ) (11) \boxed{\frac{\partial\mathcal{L}}{\partial \mathbf{W}_{k,c}(u,v)}= \sum_{i,j}\delta_{k}(i,j)\; \mathbf{X}_{c}(i+u,j+v)}\tag{11} Wk,c(u,v)L=i,jδk(i,j)Xc(i+u,j+v)(11)

对输入:

∂ L ∂ X c ( i , j ) = ∑ k ∑ u , v δ k ( i − u , j − v )    W k , c ( u , v ) (12) \boxed{\frac{\partial\mathcal{L}}{\partial \mathbf{X}_{c}(i,j)}= \sum_{k}\sum_{u,v}\delta_{k}(i-u,j-v)\; \mathbf{W}_{k,c}(u,v)}\tag{12} Xc(i,j)L=ku,vδk(iu,jv)Wk,c(u,v)(12)

4.4 池化层梯度

反向时把上采样梯度分配给前向最大值所在位置,其余为 0:

∂ L ∂ Z ( c , h , w ) = { ∂ L ∂ P ( c , i , j ) ( h , w ) = arg ⁡ max ⁡ 块 i j 0 otherwise (13) \frac{\partial\mathcal{L}}{\partial \mathbf{Z}(c,h,w)}= \begin{cases} \frac{\partial\mathcal{L}}{\partial \mathbf{P}(c,i,j)} & (h,w)=\arg\max \text{块}_{ij}\\ 0 & \text{otherwise} \end{cases}\tag{13} Z(c,h,w)L={ P(c,i,j)L0(h,w)=argmaxijotherwise(13)

4.5 参数更新(SGD 示例)

θ ← θ − η    ∂ L ∂ θ (14) \theta\leftarrow\theta-\eta\;\frac{\partial\mathcal{L}}{\partial\theta}\tag{14} θθηθL(14)


5 代码实现与公式对应

说明:以下为最小可运行的纯 NumPy 实现。每行关键梯度计算都用 # Eq.(n) 标注其对应公式编号,可直接对照上文推导。

import numpy as np

def conv_forward(X, W, b, stride=1):
    # X: (C_in, H, W), W: (C_out, C_in, K, K)
    C_out, C_in, K, _ = W.shape
    H_out = (X.shape[1] - K) // stride + 1  # Eq.(2)
    W_out = (X.shape[2] - K) // stride + 1
    Y = np.zeros((C_out, H_out, W_out))
    for k in range(C_out):
        for i in range(H_out):
            for j in range(W_out):
                region = X[:, i:i+K, j:j+K]               # 对应 X_c(i+u,j+v)
                Y[k, i, j] = np.sum(W[k] * region) + b[k]  # Eq.(1)
    cache = (X, W, b, Y, stride)
    return Y, cache

def relu_forward(Y):
    Z = np.maximum(0, Y)  # Eq.(3)
    cache = Y
    return Z, cache

def maxpool_forward(Z, size=2, stride=2):
    C, H, W = Z.shape
    H_out, W_out = H // size, W // size
    P = np.zeros((C, H_out, W_out))
    mask = np.zeros_like(Z, dtype=bool)
    for c in range(C):
        for i in range(H_out):
            for j in range(W_out):
                region = Z[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride]
                idx = np.unravel_index(np.argmax(region), region.shape)
                mask[c, i*stride+idx[0], j*stride+idx[1]] = True
                P[c, i, j] = region[idx]                 # Eq.(4)
    cache = (mask, stride)
    return P, cache

def fc_forward(p, W_fc, b_fc):
    s = W_fc @ p + b_fc         # Eq.(5)
    cache = (p, W_fc, b_fc)
    return s, cache

def softmax_loss(s, y_true):
    exp_s = np.exp(s - np.max(s))
    probs = exp_s / np.sum(exp_s)
    loss = -np.log(probs[y_true])       # Eq.(6)
    ds = probs
    ds[y_true] -= 1                     # Eq.(7)
    return loss, ds

# ----------------- backward functions ----------------- #
def fc_backward(ds, cache):
    p, W_fc, b_fc = cache
    dW = np.outer(ds, p)        # Eq.(8)
    db = ds.copy()              # Eq.(9)
    dp = W_fc.T @ ds            # Eq.(10)
    return dp, dW, db

def maxpool_backward(dP, cache):
    mask, stride = cache
    dZ = np.zeros_like(mask, dtype=float)
    C, H_out, W_out = dP.shape
    for c in range(C):
        for i in range(H_out):
            for j in range(W_out):
                dZ[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride] += \
                    mask[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride] * dP[c, i, j]  # Eq.(13)
    return dZ

def relu_backward(dZ, cache):
    Y = cache
    dY = dZ * (Y > 0)  # ReLU 导数
    return dY

def conv_backward(dY, cache):
    X, W, b, Y, stride = cache
    C_out, C_in, K, _ = W.shape
    dX = np.zeros_like(X)
    dW = np.zeros_like(W)
    db = np.zeros_like(b)
    _, H_out, W_out = dY.shape

    for k in range(C_out):
        db[k] = np.sum(dY[k])          # 对偏置求梯度
        for i in range(H_out):
            for j in range(W_out):
                region = X[:, i:i+K, j:j+K]
                dW[k] += dY[k, i, j] * region     # Eq.(11)
                dX[:, i:i+K, j:j+K] += dY[k, i, j] * W[k]  # Eq.(12)

    return dX, dW, db

如何阅读

  1. 先对照公式 (1)–(14) 理解数学推导。
  2. 在代码中搜索 # Eq.(n),即可看到对应公式如何落地为数组运算。
  3. 如需 GPU/自动求导,可用同样的注释方式移植到 PyTorch / JAX。

6 结论

通过逐层给出显式数学公式并用纯 NumPy 手写梯度,我们完整展示了 CNN 从前向到反向的全部细节。掌握这些推导后,再使用框架的自动求导时就能精准定位数值误差与性能瓶颈,为调参与模型创新奠定坚实基础。

哈佛博后带小白玩转机器学习

猜你喜欢

转载自blog.csdn.net/l35633/article/details/147155457
今日推荐