目录
1 引言
深度卷积神经网络(CNN)之所以强大,源于其端到端可微的结构:从输入到损失的每一步都有明确的数学表达式,并可通过链式法则计算梯度。下面我们对典型 CNN 的前向传播与反向传播进行逐层、逐公式的完整推导,并给出纯 NumPy 实现,代码中用 # Eq.(n)
明确指出与公式的对应关系。
2 记号与约定
记号 | 含义 |
---|---|
X ∈ R C in × H × W \mathbf{X}\in\mathbb{R}^{C_\text{in}\times H \times W} X∈RCin×H×W | 当前层输入特征图 |
W ∈ R C out × C in × K × K \mathbf{W}\in\mathbb{R}^{C_\text{out}\times C_\text{in}\times K\times K} W∈RCout×Cin×K×K | 卷积核权重 |
b ∈ R C out \mathbf{b}\in\mathbb{R}^{C_\text{out}} b∈RCout | 卷积偏置 |
L \mathcal{L} L | 总损失 |
⊙ \odot ⊙ | Hadamard(逐元素)积 |
默认步幅 s = 1 s=1 s=1,无填充;若有特殊步幅或填充会在式中注明。
3 前向传播推导
3.1 卷积层 (Convolution)
Y k ( i , j ) = ∑ c = 1 C in ∑ u = 0 K − 1 ∑ v = 0 K − 1 W k , c ( u , v ) X c ( i + u , j + v ) + b k (1) \boxed{\mathbf{Y}_{k}(i,j)=\sum_{c=1}^{C_\text{in}}\sum_{u=0}^{K-1}\sum_{v=0}^{K-1} \mathbf{W}_{k,c}(u,v)\;\mathbf{X}_{c}(i+u,j+v)+b_k}\tag{1} Yk(i,j)=c=1∑Cinu=0∑K−1v=0∑K−1Wk,c(u,v)Xc(i+u,j+v)+bk(1)
其中 k = 1 , … , C out k\!=\!1,\dots,C_\text{out} k=1,…,Cout。输出尺寸:
H out = H − K + 1 , W out = W − K + 1 (2) H_\text{out}=H-K+1,\quad W_\text{out}=W-K+1\tag{2} Hout=H−K+1,Wout=W−K+1(2)
3.2 非线性激活 (ReLU)
Z = R e L U ( Y ) = max ( 0 , Y ) (3) \boxed{\mathbf{Z}=\mathrm{ReLU}(\mathbf{Y})=\max(0,\mathbf{Y})}\tag{3} Z=ReLU(Y)=max(0,Y)(3)
3.3 池化层 (Max‑Pooling)
以 2 × 2 2\times2 2×2、步幅 2 为例:
P ( c , i , j ) = max 0 ≤ u < 2 0 ≤ v < 2 Z ( c , 2 i + u , 2 j + v ) (4) \boxed{\mathbf{P}(c,i,j)=\max_{\substack{0\le u<2\\0\le v<2}} \mathbf{Z}\bigl(c,2i+u,2j+v\bigr)}\tag{4} P(c,i,j)=0≤u<20≤v<2maxZ(c,2i+u,2j+v)(4)
3.4 全连接层 (Fully‑Connected)
输入向量 p ∈ R d in \mathbf{p}\in\mathbb{R}^{d_\text{in}} p∈Rdin,权重 W fc ∈ R d out × d in \mathbf{W}_\text{fc}\in\mathbb{R}^{d_\text{out}\times d_\text{in}} Wfc∈Rdout×din,偏置 b fc ∈ R d out \mathbf{b}_\text{fc}\in\mathbb{R}^{d_\text{out}} bfc∈Rdout:
s = W fc p + b fc (5) \boxed{\mathbf{s}=\mathbf{W}_\text{fc}\mathbf{p}+ \mathbf{b}_\text{fc}}\tag{5} s=Wfcp+bfc(5)
3.5 Softmax 与交叉熵损失
对样本的得分向量 s \mathbf{s} s:
y ^ k = e s k ∑ t e s t , L = − log y ^ y ∗ (6) \hat{y}_k=\frac{e^{s_k}}{\sum_{t} e^{s_t}},\qquad \mathcal{L}=-\log \hat{y}_{y^*}\tag{6} y^k=∑testesk,L=−logy^y∗(6)
其中 y ∗ y^* y∗ 为真标签。
4 反向传播推导
4.1 Softmax 梯度
∂ L ∂ s k = y ^ k − 1 { k = y ∗ } (7) \frac{\partial \mathcal{L}}{\partial s_k} = \hat{y}_k - \mathbf{1}_{\{k = y^*\}} \tag{7} ∂sk∂L=y^k−1{ k=y∗}(7)
4.2 全连接层梯度
∂ L ∂ W fc = ∂ L ∂ s p ⊤ (8) \frac{\partial\mathcal{L}}{\partial \mathbf{W}_\text{fc}} =\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\;\mathbf{p}^\top\tag{8} ∂Wfc∂L=∂s∂Lp⊤(8)
∂ L ∂ b fc = ∂ L ∂ s (9) \frac{\partial\mathcal{L}}{\partial \mathbf{b}_\text{fc}} =\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\tag{9} ∂bfc∂L=∂s∂L(9)
∂ L ∂ p = W fc ⊤ ∂ L ∂ s (10) \frac{\partial\mathcal{L}}{\partial \mathbf{p}} =\mathbf{W}_\text{fc}^\top\frac{\partial\mathcal{L}}{\partial\mathbf{s}}\tag{10} ∂p∂L=Wfc⊤∂s∂L(10)
4.3 卷积层梯度
设 δ k ( i , j ) = ∂ L ∂ Y k ( i , j ) \delta_{k}(i,j)=\dfrac{\partial\mathcal{L}}{\partial \mathbf{Y}_{k}(i,j)} δk(i,j)=∂Yk(i,j)∂L:
对权重:
∂ L ∂ W k , c ( u , v ) = ∑ i , j δ k ( i , j ) X c ( i + u , j + v ) (11) \boxed{\frac{\partial\mathcal{L}}{\partial \mathbf{W}_{k,c}(u,v)}= \sum_{i,j}\delta_{k}(i,j)\; \mathbf{X}_{c}(i+u,j+v)}\tag{11} ∂Wk,c(u,v)∂L=i,j∑δk(i,j)Xc(i+u,j+v)(11)
对输入:
∂ L ∂ X c ( i , j ) = ∑ k ∑ u , v δ k ( i − u , j − v ) W k , c ( u , v ) (12) \boxed{\frac{\partial\mathcal{L}}{\partial \mathbf{X}_{c}(i,j)}= \sum_{k}\sum_{u,v}\delta_{k}(i-u,j-v)\; \mathbf{W}_{k,c}(u,v)}\tag{12} ∂Xc(i,j)∂L=k∑u,v∑δk(i−u,j−v)Wk,c(u,v)(12)
4.4 池化层梯度
反向时把上采样梯度分配给前向最大值所在位置,其余为 0:
∂ L ∂ Z ( c , h , w ) = { ∂ L ∂ P ( c , i , j ) ( h , w ) = arg max 块 i j 0 otherwise (13) \frac{\partial\mathcal{L}}{\partial \mathbf{Z}(c,h,w)}= \begin{cases} \frac{\partial\mathcal{L}}{\partial \mathbf{P}(c,i,j)} & (h,w)=\arg\max \text{块}_{ij}\\ 0 & \text{otherwise} \end{cases}\tag{13} ∂Z(c,h,w)∂L={ ∂P(c,i,j)∂L0(h,w)=argmax块ijotherwise(13)
4.5 参数更新(SGD 示例)
θ ← θ − η ∂ L ∂ θ (14) \theta\leftarrow\theta-\eta\;\frac{\partial\mathcal{L}}{\partial\theta}\tag{14} θ←θ−η∂θ∂L(14)
5 代码实现与公式对应
说明:以下为最小可运行的纯 NumPy 实现。每行关键梯度计算都用
# Eq.(n)
标注其对应公式编号,可直接对照上文推导。
import numpy as np
def conv_forward(X, W, b, stride=1):
# X: (C_in, H, W), W: (C_out, C_in, K, K)
C_out, C_in, K, _ = W.shape
H_out = (X.shape[1] - K) // stride + 1 # Eq.(2)
W_out = (X.shape[2] - K) // stride + 1
Y = np.zeros((C_out, H_out, W_out))
for k in range(C_out):
for i in range(H_out):
for j in range(W_out):
region = X[:, i:i+K, j:j+K] # 对应 X_c(i+u,j+v)
Y[k, i, j] = np.sum(W[k] * region) + b[k] # Eq.(1)
cache = (X, W, b, Y, stride)
return Y, cache
def relu_forward(Y):
Z = np.maximum(0, Y) # Eq.(3)
cache = Y
return Z, cache
def maxpool_forward(Z, size=2, stride=2):
C, H, W = Z.shape
H_out, W_out = H // size, W // size
P = np.zeros((C, H_out, W_out))
mask = np.zeros_like(Z, dtype=bool)
for c in range(C):
for i in range(H_out):
for j in range(W_out):
region = Z[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride]
idx = np.unravel_index(np.argmax(region), region.shape)
mask[c, i*stride+idx[0], j*stride+idx[1]] = True
P[c, i, j] = region[idx] # Eq.(4)
cache = (mask, stride)
return P, cache
def fc_forward(p, W_fc, b_fc):
s = W_fc @ p + b_fc # Eq.(5)
cache = (p, W_fc, b_fc)
return s, cache
def softmax_loss(s, y_true):
exp_s = np.exp(s - np.max(s))
probs = exp_s / np.sum(exp_s)
loss = -np.log(probs[y_true]) # Eq.(6)
ds = probs
ds[y_true] -= 1 # Eq.(7)
return loss, ds
# ----------------- backward functions ----------------- #
def fc_backward(ds, cache):
p, W_fc, b_fc = cache
dW = np.outer(ds, p) # Eq.(8)
db = ds.copy() # Eq.(9)
dp = W_fc.T @ ds # Eq.(10)
return dp, dW, db
def maxpool_backward(dP, cache):
mask, stride = cache
dZ = np.zeros_like(mask, dtype=float)
C, H_out, W_out = dP.shape
for c in range(C):
for i in range(H_out):
for j in range(W_out):
dZ[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride] += \
mask[c, i*stride:(i+1)*stride, j*stride:(j+1)*stride] * dP[c, i, j] # Eq.(13)
return dZ
def relu_backward(dZ, cache):
Y = cache
dY = dZ * (Y > 0) # ReLU 导数
return dY
def conv_backward(dY, cache):
X, W, b, Y, stride = cache
C_out, C_in, K, _ = W.shape
dX = np.zeros_like(X)
dW = np.zeros_like(W)
db = np.zeros_like(b)
_, H_out, W_out = dY.shape
for k in range(C_out):
db[k] = np.sum(dY[k]) # 对偏置求梯度
for i in range(H_out):
for j in range(W_out):
region = X[:, i:i+K, j:j+K]
dW[k] += dY[k, i, j] * region # Eq.(11)
dX[:, i:i+K, j:j+K] += dY[k, i, j] * W[k] # Eq.(12)
return dX, dW, db
如何阅读
- 先对照公式 (1)–(14) 理解数学推导。
- 在代码中搜索
# Eq.(n)
,即可看到对应公式如何落地为数组运算。- 如需 GPU/自动求导,可用同样的注释方式移植到 PyTorch / JAX。
6 结论
通过逐层给出显式数学公式并用纯 NumPy 手写梯度,我们完整展示了 CNN 从前向到反向的全部细节。掌握这些推导后,再使用框架的自动求导时就能精准定位数值误差与性能瓶颈,为调参与模型创新奠定坚实基础。