Numpy搭建两层神经网络

在学习pytorch的时候,看见了标题的东西,就简单记录下自己的理解。
视频地址:here
时间:66:32
假定建立的神经网络当前没有偏置b,那么神经网络的结构构成如下:
h i d d e n = R e L U ( x ∗ w 1 ) y ^ = h i d d e n ∗ w 2 l o s s = ( y ^ − y ) 2 \begin{aligned} & hidden = ReLU( x*w_1) \\ & \hat y = hidden*w_2 \\ &loss = (\hat y - y)^2 \end{aligned} hidden=ReLU(xw1)y^=hiddenw2loss=(y^y)2
不妨稍加整理,其损失函数可以整理为:
l o s s = ( R e L U ( x ∗ w 1 ) ∗ w 2 − y ) 2 = ( R e L U ( X W 1 ) W 2 − Y ) T ( R e L U ( X W 1 ) W 2 − Y ) \begin{aligned} loss &= (ReLU(x *w_1) * w_2 - y) ^ 2 \\ &= (ReLU(XW_1)W_2-Y)^T(ReLU(XW_1)W_2-Y) \end{aligned} loss=(ReLU(xw1)w2y)2=(ReLU(XW1)W2Y)T(ReLU(XW1)W2Y)
那么,对其 w 1 w_1 w1求偏导,
∂ l o s s ∂ w 1 = ∂ Z T Z ∂ w 1 = ∂ Z T Z ∂ Z ∗ ∂ Z ∂ W 1 = 2 Z ∗ ∂ ( R e L U ( X W 1 ) W 2 − Y ) ∂ W 1 = 2 ( Y ^ − Y ) ∗ ∂ ( X W 1 W 2 − Y ) ∂ W 1 .       w h e n   X W 1 ⩾ 0 = 2 X T ( Y ^ − Y ) ∗ W 2 T .       w h e n   X W 1 ⩾ 0 \begin{aligned} \frac{\partial loss}{\partial w_1} &= \frac{\partial Z^TZ}{\partial w_1} \\ &= \frac{\partial Z^TZ}{\partial Z} * \frac{\partial Z}{\partial W_1} \\ &= 2Z* \frac{\partial (ReLU(XW_1)W_2-Y)}{\partial W_1} \\ &= 2(\hat Y-Y) * \frac{\partial (XW_1W_2-Y)}{\partial W_1}. \ \ \ \ \ when \ XW_1 \geqslant 0 \\ &= 2X^T(\hat Y-Y) * W_2^T. \ \ \ \ \ when \ XW_1 \geqslant 0 \end{aligned} w1loss=w1ZTZ=ZZTZW1Z=2ZW1(ReLU(XW1)W2Y)=2(Y^Y)W1(XW1W2Y).     when XW10=2XT(Y^Y)W2T.     when XW10
同样的,有:
∂ l o s s ∂ w 2 = 2 W 1 T X T ( Y ^ − Y ) \begin{aligned} \frac{\partial loss}{\partial w_2} &= 2 W_1^TX^T(\hat Y - Y) \end{aligned} w2loss=2W1TXT(Y^Y)
这里,就简单的将 X W 1 ⩾ 0 XW_1 \geqslant 0 XW10,不考虑。也就是在计算梯度的时候,不考虑 R e L U ReLU ReLU这个分段函数,简单的统一处理,也就是在梯度计算的时候,默认没有加入该激活函数,以简化问题。
但是,值得注意的是,在计算 ( Y ^ − Y ) (\hat Y - Y) (Y^Y)时候,还是添加上 R e L U ReLU ReLU,经实践,可以加快收敛。

import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    
    # loss = (y_pred - y) ** 2
    grad_y_pred = 2.0 * (y_pred - y)
    # 
#     grad_w2 = h_relu.T.dot(grad_y_pred)
#     grad_h_relu = grad_y_pred.dot(w2.T)
#     grad_h = grad_h_relu.copy()
#     grad_h[h < 0] = 0
#     grad_w1 = x.T.dot(grad_h) 
    grad_w1 = 2 * x.T.dot(y_pred - y).dot(w2.T)
    grad_w2 = 2 * w1.T.dot(x.T).dot(y_pred - y)
        
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

【注】,改代码来自上面链接中视频中的代码。本文修改的部分仅仅是梯度计算的这里。仅是作为简单的理解和记录。
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/qq_26460841/article/details/112908044