算法强化 —— 反向传播

反向传播

使用反向传播是为了防止路径的重复计算。
为了方便,我们将之前的一个前向传播的过程复制过来:
Z 1 = W 1 X + b 1 Z_1 = W_1X+b_1
H 1 = R E L U ( Z 1 ) H_1 = RELU(Z_1)
Z 2 = W 2 H 1 + b 2 Z_2 = W_2H_1 + b_2
H 2 = R E L U ( Z 2 ) H_2 = RELU(Z_2)
Z 3 = W 3 H 2 + b 3 Z_3 = W_3H_2+b_3
y ^ = s i g m o i d ( Z 3 ) \hat{y} = sigmoid(Z_3)
同时,将损失函数也复制过来
J ( w , b ) = 1 m i = 1 m L ( y ^ ( i ) , y ( i ) ) = 1 m i = 1 m [ y ( i ) log ( y ^ ( i ) ) + ( 1 y ( i ) ) log ( 1 y ^ ( i ) ) ] + λ 2 m w F 2 J(w, b)=\frac{1}{m} \sum_{i=1}^{m} L\left(\hat{y}^{(i)}, y^{(i)}\right)=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(\hat{y}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{y}^{(i)}\right)\right]+\frac{\lambda}{2 m}\|w\|_{F}^{2}
注意 为了直观,没有写出来对矩阵求导的转置。

首先第一件事是对 z 3 z_3 进行求导
J z 3 = J y ^ y ^ z 3 = y ^ y = δ 3 \frac{\partial J}{\partial z_{3}}=\frac{\partial J}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_{3}}=\hat{y}-y=\delta_{3}
然后我们开始对参数w和b进行求导
J w 3 = J z 3 z 3 w 3 = δ 3 H 2 + 1 m λ w 3 J b 3 = J z 3 z 3 b 3 = δ 3 \begin{aligned} \frac{\partial J}{\partial w_{3}}=\frac{\partial J}{\partial z_{3}} \frac{\partial z_{3}}{\partial w_{3}} &=\delta_{3} H_{2}+\frac{1}{m} \lambda w_{3} \\ \frac{\partial J}{\partial b_{3}} &=\frac{\partial J}{\partial z_{3}} \frac{\partial z_{3}}{\partial b_{3}}=\delta_{3} \end{aligned}

我们完成了对w3和b3这两个参数进行求导,后面基本类似,就是运用链式求导的法则一层层的往前求
J z 2 = J z 3 z 3 H 2 H 2 z 2 = δ 3 w 3 r e l u ( z 2 ) = δ 2 J w 2 = J z 2 z 2 w 2 = δ 2 H 1 + 1 m λ w 2 J b 2 = J z 2 z 2 b 2 = δ 2 \begin{aligned} \frac{\partial J}{\partial z_{2}}=\frac{\partial J}{\partial z_{3}} \frac{\partial z_{3}}{\partial H_{2}} \frac{\partial H_{2}}{\partial z_{2}} &=\delta_{3} w_{3} r e l u^{\prime}\left(z_{2}\right)=\delta_{2} \\ \frac{\partial J}{\partial w_{2}}=\frac{\partial J}{\partial z_{2}} & \frac{\partial z_{2}}{\partial w_{2}}=\delta_{2} H_{1}+\frac{1}{m} \lambda w_{2} \\ \frac{\partial J}{\partial b_{2}} &=\frac{\partial J}{\partial z_{2}} \frac{\partial z_{2}}{\partial b_{2}}=\delta_{2} \end{aligned}
对于W1和b1也一样
J z 1 = J z 2 z 2 H 1 H 1 z 1 = δ 2 w 2  relu’t  ( z 1 ) = δ 1 J w 1 = J z 1 z 1 w 1 = δ 1 x + 1 m λ w 1 J b 1 = J z 1 z 1 b 1 = δ 1 \begin{aligned} \frac{\partial J}{\partial z_{1}}=\frac{\partial J}{\partial z_{2}} \frac{\partial z_{2}}{\partial H_{1}} \frac{\partial H_{1}}{\partial z_{1}} &=\delta_{2} w_{2} \text { relu't }\left(z_{1}\right)=\delta_{1} \\ \frac{\partial J}{\partial w_{1}}=\frac{\partial J}{\partial z_{1}} \frac{\partial z_{1}}{\partial w_{1}} &=\delta_{1} x+\frac{1}{m} \lambda w_{1} \\ \frac{\partial J}{\partial b_{1}}=\frac{\partial J}{\partial z_{1}} \frac{\partial z_{1}}{\partial b_{1}} &=\delta_{1} \end{aligned}

首先注意一点,一个标量对一个矩阵求导,其维度不变
J w 3 = J z 3 z 3 w 3 = δ 3 H 2 \frac{\partial J}{\partial w_{3}}=\frac{\partial J}{\partial z_{3}} \frac{\partial z_{3}}{\partial w_{3}}=\delta_{3} H_{2}

import numpy as np
def backward_propagation(X, Y, Weight, bias, H, activation, ):
    m = X.shape[1]
    gradients = {}
    L = len(Weight)
    gradients['dZ'+str(L)] = H['H'+str(L)] - Y
    gradients['dW' + str(L)] = 1./m * np.dot(gradients['dZ'+str(L)],H['H'+str(L-1)].T) + 1./m* lambd * Weight['W']
    gradients['db' + str(L)] = 1./m * np.dot(gradients['dZ'+str(L)],axis = 1,keepdims = True)
    for l in range(L-1,0,-1):
        gradients['dH' + str(l)] = np.dot(Weight['W'+str(l+1)].T,gradients['dZ'+str(l+1)])
        if activation[l-1] == 'relu':
            gradients['dZ'+str(l)] = np.multiarray(gradients['dH' + str(l)],np.int64(H['H'+str(1)]>0))
        elif activation[l-1] == 'tanh':
            gradients['dZ' + str(l)] = np.multiarray(gradients['dH' + str(l)], 1-np.power(H['H'+str(1)],2))

        gradients['dW' + str(l)] = 1. / m * np.dot(gradients['dZ' + str(L)], H['H' + str(L - 1)].T) + 1. / m * lambd * \
                                   Weight['W']
        gradients['db' + str(l)] = 1. / m * np.dot(gradients['dZ' + str(L)], axis=1, keepdims=True)

    return gradients


def updata_parameters(Weight,bias,gradients ,lr = 0.1):
    # 更新参数,lr为leaning rate 代表参数的学习率
    # 太小会使网络收敛很慢,太大会使网络在最低点附近徘徊不会收敛
    for i in range(1,len(Weight)+1):
        Weight['W'+str(i)] -= lr*gradients['dW'+str(i)]
        bias['b'+str(i)] -= lr * gradients['db'+str(i)]
    return Weight,bias

发布了110 篇原创文章 · 获赞 3 · 访问量 4079

猜你喜欢

转载自blog.csdn.net/qq_33357094/article/details/105108333