机器学习 (三): 反向传播数学推导 & 神经网络

反向传播推导 & 神经网络

约定

本小节所使用的神经网络如图所示
神经网络
其中

  • 一共有三层, 序号分别是 0, 1, 2 (第0层为输入层)
  • 输入为向量 x x 有两个项 x 1 x_1 x 2 x_2
  • a ( i ) a^{(i)} 表示第 i 层的输出向量, a j ( i ) a_j^{(i)} 表示第 i 层输出向量的第 j 项. 注意 x = a ( 0 ) x = a^{(0)}
  • W ( i ) W^{(i)} 表示连接第 i-1 层和第 i 层的边, 即系数, w j k ( i ) w_{jk}^{(i)} 表示连接第 i 层第 j 个节点和第 i-1 层第 k 个节点的边, 注意 jk 的顺序是从右到左连接
  • z ( i ) = W ( i ) × a ( i 1 ) z^{(i)} = W^{(i)} \times a^{(i-1)}
  • a ( i ) = s i g m o i d ( z ( i ) ) a^{(i)} = sigmoid(z^{(i)}) 也就是说本网络使用的激活函数是 sigmoid
  • s i g m o i d ( x ) = 1 1 + e x sigmoid(x) = \frac{1}{1 + e^{-x}}
  • 损失函数为 J ( W , x , y ) = i = 1 2 [ y l n ( a i ( 2 ) ) + ( 1 y ) l n ( 1 a i ( 2 ) ) ] J(W, x, y) = \sum_{i=1}^2[yln(a_i^{(2)}) + (1-y)ln(1 - a_i^{(2)})]
  • 简便起见不考虑 bias 项

推导

我们的目标是最小化 J ( W , x , y ) J(W, x, y) , 其中 W 是所有的边, 而 x, y 是测试数据集为常量, 所以我们需要求 δ J δ W \frac{\delta J}{\delta W} , 具体地, 我们要求 δ J δ w ( 2 ) \frac{\delta J}{\delta w^{(2)}} δ J δ w ( 1 ) \frac{\delta J}{\delta w^{(1)}} :

δ J δ w ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) × δ z ( 2 ) δ w ( 2 ) \frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta w^{(2)}}

δ J δ a ( 2 ) = ( y l n ( a ( 2 ) ) + ( 1 y ) l n ( 1 a ( 2 ) ) ) = y a ( 2 ) 1 y 1 a ( 2 ) \frac{\delta J}{\delta a^{(2)}} = (yln(a^{(2)}) + (1-y)ln(1-a^{(2)}))^{'} = \frac{y}{a^{(2)}} - \frac{1-y}{1 - a^{(2)}}

δ a ( 2 ) δ z ( 2 ) = a ( 2 ) ( 1 a ( 2 ) ) \frac{\delta a^{(2)}}{\delta z^{(2)}} = a^{(2)} \cdot (1-a^{(2)})

由上述二式可得

δ J δ z ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) = y ( 1 a ( 2 ) ) ( 1 y ) a ( 2 ) \frac{\delta J}{\delta z^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} = y \cdot (1-a^{(2)}) - (1-y)\cdot a^{(2)}

因为 y 的值只可能是 0 或 1, 当 y = 0 时, 上式等于

a ( 2 ) = 0 a ( 2 ) = y a ( 2 ) -a^{(2)} = 0-a^{(2)} = y - a^{(2)}

当 y = 1 时, 上式等于

1 a ( 2 ) = y a ( 2 ) 1 - a^{(2)} = y - a^{(2)}

所以

δ J δ z ( 2 ) = a ( 2 ) y \frac{\delta J}{\delta z^{(2)}} = a^{(2)} - y

δ ( 2 ) = δ J δ z ( 2 ) = a ( 2 ) y \delta^{(2)} = \frac{\delta J}{\delta z^{(2)}} = a^{(2)} - y

因为

z ( 2 ) = W ( 2 ) × a ( 1 ) z^{(2)} = W^{(2)} \times a^{(1)}

所以

δ z ( 2 ) δ W ( 2 ) = a ( 1 ) \frac{\delta z^{(2)}}{\delta W^{(2)}} = a^{(1)}

综上

δ J δ w ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) × δ z ( 2 ) δ W ( 2 ) = δ ( 2 ) × a ( 1 ) \frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta W^{(2)}} = \delta^{(2)} \times a^{(1)}


现在开始求 δ J δ w ( 1 ) \frac{\delta J}{\delta w^{(1)}} :

δ J δ w ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) × δ z ( 1 ) δ W ( 1 ) \frac{\delta J}{\delta w^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} \times \frac{\delta z^{(1)}}{\delta W^{(1)}}

因为

z ( 2 ) = W ( 2 ) × a ( 1 ) z^{(2)} = W^{(2)} \times a^{(1)}

所以

δ z ( 2 ) δ a ( 1 ) = W ( 2 ) \frac{\delta z^{(2)}}{\delta a^{(1)}} = W^{(2)}

所以

δ J δ a ( 1 ) = δ J δ z ( 2 ) × δ z ( 2 ) δ a ( 1 ) = W ( 2 ) T × δ ( 2 ) \frac{\delta J}{\delta a^{(1)}} = \frac{\delta J}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta a^{(1)}} =W^{(2)T} \times \delta^{(2)}

因为

δ a ( 1 ) δ z ( 1 ) = a ( 1 ) ( 1 a ( 1 ) ) \frac{\delta a^{(1)}}{\delta z^{(1)}} = a^{(1)} \cdot(1-a^{(1)})
所以

δ J δ z ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) = W ( 2 ) T × δ ( 2 ) a ( 1 ) ( 1 a ( 1 ) ) \frac{\delta J}{\delta z^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} = W^{(2)T} \times \delta^{(2)} \cdot a^{(1)} \cdot (1 - a^{(1)})

δ ( 1 ) = δ J δ z ( 1 ) = W ( 2 ) T × δ ( 2 ) a ( 1 ) ( 1 a ( 1 ) ) \delta^{(1)} = \frac{\delta J}{\delta z^{(1)}} = W^{(2)T} \times \delta^{(2)} \cdot a^{(1)} \cdot (1 - a^{(1)})

又因为

z ( 1 ) = W ( 1 ) × a ( 0 ) z^{(1)} = W^{(1)} \times a^{(0)}

所以

δ z ( 1 ) δ W ( 1 ) = a ( 0 ) \frac{\delta z^{(1)}}{\delta W^{(1)}} = a^{(0)}

综上

δ J δ w ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) × δ z ( 1 ) δ W ( 1 ) = δ ( 1 ) × a ( 0 ) \frac{\delta J}{\delta w^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} \times \frac{\delta z^{(1)}}{\delta W^{(1)}} = \delta^{(1)} \times a^{(0)}

神经网络实现

import numpy as np
import matplotlib.pyplot as plt 

def init_model(X, Y, layers):
	model = {}
	nodes = []
	nodes.append(X.shape[0])
	for l in layers:
		nodes.append(l)
	nodes.append(Y.shape[0])
	model['depth'] = len(layers) + 1
	for n in range(model['depth']):
		model['W' + str(n + 1)] = np.random.rand(nodes[n+1], nodes[n])
		model['B' + str(n + 1)] = np.random.rand(nodes[n+1], 1)
	return model

def sigmoid(x):
	return 1.0 / (1 + np.exp(-x))

def forward_propagation(X, Y, model):
	result = X.copy()
	model['A0'] = result.copy()
	for i in range(model['depth']):
		W = model['W' + str(i+1)]
		B = model['B' + str(i+1)]
		result = np.dot(W, result) + B
		model['Z' + str(i+1)] = result.copy()
		result = sigmoid(result)
		model['A' + str(i+1)] = result.copy()
	return result

def cross_entropy(Y, result):
	return -Y * np.log(result) - (1 - Y) * np.log(1 - result)

def cost(Y, result):
	return np.sum(cross_entropy(Y, result)) / Y.shape[1]

def backward_propagation(Y, model, learning_rate):
	L = model['depth']
	m = Y.shape[1]
	delta = model['A' + str(L)] - Y
	model['dW' + str(L)] = np.dot(delta, model['A' + str(L-1)].T) / m
	model['dB' + str(L)] = np.sum(delta, axis = 1, keepdims=True) / m
	for l in range(L - 1, 0, -1):
		delta = (np.dot(model['W' + str(l+1)].T, delta) * model['A' + str(l)] * (1 - model['A' + str(l)])) 
		model['dW' + str(l)] = np.dot(delta, model['A' + str(l-1)].T) / m
		model['dB' + str(l)] = np.sum(delta, axis=1, keepdims=True) / m

	for l in range(model['depth'], 0, -1):
		model['W' + str(l)] -= learning_rate * model['dW' + str(l)]
		model['B' + str(l)] -= learning_rate * model['dB' + str(l)]

def train(X, Y, model, learning_rate = 0.05, times = 50, show_cost_history = True):
	cost_history = []
	result = 0
	for t in range(times):
		result = forward_propagation(X, Y, model)
		backward_propagation(Y, model, learning_rate)
		cost_history.append(cost(Y, result))
	if show_cost_history:
		plt.plot(cost_history)
		plt.show()

if __name__ == '__main__':
	m = 1000
	X = np.random.rand(2, m)
	Y = np.zeros((1, m))
	for i in range(m):
		if (X[0, i] < 0.5) and (X[1, i] < 0.5):
			Y[0, i] = 1

	model = init_model(X, Y, [2])
	train(X, Y, model)



猜你喜欢

转载自blog.csdn.net/vinceee__/article/details/88047176