深层神经网络的优化算法

神经网络发展至今,每天都有新的优化算法被提出,但是鲜有通用性好的优化算法,在课程中吴恩达老师介绍了优化效果和通用性都非常好的几种优化算法:mini-batch梯度下降、momentum梯度下降、RMSprop、Adam算法等等,下面逐一讲诉各算法的原理及程序,最后结合神经网络对比应用各优化算法后性能的改善情况。

程序所需的库文件如下

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets


from opt_utils import *
from testCases_opt import *


plt.rcParams['figure.figsize'] = (7.0, 4.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

opt_utils和testCases_opt是吴恩达老师给出的辅助程序,可在这里获取。

一、梯度下降法

在构建神经网络的反向传播过程中,我们使用梯度下降法对dW,db等参数进行迭代。

def update_parameters_with_gd(parameters, grads, learning_rate):

    L = len(parameters) // 2

    for l in range(L):

        parameters['W' + str(l+1)] =  parameters['W' + str(l+1)] - learning_rate * \
                                     grads['dW' + str(l+1)]

        parameters['b' + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * \
                                     grads['db' + str(l+1)]
         
    return parameters

通常我们在每次迭代中把样本集中的m个数据一起处理,这种梯度下降就称为batch梯度下降法。

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):

    a, caches = forward_propagation(X, paramters)

    cost = compute_cost(a, Y)

    grads = backward_propagation(a, caches, paramters)

    paramters = update_paramters(paramters, grads)

更极端的例子是每次迭代只处理一个数据,这种方法称为随机梯度下降。

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)

for i in range(0,num_iterations):

    for j in range(0, m):
        
        a, caches = forward_propagation(X[:,j], paramters)

        cost = compute_cost(a, Y[:,j])

        grads = backward_propagation(a, caches, paramters)

        paramters = update_paramters(paramters, grads)

二、mini-batch梯度下降法

batch梯度下降法适用于样本数量较少(一般少于2000)的情况,因为每次迭代都处理整个训练集的m个样本相当消耗时间,尤其是在数据量非常大的情况下(比如500W),那么我们需要选择mini-batch梯度下降法,将整个训练集分成若干个batch,分批处理。

步骤一:调序。需保持调序后X[i]与Y[i]仍然匹配。


步骤二:划分。通常mini-batch的大小需要根据CPU或者GPU的大小而定,常用大小为:64、128、256、512.


如果样本数量不能被mini-batch-size整除,那么最后一个batch的大小为:

m - mini_batch_size * math.floors(m / mini_batch_size)
def initialize_velocity(parameters):
    L = len(parameters) // 2

    for l in range(L):
        v['dW' + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
        v['db' + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)

    return v
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):

    np.random.seed(seed)
    m = X.shape[1]
    mini_batches = []

    permutation = list(np.random.permutation(m))
    shuffled_X = X[:,permutation]
    shuffled_Y = Y[:,permutation].reshape((1,m))

    num_complete_minibatches = math.floor(m / mini_batch_size)
    for k in range(0, num_complete_minibatches):
        mini_batch_X = shuffled_X[:,k * mini_batch_size : (k + 1)* mini_batch_size]
        mini_batch_Y = shuffled_Y[:,k * mini_batch_size : (k + 1)* mini_batch_size]

        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    if m % mini_batch_size != 0:
        mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size : m]
        mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size : m]

        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches
X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)

print ("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
print ("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
print ("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
print ("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
print ("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape)) 
print ("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
print ("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))
shape of the 1st mini_batch_X: (12288, 64)
shape of the 2nd mini_batch_X: (12288, 64)
shape of the 3rd mini_batch_X: (12288, 20)
shape of the 1st mini_batch_Y: (1, 64)
shape of the 2nd mini_batch_Y: (1, 64)
shape of the 3rd mini_batch_Y: (1, 20)
mini batch sanity check: [ 0.90085595 -0.7612069   0.2344157 ]

三、momentum梯度下降法

在计算cost值时如果,影响结果的变量通常是多维的,我们以最简单的二维为例,此时cost曲线将是一个碗状曲线,在迭代时所计算出的极值点点的运动轨迹是在纵轴不断震荡、在横轴逐渐靠近最小值的曲线。为了加快算法,我们自然希望纵轴的震荡减小继而能够快速下降到最小值,就像西西弗斯推下山的巨石,快速滚落一气呵成,momentum梯度下降法正是为达到这个目的而提出。

momentum梯度下降法公式如下


def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):

    L = len(parameters) // 2

    for l in range(L):
        v['dW' + str(l+1)] = beta *  v['dW' + str(l+1)] + (1 - beta) * grads['dW' + str(l+1)]
        v['db' + str(l+1)] = beta *  v['db' + str(l+1)] + (1 - beta) * grads['db' + str(l+1)]

        parameters['W' + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * v['dW' + str(l+1)]
        parameters['b' + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * v['db' + str(l+1)]

    return parameters, v

使用吴恩达老师给出的辅助程序进行验证

parameters, grads, v = update_parameters_with_momentum_test_case()
parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
W1 = [[ 1.62544598 -0.61290114 -0.52907334]
 [-1.07347112  0.86450677 -2.30085497]]
b1 = [[ 1.74493465]
 [-0.76027113]]
W2 = [[ 0.31930698 -0.24990073  1.4627996 ]
 [-2.05974396 -0.32173003 -0.38320915]
 [ 1.13444069 -1.0998786  -0.1713109 ]]
b2 = [[-0.87809283]
 [ 0.04055394]
 [ 0.58207317]]
v["dW1"] = [[-0.11006192  0.11447237  0.09015907]
 [ 0.05024943  0.09008559 -0.06837279]]
v["db1"] = [[-0.01228902]
 [-0.09357694]]
v["dW2"] = [[-0.02678881  0.05303555 -0.06916608]
 [-0.03967535 -0.06871727 -0.08452056]
 [-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = [[0.02344157]
 [0.16598022]
 [0.07420442]]

在使用momentum算法时由于v的初始值是0,因此会导致在迭代前期所得的计算值与实际偏差很大,我们可以通过使用公式:v = v / ( 1 - pow(beta, t))进行修正,但是通常并不需要进行修正,因为迭代次数通常会比较大,前几次迭代的偏差很快会渡过去。

四、Adam算法

Adam算法本质是momentum算法和RMSprop算法的组合,计算公式如下:


Adam算法全称是adaptive moment estimation,其中momentum算法中的beta1用于计算dW,称为第一矩(moment 1);RMSprop算法中的beta2用于计算(dW)^2,称为第二矩(moment 2)。对于v和s的初始化,需要初始化为0的矩阵,且shape需与W,b保持一致。

def initialize_adam(parameters):
    L = len(parameters) // 2
    v = {}
    s = {}

    for l in range(L):
        v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
        v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
        s["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
        s["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)

    return v, s
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):

    L = len(parameters) // 2
    v_corrected = {}
    s_corrected = {}
    

    for l in range(L):

        v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1) *  grads["dW" + str(l+1)]
        v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1) *  grads["db" + str(l+1)]
        
        v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - beta1 ** t)
        v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - beta1 ** t)
                
        s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * (grads["dW" + str(l+1)] ** 2)
        s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1 - beta2) * (grads["db" + str(l+1)] ** 2)
        
        s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - beta2 ** t)
        s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - beta2 ** t)

        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * (v_corrected["dW" + str(l+1)] / \
                                     (np.sqrt(s_corrected["dW" + str(l+1)]) + epsilon))

        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * (v_corrected["db" + str(l+1)] / \
                                     (np.sqrt(s_corrected["db" + str(l+1)]) + epsilon))        

    return parameters, v, s

使用吴恩达老师给出的辅助程序验证一下

parameters, grads, v, s = update_parameters_with_adam_test_case()
parameters, v, s  = update_parameters_with_adam(parameters, grads, v, s, t = 2)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))
W1 = [[ 1.63178673 -0.61919778 -0.53561312]
 [-1.08040999  0.85796626 -2.29409733]]
b1 = [[ 1.75225313]
 [-0.75376553]]
W2 = [[ 0.32648046 -0.25681174  1.46954931]
 [-2.05269934 -0.31497584 -0.37661299]
 [ 1.14121081 -1.09244991 -0.16498684]]
b2 = [[-0.88529979]
 [ 0.03477238]
 [ 0.57537385]]
v["dW1"] = [[-0.11006192  0.11447237  0.09015907]
 [ 0.05024943  0.09008559 -0.06837279]]
v["db1"] = [[-0.01228902]
 [-0.09357694]]
v["dW2"] = [[-0.02678881  0.05303555 -0.06916608]
 [-0.03967535 -0.06871727 -0.08452056]
 [-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = [[0.02344157]
 [0.16598022]
 [0.07420442]]
s["dW1"] = [[0.00121136 0.00131039 0.00081287]
 [0.0002525  0.00081154 0.00046748]]
s["db1"] = [[1.51020075e-05]
 [8.75664434e-04]]
s["dW2"] = [[7.17640232e-05 2.81276921e-04 4.78394595e-04]
 [1.57413361e-04 4.72206320e-04 7.14372576e-04]
 [4.50571368e-04 1.60392066e-07 1.24838242e-03]]
s["db2"] = [[5.49507194e-05]
 [2.75494327e-03]
 [5.50629536e-04]]

五、改善神经网络

通过前四个小节我们已经知道这些优化算法是如何实现的,下面就逐一应用观察他们对网络模型的改善。

1.训练数据:使用吴恩达老师给出的“moon”数据集作为训练样本。

train_X, train_Y = load_dataset()

2.构建网络:通过设定optimizer参数来选用不同的优化算法。

def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9,
          beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, num_epochs = 10000, print_cost = True):

    L = len(layers_dims)
    costs = []
    t = 0
    seed = 0

    parameters = initialize_parameters(layer_dims)

    if optimizer == 'gd':
        pass
    elif optimizer == 'momentum':
        v = initialize_velocity(parameters)
    elif optimizer == 'adam':
        v, s =initialize_adam(parameters)


    for i in range(num_epochs):

        seed = seed + 1
        mini_batches = random_mini_batches(X, Y, mini_batch_size, seed)

        for minibatche in mini_batches:
            
            (minibatch_X, minibatch_Y) = minibatche

            a3, caches = forward_propagation(minibatch_X, parameters)

            cost = compute_cost(a3, minibatch_Y)

            grads = backward_propagation(minibatch_X, minibatch_Y, cache)

            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)

            elif optimizer == 'momentum':
                parameters = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)

            elif optimizer == 'adam':
                parameters = update_parameters_with_adam(parameters, grads, v, s, t, learning_rate,
                                                         beta1, beta2, epsilon)

        if print_cost and i % 1000 == 0:
            print("Cost after epoch %i:%f"%(i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)

    plt.plot(costs)
    plt.ylabel("cost")
    plt.xlabel("epoch per 1000")
    plt.title("Learning Rate =" + str(learning_rate))
    plt.show()

    return parameters

3.mini-batch梯度下降优化

运行下面代码可获得使用mini-batch优化效果

train_X, train_Y = load_dataset()

layers_dims = [train_X.shape[0], 5,,2,,1]
parameters = model(train_X, train_Y, layers_dims, optimizer = 'gd')

predictions = predict(train_X, train_Y, parameters)

plt.title("Model with GD optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
Cost after epoch 0:0.686257
Cost after epoch 1000:0.638242
Cost after epoch 2000:0.624268
Cost after epoch 3000:0.611973
Cost after epoch 4000:0.569783
Cost after epoch 5000:0.485165
Cost after epoch 6000:0.539755
Cost after epoch 7000:0.442252
Cost after epoch 8000:0.419857
Cost after epoch 9000:0.436197

Accuracy: 0.7966666666666666


4.使用momentum算法优化

运行下面代码可获得使用momentum优化效果

train_X, train_Y = load_dataset()

layers_dims = [train_X.shape[0], 5,2,1]
parameters = model(train_X, train_Y, layers_dims, beta = 0.9, optimizer = 'momentum')

predictions = predict(train_X, train_Y, parameters)

plt.title("Model with momentum optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
Cost after epoch 0:0.686286
Cost after epoch 1000:0.638342
Cost after epoch 2000:0.624358
Cost after epoch 3000:0.612031
Cost after epoch 4000:0.569856
Cost after epoch 5000:0.485280
Cost after epoch 6000:0.539752
Cost after epoch 7000:0.442376
Cost after epoch 8000:0.420052
Cost after epoch 9000:0.436641

Accuracy: 0.7966666666666666

5.Adam算法

运行下面代码可获得使用Adam优化效果

train_X, train_Y = load_dataset()

layers_dims = [train_X.shape[0], 5,2,1]
parameters = model(train_X, train_Y, layers_dims, optimizer = 'adam')

predictions = predict(train_X, train_Y, parameters)

plt.title("Model with adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
Cost after epoch 0:0.684965
Cost after epoch 1000:0.181558
Cost after epoch 2000:0.169563
Cost after epoch 3000:0.218465
Cost after epoch 4000:0.154296
Cost after epoch 5000:0.149509
Cost after epoch 6000:0.165925
Cost after epoch 7000:0.026825
Cost after epoch 8000:0.092551
Cost after epoch 9000:0.154314

Accuracy: 0.94

六、分析

优化算法 精确度 cost曲线
GD 0.79 振幅较大,较慢下降
Momentum 0.79 振幅较大,较慢下降
Adam 0.94 振幅较小,快速下降

对比三种优化算法,可直观看出Adam效果最好,因为Adam结合Momentum和RMSprop两种优化算法,同时由于使用mini-batch,可以大大节省CPU或GPU的内存空间。

猜你喜欢

转载自blog.csdn.net/u013093426/article/details/80953716