随机梯度下降与批梯度下降

装载自：http://www.cnblogs.com/rcfeng/p/3958926.html

具体的理论公式见吴恩达老师斯坦福公开课讲义note1。

通过代码理解

1.随机梯度下降算法的python的实现：

# coding=utf-8
#!/usr/bin/python

'''
Created on 2014年9月6日
 
@author: Ryan C. F.

'''

#Training data set
#each element in x represents (x0,x1,x2)
x = [(1,0.,3) , (1,1.,3) ,(1,2.,3), (1,3.,2) , (1,4.,4)]
#y[i] is the output of y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]
y = [95.364,97.217205,75.195834,60.105519,49.342380]


epsilon = 0.0001
#learning rate
alpha = 0.01
diff = [0,0]
error1 = 0
error0 =0
m = len(x)


#init the parameters to zero
theta0 = 0
theta1 = 0
theta2 = 0

while True:

    #calculate the parameters
    for i in range(m):
    
        diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )
        
        theta0 = theta0 + alpha * diff[0]* x[i][0]
        theta1 = theta1 + alpha * diff[0]* x[i][1]
        theta2 = theta2 + alpha * diff[0]* x[i][2]
    
    #calculate the cost function
    error1 = 0
    for lp in range(len(x)):
        error1 += ( y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] ) )**2/2
    
    if abs(error1-error0) < epsilon:
        break
    else:
        error0 = error1
    
    print ' theta0 : %f, theta1 : %f, theta2 : %f, error1 : %f'%(theta0,theta1,theta2,error1)

print 'Done: theta0 : %f, theta1 : %f, theta2 : %f'%(theta0,theta1,theta2)

2.批梯度下降

# coding=utf-8
#!/usr/bin/python

'''
Created on 2014年9月6日
 
@author: Ryan C. F.

'''

#Training data set
#each element in x represents (x0,x1,x2)
x = [(1,0.,3) , (1,1.,3) ,(1,2.,3), (1,3.,2) , (1,4.,4)]
#y[i] is the output of y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]
y = [95.364,97.217205,75.195834,60.105519,49.342380]


epsilon = 0.000001
#learning rate
alpha = 0.001
diff = [0,0]
error1 = 0
error0 =0
m = len(x)

#init the parameters to zero
theta0 = 0
theta1 = 0
theta2 = 0
sum0 = 0
sum1 = 0
sum2 = 0
while True:
    
    #calculate the parameters
    for i in range(m):
        #begin batch gradient descent
        diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] ) #此处使用的是未更新的theta
        sum0 = sum0 + alpha * diff[0]* x[i][0]
        sum1 = sum1 + alpha * diff[0]* x[i][1]
        sum2 = sum2 + alpha * diff[0]* x[i][2]
        #end  batch gradient descent
    theta0 = sum0;
    theta1 = sum1;
    theta2 = sum2;
    #calculate the cost function
    error1 = 0
    for lp in range(len(x)):
        error1 += ( y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] ) )**2/2
    
    if abs(error1-error0) < epsilon:
        break
    else:
        error0 = error1
    
    print ' theta0 : %f, theta1 : %f, theta2 : %f, error1 : %f'%(theta0,theta1,theta2,error1)

print 'Done: theta0 : %f, theta1 : %f, theta2 : %f'%(theta0,theta1,theta2)

通过上述批梯度下降和随机梯度下降算法代码的对比，不难发现两者的区别：

1. 随机梯度下降算法在迭代的时候，每迭代一个新的样本，就会更新一次所有的theta参数。

for i in range(m):
    
        diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )
        
        theta0 = theta0 + alpha * diff[0]* x[i][0]
        theta1 = theta1 + alpha * diff[0]* x[i][1]
        theta2 = theta2 + alpha * diff[0]* x[i][2]

2. 批梯度下降算法在迭代的时候，是完成所有样本的迭代后才会去更新一次theta参数

#calculate the parameters
    for i in range(m):
        #begin batch gradient descent
        diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )
        sum0 = sum0 + alpha * diff[0]* x[i][0]
        sum1 = sum1 + alpha * diff[0]* x[i][1]
        sum2 = sum2 + alpha * diff[0]* x[i][2]
        #end  batch gradient descent
    theta0 = sum0;
    theta1 = sum1;
    theta2 = sum2;

因此当样本数量很大时候，批梯度得做完所有样本的计算才能更新一次theta，从而花费的时间远大于随机梯度下降。但是随机梯度下降过早的结束了迭代，使得它获取的值只是接近局部最优解，而并非像批梯度下降算法那么是局部最优解。

因此我觉得以上的差别才是批梯度下降与随机梯度下降最本质的差别。

随机梯度下降与批梯度下降

猜你喜欢