简介：

梯度下降法(Gradient Descent)严格地说其实不能算是一种机器学习的算法，而是属于一种优化算法，其目的在于基于搜索最小化一个损失函数，找到最优解。由于其简单又具有很好地效果，因此被广泛地运用在机器学习的算法中。

原理讲解：

假设在二维空间平面上，有函数如下：

在这里插入图片描述

因此我们要寻找到最低点，我们可以定义一个公式：y=y-ηy’(y’为y的导数)。当我们重复代入上述公式时，我们可以发现随着点逐渐下降，y’的值越来越小，直到寻找到最低点时y’的值为0，y值不再改变，这时我们就找到了最低点。

其中η我们称之为"学习效率"。不难发现η越小，我们代入公式的次数就越多，得到结果的速度也就越慢。但是η的值也不能取得太大，否则可能会出现第一次就越过了最低点，然后逐渐愈加偏离的情况：
在这里插入图片描述

梯度下降法不仅仅可以应用在二维的数据，多维的数据也同样适用，推广到多维时可如下图：

在这里插入图片描述

应用在多维事，我们可以运用之前学过的多元线性回归算法来帮助我们计算结果，并将线性回归得到的式子转换成矩阵的形式，我们就可以通过代码实现上述的算法了。

这里再拓展一种随机梯度下降法(上述方法我们通常成为批量梯度下降法)，目的是让提高计算速度。因为上面的方法我们每次下降都需要计算所有的数据，这样子就需要大量的计算时间，而通过实验发现，即使是随机取其中一组数据进行计算，最终我们还是可以到达最低点的附近（这里类似于用精度换取时间）。

好了，大概思路就是这样子，接下来我们通过python代码实现上面的算法：

代码展示：

1、封装算法

# 封装的多元线性回归的梯度下降算法
import numpy as np
from sklearn.metrics import r2_score

class LinearRegression:
	def __init__(self):
	    self.coef_ = None
	    self.interception_ = None
	    self._theta = None

	def fit_normal(self, x_train, y_train):
		assert x_train.shape[0] == y_train.shape[0], \
	        	"the size of x_train must be equal to the size of y_train"
	
	   	x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
	  	self._theta = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y_train);
	
	  	self.interception_ = self._theta[0]
	    	self.coef_ = self._theta[1:]

	def fit_gd(self, x_train, y_train, eta=0.01, n_iters=1e4):
		assert x_train.shape[0] == y_train.shape[0], \
  			  "the size of x_train must be equal to the size of y_train"
		def lose(theta, x_b, y):
   			 try:
       				 return np.sum((y - x_b.dot(theta) ** 2)) / len(x_b)
  			  except:
      				  return float("inf")
      		def Derivative(theta, x_b, y):
      			return x_b.T.dot(x_b.dot(theta)-y)*2/len(x_b)
      				 
      		def gradient_descent(x_b, y, init_theta, eta, epsilon=1e-8):
   			 theta = init_theta
  		 	 i_iters = 0
 		  	 while i_iters < n_iters:
      				  gradient = Derivative(theta, x_b, y)
       				 last_theta = theta
       				 theta = theta - eta * gradient
       				 if (abs(lose(theta, x_b, y) - lose(last_theta, x_b, y)) < epsilon):
  					  break
  				i_iters += 1
  			return theta
  		x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
		initial_theta = np.zeros(x_b.shape[1])
		self._theta = gradient_descent(x_b, y_train, initial_theta, eta, epsilon=1e-8)
		self.interception_ = self._theta[0]
		self.coef_ = self._theta[1:]
		return self

	def fit_random_gd(self, x_train, y_train, n_iters=5, t0=5, t1=50):
	
		assert x_train.shape[0] == y_train.shape[0], \
   			 "the size of x_train must be equal to the size of y_train"
		assert n_iters >= 1

		def Derivative(theta, x_b_i, y_i):
   			 return x_b_i*(x_b_i.dot(theta)-y_i)*2
   		def random_gradient_descent(x_b, y, initial_theta):
   			def learning_rate(t):
   				 return t0/(t+t1)
			theta = initial_theta
			m = len(x_b)
			for cur_iter in range(n_iters):
 				 indexes = np.random.permutation(m)
 			  	 x_b_new = x_b[indexes]
   				 y_new = y[indexes]
  				  for i in range(m):
       					 gradient = Derivative(theta, x_b_new[i], y_new[i])
        				theta = theta - learning_rate(cur_iter*m+i)*gradient
			return theta

		x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
		initial_theta = np.zeros(x_b.shape[1])
		self._theta = random_gradient_descent(x_b, y_train, initial_theta)
		self.interception_ = self._theta[0]
		self.coef_ = self._theta[1:]
		return self

	def predict(self, x_predict):
		assert x_predict.shape[1] == len(self.coef_), \
	        	"Simple Linear regressor can only solve single feature training data"
		assert self.interception_ is not None and self.coef_ is not None, \
	        	"must fit before predict!"
	        	
		x_b = np.hstack([np.ones((len(x_predict), 1)), x_predict])
		return x_b.dot(self._theta)

	def score(self, x_test, y_test):
		y_predict = self.predict(x_test)
		return r2_score(y_test, y_predict)
		
	def __repr__(self):
	    return "LinearRegression()"

2、主函数：

from sklearn import datasets
from sklearn.model_selection import train_test_split
from LR_GD_class import LinearRegression
from sklearn.preprocessing import StandardScaler

boston = datasets.load_boston()
x = boston.data
y = boston.target
x = x[y < 50.0]
y = y[y < 50.0]

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)

standardScaler = StandardScaler()
standardScaler.fit(x_train)
x_train_stand = standardScaler.transform(x_train)
x_test_stand = standardScaler.transform(x_test)

# 批量梯度下降法
lin_reg = LinearRegression()
lin_reg.fit_gd(x_train_stand, y_train)
print(lin_reg.score(x_test_stand, y_test))

## 随机梯度下降法
lin_reg1 = LinearRegression()
lin_reg1.fit_random_gd(x_train_stand, y_train, n_iters=50)
print(lin_reg1.score(x_test_stand, y_test))

希望对读者有所帮助，喜欢的话可以关注一下我的公众号，我会把学习笔记发在上面，大家可以一起共同学习！

在这里插入图片描述