一个简单的线性拟合问题，到底有多少种做法

相信大家都做过线性拟合问题吧，其实就是给很多点，来求线性方程的斜率和截距。早在高中数学就有这类问题，我记得很清楚，如果出现在试卷中，一般出现在解答题的第二题左右，高中中的做法就是最小二乘法，代入公式，求斜率和截距，说句好听，就是送分题。

在科学计算中，也是采用ols(普通最小二乘法)进行回归分析。OLS 全称ordinary least squares,是回归分析(regression analysis)最根本的一个形式。

深度学习

这个就是三大深度学习框架的入门demo。

（1） Keras

Keras 的核心数据结构是 model，一种组织网络层的方式。最简单的模型是 Sequential 顺序模型，它由多个网络层线性堆叠。

import keras
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from pylab import mpl
#正常显示画图时出现的中文
mpl.rcParams['font.sans-serif']=['SimHei']
from keras import layers

x = np.linspace(0,50,50)
y  =  3 * x + 7 + np.random.randn(50) *5 
plt.scatter(x,y)

通过numpy随机绘制了50个散点图，如下图所示

下面使用keras搭建最简单的顺序Sequential模型

# 顺序模型
model = keras.Sequential() 
# 输入输出是一维特征
model.add(layers.Dense(1,input_dim=1)) 
model.summary()
# 编译模型 
# 使用adam优化器，损失函数mse均方差，在compile加入精确度
model.compile(optimizer='adam',loss='mse',metrics=['acc'])
# 训练模型 epochs 参数是把数据训练3000遍
history = model.fit(x, y, epochs=3000)
plt.scatter(x, y, c='r')
plt.plot(x, model.predict(x))

运行代码，结果如下图所示

（2）Pytorch

我们创建一个由方程 $y = 2x + 0.2$ 产生的数据集，并通过torch.rand()函数制造噪音

import torch as t
from matplotlib import pyplot as plt
from torch.autograd import Variable
from torch import nn
# 创建数据集
x = Variable(t.unsqueeze(t.linspace(-1, 1, 100), dim=1))
y = Variable(x * 2 + 0.2 + t.rand(x.size()))
plt.scatter(x.data.numpy(),y.data.numpy())
plt.show()

制造出来的数据集，如下图所示

下面我们就要开始定义我们的模型，这里定义的是一个输入层和输出层都只有一维的模型，并定义出损失函数MSE和优化函数sgd，这里使用均方误差作为损失函数。

class LinearRegression(t.nn.Module):
    def __init__(self):
        #继承父类构造函数
        super(LinearRegression, self).__init__() 
        #输入和输出的维度都是1
        self.linear = t.nn.Linear(1, 1) 
    def forward(self, x):
        out = self.linear(x)
        return out
        
model = LinearRegression()#实例化对象
num_epochs = 1000#迭代次数
learning_rate = 1e-2#学习率0.01
Loss = t.nn.MSELoss()#损失函数
optimizer = t.optim.SGD(model.parameters(), lr=learning_rate)#优化函数

遍历每次epoch，计算出loss，反向传播计算梯度，不断的更新梯度，使用梯度下降进行优化。

for epoch in range(num_epochs):
    # 预测
    y_pred= model(x)
    # 计算loss
    loss = Loss(y_pred, y)
    #清空上一步参数值
    optimizer.zero_grad()
    #反向传播
    loss.backward()
    #更新参数
    optimizer.step()
    if epoch % 200 == 0:
        print("[{}/{}] loss:{:.4f}".format(epoch+1, num_epochs, loss))

plt.scatter(x.data.numpy(), y.data.numpy())
plt.plot(x.data.numpy(), y_pred.data.numpy(), 'r-',lw=5)
plt.text(0.5, 0,'Loss=%.4f' % loss.data.item(), fontdict={'size': 20, 'color':  'red'})
plt.show()
####结果如下####
[1/1000] loss:1.7766
[201/1000] loss:0.1699
[401/1000] loss:0.0816
[601/1000] loss:0.0759
[801/1000] loss:0.0755

运行代码，结果如下图所示

（3）tensorflow

下面，我们采用tensorflow进行线性拟合

首先生成我们的数据，数据标签通过真实函数加上高斯噪声得到。

然后为了进行梯度计算,定义了变量w和偏移量b，初值都设为0.

import tensorflow as tf
import numpy as np

# 一些参数
learning_rate = 0.01  # 学习率
training_steps = 1000  # 训练次数
display_step = 50  # 训练50次输出一次

# 训练数据
X = np.linspace(0, 1, 50).reshape((-1,1))
Y =  2* X+ 1 + np.random.normal(0,0.1, X.shape)

n_samples = 50

# 随机初始化权重和偏置
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")

# 线性回归函数
def linear_regression(x):
    return W*x + b

# 损失函数
def mean_square(y_pred, y_true):
    return tf.reduce_sum(tf.pow(y_pred-y_true, 2)) / (2 * n_samples)

# 优化器采用随机梯度下降(SGD)
optimizer = tf.optimizers.SGD(learning_rate)

# 计算梯度，更新参数
def run_optimization():
    # tf.GradientTape()梯度带，可以查看每一次epoch的参数值
    with tf.GradientTape() as g:
        pred = linear_regression(X)
        loss = mean_square(pred, Y)
    # 计算梯度
    gradients = g.gradient(loss, [W, b])
    # 更新W，b
    optimizer.apply_gradients(zip(gradients, [W, b]))

# 开始训练
for step in range(1, training_steps+1):
    run_optimization()
    if step % display_step == 0:
        pred = linear_regression(X)
        loss = mean_square(pred, Y)
        print("step: %i, loss: %f, W: %f, b: %f" % (step, loss, W.numpy(), b.numpy()))

import matplotlib.pyplot as plt
plt.plot(X, Y, 'ro', label='Original data')
plt.plot(X, np.array(W * X + b), label='Fitted line')
plt.legend()
plt.show()

上面就是深度学习的做法，你们会看到一列蒙蔽，下面我来演示机器学习做法

机器学习

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0,30,50)
y = x+ 2*np.random.rand(50) 
plt.figure(figsize=(10,8))
from sklearn.linear_model import LinearRegression  #导入线性回归
model = LinearRegression()  #初始化模型
x1 = x.reshape(-1,1)   # 将行变列  得到x坐标
y1 = y.reshape(-1,1)   # 将行变列  得到y坐标
model.fit(x1,y1)  #训练数据
plt.scatter(x,y)
x_test = np.linspace(0,40).reshape(-1,1)
plt.plot(x_test,model.predict(x_test))
model.coef_   #array([[1.00116024]])   斜率
model.intercept_   # array([0.86175551])  截距

Numpy

import numpy as np
x = np.linspace(0,30,50)
y = x+ 1 + np.random.normal(0,0.1, 50)
z1 = np.polyfit(x,y,1) #一次多项式拟合，相当于线性拟合
z1 # [1.00895356, 0.71872268]
p1 = np.poly1d(z1)  
p1 # array([1.00032794, 0.9799152 ])

Scipy

from scipy.stats  import  linregress
x = np.linspace(0,30,50)
y = x + 2 +np.random.normal(0)
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print("slope: %f    intercept: %f" % (slope, intercept)) # slope: 1.000000    intercept: 1.692957
print("R-squared: %f" % r_value**2) #R-squared: 1.000000
plt.figure(figsize=(10,8))
plt.plot(x, y, 'o', label='original data')
plt.plot(x, intercept + slope*x, 'r', label='fitted line')
plt.legend()
plt.show()

Statsmodels

#  线性模型
import statsmodels.api as sm
import numpy as np
x = np.linspace(0,10,100)
y = 3*x + np.random.randn()+ 10
# Fit and summarize OLS model
X = sm.add_constant(x)
mod = sm.OLS(y,X)
result = mod.fit()
print(result.params)
print(result.summary())
[9.65615842 3.        ]


  OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 7.546e+31
Date:                Thu, 25 Jul 2019   Prob (F-statistic):               0.00
Time:                        21:10:18   Log-Likelihood:                 3082.0
No. Observations:                 100   AIC:                            -6160.
Df Residuals:                      98   BIC:                            -6155.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          9.6562      2e-15   4.83e+15      0.000       9.656       9.656
x1             3.0000   3.45e-16   8.69e+15      0.000       3.000       3.000
==============================================================================
Omnibus:                        4.067   Durbin-Watson:                   0.161
Prob(Omnibus):                  0.131   Jarque-Bera (JB):                4.001
Skew:                           0.446   Prob(JB):                        0.135
Kurtosis:                       2.593   Cond. No.                         11.7
==============================================================================

R

R语言通过lm()函数创建预测变量与响应变量之间的关系线性回归模型。下面是我的笔记关于体重和身高的回归

> x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> lm(y~x)
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
   -38.4551       0.6746  
> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3002 -1.6629  0.0412  1.8944  3.9775 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -38.45509    8.04901  -4.778  0.00139 ** 
x             0.67461    0.05191  12.997 1.16e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared:  0.9548,	Adjusted R-squared:  0.9491 
F-statistic: 168.9 on 1 and 8 DF,  p-value: 1.164e-06

lm(y~x)返回的是截距和斜率，即身高和体重的线性回归方程：y = 0.6746x -38.4551

SPSS

SPSS也可以做回归分析。

数据源

在菜单栏中选择【分析】，在下拉菜单中选择【回归】，右侧弹出子菜单中选择【散线性】，弹出线性回归窗口，将“cholesterol“变量移至右侧的【因变量】框中，”time”变量移至右侧的【自变量】框选择因变量和自变量，如下图所示

tep1：在菜单栏中选择【分析】，在下拉菜单中选择【一般线性模型】中选择【多变量】，弹出“多变量”窗口，将“身高”、“体重”选入右侧【因变量】框中，将“班级”选入【固定因子】框中，如下图所示

Step2：点击【事后比较】在【多变量：实测平均值的事后比较】窗口中，将“class”选入右侧【事后检验】中，并选中【图基】，点击【继续】，如下图所示。

Stata

SPSS出来了，我会放过Stata?

. set obs 10
number of observations (_N) was 0, now 10
. gen x=_n
. gen y= x+runiform()
. list
     +---------------+
     |  x          y |
     |---------------|
  1. |  1   1.348872 |
  2. |  2   2.266886 |
  3. |  3   3.136646 |
  4. |  4   4.028557 |
  5. |  5   5.868933 |
     |---------------|
  6. |  6   6.350855 |
  7. |  7   7.071105 |
  8. |  8   8.323368 |
  9. |  9   9.555103 |
 10. | 10   10.87599 |
     +---------------+

. reg y x

      Source |       SS           df       MS      Number of obs   =        10
-------------+----------------------------------   F(1, 8)         =   1107.40
       Model |  89.9664664         1  89.9664664   Prob > F        =    0.0000
    Residual |  .649928434         8  .081241054   R-squared       =    0.9928
-------------+----------------------------------   Adj R-squared   =    0.9919
       Total |  90.6163949         9  10.0684883   Root MSE        =    .28503

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.044271   .0313806    33.28   0.000     .9719076    1.116635
       _cons |   .1391392   .1947113     0.71   0.495    -.3098658    .5881443
------------------------------------------------------------------------------

在结果界面中， Coef为1.044271表示回归斜率，_cons为.1391392表示回归截距，R-squared和Adj R-squared分别为0.9928和0.9919，说明回归方程拟合效果很好。

如果我们需要绘制对应图形，只需要使用命令avplots就可以绘制折线图，如下图所示。