用Skiti-learn和pandas实现岭回归

Ridge回归的损失函数：
J(θ)=1/2 (Xθ-Y)T (Xθ-Y) +1/2 α||θ||²2
对于算法的实现，一般先确定模型，然后根据模型确定目标函数。而机器学习的实现基础是数据，对数据的处理分析必不可少，算法实现后还需对模型评估对比。
设置线性回归模型如下：
PE=θ0+θ1∗AT+θ2∗V+θ3∗AP+θ4∗RH

数据的读取以及数据集的划分

import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
from  sklearn import datasets，linear_model

data=pd.read_csv('...\data.csv')
X=data[['AT','V','AP','RH']]
Y=data[['PE']]


#划分数据集
from sklearn.cross_validation import train_test_split
X_train ,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

用skiti-learn运行Ridge回归

from sklearn.cross_model import Ridge
ridge=Ridge(alpha=1）
ridge.fit(X_train,y_train)
#超参数的设定对结果影响比较大，我们先设为1，观察结果

print ridge.coef_
print ridge.intercept_

选择超参数，并研究超参数与回归系数θ的关系

from sklearn.linear_model import RidgeCV
ridgecv=RidgeCV(alphas=[0.01,0.1,0.5,1,3,5,7,10,20,100])
ridgecv.fit(X_train,y_train)
ridgecv.alpha_
# 得到最优超参数

通过Ridge回归的损失函数表达式可以看到，超参数越大，那么正则项惩罚的就越厉害，得到回归系数θ就越小，最终趋近与0。而如果超参数越小，即正则化项越小，那么回归系数θ就越来越接近于普通的线性回归系数。

#研究超参数与回归系数θ的关系
X=1./(np.arange(1,11)+np.arrage(0,10)[:,np,newaxis])
y=np.ones(10)
#说实话，矩阵很多东西我还不是很明白，X，y的构造不是很明白


n_alphas=200
#设置200个超参数
alphas=pd.logspace(-10,-2,n_alphas)
#超参数在10的-10次方和10的-2次方中取值


clf= linear_model.Ridge(fit_intercept=False)
coefs = []
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)



#用图形标识
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
#翻转x轴的大小方向，让alpha从大到小显示
ax.set_xlim(ax.get_xlim()[::-1]) 
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
#坐标轴适应数据
plt.show()

用Skiti-learn和pandas实现岭回归

猜你喜欢