上一篇《天池学习赛:工业蒸汽量预测3——模型训练》中已经是使用了几种机器学习的模型,接下来将介绍一些模型的评价方法。
1 模型评估的方法
1 欠拟合与过拟合
2 泛化与正则化
3 回归模型评价指标与调用方法
(1)平均绝对误差
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_pred)
(2)均方误差
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)
(3)均方根误差
from sklearn.metrics import mean_squared_error
Pred_Error=mean_squared_error(y_test,y_pred)
Sqrt(Pred_Error)
(4)R平方值
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
4 交叉验证
(1)简单交叉验证:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,test_size=.4,random_state=0)
(2)K折交叉验证:
from sklearn.model_selection import KFold
kf=KFold(n_splits=10)
(3)留一法交叉验证:
from sklearn.model_selection import LeaveOneOut
l0o=LeaveOneOut()
(4)留P法交叉验证:
from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=5)
2 模型调参
1 调参
2 网格搜索
对所有可能的参数组合,依次进行训练,最后找出最好的参数组合
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris=load_iris()
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,random_state=0)
print("size of training set:{} size of teating set:{}".format(X_train[0],X_test[0]))
best_score=0
for gamma in[0.001,0.01,0.1,1,10,100]:
for c in[0.001,0.01,0.1,1,10,100]:
svm=SVC(gamma=gamma,C=c)
svm.fit(X_train,Y_train)
score=svm.score(X_test,Y_test)
if score >best_score:
best_score=score
best_parameters={
'gamma':gamma,'C':c}
print("Best score:{:.2f}",format(best_score))
print('Best parameters:{}'.format(best_parameters))
可以从输出结果看出gamma=0.001与C=100的组合最好,得分最高。
size of training set:[5.9 3. 4.2 1.5] size of teating set:[5.8 2.8 5.1 2.4]
Best score:0.97
Best parameters:{‘gamma’: 0.001, ‘C’: 100}
3 学习曲线
3 赛题模型验证与调参
3.1 模型过拟合与欠拟合
1 基础代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split #切分数据
from sklearn.metrics import mean_squared_error #评价指标
from sklearn.linear_model import LinearRegression #从sklearn引入线性模型
from sklearn.neighbors import KNeighborsRegressor #k近邻回归模型
from sklearn.tree import DecisionTreeRegressor #决策树回归模型
from sklearn.ensemble import RandomForestRegressor #随机森林回归模型
from lightgbm import LGBMRegressor #LightGBM回归模型
from sklearn.svm import SVR #支持向量机
from sklearn.linear_model import SGDRegressor
#读数据
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')
#归一化
features_columns=[col for col in train_data.columns if col not in ['target']]
min_max_scaler=preprocessing.MinMaxScaler()
min_max_scaler=min_max_scaler.fit(train_data[features_columns])
train_data_scaler=min_max_scaler.transform(train_data[features_columns])
test_data_scaler=min_max_scaler.transform(test_data[features_columns])
train_data_scaler=pd.DataFrame(train_data_scaler)
train_data_scaler.columns=features_columns
test_data_scaler=pd.DataFrame(test_data_scaler)
test_data_scaler.columns=features_columns
train_data_scaler['target']=train_data['target']
#PCA降维
pca=PCA(n_components=0.9)
new_train_pca_16=pca.fit_transform(train_data_scaler.iloc[:,0:-1])
new_test_pca_16=pca.transform(test_data_scaler)
new_train_pca_16=pd.DataFrame(new_train_pca_16)
new_test_pca_16=pd.DataFrame(new_test_pca_16)
new_train_pca_16['target']=train_data_scaler['target']
#采用PCA保留的16维特征的数据
new_train_pca_16=new_train_pca_16.fillna(0)
train=new_train_pca_16[new_train_pca_16.columns]
target=new_train_pca_16['target']
#划分数据集 训练集80%验证机20%
train_data,test_data,train_target,test_target=train_test_split(train,target,\
test_size=0.2,random_state=0)
2 欠拟合
clf=SGDRegressor(max_iter=500,tol=1e-2)
clf.fit(train_data,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data))
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3 过拟合
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(5)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
4 正常拟合
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3.2 模型正则化
1 L2正则化
在上面正常拟合的代码中加入正则化项
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='L2',alpha=0.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
2 L1正则化
同上
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1',alpha=0.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3 ElasticNet联合L1和L2范数加权正则化
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='elasticnet',l1_ratio=0.9,alpha=.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3.3 模型交叉验证
1 简单交叉
2 K折交叉
from sklearn.model_selection import KFold
kf=KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):
train_data,test_data,train_target,test_target=train.values[train_index],\
train.values[test_index],\
target[train_index],\
target[test_index]
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print(k,"折","SGDRegressor train MSE:", score_train)
print(k,"折","SGDRegressor test MSE:", score_test,"\n")
3.留一法交叉
from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()
num=100
for k,(train_index,test_index) in enumerate(loo.split(train)):
train_data, test_data, train_target, test_target = train.values[train_index], \
train.values[test_index], \
target[train_index], \
target[test_index]
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print(k, "个", "SGDRegressor train MSE:", score_train)
print(k, "个", "SGDRegressor test MSE:", score_test, "\n")
if k>=9:
break
4 留P法交叉
from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=10)
num=100
for k,(train_index,test_index) in enumerate(lpo.split(train)):
train_data, test_data, train_target, test_target = train.values[train_index], \
train.values[test_index], \
target[train_index], \
target[test_index]
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print(k, "10个", "SGDRegressor train MSE:", score_train)
print(k, "10个", "SGDRegressor test MSE:", score_test, "\n")
if k >= 9:
break
3.4 模型超参空间及调参
1 穷举网格搜索
#使用数据训练随机森林模型,采用穷举网格搜索方法调参
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor=RandomForestRegressor()
parameters={
'n_estimators':[50,100,200],'max_depth':[1,2,3]}
clf=GridSearchCV(RandomForestRegressor,parameters,cv=5)
clf.fit(train_data,train_target)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("RandomForestRegressor GridSearchCV test MSE:",score_test)
print(sorted(clf.cv_results_.keys())) #包含训练时间和验证指标的一些信息
2 随机参数优化
#使用数据训练随机森林模型,采用随即参数优化方法调参
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor=RandomForestRegressor()
parameters={
'n_estimators':[50,100,200,300],
'max_depth':[1,2,3,4,5]
}
clf=RandomizedSearchCV(RandomForestRegressor,parameters,cv=5)
clf.fit(train_data,train_target)
print('Best parameters found are:',clf.best_params_)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("RandomForestRegressor RandomizedSearchCV GridSearchCV test MSE:",score_test)
print(sorted(clf.cv_results_.keys())) #包含训练时间和验证指标的一些信息
3 LGB调参
import lightgbm as lgb
clf=lgb.LGBMRegressor(num_leaves=31)
parameters={
'learning_rate':[0.01,0.1,1],'n_estimators':[20,40]}
clf.fit(train_data,train_target)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("LGBMRegressor GridSearchCV test MSE:",score_test)
3.5 学习曲线和验证曲线
1 学习曲线
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curve
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')
plt.figure(figsize=(18,10),dpi=150)
def plot_learning_curve(estimator,title,x,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(.1,1.0,5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes,train_scores,test_scores=learning_curve(estimator,x,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)
train_scores_mean=np.mean(train_scores,axis=1)
train_scores_std=np.std(train_scores,axis=1)
test_scores_mean=np.mean(test_scores,axis=1)
test_scores_std=np.std(test_scores,axis=1)
plt.grid()
plt.fill_between(train_sizes,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='r')
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1,
color='g')
plt.plot(train_sizes,train_scores_mean,'o-',color='r',label="training score")
plt.plot(train_sizes,test_scores_mean,'o-',color='g',label="corss-validation score")
plt.legend(loc="best")
return plt
x=train_data[test_data.columns].values
y=train_data['target'].values
title="LinearRegression"
cv=ShuffleSplit(n_splits=100,test_size=0.2,random_state=0)
estimator=SGDRegressor()
plot_learning_curve(estimator,title,x,y,ylim=(0.7,1.01),cv=cv,n_jobs=-1).show()
2 验证曲线
绘制数据训练SGDRegressor模型的曲线
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curve
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')
x=train_data[test_data.columns].values
y=train_data['target'].values
param_range=[0.1,0.01,0.001,0.0001,0.00001,0.000001]
train_scores,test_scores=validation_curve(SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1')\
,x,y,param_name='alpha',param_range=param_range,\
cv=10,scoring='r2',n_jobs=1)
train_scores_mean=np.mean(train_scores,axis=1)
train_scores_std=np.std(train_scores,axis=1)
test_scores_mean=np.mean(test_scores,axis=1)
test_scores_std=np.std(test_scores,axis=1)
plt.title("Validation Curve with SCDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0,1.1)
plt.semilogx(param_range,train_scores_mean,label="Training scores",color='r')
plt.fill_between(param_range,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.2,color='r')
plt.semilogx(param_range,test_scores_mean,label="Sross_validation score",color='g')
plt.fill_between(param_range,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.2,color='g')
plt.legend(loc="best")
plt.show()