本篇为总结篇,包含赛题全部的处理代码。
目录
1 模型优化
1.1 模型融合技术
模型融合即先产生一组个体学习器,再用某种策略将他们结合起来,以加强模型效果。随着集成中分类器数目的增加,集成学期器的错误率也会呈指数级下降,最终趋于零。
综合个体学习器的优势是能够降低预测误差,优化整体模型的性能。而且个体学习器的准确性越高,多样性越大,模型融合的提升效果越好。
模型融合技术可以分为两类:
(1)个体学习器不存在强以来关系可同时生成的并行化方法,如Bagging和随机森林
(2)个体学习器存在强依赖必须串行生成序列,如Boosting
1 Bagging和随机森林
Bagging方法采用的是自助采样法(Bostap sampling), 即对于m个样本的原始训练集,每次先随机采集一个样本放入采样集, 接着把该样本放回,也就是说下次采样时该样本仍有可能被采集到,这样采集m次,最终可以得到m个样本的采样集。由于是随机采样,因此每次的采样集和原始的训练集不同,和其他采样集也不同,这样就可以得到多个不同的弱学习器。随机森林是对Bagging 方法的改进,其改进之处有两点:基本学习器限定为决策树:除了在Bagging的样本上加上扰动,在属性上也加上扰动,相当于在决策树学习的过程中引入了随机属性选择。对基决策树的每个结点,先从该结点的属性集合中随机选择一个包含k个属性的子集,然后从这个子集中选择一个最优属性用 于划分。
2 Boosting
Boosting方法中著名的算法有AdaBoost算法和提升树( Boosting Tree)系列算法。在提升树系列算法中,应用最广泛的是梯度提升树(Gradient BoostingTree),下面逐一简要介绍。
(1) AdaBoost算法:是加法模型、损失函数为指数函数、学习算法为前向分布算法时的二分类算法。
(2)提升树:是加法模型、学习算法为前向分布算法时的算法,基本学习器限定为决策树。对于二分类问题,损失函数为指数函数,就是把AdaBoost算法中的基本学习器限定为二叉决策树;对于回归问题,损失函数为平方误差,此时拟合的是当前模型的残差.
(3)梯度提升树:是对提升树算法的改进。提升树算法只适合于误差函数为指数函数和平方误差,而对于一般的损失函数,梯度提升树算法可以将损失函数的负梯度在当前模型的值作为残差的近似值。
1.2 预测结果融合策略
1 投票机制(Voting)
(1) 硬投票:对多个模型直接进行投票,最终投票数最多的类为最终被预测的类
(2) 软投票:和硬投票原理相同,增加了设置权重的功能,可以为不同模型设置不同的权重,进而区别模型不同的重要度。
2 软投票示例
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
clf1=LogisticRegression(random_state=0,solver='lbfgs',multi_class='auto')
clf2=RandomForestClassifier(random_state=0,n_estimators=100)
clf3=SVC(random_state=0,probability=True,gamma='auto')
eclf=EnsembleVoteClassifier(clfs=[clf1,clf2,clf3],weights=[2,1,1],voting='soft')
x,y=iris_data()
x=x[:,[0,2]]
gs=gridspec.GridSpec(1,4)
fig=plt.figure(figsize=(16,4))
for clf,lab,grd in zip(
[clf1,clf2,clf3,eclf],
['Logistic Regression','RandomForest','RBF kernel SVM','Ensemble'],
itertools.product([0,1],repeat=2)):
clf.fit(x,y)
ax=plt.subplot(gs[0,grd[0]*2+grd[1]])
fig=plot_decision_regions(X=x,y=y,clf=clf,legend=2)
plt.title(lab)
plt.show()
3 Averaging和Ranking
Averaging的原理是将模型结果的平均值作为最终的预测值,也可以使用加权平均的方去。但其也存在问题:如果不同回归方法预测结果的波动幅度相差比较大,那么波动小的国归结果在融合时起的作用就比较小。
Rnking的思想和Aenging的一致,因为上述平均法存在一定的问题, 所以这里采用了把排名平均的方法。如果有权重,则求出n个模型权重比排名之和,即为最后的结果。
4 Blending
Blending 是把原始的训练集先分成两部分,如70%的数据作为新的训练集,剩下30%的数据作为测试集。
在第一层中,用70%的数据训练多个模型,然后去预测剩余30%数据的label。在第
二层中,直接用30%的数据在第一层预测的结果作为新特征继续训练即可。
Blending的优点:Blending比Stacking简单(不用进行k次交叉验证来获得 stacker feature),避开了一些信息泄露问题,因为generlizers 和 stacker使用了不一样的数据集。
Blending的缺点:
(1)使用了很少的数据(第二阶段的blender 只使用了训练集10%的数据量)。
(2)blender可能会过拟合。
说明:对于实践中的结果而言,Stacking和Blending的效果差不多。
5.Stacking
Stacking的基本原理是用训练好的所有基模型对训练集进行预测,将第j个基模型对第1
个训练样本的预测值作为新的训练集中第1个样本的第j个特征值,最后基于新的训练集进行训练。同理,预测的过程也要先经过所有基模型的预测形成新的测试集,最后对测试集进行预测。
Stacking是一种分层模型集成框架。以两层为例:第一层由多个基学习器组成,其输入为原始训练集;第二层的模型则是以第一层基学习器的输出作为训练集进行训练,从而得到完整的 Stacking 模型。Stacking 两层模型都使用了全部的训练集数据。
下面举例进一步说明:
(1)有训练集和测试集两组数据,并将训练集分成5份:trainl,train2,train3,train4,train5.
(2)选定基模型。这里假定我们选择了 xgboost,lightgbm,randomforest作为基模型。比如xgboost 模型部分,依次用 trainl,train2,train3,train4,train5 作为验证集,其余4份作为训练集,然后 5折交叉验证进行模型训练,再在测试集上进行预测。这样会得到在训练集上由xgboost模型训练出来的5份predictions和在测试集上的1份预测值B1,然后将5份predicticons纵向重叠合并起来得到AI.lightgbm 模型和randomforest模型部分同理。
(3)在三个基模型训练完毕后,将三个模型在训练集上的预测值分别作为3个“特征”A1,A2,A3,然后使用LR模型进行训练并建立LR模型。
(4)使用训练好的LR模型,在三个基模型的测试集上预测得到“特征”值(B1,B2,B3)的基础上进行预测,得出最终的预测类别或概率。
说明:在做Stacking的过程中,如果将第一层模型的预测值和原始特征合并加入第二层模型的训练中,则可以使模型的效果更好,还可以防止模型的过拟合。
2 赛题模型融合
导入库:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split #切分数据
import warnings
warnings.filterwarnings("ignore")
前5篇为本篇打好基础,在前面的基础上处理好数据(可视化部分可以注释掉):
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')
#合并训练数据和测试数据
train_data["oringin"]="train" #用于区分训练集
test_data["oringin"]="test"
data_all=pd.concat([train_data,test_data],axis=0,ignore_index=True)
#查看合并后以及新加列
print(data_all.columns)
# 探索数据分布
dist_cols=6
dist_rows=len(data_all.columns)
plt.figure(figsize=(2*dist_cols,2*dist_rows))
i=1
for col in data_all.columns[0:-2]:
ax=plt.subplot(dist_rows,dist_cols,i)
g = sns.kdeplot(data_all[col][(data_all["oringin"] == "train")], color="Red", shade=True)
g = sns.kdeplot(data_all[col][(data_all["oringin"] == "test")], color="Blue", shade= True)
ax.set_xlabel(col)
ax.set_ylabel("Frequency")
ax=ax.legend(["train","test"])
i+=1
plt.show()
#数据清洗
data_all.drop(["V5","V9","V11","V17","V22","V28"],axis=1,inplace=True)
print(data_all.columns)
#特征可视化
data_train1 = data_all[data_all["oringin"] == "train"].drop("oringin", axis=1)
fcols = 6
frows = len(train_data.columns)
plt.figure(figsize=(2 * fcols, 2* frows))
i = 0
for col in data_train1.columns:
i += 1
ax = plt.subplot(frows, fcols, i)
sns.regplot(x=col, y='target', data=train_data, ax=ax,
scatter_kws={'marker': '.', 's': 3, 'alpha': 0.3},
line_kws={'color': 'k'});
plt.xlabel(col)
plt.ylabel('target')
i += 1
ax = plt.subplot(frows, fcols, i)
sns.distplot(train_data[col].dropna(), fit=stats.norm)
plt.xlabel(col)
plt.show()
# 找出相关程度
plt.figure(figsize=(20, 16)) # 指定绘图对象宽度和高度
colnm = data_train1.columns.tolist() # 列表头
mcorr = data_train1[colnm].corr(method="spearman") # 相关系数矩阵,即给出了任意两个变量之间的相关系数
mask = np.zeros_like(mcorr, dtype=np.bool) # 构造与mcorr同维数矩阵 为bool型
mask[np.triu_indices_from(mask)] = True # 角分线右侧为True
cmap = sns.diverging_palette(220, 10, as_cmap=True) # 返回matplotlib colormap对象
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f') # 热力图(看两两相似度)
plt.show()
#删除小于0.1相关性系数的特征
threshold = 0.1
corr_matrix = data_train1.corr().abs()
drop_col=corr_matrix[corr_matrix["target"]<threshold].index
data_all.drop(drop_col,axis=1,inplace=True)
print(data_all.columns)
#归一化
cols_numeric=list(data_all.columns)
cols_numeric.remove("oringin")
def scale_minmax(col):
return (col-col.min())/(col.max()-col.min())
scale_cols = [col for col in cols_numeric if col!='target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax,axis=0)
print(data_all[scale_cols].describe())
#boxcox变换
fcols=6
frows=len(cols_numeric)-1
plt.figure(figsize=(2*fcols,2*frows))
i=0
for var in cols_numeric:
if var !='target':
dat=data_all[[var,'target']].dropna()
i+=1
plt.subplot(frows, fcols, i)
sns.distplot(dat[var], fit=stats.norm)
plt.title(var + 'Original')
plt.xlabel('')
i+=1
plt.subplot(frows, fcols, i)
_=stats.probplot(dat[var], plot=plt)
plt.title('skew=' + '{:.4f}'.format(stats.skew(dat[var])))
plt.xlabel('')
plt.ylabel('')
i+=1
plt.subplot(frows, fcols, i)
plt.plot(dat[var], dat['target'],'.', alpha=0.5)
plt.title('corr=' +'{:.2f}'.format(np.corrcoef(dat[var],dat['target'])[0][1]))
i+=1
plt.subplot(frows,fcols,i)
trans_var, lambda_var = stats.boxcox(dat[var].dropna() + 1)
trans_var = scale_minmax(trans_var)
sns.distplot(trans_var, fit=stats.norm)
plt.title(var +'Tramsformed')
plt.xlabel('')
i+=1
plt.subplot(frows, fcols, i)
stats.probplot (trans_var, plot=plt)
plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))
plt.xlabel('')
plt.ylabel ('')
i+=1
plt.subplot(frows, fcols, i)
plt.plot(trans_var, dat['target'],'.', alpha=0.5)
plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var, dat['target'])[0][1]))
plt.show()
cols_transform = data_all.columns[0:-2]
for col in cols_transform:
data_all.loc[:, col], _ = stats.boxcox(data_all.loc[:, col] + 1)
#目标值处理
print(data_all.target.describe())
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_all.target.dropna(), fit=stats.norm)
plt.subplot(1,2,2)
_=stats.probplot(data_all.target.dropna(), plot=plt)
plt.show()
#对数变换target目标值提升正太性
sp = train_data.target
train_data.target1 =np.power(1.5,sp)
print(train_data.target1.describe())
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(train_data.target1.dropna(),fit=stats.norm);
plt.subplot(1,2,2)
_=stats.probplot(train_data.target1.dropna(), plot=plt)
plt.show()
获取训练数据和测试数据
#使用简单交叉验证方法对模型验证,划分训练数据集为70%,验证数据集30%
def get_training_data():
df_train = data_all[data_all["oringin"]=="train"]
df_train["label"]=train_data.target1
y = df_train.target
X = df_train.drop(["oringin","target","label"],axis=1)
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.3,random_state=100)
return X_train,X_valid,y_train,y_valid
def get_test_data():
df_test = data_all[data_all["oringin"]=="test"].reset_index(drop=True)
return df_test.drop(["oringin","target"],axis=1)
模型评价函数
def rmse(y_true, y_pred):
diff = y_pred - y_true
sum_sq = sum(diff ** 2)
n = len(y_pred)
return np.sqrt(sum_sq / n)
def mse(y_ture, y_pred):
return mean_squared_error(y_ture, y_pred)
rmse_scorer = make_scorer(rmse, greater_is_better=False)
mse_scorer = make_scorer(mse, greater_is_better=False)
#异常值过滤
# function to detect outliers based on the predictions of a model
def find_outliers(model, X, y, sigma=3):
# predict y values using model
model.fit(X, y)
y_pred = pd.Series(model.predict(X), index=y.index)
# calculate residuals between the model prediction and true y values
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
# calculate z statistic, define outliers to be where |z|>sigma
z = (resid - mean_resid) / std_resid
outliers = z[abs(z) > sigma].index
# print and plot the results
print('R2=', model.score(X, y))
print('rmse=', rmse(y, y_pred))
print("mse=", mean_squared_error(y, y_pred))
print('---------------------------------------')
print('mean of residuals:', mean_resid)
print('std of residuals:', std_resid)
print('---------------------------------------')
print(len(outliers), 'outliers:')
print(outliers.tolist())
plt.figure(figsize=(15, 5))
ax_131 = plt.subplot(1, 3, 1)
plt.plot(y, y_pred, '.')
plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
plt.legend(['Accepted', 'Outlier'])
plt.xlabel('y')
plt.ylabel('y_pred');
ax_132 = plt.subplot(1, 3, 2)
plt.plot(y, y - y_pred, '.')
plt.plot(y.loc[outliers], y.loc[outliers] - y_pred.loc[outliers], 'ro')
plt.legend(['Accepted', 'Outlier'])
plt.xlabel('y')
plt.ylabel('y - y_pred');
ax_133 = plt.subplot(1, 3, 3)
z.plot.hist(bins=50, ax=ax_133)
z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133)
plt.legend(['Accepted', 'Outlier'])
plt.xlabel('z')
plt.savefig('outliers.png')
return outliers
# get training data
X_train, X_valid,y_train,y_valid = get_training_data()
test=get_test_data()
# find and remove outliers using a Ridge model
outliers = find_outliers(Ridge(), X_train, y_train)
X_outliers=X_train.loc[outliers]
y_outliers=y_train.loc[outliers]
X_t=X_train.drop(outliers)
y_t=y_train.drop(outliers)
#定义方法获取去除异常值的训练数据,深copy
def get_trainning_data_omitoutliers():
y=y_t.copy()
X=X_t.copy()
return X,y
采用网格搜索训练模型(后面训练时调用)
def train_model(model, param_grid=[], X=[], y=[],
splits=5, repeats=5):
# 获取数据
if len(y) == 0:
X, y = get_trainning_data_omitoutliers()
# 交叉验证
rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)
# 网格搜索最佳参数
if len(param_grid) > 0:
gsearch = GridSearchCV(model, param_grid, cv=rkfold,
scoring="neg_mean_squared_error",
verbose=1, return_train_score=True)
# 训练
gsearch.fit(X, y)
# 最好的模型
model = gsearch.best_estimator_
best_idx = gsearch.best_index_
# 获取交叉验证评价指标
grid_results = pd.DataFrame(gsearch.cv_results_)
cv_mean = abs(grid_results.loc[best_idx, 'mean_test_score'])
cv_std = grid_results.loc[best_idx, 'std_test_score']
# 没有网格搜索
else:
grid_results = []
cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)
cv_mean = abs(np.mean(cv_results))
cv_std = np.std(cv_results)
# 合并数据
cv_score = pd.Series({'mean': cv_mean, 'std': cv_std})
# 预测
y_pred = model.predict(X)
# 模型性能的统计数据
print('----------------------')
print(model)
print('----------------------')
print('score=', model.score(X, y))
print('rmse=', rmse(y, y_pred))
print('mse=', mse(y, y_pred))
print('cross_val: mean=', cv_mean, ', std=', cv_std)
# 残差分析与可视化
y_pred = pd.Series(y_pred, index=y.index)
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
z = (resid - mean_resid) / std_resid
n_outliers = sum(abs(z) > 3)
outliers = z[abs(z) > 3].index
plt.figure(figsize=(15, 5))
ax_131 = plt.subplot(1, 3, 1)
plt.plot(y, y_pred, '.')
plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
plt.xlabel('y')
plt.ylabel('y_pred');
plt.title('corr = {:.3f}'.format(np.corrcoef(y, y_pred)[0][1]))
ax_132 = plt.subplot(1, 3, 2)
plt.plot(y, y - y_pred, '.')
plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
plt.xlabel('y')
plt.ylabel('y - y_pred');
plt.title('std resid = {:.3f}'.format(std_resid))
ax_133 = plt.subplot(1, 3, 3)
z.plot.hist(bins=50, ax=ax_133)
z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133)
plt.xlabel('z')
plt.title('{:.0f} samples with z>3'.format(n_outliers))
plt.show()
return model, cv_score, grid_results
定义训练变量存储数据
opt_models = dict()
score_models = pd.DataFrame(columns=['mean','std'])
# no. k-fold splits
splits=5
# no. k-fold iterations
repeats=5
2.1 单一模型预测效果
(可以挑下面几种进行模型组合,没必要全用,需要时自行解除注释)
#1 岭回归
def ridge(score_models):
from sklearn.linear_model import Ridge
model = 'Ridge'
opt_models[model] = Ridge()
alph_range = np.arange(0.25, 6, 0.25)
param_grid = {'alpha': alph_range}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=repeats)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']),
abs(grid_results['std_test_score']) / np.sqrt(splits * repeats))
plt.xlabel('alpha')
plt.ylabel('score')
# ridge(score_models)#调用
#2 Lasso回归
def lasso_model(score_models):
from sklearn.linear_model import Lasso
model = 'Lasso'
opt_models[model] = Lasso()
alph_range = np.arange(1e-4, 1e-3, 4e-5)
param_grid = {'alpha': alph_range}
opt_models[model], cv_score, grid_results = train_model(
opt_models[model], param_grid=param_grid,
splits=splits, repeats=repeats
)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']), \
abs(grid_results['std_test_score']) / np.sqrt(splits * repeats))
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
# lasso_model(score_models)
#3 ElasticNet回归
def elasticNet(score_models):
from sklearn.linear_model import ElasticNet
model = 'ElasticNet'
opt_models[model] = ElasticNet()
alpha_range = np.arange(1e-4, 1e-3, 1e-4)
param_grid = {'alpha': alpha_range,
'l1_ratio': np.arange(0.1, 1.0, 0.1)}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
# elasticNet(score_models)
#4 SVR回归
def LinearSVR(score_models):
from sklearn.svm import LinearSVR
model = 'LinearSVR'
opt_models[model] = LinearSVR()
crange = np.arange(0.1, 1.0, 0.1)
param_grid = {'C': crange, 'max_iter': [1000]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=repeats)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(crange, abs(grid_results['mean_test_score']),
abs(grid_results['std_test_score']) / np.sqrt(splits * repeats))
plt.xlabel('C')
plt.ylabel('score')
plt.show()
# LinearSVR(score_models)
#5 K近邻
def KNeighbors(score_models):
from sklearn.neighbors import KNeighborsRegressor
model = 'KNeighbors'
opt_models[model] =KNeighborsRegressor()
param_grid = {'n_neighbors':np.arange(3,11,1)}
opt_models[model], cv_score, grid_results = train_model(
opt_models[model], param_grid=param_grid,
splits=splits, repeats=1
)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(np.arange(3,11,1), abs(grid_results['mean_test_score']), \
abs(grid_results['std_test_score']) / np.sqrt(splits * 1))
plt.xlabel('n_neighbors')
plt.ylabel('score')
plt.show()
# KNeighbors(score_models)
2.2 模型融合Boosting方法
#1 GBDT模型
def GradientBoosting(score_models):
from sklearn.ensemble import GradientBoostingRegressor
model = 'GradientBoosting'
opt_models[model] = GradientBoostingRegressor()
param_grid = {'n_estimators': [150,250,300], 'max_depth': [1,2,3],'min_samples_split':[5,6,7]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
# GradientBoosting(score_models)
#2 XGB模型
def XGB(score_models):
from xgboost.sklearn import XGBRegressor
model = 'XGB'
opt_models[model] = XGBRegressor(objective='reg:squarederror')
param_grid = {'n_estimators': [100,200,300,400,500], 'max_depth': [1,2,3]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
# XGB(score_models)
#3 随机森林模型
def RandomForest(score_models):
from sklearn.ensemble import RandomForestRegressor
model = 'RandomForest'
opt_models[model] = RandomForestRegressor()
param_grid = {'n_estimators': [100,150,200], 'max_features': [8,12,16,20,24],'min_samples_split':[2,4,6]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=5, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
# RandomForest(score_models)
2.3 多模型预测Bagging方法
def model_predict(test_data,test_y=[],stack=False):
i=0
y_predict_total=np.zeros((test_data.shape[0],))
for model in opt_models.keys():
if model!="LinearSVR" and model!="KNeighbors":
y_predict=opt_models[model].predict(test_data)
y_predict_total+=y_predict
i+=1
if len(test_y)>0:
print("{}_mse:".format(model),mean_squared_error(y_predict,test_y))
y_predict_mean=np.round(y_predict_total/i,6)
if len(test_y)>0:
print("mean_mse:",mean_squared_error(y_predict_mean,test_y))
else:
y_predict_mean=pd.Series(y_predict_mean)
return y_predict_mean
model_predict(X_valid,y_valid)
保存预测结果:
y_ = model_predict(test)
y_.to_csv('./predict.txt',header = None,index = False)
2.4 多模型融合Stacking方法
参看《kaggle—HousePrice房价预测项目实战》后面的stacking方法。
本次使用stacking测试集上模型分数很好,但是在预测上面效果不是很好,出现了过拟合。这里感觉利用神经网络效果比较明显。最后使用了单一模型的组合,经过调参最终模型达到0.1157,目前能排到前三百名。
由于排名还没出来,可以参考下面已经出现的排名,大概280多。
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
# We again fit the data on clones of the original models
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
# Train cloned base models then create out-of-fold predictions
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y):
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred
# Now train the cloned meta-model using the out-of-fold predictions as new feature
self.meta_model_.fit(out_of_fold_predictions, y)
return self
# Do the predictions of all base models on the test data and use the averaged predictions as
# meta-features for the final prediction which is done by the meta-model
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_])
return self.meta_model_.predict(meta_features)
简单模型融合
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# 遍历所有模型
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
for model in self.models_:
model.fit(X, y)
return self
# 预估,并对预估结果值做average
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
def gsearch(model, param_grid, X,y,scoring='neg_mean_squared_error', splits=5, repeats=1, n_jobs=-1):
# p次k折交叉验证
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats, random_state=0)
model_gs = GridSearchCV(model, param_grid=param_grid, scoring=scoring, cv=rkfold, verbose=1, n_jobs=-1)
model_gs.fit(X, y)
print('参数最佳取值: {0}'.format(model_gs.best_params_))
print('最小均方误差: {0}'.format(abs(model_gs.best_score_)))
return model_gs
def rmsle_cv(model=None, X_train_head=None, y_train=None):
n_folds = 5
kf = KFold(n_folds, shuffle=True, random_state=seed).get_n_splits(X_train_head)
rmse = -cross_val_score(model, X_train_head, y_train, scoring="neg_mean_squared_error", cv=kf)
return (rmse)
这里直接调用这些函数即可。
除了平均组合,也可以加权组合,就像房价预测中的那样。