房价预测——先进的回归技术，梯度提升树和随机森林

这是近期在kaggle上的一场回归预测的竞赛，官方所给的数据集有1491个样本，79个特征，需要我们进行一定的特征处理然后选取适合的模型来进行预测。本文采取两种先进的回归技术进行预测，分别是随机森林和梯度提升树以及它俩的集成。
题目 https://www.kaggle.com/c/house-prices-advanced-regression-techniques

数据处理

由于其特征较多，首先进行特征筛选，筛选规则是所有的Str列，如果某一个属性占了95%以上或者某一列有一半以上的是NaN值，则把这些列删掉，剩余的Str列进行0，1，2，3……编码，然后填补缺失值为0。数据处理代码如下：

houseprice_data = pd.read_csv("./data/train.csv")#读取数据
datalength = len(houseprice_data)#数据量

#数据处理：

houseprice_col = list(houseprice_data.columns)#获取列名
print(houseprice_data.columns)
delcol_list = []#要删除的列名列表

for i,col in enumerate(houseprice_col):
    if type(houseprice_data[col][0]) == str:#如果是字符串列
        if np.array(houseprice_data[col].value_counts()).max()>datalength*0.95:#如果次数最多的大于90%
            delcol_list.append(col)#填入要删除的列名之中

for col in list(houseprice_data.columns):#删掉nan为一半的列
    if houseprice_data[col].count() < datalength/2:
        delcol_list.append(col)
delcol_list.append('Id')#把id加进去
delcol_list.append('FireplaceQu')#特殊列加进去
houseprice_data_one = houseprice_data.drop(delcol_list,axis=1)#drop掉delcollist

for j,col in enumerate(list(houseprice_data_one.columns)):#遍历新列
    if type(houseprice_data[col][0]) == str:
        values = pd.Categorical(houseprice_data_one[col]).codes
        houseprice_data_one[col] = values#对字符串列进行编码

houseprice_data_one = houseprice_data_one.fillna(0)#填缺失值为0

随机森林

处理完数据之后，我们先用随机森林进行预测：

#进行训练预测

X = np.array(houseprice_data_one.drop('SalePrice',axis=1))
Y = np.array(houseprice_data_one['SalePrice'])

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1)#切分数据集

clf = RandomForestRegressor()#随机森林
clf.fit(x_train,y_train)
print("拟合率：",clf.score(x_train,y_train))
print("预测率：",clf.score(x_test,y_test))


#拟合率： 0.9673565610627388
#预测率： 0.9035892586277517

交叉验证与归一化

我们只切分了一次数据集，这只是在一部分数据上的结果，不足以证明模型的有效性，接下来我们做交叉验证，并对X进行归一化处理：

#交叉验证
kf1 = KFold(n_splits=10,shuffle=False)#10折交叉验证
#随机森林
clfre = RandomForestRegressor(n_estimators=210,max_depth=30,max_features=30)

scaler = MinMaxScaler()#归一化处理
X_scaler = scaler.fit_transform(X)

#交叉验证得分
for retrain,retest in kf1.split(X_scaler,Y):
    clfre.fit(X_scaler[retrain],Y[retrain])
    retrain_scores.append(clfre.score(X_scaler[retrain],Y[retrain]))
    retest_scores.append((clfre.score(X_scaler[retest],Y[retest])))

#得分情况
print("交叉验证训练分数：",np.array(retrain_scores).mean())
print("交叉验证测试分数：",np.array(retest_scores).mean())


#交叉验证训练分数： 0.9818125682413038
#交叉验证测试分数： 0.87427725293159

交叉验证之后所能达到的最高分数在0.874左右，在kaggle平台测试分数要比不进行交叉验证要高，说明我们的模型泛化能力增强了。

梯度提升树

接下来看梯度提升树：

train_scoresx = []
test_scoresx = []

kfx = KFold(n_splits=10,shuffle=False)#10折交叉验证
clfx = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt', loss='huber', random_state=42)#梯度提升树

scalerx = MinMaxScaler()#归一化处理
X_scalerx = scalerx.fit_transform(X)

for train,test in kfx.split(X_scalerx,Y):
    clfx.fit(X_scalerx[train],Y[train])
    train_scoresx.append(clfx.score(X_scalerx[train],Y[train]))
    test_scoresx.append((clfx.score(X_scalerx[test],Y[test])))

#得分情况
print("交叉验证训练分数：",np.array(train_scoresx).mean())
print("交叉验证测试分数：",np.array(test_scoresx).mean())


#交叉验证训练分数： 0.994345303778351
#交叉验证测试分数： 0.8911478084451863

模型集成

梯度提升树交叉验证得分要比随机森林高的多，可以达到0.891。接下来我们将两个模型进行集成：

#均方误差函数：
def score(y_test,y_true):
    return (1 - ((y_test - y_true)**2).sum() / ((y_true - y_true.mean())**2).sum())

#梯度提升树与随机森林集成
test_scores = []


kfx = KFold(n_splits=10,shuffle=False)#10折交叉验证

clfx = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt', loss='huber', random_state=42)#梯度提升树
clfr = RandomForestRegressor(n_estimators=210,max_depth=30,max_features=30)#随机森林

scalerx = MinMaxScaler()#归一化处理
X_scalerx = scalerx.fit_transform(X)

for train,test in kfx.split(X_scalerx,Y):
    clfx.fit(X_scalerx[train],Y[train])
    clfr.fit(X_scalerx[train],Y[train])
    score((clfx.predict(X_scalerx[test])+clfr.predict(X_scalerx[test]))/2,Y[test])#模型预测结果平均值集成方法

#得分情况
print("集成分数：",np.array(r2score).mean())

#集成分数：0.889

集成分数要比梯度提升树第一点，但是在kaggle平台得分比梯度提升树要高，所有集成的模型适应性更强。如果要评价这三个模型，应该用专用的模型评价指标评价，可以利用roc_auc_score来评价。

数据获取：
链接：https://pan.baidu.com/s/16Ej8Q5I8ep1g8QBuStN8dw 密码：ix62