模型介绍

集成（Ensemble）分类模型便是综合考量多个分类器的预测结果，从而做出决策。
这种“综合考量”的方式大体分为两种：
一种是利用相同的训练数据同时搭建多个独立的分类模型，然后通过投票的方式，以少数服从多数的原则作出最终的分类决策。比较具有代表性的模型为随机森林分类器（Random Forest Classifier）,即在相同训练数据上同时搭建多棵决策树（Decision Tree)。然而，一株标准的决策树会根据每维特征对预测结果的影响程度进行排序，进而决定不同特征从上至下构建分裂节点的顺序；如此一来，所有在随机森林中的决策树都会受到这一策略影响而构建得完全一致，从而丧失了多样性。因此，随机森林分类器在构建的过程中，每一棵决策数都会放弃这一固定的排序算法，转而随机选取特征。
另一种则是按照一定次序搭建多个分类模型。这些模型之间彼此存在依赖关系。一般而言，每一个后续模型的加入都需要对现有集成模型的综合性能有所贡献，进而不断提升更新过后的集成模型的性能，并最终期望借助整合多个分类能力较弱的分类器，搭建出具有更强分类能力的模型。比较具有代表性的当属梯度提升决策树（Gradient Tree Boosting)。与构建随机森林分类器模型不同，这里每一棵决策树在生成的过程中都会尽可能降低整体集成模型在训练集上的拟合误差。

数据描述

使用泰坦尼克号的乘客数据。

代码实战

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']
X['age'].fillna(X['age'].mean(), inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size=0.25, random_state=33)
#对类别型特征进行转化，成为特征向量。
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
#使用单一决策树进行模型训练以及预测分析。
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_pred = dtc.predict(X_test)

#使用随机森林分类器进行集成模型的训练以及预测分析
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)

#使用梯度提升决策树进行集成模型的训练以及预测分析。
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_pred = gbc.predict(X_test)

#输出单一决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标
print("The accuracy of decision tree is", dtc.score(X_test, y_test))
print(classification_report(dtc_y_pred, y_test))

#输出随机森林分类器在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标
print("The accuracy of random forest classifier is", rfc.score(X_test, y_test))
print(classification_report(rfc_y_pred, y_test))

#输出梯度提升决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print("The accuracy of gradient tree boosting is", gbc.score(X_test, y_test))
print(classification_report(gbc_y_pred, y_test))
#out[]  The accuracy of decision tree is 0.7811550151975684
        #              precision    recall  f1-score   support
        # 
        #           0       0.91      0.78      0.84       236
        #           1       0.58      0.80      0.67        93
        # 
        # avg / total       0.81      0.78      0.79       329
        # 
        # The accuracy of random forest classifier is 0.78419452887538
        #              precision    recall  f1-score   support
        # 
        #           0       0.91      0.78      0.84       235
        #           1       0.59      0.80      0.68        94
        # 
        # avg / total       0.82      0.78      0.79       329
        # 
        # The accuracy of gradient tree boosting is 0.790273556231003
        #              precision    recall  f1-score   support
        # 
        #           0       0.92      0.78      0.84       239
        #           1       0.58      0.82      0.68        90
        # 
        # avg / total       0.83      0.79      0.80       329
        # 
        # 
        # Process finished with exit code 0

输出表明：在相同的训练和测试数据条件下，仅仅使用模型的默认配置，梯度上升决策树具有最佳的预测性能，其次是随机森林分类器，最后是单一决策树。
一般而言，工业界为了追求更加强劲的预测性能，经常使用随机森林模型作为基线系统。

特点分析

集成模型可以说是实战应用中最为常见的。相比于其他单一的学习模型，集成模型可以整合多种模型，或者多次就一种类型的模型进行建模。由于模型估计参数的过程也同样受到概率的影响，具有一定的不确定性；因此，集成模型虽然在训练过程中要耗费更多的时间，但是得到的综合模型往往具有更高的表现性能和更好的稳定性。

机器学习8-分类学习-集成模型

模型介绍

数据描述

代码实战

特点分析

猜你喜欢