导入数据

导入pandas,并且重命名为pd。

import pandas as pd
#通过互联网读取泰坦尼克乘客档案，并存储在变量titanic中。
titanic = pd.read_csv( ‘titanic.txt’)

＃引入pandas，并且重命名为pd。
将熊猫作为pd导入
＃通过互联网读取泰坦尼克乘客档案，并存储在变量titanic中。
泰坦尼克号= pd.read_csv（‘titanic.txt’）

数据导入

#导入pandas,并且重命名为pd。
import pandas as pd
#通过互联网读取泰坦尼克乘客档案，并存储在变量titanic中。
titanic= pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

#人工选取pclass、age以及sex作为判别乘客是否能够生还的特征。
x = titanic[['pclass','age','sex']]
y = titanic['survived']

数据处理

#对于缺失的年龄信息,我们使用全体乘客的平均年龄代替，这样可以在保证顺利训练模型的同时,尽可能不影响预测任务。
x['age'].fillna(x['age'].mean(), inplace= True)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py:5434: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

#对原始数据进行分割，258的乘客数据用于测试。
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split (x, y,test_size=0.25,random_state = 33)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

#对类别型特征进行转化,成为特征向量。
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer (sparse = False)
x_train = vec.fit_transform(x_train.to_dict (orient= 'record') )
x_test = vec.transform(x_test.to_dict(orient= 'record'))

建立模型

使用多种用于评价分类任务性能的指标,在测试数据集上对比单一决策树(DecisionTree)、随机森林分类器(RandomForestClassifier)以及梯度提升决策树(Gradient Tree Boosting)的性能差异。

#使用单-决策树进行模型训练以及预测分析。
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
dtc_y_pred= dtc.predict(x_test)

#使用随机森林分类器进行集成模型的训练以及预测分析。
from sklearn. ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfc_y_pred = rfc.predict(x_test)

#使用梯度提升决策树进行集成模型的训练以及预测分析。
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier ()
gbc.fit(x_train, y_train)
gbc_y_pred = gbc.predict (x_test)

模型评估

#从sklearn .metrics导人classification report。
from sklearn.metrics import classification_report
#输出单一决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print('The accuracy of decision tree is', dtc.score(x_test, y_test))
print(classification_report(dtc_y_pred, y_test))
#输出随机森林分类器在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print('The accuracy of random forest classifier is', rfc.score(x_test, y_test))
print(classification_report(rfc_y_pred, y_test))
#输出梯度提升决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print('The accuracy of gradient tree boosting is', gbc.score(x_test, y_test))
print(classification_report(gbc_y_pred, y_test))

The accuracy of decision tree is 0.7811550151975684
             precision    recall  f1-score   support

          0       0.91      0.78      0.84       236
          1       0.58      0.80      0.67        93

avg / total       0.81      0.78      0.79       329

The accuracy of random forest classifier is 0.78419452887538
             precision    recall  f1-score   support

          0       0.90      0.78      0.84       233
          1       0.60      0.79      0.68        96

avg / total       0.81      0.78      0.79       329

The accuracy of gradient tree boosting is 0.790273556231003
             precision    recall  f1-score   support

          0       0.92      0.78      0.84       239
          1       0.58      0.82      0.68        90

avg / total       0.83      0.79      0.80       329

输出表明:在相同的训练和测试数据条件下,仅仅使用模型的默认配置，梯度上升决策树具有最佳的预测性能,其次是随机森林分类器,最后是单一决策树。