Kaggle练习赛Titanic手札

标签： Kaggle

原文链接：http://blog.csdn.net/xuelabizp/article/details/52886054

参考资料：https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic/comments

一、Titanic练习赛介绍

kaggle上面的比赛有若干种，分别是Featured，Research，Playground和101等。Featured和Research比赛可以获得奖金，而Playground和101就是用来练手的。新注册Kaggle账号之后，网站会提示新手进行Titanic练习赛。

Titanic练习赛主要就是预测乘客是否存活，训练集中有乘客的若干特征与存活情况，乘客特征是年龄，性别等信息。使用训练集训练出一个模型，然后利用该模型去预测测试集中乘客的存活情况，原文描述如下：

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

二、特征分析与选择

训练集中乘客的特征有：PassengerId，Pclass，Name，Sex，Age，SibSp，Parch，Ticket，Fare，Cabin和Embarked。

从常识上推断，PassengerId，Name，Ticket，Embarked是无关特征，Pclass，Sex，Age，SibSp，Parch，Fare和Cabin是相关特征。

按理说Cabin应该是非常重要的相关特征，因为事故发生在晚上，这个时候大家都在睡觉，船舱距离逃生位置的远近直接决定了存活率，但是训练集中该特征的缺失值太多了，所以初步分析时，先不考虑该特征。

接下来按照推测的相关度从高到低依次分析：Sex，Age，Fare，SibSp，Parch，Pclass。

2.1 Sex

电影《Titanic》中，“小孩和女人先走”让人印象深刻，那么我们就看看女人和男人的存活量和存活率。

男女性别中幸存的数量和比率如上图所示，男性乘客的幸存率不到20%，而女性乘客的幸存率在70%以上，两者差距较大，由此可以推断Sex是非常重要的相关特征。实际上，把测试集中的所有男性都判定为遇难，所有女性都判定为存活，也有76.555%的正确率。

def sex_analysis(trainDf):
    fig = plt.figure(figsize=(8 ,6))
    maleDf = trainDf[trainDf['Sex'] == 'male']
    femaleDf = trainDf[trainDf['Sex'] == 'female']
    #男性女性存活数量
    ax1 = fig.add_subplot(1, 2, 1)
    ax1.set_title('survival count of both sexes')
    ax1.set_xticks([0, 1])
    ax1.set_xticklabels(['male', 'female'])
    ax1.set_xlabel("Sex")
    ax1.set_ylabel("Survival count")
    ax1.grid()


    maleSurvived = maleDf['Survived'] == 1
    femaleSurvived = femaleDf['Survived'] == 1
    ax1.bar(0, maleSurvived.sum(), align="center", color='b', alpha = 0.5)
    ax1.bar(1, femaleSurvived.sum(), align="center", color='r', alpha = 0.5)
    #男性女性存活率
    ax2 = fig.add_subplot(1, 2, 2)
    ax2.set_title('survival rate of both sexes')
    ax2.set_xticks([0, 1])
    ax2.set_xticklabels(['male', 'female'])
    ax2.set_xlabel("Sex")
    ax2.set_ylabel("Survival rate")
    ax2.grid()

    ax2.bar(0, float(maleSurvived.sum()) / len(maleDf.index), align="center", color='b', alpha = 0.5)
    ax2.bar(1, float(femaleSurvived.sum()) / len(femaleDf.index), align="center", color='r', alpha = 0.5)
    plt.show()

2.2 Age

一般遇难的情况下，小孩的存活率比较高，因为大家把逃生的机会让给了小孩。

幸存者年龄分布直方图和频率直方图如上图所示，以5年为一个年龄区间，即[0,5), [5,10),…,[80, 85)。从图中可以看到15岁以下的孩童存活率较高，虽然存活的中年人数量较高，但是存活率反而较低。

在[80,85)区间中，存活率是100%，这其实算一个异常值，因为这个区间只有一个乘客，他存活，则这个区间的存活率为100；他遇难，则这个区间的存活率为0。

综上所述，年龄是一个相关特征，且应当为数据集添加一个Child特征，该特征根据年龄生成，如当一个乘客的年龄小于15岁时，将其Child特征设置为1，其它年龄时，将其Child特征设置为0。

def age_analysis(trainDf):
    fig = plt.figure(figsize=(10, 6))
    #年龄有缺失值，提取出年龄值没有缺失的行
    ageDf = trainDf[trainDf['Age'].notnull()]
    #把年龄转化为整数，提取存活的乘客信息
    ageDf['Age'] = ageDf['Age'].astype(int)
    survivedDf   = trainDf[trainDf['Survived'] == 1]

    #幸存者的年龄频数直方图
    survivedAge = []
    for i in range(1, 18):#所有乘客的年龄均在[0, 80]之内，按照5岁一个区间统计，半开半闭共17个区间
        survivedAge.append( len(survivedDf[ (survivedDf['Age'] >= (i-1)*5) & (survivedDf['Age'] < i*5) ]) )

    ax1 = fig.add_subplot(1, 2, 1)
    ax1.set_title('age distribution of survivors')
    ax1.set_xticks(range(5, 85, 5))
    ax1.set_xlabel("Age")
    ax1.set_ylabel("Survival count")
    ax1.set_ylim(0, 45)
    ax1.grid()
    ax1.bar(range(5, 90, 5), survivedAge, width = 3, align='center', color = 'g', alpha = 0.5)

    #幸存者的年龄频率直方图
    age = []
    for i in range(1, 18):#所有乘客的年龄均在[0, 80]之内，按照5岁一个区间统计，半开半闭共17个区间
        age.append( len(ageDf[ (ageDf['Age'] >= (i-1)*5) & (ageDf['Age'] < i*5) ]) )

    ax2 = fig.add_subplot(1, 2, 2)
    ax2.set_title('age frequency distribution of survivors')
    ax2.set_xticks(range(5, 85, 5))
    ax2.set_xlabel("Age")
    ax2.set_ylabel("Survival rate")
    ax2.grid()
    survivedAge = np.array(survivedAge)
    age = np.array(age, np.float)
    survivedRate = survivedAge / age
    ax2.bar(range(5, 90, 5), survivedRate, width = 3, align='center', color = 'r', alpha = 0.5)
    averRate = sum(survivedRate[0:13]) / 13#去除[65,85)之间0%或者100%存活率后，求平均存活率
    ax2.plot((0,87), (averRate, averRate), color='b', linewidth=2, alpha=0.5, label='average rate')
    ax2.legend(loc='best')
    plt.show()

2.3 Fare

一般情况下，票价越高的人社会地位越高，这些人也比较容易获救。

幸存者和遇难者的票价分布图如上图所示，由于票价的跨度范围太大，细节不容易观察到，把x轴的范围限制为[0，200]，结果如下图所示：

这次可以较为确信的观察到票价高的乘客存活率较高，大量的低票价乘客遇难，很有可能是船体撞到冰山以后，船底灌水，底层的乘客无法逃跑。

进一步画出幸存者和逃生者平均票价的条形图：

其中幸存者的平均票价为47.99，遇难者的平均票价为21.69，由此可以推断票价是相关特征。

def fare_analysis(trainDf):
    #将票价转化为整数，提取出存活和遇难的行
    trainDf['Fare'] = trainDf['Fare'].astype(int)
    survivedDf = trainDf[trainDf['Survived'] == 1]
    notSurvivedDf = trainDf[trainDf['Survived'] == 0]
    #使用groupby方法统计票价
    survivedFare = survivedDf.groupby('Fare').size()
    notSurvivedFare = notSurvivedDf.groupby('Fare').size()
    fig1 = plt.figure(figsize=(8, 9))
    #幸存者票价分布直方图
    ax1 = fig1.add_subplot(211)
    ax1.set_title('fare distribution of survivors')
    ax1.set_xlabel('Fare')
    ax1.set_ylabel('Count')
    ax1.set_xlim(0, 200)#限制一下x轴的范围，以便观察细节
    ax1.bar(survivedFare.index, survivedFare, color='g')
    #遇难者票价分布直方图
    ax2 = fig1.add_subplot(212)
    ax2.set_title('fare distribution of NOT survivors')
    ax2.set_xlabel('Fare')
    ax2.set_ylabel('Count')
    ax2.set_xlim(0, 200)#限制一下x轴的范围，以便观察细节
    ax2.bar(notSurvivedFare.index, notSurvivedFare, color='r')
    #画出幸存者和遇难者的平均票价条形图
    y1 = sum(survivedFare.index * survivedFare) / float(survivedFare.sum())
    y2 = sum(notSurvivedFare.index * notSurvivedFare) / float(notSurvivedFare.sum())

    fig2 = plt.figure(figsize=(4, 4))
    ax3 = fig2.add_subplot(111)
    ax3.set_title('average fare')
    ax3.set_xticks([0, 1])
    ax3.set_xticklabels(['survived', 'notSurvived'])
    ax3.set_ylabel("fare")
    print y1, y2
    ax3.bar((0, 1), (y1, y2), align='center', alpha=0.5)
    plt.show()

2.4 Pclass

乘客等级共有3等，具体代表什么官网上并没有解释，只能先把图画出来，观察分析。

从上图可以看出，等级为1的乘客存活数量和存活率都最高，乘客等级是相关特征。

def pclass_analysis(trainDf):
    #提取pclass1, pclass2, pclass3
    pclass1Df = trainDf[trainDf['Pclass'] == 1]
    pclass2Df = trainDf[trainDf['Pclass'] == 2]
    pclass3Df = trainDf[trainDf['Pclass'] == 3]
    #提取已分类数据中的幸存者
    survivedPclass1 = pclass1Df[pclass1Df['Survived'] == 1]
    survivedPclass2 = pclass2Df[pclass2Df['Survived'] == 1]
    survivedPclass3 = pclass3Df[pclass3Df['Survived'] == 1]

    fig = plt.figure(figsize=(9,6))
    #各等级幸存数量条形图
    ax1 = fig.add_subplot(121)
    ax1.set_title('survival count of Pclass')
    ax1.set_xticks([0, 1, 2])
    ax1.set_xticklabels(['Pclass1', 'Pclass2', 'Pclass3'])
    ax1.set_ylabel("Survival count")
    ax1.bar(0, len(survivedPclass1.index), color='r', align='center', alpha=0.5)
    ax1.bar(1, len(survivedPclass2.index), color='g', align='center', alpha=0.5)
    ax1.bar(2, len(survivedPclass3.index), color='b', align='center', alpha=0.5)

    #各等级幸存比率条形图
    ax2 = fig.add_subplot(122)
    ax2.set_title('survival rate of Pclass')
    ax2.set_xticks([0, 1, 2])
    ax2.set_xticklabels(['Pclass1', 'Pclass2', 'Pclass3'])
    ax2.set_ylabel("Survival rate")
    ax2.bar(0, len(survivedPclass1.index) / float(len(pclass1Df.index)), color='r', align='center', alpha=0.5)
    ax2.bar(1, len(survivedPclass2.index) / float(len(pclass2Df.index)), color='g', align='center', alpha=0.5)
    ax2.bar(2, len(survivedPclass3.index) / float(len(pclass3Df.index)), color='b', align='center', alpha=0.5)

    plt.show()

2.5 SibSp

SibSp特征描述乘客是否有兄弟姐妹或者配偶，一般情况下，如果有同伴相助，存活率会较高。初步分析时，只按照有无兄弟姐妹或者配偶来分析，不考虑其数量。

从上图可以看出，有兄弟姐妹或者配偶的乘客还是较少的，但是他们的存活率却比没有兄弟姐妹或者配偶的乘客要高上10%，由此断定Sibsp是相关特征。

def sibsp_analysis(trainDf):
    #按照有无兄弟姐妹或者配偶进行数据分类
    sibspDf = trainDf[trainDf['SibSp'] > 0]
    noneSibspDf = trainDf[trainDf['SibSp'] == 0]
    #提取存活的乘客信息
    survivedSibsp = sibspDf[sibspDf['Survived'] == 1]
    survivedNoneSibsp = noneSibspDf[noneSibspDf['Survived'] == 1]

    fig = plt.figure(figsize=(9,6))
    #存活乘客数量
    ax1 = fig.add_subplot(121)
    ax1.set_title('survival count with or without sibsp')
    ax1.set_xticks([0, 1])
    ax1.set_xticklabels(['with', 'without'])
    ax1.set_xlabel("Sibsp")
    ax1.set_ylabel("Survival count")
    ax1.grid()
    ax1.bar(0, len(survivedSibsp.index), color='g', align='center', alpha=0.5)
    ax1.bar(1, len(survivedNoneSibsp.index), color='b', align='center', alpha=0.5)

    #存活乘客比率
    ax2 = fig.add_subplot(122)
    ax2.set_title('survival rate with or without sibsp')
    ax2.set_xticks([0, 1])
    ax2.set_xticklabels(['with', 'without'])
    ax2.set_xlabel("Sibsp")
    ax2.set_ylabel("Survival rate")
    ax2.grid()
    ax2.bar(0, len(survivedSibsp.index) / float(len(sibspDf.index)), color='g', align='center', alpha=0.5)
    ax2.bar(1, len(survivedNoneSibsp.index) / float(len(noneSibspDf.index)), color='b', align='center', alpha=0.5)

    plt.show()

2.6 Parch

Parch特征描述乘客是否有父母或者孩子，初步分析时，只按照有无父母或者孩子来分析，不考虑其数量。

从上图可以看出，有父母或者孩子的乘客还是较少的，但是他们的存活率却比没有兄弟姐妹或者配偶的乘客要高上近15%，由此断定Parch是相关特征。

def parch_analysis(trainDf):
    #按照有无父母或者孩子进行数据分类
    parchDf = trainDf[trainDf['Parch'] > 0]
    noneParchDf = trainDf[trainDf['Parch'] == 0]
    #提取存活的乘客信息
    survivedParch = parchDf[parchDf['Survived'] == 1]
    survivedNoneParch = noneParchDf[noneParchDf['Survived'] == 1]

    fig = plt.figure(figsize=(9,6))
    #存活乘客数量
    ax1 = fig.add_subplot(121)
    ax1.set_title('survival count with or without parch')
    ax1.set_xticks([0, 1])
    ax1.set_xticklabels(['with', 'without'])
    ax1.set_xlabel("Parch")
    ax1.set_ylabel("Survival count")
    ax1.grid()
    ax1.bar(0, len(survivedParch.index), color='g', align='center', alpha=0.5)
    ax1.bar(1, len(survivedNoneParch.index), color='b', align='center', alpha=0.5)

    #存活乘客比率
    ax2 = fig.add_subplot(122)
    ax2.set_title('survival rate with or without parch')
    ax2.set_xticks([0, 1])
    ax2.set_xticklabels(['with', 'without'])
    ax2.set_xlabel("Parch")
    ax2.set_ylabel("Survival rate")
    ax2.grid()
    ax2.bar(0, len(survivedParch.index) / float(len(parchDf.index)), color='g', align='center', alpha=0.5)
    ax2.bar(1, len(survivedNoneParch.index) / float(len(noneParchDf.index)), color='b', align='center', alpha=0.5)

    plt.show()

3. 数据准备

数据分析和建模方面的大量编程工作都是用在数据准备上面的：加载，清理，转换和重塑。

3.1删除不相关特征

不相关特征有PassengerId，Name，Ticket，Embarked。虽然Cabin是相关特征，但是由于缺失值较多，也删除掉。

#载入数据
trainDf = pd.read_csv("train.csv", dtype={"Age": np.float64},)
testDf = pd.read_csv("test.csv", dtype={"Age": np.float64},)
#删除无用数据列
trainDf.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'], axis = 1, inplace = True)
testDf.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis = 1, inplace = True)

3.2处理Sex特征

将Sex特征的female和male转化为数值量0和1.

trainDf['Sex'] = trainDf['Sex'].map({'female': 0, 'male': 1}).astype(int)
testDf['Sex'] = testDf['Sex'].map({'female': 0, 'male': 1}).astype(int)

3.3处理Age特征

一般情况下，将Age转化为整数值即可，但是上面分析过Age特征，发现年龄较小的乘客存活率更高，所以使用Age特征生成一个Child特征：年龄在15岁及其以下乘客的Child特征值置为1，其他年龄乘客的Child特征值置为0，然后删除Age特征，后面的交叉验证也证明此方法可以提高模型的预测正确率。

年龄特征有缺失值，可以先使用平均数+标准差的方法补充上，后续可以再改进。

#求出每个集合中年龄的平均数，标准差，缺失值个数
averageAgeTrain     = trainDf["Age"].mean()
stdAgeTrain         = trainDf["Age"].std()
countNanAgeTrain    = trainDf["Age"].isnull().sum()

averageAgeTest      = testDf["Age"].mean()
stdAgeTest          = testDf["Age"].std()
countNanAgeTest     = testDf["Age"].isnull().sum()

rand1 = np.random.randint(averageAgeTrain - stdAgeTrain, averageAgeTrain + stdAgeTrain, size = countNanAgeTrain)
rand2 = np.random.randint(averageAgeTest - stdAgeTest, averageAgeTest + stdAgeTest, size = countNanAgeTest)
trainDf.loc[np.isnan(trainDf["Age"]), "Age"] = rand1
testDf.loc[np.isnan(testDf["Age"]), "Age"] = rand2
#把年龄转化为int型数值
trainDf['Age'] = trainDf["Age"].astype(int)
testDf['Age']  = testDf['Age'].astype(int)
#根据年龄生成Child特征
def build_child(age):
    return 1 if age <= 15 else 0
trainDf['Child'] = trainDf['Age'].apply(build_child)
testDf['Child'] = testDf['Age'].apply(build_child)
trainDf.drop(['Age'], axis = 1, inplace=True)
testDf.drop(['Age'], axis = 1, inplace=True)

3.4处理Pclass特征

Pclass特征有三个值，分别是1,2,3，属于分类变量，所以将Pclass特征一分为三，生成Class1，Class2，Class3特征，然后删除Pclass特征。

#为Pclass创建虚拟列，然后删除掉第三类的列，因为这一列的存活率太低了
pclassDummiesTrain = pd.get_dummies(trainDf['Pclass'])
pclassDummiesTrain.columns = ['Class1', 'Class2', 'Class3']
pclassDummiesTest = pd.get_dummies(testDf['Pclass'])
pclassDummiesTest.columns = ['Class1', 'Class2', 'Class3']
#加入虚拟列
trainDf = trainDf.join(pclassDummiesTrain)
testDf = testDf.join(pclassDummiesTest)
#删除原先的Pclass列
trainDf.drop('Pclass', axis=1, inplace=True)
testDf.drop('Pclass', axis=1, inplace=True)

3.5处理Fare，SibSp，Parch特征

Fare特征的缺失值使用中位数补充，因为Fare的票价波动太大，平均数不合适，然后将其转化成整数。

SibSp，Parch没有缺失值，不用做任何改变。

# 测试集中ID为1044的乘客票价信息缺失，使用中位数补上
testDf["Fare"].fillna(testDf["Fare"].median(), inplace=True)
# 把票价从小数转化为整数
trainDf['Fare'] = trainDf['Fare'].astype(int)
testDf['Fare']  = testDf['Fare'].astype(int)

4.模型训练与交叉验证

本问题的机器学习模型准备采用随机森林。随机森林在很多数据集上都有良好的表现，其泛化能力高，不容易过拟合，擅长处理高维度数据。

不论特征如何选择，训练模型的代码都是固定的。采用交叉验证的方式判定当前训练模型的好坏，随机取70%的数据进行训练，剩下30%的数据用来验证模型，打印出训练数据的正确率和验证数据的正确率，以便观察。

#创建训练集，测试集
x = trainDf.drop("Survived", axis=1)
y = trainDf["Survived"]
xPre = testDf.drop("PassengerId", axis=1).copy()
#把训练集划分成两部分，进行交叉验证
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.7, random_state=1)

#计算验证集的正确率，a是预测分类，b是实际分类
def cal_correct_rate(a, b):
    count = a.ravel() == b.ravel()
    correctRate = 100 * float(count.sum()) / a.size
    return correctRate

#Random Forests
randomForest = RandomForestClassifier(n_estimators=100)
randomForest.fit(xTrain, yTrain)
print u"随机森林训练数据正确率：%.2f%%" % (100 * randomForest.score(xTrain, yTrain))
yHat = randomForest.predict(xTest)
#验证
correctRateRF = cal_correct_rate(yHat, yTest)
print u"随机森林测试数据正确率：%.2f%%" % correctRateRF
#预测
yPre = randomForest.predict(xPre)

#保存到文件
submission = pd.DataFrame({
        "PassengerId": testDf["PassengerId"],
        "Survived": yPre
    # })
submission.to_csv('submission.csv', index=False)

4.1仅使用Sex特征

Sex是非常重要的特征，那么可以尝试仅使用Sex特征进行预测，结果如下：

训练数据正确率：79.78%
验证数据正确率：78.21%
提交数据正确率：76.56%

截止写下这句话时(2016年9月18日22:26)，一共有4966个参赛队伍，而仅仅使用Sex特征进行预测(男性全部遇难，女性全部存活)，也可以排在第3183位(Top64.10%)。

4.2 使用Sex和Pclass特征

Pclass也是较为重要的特征，使用Pclass生成的特征Class1, Class2, Class3和Sex特征建立模型并预测，结果如下：

训练数据正确率：80.52%
验证数据正确率：77.88%
提交数据正确率：75.60%

增加了Pclass特征之后，预测效果反而不如仅使用Sex特征。

4.3 使用Sex，Pclass和Fare

结果如下：

训练数据正确率：87.64%
验证数据正确率：78.53%
提交数据正确率：77.03%

训练数据正确率有较小提升，验证数据正确率基本不变，提交数据正确率有微小提升，然而这一点微小的提升，将排名从第3183名提升到第2984名(Top60.09%)。

4.4 使用Sex，Pclass，Fare和Child

Child是由Age生成的特征，当年龄小于等于15岁时，将Child特征置为1，否则置为0。之所以选择15岁为分界点，是因为[0,15)的年龄存活率高于平均年龄存活率。
结果如下：

训练数据正确率：89.89%
验证数据正确率：80.13%
提交数据正确率：77.99%

训练数据正确率有微小提升，验证数据正确率有微小提升，提交数据正确率有极微小提升，然而这一点微小的提升，将排名从第2984名提升到第2064名(Top41.56%)。

4.5 使用Sex，Pclass，Fare，Child，SibSp和Parch

由于SibSp和Parch均是同伴类的特征，所以一起加入，预测结果如下：

训练数据正确率：91.01%
验证数据正确率：80.61%
提交数据正确率：78.95%

训练数据正确率有微小提升，验证数据正确率有微小提升，提交数据正确率有极微小提升，然而这一点微小的提升，将排名从第2064名提升到第1361名(Top27.40%)。

4.6更换GBDT模型进行预测

训练数据正确率：90.64%
验证数据正确率：80.13%
提交数据正确率：79.43%

从验证集的正确率看，GBDT并没有让正确率有明显的提升，然而提交正确率却增加了不到一个百分点，测试集的总量就是418，其实就是多分对2个数据而已，然而比赛人数众多，这一点微小的提升将排名从第1361名提升到第1066名(Top22.68%)

截止目前，相关特征都已经用到了，然而正确率依旧在80%以下，查看排行榜上第一个正确率超过80%的排名，是498名(Top10.03%)，而且大多人都是80%左右，可见已经达到了正确率提升的瓶颈。官网给定测试集test.csv一共有418条数据，按照正确率的差值计算出只要再多分对6个人就可以将正确率提升到80%以上，需要对相关特征做进一步处理。