Kaggle competition entry topic: Titanic

   First of all, let’s introduce the Kaggle competition. This competition is specially prepared for practitioners and learners related to machine learning and data mining. It is currently run by Google with support. Each competition has a prize pool of tens of thousands to several million. Editing is just for the purpose of learning to experience. Like-minded people can contact the editor in the background.

   Well, here is the topic for today: Kaggle Primer: Predicting the survival of passengers on the Titanic.

    The Titanic incident that happened in 1912 resulted in the death of 1,502 of 2,224 passengers on board (our hero also died). As an afterthought, we have some personal information on the passengers on board and whether some passengers were rescued. We hope that by exploring these data, we can discover some unknown secrets. . . , and casually predict whether another part of the passengers will be rescued.

    Step 1: Load data and data overview

#Download Data
trainData = pd.read_csv('train.csv')
testData = pd.read_csv('test.csv')

#Data overview
print(trainData.info())

    The general data is as follows:

               

     There are a total of 891 training data samples, and the information includes:

                PassengerId : Passenger ID

                Survived : 0=dead, 1=survived

                Pclass : Economy class 1=high 2=middle 3=low

                Name : Passenger's name

                Sex : sex

                Age : age

                Sibsp : Number of siblings on board

                Parch: the number of parents and children on board

                Ticket : ticket number

                Fare : Fare

               Cabin : Cabin number

               Embarked: Embarked port

     The real data is as follows:


      

     Step 2: Data Exploration

#data exploration
print(trainData.describe())

     The result is as follows:


    From this, we can draw that the average value of Survived is 0.383838, indicating that most of the passengers did not survive; the average value of Pclass is 2.308642, so the average economic level of passengers is medium and low; the average age of passengers is 29.7.

    Well, let's dig a little deeper and see if the above features are relevant to survival or not.

    2.1 The relationship between Pclass and Survived

     In order to understand the relationship between economic class and survival, a statistical graph of the number of deaths and survivors in each class is drawn. The drawing Python is as follows:

#Explore the relationship between Pclass and survival
dead = [0,0,0]
alive = [0,0,0]
trainData = np.array(trainData)
for data in trainData:
    if data[1] == 0:
        dead[data[2]-1] += 1
    else:
        alive[data[2]-1] += 1

pos = [1,2,3]
ax = plt.figure(figsize=(8,6)).add_subplot(111)
ax.bar(pos, dead, color='r', alpha=0.6, label='死亡')
ax.bar(pos, alive, color='g', bottom=dead, alpha=0.6, label='幸存')
ax.legend(fontsize=16, loc='best')
ax.set_xticks(pos)
ax.set_xticklabels(['Pclass%d'%(i) for i in range(1,4)], size=15)
ax.set_title('Economic class and survival statistics', size=20)
plt.show()


    The following information can be obtained from the figure:

           The class with the largest number of people is Pclass3, that is, the low-level income;

           The class with the highest death rate is the low-income class, and the class with the highest survival rate is the high-income class. This is also in line with reality. People with high economic classes have a higher probability of being rescued.

     2.2 The relationship between Age and Survived

     First, look at the age distribution of the passengers. Since the age information of some passengers is empty, these passengers will not be considered for the time being. The drawing Python code is as follows:

ageDis = {}
trainData = np.array(trainData)
for data in trainData:
    if np.isnan(data[5]):
        pass
    else:
        if data[5] not in ageDis.keys():
            ageDis[data[5]] = 0
        ageDis[data[5]] += 1
ageDis = sorted(ageDis.items(),key=lambda item:item[0])
age = []
ageCount = []
for d in ageDis:
    age.append(d[0])
    ageCount.append(d[1])
plt.bar(age,ageCount)
plt.title('Age distribution map')
plt.xlabel('age')
plt.ylabel('人数',verticalalignment='baseline',horizontalalignment='left')
plt.show()

      As can be seen from the figure, the age roughly presents a normal distribution. In order to deal with the missing age of some passengers mentioned above, we can use a random number in the [mean-std, mean+std] interval as the default value of age. The Python code is as follows:

'''Deal with missing age'''
for i in range(len(trainData)):
    if np.isnan(trainData.iloc[i,5]):
        trainData.iloc [i, 5] = random.randint (29-14,29 + 14)

      Below we define the division of age groups, 0-6 years old are children, 7-17 years old are teenagers, 18-40 years old are young people, 41-65 years old are middle-aged, and 66 and later are old people. The Python code is as follows:

'''Age division'''
def life(age):
    if age >= 66:
        return 4 #older
    else:
        if age >= 41:
            return 3 #middle-aged
        else:
            if age >= 18:
                return 2 #youth
            else:
                if age >= 7:
                    return 1 #juvenile
                else:
                    return 0 #children

      Let's use the above definition to explore the proportion of deaths and survivors in each age group. The Python code is as follows:

'''Deaths and survivors by age'''
trainData = np.array(trainData)
aliveCount = [0,0,0,0,0]
deadCount = [0,0,0,0,0]
pos = [1,2,3,4,5]
for data in trainData:
    if data[1] == 1:
        aliveCount[life(data[5])] += 1
    else:
        deadCount[life(data[5])] += 1
pos = [1,2,3,4,5]
ax = plt.figure(figsize=(8,6)).add_subplot(111)
ax.bar(pos, deadCount, color='r', alpha=0.6, label='死亡')
ax.bar(pos, aliveCount, color='g', bottom=deadCount, alpha=0.6, label='幸存')
ax.legend(fontsize=16, loc='best')
ax.set_xticks(pos)
ax.set_xticklabels(['children','juvenile','youth','middle-aged','older'], size=15)
ax.set_title('Number of deaths and survivors by age group', size=20)
plt.show()

    As can be seen from the above figure, except that the number of survivors of children is higher than the number of deaths, the number of deaths in other age groups is higher, which may fully demonstrate the principle of giving priority to children.

    2.3 The relationship between sex and Survived

    According to common sense, women have priority to rescue than men, so it can be inferred that the survival rate of women should be higher than that of men, and the comparison is made by drawing. The Python code is as follows.

'''The relationship between sex and Survived'''
trainData = np.array(trainData)
aliveCount = [0,0]
deadCount = [0,0]
pos = [1,2]
for data in trainData:
    if data[4] == 'male':
        if data[1] == 1:
            aliveCount[1] += 1
        else:
            deadCount[1] += 1
    else:
        if data[1] == 1:
            aliveCount[0] += 1
        else:
            deadCount[0] += 1
ax = plt.figure(figsize=(8,6)).add_subplot(111)
ax.bar(pos, deadCount, color='r', alpha=0.6, label='死亡')
ax.bar(pos, aliveCount, color='g', bottom=deadCount, alpha=0.6, label='幸存')
ax.legend(fontsize=16, loc='best')
ax.set_xticks(pos)
ax.set_xticklabels(['female','male'], size=15)
ax.set_title('The relationship between gender and survival', size=20)
plt.show()

    Among them, the survival rate of women was 0.74, which was much higher than that of men, 0.19.

    2.4 The relationship between Embacked and Survived

    Whether the port where the passengers board the ship has anything to do with whether they survive or not, we can't speculate, and directly show it by drawing. The Python code is as follows.

'''The relationship between Embacked and Survived'''
trainData = np.array(trainData)
aliveCount = [0,0,0]  # S,Q,C
deadCount = [0,0,0]
pos = [1,2,3]
for data in trainData:

    if data[11] == 'S':
        if data[1] == 1:
            aliveCount[0] += 1
        else:
            deadCount[0] += 1
    else:
        if data[11] == 'Q':
            if data[1] == 1:
                aliveCount[1] += 1
            else:
                deadCount[1] += 1
        else:
            if data[1] == 1:
                aliveCount[2] += 1
            else:
                deadCount[2] += 1
ax = plt.figure(figsize=(8,6)).add_subplot(111)
ax.bar(pos, deadCount, color='r', alpha=0.6, label='死亡')
ax.bar(pos, aliveCount, color='g', bottom=deadCount, alpha=0.6, label='幸存')
ax.legend(fontsize=16, loc='best')
ax.set_xticks(pos)
ax.set_xticklabels(['S','Q','C'], size=15)
ax.set_title('The relationship between the port of boarding and survival', size=20)
plt.show()

  Among them, the survival rate of port S is 0.33, the survival rate of port Q is 0.39, and the survival rate of port C is 0.56. It can be seen that the survival rate of passengers boarding at port C is higher.

   2.5 The relationship between Sibsp, Parch and Survived

   Give two pictures first.



      From the above two figures, it can be seen that the survival rate and the number of Sibsp and Parch are not a simple linear relationship, because the survival rate increases and decreases with the increase of the number of Sibsp and Parch.


   The third step is a decision tree-based classification model

   Through the above analysis, we first use a decision tree to build a classification model for prediction. The feature selection this time is economic class, gender, age group, and port of embarkation, and the goal is to survive.

   The raw data is processed to obtain a training data set suitable for decision trees. The Python code is as follows.

'''Get the training set suitable for decision tree'''
trainData = np.array(trainData)
data = []
labels = ['Pclass','Sex','Age','Embacked']
for d in trainData:
    temp = []
    temp.append(str(d[2]))
    temp.append(str(d[4]))
    temp.append(str(life(d[5])))
    temp.append(str(d[11]))
    if d[1] == 1:
        temp.append('yes')
    else:
        temp.append('no')
    data.append(temp)

    The constructed decision tree is shown in the figure below.


   As you can see from the tree, gender is the first factor, followed by economic class, then age group, and finally port of embarkation. Using the constructed decision tree, it is possible to predict whether a passenger will survive or not. The Python code is as follows.

''' decision tree classifier
    Parameters: inputTree: the decision tree to build,  
    featureLabels, testVec: feature labels and feature values ​​of test data
'''
def classify(inputTree,featureLabels,testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featureIndex = featureLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featureIndex] == key:
            if isinstance(secondDict[key],dict):
                classLabel = classify(secondDict[key],featureLabels,testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

  The Python code to make predictions for all passengers in the test file is as follows.

'''predict'''
testData = np.array(testData)
featureLabels = labels[:]
for test in testData:
    temp = []
    temp.append(str(test[1]))
    temp.append(str(test[3]))
    temp.append(str(life(test[4])))
    temp.append(str(test[10]))
    print(classify(myTree,featureLabels,temp))

   Using the above code, it is possible to predict whether all passengers survived. Submitting the results to Kaggle, the public test score was 0.66985, meaning that nearly 70% of the passenger predictions were correct. Although the score is not high, the whole process benefits a lot, and I just hope to continue to work hard.



   For more dry goods, please pay attention to the WeChat public account: Dream Chasing Programmer.







Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325832235&siteId=291194637