Kaggle introductory learning demo——Titanic: Machine Learning from Disaster

Series of articles catalog
classification problem study notes-KNN
classification problem study notes-decision tree



Predict the survival situation on the Titanic and be familiar with the basics of ML

1. Download the data set to understand the meaning of the fields

First go to the competition interface to download the data set https://www.kaggle.com/c/titanic/data

import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


train = pd.read_csv('../data/titanic/train.csv')
test = pd.read_csv('../data/titanic/test.csv')
print('Train data shape:',train.shape)
print('Test data shape:',test.shape)

Train data shape: (891, 12)
Test data shape: (418, 11)

# 首先了解字段的含义并且可以看出Age、Cabin、Embarked三个字段是存在缺失值的情况
# survival	Survival	0 = No, 1 = Yes 我们的label 
# pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd 仓位等级
# sex	Sex	
# Age	Age in years	
# sibsp	# of siblings / spouses aboard the Titanic	 是否是带配偶来的
# parch	# of parents / children aboard the Titanic	 带了几个孩子来的
# ticket	Ticket number	 船票号码
# fare	Passenger fare	   船票票价
# cabin	Cabin number	   船舱号码
# embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton 出发港
train.info()


""" 结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"""
test.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB"""

It can be seen that the training data train.csv missing value fields: Age, Cabin, Embarked
test data missing value fields: Age, Cabin, Fare

train.describe() # 看一看数值标签的样本特征分布

Insert picture description here

train.describe(include=['O']) # 分类类型标签的样本特征分布

Insert picture description here

train.head(3).append(train.tail(3))

Insert picture description here

# 可视化看一下缺失值字段
missing = train.isnull().sum()/len(train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

Insert picture description here

From the observation sample in the first step so far, we can roughly get some information: Missing value fields: Age, Cabin, Embarked, Fare
string type categorizable labels have Sex, Embarked, Cabin
string type text labels have Name, Ticket
numerical labels There are PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

Code to distinguish:

numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
print('数值型:',numerical_fea)
print('string型', category_fea)

"""
数值型: ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
string型 ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
"""

# 划分数值型变量中的连续变量和离散型变量
# 这里将可枚举值少于20个认为是离散型标签
def get_numerical_serial_fea(data,feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 20:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train,numerical_fea)

print('连续型:',numerical_serial_fea)
print('离散型:',numerical_noserial_fea)
"""
连续型: ['PassengerId', 'Age', 'Fare']
离散型: ['Survived', 'Pclass', 'SibSp', 'Parch']
"""

Okay, so far, the first step of observing the feature phase is over, and then entering the data cleaning phase.

2. Data cleaning

1) Feature selection

# 查看缺失值大于50%的列
have_null_fea_dict = (train.isnull().sum()/len(train)).to_dict()
fea_null_moreThanHalf = {
    
    }
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)

"""
{'Cabin': 0.7710437710437711}
"""

# 好的,干掉这一列,缺的太多了,说明这一列对label的影响几乎不起作用了(当然也可以给缺失值填充一个值比方说"N" ,下次一定)
# 然后干掉姓名,船票号码(这两个首先是string文本型标签,然后唯一值太多,个人主观感受对label影响弱所以干掉,也可以不干掉,然后转码入模训练,下次一定)
train.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)
test.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)

2) Missing value filling

# 缺失值填充 Age、Embarked、Fare
# age这里用船舱等级分类后的中位数填充
# Embarked这里用众数填充
train['Age'] = train.groupby(['Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Age'] = test.groupby(['Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Fare'] = test['Fare'].transform(lambda x: x.fillna(x.median()))
train["Age"] = train["Age"].astype(int)
train['Embarked'] = train['Embarked'].fillna('S')

3) String type classification field transcoding


# string型 性别、Embarked标签转码
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Sex"] = train["Sex"].astype(int)

train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
train["Embarked"] = train["Embarked"].astype(int)

test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
test["Sex"] = test["Sex"].astype(int)

test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2
test["Embarked"] = test["Embarked"].astype(int)

train.head()

Insert picture description here

2. Model training

# 开始准备入模数据
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test  = test.copy()

X_train.shape, Y_train.shape, X_test.shape

# ((891, 7), (891,), (418, 7))
# 逻辑回归
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
# 80.81
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
# 99.66
# SVM 99.66  new bee
# KNN 
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
# 80.13 垃圾
# 朴素贝叶斯
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
# 79.35 垃圾
# 随机梯度下降
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
# 62.96 呸,垃圾 ,恶心人
# 决策树
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
# 100.0 完美 坡菲
# 随机森林
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
# 100.0 无敌,就用它了

submission = pd.DataFrame({
    
    
        "PassengerId": test["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('../data/titanic/submission.csv', index=False)

The ranking after uploading is as follows: the accuracy rate is as high as 74.401%, which is invincible.

Insert picture description here
The scores of each model are as follows:

models = pd.DataFrame({
    
    
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

Insert picture description here

Guess you like

Origin blog.csdn.net/Pioo_/article/details/109765306