Series of articles catalog
classification problem study notes-KNN
classification problem study notes-decision tree
Article Directory
Predict the survival situation on the Titanic and be familiar with the basics of ML
1. Download the data set to understand the meaning of the fields
First go to the competition interface to download the data set https://www.kaggle.com/c/titanic/data
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('../data/titanic/train.csv')
test = pd.read_csv('../data/titanic/test.csv')
print('Train data shape:',train.shape)
print('Test data shape:',test.shape)
Train data shape: (891, 12)
Test data shape: (418, 11)
# 首先了解字段的含义并且可以看出Age、Cabin、Embarked三个字段是存在缺失值的情况
# survival Survival 0 = No, 1 = Yes 我们的label
# pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd 仓位等级
# sex Sex
# Age Age in years
# sibsp # of siblings / spouses aboard the Titanic 是否是带配偶来的
# parch # of parents / children aboard the Titanic 带了几个孩子来的
# ticket Ticket number 船票号码
# fare Passenger fare 船票票价
# cabin Cabin number 船舱号码
# embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton 出发港
train.info()
""" 结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"""
test.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB"""
It can be seen that the training data train.csv missing value fields: Age, Cabin, Embarked
test data missing value fields: Age, Cabin, Fare
train.describe() # 看一看数值标签的样本特征分布
train.describe(include=['O']) # 分类类型标签的样本特征分布
train.head(3).append(train.tail(3))
# 可视化看一下缺失值字段
missing = train.isnull().sum()/len(train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
From the observation sample in the first step so far, we can roughly get some information: Missing value fields: Age, Cabin, Embarked, Fare
string type categorizable labels have Sex, Embarked, Cabin
string type text labels have Name, Ticket
numerical labels There are PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
Code to distinguish:
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
print('数值型:',numerical_fea)
print('string型', category_fea)
"""
数值型: ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
string型 ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
"""
# 划分数值型变量中的连续变量和离散型变量
# 这里将可枚举值少于20个认为是离散型标签
def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 20:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train,numerical_fea)
print('连续型:',numerical_serial_fea)
print('离散型:',numerical_noserial_fea)
"""
连续型: ['PassengerId', 'Age', 'Fare']
离散型: ['Survived', 'Pclass', 'SibSp', 'Parch']
"""
Okay, so far, the first step of observing the feature phase is over, and then entering the data cleaning phase.
2. Data cleaning
1) Feature selection
# 查看缺失值大于50%的列
have_null_fea_dict = (train.isnull().sum()/len(train)).to_dict()
fea_null_moreThanHalf = {
}
for key,value in have_null_fea_dict.items():
if value > 0.5:
fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)
"""
{'Cabin': 0.7710437710437711}
"""
# 好的,干掉这一列,缺的太多了,说明这一列对label的影响几乎不起作用了(当然也可以给缺失值填充一个值比方说"N" ,下次一定)
# 然后干掉姓名,船票号码(这两个首先是string文本型标签,然后唯一值太多,个人主观感受对label影响弱所以干掉,也可以不干掉,然后转码入模训练,下次一定)
train.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)
test.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)
2) Missing value filling
# 缺失值填充 Age、Embarked、Fare
# age这里用船舱等级分类后的中位数填充
# Embarked这里用众数填充
train['Age'] = train.groupby(['Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Age'] = test.groupby(['Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Fare'] = test['Fare'].transform(lambda x: x.fillna(x.median()))
train["Age"] = train["Age"].astype(int)
train['Embarked'] = train['Embarked'].fillna('S')
3) String type classification field transcoding
# string型 性别、Embarked标签转码
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Sex"] = train["Sex"].astype(int)
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
train["Embarked"] = train["Embarked"].astype(int)
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
test["Sex"] = test["Sex"].astype(int)
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2
test["Embarked"] = test["Embarked"].astype(int)
train.head()
2. Model training
# 开始准备入模数据
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test = test.copy()
X_train.shape, Y_train.shape, X_test.shape
# ((891, 7), (891,), (418, 7))
# 逻辑回归
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
# 80.81
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
# 99.66
# SVM 99.66 new bee
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
# 80.13 垃圾
# 朴素贝叶斯
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
# 79.35 垃圾
# 随机梯度下降
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
# 62.96 呸,垃圾 ,恶心人
# 决策树
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
# 100.0 完美 坡菲
# 随机森林
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
# 100.0 无敌,就用它了
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": Y_pred
})
submission.to_csv('../data/titanic/submission.csv', index=False)
The ranking after uploading is as follows: the accuracy rate is as high as 74.401%, which is invincible.
The scores of each model are as follows:
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)