系列文章目录
分类问题学习笔记——KNN
分类问题学习笔记——决策树
预测泰坦尼克号上的生存状况,熟悉ML基础知识
1.下载数据集,了解字段含义
首先去比赛界面下载数据集https://www.kaggle.com/c/titanic/data
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('../data/titanic/train.csv')
test = pd.read_csv('../data/titanic/test.csv')
print('Train data shape:',train.shape)
print('Test data shape:',test.shape)
Train data shape: (891, 12)
Test data shape: (418, 11)
# 首先了解字段的含义并且可以看出Age、Cabin、Embarked三个字段是存在缺失值的情况
# survival Survival 0 = No, 1 = Yes 我们的label
# pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd 仓位等级
# sex Sex
# Age Age in years
# sibsp # of siblings / spouses aboard the Titanic 是否是带配偶来的
# parch # of parents / children aboard the Titanic 带了几个孩子来的
# ticket Ticket number 船票号码
# fare Passenger fare 船票票价
# cabin Cabin number 船舱号码
# embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton 出发港
train.info()
""" 结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"""
test.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB"""
可以看出训练数据train.csv 缺失值字段:Age、Cabin、Embarked
测试数据 缺失值字段:Age、Cabin、Fare
train.describe() # 看一看数值标签的样本特征分布
train.describe(include=['O']) # 分类类型标签的样本特征分布
train.head(3).append(train.tail(3))
# 可视化看一下缺失值字段
missing = train.isnull().sum()/len(train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
至此第一步的观察样本,我们大致可以得出一些信息:缺失值字段:Age、Cabin、Embarked、Fare
string型可分类标签有 Sex 、Embarked、Cabin
string型文本标签有Name 、Ticket
数值型的标签有PassengerId、Survived、Pclass、Age、SibSp、Parch、Fare
代码来分辨:
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
print('数值型:',numerical_fea)
print('string型', category_fea)
"""
数值型: ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
string型 ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
"""
# 划分数值型变量中的连续变量和离散型变量
# 这里将可枚举值少于20个认为是离散型标签
def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 20:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train,numerical_fea)
print('连续型:',numerical_serial_fea)
print('离散型:',numerical_noserial_fea)
"""
连续型: ['PassengerId', 'Age', 'Fare']
离散型: ['Survived', 'Pclass', 'SibSp', 'Parch']
"""
好的,至此第一步观察特征阶段结束,下面进入到数据清洗阶段。
2.数据清洗
1)特征选择
# 查看缺失值大于50%的列
have_null_fea_dict = (train.isnull().sum()/len(train)).to_dict()
fea_null_moreThanHalf = {
}
for key,value in have_null_fea_dict.items():
if value > 0.5:
fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)
"""
{'Cabin': 0.7710437710437711}
"""
# 好的,干掉这一列,缺的太多了,说明这一列对label的影响几乎不起作用了(当然也可以给缺失值填充一个值比方说"N" ,下次一定)
# 然后干掉姓名,船票号码(这两个首先是string文本型标签,然后唯一值太多,个人主观感受对label影响弱所以干掉,也可以不干掉,然后转码入模训练,下次一定)
train.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)
test.drop(['Cabin','Name','Ticket'],axis=1,inplace=True)
2)缺失值填充
# 缺失值填充 Age、Embarked、Fare
# age这里用船舱等级分类后的中位数填充
# Embarked这里用众数填充
train['Age'] = train.groupby(['Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Age'] = test.groupby(['Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Fare'] = test['Fare'].transform(lambda x: x.fillna(x.median()))
train["Age"] = train["Age"].astype(int)
train['Embarked'] = train['Embarked'].fillna('S')
3)string型分类字段转码
# string型 性别、Embarked标签转码
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Sex"] = train["Sex"].astype(int)
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
train["Embarked"] = train["Embarked"].astype(int)
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
test["Sex"] = test["Sex"].astype(int)
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2
test["Embarked"] = test["Embarked"].astype(int)
train.head()
2.模型训练
# 开始准备入模数据
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test = test.copy()
X_train.shape, Y_train.shape, X_test.shape
# ((891, 7), (891,), (418, 7))
# 逻辑回归
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
# 80.81
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
# 99.66
# SVM 99.66 new bee
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
# 80.13 垃圾
# 朴素贝叶斯
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
# 79.35 垃圾
# 随机梯度下降
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
# 62.96 呸,垃圾 ,恶心人
# 决策树
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
# 100.0 完美 坡菲
# 随机森林
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
# 100.0 无敌,就用它了
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": Y_pred
})
submission.to_csv('../data/titanic/submission.csv', index=False)
上传后排名如下:准确率高达 74.401% ,无敌。
各个模型的评分如下:
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)