Kaggle/Titanic python分析和建模

Titanic是Kaggle入门项目,本文跟随https://www.kaggle.com/startupsci/titanic/titanic-data-science-solutions学习。


1.Workflow stages

完整的流程分7步;当然,Kaggle已经提供了第1和第2步了;绝大部分都是数据整理工作,即所谓的“特征工程”,其中,通过画图来探索数据是必备技能。

其中,Wrangle是什么意思?

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem solving steps and final solution.
  7. Supply or submit the results.

2. Analyze by describing data

通过pandas进行数据集的早期探索,可以问答以下的问题:

Which features are available in the dataset?
Which features are categorical?
Which features are numerical?
Which features are mixed data types?
Which features may contain errors or typos?
Which features contain blank, null or empty values?
What are the data types for various features?
What is the distribution of numerical feature values across the samples?
What is the distribution of categorical features?

3. Assumtions based on data analysis
在“Analyze by describing data”基础上按照以下几类进行假设分析。
Correlating feature:此例中,比如female的存活概率较高
Completing  feature
Correcting feature
Creating new feature

4. Analyze by pivoting features | Analyze by visualizing data
section 3 and section 4是必须一起考虑和操作的,通过这2步骤,能更深的理解数据的各特征。
并且通过此2步骤,将会考虑哪些特征是有用的,哪些特征是无用可丢弃的。
Assumtions必须通过本步骤提供证据,表格和直方图都是“透视”数据规律的好办法。
特征参数是类别变量时,使用表格进行“透视”数据。
特征参数是数值变量时,通过直方图进行“透视”数据。

4.1 Correlating feature
Correlating numerical features
Correlating numerical and ordinal features
Correlating categorical features

5. Wrangle data
这一步才是真正的“特征工程”处理了,之前的section 2/3/4都只是分析特征而已。
Correcting by dropping features
Creating new feature extracting from existing
Converting a categorical feature
Completing a numerical continuous feature
Create new feature combining existing features
Completing a categorical feature
Converting categorical feature to numeric
Quick completing and converting a numeric feature










猜你喜欢

转载自blog.csdn.net/sunfoot001/article/details/68953296