Pandas数据重构

Joyful Pandas

Datawhale社区 Joyful Pandas 课程中关于连接、分组、聚合 的整理:

连接

关系连接

  • 左连接 :左——右
  • 右连接 :右——左
  • 内连接 :保留左右表中的相同键
  • 外连接 :保留左右表中的所有键

列连接

  • df1.merge(df2, left_index = True, right_index = True)

索引连接

  • df1.join(df2)

纵向连接

  • df1.append(df2) # 同pd.concat([df1, df2], axis = 1)

方向连接

  • pd.concat([df1, df2], axis = 0)
  • df1.assign(s1) # dataframe 末尾追加 series

分组模式及其对象

分组模式
df.groupby(分组依据)[数据来源].具体操作

groupby 对象
gb = df.groupby([..., ...]

  • 属性
    • ngroups:分组个数
    • groups:组名映射的字典
  • 方法
    • size():统计每个组的元素个数
    • get_group():获取元素所在组对应的行

聚合函数

df.agg()


动手学数据分析

Datawhale社区 动手学数据分析 课程中关于 数据重构 的内容:

开始之前,导入numpy、pandas包和数据

# 导入基本库
import numpy as np
import pandas as pd
# 载入data文件中的:train-left-up.csv
pd.read_csv("data/train-left-up.csv").head()
PassengerId Survived Pclass Name
0 1 0 3 Braund, Mr. Owen Harris
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 1 3 Heikkinen, Miss. Laina
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 5 0 3 Allen, Mr. William Henry

2 第二章:数据重构

2.4 数据的合并

2.4.1 任务一

将data文件夹里面的所有数据都载入,观察数据的之间的关系

#写入代码
text_left_up = pd.read_csv("data/train-left-up.csv")
text_left_down = pd.read_csv("data/train-left-down.csv")
text_right_up = pd.read_csv("data/train-right-up.csv")
text_right_down = pd.read_csv("data/train-right-down.csv")
#写入代码
my_list = [text_left_up, text_left_down, text_right_up, text_right_down]
for i in my_list:
    print(i.info())
    print('=========')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PassengerId  439 non-null    int64 
 1   Survived     439 non-null    int64 
 2   Pclass       439 non-null    int64 
 3   Name         439 non-null    object
dtypes: int64(3), object(1)
memory usage: 13.8+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PassengerId  452 non-null    int64 
 1   Survived     452 non-null    int64 
 2   Pclass       452 non-null    int64 
 3   Name         452 non-null    object
dtypes: int64(3), object(1)
memory usage: 14.2+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Sex       439 non-null    object 
 1   Age       352 non-null    float64
 2   SibSp     439 non-null    int64  
 3   Parch     439 non-null    int64  
 4   Ticket    439 non-null    object 
 5   Fare      439 non-null    float64
 6   Cabin     97 non-null     object 
 7   Embarked  438 non-null    object 
dtypes: float64(2), int64(2), object(4)
memory usage: 27.6+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Sex       452 non-null    object 
 1   Age       362 non-null    float64
 2   SibSp     452 non-null    int64  
 3   Parch     452 non-null    int64  
 4   Ticket    452 non-null    object 
 5   Fare      452 non-null    float64
 6   Cabin     107 non-null    object 
 7   Embarked  451 non-null    object 
dtypes: float64(2), int64(2), object(4)
memory usage: 28.4+ KB
None
=========

【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么

2.4.2:任务二

使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up

#写入代码
result_up = pd.concat([text_left_up, text_right_up], axis = 1)
result_up.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

2.4.3 任务三

使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。

#写入代码
result_down = pd.concat([text_left_down, text_right_down], axis = 1)
result_down.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 440 0 2 Kvillner, Mr. Johan Henrik Johannesson male 31.0 0 0 C.A. 18723 10.500 NaN S
1 441 1 2 Hart, Mrs. Benjamin (Esther Ada Bloomfield) female 45.0 1 1 F.C.C. 13529 26.250 NaN S
2 442 0 3 Hampe, Mr. Leon male 20.0 0 0 345769 9.500 NaN S
3 443 0 3 Petterson, Mr. Johan Emil male 25.0 1 0 347076 7.775 NaN S
4 444 1 2 Reynaldo, Ms. Encarnacion female 28.0 0 0 230434 13.000 NaN S
result = pd.concat([result_up, result_down], axis = 0).reset_index(drop = True)
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

2.4.4 任务四

使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务

#写入代码
resul_up = text_left_up.join(text_right_up)
result_down = text_left_down.join(text_right_down)
result = result_up.append(result_down).reset_index(drop = True)
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

2.4.5 任务五

使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务

#写入代码
resul_up = text_left_up.merge(text_right_up, left_index = True, right_index = True)
result_down = text_left_down.merge(text_right_down, left_index = True, right_index = True)
result = result_up.append(result_down).reset_index(drop = True)
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?

merge, join 用来进行列拼接,append 用来进行行拼接,官方文档:
pandas.DataFrame.join
pandas.DataFrame.merge
pandas.DataFrame.append

2.5 换一种角度看数据

2.5.1 任务一

将我们的数据变为Series类型的数据

#写入代码
result.stack().shape
(9826,)
#写入代码
result.stack()[0]
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Embarked                             S
dtype: object

复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。

2.6 数据运用

2.6.1 任务一

通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制

#写入心得

2.4.2:任务二

计算泰坦尼克号男性与女性的平均票价

# 写入代码
df.groupby('Sex')['Fare'].mean()
Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。

下面通过几个任务来熟悉GroupBy机制。

2.4.3:任务三

统计泰坦尼克号中男女的存活人数

# 写入代码
df.groupby('Sex')['Survived'].sum()
Sex
female    233
male      109
Name: Survived, dtype: int64

2.4.4:任务四

计算客舱不同等级的存活人数

# 写入代码
df.groupby('Pclass')['Survived'].sum()
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

【补充】关于 count 和 sum 方法的区别:
count 计数 -> 多少行, sum 累加 -> 0,1标签相加得到1的个数

print(df.groupby('Pclass')['Survived'].count())
print('=========')
df.Pclass.value_counts()
Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64
=========





3    491
1    216
2    184
Name: Pclass, dtype: int64

提示:】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0

思考】从数据分析的角度,上面的统计结果可以得出那些结论

#思考心得

  1. 女性票价平均比男性高
  2. 女性存活人数远多于男性
  3. 客舱等级1的存活概率大

【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?

#思考心得
df.groupby('Sex').agg({
    
    'Fare': 'mean', 'Survived': 'sum'}).rename(columns = {
    
    'Fare': 'Fare_mean', 'Survived': 'Survived_sum'})
Fare_mean Survived_sum
Sex
female 44.479818 233
male 25.523893 109

2.4.5:任务五

统计在不同等级的票中的不同年龄的船票花费的平均值

# 写入代码
df['Age_level'] = pd.cut(df.Age, bins = 4)
df.groupby(['Pclass','Age_level'])['Fare'].mean()
Pclass  Age_level      
1       (0.34, 20.315]     116.136705
        (20.315, 40.21]     97.959878
        (40.21, 60.105]     70.386898
        (60.105, 80.0]      59.969050
2       (0.34, 20.315]      24.725834
        (20.315, 40.21]     21.055769
        (40.21, 60.105]     20.254032
        (60.105, 80.0]      10.500000
3       (0.34, 20.315]      16.580693
        (20.315, 40.21]     11.402144
        (40.21, 60.105]     12.248931
        (60.105, 80.0]       7.820000
Name: Fare, dtype: float64

2.4.6:任务六

将任务二和任务三的数据合并,并保存到sex_fare_survived.csv

# 写入代码
pd.concat([df.groupby('Sex')['Fare'].mean(), df.groupby('Sex')['Survived'].sum()], axis = 1)
Fare Survived
Sex
female 44.479818 233
male 25.523893 109
pd.merge(df.groupby('Sex')['Fare'].mean(), df.groupby('Sex')['Survived'].sum(),on='Sex')
Fare Survived
Sex
female 44.479818 233
male 25.523893 109

2.4.7:任务七

得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后计算存活人数最高的存活率(存活人数/总人数)

# 写入代码
survived_age =  df.groupby('Age')['Survived'].sum()
# 写入代码
survived_age.sort_values(ascending= False)
Age
24.0    15
22.0    11
27.0    11
36.0    11
35.0    11
        ..
20.5     0
23.5     0
24.5     0
28.5     0
40.5     0
Name: Survived, Length: 88, dtype: int64
survived_age.sort_values(ascending= False).iloc[0] / df.Survived.sum()
0.043859649122807015

猜你喜欢

转载自blog.csdn.net/qq_38869560/article/details/128745496