这是在datawhale学习小组学习pandas的第五章内容,变形,以下是学习笔记,仅供参考,不喜勿喷
DataWhale
参考:https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch5.html
第五章 变形
import numpy as np
import pandas as pd
一、长宽表的变形
一个表中把性别存储在某一个列中,那么它就是关于性别的长表;如果把性别作为列名,列中的元素是某一其他的相关特征数值,那么这个表是关于性别的宽表
pd.DataFrame({
'Gender':['F','F','M','M'],'Height':[163,160,175,180]})
|
Gender |
Height |
0 |
F |
163 |
1 |
F |
160 |
2 |
M |
175 |
3 |
M |
180 |
pd.DataFrame({
'Height: F':[163, 160],'Height: M':[175, 180]})
|
Height: F |
Height: M |
0 |
163 |
175 |
1 |
160 |
180 |
1. pivot
pivot 是一种典型的长表变宽表的函数,首先来看一个例子:下表存储了张三和李四的语文和数学分数,现在想要把语文和数学分数作为列来展示。
df = pd.DataFrame({
'Class':[1,1,2,2], 'Name':['San Zhang','San Zhang','Si Li','Si Li'],
'Subject':['Chinese','Math','Chinese','Math'],
'Grade':[80,75,90,85]})
df
|
Class |
Name |
Subject |
Grade |
0 |
1 |
San Zhang |
Chinese |
80 |
1 |
1 |
San Zhang |
Math |
75 |
2 |
2 |
Si Li |
Chinese |
90 |
3 |
2 |
Si Li |
Math |
85 |
对于一个基本的长变宽的操作而言,最重要的有三个要素,分别是变形后的行索引、需要转到列索引的列,以及这些列和行索引对应的数值,它们分别对应了 pivot 方法中的 index, columns, values 参数。新生成表的列索引是 columns 对应列的 unique 值,而新表的行索引是 index 对应列的 unique 值,而 values 对应了想要展示的数值列。
df.pivot(index='Name', columns='Subject', values='Grade')
Subject |
Chinese |
Math |
Name |
|
|
San Zhang |
80 |
75 |
Si Li |
90 |
85 |
df.loc[1, 'Subject'] = 'Chinese'
try:
df.pivot(index='Name', columns='Subject', values='Grade')
except Exception as e:
Err_Msg = e
Err_Msg
File "<tokenize>", line 4
except Exception as e:
^
IndentationError: unindent does not match any outer indentation level
pandas 从 1.1.0 开始, pivot 相关的三个参数允许被设置为列表,这也意味着会返回多级索引
df = pd.DataFrame({
'Class':[1, 1, 2, 2, 1, 1, 2, 2],
'Name':['San Zhang', 'San Zhang', 'Si Li', 'Si Li',
'San Zhang', 'San Zhang', 'Si Li', 'Si Li'],
'Examination': ['Mid', 'Final', 'Mid', 'Final',
'Mid', 'Final', 'Mid', 'Final'],
'Subject':['Chinese', 'Chinese', 'Chinese', 'Chinese',
'Math', 'Math', 'Math', 'Math'],
'Grade':[80, 75, 85, 65, 90, 85, 92, 88],
'rank':[10, 15, 21, 15, 20, 7, 6, 2]})
df
|
Class |
Name |
Examination |
Subject |
Grade |
rank |
0 |
1 |
San Zhang |
Mid |
Chinese |
80 |
10 |
1 |
1 |
San Zhang |
Final |
Chinese |
75 |
15 |
2 |
2 |
Si Li |
Mid |
Chinese |
85 |
21 |
3 |
2 |
Si Li |
Final |
Chinese |
65 |
15 |
4 |
1 |
San Zhang |
Mid |
Math |
90 |
20 |
5 |
1 |
San Zhang |
Final |
Math |
85 |
7 |
6 |
2 |
Si Li |
Mid |
Math |
92 |
6 |
7 |
2 |
Si Li |
Final |
Math |
88 |
2 |
现在想要把测试类型和科目联合组成的四个类别(期中语文、期末语文、期中数学、期末数学)转到列索引,并且同时统计成绩和排名:
pivot_multi = df.pivot(index = ['Class', 'Name'],
columns = ['Subject','Examination'],
values = ['Grade','rank'])
pivot_multi
|
|
Grade |
rank |
|
Subject |
Chinese |
Math |
Chinese |
Math |
|
Examination |
Mid |
Final |
Mid |
Final |
Mid |
Final |
Mid |
Final |
Class |
Name |
|
|
|
|
|
|
|
|
1 |
San Zhang |
80 |
75 |
90 |
85 |
10 |
15 |
20 |
7 |
2 |
Si Li |
85 |
65 |
92 |
88 |
21 |
15 |
6 |
2 |
2. pivot_table
pivot 的使用依赖于唯一性条件,那如果不满足唯一性条件,那么必须通过聚合操作使得相同行列组合对应的多个值变为一个值。例如,张三和李四都参加了两次语文考试和数学考试,按照学院规定,最后的成绩是两次考试分数的平均值,此时就无法通过 pivot 函数来完成。
df = pd.DataFrame({
'Name':['San Zhang', 'San Zhang',
'San Zhang', 'San Zhang',
'Si Li', 'Si Li', 'Si Li', 'Si Li'],
'Subject':['Chinese', 'Chinese', 'Math', 'Math',
'Chinese', 'Chinese', 'Math', 'Math'],
'Grade':[80, 90, 100, 90, 70, 80, 85, 95]})
df
|
Name |
Subject |
Grade |
0 |
San Zhang |
Chinese |
80 |
1 |
San Zhang |
Chinese |
90 |
2 |
San Zhang |
Math |
100 |
3 |
San Zhang |
Math |
90 |
4 |
Si Li |
Chinese |
70 |
5 |
Si Li |
Chinese |
80 |
6 |
Si Li |
Math |
85 |
7 |
Si Li |
Math |
95 |
df.pivot_table(index = 'Name',
columns = 'Subject',
values = 'Grade',
aggfunc = 'mean')
Subject |
Chinese |
Math |
Name |
|
|
San Zhang |
85 |
95 |
Si Li |
75 |
90 |
3. melt
长宽表只是数据呈现方式的差异,但其包含的信息量是等价的,前面提到了利用 pivot 把长表转为宽表,那么就可以通过相应的逆操作把宽表转为长表, melt 函数就起到了这样的作用。
df = pd.DataFrame({
'Class':[1,2],
'Name':['San Zhang', 'Si Li'],
'Chinese':[80, 90],
'Math':[80, 75]})
df
|
Class |
Name |
Chinese |
Math |
0 |
1 |
San Zhang |
80 |
80 |
1 |
2 |
Si Li |
90 |
75 |
df_melted = df.melt(id_vars = ['Class', 'Name'],
value_vars = ['Chinese', 'Math'],
var_name = 'Subject',
value_name = 'Grade')
df_melted
|
Class |
Name |
Subject |
Grade |
0 |
1 |
San Zhang |
Chinese |
80 |
1 |
2 |
Si Li |
Chinese |
90 |
2 |
1 |
San Zhang |
Math |
80 |
3 |
2 |
Si Li |
Math |
75 |
通过 pivot 操作把 df_melted 转回 df 的形式:
df_unmelted = df_melted.pivot(index = ['Class', 'Name'],
columns='Subject',
values='Grade')
df_unmelted
|
Subject |
Chinese |
Math |
Class |
Name |
|
|
1 |
San Zhang |
80 |
80 |
2 |
Si Li |
90 |
75 |
df_unmelted = df_unmelted.reset_index().rename_axis(
columns={
'Subject':''})
4. wide_to_long
melt 方法中,在列索引中被压缩的一组值对应的列元素只能代表同一层次的含义,即 values_name 。现在如果列中包含了交叉类别,比如期中期末的类别和语文数学的类别,那么想要把 values_name 对应的 Grade 扩充为两列分别对应语文分数和数学分数,只把期中期末的信息压缩,这种需求下就要使用 wide_to_long 函数来完成。
df = pd.DataFrame({
'Class':[1,2],'Name':['San Zhang', 'Si Li'],
'Chinese_Mid':[80, 75], 'Math_Mid':[90, 85],
'Chinese_Final':[80, 75], 'Math_Final':[90, 85]})
df
|
Class |
Name |
Chinese_Mid |
Math_Mid |
Chinese_Final |
Math_Final |
0 |
1 |
San Zhang |
80 |
90 |
80 |
90 |
1 |
2 |
Si Li |
75 |
85 |
75 |
85 |
pd.wide_to_long(df,
stubnames=['Chinese', 'Math'],
i = ['Class', 'Name'],
j='Examination',
sep='_',
suffix='.+')
|
|
|
Chinese |
Math |
Class |
Name |
Examination |
|
|
1 |
San Zhang |
Mid |
80 |
90 |
Final |
80 |
90 |
2 |
Si Li |
Mid |
75 |
85 |
Final |
75 |
85 |
二、索引的变形
1. stack与unstack
unstack 函数的作用是把行索引转为列索引,
扫描二维码关注公众号,回复:
12418307 查看本文章
df = pd.DataFrame(np.ones((4,2)),
index = pd.Index([('A', 'cat', 'big'),
('A', 'dog', 'small'),
('B', 'cat', 'big'),
('B', 'dog', 'small')]),
columns=['col_1', 'col_2'])
df
|
|
|
col_1 |
col_2 |
A |
cat |
big |
1.0 |
1.0 |
dog |
small |
1.0 |
1.0 |
B |
cat |
big |
1.0 |
1.0 |
dog |
small |
1.0 |
1.0 |
df.unstack()
|
|
col_1 |
col_2 |
|
|
big |
small |
big |
small |
A |
cat |
1.0 |
NaN |
1.0 |
NaN |
dog |
NaN |
1.0 |
NaN |
1.0 |
B |
cat |
1.0 |
NaN |
1.0 |
NaN |
dog |
NaN |
1.0 |
NaN |
1.0 |
unstack 的主要参数是移动的层号,默认转化最内层,移动到列索引的最内层,同时支持同时转化多个层:
df.unstack(2)
|
|
col_1 |
col_2 |
|
|
big |
small |
big |
small |
A |
cat |
1.0 |
NaN |
1.0 |
NaN |
dog |
NaN |
1.0 |
NaN |
1.0 |
B |
cat |
1.0 |
NaN |
1.0 |
NaN |
dog |
NaN |
1.0 |
NaN |
1.0 |
df.unstack([0,2])
|
col_1 |
col_2 |
|
A |
B |
A |
B |
|
big |
small |
big |
small |
big |
small |
big |
small |
cat |
1.0 |
NaN |
1.0 |
NaN |
1.0 |
NaN |
1.0 |
NaN |
dog |
NaN |
1.0 |
NaN |
1.0 |
NaN |
1.0 |
NaN |
1.0 |
三、其他变形函数
统计 learn_pandas 数据集中学校和转系情况对应的频数:
df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\learn_pandas.csv')
df.head()
|
School |
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
Test_Number |
Test_Date |
Time_Record |
0 |
Shanghai Jiao Tong University |
Freshman |
Gaopeng Yang |
Female |
158.9 |
46.0 |
N |
1 |
2019/10/5 |
0:04:34 |
1 |
Peking University |
Freshman |
Changqiang You |
Male |
166.5 |
70.0 |
N |
1 |
2019/9/4 |
0:04:20 |
2 |
Shanghai Jiao Tong University |
Senior |
Mei Sun |
Male |
188.9 |
89.0 |
N |
2 |
2019/9/12 |
0:05:22 |
3 |
Fudan University |
Sophomore |
Xiaojuan Sun |
Female |
NaN |
41.0 |
N |
2 |
2020/1/3 |
0:04:08 |
4 |
Fudan University |
Sophomore |
Gaojuan You |
Male |
174.0 |
74.0 |
N |
2 |
2019/11/6 |
0:05:22 |
pd.crosstab(index = df.School, columns = df.Transfer)
Transfer |
N |
Y |
School |
|
|
Fudan University |
38 |
1 |
Peking University |
28 |
2 |
Shanghai Jiao Tong University |
53 |
0 |
Tsinghua University |
62 |
4 |
pd.crosstab(index = df.School, columns = df.Transfer,
values = [0]*df.shape[0], aggfunc = 'count')
Transfer |
N |
Y |
School |
|
|
Fudan University |
38.0 |
1.0 |
Peking University |
28.0 |
2.0 |
Shanghai Jiao Tong University |
53.0 |
NaN |
Tsinghua University |
62.0 |
4.0 |
利用 pivot_table 进行等价操作,由于这里统计的是组合的频数,因此 values 参数无论传入哪一个列都不会影响最后的结果:
df.pivot_table(index = 'School',
columns = 'Transfer',
values = 'Name',
aggfunc = 'count')
Transfer |
N |
Y |
School |
|
|
Fudan University |
38.0 |
1.0 |
Peking University |
28.0 |
2.0 |
Shanghai Jiao Tong University |
53.0 |
NaN |
Tsinghua University |
62.0 |
4.0 |
2. explode
explode 参数能够对某一列的元素进行纵向的展开,被展开的单元格必须存储 list, tuple, Series, np.ndarray 中的一种类型。
df_ex = pd.DataFrame({
'A': [[1, 2],
'my_str',
{
1, 2},
pd.Series([3, 4])],
'B': 1})
df_ex.explode('A')
|
A |
B |
0 |
1 |
1 |
0 |
2 |
1 |
1 |
my_str |
1 |
2 |
{1, 2} |
1 |
3 |
3 |
1 |
3 |
4 |
1 |
3. get_dummies
get_dummies 是用于特征构建的重要函数之一,其作用是把类别特征转为指示变量。例如,对年级一列转为指示变量,属于某一个年级的对应列标记为1,否则为0:
pd.get_dummies(df.Grade).head()
|
Freshman |
Junior |
Senior |
Sophomore |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
2 |
0 |
0 |
1 |
0 |
3 |
0 |
0 |
0 |
1 |
4 |
0 |
0 |
0 |
1 |
四、练习
Ex1:美国非法药物数据集
现有一份关于美国非法药物的数据集,其中 SubstanceName, DrugReports 分别指药物名称和报告数量:
df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\drugs.csv').sort_values([
'State','COUNTY','SubstanceName'],ignore_index=True)
df.head()
|
YYYY |
State |
COUNTY |
SubstanceName |
DrugReports |
0 |
2011 |
KY |
ADAIR |
Buprenorphine |
3 |
1 |
2012 |
KY |
ADAIR |
Buprenorphine |
5 |
2 |
2013 |
KY |
ADAIR |
Buprenorphine |
4 |
3 |
2014 |
KY |
ADAIR |
Buprenorphine |
27 |
4 |
2015 |
KY |
ADAIR |
Buprenorphine |
5 |
将数据转为如下的形式:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dx8hYUlI-1609077535897)(attachment:Ex5_1.png)]
A = df.pivot(index=['State','COUNTY','SubstanceName'
], columns='YYYY', values='DrugReports'
).reset_index().rename_axis(columns={
'YYYY':''})
A.head()
|
State |
COUNTY |
SubstanceName |
2010 |
2011 |
2012 |
2013 |
2014 |
2015 |
2016 |
2017 |
0 |
KY |
ADAIR |
Buprenorphine |
NaN |
3.0 |
5.0 |
4.0 |
27.0 |
5.0 |
7.0 |
10.0 |
1 |
KY |
ADAIR |
Codeine |
NaN |
NaN |
1.0 |
NaN |
NaN |
NaN |
NaN |
1.0 |
2 |
KY |
ADAIR |
Fentanyl |
NaN |
NaN |
1.0 |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
KY |
ADAIR |
Heroin |
NaN |
NaN |
1.0 |
2.0 |
NaN |
1.0 |
NaN |
2.0 |
4 |
KY |
ADAIR |
Hydrocodone |
6.0 |
9.0 |
10.0 |
10.0 |
9.0 |
7.0 |
11.0 |
3.0 |
将第1问中的结果恢复为原表。
A_melted = A.melt(id_vars = ['State','COUNTY','SubstanceName'],
value_vars = A.columns[-8:],
var_name = 'YYYY',
value_name = 'DrugReports').dropna(
subset=['DrugReports'])
A_melted = A_melted[df.columns].sort_values([
'State','COUNTY','SubstanceName'],ignore_index=True
).astype({
'YYYY':'int64', 'DrugReports':'int64'})
res_melted.equals(df)
True
按 State 分别统计每年的报告数量总和,其中 State, YYYY 分别为列索引和行索引,要求分别使用 pivot_table 函数与 groupby+unstack 两种不同的策略实现,并体会它们之间的联系。
B = df.pivot_table(index='YYYY', columns='State',
values='DrugReports', aggfunc='sum')
B.head()
State |
KY |
OH |
PA |
VA |
WV |
YYYY |
|
|
|
|
|
2010 |
10453 |
19707 |
19814 |
8685 |
2890 |
2011 |
10289 |
20330 |
19987 |
6749 |
3271 |
2012 |
10722 |
23145 |
19959 |
7831 |
3376 |
2013 |
11148 |
26846 |
20409 |
11675 |
4046 |
2014 |
11081 |
30860 |
24904 |
9037 |
3280 |