【案例+操作+演示】20分钟带你入门Pandas,掌握数据分析科学模块,附带上百个案例练习题【含答案】

二十分钟入门pandas,学不会私信教学!

有需要pyecharts资源的可以点击文章上面下载!!!

需要本项目运行源码可以点击资源进行下载

资源

#coding:utf8
%matplotlib inline

这个一篇针对pandas新手的简短入门,想要了解更多复杂的内容,参阅Cookbook

通常,我们首先要导入以下几个库:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

创建对象

通过传递一个list来创建Series,pandas会默认创建整型索引:

s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递一个numpy array,日期索引以及列标签来创建一个DataFrame

dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
A B C D
2013-01-01 0.194873 0.298287 0.073043 -0.681957
2013-01-02 -0.679429 0.397972 0.887388 1.187169
2013-01-03 0.782244 -0.251828 -0.736243 1.752973
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272
2013-01-05 0.405459 -1.151989 -1.309286 1.023221
2013-01-06 -0.445221 -1.520998 0.717376 1.657476

通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame:

df2 = pd.DataFrame({
    
     'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

可以看到各列的数据类型为:

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

查看frame中头部和尾部的几行:

df.head()
A B C D
2013-01-01 0.194873 0.298287 0.073043 -0.681957
2013-01-02 -0.679429 0.397972 0.887388 1.187169
2013-01-03 0.782244 -0.251828 -0.736243 1.752973
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272
2013-01-05 0.405459 -1.151989 -1.309286 1.023221
df.tail(3)
A B C D
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272
2013-01-05 0.405459 -1.151989 -1.309286 1.023221
2013-01-06 -0.445221 -1.520998 0.717376 1.657476

显示索引、列名以及底层的numpy数据

df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[ 0.19487255,  0.29828663,  0.07304296, -0.68195723],
       [-0.67942918,  0.39797196,  0.88738797,  1.18716883],
       [ 0.78224424, -0.25182784, -0.73624252,  1.75297344],
       [-1.8770662 , -0.06096652, -2.10376905, -1.63427152],
       [ 0.40545892, -1.15198867, -1.30928606,  1.02322057],
       [-0.44522115, -1.52099782,  0.71737636,  1.65747636]])

describe()能对数据做一个快速统计汇总

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.269857 -0.381587 -0.411915 0.550768
std 0.955046 0.785029 1.180805 1.385087
min -1.877066 -1.520998 -2.103769 -1.634272
25% -0.620877 -0.926948 -1.166025 -0.255663
50% -0.125174 -0.156397 -0.331600 1.105195
75% 0.352812 0.208473 0.556293 1.539899
max 0.782244 0.397972 0.887388 1.752973

对数据做转置:

df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.194873 -0.679429 0.782244 -1.877066 0.405459 -0.445221
B 0.298287 0.397972 -0.251828 -0.060967 -1.151989 -1.520998
C 0.073043 0.887388 -0.736243 -2.103769 -1.309286 0.717376
D -0.681957 1.187169 1.752973 -1.634272 1.023221 1.657476

按轴进行排序(1代表横轴;0代表纵轴):

df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 -0.681957 0.073043 0.298287 0.194873
2013-01-02 1.187169 0.887388 0.397972 -0.679429
2013-01-03 1.752973 -0.736243 -0.251828 0.782244
2013-01-04 -1.634272 -2.103769 -0.060967 -1.877066
2013-01-05 1.023221 -1.309286 -1.151989 0.405459
2013-01-06 1.657476 0.717376 -1.520998 -0.445221

按值进行排序 :

df.sort_values(by='D')
A B C D
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272
2013-01-01 0.194873 0.298287 0.073043 -0.681957
2013-01-05 0.405459 -1.151989 -1.309286 1.023221
2013-01-02 -0.679429 0.397972 0.887388 1.187169
2013-01-06 -0.445221 -1.520998 0.717376 1.657476
2013-01-03 0.782244 -0.251828 -0.736243 1.752973

数据选择

注意:虽然标准的Python/Numpy的表达式能完成选择与赋值等功能,但我们仍推荐使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix

选取

选择某一列数据,它会返回一个Series,等同于df.A

df['A']
2013-01-01    0.194873
2013-01-02   -0.679429
2013-01-03    0.782244
2013-01-04   -1.877066
2013-01-05    0.405459
2013-01-06   -0.445221
Freq: D, Name: A, dtype: float64

通过使用 [ ]进行切片选取:

df[0:4]
A B C D
2013-01-01 0.194873 0.298287 0.073043 -0.681957
2013-01-02 -0.679429 0.397972 0.887388 1.187169
2013-01-03 0.782244 -0.251828 -0.736243 1.752973
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272
df['20130102':'20130104']
A B C D
2013-01-02 -0.679429 0.397972 0.887388 1.187169
2013-01-03 0.782244 -0.251828 -0.736243 1.752973
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272

通过标签选取

通过标签进行交叉选取:

dates[0]
Timestamp('2013-01-01 00:00:00', freq='D')
df.loc[dates[0]]
A    0.194873
B    0.298287
C    0.073043
D   -0.681957
Name: 2013-01-01 00:00:00, dtype: float64

使用标签对多个轴进行选取

df.loc[:,['A','B']]
A B
2013-01-01 0.194873 0.298287
2013-01-02 -0.679429 0.397972
2013-01-03 0.782244 -0.251828
2013-01-04 -1.877066 -0.060967
2013-01-05 0.405459 -1.151989
2013-01-06 -0.445221 -1.520998
df.loc[:,['A','B']][:3]
A B
2013-01-01 0.194873 0.298287
2013-01-02 -0.679429 0.397972
2013-01-03 0.782244 -0.251828

进行标签切片,包含两个端点

df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 -0.679429 0.397972
2013-01-03 0.782244 -0.251828
2013-01-04 -1.877066 -0.060967

对于返回的对象进行降维处理

df.loc['20130102',['A','B']]
A   -0.679429
B    0.397972
Name: 2013-01-02 00:00:00, dtype: float64

获取一个标量

df.loc[dates[0],'A']
0.19487255317338711

快速获取标量(与上面的方法等价)

df.at[dates[0],'A']
0.19487255317338711

通过位置选取

通过传递整型的位置进行选取

df.iloc[3]
A   -1.877066
B   -0.060967
C   -2.103769
D   -1.634272
Name: 2013-01-04 00:00:00, dtype: float64

通过整型的位置切片进行选取,与python/numpy形式相同

df.iloc[3:5,0:2]
A B
2013-01-04 -1.877066 -0.060967
2013-01-05 0.405459 -1.151989

只对行进行切片

df.iloc[1:3,:]
A B C D
2013-01-02 -0.679429 0.397972 0.887388 1.187169
2013-01-03 0.782244 -0.251828 -0.736243 1.752973

只对列进行切片

df.iloc[:,1:3]
B C
2013-01-01 0.298287 0.073043
2013-01-02 0.397972 0.887388
2013-01-03 -0.251828 -0.736243
2013-01-04 -0.060967 -2.103769
2013-01-05 -1.151989 -1.309286
2013-01-06 -1.520998 0.717376

只获取某个值

df.iloc[1,1]
0.39797195640479976

快速获取某个值(与上面的方法等价)

df.iat[1,1]
0.39797195640479976

布尔索引

用某列的值来选取数据

df[df.A > 0]
A B C D
2013-01-01 0.194873 0.298287 0.073043 -0.681957
2013-01-03 0.782244 -0.251828 -0.736243 1.752973
2013-01-05 0.405459 -1.151989 -1.309286 1.023221

where操作来选取数据

df[df > 0]
A B C D
2013-01-01 0.194873 0.298287 0.073043 NaN
2013-01-02 NaN 0.397972 0.887388 1.187169
2013-01-03 0.782244 NaN NaN 1.752973
2013-01-04 NaN NaN NaN NaN
2013-01-05 0.405459 NaN NaN 1.023221
2013-01-06 NaN NaN 0.717376 1.657476

用**isin()**方法来过滤数据

df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
A B C D E
2013-01-01 0.194873 0.298287 0.073043 -0.681957 one
2013-01-02 -0.679429 0.397972 0.887388 1.187169 one
2013-01-03 0.782244 -0.251828 -0.736243 1.752973 two
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272 three
2013-01-05 0.405459 -1.151989 -1.309286 1.023221 four
2013-01-06 -0.445221 -1.520998 0.717376 1.657476 three
df2[df2['E'].isin(['two', 'four'])]
A B C D E
2013-01-03 0.782244 -0.251828 -0.736243 1.752973 two
2013-01-05 0.405459 -1.151989 -1.309286 1.023221 four

赋值

赋值一个新的列,通过索引来自动对齐数据

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
df['F'] = s1
df
A B C D F
2013-01-01 0.194873 0.298287 0.073043 -0.681957 NaN
2013-01-02 -0.679429 0.397972 0.887388 1.187169 1.0
2013-01-03 0.782244 -0.251828 -0.736243 1.752973 2.0
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272 3.0
2013-01-05 0.405459 -1.151989 -1.309286 1.023221 4.0
2013-01-06 -0.445221 -1.520998 0.717376 1.657476 5.0

通过标签赋值

df.at[dates[0], 'A'] = 0
df
A B C D F
2013-01-01 0.000000 0.298287 0.073043 -0.681957 NaN
2013-01-02 -0.679429 0.397972 0.887388 1.187169 1.0
2013-01-03 0.782244 -0.251828 -0.736243 1.752973 2.0
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272 3.0
2013-01-05 0.405459 -1.151989 -1.309286 1.023221 4.0
2013-01-06 -0.445221 -1.520998 0.717376 1.657476 5.0

通过位置赋值

df.iat[0,1] = 8888
df
A B C D F
2013-01-01 0.000000 8888.000000 0.073043 -0.681957 NaN
2013-01-02 -0.679429 0.397972 0.887388 1.187169 1.0
2013-01-03 0.782244 -0.251828 -0.736243 1.752973 2.0
2013-01-04 -1.877066 -0.060967 -2.103769 -1.634272 3.0
2013-01-05 0.405459 -1.151989 -1.309286 1.023221 4.0
2013-01-06 -0.445221 -1.520998 0.717376 1.657476 5.0

通过传递numpy array赋值

df.loc[:,'D'] = np.array([5] * len(df))
df
A B C D F
2013-01-01 0.000000 8888.000000 0.073043 5 NaN
2013-01-02 -0.679429 0.397972 0.887388 5 1.0
2013-01-03 0.782244 -0.251828 -0.736243 5 2.0
2013-01-04 -1.877066 -0.060967 -2.103769 5 3.0
2013-01-05 0.405459 -1.151989 -1.309286 5 4.0
2013-01-06 -0.445221 -1.520998 0.717376 5 5.0

通过where操作来赋值

df2 = df.copy()
df2[df2 > 0] = -df2
df2
A B C D F
2013-01-01 0.000000 -8888.000000 -0.073043 -5 NaN
2013-01-02 -0.679429 -0.397972 -0.887388 -5 -1.0
2013-01-03 -0.782244 -0.251828 -0.736243 -5 -2.0
2013-01-04 -1.877066 -0.060967 -2.103769 -5 -3.0
2013-01-05 -0.405459 -1.151989 -1.309286 -5 -4.0
2013-01-06 -0.445221 -1.520998 -0.717376 -5 -5.0

缺失值处理

在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。

dates[0]
Timestamp('2013-01-01 00:00:00', freq='D')
df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
A B C D F E
2013-01-01 0.000000 8888.000000 0.073043 5 NaN 1.0
2013-01-02 -0.679429 0.397972 0.887388 5 1.0 1.0
2013-01-03 0.782244 -0.251828 -0.736243 5 2.0 NaN
2013-01-04 -1.877066 -0.060967 -2.103769 5 3.0 NaN

剔除所有包含缺失值的行数据

df1.dropna(how='any')#any是任意,所以保留下来了
A B C D F E
2013-01-02 -0.679429 0.397972 0.887388 5 1.0 1.0

填充缺失值

df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 8888.000000 0.073043 5 5.0 1.0
2013-01-02 -0.679429 0.397972 0.887388 5 1.0 1.0
2013-01-03 0.782244 -0.251828 -0.736243 5 2.0 5.0
2013-01-04 -1.877066 -0.060967 -2.103769 5 3.0 5.0

获取值是否为nan的布尔标记

pd.isnull(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

运算

统计

运算过程中,通常不包含缺失值。

进行描述性统计

df.mean()#默认为0
A      -0.302336
B    1480.902032
C      -0.411915
D       5.000000
F       3.000000
dtype: float64

对其他轴进行同样的运算

df.mean(1)#对所有行
2013-01-01    2223.268261
2013-01-02       1.321186
2013-01-03       1.358835
2013-01-04       0.791640
2013-01-05       1.388837
2013-01-06       1.750231
Freq: D, dtype: float64

对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
#意思就是前面两个填充为空,后面的自动往下移动
s
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
df
A B C D F
2013-01-01 0.000000 8888.000000 0.073043 5 NaN
2013-01-02 -0.679429 0.397972 0.887388 5 1.0
2013-01-03 0.782244 -0.251828 -0.736243 5 2.0
2013-01-04 -1.877066 -0.060967 -2.103769 5 3.0
2013-01-05 0.405459 -1.151989 -1.309286 5 4.0
2013-01-06 -0.445221 -1.520998 0.717376 5 5.0
df.sub(s, axis='index')
# 然后对df和s中的对应位置的元素进行减法运算,即df中的元素减去s中对应索引元素的值。
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.217756 -1.251828 -1.736243 4.0 1.0
2013-01-04 -4.877066 -3.060967 -5.103769 2.0 0.0
2013-01-05 -4.594541 -6.151989 -6.309286 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN

Apply 函数作用

通过apply()对函数作用

df.apply(np.cumsum)
# df.apply(np.cumsum)对DataFrame中的每一列应用np.cumsum函数进行累积求和。

# np.cumsum计算累积和,它将每一列中的元素按顺序进行累加。
A B C D F
2013-01-01 0.000000 8888.000000 0.073043 5 NaN
2013-01-02 -0.679429 8888.397972 0.960431 10 1.0
2013-01-03 0.102815 8888.146144 0.224188 15 3.0
2013-01-04 -1.774251 8888.085178 -1.879581 20 6.0
2013-01-05 -1.368792 8886.933189 -3.188867 25 10.0
2013-01-06 -1.814013 8885.412191 -2.471490 30 15.0
df.apply(lambda x:x.max()-x.min())
# 这个lambda函数会对df的每一列调用x.max()求最大值和x.min()求最小值,最后计算最大值减最小值。
A       2.659310
B    8889.520998
C       2.991157
D       0.000000
F       4.000000
dtype: float64

频数统计

s = pd.Series(np.random.randint(0, 7, size=10))
s
0    2
1    5
2    1
3    3
4    2
5    0
6    0
7    1
8    4
9    1
dtype: int32
s.value_counts()
1    3
2    2
0    2
5    1
3    1
4    1
dtype: int64

字符串方法

对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

Concat 连接

pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作

通过**concat()**来连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))
df
0 1 2 3
0 0.639168 0.881631 0.198015 0.393510
1 0.503544 0.155373 1.218277 0.128893
2 -0.661486 1.365067 -0.010755 0.058110
3 0.185698 -0.750695 -0.637134 -1.811947
4 -0.493348 -0.246197 -0.700524 0.692042
5 -2.280015 0.986806 1.297614 -0.749969
6 0.688663 0.088751 -0.164766 -0.165378
7 -0.382894 -0.157371 0.000836 -1.947379
8 -1.618486 0.804667 1.919125 -0.290719
9 0.392898 -0.264556 0.817233 0.680797
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
[          0         1         2         3
 0  0.639168  0.881631  0.198015  0.393510
 1  0.503544  0.155373  1.218277  0.128893
 2 -0.661486  1.365067 -0.010755  0.058110,
           0         1         2         3
 3  0.185698 -0.750695 -0.637134 -1.811947
 4 -0.493348 -0.246197 -0.700524  0.692042
 5 -2.280015  0.986806  1.297614 -0.749969
 6  0.688663  0.088751 -0.164766 -0.165378,
           0         1         2         3
 7 -0.382894 -0.157371  0.000836 -1.947379
 8 -1.618486  0.804667  1.919125 -0.290719
 9  0.392898 -0.264556  0.817233  0.680797]
pd.concat(pieces)
0 1 2 3
0 0.639168 0.881631 0.198015 0.393510
1 0.503544 0.155373 1.218277 0.128893
2 -0.661486 1.365067 -0.010755 0.058110
3 0.185698 -0.750695 -0.637134 -1.811947
4 -0.493348 -0.246197 -0.700524 0.692042
5 -2.280015 0.986806 1.297614 -0.749969
6 0.688663 0.088751 -0.164766 -0.165378
7 -0.382894 -0.157371 0.000836 -1.947379
8 -1.618486 0.804667 1.919125 -0.290719
9 0.392898 -0.264556 0.817233 0.680797

Join 合并

类似于SQL中的合并(merge)

left = pd.DataFrame({
    
    'key':['foo', 'foo'], 'lval':[1,2]})
left
key lval
0 foo 1
1 foo 2
right = pd.DataFrame({
    
    'key':['foo', 'foo'], 'lval':[4,5]})
right
key lval
0 foo 4
1 foo 5
pd.merge(left, right, on='key')
key lval_x lval_y
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

Append 添加

将若干行添加到dataFrame后面

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 -1.526419 0.868844 -1.379758 0.498004
1 -0.917867 -0.137874 -0.909232 -0.523873
2 1.370409 -0.948766 1.728098 0.361813
3 -1.274621 -1.224051 -0.749470 -2.712027
4 -0.303875 -0.177942 0.496359 0.048004
5 -0.941436 0.044570 -0.229654 0.092941
6 0.465798 -0.835244 0.131745 2.219413
7 0.875844 0.243440 -1.050471 1.761330
s = df.iloc[3]
s
A   -1.274621
B   -1.224051
C   -0.749470
D   -2.712027
Name: 3, dtype: float64
import warnings

df.append(s, ignore_index=True)
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1384017944.py:3: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df.append(s, ignore_index=True)
A B C D
0 -1.526419 0.868844 -1.379758 0.498004
1 -0.917867 -0.137874 -0.909232 -0.523873
2 1.370409 -0.948766 1.728098 0.361813
3 -1.274621 -1.224051 -0.749470 -2.712027
4 -0.303875 -0.177942 0.496359 0.048004
5 -0.941436 0.044570 -0.229654 0.092941
6 0.465798 -0.835244 0.131745 2.219413
7 0.875844 0.243440 -1.050471 1.761330
8 -1.274621 -1.224051 -0.749470 -2.712027

分组

对于“group by”操作,我们通常是指以下一个或几个步骤:

  • 划分 按照某些标准将数据分为不同的组
  • 应用 对每组数据分别执行一个函数
  • 组合 将结果组合到一个数据结构
df = pd.DataFrame({
    
    'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
A B C D
0 foo one -0.491461 -0.550970
1 bar one -0.468956 0.584847
2 foo two 0.461989 0.372785
3 bar three 0.290600 -2.142788
4 foo two 0.448364 -2.036729
5 bar two 0.639793 0.577440
6 foo one 1.335309 -0.582853
7 bar three 0.629100 0.223494

分组并对每个分组应用sum函数

df.groupby('A').sum()
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1885751491.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby('A').sum()
C D
A
bar 1.090537 -0.757006
foo 1.754201 -2.797767

按多个列分组形成层级索引,然后应用函数

df.groupby(['A','B']).sum()
C D
A B
bar one -0.468956 0.584847
three 0.919700 -1.919294
two 0.639793 0.577440
foo one 0.843848 -1.133823
two 0.910352 -1.663943

变形

堆叠

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
A B
first second
bar one 0.138584 0.661800
two 0.206128 0.668363
baz one -0.531729 -0.695009
two 0.746672 -0.735620

**stack()**方法对DataFrame的列“压缩”一个层级

stacked = df2.stack()
stacked
first  second   
bar    one     A    0.138584
               B    0.661800
       two     A    0.206128
               B    0.668363
baz    one     A   -0.531729
               B   -0.695009
       two     A    0.746672
               B   -0.735620
dtype: float64

对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级

stacked.unstack()
A B
first second
bar one 0.138584 0.661800
two 0.206128 0.668363
baz one -0.531729 -0.695009
two 0.746672 -0.735620
stacked.unstack(1)
second one two
first
bar A 0.138584 0.206128
B 0.661800 0.668363
baz A -0.531729 0.746672
B -0.695009 -0.735620
stacked.unstack(0)
first bar baz
second
one A 0.138584 -0.531729
B 0.661800 -0.695009
two A 0.206128 0.746672
B 0.668363 -0.735620

数据透视表

df = pd.DataFrame({
    
    'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
A B C D E
0 one A foo 0.014262 -0.732936
1 one B foo 0.853949 1.719331
2 two C foo -0.344640 0.765832
3 three A bar 1.908338 -1.447838
4 one B bar -1.469287 0.954153
5 one C bar -1.341587 -0.606839
6 two A foo 0.961927 0.054527
7 three B foo 0.829778 -0.581417
8 one C foo 0.623418 -0.456098
9 one A bar 1.745817 -0.684403
10 two B bar -1.457706 -0.272665
11 three C bar 1.336044 2.080870

我们可以轻松地从这个数据得到透视表

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A 1.745817 0.014262
B -1.469287 0.853949
C -1.341587 0.623418
three A 1.908338 NaN
B NaN 0.829778
C 1.336044 NaN
two A NaN 0.961927
B -1.457706 NaN
C NaN -0.344640

时间序列

pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。这在金融领域很常见,但又不限于此。

rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng[:5]
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04'],
              dtype='datetime64[ns]', freq='S')
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts
2012-01-01 00:00:00    193
2012-01-01 00:00:01    447
2012-01-01 00:00:02    407
2012-01-01 00:00:03    450
2012-01-01 00:00:04    368
                      ... 
2012-01-01 00:01:35    102
2012-01-01 00:01:36    446
2012-01-01 00:01:37    290
2012-01-01 00:01:38    256
2012-01-01 00:01:39    154
Freq: S, Length: 100, dtype: int32

ts.resample(‘5Min’) 表示对时间序列数据ts进行重采样,改变其索引的时间频率。

具体来说:

ts: 一个时间序列数据,索引为datetime类型。
'5Min': 表示将索引频率改为每5分钟一个。

resample函数将按照新频率重新采样,原索引不能被5分钟整除的时间点会被删除,新的索引会在5分钟的时间点上。

ts.resample('5Min')
# Datafream才这样写
# ts.resample('5Min', how='sum')
<pandas.core.resample.DatetimeIndexResampler object at 0x0000017E3D5A86D0>

时区表示

rng = pd.date_range('3/6/2012', periods=5, freq='D')
rng
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-03-06   -0.967180
2012-03-07   -0.917108
2012-03-08    0.252346
2012-03-09    0.461718
2012-03-10   -0.931543
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00   -0.967180
2012-03-07 00:00:00+00:00   -0.917108
2012-03-08 00:00:00+00:00    0.252346
2012-03-09 00:00:00+00:00    0.461718
2012-03-10 00:00:00+00:00   -0.931543
Freq: D, dtype: float64

时区转换

ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00   -0.967180
2012-03-06 19:00:00-05:00   -0.917108
2012-03-07 19:00:00-05:00    0.252346
2012-03-08 19:00:00-05:00    0.461718
2012-03-09 19:00:00-05:00   -0.931543
Freq: D, dtype: float64

时间跨度转换

rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31   -0.841095
2012-02-29   -1.548459
2012-03-31   -0.473997
2012-04-30   -0.602313
2012-05-31   -0.519119
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01   -0.841095
2012-02   -1.548459
2012-03   -0.473997
2012-04   -0.602313
2012-05   -0.519119
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01   -0.841095
2012-02-01   -1.548459
2012-03-01   -0.473997
2012-04-01   -0.602313
2012-05-01   -0.519119
Freq: MS, dtype: float64

日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如,我们把以11月为年底的季度数据转换为当前季度末月底为始的数据

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]')
ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts[:5]
1990Q1   -0.316562
1990Q2   -0.055698
1990Q3   -0.267225
1990Q4    0.514381
1991Q1   -0.716024
Freq: Q-NOV, dtype: float64
ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts[:6]
1990-02-01 09:00   -0.316562
1990-05-01 09:00   -0.055698
1990-08-01 09:00   -0.267225
1990-11-01 09:00    0.514381
1991-02-01 09:00   -0.716024
1991-05-01 09:00    1.530323
Freq: H, dtype: float64

分类

从版本0.15开始,pandas在DataFrame中开始包括分类数据。

df = pd.DataFrame({
    
    "id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e

把raw_grade转换为分类类型

df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

重命名类别名为更有意义的名称

df["grade"].cat.categories = ["very good", "good", "very bad"]
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\2985790766.py:1: FutureWarning: Setting categories in-place is deprecated and will raise in a future version. Use rename_categories instead.
  df["grade"].cat.categories = ["very good", "good", "very bad"]

对分类重新排序,并添加缺失的分类

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

排序是按照分类的顺序进行的,而不是字典序

df.sort_values(by="grade")
id raw_grade grade
4 5 e very bad
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good

按分类分组时,也会显示空的分类

df.groupby("grade").size()
grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64

绘图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

在这里插入图片描述

对于DataFrame类型,**plot()**能很方便地画出所有列及其标签

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

在这里插入图片描述

获取数据的I/O

CSV

写入一个csv文件

df.to_csv('data/foo.csv')

从一个csv文件读入

pd.read_csv('data/foo.csv')
Unnamed: 0 A B C D
0 2000-01-01 -0.055622 1.889998 -1.094622 0.591568
1 2000-01-02 -0.408948 3.388888 -0.550835 1.362462
2 2000-01-03 -0.041353 1.880637 -2.417740 0.383862
3 2000-01-04 -0.442517 0.807854 -2.409478 1.418281
4 2000-01-05 0.442292 -0.735053 -2.970804 1.435721
... ... ... ... ... ...
995 2002-09-22 -28.470361 14.628723 -68.388462 -37.630887
996 2002-09-23 -28.995055 14.517615 -68.247657 -36.363608
997 2002-09-24 -28.756586 15.852227 -68.328996 -36.032141
998 2002-09-25 -29.331636 15.557680 -68.450159 -35.825427
999 2002-09-26 -31.065863 16.141154 -68.852564 -36.462825

1000 rows × 5 columns

HDF5

HDFStores的读写

写入一个HDF5 Store

df.to_hdf('data/foo.h5', 'df')

从一个HDF5 Store读入

pd.read_hdf('data/foo.h5', 'df')
A B C D
2000-01-01 -0.055622 1.889998 -1.094622 0.591568
2000-01-02 -0.408948 3.388888 -0.550835 1.362462
2000-01-03 -0.041353 1.880637 -2.417740 0.383862
2000-01-04 -0.442517 0.807854 -2.409478 1.418281
2000-01-05 0.442292 -0.735053 -2.970804 1.435721
... ... ... ... ...
2002-09-22 -28.470361 14.628723 -68.388462 -37.630887
2002-09-23 -28.995055 14.517615 -68.247657 -36.363608
2002-09-24 -28.756586 15.852227 -68.328996 -36.032141
2002-09-25 -29.331636 15.557680 -68.450159 -35.825427
2002-09-26 -31.065863 16.141154 -68.852564 -36.462825

1000 rows × 4 columns

Excel

MS Excel的读写

写入一个Excel文件

df.to_excel('data/foo.xlsx', sheet_name='Sheet1')

从一个excel文件读入

pd.read_excel('data/foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Unnamed: 0 A B C D
0 2000-01-01 -0.055622 1.889998 -1.094622 0.591568
1 2000-01-02 -0.408948 3.388888 -0.550835 1.362462
2 2000-01-03 -0.041353 1.880637 -2.417740 0.383862
3 2000-01-04 -0.442517 0.807854 -2.409478 1.418281
4 2000-01-05 0.442292 -0.735053 -2.970804 1.435721
... ... ... ... ... ...
995 2002-09-22 -28.470361 14.628723 -68.388462 -37.630887
996 2002-09-23 -28.995055 14.517615 -68.247657 -36.363608
997 2002-09-24 -28.756586 15.852227 -68.328996 -36.032141
998 2002-09-25 -29.331636 15.557680 -68.450159 -35.825427
999 2002-09-26 -31.065863 16.141154 -68.852564 -36.462825

1000 rows × 5 columns

在这里插入图片描述Pandas练习题目录

1.Getting and knowing

  • Chipotle
  • Occupation
  • World Food Facts

2.Filtering and Sorting

  • Chipotle
  • Euro12
  • Fictional Army

3.Grouping

  • Alcohol Consumption
  • Occupation
  • Regiment

4.Apply

  • Students
  • Alcohol Consumption
  • US_Crime_Rates

5.Merge

  • Auto_MPG
  • Fictitious Names
  • House Market

6.Stats

  • US_Baby_Names
  • Wind_Stats

7.Visualization

  • Chipotle
  • Titanic Disaster
  • Scores
  • Online Retail
  • Tips

8.Creating Series and DataFrames

  • Pokemon

9.Time Series

  • Apple_Stock
  • Getting_Financial_Data
  • Investor_Flow_of_Funds_US

10.Deleting

  • Iris
  • Wine

使用方法

每个练习文件夹有三个不同类型的文件:

1.Exercises.ipynb

没有答案代码的文件,这个是你做的练习

2.Solutions.ipynb

运行代码后的结果(不要改动)

3.Exercise_with_Solutions.ipynb

有答案代码和注释的文件

你可以在Exercises.ipynb里输入代码,看看运行结果是否和Solutions.ipynb里面的内容一致,如果真的完成不了再看下Exercise_with_Solutions.ipynb的答案。

每文一语

开教传授,私信教学!!!

猜你喜欢

转载自blog.csdn.net/weixin_47723732/article/details/132882530