GroupBy技术
>>> import numpy as np
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
>>> df
data1 data2 key1 key2
0 -1.012239 0.381608 a one
1 0.432161 -1.384340 a two
2 0.426435 -1.732019 b one
3 -1.388080 0.839690 b two
4 -0.439888 -0.603553 a one
>>> grouped = df['data1'].groupby(df['key1'])
>>> grouped.mean()
key1
a -0.339989
b -0.480822
Name: data1, dtype: float64
>>> means = df['data1'].groupby([df['key1'],df['key2']]).mean()
>>> means
key1 key2
a one -0.726064
two 0.432161
b one 0.426435
two -1.388080
Name: data1, dtype: float64
>>> means.unstack()
key2 one two
key1
a -0.726064 0.432161
b 0.426435 -1.388080
直接使用列名作为分组键
>>> df.groupby('key1').mean()
data1 data2
key1
a -0.339989 -0.535428
b -0.480822 -0.446165
>>> df.groupby(['key1','key2']).size()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
对分组进行迭代
GroupBy对象支持迭代,可以产生一组二元元组。
>>> for name,group in df.groupby('key1'):
... print name
... print group
...
a
data1 data2 key1 key2
0 -1.012239 0.381608 a one
1 0.432161 -1.384340 a two
4 -0.439888 -0.603553 a one
b
data1 data2 key1 key2
2 0.426435 -1.732019 b one
3 -1.388080 0.839690 b two
对于多重键情况,元组的第一元素是由键值组成的元组:
>>> for(k1,k2),group in df.groupby(['key1','key2']):
... print k1,k2
... print group
...
a one
data1 data2 key1 key2
0 -1.012239 0.381608 a one
4 -0.439888 -0.603553 a one
a two
data1 data2 key1 key2
1 0.432161 -1.38434 a two
b one
data1 data2 key1 key2
2 0.426435 -1.732019 b one
b two
data1 data2 key1 key2
3 -1.38808 0.83969 b two
你可以对这些数据片段做任何操作,比如把他们当成一个字典
>>> pieces = dict(list(df.groupby('key1')))
>>> pieces['b']
data1 data2 key1 key2
2 0.426435 -1.732019 b one
3 -1.388080 0.839690 b two
>>> pieces['a']
data1 data2 key1 key2
0 -1.012239 0.381608 a one
1 0.432161 -1.384340 a two
4 -0.439888 -0.603553 a one
groupby默认是在axis=0上进行分组的,通过设置也可以在其他任何轴上进行分组
>>> df.dtypes
data1 float64
data2 float64
key1 object
key2 object
dtype: object
>>> grouped = df.groupby(df.dtypes,axis=1)
>>> dict(list(grouped))
{dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one, dtype('float64'): data1 data2
0 -1.012239 0.381608
1 0.432161 -1.384340
2 0.426435 -1.732019
3 -1.388080 0.839690
4 -0.439888 -0.603553}
选取一个或一组列
对于大数据集很可能只需对部分列进行聚合,例:
>>> df.groupby(['key1','key2'])[['data2']].mean()
data2
key1 key2
a one -0.110972
two -1.384340
b one -1.732019
two 0.839690
>>> s_grouped = df.groupby(['key1','key2'])['data2']
>>> s_grouped.mean()
key1 key2
a one -0.110972
two -1.384340
b one -1.732019
two 0.839690
Name: data2, dtype: float64
通过字典或Series进行分组
>>> people = DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
>>> people.ix[2:3,['b','c']] = np.nan
>>> people
a b c d e
Joe -0.507204 1.111102 -1.626998 -1.191771 0.386699
Steve 1.225585 1.202014 0.089095 0.004328 -0.660203
Wes -0.641992 NaN NaN -1.612848 0.327813
Jim 1.271822 -0.117422 0.919063 -0.254136 -0.957631
Travis 0.690725 -1.098159 -0.757635 -0.794666 -1.297784
>>> mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
>>> by_column = people.groupby(mapping,axis=1)
>>> by_column.sum()
blue red
Joe -2.818768 0.990597
Steve 0.093423 1.767396
Wes -1.612848 -0.314179
Jim 0.664926 0.196769
Travis -1.552301 -1.705217
>>> map_series = Series(mapping)
>>> map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
>>> people.groupby(map_series,axis=1).count()
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
通过函数进行分组
>>> import numpy as np
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> people = DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
>>> people.ix[2:3,['b','c']] = np.nan
>>> people.groupby(len).sum()
a b c d e
3 2.080547 -0.604547 -0.604366 -1.513836 0.497836
5 0.079461 -1.729398 -0.901477 0.569260 0.302427
6 0.005069 -0.035869 -0.793810 1.150144 2.031785
>>> people
a b c d e
Joe 1.119423 -0.345290 0.668423 -0.658008 0.413723
Steve 0.079461 -1.729398 -0.901477 0.569260 0.302427
Wes -0.556755 NaN NaN -0.992753 0.124015
Jim 1.517879 -0.259257 -1.272789 0.136925 -0.039903
Travis 0.005069 -0.035869 -0.793810 1.150144 2.031785
下例:先按长度分组,然后是one,two的分组
>>> key_list = ['one','one','one','two','two']
>>> people.groupby([len,key_list]).min()
a b c d e
3 one -0.556755 -0.345290 0.668423 -0.992753 0.124015
two 1.517879 -0.259257 -1.272789 0.136925 -0.039903
5 one 0.079461 -1.729398 -0.901477 0.569260 0.302427
6 two 0.005069 -0.035869 -0.793810 1.150144 2.031785
根据索引级别分组
>>> columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names = ['cty','tenor'])
>>> hier_df = DataFrame(np.random.randn(4,5),columns = columns)
>>> hier_df
cty US JP
tenor 1 3 5 1 3
0 0.839657 0.656362 1.034138 -1.107702 0.687075
1 0.979355 0.581277 1.024826 -0.617576 0.117190
2 0.579184 -0.629204 1.849724 -0.738685 -1.937523
3 0.168968 -0.352462 -0.791173 -0.628160 0.391682
>>> hier_df.groupby(level='cty',axis=1).count()
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3