pandas数据处理实践四(时间序列date_range、数据分箱cut、分组技术GroupBy)

时间序列:

关键函数

pandas.date_rangestart = Noneend = Noneperiods = Nonefreq = Nonetz = Nonenormalize = Falsename = Noneclosed = None** kwargs 

参数:

start:str或datetime-like,可选

生成日期的左边界。

end:str或datetime-like,可选

生成日期的权利。

periods:整数,可选

要生成的周期数。

freq:str或DateOffset,默认为'D'(每日日历)

频率串可以有倍数,例如'5H'。

tz:str或tzinfo,可选

返回本地化DatetimeIndex的时区名称,例如“Asia / Hong_Kong”。默认情况下,生成的DatetimeIndex是暂时的。

normalize:bool,默认为False

在生成日期范围之前将开始/结束日期标准化为午夜。

name:str,默认无

生成的DatetimeIndex的名称。

closed:{无,'左','右'},可选

使间隔相对于给定频率关闭到“左”,“右”或两侧(无,默认)。

** kwargs

为了兼容性。对结果没有影响。

返回固定频率DatetimeIndex。

时间序列生成的几种方式和采样:

 from datetime import datetime # 导入时间序列^M
     ...: t1 = datetime(2009,10,20) # 直接定义
     ...:
     ...:

In [105]: t1
Out[105]: datetime.datetime(2009, 10, 20, 0, 0)

In [106]: # 通过列表^M
     ...: date_list = [^M
     ...:     datetime(2018,10,1),^M
     ...:     datetime(2018,10,2),^M
     ...:     datetime(2018,10,5),^M
     ...:     datetime(2018,10,7)^M
     ...: ]

In [107]: date_list
Out[107]:
[datetime.datetime(2018, 10, 1, 0, 0),
 datetime.datetime(2018, 10, 2, 0, 0),
 datetime.datetime(2018, 10, 5, 0, 0),
 datetime.datetime(2018, 10, 7, 0, 0)]

In [108]: s1 = Series(np.random.randn(4),index=date_list) # 给时间序列赋

In [109]: s1
Out[109]:
2018-10-01    0.433032
2018-10-02   -1.180358
2018-10-05   -1.583058
2018-10-07   -1.200917
dtype: float64

In [110]: s1.values
Out[110]: array([ 0.43303189, -1.1803582 , -1.58305798, -1.20091707])

In [111]: s1.index
Out[111]: DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-05', '2018-10-07'], dtype='datetime64[ns]', freq=None)

In [112]: # 快速生成时间序列:pd.date_range

In [113]: data_list_new = pd.date_range('2018-01-01',periods=100,freq='H') # 默认是从周日开始

In [114]: len(data_list_new)
Out[114]: 100

In [115]: s2 = Series(np.random.rand(100),index=data_list_new)

In [116]: s2.head()
Out[116]:
2018-01-01 00:00:00    0.891556
2018-01-01 01:00:00    0.953536
2018-01-01 02:00:00    0.321705
2018-01-01 03:00:00    0.150378
2018-01-01 04:00:00    0.180122
Freq: H, dtype: float64

In [117]: t_range = pd.date_range('20180101','20181231')

In [118]: t_range
Out[118]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25',
               '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29',
               '2018-12-30', '2018-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')

In [119]: s1 = Series(np.random.randn(len(t_range)),index=t_range)

In [120]: s1.head()
Out[120]:
2018-01-01    0.442134
2018-01-02    1.726818
2018-01-03   -1.157719
2018-01-04    1.179449
2018-01-05    0.974630
Freq: D, dtype: float64

In [121]: # 对时间序列采样

In [122]: s1['2018-01'].mean()
Out[122]: 0.03117062119001378

In [123]: s1_month = s1.resample('M').mean() #按月进行采样

In [124]: s1_month.index
Out[124]:
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

In [125]: s1.resample('H').bfill().head()
Out[125]:
2018-01-01 00:00:00    0.442134
2018-01-01 01:00:00    1.726818
2018-01-01 02:00:00    1.726818
2018-01-01 03:00:00    1.726818
2018-01-01 04:00:00    1.726818
Freq: H, dtype: float64

数据分箱技术Binning:

pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')

该函数的用处是把分散的数据化为分段数据,例如学生的分数,从0到100分,可以分为(0,59],(60,79],(80,90],(90,100],还有就是年龄也可以分段,因此该函数就是为此而生的,同时返回的还是原始数据,只是已经是分箱过的数据,同时可以添加新标签,下面给出例子:

   把学生分数分箱

In [1]: import numpy as np^M
   ...: import pandas as pd^M
   ...: from pandas import Series,DataFrame
   ...:
   ...:

In [2]: score_list = np.random.randint(0,100,size=100) # 随机创建100个学生分数,分数从0    
                                                       # 到100 

In [3]: score_list
Out[3]:
array([56, 80, 89,  3, 45, 56, 65, 48, 12, 20, 13, 37,  1, 85, 64, 50, 72,
       43,  8, 15,  9, 16, 63, 41, 68, 98,  2, 18, 78, 83, 54, 90, 81, 64,
       98, 48, 52, 67,  1,  7, 24, 98, 83, 57, 57, 36, 90, 48, 59, 72,  4,
        8,  2, 26, 16, 91, 26,  9, 66, 92, 22,  3, 91, 72, 90, 28, 74, 88,
       89, 79, 13, 91, 57, 98, 63, 68, 63, 73, 33, 33, 99, 55, 18, 87, 60,
       53, 24, 77, 85, 70, 57, 58, 75, 86, 88, 43, 52,  4, 71, 16])

In [4]: bins = [0,59,79,89,100] # 分数分段区间即0,59],(60,79],(80,90],(90,100]

In [5]: score_cut = pd.cut(score_list,bins) # 通过pd.cut()函数把分数按照bins进行分割

In [18]: len(score_cut)  # 返回还是100个分数,只是这些分数已经分箱了,可以添加标签等
Out[18]: 100

In [6]: score_cut # 返回的数据类型为pandas.core.arrays.categorical.Categorical
Out[6]:
[(0, 59], (79, 89], (79, 89], (0, 59], (0, 59], ..., (0, 59], (0, 59], (0, 59], (59, 79], (0, 59]]
Length: 100
Categories (4, interval[int64]): [(0, 59] < (59, 79] < (79, 89] < (89, 100]]

In [7]: type(score_cut)
Out[7]: pandas.core.arrays.categorical.Categorical

In [8]: pd.value_counts(score_cut) # 查看每个区间的人数
Out[8]:
(0, 59]      54
(59, 79]     22
(89, 100]    12
(79, 89]     12
dtype: int64
# 为后续处理做准备

Dataframe数据进行分箱

还是引用上面的数据进行实践

In [9]: df = DataFrame() # 创建一个空Dataframe数据

In [10]: df['score_list'] = score_list # 把数据填充进去
 
In [11]: df.head() # 查看前5行
Out[11]:
   score_list
0          56
1          80
2          89
3           3
4          45

In [12]: df['name'] = [pd.util.testing.rands(3) for i in range(100)]
    ...: # pandas提供pd.util.testing.rands()函数 随机生成字符串作为学生姓名并填充进去

In [13]: df.head() # 显示前5个人的数据
Out[13]:
   score_list name
0          56  puk
1          80  VUL
2          89  cwz
3           3  uVb
4          45  sRN

In [14]: # 把分箱结果作为一个columns

In [15]: # 把分箱结果作为一个columns,并把分数段分等级:low,0k,good,great

In [16]: df['Categories'] = pd.cut(df['score_list'],bins,labels=['low','ok','g
    ...: ood','great'])

In [17]: df.head(10)
Out[17]:
   score_list name Categories
0          56  puk        low
1          80  VUL       good
2          89  cwz       good
3           3  uVb        low
4          45  sRN        low
5          56  3vM        low
6          65  wp8         ok
7          48  lSF        low
8          12  AkT        low
9          20  tgb        low

分组技术GroupBy

DataFrame.groupbyby = Noneaxis = 0level = Noneas_index = Truesort = Truegroup_keys = Truesqueeze = Falseobserve = False** kwargs 

该函数的主要处理分组问题,例如从数据中有两个特征感兴趣,可以单独拿出来供我们处理,例如:

	date	city	temperature	wind
0	03/01/2016	BJ	8	5
1	17/01/2016	BJ	12	2
2	31/01/2016	BJ	19	2
3	14/02/2016	BJ	-3	3
4	28/02/2016	BJ	19	2
5	13/03/2016	BJ	5	3
6	27/03/2016	SH	-4	4
7	10/04/2016	SH	19	3
8	24/04/2016	SH	20	3
9	08/05/2016	SH	17	3
10	22/05/2016	SH	4	2
11	05/06/2016	SH	-10	4
12	19/06/2016	SH	0	5
13	03/07/2016	SH	-9	5
14	17/07/2016	GZ	10	2
15	31/07/2016	GZ	-1	5
16	14/08/2016	GZ	1	5
17	28/08/2016	GZ	25	4
18	11/09/2016	SZ	20	1
19	25/09/2016	SZ	-10	4

从数据中我们看到主要有四个城市的天气记录,只是通过这个表格我们不容易处理数据,例如各城市的均值和最大值、最小值、画图等,以此可以针对‘city’进行分组,然后对其处理,再利用分组后的属性对数据进一步处理,其中一些属性有:

gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

从中我们可以看出有很多属性函数给我们处理数据,还具有画图功能,下面给出具体数据处理代码示例:


In [59]: import numpy as np
    ...: import pandas as pd
    ...: from pandas import Series,DataFrame
    ...:
    ...:

In [60]: df = pd.read_csv('city_weather.csv')

In [61]: df.head()
Out[61]:
         date city  temperature  wind
0  03/01/2016   BJ            8     5
1  17/01/2016   BJ           12     2
2  31/01/2016   BJ           19     2
3  14/02/2016   BJ           -3     3
4  28/02/2016   BJ           19     2

In [62]: gb = df.groupby(df['city'],) # 以城市为准分组,可分为BJ,GZ,SH,SZ

g.<tab> # 有很多属性可用

gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

In [65]: gb.groups # 组成员和每组的索引
Out[65]:
{'BJ': Int64Index([0, 1, 2, 3, 4, 5], dtype='int64'),
 'GZ': Int64Index([14, 15, 16, 17], dtype='int64'),
 'SH': Int64Index([6, 7, 8, 9, 10, 11, 12, 13], dtype='int64'),
 'SZ': Int64Index([18, 19], dtype='int64')}

In [67]: gb.get_group('BJ').mean() # 获得BJ的temperature和wind的均值
Out[67]:
temperature    10.000000
wind            2.833333
dtype: float64

In [69]: gb.max()
Out[69]:
            date  temperature  wind
city
BJ    31/01/2016           19     5
GZ    31/07/2016           25     5
SH    27/03/2016           20     5
SZ    25/09/2016           20     4

gb.plot()

 

 

其他功能参考pandas官方文档 

猜你喜欢

转载自blog.csdn.net/weixin_42398658/article/details/82936525