一.类别型数据(Categorical Data)
1.概念:
"类别型变量"(Categorical Variable)是指仅有有限个取值的定性变量,表现为互不相容的类别或属性.在Pandas中的类型名为:category,又分
为"有序型"(如改进程度)和"无序型"(如性别).类别型数据常用不同的int来表示,这种方法称为"分类编码表示法"或"字典编码表示法",这些int值
称为"分类编码"或"编码".这种做法可以大大提高分析时的性能,节约内存资源,并可在在保持编码不变的情况下对分类进行转换,如:
①重命名分类
②加入1个新分类而不改变已有分类的顺序
2.创建
(1)通过实例化创建:
pd.Categorical(<values>[,categories=None,ordered=False,dtype=None,fastpath=False])
#参数说明:
values:指定值;为list-like
#如果存在categories中不存在的类别,则该类别会被替换为NaN
categories:指定全部类别;为Index-like(要求值唯一),默认为<values>中出现的类别
#类别可为任意不可变数据类型
ordered:指定各类别间是否有排序;为bool
#会自动判断顺序,不取决于输入顺序
#实例:
>>> pd.Categorical(["A","B","B","C","D","C","A"])
['A', 'B', 'B', 'C', 'D', 'C', 'A']
Categories (4, object): ['A', 'B', 'C', 'D']
>>> pd.Categorical(["A","B","B","C","D","C","A"],ordered=True)
['A', 'B', 'B', 'C', 'D', 'C', 'A']
Categories (4, object): ['A' < 'B' < 'C' < 'D']
>>> pd.Categorical(["A","B","B","D","C","A"],ordered=True)
['A', 'B', 'B', 'D', 'C', 'A']
Categories (4, object): ['A' < 'B' < 'C' < 'D']
>>> pd.Categorical(["A","B","B","C","D","C","A"],categories=["A","B","C"])
['A', 'B', 'B', 'C', NaN, 'C', 'A']
Categories (3, object): ['A', 'B', 'C']
(2)通过from_codes构造器创建:
pd.Categorical.from_codes(<codes>[,categories=None,ordered=None,dtype=None])
#参数说明:
codes:指定数据;为int array-like(-1代表NaN)
#int表示categories/dtype中提供的索引等于该int的类别
categories:指定全部类别;为Index-like(要求值唯一)
#类别可为任意不可变数据类型
ordered:定各类别间是否有排序;为bool
#顺序取决于输入顺序
dtype:指定类型;为CategoricalDtype/"category"
#为CategoricalDtype时不能与categories/ordered同时指定;dtype/categories需要提供可取的类别
#实例:
>>> pd.Categorical.from_codes([1,0,0,1,1,0],categories=["a","b"])
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a', 'b']
>>> pd.Categorical.from_codes([1,0,0,1,1,0],dtype=pd.Categorical(["A","B"]).dtype)
['B', 'A', 'A', 'B', 'B', 'A']
Categories (2, object): ['A', 'B']
>>> pd.Categorical.from_codes([1,0,0,1,1,0],categories=["a","b"],ordered=True)
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a' < 'b']
(3)通过Series/DataFrame创建:
>>> s1=pd.Series(["A","B","C","D"],dtype="category")
>>> s2=pd.Series(["A","B","C","D"])
>>> s2=s2.astype("category")
>>> s1.dtype,s2.dtype
(CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=False), CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=False))
>>> type(s1.values)
<class 'pandas.core.arrays.categorical.Categorical'>
3.cat属性:
Series的cat属性提供了分类方法的入口:
>>> s1.cat.ordered
False
>>> s2=s2.cat.set_categories(["A","B","C","D"],ordered=True)
>>> s2.cat.ordered
True
4.属性与方法
(1)属性:
查看全部类别:<c>.categories
#实例:接上
>>> s1.values.categories
Index(['A', 'B', 'C', 'D'], dtype='object')
######################################################################################################################
查看全部编码:<c>.codes
#实例:接上
>>> s1.values.codes
array([0, 1, 2, 3], dtype=int8)
######################################################################################################################
查看类别间是否有序:<c>.ordered
#实例:接上
>>> s1.values.ordered
False
######################################################################################################################
查看数据类型:<c>.dtype
#返回pandas.core.dtypes.dtypes.CategoricalDtype
>>> s1.values.dtype
CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=False)
(2)方法:
添加类别:<c>.add_categories(<new_categories>[,inplace=False])
#参数说明:
new_categories:指定要添加的类别;为category/category list-like
#实例:接上
>>> s1.values.add_categories("Z")
['A', 'B', 'C', 'D']
Categories (5, object): ['A', 'B', 'C', 'D', 'Z']
######################################################################################################################
设置类别间的顺序:<c>.as_ordered([inplace=False])
#实例:接上
>>> s1.values.as_ordered()
['A', 'B', 'C', 'D']
Categories (4, object): ['A' < 'B' < 'C' < 'D']
######################################################################################################################
取消类别间的顺序:<c>.as_unordered([inplace=False])
######################################################################################################################
删除指定类别:<c>.remove_categories(<removals>[,inplace=False])
#参数说明:
removals:指定要删除的类别;为category/category list-like
######################################################################################################################
删除未使用的类别:<c>.remove_unused_categories([inplace=False])
#实例:
>>> c=pd.Categorical.from_codes([1,0,0,1,1,0],categories=["a","b"])
>>> c
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a', 'b']
>>> c.remove_unused_categories()
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a', 'b']
######################################################################################################################
修改类别名称:<c>.rename_categories(<new_categories>[,inplace=False])
#注意:不能改变类别数量;被修改的类别的数据会被调整为新的类型(名)
#参数说明:
new_categories:指定新的类别名;为list-like/dict-like/callable
#实例:接上
>>> c.rename_categories(["A","B"])
['B', 'A', 'A', 'B', 'B', 'A']
Categories (2, object): ['A', 'B']
>>> c.rename_categories({
"a":"A"})
['b', 'A', 'A', 'b', 'b', 'A']
Categories (2, object): ['A', 'b']
######################################################################################################################
修改类别顺序:<c>.reorder_categories(<new_categories>[,ordered=None,inplace=False])
#和<c>.rename_categories()类似,但能修改为ordered CategoricalDtype
######################################################################################################################
修改为新类别:<c>.set_categories(<new_categories>[,ordered=None,rename=False,inplace=False])
#注意:被修改的类别的数据会被调整为NaN
#参数说明:
new_categories:指定新的类别;为list-like/dict-like/callable
#实例:接上
>>> c.set_categories(["A","B"])
[NaN, NaN, NaN, NaN, NaN, NaN]
Categories (2, object): ['A', 'B']
>>> c.set_categories({
"a":"A"})
[NaN, 'a', 'a', NaN, NaN, 'a']
Categories (1, object): ['a']
二.链式编程与管道方法
1.链式编程:
创建的很多临时变量其实不会在分析中用到,这时可以采用链式编程:
>>> df=pd.DataFrame({
"key":[1,2,1,2,1,1,2,1,2,2],"k1":[23,33,27,34,93,37,18,73,92,34],"k2":[1,2,3,4,5,6,7,8,9,0],"k3":[9,43,23,65,12,76,91,12,32,66]})
>>> df
key k1 k2 k3
0 1 23 1 9
1 2 33 2 43
2 1 27 3 23
3 2 34 4 65
4 1 93 5 12
5 1 37 6 76
6 2 18 7 91
7 1 73 8 12
8 2 92 9 32
9 2 34 0 66
>>> result=(df.assign(k2=df.k1-df.k3.mean()).groupby("key").k2.std())
>>> result
key
1 30.835045
2 28.656587
Name: k2, dtype: float64
2.管道方法:
执行指定函数:<s_or_df>.pipe(<func>[,*args,**kwargs])
#相当于<func>(<s_or_df>[,*args,**kwargs]),但管道方法使链式编程变得更容易
#参数说明:
func:指定要执行的函数;为function
#要求至少接收1个参数,即<s_or_df>
args,kwargs:指定要传入<func>的参数
#实例:接上
>>> def f(x):
... return x+x
...
>>> df.pipe(f)
key k1 k2 k3
0 2 46 2 18
1 4 66 4 86
2 2 54 6 46
3 4 68 8 130
4 2 186 10 24
5 2 74 12 152
6 4 36 14 182
7 2 146 16 24
8 4 184 18 64
9 4 68 0 132