结论:
如果特征是字符型,并且有等级含义,则使用Labelencoder,并且等级是按照字符排序的,比如’成绩’
如果特征是字符型,没有等级含义,则使用pd.get_dummies(),将字符型转换成onehot(推荐),比如‘性别’
如果特征是数值型,没有等级,使用OneHotEncder(),将数值型转换成onehot,比如‘标签’
针对“姓名”“成绩”两个特征进行举例介绍
'''连续变量特征‘成绩’进行离散化,转换成标签数据,然后再进行onehot'''
data = pd.DataFrame([['a',59],['c',41],['e',66],['b',75],['d',88],['f',91],['f',99],
['g',95],['h',83]],columns=['name','score'])
===================================
Out[163]:
name score
0 a 59
1 c 41
2 e 66
3 b 75
4 d 88
5 f 91
6 f 99
7 g 95
8 h 83
data['score_cut'] = pd.qcut(data['score'],[0,0.25,0.5,0.75,1],labels=['0-poor','1-average','2-good','3-distinction'])
================================
Out[165]:
name score score_cut
0 a 59 0-poor
1 c 41 0-poor
2 e 66 0-poor
3 b 75 1-average
4 d 88 2-good
5 f 91 2-good
6 f 99 3-distinction
7 g 95 3-distinction
8 h 83 1-average
#按照本班整体成绩划分等级,其中0-poor前面的0是为了转换时候体现出等级
data['score_encoder'] = label.fit_transform(data['score_cut'])#score中得分越高,说明等级越高,设为3
=============================================
Out[167]:
name score score_cut score_encoder
0 a 59 0-poor 0
1 c 41 0-poor 0
2 e 66 0-poor 0
3 b 75 1-average 1
4 d 88 2-good 2
5 f 91 2-good 2
6 f 99 3-distinction 3
7 g 95 3-distinction 3
8 h 83 1-average 1
#如果第一列不是name,是标签列,即一共有9个类别,则第一列没有等级划分,就可以对a进行onehot
data = data.rename(columns={
'name':'label'})
data = data.join(pd.get_dummies(data['label'],prefix='label'))
============================================================
Out[171]:
label score score_cut score_encoder label_a label_b label_c
0 a 59 0-poor 0 1 0 0
1 c 41 0-poor 0 0 0 1
2 e 66 0-poor 0 0 0 0
3 b 75 1-average 1 0 1 0
4 d 88 2-good 2 0 0 0
5 f 91 2-good 2 0 0 0
6 f 99 3-distinction 3 0 0 0
7 g 95 3-distinction 3 0 0 0
8 h 83 1-average 1 0 0 0
label_d label_e label_f label_g label_h
0 0 0 0 0 0
1 0 0 0 0 0
2 0 1 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0
labelencoder计算参考: https://www.cnblogs.com/king-lps/p/7846414.html.
下面是一些例子
'''labelencoder和onehotencoder'''
data = pd.DataFrame([['b',1],['a',2],['c',3],['c',2]],columns=['name','age'])
#labelencoder是对非数值型特征进行编码的,得到的数值大小不同的数据,即数据是分等级的
data_copy1 = data.copy(deep=True)
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data_copy1['name_encoder'] = label.fit_transform(data_copy1['name'])
print(label.classes_)
#用pd.get_dummies()对非数值型特征进行onehot
data_encoder2 = pd.get_dummies(data_copy1)
#使用onehotencoder()对数值型特征进行onehot
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
data = pd.DataFrame([[1,1],[3,2],[2,3]],columns=['id','age'])
data_encoder1 = ohe.fit_transform(data).toarray()