onehotencoder和labelencoder小结

结论:
如果特征是字符型,并且有等级含义,则使用Labelencoder,并且等级是按照字符排序的,比如’成绩’
如果特征是字符型,没有等级含义,则使用pd.get_dummies(),将字符型转换成onehot(推荐),比如‘性别’
如果特征是数值型,没有等级,使用OneHotEncder(),将数值型转换成onehot,比如‘标签’

针对“姓名”“成绩”两个特征进行举例介绍

'''连续变量特征‘成绩’进行离散化,转换成标签数据,然后再进行onehot'''
data = pd.DataFrame([['a',59],['c',41],['e',66],['b',75],['d',88],['f',91],['f',99],
                    ['g',95],['h',83]],columns=['name','score'])
===================================
Out[163]: 
  name  score
0    a     59
1    c     41
2    e     66
3    b     75
4    d     88
5    f     91
6    f     99
7    g     95
8    h     83
data['score_cut'] = pd.qcut(data['score'],[0,0.25,0.5,0.75,1],labels=['0-poor','1-average','2-good','3-distinction'])
================================
Out[165]: 
  name  score      score_cut
0    a     59         0-poor
1    c     41         0-poor
2    e     66         0-poor
3    b     75      1-average
4    d     88         2-good
5    f     91         2-good
6    f     99  3-distinction
7    g     95  3-distinction
8    h     83      1-average
#按照本班整体成绩划分等级,其中0-poor前面的0是为了转换时候体现出等级
data['score_encoder'] = label.fit_transform(data['score_cut'])#score中得分越高,说明等级越高,设为3
=============================================
Out[167]: 
  name  score      score_cut  score_encoder
0    a     59         0-poor              0
1    c     41         0-poor              0
2    e     66         0-poor              0
3    b     75      1-average              1
4    d     88         2-good              2
5    f     91         2-good              2
6    f     99  3-distinction              3
7    g     95  3-distinction              3
8    h     83      1-average              1
#如果第一列不是name,是标签列,即一共有9个类别,则第一列没有等级划分,就可以对a进行onehot
data = data.rename(columns={
    
    'name':'label'})
data = data.join(pd.get_dummies(data['label'],prefix='label'))
============================================================
Out[171]: 
  label  score      score_cut  score_encoder  label_a  label_b  label_c  
0     a     59         0-poor              0        1        0        0   
1     c     41         0-poor              0        0        0        1   
2     e     66         0-poor              0        0        0        0   
3     b     75      1-average              1        0        1        0   
4     d     88         2-good              2        0        0        0   
5     f     91         2-good              2        0        0        0   
6     f     99  3-distinction              3        0        0        0   
7     g     95  3-distinction              3        0        0        0   
8     h     83      1-average              1        0        0        0   
   label_d  label_e  label_f  label_g  label_h  
0        0        0        0        0        0  
1        0        0        0        0        0  
2        0        1        0        0        0  
3        0        0        0        0        0  
4        1        0        0        0        0  
5        0        0        1        0        0 

labelencoder计算参考: https://www.cnblogs.com/king-lps/p/7846414.html.

下面是一些例子
'''labelencoder和onehotencoder'''
data = pd.DataFrame([['b',1],['a',2],['c',3],['c',2]],columns=['name','age'])
#labelencoder是对非数值型特征进行编码的,得到的数值大小不同的数据,即数据是分等级的
data_copy1 = data.copy(deep=True)
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data_copy1['name_encoder'] = label.fit_transform(data_copy1['name'])
print(label.classes_)

#用pd.get_dummies()对非数值型特征进行onehot
data_encoder2 = pd.get_dummies(data_copy1)

#使用onehotencoder()对数值型特征进行onehot
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
data = pd.DataFrame([[1,1],[3,2],[2,3]],columns=['id','age'])
data_encoder1 = ohe.fit_transform(data).toarray()

猜你喜欢

转载自blog.csdn.net/qq_41716239/article/details/105106812