独热编码处理文本属性

学习来源:click here

当数据中存在文本属性时,机器学习算法不便于处理文本属性,这时候需要把文本属性转换成数字。转换时,如果属性间存在顺序关系,例如:(冷,暖,热),可以直接使用整数编码;但当属性间没有顺序关系时,例如:(红, 绿, 蓝),则可使用独热编码。

独热编码:编码属性的值为1,其余属性的值为0

一、人工独热编码

from numpy improt argmax

data = 'hello world'
alphabet = 'abcdefghigklmnopqrstuvwxyz '
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
#整数编码
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)
#独热编码
OneHot_Encoder = list()
for i in integer_encoded:
     letter = [0 for _ in range(len(alphabet))]
     letter[i] = 1
     OneHot_Encoder.append(letter)
print(OneHot_Encoder)
#从独热编码恢复数据
inverted = int_to_char[argmax(OneHot_Encoder[0])]
print(inverted)

#output:

二、Scikit-Learn独热编码

from numpy import argmax
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#整数编码
data = array(['cold', 'cold', 'warm', 'hot', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold'])
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(data)
print(label_encoded)
#独热编码
onehot_encoder = OneHotEncoder(categories='auto')
onehot_encoded = onehot_encoder.fit_transform(label_encoded.reshape(-1, 1))
onehot = onehot_encoded.toarray()
print(onehot)
#恢复编码
state = label_encoder.inverse_transform([argmax(onehot[0, :])])
print(state)

#output:
 

猜你喜欢

转载自www.cnblogs.com/pineapple-chicken/p/12402273.html