Python整理类别型数值

整理类别型数值(`Categorical Data`)

创建数据：

import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
df

----------
color size  price classlabel
0  green    M   10.1     class1
1    red    L   13.5     class2
2   blue   XL   15.3     class1

映射有序特征(Mapping ordinal features)——size

size_mapping = {
        'M':1,
        'L':2,
        'XL':3
    }

df['size'] = df['size'].map(size_mapping)

----------
color  size  price classlabel
0  green     1   10.1     class1
1    red     2   13.5     class2
2   blue     3   15.3     class1

编码类别标签(encoding class labels)——classlabel

import numpy as np
class_mapping = {label:idx for idx,label in
enumerate(np.unique(df['classlabel']))}
class_mapping

----------
{'class1': 0, 'class2': 1}

df['classlabel'] = df['classlabel'].map(class_mapping)
df

----------
color  size  price  classlabel
0  green     1   10.1           0
1    red     2   13.5           1
2   blue     3   15.3           0

inv_class_mapping = {label: idx for idx, label in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)#还原

在scikit-learn中有个LabelEncoder类，可方便实现以上功能：

from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

----------
array([0, 1, 0])

编码名词性特征(encoding nominal labels)——color

如上所述方法，是使用一个简单的字典映射，将有序、有大小的特征转化为数值型特征。在此，也可以使用该方法重新编码color特征：

blue ——0
green——1
red——2

X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

----------
array([[1, 1, 10.1],
    [2, 2, 13.5],
    [0, 3, 15.3]], dtype=object)

如果数据处理过程仅至此，并将数据输入模型，这样我们就犯了一个在处理分类数据时常常犯的一个错误。虽然color值并没有转化为特殊的有序值，但是学习算法会默认为(假设)green(1)大于blue(0)，red(2)大于green(1)。虽然这种假设方法并不正确，但是算法还是会产生有效的预测结果。然而，这并非是最佳结果。

在这种情况下，有效的方法是——one-hot编码技术。

技术思想：to create a new dummy feature for each unique value in the nominal feature column.

pd.get_dummies(df[['price', 'color', 'size']])

----------
price  size  color_blue  color_green  color_red
0   10.1     1           0            1          0
1   13.5     2           0            0          1
2   15.3     3           1            0          0

Python整理类别型数值

整理类别型数值(Categorical Data)

猜你喜欢

整理类别型数值(`Categorical Data`)