1、简要介绍

分类变量类似于枚举，拥有特定数量的值类型。

比如：红白蓝以颜色为分类的元素，大中小以形状为分类的元素。

而这类值基本是给出一个big或者red等英文字符串做为数据，这时候的话，我们就得去进行一些操作，把它们弄成可以去处理的映射值或是直接给删掉。

2、三种方法（伴随代码一同解释）

首先来预处理

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)#Because our dependent variable is SalePrice,we need to drop some missing targets

y = X.SalePrice#Select dependent variable
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
#Now we have the dataframe without missing values

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

View Code

1) 删除分类变量

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
#exclude=['object'] means categorical data

View Code

然后可以check一波它们的mean_absolute_error值

2) 标签编码

即构建映射值。

①但是这里尤其要注意了，我们一开始是分了train和valid两个样本，如果直接简单暴力的去把train里面的分类变量都直接标签的话，就会编译错误，因为你说不准valid样本里面会出现一些没有在train上面出现过的分类变量。

②在本例中，这个假设是有意义的，因为对类别有个唯一的排名。并不是所有的分类变量在值中都有一个明确的顺序，但是我们将那些有顺序的变量称为有序变量。对于基于树的模型(如决策树和随机森林)，有序变量的标签编码可能效果不错。

# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
#See that we must ensure X_train dataset have the same label encoded as X_valid

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))

View Code

from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 
label_encoder=LabelEncoder() 
for col in good_label_cols:
    label_X_train[col]=label_encoder.fit_transform(label_X_train[col])
    label_X_valid[col]=label_encoder.transform(label_X_valid[col])

View Code

3) One-Hot 编码

①可以看到我们要添加许多列在数据上，有多少类别我们就添加多少列，所以如果类别很多，就意味着我们的列表要拓展得很大，因此，我们通常只对一个基数相对较低的列进行热编码。然后，可以从数据集中删除高基数列，也可以使用标签编码。一般情况下，选类别为10做为删除标准

②与标签编码不同，one-hot编码不假定类别的顺序。因此，如果在分类数据中没有明确的顺序(例如，“红色”既不比“黄色”多也不比“黄色”少)，这种方法可能会特别有效。我们把没有内在排序的分类变量称为名义变量。

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

View Code

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

View Code

处理分类变量的一些方法

1、简要介绍

2、三种方法（伴随代码一同解释）

猜你喜欢