LabelEncoder & OneHotEncoder

<摘自 http://biggyani.blogspot.com/2014/08/using-onehot-with-categorical.html>

Using OneHot,LabelEncoder with categorical features/columns on a pandas dataframe, for feature selection and prediction
Many a times, you have a machine learning problem with a data set where you have one ore more categorical features/columns. Now, there are generally three parts to a machine learning problem, prepare/clean the data, do feature selection, fit models and predict.

In feature selection phase, if you plan to use things like chi square, variance (note if you have extremely skewed data set, say with 95% false/0 target values and 5% true/>0 target values, a very low variance feature might also be an important feature), L1/Lasso regularized Logistic Regression or Support Vector (with Linear Kernel), Principal component analysis etc, you will need to convert your categorical values to one/against all in each column. If you have only categorical values, or a mixture, and your target is a class, and you are using trees, information gain etc to do the feature selection phase, then you will not need this conversion.

Similarly in the fit models and predict phase, if you are using any algorithm other than trees/clustering where your feature values will be multiplied by co-efficients, then you will need to covert your categorical values into one/against all in each column. It is possible though that the library you are using in R or Matlab or Python or R or SPSS, may already have this option inbuilt. So, do check before doing the conversion yourself.

If you need to do the conversion, this is how you do it in Python using OneHotEncoder, LabelEncoder

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import numpy as np
import pandas as pd

train = pd.read_csv('train.csv')

# insert code to get a list of categorical columns into a variable say categorical_columns
# insert code to take care of the missing values in the columns in whatever way you like to
# but is is important that missing values are replaced.

# Get the categorical values into a 2D numpy array
train_categorical_values = np.array(train[categorical_columns])

# OneHotEncoder will only work on integer categorical values, so if you have strings in your
# categorical columns, you need to use LabelEncoder to convert them first

# do the first column
enc_label = LabelEncoder()
train_data = enc_label.fit_transform(train_categorcial_values[:,0])

# do the others
for i in range(1, train_categorical_values.shape[1]):
    enc_label = LabelEncoder()
    train_data = np.column_stack((train_data, enc_label.fit_transform(train_categorical_values[:,i])))

train_categorical_values = train_data.astype(float)

# if you have only integers then you can skip the above part from do the first column and uncomment the following line
# train_categorical_values = train_categorical_values.astype(float) 

enc_onehot = OneHotEncoder()
train_cat_data = enc.fit_transform(train_categorical_values)
 
# play around and print enc.n_values_ features_indices_ to see how many unique values are there in each column

# create a list of columns to help create a DF from np array
# so say if you have col1 and col2 as the categorical columns with 2 and 3 unique values respectively. The following code
# will give you col1_0, col1_1, col2_1,col2_2,col2_3 as the columns

cols = [categorical_columns[i] + '_' + str(j) for i in range(0,len(categorical_columns)) for j in range(0,enc.n_values_[i]) ]
train_cat_data_df = pd.DataFrame(train_cat_data.toarray(),columns=cols)

# get this columns back into the data frame
train[cols] = train_cat_data_df[cols]

# append the target column. Obviously rename it to whatever is your target column 
cols.append('target')
# So now you have a dataframe with only the categorical columns and the target. You can now do whatever you want to do with it :)
train_cat_df = train[cols]<div>

猜你喜欢

转载自blog.csdn.net/csiao_Bing/article/details/84972884