[scikit-learn] Some pits of feature binarization encoding function

  • 1 Introduction

  • 2. Origin of the problem

    • 2.1. Dealing with numeric categorical variables

    • 2.2. Dealing with string categorical variables

    • 2.3. Useless attempts

  • 3. Another solution

  • 4. References

1 Introduction

In the past few days, I have been writing 『优雅高效地数据挖掘——基于Python的sklearn_pandas库』 an article, part of which involves how to perform feature binarization in batches in parallel. In the process, I found that there are some pits in the binarization function in scikit-learn (hereinafter referred to as sklearn). The author of sklearn_pandas is in the I have communicated on github. I will summarize it here and make a record.

Several sklearn binary encoding functions involved: OneHotEncoder()LabelEncoder()LabelBinarizer()MultiLabelBinarizer()

2. Origin of the problem

First create a test data

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'],                        
'age': [4 , 6, 3, 3],                        
'salary':[4, 5, 1, 1]})

Here we regard  pet, age, and salary as categorical features, the difference is that  age sum  salary is a numeric type, pet but a string type. Our purpose is simple: binarize them all and do one-hot encoding

2.1. Dealing with numeric categorical variables

Binarizing age is very simple, calling OneHotEncoder directly

OneHotEncoder(sparse = False).fit_transform( testdata.age ) # testdata.age 这里与 testdata[['age']]等价

However, the result of the operation is  array([[ 1.,  1.,  1.,  1.]])that this result is wrong. It is known from the Warning information that the reason is that in the new version of sklearn, OneHotEncoder the input must be a 2-D array, and the Series returned by testdata.age is essentially a 1-D array. So change it to

OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )

We get what we want:

array([[ 0.,  1.,  0.],
      [ 0.,  0.,  1.],
      [ 1.,  0.,  0.],
      [ 1.,  0.,  0.]])

You can do it in the same way  salary ,  OneHotEncoderand then combine the results with  numpy.hstack() the two to get the transformed result

a1 = OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )
a2 = OneHotEncoder(sparse = False).fit_transform( testdata[['salary']])
final_output = numpy.hstack((a1,a2))

However, this code is slightly redundant. Since it  OneHotEncoder() can accept 2-D array input, we can write it like this

OneHotEncoder(sparse = False).fit_transform( testdata['age', 'salary'])

The result is

array([[ 0.,  1.,  0.,  0.,  1.,  0.],
      [ 0.,  0.,  1.,  0.,  0.,  1.],
      [ 1.,  0.,  0.,  1.,  0.,  0.],
      [ 1.,  0.,  0.,  1.,  0.,  0.]])

Sometimes in addition to getting the final encoding result, we also want to know which columns in the result belong to  age the binarized encoding and which columns belong  salary to. At this time, we can   achieve this requirement  through OneHotEncoder() the built-in ones  , for example  , the value here is [ 0, 3, 6], indicating that column [0:3] is the binarized code, and [3:6] is . Please refer to the sklearn documentation for more details,feature_indices_feature_indices_agesalary

2.2. Dealing with string categorical variables

Unfortunately, it is OneHotEncoderimpossible to directly encode the categorical variable of string type, which means that OneHotEncoder().fit_transform(testdata[['pet']])this sentence will report an error (try it if you don't believe it). Many people have discussed this issue on stackoverflow and sklearn's github issue, but the sklearn version so far has not added OneHotEncodersupport for string-type categorical variables, so the method of saving the country by curve is generally used:

  • Method 1: First, use LabelEncoder() to convert into continuous numerical variables, and then use OneHotEncoder() to binarize

  • Method 2 Directly use LabelBinarizer() for binarization

However, it should be noted that whether LabelEncoder() or LabelBinarizer(), their original design in sklearn is to solve the discretization of label y, rather than input X, so their input is limited to 1-D array, This is exactly the opposite of OneHotEncoder() which requires a 2-D array. So we have to be extra careful when using it, otherwise there will be array([[ 1.,  1.,  1.,  1.]])errors like the above

# 方法一: LabelEncoder() + OneHotEncoder()
a = LabelEncoder().fit_transform(testdata['pet'])
OneHotEncoder( sparse=False ).fit_transform(a.reshape(-1,1)) # 注意: 这里把 a 用 reshape 转换成 2-D array

# 方法二: 直接用 LabelBinarizer()

LabelBinarizer().fit_transform(testdata['pet'])

The results obtained by the two methods are consistent with

array([[ 1.,  0.,  0.],
      [ 0.,  1.,  0.],
      [ 0.,  1.,  0.],
      [ 0.,  0.,  1.]])

Because LabelEncoderand LabelBinarizeris designed to only support 1-D array, it also makes it unable to accept multiple columns of input in batches like OneHotEncoder above, which means that LabelEncoder().fit_transform(testdata[['pet', 'age']])an error will be reported.

2.3. Useless attempts

However, if I was so persistent, how could I give up on this, I carefully flipped through the API interface of sklearn, and I found one called MultiLabelBinarizer(), which seemed to be able to solve this problem, so I tried it

MultiLabelBinarizer().fit_transform(testdata[['age','salary']].values)

The output is as follows

array([[0, 0, 1, 0, 0],
      [0, 0, 0, 1, 1],
      [1, 1, 0, 0, 0],
      [1, 1, 0, 0, 0]])

As a result, there was no problem at first glance, but when I looked closely, I was slapped in the face! MultiLabelBinarizerInstead of one-hot encoding each column separately, the values ​​of these columns are regarded as a whole, and each row of samples has been deduplicated, so there is only one 1 in the first row of the result, because age and salary are the first. The value of a row is 4.  MultiLabelBinarizer By default, the sample in this row has only one category 4. . . . . .

3. Another solution

In fact, if we jump out of scikit-learn, we can solve this problem well in pandas, just use the get_dummiesfunctions that come with pandas

pd.get_dummies(testdata,columns=testdata.columns)

The result was exactly what we wanted

age_3   age_4   age_6   pet_cat pet_dog pet_fish    salary_1    salary_4    salary_5
0   0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

1   0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0

2   1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0

3   1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0

get_dummiesThe advantages are:

  1. It is a pandas module itself, so it is very compatible with the DataFrame type

  2. No matter if your column is numeric or string, you can perform binary encoding

  3. Can automatically generate binary-coded variable names according to instructions

So, we have found the perfect solution? No! get_dummiesEverything is good, but it is not the type in sklearn after all transformer, so the result obtained has to be manually input into the corresponding module in sklearn, and it cannot transformerbe input into pipelinethe machine learning process like sklearn. more important point

get_dummies Unlike sklearn  transformer, there are  transformmethods, so once a feature value that has not appeared in the training set appears in the test set, simply using the  get_dummies method on both the test set and the training set will result in data errors

So, if anyone has a better solution, please suggest it, thank you very much! !

  • 1 Introduction

  • 2. Origin of the problem

    • 2.1. Dealing with numeric categorical variables

    • 2.2. Dealing with string categorical variables

    • 2.3. Useless attempts

  • 3. Another solution

  • 4. References

1 Introduction

In the past few days, I have been writing 『优雅高效地数据挖掘——基于Python的sklearn_pandas库』 an article, part of which involves how to perform feature binarization in batches in parallel. In the process, I found that there are some pits in the binarization function in scikit-learn (hereinafter referred to as sklearn). The author of sklearn_pandas is in the I have communicated on github. I will summarize it here and make a record.

Several sklearn binary encoding functions involved: OneHotEncoder()LabelEncoder()LabelBinarizer()MultiLabelBinarizer()

2. Origin of the problem

First create a test data

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'],                        
'age': [4 , 6, 3, 3],                        
'salary':[4, 5, 1, 1]})

Here we regard  pet, age, and salary as categorical features, the difference is that  age sum  salary is a numeric type, pet but a string type. Our purpose is simple: binarize them all and do one-hot encoding

2.1. Dealing with numeric categorical variables

Binarizing age is very simple, calling OneHotEncoder directly

OneHotEncoder(sparse = False).fit_transform( testdata.age ) # testdata.age 这里与 testdata[['age']]等价

However, the result of the operation is  array([[ 1.,  1.,  1.,  1.]])that this result is wrong. It is known from the Warning information that the reason is that in the new version of sklearn, OneHotEncoder the input must be a 2-D array, and the Series returned by testdata.age is essentially a 1-D array. So change it to

OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )

We get what we want:

array([[ 0.,  1.,  0.],
      [ 0.,  0.,  1.],
      [ 1.,  0.,  0.],
      [ 1.,  0.,  0.]])

You can do it in the same way  salary ,  OneHotEncoderand then combine the results with  numpy.hstack() the two to get the transformed result

a1 = OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )
a2 = OneHotEncoder(sparse = False).fit_transform( testdata[['salary']])
final_output = numpy.hstack((a1,a2))

However, this code is slightly redundant. Since it  OneHotEncoder() can accept 2-D array input, we can write it like this

OneHotEncoder(sparse = False).fit_transform( testdata['age', 'salary'])

The result is

array([[ 0.,  1.,  0.,  0.,  1.,  0.],
      [ 0.,  0.,  1.,  0.,  0.,  1.],
      [ 1.,  0.,  0.,  1.,  0.,  0.],
      [ 1.,  0.,  0.,  1.,  0.,  0.]])

Sometimes in addition to getting the final encoding result, we also want to know which columns in the result belong to  age the binarized encoding and which columns belong  salary to. At this time, we can   achieve this requirement  through OneHotEncoder() the built-in ones  , for example  , the value here is [ 0, 3, 6], indicating that column [0:3] is the binarized code, and [3:6] is . Please refer to the sklearn documentation for more details,feature_indices_feature_indices_agesalary

2.2. Dealing with string categorical variables

Unfortunately, it is OneHotEncoderimpossible to directly encode the categorical variable of string type, which means that OneHotEncoder().fit_transform(testdata[['pet']])this sentence will report an error (try it if you don't believe it). Many people have discussed this issue on stackoverflow and sklearn's github issue, but the sklearn version so far has not added OneHotEncodersupport for string-type categorical variables, so the method of saving the country by curve is generally used:

  • Method 1: First, use LabelEncoder() to convert into continuous numerical variables, and then use OneHotEncoder() to binarize

  • Method 2 Directly use LabelBinarizer() for binarization

However, it should be noted that whether LabelEncoder() or LabelBinarizer(), their original design in sklearn is to solve the discretization of label y, rather than input X, so their input is limited to 1-D array, This is exactly the opposite of OneHotEncoder() which requires a 2-D array. So we have to be extra careful when using it, otherwise there will be array([[ 1.,  1.,  1.,  1.]])errors like the above

# 方法一: LabelEncoder() + OneHotEncoder()
a = LabelEncoder().fit_transform(testdata['pet'])
OneHotEncoder( sparse=False ).fit_transform(a.reshape(-1,1)) # 注意: 这里把 a 用 reshape 转换成 2-D array

# 方法二: 直接用 LabelBinarizer()

LabelBinarizer().fit_transform(testdata['pet'])

The results obtained by the two methods are consistent with

array([[ 1.,  0.,  0.],
      [ 0.,  1.,  0.],
      [ 0.,  1.,  0.],
      [ 0.,  0.,  1.]])

Because LabelEncoderand LabelBinarizeris designed to only support 1-D array, it also makes it unable to accept multiple columns of input in batches like OneHotEncoder above, which means that LabelEncoder().fit_transform(testdata[['pet', 'age']])an error will be reported.

2.3. Useless attempts

However, if I was so persistent, how could I give up on this, I carefully flipped through the API interface of sklearn, and I found one called MultiLabelBinarizer(), which seemed to be able to solve this problem, so I tried it

MultiLabelBinarizer().fit_transform(testdata[['age','salary']].values)

The output is as follows

array([[0, 0, 1, 0, 0],
      [0, 0, 0, 1, 1],
      [1, 1, 0, 0, 0],
      [1, 1, 0, 0, 0]])

As a result, there was no problem at first glance, but when I looked closely, I was slapped in the face! MultiLabelBinarizerInstead of one-hot encoding each column separately, the values ​​of these columns are regarded as a whole, and each row of samples has been deduplicated, so there is only one 1 in the first row of the result, because age and salary are the first. The value of a row is 4.  MultiLabelBinarizer By default, the sample in this row has only one category 4. . . . . .

3. Another solution

In fact, if we jump out of scikit-learn, we can solve this problem well in pandas, just use the get_dummiesfunctions that come with pandas

pd.get_dummies(testdata,columns=testdata.columns)

The result was exactly what we wanted

age_3   age_4   age_6   pet_cat pet_dog pet_fish    salary_1    salary_4    salary_5
0   0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

1   0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0

2   1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0

3   1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0

get_dummiesThe advantages are:

  1. It is a pandas module itself, so it is very compatible with the DataFrame type

  2. No matter if your column is numeric or string, you can perform binary encoding

  3. Can automatically generate binary-coded variable names according to instructions

So, we have found the perfect solution? No! get_dummiesEverything is good, but it is not the type in sklearn after all transformer, so the result obtained has to be manually input into the corresponding module in sklearn, and it cannot transformerbe input into pipelinethe machine learning process like sklearn. more important point

get_dummies Unlike sklearn  transformer, there are  transformmethods, so once a feature value that has not appeared in the training set appears in the test set, simply using the  get_dummies method on both the test set and the training set will result in data errors

So, if anyone has a better solution, please suggest it, thank you very much! !

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326006403&siteId=291194637