-
1 Introduction
-
2. Origin of the problem
-
2.1. Dealing with numeric categorical variables
-
2.2. Dealing with string categorical variables
-
2.3. Useless attempts
-
-
3. Another solution
-
4. References
1 Introduction
In the past few days, I have been writing 『优雅高效地数据挖掘——基于Python的sklearn_pandas库』
an article, part of which involves how to perform feature binarization in batches in parallel. In the process, I found that there are some pits in the binarization function in scikit-learn (hereinafter referred to as sklearn). The author of sklearn_pandas is in the I have communicated on github. I will summarize it here and make a record.
Several sklearn binary encoding functions involved: OneHotEncoder()
, LabelEncoder()
, LabelBinarizer()
, MultiLabelBinarizer()
2. Origin of the problem
First create a test data
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'],
'age': [4 , 6, 3, 3],
'salary':[4, 5, 1, 1]})
Here we regard pet
, age
, and salary
as categorical features, the difference is that age
sum salary
is a numeric type, pet
but a string type. Our purpose is simple: binarize them all and do one-hot encoding
2.1. Dealing with numeric categorical variables
Binarizing age is very simple, calling OneHotEncoder directly
OneHotEncoder(sparse = False).fit_transform( testdata.age ) # testdata.age 这里与 testdata[['age']]等价
However, the result of the operation is array([[ 1., 1., 1., 1.]])
that this result is wrong. It is known from the Warning information that the reason is that in the new version of sklearn, OneHotEncoder
the input must be a 2-D array, and the Series returned by testdata.age is essentially a 1-D array. So change it to
OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )
We get what we want:
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 1., 0., 0.]])
You can do it in the same way salary
, OneHotEncoder
and then combine the results with numpy.hstack()
the two to get the transformed result
a1 = OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )
a2 = OneHotEncoder(sparse = False).fit_transform( testdata[['salary']])
final_output = numpy.hstack((a1,a2))
However, this code is slightly redundant. Since it OneHotEncoder()
can accept 2-D array input, we can write it like this
OneHotEncoder(sparse = False).fit_transform( testdata['age', 'salary'])
The result is
array([[ 0., 1., 0., 0., 1., 0.],
[ 0., 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 1., 0., 0.]])
Sometimes in addition to getting the final encoding result, we also want to know which columns in the result belong to age
the binarized encoding and which columns belong salary
to. At this time, we can achieve this requirement through OneHotEncoder()
the built-in ones , for example , the value here is [ 0, 3, 6], indicating that column [0:3] is the binarized code, and [3:6] is . Please refer to the sklearn documentation for more details,feature_indices_
feature_indices_
age
salary
2.2. Dealing with string categorical variables
Unfortunately, it is OneHotEncoder
impossible to directly encode the categorical variable of string type, which means that OneHotEncoder().fit_transform(testdata[['pet']])
this sentence will report an error (try it if you don't believe it). Many people have discussed this issue on stackoverflow and sklearn's github issue, but the sklearn version so far has not added OneHotEncoder
support for string-type categorical variables, so the method of saving the country by curve is generally used:
Method 1: First, use LabelEncoder() to convert into continuous numerical variables, and then use OneHotEncoder() to binarize
Method 2 Directly use LabelBinarizer() for binarization
However, it should be noted that whether LabelEncoder() or LabelBinarizer(), their original design in sklearn is to solve the discretization of label y, rather than input X, so their input is limited to 1-D array, This is exactly the opposite of OneHotEncoder() which requires a 2-D array. So we have to be extra careful when using it, otherwise there will be array([[ 1., 1., 1., 1.]])
errors like the above
# 方法一: LabelEncoder() + OneHotEncoder()
a = LabelEncoder().fit_transform(testdata['pet'])
OneHotEncoder( sparse=False ).fit_transform(a.reshape(-1,1)) # 注意: 这里把 a 用 reshape 转换成 2-D array
# 方法二: 直接用 LabelBinarizer()
LabelBinarizer().fit_transform(testdata['pet'])
The results obtained by the two methods are consistent with
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
Because LabelEncoder
and LabelBinarizer
is designed to only support 1-D array, it also makes it unable to accept multiple columns of input in batches like OneHotEncoder above, which means that LabelEncoder().fit_transform(testdata[['pet', 'age']])
an error will be reported.
2.3. Useless attempts
However, if I was so persistent, how could I give up on this, I carefully flipped through the API interface of sklearn, and I found one called MultiLabelBinarizer(), which seemed to be able to solve this problem, so I tried it
MultiLabelBinarizer().fit_transform(testdata[['age','salary']].values)
The output is as follows
array([[0, 0, 1, 0, 0],
[0, 0, 0, 1, 1],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0]])
As a result, there was no problem at first glance, but when I looked closely, I was slapped in the face! MultiLabelBinarizer
Instead of one-hot encoding each column separately, the values of these columns are regarded as a whole, and each row of samples has been deduplicated, so there is only one 1 in the first row of the result, because age and salary are the first. The value of a row is 4. MultiLabelBinarizer
By default, the sample in this row has only one category 4. . . . . .
3. Another solution
In fact, if we jump out of scikit-learn, we can solve this problem well in pandas, just use the get_dummies
functions that come with pandas
pd.get_dummies(testdata,columns=testdata.columns)
The result was exactly what we wanted
age_3 age_4 age_6 pet_cat pet_dog pet_fish salary_1 salary_4 salary_5
0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
get_dummies
The advantages are:
-
It is a pandas module itself, so it is very compatible with the DataFrame type
-
No matter if your column is numeric or string, you can perform binary encoding
-
Can automatically generate binary-coded variable names according to instructions
So, we have found the perfect solution? No! get_dummies
Everything is good, but it is not the type in sklearn after all transformer
, so the result obtained has to be manually input into the corresponding module in sklearn, and it cannot transformer
be input into pipeline
the machine learning process like sklearn. more important point
get_dummies
Unlike sklearntransformer
, there aretransform
methods, so once a feature value that has not appeared in the training set appears in the test set, simply using theget_dummies
method on both the test set and the training set will result in data errors
So, if anyone has a better solution, please suggest it, thank you very much! !
-
1 Introduction
-
2. Origin of the problem
-
2.1. Dealing with numeric categorical variables
-
2.2. Dealing with string categorical variables
-
2.3. Useless attempts
-
-
3. Another solution
-
4. References
1 Introduction
In the past few days, I have been writing 『优雅高效地数据挖掘——基于Python的sklearn_pandas库』
an article, part of which involves how to perform feature binarization in batches in parallel. In the process, I found that there are some pits in the binarization function in scikit-learn (hereinafter referred to as sklearn). The author of sklearn_pandas is in the I have communicated on github. I will summarize it here and make a record.
Several sklearn binary encoding functions involved: OneHotEncoder()
, LabelEncoder()
, LabelBinarizer()
, MultiLabelBinarizer()
2. Origin of the problem
First create a test data
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'],
'age': [4 , 6, 3, 3],
'salary':[4, 5, 1, 1]})
Here we regard pet
, age
, and salary
as categorical features, the difference is that age
sum salary
is a numeric type, pet
but a string type. Our purpose is simple: binarize them all and do one-hot encoding
2.1. Dealing with numeric categorical variables
Binarizing age is very simple, calling OneHotEncoder directly
OneHotEncoder(sparse = False).fit_transform( testdata.age ) # testdata.age 这里与 testdata[['age']]等价
However, the result of the operation is array([[ 1., 1., 1., 1.]])
that this result is wrong. It is known from the Warning information that the reason is that in the new version of sklearn, OneHotEncoder
the input must be a 2-D array, and the Series returned by testdata.age is essentially a 1-D array. So change it to
OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )
We get what we want:
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 1., 0., 0.]])
You can do it in the same way salary
, OneHotEncoder
and then combine the results with numpy.hstack()
the two to get the transformed result
a1 = OneHotEncoder(sparse = False).fit_transform( testdata[['age']] )
a2 = OneHotEncoder(sparse = False).fit_transform( testdata[['salary']])
final_output = numpy.hstack((a1,a2))
However, this code is slightly redundant. Since it OneHotEncoder()
can accept 2-D array input, we can write it like this
OneHotEncoder(sparse = False).fit_transform( testdata['age', 'salary'])
The result is
array([[ 0., 1., 0., 0., 1., 0.],
[ 0., 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 1., 0., 0.]])
Sometimes in addition to getting the final encoding result, we also want to know which columns in the result belong to age
the binarized encoding and which columns belong salary
to. At this time, we can achieve this requirement through OneHotEncoder()
the built-in ones , for example , the value here is [ 0, 3, 6], indicating that column [0:3] is the binarized code, and [3:6] is . Please refer to the sklearn documentation for more details,feature_indices_
feature_indices_
age
salary
2.2. Dealing with string categorical variables
Unfortunately, it is OneHotEncoder
impossible to directly encode the categorical variable of string type, which means that OneHotEncoder().fit_transform(testdata[['pet']])
this sentence will report an error (try it if you don't believe it). Many people have discussed this issue on stackoverflow and sklearn's github issue, but the sklearn version so far has not added OneHotEncoder
support for string-type categorical variables, so the method of saving the country by curve is generally used:
Method 1: First, use LabelEncoder() to convert into continuous numerical variables, and then use OneHotEncoder() to binarize
Method 2 Directly use LabelBinarizer() for binarization
However, it should be noted that whether LabelEncoder() or LabelBinarizer(), their original design in sklearn is to solve the discretization of label y, rather than input X, so their input is limited to 1-D array, This is exactly the opposite of OneHotEncoder() which requires a 2-D array. So we have to be extra careful when using it, otherwise there will be array([[ 1., 1., 1., 1.]])
errors like the above
# 方法一: LabelEncoder() + OneHotEncoder()
a = LabelEncoder().fit_transform(testdata['pet'])
OneHotEncoder( sparse=False ).fit_transform(a.reshape(-1,1)) # 注意: 这里把 a 用 reshape 转换成 2-D array
# 方法二: 直接用 LabelBinarizer()
LabelBinarizer().fit_transform(testdata['pet'])
The results obtained by the two methods are consistent with
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
Because LabelEncoder
and LabelBinarizer
is designed to only support 1-D array, it also makes it unable to accept multiple columns of input in batches like OneHotEncoder above, which means that LabelEncoder().fit_transform(testdata[['pet', 'age']])
an error will be reported.
2.3. Useless attempts
However, if I was so persistent, how could I give up on this, I carefully flipped through the API interface of sklearn, and I found one called MultiLabelBinarizer(), which seemed to be able to solve this problem, so I tried it
MultiLabelBinarizer().fit_transform(testdata[['age','salary']].values)
The output is as follows
array([[0, 0, 1, 0, 0],
[0, 0, 0, 1, 1],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0]])
As a result, there was no problem at first glance, but when I looked closely, I was slapped in the face! MultiLabelBinarizer
Instead of one-hot encoding each column separately, the values of these columns are regarded as a whole, and each row of samples has been deduplicated, so there is only one 1 in the first row of the result, because age and salary are the first. The value of a row is 4. MultiLabelBinarizer
By default, the sample in this row has only one category 4. . . . . .
3. Another solution
In fact, if we jump out of scikit-learn, we can solve this problem well in pandas, just use the get_dummies
functions that come with pandas
pd.get_dummies(testdata,columns=testdata.columns)
The result was exactly what we wanted
age_3 age_4 age_6 pet_cat pet_dog pet_fish salary_1 salary_4 salary_5
0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
get_dummies
The advantages are:
-
It is a pandas module itself, so it is very compatible with the DataFrame type
-
No matter if your column is numeric or string, you can perform binary encoding
-
Can automatically generate binary-coded variable names according to instructions
So, we have found the perfect solution? No! get_dummies
Everything is good, but it is not the type in sklearn after all transformer
, so the result obtained has to be manually input into the corresponding module in sklearn, and it cannot transformer
be input into pipeline
the machine learning process like sklearn. more important point
get_dummies
Unlike sklearntransformer
, there aretransform
methods, so once a feature value that has not appeared in the training set appears in the test set, simply using theget_dummies
method on both the test set and the training set will result in data errors
So, if anyone has a better solution, please suggest it, thank you very much! !