OneHotEncoder的理解

编码的对象是数组，编码逻辑是将行认为是sample，列认为是feature。
将每列出现的值按一定的规律排列（比如大小），加入fit的数据又N列，encoder后的categories便会有N个。

对于需要transform的数组来说，第一列中的值在categories的相应位置存在的，则为1，不存在，则为0 。以此类推，第N列中的值在第N个categories中存在就为1，不存在就为0。将所有 categories中的返回值以行链接，（相当于np.c_[]函数的作用）返回。

接着对下一行中的每个列的值做以上运算。

举例如下：

enc=OneHotEncoder()
data=[[0,0,3],[1,1,0],[0,2,1],[1,0,2]]
enc.fit(data)

Warning (from warnings module):
File "/Users/bnz/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/encoders.py", line 368
warnings.warn(msg, FutureWarning)
FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify “categories=‘auto’”.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
OneHotEncoder(categorical_features=None, categories=None,
dtype=<class ‘numpy.float64’>, handle_unknown=‘error’,
n_values=None, sparse=True)
>>> enc.categories
[array([0., 1.]), array([0., 1., 2.]), array([0., 1., 2., 3.])]

enc.transform([[0,1,1]]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0.]])

data2=[[0, 0, 3], [3, 1, 0], [0, 2, 1], [1, 0, 2]]
enc.fit(data2)

Warning (from warnings module):
File “/Users/bnz/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py”, line 368
warnings.warn(msg, FutureWarning)
FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify “categories=‘auto’”.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
OneHotEncoder(categorical_features=None, categories=None,
dtype=<class ‘numpy.float64’>, handle_unknown=‘error’,
n_values=None, sparse=True)

enc.categories_
[array([0., 1., 3.]), array([0., 1., 2.]), array([0., 1., 2., 3.])]

参考文章：https://blog.csdn.net/lanchunhui/article/details/72794317

猜你喜欢