使用sklearn处理数据和降维

放在最前面:可以通过官网https://scikit-learn.org/stable/user_guide.html查询关于sklearn的任一函数,右上角搜索框直接搜索就行。
在这里插入图片描述
当然,中文版的用户指南也少不了:

sklearn提供了model_selection模型选择模块、processing数据预处理模块与decompisition特征分解模块。通过这三个模块可以实现数据预处理与模型构建前的数据标准化、二值化、数据集的分割、交叉验证和PCA降维等工作,下面以乳腺癌的数据集,从应用角度进行介绍:

1.加载datasets模块中的数据集

sklearn库中的datasets模块集成了部分数据分析的经典数据集,可以使用这些数据集可以熟悉数据预处理的流程以及建模的流程。关于datasets模块常用的数据集和加载函数和解释如下表:

数据集加载函数 数据集任务类型
load_boston 回归
fetch_califonia_housing 回归
load_breast_cancer 分类、聚类
load_iris 分类、聚类
load_digits 分类
load_wine 分类

如果要获取某个数据集,可以将对应的函数赋值给某个变量,加载后的数据集可以看作一个字典,几乎所有的sklearn数据集都可以使用data、target、featur_names、DESCR分别获取数据集的数据、标签、特征名称和描述信息(包含每个特征的最大最小值)。

例子:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

2.划分训练集和测试集

sklearn的model_selection模块提供了train_test_split函数,能够对数据集进行拆分。
该函数的官方说明如下:

Signature: train_test_split(*arrays, **options)
Docstring:
Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`` and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse
    matrices or pandas dataframes.

test_size : float or int, default=None
    If float, should be between 0.0 and 1.0 and represent the proportion
    of the dataset to include in the test split. If int, represents the
    absolute number of test samples. If None, the value is set to the
    complement of the train size. If ``train_size`` is also None, it will
    be set to 0.25.

train_size : float or int, default=None
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the train split. If
    int, represents the absolute number of train samples. If None,
    the value is automatically set to the complement of the test size.

random_state : int or RandomState instance, default=None
    Controls the shuffling applied to the data before applying the split.
    Pass an int for reproducible output across multiple function calls.
    See :term:`Glossary <random_state>`.


shuffle : bool, default=True
    Whether or not to shuffle the data before splitting. If shuffle=False
    then stratify must be None.

stratify : array-like, default=None
    If not None, data is split in a stratified fashion, using this as
    the class labels.

Returns
-------
splitting : list, length=2 * len(arrays)
    List containing train-test split of inputs.

    .. versionadded:: 0.16
        If the input is sparse, the output will be a
        ``scipy.sparse.csr_matrix``. Else, output type is the same as the
        input type.

Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

该函数的常用参数如下:

参数名称 说明
*arrays 接收一个或多个数据集,代表要划分的数据集。若为分类或回归任务,则分别传入数据和标签;若为聚类,则传入数据。该参数无默认。
test_size 接收folat、int类型的数据或者None。代表测试集的大小。如果传入float类型的数据,则需要限定在0-1之间,代表测试集在总数中的占比;如果传入的为int类型的数据,则表示测试集记录的绝对数目。该参数与train_size可以只传入一个。
train_size 接收folat、int类型的数据或者None。代表训练集的大小。该参数与test_size可以只传入一个。
random_state 接收int,代表随机种子编号,相同随机种子编号产生相同的随机结果,不同的随机种子编号产生不同的随机结果。默认为None
shuffle 接收boolean。代表是否进行有放回的采样,若该参数取值为True,则stratify参数必须不能为空。
stratify 接收array或者None。如果不为None,则使用传入的标签进行分层采样。

train_test_split函数分别将传入的数据划分为训练集和测试集,如果传入的是一组数据,那么生成的就是这一组数据随机划分后的训练集和测试集,总共两组,如果传入的是两组数据,那么生成的训练集和测试集分别两组,总共四组。也就是可以传入多个数据集,然后分别进行划分。

例子:
在这里插入图片描述
train_test_split是最常用的数据划分方法,在model_selection模块中还提供了其他数据划分的函数,如PredfinedSplit、ShuffleSplit等,可以查看官方文档。

3.使用sklearn转换器数据预处理与降维

sklearn把大量特征处理的相关操作封装为转换器,放在了preprocessing模块,转换器主要包括3个方法:fit、transform、fit_transform,这三种方法相关说明如下:

方法 说明
fit 主要通过分析特征和目标值提取有价值的信息,这些信息可以是统计量,也可以是权值系数等
transform 该方法主要用来对特征进行转换,从可利用信息的角度分为无信息转换和有信息转换。无信息转换是指不利用任何其他信息进行转换,比如指数和对数函数转换等。有信息转换根据是否利用目标值向量又可分为无监督转换和有监督转换。无监督转换指只利用特征的统计信息的转换,比如标准化和PCA降维。有监督转换是指既利用了特征信息又利用了目标值信息的转换,比如通过模型选择特征和LDA降维等。
fit_transform 该方法是先调用fit方法、然后调用transform方法

sklearn转换器能够实现对传入的Numpy数组进行标准化处理、归一化处理、二值化处理和PCA降维等操作。

在数据分析的时候,对各类特征的处理相关的操作都需要对训练集和测试集分开进行,需要将训练集的操作规则、权重系数等应用到测试集,如果自己去实现这个过程会相对比较繁琐,而sklearn转换器可以很好的解决这个问题。

例:对iris数据集进行离差标准化(最大最小标准化)
在这里插入图片描述

可以看出训练集在离差标准化之后的最小值、最大值被限定在[0, 1]区间,同时由于测试集应用了训练集的离差标准化规则,数据超出了[0, 1]范围,这也侧面证明了此处应用了训练集的规则。

MinMaxScaler()的官方说明:

Docstring:  
Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such
that it is in the given range on the training set, e.g. between
zero and one.

The transformation is given by::

    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean,
unit variance scaling.

Read more in the :ref:`User Guide <preprocessing_scaler>`.

Parameters
----------
feature_range : tuple (min, max), default=(0, 1)
    Desired range of transformed data.

copy : bool, default=True
    Set to False to perform inplace row normalization and avoid a
    copy (if the input is already a numpy array).

Attributes
----------
min_ : ndarray of shape (n_features,)
    Per feature adjustment for minimum. Equivalent to
    ``min - X.min(axis=0) * self.scale_``

scale_ : ndarray of shape (n_features,)
    Per feature relative scaling of the data. Equivalent to
    ``(max - min) / (X.max(axis=0) - X.min(axis=0))``

    .. versionadded:: 0.17
       *scale_* attribute.

data_min_ : ndarray of shape (n_features,)
    Per feature minimum seen in the data

    .. versionadded:: 0.17
       *data_min_*

data_max_ : ndarray of shape (n_features,)
    Per feature maximum seen in the data

    .. versionadded:: 0.17
       *data_max_*

data_range_ : ndarray of shape (n_features,)
    Per feature range ``(data_max_ - data_min_)`` seen in the data

    .. versionadded:: 0.17
       *data_range_*

n_samples_seen_ : int
    The number of samples processed by the estimator.
    It will be reset on new calls to fit, but increments across
    ``partial_fit`` calls.

Examples
--------
>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]

See also
--------
minmax_scale: Equivalent function without the estimator API.

Notes
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.

For a comparison of the different scalers, transformers, and normalizers,
see :ref:`examples/preprocessing/plot_all_scaling.py

sklearn除了提供离差标准化函数MinMaxScaler外,还提供了一系列数据预处理函数,具体如下:

函数名称 说明
StandardScaler 对特征进行标准差标准化
Normalizer 对特征进行归一化
Binarizer 对定量特征进行二值化处理
OneHotEncoder 对定性特征进行独热编码处理
Function Transformer 对特征进行自定义函数变化

关于这些函数,后面会以一个案例进行使用,此处不再多说明。

sklearn除了提供了基本的特征变换函数外,在decomposition模块还提供了降维算法、特征选择算法,这些算法的使用也是通过转换器的方式进行的。

例如:对breast_cancer数据集进行PCA降维
在这里插入图片描述
PCA()的官方说明:

Init signature:
PCA(
    n_components=None,
    *,
    copy=True,
    whiten=False,
    svd_solver='auto',
    tol=0.0,
    iterated_power='auto',
    random_state=None,
)
Docstring:     
Principal component analysis (PCA).

Linear dimensionality reduction using Singular Value Decomposition of the
data to project it to a lower dimensional space. The input data is centered
but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated
SVD by the method of Halko et al. 2009, depending on the shape of the input
data and the number of components to extract.

It can also use the scipy.sparse.linalg ARPACK implementation of the
truncated SVD.

Notice that this class does not support sparse input. See
:class:`TruncatedSVD` for an alternative with sparse data.

Read more in the :ref:`User Guide <PCA>`.

Parameters
----------
n_components : int, float, None or str
    Number of components to keep.
    if n_components is not set all components are kept::

        n_components == min(n_samples, n_features)

    If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's
    MLE is used to guess the dimension. Use of ``n_components == 'mle'``
    will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``.

    If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the
    number of components such that the amount of variance that needs to be
    explained is greater than the percentage specified by n_components.

    If ``svd_solver == 'arpack'``, the number of components must be
    strictly less than the minimum of n_features and n_samples.

    Hence, the None case results in::

        n_components == min(n_samples, n_features) - 1

copy : bool, default=True
    If False, data passed to fit are overwritten and running
    fit(X).transform(X) will not yield the expected results,
    use fit_transform(X) instead.

whiten : bool, optional (default False)
    When True (False by default) the `components_` vectors are multiplied
    by the square root of n_samples and then divided by the singular values
    to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal
    (the relative variance scales of the components) but can sometime
    improve the predictive accuracy of the downstream estimators by
    making their data respect some hard-wired assumptions.

svd_solver : str {
    
    'auto', 'full', 'arpack', 'randomized'}
    If auto :
        The solver is selected by a default policy based on `X.shape` and
        `n_components`: if the input data is larger than 500x500 and the
        number of components to extract is lower than 80% of the smallest
        dimension of the data, then the more efficient 'randomized'
        method is enabled. Otherwise the exact full SVD is computed and
        optionally truncated afterwards.
    If full :
        run exact full SVD calling the standard LAPACK solver via
        `scipy.linalg.svd` and select the components by postprocessing
    If arpack :
        run SVD truncated to n_components calling ARPACK solver via
        `scipy.sparse.linalg.svds`. It requires strictly
        0 < n_components < min(X.shape)
    If randomized :
        run randomized SVD by the method of Halko et al.

    .. versionadded:: 0.18.0

tol : float >= 0, optional (default .0)
    Tolerance for singular values computed by svd_solver == 'arpack'.

    .. versionadded:: 0.18.0

iterated_power : int >= 0, or 'auto', (default 'auto')
    Number of iterations for the power method computed by
    svd_solver == 'randomized'.

    .. versionadded:: 0.18.0

random_state : int, RandomState instance, default=None
    Used when ``svd_solver`` == 'arpack' or 'randomized'. Pass an int
    for reproducible results across multiple function calls.
    See :term:`Glossary <random_state>`.

    .. versionadded:: 0.18.0

Attributes
----------
components_ : array, shape (n_components, n_features)
    Principal axes in feature space, representing the directions of
    maximum variance in the data. The components are sorted by
    ``explained_variance_``.

explained_variance_ : array, shape (n_components,)
    The amount of variance explained by each of the selected components.

    Equal to n_components largest eigenvalues
    of the covariance matrix of X.

    .. versionadded:: 0.18

explained_variance_ratio_ : array, shape (n_components,)
    Percentage of variance explained by each of the selected components.

    If ``n_components`` is not set then all components are stored and the
    sum of the ratios is equal to 1.0.

singular_values_ : array, shape (n_components,)
    The singular values corresponding to each of the selected components.
    The singular values are equal to the 2-norms of the ``n_components``
    variables in the lower-dimensional space.

    .. versionadded:: 0.19

mean_ : array, shape (n_features,)
    Per-feature empirical mean, estimated from the training set.

    Equal to `X.mean(axis=0)`.

n_components_ : int
    The estimated number of components. When n_components is set
    to 'mle' or a number between 0 and 1 (with svd_solver == 'full') this
    number is estimated from input data. Otherwise it equals the parameter
    n_components, or the lesser value of n_features and n_samples
    if n_components is None.

n_features_ : int
    Number of features in the training data.

n_samples_ : int
    Number of samples in the training data.

noise_variance_ : float
    The estimated noise covariance following the Probabilistic PCA model
    from Tipping and Bishop 1999. See "Pattern Recognition and
    Machine Learning" by C. Bishop, 12.2.1 p. 574 or
    http://www.miketipping.com/papers/met-mppca.pdf. It is required to
    compute the estimated data covariance and score samples.

    Equal to the average of (min(n_features, n_samples) - n_components)
    smallest eigenvalues of the covariance matrix of X.

See Also
--------
KernelPCA : Kernel Principal Component Analysis.
SparsePCA : Sparse Principal Component Analysis.
TruncatedSVD : Dimensionality reduction using truncated SVD.
IncrementalPCA : Incremental Principal Component Analysis.

References
----------
For n_components == 'mle', this class uses the method of *Minka, T. P.
"Automatic choice of dimensionality for PCA". In NIPS, pp. 598-604*

Implements the probabilistic PCA model from:
Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal
component analysis". Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(3), 611-622.
via the score and score_samples methods.
See http://www.miketipping.com/papers/met-mppca.pdf

For svd_solver == 'arpack', refer to `scipy.sparse.linalg.svds`.

For svd_solver == 'randomized', see:
*Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).
"Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions".
SIAM review, 53(2), 217-288.* and also
*Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).
"A randomized algorithm for the decomposition of matrices".
Applied and Computational Harmonic Analysis, 30(1), 47-68.*

Examples
--------
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(n_components=2)
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.0075...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]

>>> pca = PCA(n_components=2, svd_solver='full')
>>> pca.fit(X)
PCA(n_components=2, svd_solver='full')
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.00755...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]

>>> pca = PCA(n_components=1, svd_solver='arpack')
>>> pca.fit(X)
PCA(n_components=1, svd_solver='arpack')
>>> print(pca.explained_variance_ratio_)
[0.99244...]
>>> print(pca.singular_values_)
[6.30061...]

关于PCA降维算法函数常用的参数及其作用如下:

参数名称 说明
n_components 接收None、int、float或mle。未指定时,代表所有的特征都会留下来;如果为int,则表示将原始数据降低到n个维度;如果为float,则PCA根据样本特征方差来决定降维后的维度数;赋值为mle,PCA会用MLE算法根据特征的方差分布情况自动选择一定数量的主成分特征来降维。默认为None。
copy 接收boolean。代表是否在运行算法时将原始数据复制一份,如果为True,则运行后,原始数据的值不会有任何变化,如果为False,则运行PCA算法后,原始数据的值会发生改变,默认为True。
whiten 接收boolean,表示白化,所谓白化,就是对降维后的数据的每个特征进行归一化,让方差都为1.默认为False。
svd_solver 接收auto、full、arpack、randomized。代表使用的SVD算法。randomized一般适用于数据量大,数据维度多,同时主成分数目比例又比较低的PCA降维,它使用了一些加快SVD的随机算法。full是使用SciPy库实现的传统SVD算法。arpack和randomized的适用场景类似,区别是,randomized使用的是sklearn自己的SVD实现,而arpack直接使用了SciPy库的sparse SVD实现。auto则代表PCA类会自动在上述3种算法中去权衡,选择一个合适的SVD算法来降维。默认为auto。

本文中的完整代码如下:

#!/usr/bin/env python
# coding: utf-8


from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print(len(cancer))
print(type(cancer))


# In[25]:


cancer_data = cancer['data']
print(cancer_data.shape)


# In[26]:


print(type(cancer_data)) # 可以看出获取data的数据类型是ndarray。


# In[27]:


cancer_target = cancer['target']
print(cancer_target)


# In[28]:


cancer_names = cancer['feature_names']
print(cancer_names)


# In[29]:


cancer_desc = cancer['DESCR']
print(cancer_desc)


# In[30]:


print(cancer_data.shape)
print(cancer_target.shape)


# In[31]:


from sklearn.model_selection import train_test_split
print(cancer_data.shape, cancer_target.shape)


# In[32]:


data_train, data_test, target_train, target_test = train_test_split(cancer_data, cancer_target, test_size=0.2, random_state=42)

print(data_train.shape, data_test.shape, target_train.shape, target_test.shape)


# In[48]:


import numpy as np
from sklearn.preprocessing import MinMaxScaler
Scaler = MinMaxScaler().fit(data_train) # 生成规则
# 可以查看生成规则的中间值的情况
print(Scaler.data_max_[0], Scaler.data_min_[0], Scaler.data_range_[0]) # 输出特征的第一维的最大值、最小值,二者的差值
# 将规则应用到训练集
cancer_trainScaler = Scaler.transform(data_train)
# 将规则应用到测试集
cancer_testScaler = Scaler.transform(data_test)
print("离差标准化前训练集数据的最小值", np.min(data_train))
print("离差标准化前训练集数据的最大值", np.max(data_train))
print("离差标准化后训练集数据的最小值", np.min(cancer_trainScaler))
print("离差标准化后训练集数据的最大值", np.max(cancer_trainScaler))

print('-------------------------------------')
print("离差标准化前测试集数据的最小值", np.min(data_test))
print("离差标准化前测试集数据的最大值", np.max(data_test))
print("离差标准化后测试集数据的最小值", np.min(cancer_testScaler))
print("离差标准化后测试集数据的最大值", np.max(cancer_testScaler))


# In[50]:


from sklearn.decomposition import PCA
pca_model = PCA(n_components=10).fit(cancer_trainScaler) # 对离差标准化的训练集,生成规则
# 应用规则到训练集
cacer_trainPca = pca_model.transform(cancer_trainScaler) 
# 应用规则到测试集
cacer_testPca = pca_model.transform(cancer_testScaler) 
print('降维前训练集的维度', data_train.shape)
print('降维后训练集的维度', cacer_trainPca.shape)
print('降维前测试集的维度', data_test.shape)
print('降维后测试集的维度', cacer_testPca.shape)

猜你喜欢

转载自blog.csdn.net/qq_38048756/article/details/115144438