获取有效数据

Scikit-learn will not accept categorical features by default
API里面不知使用默认的特征变量名，因此需要编码
这里我还是有疑问？
Need to encode categorical features numerically
Convert to ‘dummy variables’
- 0: Observation was NOT that category
- 1: Observation was that category

Dealing with categorical features in Python

两种方式是一样的

scikit-learn: OneHotEncoder()
pandas: get_dummies()

pd.get_dummies

离散特征编码
可用来表示分类变量、非数量因素可能产生的影响
pandas加入虚拟变量的方式
get_dummies 是利用pandas实现one hot encode的方式。详细参数请查看官方文档

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)[source]

data 要处理的DataFrame
prefix 列名的前缀，在多个列有相同的离散项时候使用
prefix_sep 前缀和离散值的分隔符，默认为下划线，默认即可
dummy_na 是否把NA值，作为一个离散值进行处理，默认为不处理
columns 要处理的列名，如果不指定该列，那么默认处理所有列
drop_first 是否从备选项中删除第一个，建模的时候为避免共线性使用

Pandas中的get_dummy()函数是将拥有不同值的变量转换为0/1数值。
举例说明：一群样本的年龄分别为19，32,56,94岁，19岁用1表示，32岁用2表示，56岁用3表示，94岁用4表示。1,2,3,4这些数值的大小本身没有意义，只是用来区分年龄。因此在实际问题中，需要将1,2,3,4转化为0/1，即如果是19岁，则为0，若不是则为1，以此类推。

举个例子

import pandas as pd
df = pd.DataFrame([  
            ['green' , 'm'],   
            ['red'   , 'n'],   
            ['blue'  , 'q']])  

df.columns = ['color',  'class'] 
pd.get_dummies(df)

# Create dummy variables: df_region
df_region = pd.get_dummies(df)

# Print the columns of df_region
print(df_region.columns)

# Drop 'Region_America' from df_region
df_region = pd.get_dummies(df, drop_first=True)

# Print the new columns of df_region
print(df_region)

处理缺失数据

填补缺失值：

sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)

主要参数说明：

missing_values：缺失值，可以为整数或NaN(缺失值numpy.nan用字符串‘NaN’表示)，默认为NaN
strategy：替换策略，字符串，默认用均值‘mean’替换
- 若为mean时，用特征列的均值替换
- 若为median时，用特征列的中位数替换
- 若为most_frequent时，用特征列的众数替换
axis：指定轴数，默认axis=0代表列，axis=1代表行
copy：设置为True代表不在原数据集上修改，设置为False时，就地修改，存在如下情况时，即使设置为False时，也不会就地修改
- X不是浮点值数组
- X是稀疏且missing_values=0
- axis=0且X为CRS矩阵
- axis=1且X为CSC矩阵
statistics_属性：axis设置为0时，每个特征的填充值数组，axis=1时，报没有该属性错误
参考

# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

pipline

连接多个转换器和预测器在一起，形成一个机器学习工作流

# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))

<script.py> output:
                 precision    recall  f1-score   support
    
       democrat       0.99      0.96      0.98        85
     republican       0.94      0.98      0.96        46
    
    avg / total       0.97      0.97      0.97       131

```

processing data

获取有效数据

Dealing with categorical features in Python

pd.get_dummies

处理缺失数据

pipline

猜你喜欢