【一周算法实践进阶】任务2 特征工程

导入本次任务所用到的包:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
                            confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'

准备数据

导入数据

原始数据集下载地址: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期。

本次导入的是前文(【一周算法实践进阶】任务 1 数据预处理)已经清洗过的数据集:

data_processed = pd.read_csv('data_processed.csv')
data_processed.head()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... loans_latest_day 一线城市 三线城市 二线城市 其他城市 境外 latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
0 0.01 0.99 0.0 0.90 0.55 0.313 17.0 27.0 26.0 3.0 ... 18.0 1 0 0 0 0 4 2 4 3
1 0.02 0.94 2000.0 1.28 1.00 0.458 19.0 30.0 14.0 4.0 ... 2.0 1 0 0 0 0 5 3 5 5
2 0.04 0.96 0.0 1.00 1.00 0.114 13.0 68.0 22.0 1.0 ... 6.0 1 0 0 0 0 5 5 5 1
3 0.00 0.96 2000.0 0.13 0.57 0.777 22.0 14.0 6.0 3.0 ... 4.0 0 1 0 0 0 5 5 5 3
4 0.01 0.99 0.0 0.46 1.00 0.175 13.0 66.0 42.0 1.0 ... 120.0 1 0 0 0 0 4 6 1 6

5 rows × 89 columns

划分数据

将原始数据划分为数据集以及标签

label = data_processed['status']
data = data_processed.drop(['status'], axis=1)

标准化

scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
data_scaled.head()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... loans_latest_day 一线城市 三线城市 二线城市 其他城市 境外 latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
0 -0.281946 0.613274 -0.493153 -0.019542 -1.295506 -0.315097 -0.120837 -0.090540 0.256651 0.361774 ... -0.696716 0.632505 -0.538091 -0.170886 -0.029907 -0.181666 -0.211985 -0.674125 -0.188457 -0.015516
1 -0.044836 0.267090 0.007392 -0.019011 0.993048 0.534560 0.323209 0.040603 -0.467915 1.483449 ... -0.996528 0.632505 -0.538091 -0.170886 -0.029907 -0.181666 0.513432 -0.161836 0.135729 1.005817
2 0.429385 0.405564 -0.493153 -0.019402 0.993048 -1.481178 -1.008929 1.701755 0.015129 -1.881576 ... -0.921575 0.632505 -0.538091 -0.170886 -0.029907 -0.181666 0.513432 0.862743 0.135729 -1.036849
3 -0.519056 0.405564 0.007392 -0.020619 -1.193793 2.403805 0.989278 -0.658829 -0.950958 0.361774 ... -0.959052 -1.581015 1.858422 -0.170886 -0.029907 -0.181666 0.513432 0.862743 0.135729 -0.015516
4 -0.281946 0.613274 -0.493153 -0.020157 0.993048 -1.123736 -1.008929 1.614326 1.222738 -1.881576 ... 1.214585 0.632505 -0.538091 -0.170886 -0.029907 -0.181666 -0.211985 1.375032 -1.161015 1.516484

5 rows × 88 columns

特征选择

根据IV值

IV 的全称是 Information Value,中文意思是信息价值,或者信息量。此处仅介绍IV值的计算方式,具体可以看参考资料。

首先计算WOE(Weight of Evidence)值:
W O E i = ln ( p ( y i ) p ( n i ) ) = ln ( y i / y T n i / n T ) WOE_i = \ln(\frac{p(y_i)}{p(n_i)})= \ln(\frac{y_i/y_T}{n_i/n_T})
其中, p ( y i ) p(y_i) 指本组逾期客户(即status=1)占样本中所有逾期客户的比例, p ( n i ) p(n_i) 指本组未逾期客户(即status=0)占样本中所有未逾期客户的比例。 y i y_i 是本组逾期客户的数量, y T y_T 是所有样本逾期客户的数量, n i n_i 是本组未逾期客户的数量, n T n_T 是所有样本未逾期客户的数量。

得到IV的计算公式:
I V i = ( p ( y i ) p ( n i ) ) × W O E i = ( y i / y T n i / n T ) × ln ( y i / y T n i / n T ) IV_i = (p(y_i)-p(n_i))\times WOE_i = (y_i/y_T - n_i/n_T)\times \ln(\frac{y_i/y_T}{n_i/n_T})

根据特征的IV值,可以得到特征的预测能力,如下表。

IV 预测能力
<0.03
0.03~0.09
0.1~0.29
0.3~0.49
>=0.5 极高

数据分箱

在计算IV值之前,首先要对数据进行进行分箱操作,分箱包含有监督分箱(卡方、最小熵法)和无监督分箱(等距、等频、聚类)。我们采用卡方分箱,其他分箱方法的介绍见参考资料。

  1. 初始化阶段

首先按照属性值对实例进行排序,每个实例属于一个分组。

  1. 合并阶段

(1)计算每一对相邻组的卡方值

(2)将卡方值最小的相邻组合并
X 2 = i = 1 2 j = 1 2 ( A i j E i j ) 2 E i j X^2 = \sum^2_{i=1}\sum^2_{j=1}\frac{(A_{ij}-E_{ij})^2}{E_{ij}}
其中, A i j A_{ij} 指第 i i 组第 j j 类实例数量, E i j E_{ij} A i j A_{ij} 的期望频率

(3)不断重复(1),(2)直到计算出的卡方值都不低于事先设定的阈值,或者分组数达到一定的条件(如最小分组数 5,最大分组数 8)。

(chiMerge函数代码来自参考资料3,有修改)

def chiMerge(df, col, target, threshold=None):
    ''' 卡方分箱
    df: pandas dataframe数据集
    col: 需要分箱的变量名(数值型)
    target: 类标签
    max_groups: 最大分组数。
    threshold: 卡方阈值,如果未指定max_groups,默认使用置信度95%设置threshold。
    return: 包括各组的起始值的列表.
    '''
    freq_tab = pd.crosstab(df[col],df[target])
    freq = freq_tab.values #转成 numpy 数组用于计算。
    # 1.初始化阶段:按照属性值对实例进行排序,每个实例属于一个分组。
    # 为了保证后续分组包含所有样本值,添加上一个比最大值大的数
    cutoffs = np.append(freq_tab.index.values, max(freq_tab.index.values)+1)
    if threshold == None:
        # 如果没有指定卡方阈值和最大分类数
        # 则以 95% 的置信度(自由度为类数目 - 1)设定阈值。
        cls_num = freq.shape[-1]
        threshold = stats.chi2.isf(0.05, df=cls_num - 1)
    # 2.合并阶段
    while True:
        minvalue = np.inf
        minidx = np.inf
        # 计算每一对相邻组的卡方值
        for i in range(len(freq) - 1):
            v = stats.chi2_contingency(freq[i:i+2] + 1, correction=False)[0]
            # 更新最小值
            if minvalue > v:
                minvalue = v
                minidx = i
        # 如果最小卡方值小于阈值,则合并最小卡方值的相邻两组,并继续循环
        if threshold != None and minvalue < threshold:
            freq[minidx] += freq[minidx+1]
            freq = np.delete(freq, minidx+1, 0)
            cutoffs = np.delete(cutoffs, minidx+1, 0)
        else:
            break
            
    return cutoffs

IV值计算

def iv_value(df, col, target):
    ''' 计算单列特征的IV值
    df: pandas dataframe数据集
    col: 需要计算的变量名(数值型)
    target: 标签
    return: 该特征的iv值
    '''
    bins = chiMerge(df, col, target) # 获得分组区间
    cats = pd.cut(df[col], bins, right=False) 
    # 为了防止除0错误,对分子分母均做+1处理
    temp = (pd.crosstab(cats, df[target]) + 1) / (df[target].value_counts() + 1)
    woe = np.log(temp.iloc[:, 1] / temp.iloc[:, 0])
    iv = sum((temp.iloc[:, 1] - temp.iloc[:, 0]) * woe)
    
    return iv

计算所有特征的iv值

iv = []
data_iv = pd.concat([data_scaled, label], axis=1)

for col in data_scaled.columns:
    iv.append(iv_value(data_iv, col, 'status'))

降序输出:

iv = np.array(iv)
np.save('iv', iv)
iv = np.load('iv.npy')
iv
array([0.02968667, 0.06475457, 0.06981247, 0.27089581, 0.03955683,
       0.13346826, 0.00854632, 0.03929596, 0.04422897, 0.00559611,
       0.53421682, 0.        , 0.03166467, 0.38242452, 0.92400898,
       0.18871897, 0.11657733, 0.79563374, 0.        , 0.36688692,
       0.06479698, 0.08637859, 0.0315798 , 0.08726314, 0.02813494,
       0.07862981, 0.02872391, 0.00936212, 0.59139039, 0.25168984,
       0.25886425, 0.42645628, 0.32054195, 0.01342581, 0.00419829,
       0.23346355, 0.57449389, 0.        , 0.37383946, 0.14084117,
       0.50192192, 0.01717901, 0.        , 0.00990202, 0.02356634,
       0.02668144, 0.03360329, 0.02932465, 0.00517526, 0.66353628,
       0.        , 0.05768091, 0.03631875, 0.40640499, 0.01445641,
       0.00671275, 0.01300546, 0.00552671, 0.03980268, 0.03645762,
       0.0140021 , 0.65682529, 0.15289713, 0.37204304, 0.05508829,
       0.0192688 , 0.01318021, 0.01300546, 0.01037065, 0.01728017,
       0.25268217, 0.15254589, 0.00475146, 0.00671275, 0.01011964,
       0.03126195, 0.50228468, 0.11432889, 0.07337619, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03444958,
       0.00903816, 0.01497038, 0.        ])

随机森林

  • n_estimators : integer, optional (default=10)

n_estimators: 也就是弱学习器的最大迭代次数,或者说最大的弱学习器的个数。

对参数n_estimators粗调:

param = {'n_estimators': list(range(10, 1001, 50))}
g = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                       param_grid=param, cv=5)
g.fit(data_scaled, label)
g.best_estimator_
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
            oob_score=False, random_state=2018, verbose=0,
            warm_start=False)

对参数n_estimators细调:

param = {'n_estimators': list(range(770, 870, 10))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                       param_grid=param, cv=5)
forest_grid.fit(data, label)
rnd_clf = forest_grid.best_estimator_
rnd_clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
            oob_score=False, random_state=2018, verbose=0,
            warm_start=False)

综合分析

将IV值和随机森林的特征重要度进行整合:

feature_df = pd.DataFrame(np.c_[rnd_clf.feature_importances_, iv.T], 
                          index=data.columns, columns=['随机森林', 'IV值'])
feature_df.head()
随机森林 IV值
low_volume_percent 0.007025 0.029687
middle_volume_percent 0.009346 0.064755
take_amount_in_later_12_month_highest 0.009766 0.069812
trans_amount_increase_rate_lately 0.014802 0.270896
trans_activity_month 0.010418 0.039557

绘制两者对比曲线,按照IV值和随机森林评分的降序:

feature_df.sort_values(by='IV值', ascending=False)\
          .plot(figsize=(8, 6), subplots=True, use_index=False)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077ACDE80>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F077B4E898>],
      dtype=object)

在这里插入图片描述

feature_df.sort_values(by='随机森林', ascending=False)\
          .plot(figsize=(8, 6), subplots=True, use_index=False)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077975B38>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F0779A1BE0>],
      dtype=object)

在这里插入图片描述

由上图可以看出,虽然会有上下浮动,但随机森林和IV值呈现的特征重要度曲线的变化趋势是基本一致的。根据之前提及的IV值和预测能力关系表,以及随机森林特征评分在0到10之间有一个断崖式的下降,所以取随机森林得分在前15且IV值大于0.3的特征,作为筛选后的特征。

rnf_sorted = feature_df.sort_values(by='随机森林', ascending=False).iloc[:15, 0]
iv_sorted = feature_df[feature_df['IV值'] >= 0.3]['IV值']
index = pd.DataFrame([rnf_sorted, iv_sorted]).dropna(axis=1).columns
index
Index(['abs', 'apply_score', 'consfin_avg_limit', 'historical_trans_amount',
       'history_fail_fee', 'latest_one_month_fail', 'loans_overdue_count',
       'loans_score', 'max_cumulative_consume_later_1_month',
       'repayment_capability', 'trans_amount_3_month',
       'trans_fail_top_count_enum_last_1_month'],
      dtype='object')

经过筛选后,剩余12个特征,获得筛选特征后的数据:

data_del = data_scaled[index]
data_del.head()
abs apply_score consfin_avg_limit historical_trans_amount history_fail_fee latest_one_month_fail loans_overdue_count loans_score max_cumulative_consume_later_1_month repayment_capability trans_amount_3_month trans_fail_top_count_enum_last_1_month
0 -0.200665 0.124820 -1.201348 -0.255030 -0.427773 -0.337569 -0.098210 0.144596 -0.067183 0.020868 -0.049208 -0.346369
1 -0.090524 1.497024 0.238640 0.215237 -0.547614 -0.080162 -0.733973 1.509325 -0.073494 -0.034355 -0.274805 -0.868380
2 -0.312623 1.516627 -0.671941 -0.675385 -0.627508 -0.080162 -0.733973 1.476440 -0.262821 -0.171658 -0.321773 0.697653
3 1.359842 0.360055 0.736282 0.790524 0.331218 -0.337569 0.537552 -0.019829 0.471049 -0.237850 0.505738 -0.346369
4 -0.315531 -0.698503 0.042759 -0.522714 0.291271 -0.337569 1.173315 -1.055708 -0.172665 -0.144424 -0.282697 0.697653

建立模型

调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:

X_train, X_test, y_train, y_test = train_test_split(data_del, label, test_size=0.3, random_state=2018)

查看划分的数据集和训练集大小:

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]

模型调优

构建并训练本次需要用到的七个模型,包含四个集成模型:XGBoost,LightGBM,GBDT,随机森林,和三个非集成模型:逻辑回归,SVM,决策树。使用网格搜索法对7个模型进行调优(调参时采用五折交叉验证的方式)

逻辑回归

部分参数介绍:

  • C : float, default: 1.0

C值越小,正则化强度越强,C必须是一个正的浮点数。

  • class_weight : dict or ‘balanced’, default: None

指定每个类的权重,未指定时所有类都默认有同一权重。“平衡”模式使用y的值来自动调整与输入数据中的类频率成反比的权重,即n_samples / (n_classes * np.bincount(y)).

  • solver : str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default: ‘liblinear’.

优化时使用的算法,对于小数据集,'liblinear’是一个好选择,而‘sag’ 和 ‘saga’对大数据集训练更快。对于多分类问题,只有‘newton-cg’, ‘sag’, ‘saga’ 和 ‘lbfgs’能处理多元损失,‘liblinear’只能以one-versus-rest模式训练。

‘newton-cg’, ‘lbfgs’ 和 ‘sag’只能使用l2惩罚项,‘liblinear’ 和 ‘saga’可以使用l1惩罚项。

注意,“sag”和“saga”快速收敛仅在具有大致相同比例的特征上得到保证。可以使用sklearn.preprocessing中的scaler预处理数据。

  • max_iter : int, default: 100

最大迭代次数

查看最佳参数和评分:

param_grid = {
    'C': np.arange(0.01, 0.1, 0.01),
    'solver': ['liblinear', 'lbfgs'],
    'class_weight': ['balanced', None]
}
log_grid = GridSearchCV(LogisticRegression(random_state=2018, max_iter=1000), 
                        param_grid, cv=5)
log_grid.fit(X_train, y_train)
log_grid.best_estimator_, log_grid.best_score_
(LogisticRegression(C=0.02, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=1000, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=2018, solver='lbfgs', tol=0.0001,
           verbose=0, warm_start=False), 0.7874241940631982)

SVM

  • C: float, optional (default=1.0)

目标函数的惩罚系数C,用来平衡分类间隔margin和错分样本的

  • kernel: string, optional (default=’rbf’)

参数选择有RBF, Linear, Poly, Sigmoid

  • gamma:float, optional (default=’auto’)

核函数的系数(‘Poly’, ‘RBF’ 和 ‘Sigmoid’), 默认是gamma = 1 / n_features;

param_grid = {
    'C': np.arange(0.1, 5.2, 0.5),
    'gamma': ['auto', 0.01, 0.5],
}

svc_grid = GridSearchCV(SVC(random_state=2018, probability=True), param_grid, cv=5)
svc_grid.fit(X_train, y_train)
svc_grid.best_estimator_, svc_grid.best_score_
(SVC(C=2.1, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=True, random_state=2018, shrinking=True,
   tol=0.001, verbose=False), 0.7871050111714012)

决策树

  • max_depth : int or None, optional (default=None)

树的最大深度,如果设置为None则直到叶节点只剩一个或者少于min_samples_split时停止。

  • max_features : int, float, string or None, optional (default=None)

在寻找最佳分割时考虑的特征数,设置为’auto’时max_features=sqrt(n_features)

  • class_weight : dict, list of dicts, “balanced” or None, default=None

指定每个类的权重,未指定时所有类都默认有同一权重。“平衡” 模式使用 y 的值来自动调整与输入数据中的类频率成反比的权重,即 n_samples / (n_classes * np.bincount(y)).

param_grid = {
    'max_depth': range(2, 8, 1),
    'min_samples_split': range(2, 11, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018), param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=2018,
             splitter='best'), 0.7749760612831152)
param_grid = {
    'min_samples_leaf': range(26, 35, 2),
    'max_features': range(2, 10, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018, max_depth=4,
                                                min_samples_split=2), 
                         param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
             max_features=7, max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=30,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=2018, splitter='best'),
 0.7794446217682732)

随机森林

param = {'n_estimators': list(range(10, 1001, 50))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7883817427385892)
forest_grid = forest_grid.best_estimator_

param = {
    'n_estimators': list(range(forest_grid.n_estimators - 40, 
                                    forest_grid.n_estimators + 50, 10))
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7883817427385892)
forest_grid = forest_grid.best_estimator_

param = {
    'max_depth': range(3, 15, 2),
    'min_samples_split': range(2, 53, 10)
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, 
                                                              n_estimators=forest_grid.n_estimators),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=9, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=30,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7918927545483562)
param = {'min_samples_leaf': range(1, 10, 2)}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, max_features='auto',
                                                              max_depth=9,
                                                              n_estimators=860, 
                                                              min_samples_split=30),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=9, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=30,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7918927545483562)

GBDT

  • n_estimators : integer, optional (default=10)

要执行的迭代次数。梯度增强对于过拟合强健性好,多以较大的值通常效果更好。

  • learning_rate : float, optional (default=0.1)

通过learning_rate缩小每棵树的贡献,需要在learning_rate和n_estimators之间进行权衡。

param_grid = {
    'n_estimators': range(80, 150, 10),
    'learning_rate': [0.02, 0.01, 0.04],
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018), param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=3,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7864666453878072)
gbdt = gbdt.best_estimator_

param_grid = {
    'max_depth': range(3, 12, 2),
    'min_samples_split': range(20, 41, 5)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018, 
                                               n_estimators=gbdt.n_estimators,
                                               learning_rate=gbdt.learning_rate), 
                    param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=5,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=35,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7874241940631982)
gbdt = gbdt.best_estimator_

param_grid = {
    'min_samples_leaf': range(1, 10, 2)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018, 
                                               n_estimators=gbdt.n_estimators,
                                               learning_rate=gbdt.learning_rate, 
                                               max_depth=gbdt.max_depth,
                                               min_samples_split=gbdt.min_samples_split), 
                    param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=5,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=35,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7874241940631982)

XGBoost

param_grid = {
    'n_estimators': range(70, 150, 10),
    'learning_rate': [0.02, 0.1, 0.2],
}
xgb = GridSearchCV(XGBClassifier(random_state=2018), param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_
(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
        max_depth=3, min_child_weight=1, missing=None, n_estimators=90,
        n_jobs=1, nthread=None, objective='binary:logistic',
        random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=None, silent=True, subsample=1), 0.7867858282796042)
xgb = xgb.best_estimator_

param_grid = {
    'max_depth': range(1, 4, 1),
    'min_samples_split': range(1, 22, 5)
}
xgb = GridSearchCV(XGBClassifier(random_state=2018, 
                                               n_estimators=xgb.n_estimators,
                                               learning_rate=xgb.learning_rate), 
                    param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_
(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
        max_depth=3, min_child_weight=1, min_samples_split=1, missing=None,
        n_estimators=90, n_jobs=1, nthread=None,
        objective='binary:logistic', random_state=2018, reg_alpha=0,
        reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
        subsample=1), 0.7867858282796042)

LightGBM

param_grid = {
    'n_estimators': range(70, 150, 10),
    'learning_rate': [0.02, 0.1, 0.2],
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018), param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_
(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
         importance_type='split', learning_rate=0.02, max_depth=-1,
         min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
         n_estimators=90, n_jobs=-1, num_leaves=31, objective=None,
         random_state=2018, reg_alpha=0.0, reg_lambda=0.0, silent=True,
         subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
 0.7826364506862432)
lgbm = lgbm.best_estimator_

param_grid = {
    'max_depth': range(1, 10, 2),
    'min_samples_split': range(10, 31, 5)
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018, 
                                               n_estimators=lgbm.n_estimators,
                                               learning_rate=lgbm.learning_rate), 
                    param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_
(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
         importance_type='split', learning_rate=0.02, max_depth=5,
         min_child_samples=20, min_child_weight=0.001, min_samples_split=10,
         min_split_gain=0.0, n_estimators=90, n_jobs=-1, num_leaves=31,
         objective=None, random_state=2018, reg_alpha=0.0, reg_lambda=0.0,
         silent=True, subsample=1.0, subsample_for_bin=200000,
         subsample_freq=0), 0.7832748164698372)

模型评估

对构建的七个模型进行评估

models = {'随机森林': forest_grid.best_estimator_,
          'GBDT': gbdt.best_estimator_,
          'XGBoost': xgb.best_estimator_,
          'LightGBM': lgbm,
          '逻辑回归': log_grid.best_estimator_,
          'SVM': svc_grid.best_estimator_,
          '决策树': tree_grid.best_estimator_}

assessments = {
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1-score': [],
    'AUC': []
} 
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    plt.tight_layout()
for name, model in models.items():
    test_pre = model.predict(X_test)
    train_pre = model.predict(X_train)
    test_proba = model.predict_proba(X_test)[:,1]
    train_proba = model.predict_proba(X_train)[:,1]
    
    acc_test = accuracy_score(test_pre, y_test) * 100
    acc_train = accuracy_score(train_pre, y_train) * 100
    accuracy = '训练集:%.2f%%;测试集:%.2f%%' % (acc_train, acc_test)
    assessments['Accuracy'].append(accuracy)
    
    pre_test = precision_score(test_pre, y_test) * 100
    pre_train = precision_score(train_pre, y_train) * 100
    precision = '训练集:%.2f%%;测试集:%.2f%%' % (pre_train, pre_test)
    assessments['Precision'].append(precision)
    
    rec_test = recall_score(test_pre, y_test) * 100
    rec_train = recall_score(train_pre, y_train) * 100
    recall = '训练集:%.2f%%;测试集:%.2f%%' % (rec_train, rec_test)
    assessments['Recall'].append(recall)
    
    f1_test = f1_score(test_pre, y_test) * 100
    f1_train = f1_score(train_pre, y_train) * 100
    f1 = '训练集:%.2f%%;测试集:%.2f%%' % (f1_train, f1_test)
    assessments['F1-score'].append(f1)
    
    fig = plt.figure(figsize=(8, 6))
    fpr, tpr, thresholds = roc_curve(y_test, test_proba)
    plot_roc_curve(fpr, tpr, label='测试集')
    fpr, tpr, thresholds = roc_curve(y_train, train_proba)
    plot_roc_curve(fpr, tpr, label='训练集')
    plt.title(name)
    
    auc_test = roc_auc_score(y_test, test_proba) * 100
    auc_train = roc_auc_score(y_train, train_proba) * 100
    auc = '训练集:%.2f%%;测试集:%.2f%%' % (auc_train, auc_test)
    assessments['AUC'].append(auc)
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, proba)
    plot_roc_curve(fpr, tpr, label=name)
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_train)[:,1]
    fpr, tpr, thresholds = roc_curve(y_train, proba)
    plot_roc_curve(fpr, tpr, label=name)
ass_df = pd.DataFrame(assessments, index=models.keys())
ass_df
AUC Accuracy F1-score Precision Recall
随机森林 训练集:90.82%;测试集:79.88% 训练集:84.33%;测试集:79.60% 训练集:58.07%;测试集:46.69% 训练集:43.59%;测试集:34.68% 训练集:86.96%;测试集:71.43%
GBDT 训练集:87.87%;测试集:79.15% 训练集:84.26%;测试集:78.78% 训练集:57.17%;测试集:44.01% 训练集:42.18%;测试集:32.37% 训练集:88.68%;测试集:68.71%
XGBoost 训练集:90.41%;测试集:79.28% 训练集:85.06%;测试集:79.23% 训练集:63.03%;测试集:49.18% 训练集:51.15%;测试集:39.02% 训练集:82.10%;测试集:66.50%
LightGBM 训练集:86.70%;测试集:79.53% 训练集:82.41%;测试集:78.93% 训练集:49.77%;测试集:41.41% 训练集:35.00%;测试集:28.90% 训练集:86.12%;测试集:72.99%
逻辑回归 训练集:76.33%;测试集:78.34% 训练集:78.77%;测试集:78.70% 训练集:37.68%;测试集:38.89% 训练集:25.77%;测试集:26.30% 训练集:70.03%;测试集:74.59%
SVM 训练集:80.23%;测试集:74.26% 训练集:80.82%;测试集:77.96% 训练集:43.14%;测试集:34.80% 训练集:29.23%;测试集:22.83% 训练集:82.31%;测试集:73.15%
决策树 训练集:76.63%;测试集:74.19% 训练集:79.29%;测试集:77.14% 训练集:46.41%;测试集:43.46% 训练集:36.03%;测试集:34.10% 训练集:65.20%;测试集:59.90%

ROC曲线:

集成模型 非集成模型
在这里插入图片描述 在这里插入图片描述
在这里插入图片描述 在这里插入图片描述
在这里插入图片描述 在这里插入图片描述
在这里插入图片描述

综合比较ROC曲线:

训练集 测试集
在这里插入图片描述 在这里插入图片描述

总结

相比于之前的结果(参考【数据分析实践】Task 1.3 模型调优)。在经过特征处理和特征选择后,各个模型的效果都有小幅提升,模型的过拟合现象也有所减少。

因为时间问题没有更加具体的调参,未来想要进一步提升效果还需要在调参和特征工程上多下功夫。

参考资料

任务描述:特征选择:分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。

[1] https://blog.csdn.net/sscc_learning/article/details/78591210, 【评分卡】评分卡入门与创建原则——分箱、WOE、IV、分值分配

[2] https://blog.csdn.net/pylady/article/details/78882220, 特征工程之分箱

[3] https://mp.weixin.qq.com/s?__biz=MzIxNzc1NDgzMw==&mid=2247484031&idx=1&sn=dc6f97982ac958653ba8af8cf75ec0d0&chksm=97f5bfc1a08236d75b13b4e456334e07d4bbff209c9449adf8ce1aae45a52fcb04954584c2ce&mpshare=1&scene=23&srcid=0127eIvjcmFdJMnR2fdaJnFX#rd, python 评分卡建模—实现 WOE 编码及 IV 值计算

[4] https://blog.csdn.net/RuDing/article/details/78332192, Gradient Boosting(GBM) 调参指南

猜你喜欢

转载自blog.csdn.net/bear507/article/details/86696246