cross_val_score或GridSearchCV scoring参数中使用多个自定义值方法

建模中要使用自定义的评判标准，所以在scoring参数中要使用自定义的评判方法，在网上研究了很久也没有很好的教程，就把自己最近实现的方法记录下来。第一次写。。。

首先，自定义的模型分数计算方法，曲线积分方法还没有改好，先用循环计算的面积

def AR( y_true, y_predict, direction = -1):
    '计算模型的AR值，返回float结果' 
    datain = pd.DataFrame({'flag':y_true, 'scores':y_predict})
            
    if direction == 1:        
        df = pd.DataFrame(datain).sort_values(by='scores').loc[:, ['scores', 'flag']]
    elif direction == -1:
        df = pd.DataFrame(datain).sort_values(by='scores', ascending=False).loc[:, ['scores', 'flag']]
    df = df.dropna()
    tot = y_true.shape[0]
    bad = y_true.sum()
    area = 0.0
    cumpb = 0.0
    cumpg = 0.0
    for index, row in df.iterrows():
        area = area + 0.5 * (1 / tot) * (cumpb * 2 + row['flag'] / bad)
        cumpb = cumpb + row['flag'] / bad
        cumpg = cumpg + (1 - row['flag']) / (tot - bad)
    ar = (area - 0.5) / (0.5 * (1 - bad / tot))
    return ar  
           
def KS(y_true, y_predict, direction = -1):  
    '计算模型的ks值，返回float结果'
    datain = pd.DataFrame({'flag':y_true, 'scores':y_predict})
    if direction == 1:        
        df = pd.DataFrame(datain).sort_values(by='scores').loc[:, ['scores', 'flag']]
    elif direction == -1:
        df = pd.DataFrame(datain).sort_values(by='scores', ascending=False).loc[:, ['scores', 'flag']]
    df = df.dropna()
    tot = y_true.shape[0]
    bad = y_true.sum()
    area = 0.0
    cumpb = 0.0
    cumpg = 0.0
    ks=0.0
    for index, row in df.iterrows():
        area = area + 0.5 * (1 / tot) * (cumpb * 2 + row['flag'] / bad)
        cumpb = cumpb + row['flag'] / bad
        cumpg = cumpg + (1 - row['flag']) / (tot - bad)
    ks = max(ks, abs(cumpb - cumpg))
    return ks

sklearn上关于自定义评分的介绍：

3.3.1.3. Implementing your own scoring object

You can generate even more flexible model scorers by constructing your own scoring object from scratch, without using themake_scorer factory. For a callable to be a scorer, it needs to meet the protocol specified by the following two rules:

It can be called with parameters (estimator, X, y), where estimator is the model that should be evaluated, X is validation data, and y is the ground truth target for X (in the supervised case) or None (in the unsupervised case).
It returns a floating point number that quantifies the estimator prediction quality on X, with reference to y. Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.

AR值函数定义，写法参照‘roc_auc’参数，函数第一个参数是真实值，第二个参数是预测值，返回一个float值，自定义的参数写在后面。

def AR( y_true, y_predict, direction = -1)

应用时需要调用make_scorer方法，将AR和KS写入字典，传入GridSearchCV的scoring参数中，refit参数指定模型选择标准

from sklearn.metrics import make_scorer
AR_score = make_scorer(AR, greater_is_better=True)
KS_score = make_scorer(KS, greater_is_better=True)
scoring = {'AR': AR_score, 'KS': KS_score}
gsearch = GridSearchCV(estimator = bestclf, param_grid = param_test,scoring=scoring, n_jobs=1,cv=ps, refit ='AR')

参考见Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV

结果可以在gsearch.cv_result_中查看

AR值可以用gsearch.best_score_得到（在refit中指定了）

KS值可以用如下方法

best_index_ks = np.nonzero(gsearch.cv_results_['rank_test_KS'] == 1)[0][0]
best_score_ks = gsearch.cv_results_['mean_test_KS'][best_index_ks]

cv_result_中对模型的结果有rank排序，rank值为1为最优值，这里找到['rank_test_KS'] == 1得到KS的最优值

cross_val_score或GridSearchCV scoring参数中使用多个自定义值方法

3.3.1.3. Implementing your own scoring object

猜你喜欢