建模中要使用自定义的评判标准,所以在scoring参数中要使用自定义的评判方法,在网上研究了很久也没有很好的教程,就把自己最近实现的方法记录下来。第一次写。。。
首先,自定义的模型分数计算方法,曲线积分方法还没有改好,先用循环计算的面积
def AR( y_true, y_predict, direction = -1): '计算模型的AR值,返回float结果' datain = pd.DataFrame({'flag':y_true, 'scores':y_predict}) if direction == 1: df = pd.DataFrame(datain).sort_values(by='scores').loc[:, ['scores', 'flag']] elif direction == -1: df = pd.DataFrame(datain).sort_values(by='scores', ascending=False).loc[:, ['scores', 'flag']] df = df.dropna() tot = y_true.shape[0] bad = y_true.sum() area = 0.0 cumpb = 0.0 cumpg = 0.0 for index, row in df.iterrows(): area = area + 0.5 * (1 / tot) * (cumpb * 2 + row['flag'] / bad) cumpb = cumpb + row['flag'] / bad cumpg = cumpg + (1 - row['flag']) / (tot - bad) ar = (area - 0.5) / (0.5 * (1 - bad / tot)) return ar def KS(y_true, y_predict, direction = -1): '计算模型的ks值,返回float结果' datain = pd.DataFrame({'flag':y_true, 'scores':y_predict}) if direction == 1: df = pd.DataFrame(datain).sort_values(by='scores').loc[:, ['scores', 'flag']] elif direction == -1: df = pd.DataFrame(datain).sort_values(by='scores', ascending=False).loc[:, ['scores', 'flag']] df = df.dropna() tot = y_true.shape[0] bad = y_true.sum() area = 0.0 cumpb = 0.0 cumpg = 0.0 ks=0.0 for index, row in df.iterrows(): area = area + 0.5 * (1 / tot) * (cumpb * 2 + row['flag'] / bad) cumpb = cumpb + row['flag'] / bad cumpg = cumpg + (1 - row['flag']) / (tot - bad) ks = max(ks, abs(cumpb - cumpg)) return ks
sklearn上关于自定义评分的介绍:
3.3.1.3. Implementing your own scoring object
You can generate even more flexible model scorers by constructing your own scoring object from scratch, without using themake_scorer
factory. For a callable to be a scorer, it needs to meet the protocol specified by the following two rules:
- It can be called with parameters
(estimator, X, y)
, whereestimator
is the model that should be evaluated,X
is validation data, andy
is the ground truth target forX
(in the supervised case) orNone
(in the unsupervised case). - It returns a floating point number that quantifies the
estimator
prediction quality onX
, with reference toy
. Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.
AR值函数定义,写法参照‘roc_auc’参数,函数第一个参数是真实值,第二个参数是预测值,返回一个float值,自定义的参数写在后面。
def AR( y_true, y_predict, direction = -1)
应用时需要调用make_scorer方法,将AR和KS写入字典,传入GridSearchCV的scoring参数中,refit参数指定模型选择标准
from sklearn.metrics import make_scorer AR_score = make_scorer(AR, greater_is_better=True) KS_score = make_scorer(KS, greater_is_better=True) scoring = {'AR': AR_score, 'KS': KS_score} gsearch = GridSearchCV(estimator = bestclf, param_grid = param_test,scoring=scoring, n_jobs=1,cv=ps, refit ='AR')
参考见Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV
结果可以在gsearch.cv_result_中查看
AR值可以用gsearch.best_score_得到(在refit中指定了)
KS值可以用如下方法
best_index_ks = np.nonzero(gsearch.cv_results_['rank_test_KS'] == 1)[0][0] best_score_ks = gsearch.cv_results_['mean_test_KS'][best_index_ks]cv_result_中对模型的结果有rank排序,rank值为1为最优值,这里找到['rank_test_KS'] == 1得到KS的最优值