0. 比赛简介
竞赛时间:4月29日9:00-5月12日17:00
竞赛流程:4月29日9:00-5月9日24:00,赛题开放A榜数据(test_A榜),预测结果数据每天限提交3次;5月10日00:00-5月12日17:00,赛题开放B榜数据(test_B榜),预测结果数据每天限提交3次。重复提交或提交格式错误均扣除有效提交次数,请谨慎提交答案,结果提交后请务必点击“运行”按钮,方可查看当前个人排名。
排行榜依据“最终得分”计算排名,“最终得分”计算公式为:A榜最高分 * 0.3 + B榜最高分 * 0.7。“最终得分”越大,成绩排名越前。
该竞赛就是一个类似于kaggle的数据科学赛题,基于银行的脱敏金融数据,对客户是否存款进行预测(二分类预测),评价指标为AUC(Area Under Curve)。本文主要对数据分析和一些常用的机器学习trick进行分享。
1. 数据预处理
训练数据主要是由49个特征和LABEL(0和1)组成,其中特征有数值型(int or float)、字符型(str),还有 "?" 和大量的2,由于机器学习模型只能识别数值型数据,需要数据进行预处理。
1.1 缺失值转换为np.nan
import pandas as pd, numpy as np
train = pd.read_excel('fintech训练营/train.xlsx')
test = pd.read_excel('fintech训练营/test_A榜.xlsx')
datasets = [train,test]
for dataset in datasets:
for i in dataset.columns:
# 将"?"转换为空值np.nan
dataset[i] = dataset[i].apply(lambda x : np.nan if x=='?' else x)
1.2 类别型特征编码
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
for dataset in datasets:
for i in dataset.columns:
if dataset[i].dtype == 'object':
if i != 'CUST_UID':
dataset[i] = label.fit_transform(dataset[i]) # 比如"A", "B"转换成1, 2
2. 数据探索
2.1 判断是否数据不平衡
train['LABEL'].value_counts()
LABEL
0 30000
1 10000
Name: LABEL, dtype: int64
由图可见0和1的比例为3:1,一般大于10:1时为数据不平衡,故不需要进行数据增强。
2.2 异常值
import matplotlib.pyplot as plt
import seaborn as sns
cols = 4
rows = 13
plt.figure(figsize = (4*cols, 3*rows))
i=1
for col in train.columns[2:]:
if train[col].dtype != 'object':
ax = plt.subplot(rows, cols, i)
ax = sns.boxplot(train[col], orient = 'v', width = 0.5)
ax.set_xlabel(col)
ax.set_ylabel('frequency')
ax = ax.legend(['train'])
i += 1
plt.tight_layout()
plt.show()
从箱型图中可以看到每个特征的异常值分布,有助于我们进一步分析。
2.3 核密度分布图
cols = 4
rows = 13
plt.figure(figsize = (4*cols, 3*rows))
i = 1
for col in train.columns[2:]:
ax = plt.subplot(rows,cols,i)
ax = sns.kdeplot(train[col].dropna(), color = 'red', shade = True)
ax = sns.kdeplot(test[col].dropna(), color = 'blue', shade = True)
ax.set_xlabel(col)
ax.set_ylabel('frequency')
ax = ax.legend(['train','test'])
i += 1
plt.tight_layout()
plt.show()
从核密度分布图可以看出每个特征的分布(正态、双尖峰等)情况,有助于我们进一步分析。
2.4 对抗验证
如果训练集和测试集数据分布不同,模型训练可能会因为过拟合导致泛化性较差。我们通过给训练集构造label=1,测试集label=0,选用机器学习模型,用所有特征来预测label,通过交叉验证得到AUC(Area Under Curve),判断如下:
(1)如果AUC过高,则存在训练集和测试集分布不同的特征,(该特征能准确判断训练集和测试集的差异,即数据漂移dataset shift,影响模型预测效果) , 则删除特征重要性排名较前的特征,然后再次预测
(2)重复步骤1直至AUC达到0.5~0.6左右
from sklearn.model_selection import train_test_split
train_new = train.copy()
test_new = test.copy()
train_new = train_new.drop(['CUST_UID','LABEL'], axis = 1) # 只保留特征
test_new = test_new.drop(['CUST_UID'], axis = 1) # 只保留特征
train_new['label'] = 1 # 训练集标签1
test_new['label'] = 0 # 测试集标签0
data = pd.concat([train_new,test_new], axis = 0)
test_size_pct = 0.2 # 划分训练集和测试集
X_train, X_valid, y_train, y_valid = train_test_split(data.drop(['label'], axis = 1), data['label'], test_size = test_size_pct, random_state = 42)
接下来选用LightGBM(机器学习模型,以下简称LGBM,下载命令:pip install lightgbm)进行训练:
from lightgbm import LGBMClassifier
from lightgbm import log_evaluation, early_stopping
lgb = LGBMClassifier(verbosity = -1)
lgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'],
callbacks = [log_evaluation(period = 50), early_stopping(stopping_rounds = 128)])
Training until validation scores don't improve for 128 rounds
[50] valid_0's auc: 0.500989 valid_0's binary_logloss: 0.534569
[100] valid_0's auc: 0.504464 valid_0's binary_logloss: 0.536658
Did not meet early stopping. Best iteration is:
[6] valid_0's auc: 0.510417 valid_0's binary_logloss: 0.532726
接下来用LGBM进行预测分析:
from sklearn.metrics import roc_auc_score
pred_lgb = lgb.predict_proba(X_valid)[:,1]
roc_auc_score(y_valid, pred_lgb)
0.510416950256567
AUC=0.51说明训练集和测试集特征分布近似,不需要删除特征。
3. 模型预测
3.1 模型初步训练
我们使用lgb对训练集的预测效果进行初步了解:
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
ignore = ['CUST_UID','LABEL']
features = [feat for feat in train.columns if feat not in ignore]
target_feature = 'LABEL'
test_size_pct = 0.10
X_train, X_valid, y_train, y_valid = train_test_split(train[features], train[target_feature], test_size = test_size_pct, random_state = 42)
lgb = LGBMClassifier(verbosity = -1)
lgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'],
callbacks = [log_evaluation(period = 50), early_stopping(stopping_rounds = 128)])
pred_lgb = lgb.predict_proba(X_valid)[:,1]
roc_auc_score(y_valid, pred_lgb)
Training until validation scores don't improve for 128 rounds
[50] valid_0's auc: 0.949571 valid_0's binary_logloss: 0.240469
[100] valid_0's auc: 0.949393 valid_0's binary_logloss: 0.240118
Did not meet early stopping. Best iteration is:
[63] valid_0's auc: 0.949871 valid_0's binary_logloss: 0.239181
0.9498706534679444
3.2 K折交叉验证
k折交叉验证(K-Fold Cross-Validation)是一种用于估计机器学习模型性能的统计方法。它是一种评估统计分析结果如何推广到独立数据集的方法,可以较大程度避免过拟合,增强模型的泛化性。详细了解可参考:https://blog.csdn.net/Rocky6688/article/details/107296546
from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(learning_rate = 0.05, max_depth = 20, num_leaves = 100, random_state = 1000, verbosity = -1)
strtfdKFold = StratifiedKFold(n_splits = 5, random_state = 100, shuffle = True)
# 把特征和标签传递给StratifiedKFold实例
X_train = train[features]
y_train = train[target_feature]
kfold = strtfdKFold.split(X_train, y_train)
scores = []
for k, (train1, test1) in enumerate(kfold):
lgb.fit(X_train.iloc[train1,:], y_train.iloc[train1])
pred_lgb = lgb.predict_proba(X_train.iloc[test1, :])[:,1]
score = roc_auc_score(y_train.iloc[test1], pred_lgb)
scores.append(score)
print('Fold: %2d, Training/Test Split Distribution: %s, AUC: %s' % (k+1, np.bincount(y_train.iloc[train1]), score))
print('Cross-Validation AUC: %s +/- %s' %(np.mean(scores), np.std(scores)))
Fold: 1, Training/Test Split Distribution: [24000 8000], AUC: 0.9490365000000001
Fold: 2, Training/Test Split Distribution: [24000 8000], AUC: 0.9481471250000001
Fold: 3, Training/Test Split Distribution: [24000 8000], AUC: 0.9523520416666665
Fold: 4, Training/Test Split Distribution: [24000 8000], AUC: 0.9509735416666666
Fold: 5, Training/Test Split Distribution: [24000 8000], AUC: 0.9490646666666667
Cross-Validation AUC: 0.9499147749999999 +/- 0.0015283908750980848
上述模型参数只是随机设定,可以看出模型的Cross Validation(以下简称CV)分数为0.9499,可以看出模型的初始效果已经挺不错了。
3.3 多模型预测
3.3.1 单模型效果检验
我们这里对LightGBM、XGBoost、CatBoost、HistGradientBoostingClassifier四种效果较好的树模型进行预测分析:
from sklearn import model_selection, ensemble
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from tqdm import tqdm
vote_est = [
#Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
('hgbc', ensemble.HistGradientBoostingClassifier(random_state = 42)),
#lightbgm
('lgb', LGBMClassifier(verbosity = -1, random_state = 42)),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
('xgb', XGBClassifier(verbosity = 0, random_state = 42)),
('cbc', CatBoostClassifier(verbose = 0, random_state = 42))
]
MLA_columns = ['MLA Name','MLA Train AUC Mean', 'MLA Test AUC Mean', 'MLA Test AUC 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
row_index = 0
cv_split = model_selection.ShuffleSplit(n_splits = 5, test_size = 0.2, train_size = 0.8, random_state = 0)
for i in tqdm(vote_est):
model = i[1]
MLA_compare.loc[row_index, 'MLA Name'] = i[0]
cv_results = model_selection.cross_validate(model, train[features], train[target_feature], cv = cv_split, scoring = 'roc_auc', return_train_score = True)
MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
MLA_compare.loc[row_index, 'MLA Train AUC Mean'] = cv_results['train_score'].mean()
MLA_compare.loc[row_index, 'MLA Test AUC Mean'] = cv_results['test_score'].mean()
#if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
MLA_compare.loc[row_index, 'MLA Test AUC 3*STD'] = cv_results['test_score'].std()*3 # let's know the worst that can happen!
row_index += 1
del model
MLA_compare.sort_values(by = ['MLA Test AUC Mean'], ascending = False, inplace = True)
MLA_compare
MLA Name | MLA Train AUC Mean | MLA Test AUC Mean | MLA Test AUC 3*STD | MLA Time |
---|---|---|---|---|
lgb | 0.976829 | 0.950642 | 0.007759 | 0.365629 |
cbc | 0.979338 | 0.950619 | 0.007465 | 14.070166 |
hgbc | 0.970688 | 0.950087 | 0.007439 | 0.70215 |
xgb | 0.994984 | 0.947138 | 0.007695 | 0.768368 |
我们可以看出LGBM的效果最好,而CatBoost训练+预测时间最长。
3.3.2 投票法
投票法(Voting)是集成学习里面针对分类问题的一种结合策略。是一种遵循少数服从多数原则的集成学习模型,通过多个模型的集成降低噪声和方差,从而提高模型的鲁棒性。一般情况下,投票法的预测效果应当优于单一模型的预测效果。投票法主要有两种形式:
- 硬投票:预测结果是所有投票结果最多出现的类。
- 软投票:将所有模型预测样本为某一类别的概率的平均值作为标准,概率最高的对应的类型为最终的预测结果 / 预测结果是所有投票结果中概率加和最大的类。
详细了解可参考:https://blog.csdn.net/deephub/article/details/122976720
sklearn已经在ensemble库中实现了投票法,直接用,由于评价指标为AUC,所以这里只尝试软投票的方法。
grid_soft = ensemble.VotingClassifier(estimators = vote_est, voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, train[features], train[target_feature],
scoring='roc_auc', cv = cv_split, return_train_score = True)
print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 98.52
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 95.20
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 0.71
可以看出软投票的测试集分数为0.952,高于单一模型效果最好的LGBM分数0.9506。如果删除单模效果最差的XGBoost(0.947138),投票结果是不是更好呢,效果如下:
vote_est2 = [
#Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
('hgbc', ensemble.HistGradientBoostingClassifier(random_state = 42)),
#lightbgm
('lgb', LGBMClassifier(verbosity = -1, random_state = 42)),
('cbc', CatBoostClassifier(verbose = 0, random_state = 42))
]
grid_soft2 = ensemble.VotingClassifier(estimators = vote_est2 , voting = 'soft')
grid_soft_cv2 = model_selection.cross_validate(grid_soft2, train[features], train[target_feature],
scoring='roc_auc', cv = cv_split, return_train_score = True)
print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv2['train_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv2['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv2['test_score'].std()*100*3))
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 97.77
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 95.18
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 0.70
可以看到,删掉XGBoost后的分数反而还下降了,可见投票法在单模效果不是很差的时候,还是尽可能增加更多的模型以增强泛化性。
3.3.3 Stacking
Stacking(堆叠泛化)是指训练一个模型用于组合 (combine)其他各个模型。即首先我们先训练多个不同的模型,然后再以之前训练的各个模型的输出为输入来训练一个模型,以得到一个最终的输出。详细了解可参考:https://blog.csdn.net/ueke1/article/details/137190677
from mlxtend.classifier import StackingCVClassifier
train_new = train.copy()
test_new = test.copy()
dataset_net = [train_new,test_new]
for dataset in dataset_net:
for i in dataset.columns:
if dataset[i].dtype != 'object':
dataset[i] = dataset[i].fillna(dataset[i].mean()) # 逻辑回归需要填充空值
hgbc = ensemble.HistGradientBoostingClassifier(random_state = 42)
lgb = LGBMClassifier(verbosity = -1, random_state = 42)
xgb = XGBClassifier(verbosity = 0, random_state = 42)
cbc = CatBoostClassifier(verbose = 0, random_state = 42)
lr = linear_model.LogisticRegressionCV()
sclf = StackingCVClassifier(classifiers = [hgbc, lgb, cbc], # 第一层分类器
meta_classifier = lr, # 第二层分类器,并非表示第二次stacking,而是通过logistic regression对新的训练特征数据进行训练,得到predicted label
cv = 5)
strtfdKFold = StratifiedKFold(n_splits = 5)
#把特征和标签传递给StratifiedKFold实例
X_train = train_new[features]
y_train = train_new[target_feature]
kfold = strtfdKFold.split(X_train, y_train)
scores = []
for k, (train1, test1) in enumerate(kfold):
sclf.fit(X_train.iloc[train1,:], y_train.iloc[train1])
pred_lgb = sclf.predict_proba(X_train.iloc[test1, :])[:,1]
score = roc_auc_score(y_train.iloc[test1], pred_lgb)
scores.append(score)
print('Fold: %2d, Training/Test Split Distribution: %s, Accuracy: %.3f' % (k+1, np.bincount(y_train.iloc[train1]), score))
print('\n\nCross-Validation accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))
Fold: 1, Training/Test Split Distribution: [24000 8000], Accuracy: 0.893
Fold: 2, Training/Test Split Distribution: [24000 8000], Accuracy: 0.882
Fold: 3, Training/Test Split Distribution: [24000 8000], Accuracy: 0.883
Fold: 4, Training/Test Split Distribution: [24000 8000], Accuracy: 0.880
Fold: 5, Training/Test Split Distribution: [24000 8000], Accuracy: 0.882
Cross-Validation accuracy: 0.884 +/- 0.005
简单试了几个stacking,发现效果远远不如单模,故选用软投票。
4. 模型优化技巧
4.1 伪标签
伪标签是一种半监督学习的技术,它是指利用已标注数据训练的模型对未标注数据进行预测,然后根据预测结果为未标注数据分配类别标签,目的是帮助模型学习到无标注数据中隐藏的信息。伪标签的选取一般是基于模型预测的最大概率的类,可以用于微调模型,提高模型的泛化能力。详细了解可参考:https://zhuanlan.zhihu.com/p/157325083
以下是伪标签的类:
from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, RegressorMixin
class PseudoLabeler(BaseEstimator, RegressorMixin):
def __init__(self, model, test, features, target, sample_rate=0.2, seed=42):
self.sample_rate = sample_rate
self.seed = seed
self.model = model
self.model.seed = seed
self.test = test
self.features = features
self.target = target
def get_params(self, deep=True):
return {
"sample_rate": self.sample_rate,
"seed": self.seed,
"model": self.model,
"test": self.test,
"features": self.features,
"target": self.target
}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
def fit(self, X, y):
if self.sample_rate > 0.0:
augemented_train = self.__create_augmented_train(X, y)
self.model.fit(
augemented_train[self.features],
augemented_train[self.target]
)
else:
self.model.fit(X, y)
return self
def __create_augmented_train(self, X, y):
num_of_samples = int(len(self.test) * self.sample_rate)
# Train the model and creat the pseudo-labels
self.model.fit(X, y)
pseudo_labels = self.model.predict(self.test[self.features])
# Add the pseudo-labels to the test set
augmented_test = self.test.copy(deep=True)
augmented_test[self.target] = pseudo_labels
# Take a subset of the test set with pseudo-labels and append in onto
# the training set
sampled_test = augmented_test.sample(n=num_of_samples)
temp_train = pd.concat([X, y], axis=1)
augemented_train = pd.concat([sampled_test, temp_train])
return shuffle(augemented_train)
def predict(self, X):
return self.model.predict(X)
def predict_proba(self, X):
return self.model.predict_proba(X)
def get_model_name(self):
return self.model.__class__.__name__
伪标签的调用方法如下:
strtfdKFold = StratifiedKFold(n_splits = 5)
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
X_train = train[features]
y_train = train[target_feature]
X_test = test[features]
kfold = strtfdKFold.split(X_train, y_train)
pred = pd.DataFrame()
for k, (train1, test1) in enumerate(kfold):
pseudo = PseudoLabeler(grid_soft, test, features, target_feature, sample_rate = 1)
pseudo.fit(X_train.iloc[train1,:], y_train.iloc[train1])
pred_lgb = pseudo.predict_proba(X_test)[:,1]
pred[str(k)] = pred_lgb
pred['result'] = (pred['0'] + pred['1'] + pred['2'] + pred['3'] + pred['4']) / 5
伪标签能使模型鲁棒性大大增强,从而增强的未知数据的预测能力,由于训练时间的问题,作者并未对伪标签做线下CV测试,但是线上分数提高了,从而证明了伪标签的有效性。
4.2 神经网络
树模型表现良好,那么nn表现如何呢,作者写了一个简单的多层感知机神经网络模型如下:
(1)数据读取以及数据预处理:
import os, gc, math, time, random, numpy as np, pandas as pd, warnings, torch
warnings.filterwarnings('ignore')
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torch.nn import functional as F
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from transformers import get_constant_schedule_with_warmup, get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
from sklearn.preprocessing import LabelEncoder
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# ====================================================
# CFG
# ====================================================
class CFG:
seed = 42
num_hidden1 = 768
num_hidden2 = 512
num_hidden3 = 768
num_hidden4 = 768
num_output = 2
print_freq = 100
scheduler = 'cosine'
batch_size = 32
num_workers = 3
lr = 1e-5
weight_decay = 0
epochs = 5
num_warmup_steps = 0
num_cycles = 0.5
n_accumulate = 1
train = True
n_fold = 5
def seed_everything(seed=CFG.seed):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
seed_everything(seed = 42)
train_net = pd.read_excel('fintech训练营/train.xlsx')
for i in train_net.columns:
train_net[i] = train_net[i].apply(lambda x : np.nan if x=='?' else x)
label = LabelEncoder()
for i in train_net.columns:
if train_net[i].dtype == 'object':
if i != 'CUST_UID':
train_net[i] = label.fit_transform(train_net[i])
else:
if i != "LABEL":
train_net[i] = train_net[i].fillna(train_net[i].mode().values[0]) # 用众数填充缺失值
train_net[i] = (train_net[i] - train_net[i].min()) / (train_net[i].max() - train_net[i].min()) # 归一化
ignore = ['CUST_UID','LABEL']
CFG.feas = [feat for feat in train_net.columns if feat not in ignore]
CFG.target_fea = 'LABEL'
skf = StratifiedKFold(n_splits = 5, random_state = CFG.seed, shuffle = True)
train_net['fold'] = -1
for i, (_, val_) in enumerate(skf.split(train_net[CFG.feas], train_net[CFG.target_fea])):
train_net.loc[val_, 'fold'] = int(i)
神经网络框架:
(1)criterion是损失函数,这里我们使用nn.CrossEntropyLoss
(2)get_score是验证集评价指标函数,评价指标为AUC
(3)FeedBackDataset是神经网络的数据类,用于数据的预处理和传入,__getitem__函数即为每个item传入数据的格式
(4)custom_collate_fn为神经网络收集整理函数,由于FeedBackDataset的数据是一个一个传入的,而一个batch可能有多个数据,所以需要对多个单独传入的数据进行整理并转化成tensor
(5)BPNetModel是神经网络的结构类,forward函数定义了数据前向传播,loss.backward()反向传播求梯度,作者在这里封装了一个loss_accumulate函数,主要是为了应对显存不足而又需要大batchsize的情况,可以在设定比如batchsize=1、accumulate=4的情况下达到和batchsize=4等同效果的结果
(6)asMinutes和timeSince为时间辅助函数,主要是为了计算神经网络运行和等待时间
(7)train_one_epoch为单个epoch的训练函数,在该函数定义训练日志、参数更新(optimizer.step)、优化器更新(scheduler.step)、梯度清零(optimizer.zero_grad)等操作
(8)valid_one_epoch为预测函数,我们都知道模型不一定是在每个epoch训练结束的时候效果最好, 该函数可以帮助我们在模型训练过程中预测验证集,通过验证集的分数来挑选最优模型
(9)train_loop为模型初始化函数,主要用于定义DataLoader、model和optimizer(Adam、SGD)等组件
def criterion(outputs, labels):
return nn.CrossEntropyLoss(reduction = 'sum')(outputs, labels)
def get_score(outputs, labels):
outputs = F.softmax(torch.tensor(outputs)).numpy()[:,1]
return roc_auc_score(labels, outputs)
class FeedBackDataset(Dataset):
def __init__(self, data):
self.data = data[CFG.feas].values
self.targets = data[CFG.target_fea].values
def __len__(self):
return len(self.targets)
def __getitem__(self, index):
return {
'feature': self.data[index],
'target': self.targets[index]
}
def custom_collate_fn(batch):
datas, targets = [], []
for batchid, data in enumerate(batch):
datas.append(data['feature'])
targets.append(data['target'])
datas = torch.tensor(datas)
targets = torch.tensor(targets)
return datas, targets
class BPNetModel(nn.Module):
def __init__(self):
super(BPNetModel, self).__init__()
self.hiddden1 = torch.nn.Linear(len(CFG.feas), CFG.num_hidden1) # 定义隐层网络
self.hiddden2 = torch.nn.Linear(CFG.num_hidden1, CFG.num_hidden2)
self.hiddden3 = torch.nn.Linear(CFG.num_hidden2, CFG.num_hidden3) # 定义隐层网络
self.hiddden4 = torch.nn.Linear(CFG.num_hidden3, CFG.num_hidden4)
self.out = torch.nn.Linear(CFG.num_hidden4, CFG.num_output) # 定义输出层网络
self.relu = torch.nn.ReLU()
def forward(self, x):
x = x.to(torch.float32)
x = self.hiddden1(x) # 隐层激活函数采用relu()函数
x = self.relu(x)
x = self.hiddden2(x)
x = self.relu(x)
x = self.hiddden3(x)
x = self.relu(x)
x = self.hiddden4(x)
x = self.relu(x)
x = self.out(x)
return x
def get_loss(self, inputs):
inputs, targets = inputs[0].to(device), inputs[1]
outs = self.forward(inputs)
loss = criterion(outs, targets.to(device))
return loss, outs
def loss_accumulate(self, data_list):
running_loss = 0
result = []
all_bs = sum(data_bts['batchsize'] for data_bts in data_list)
for data_bts in data_list:
data = data_bts['data']
loss, outs = self.get_loss(data)
loss = loss / all_bs
loss.backward()
running_loss += loss.item()
result.append(outs.detach().to('cpu').numpy())
return running_loss, all_bs, result
def asMinutes(s):
m = math.floor(s/60)
s -= m * 60
return "%dm %ds" % (m, s)
def timeSince(since, percent):
now = time.time()
s = now - since
es = s / (percent)
rs = es - s
return "%s (remain %s)" % (asMinutes(s), asMinutes(rs))
def get_scheduler(cfg, optimizer, num_train_steps):
if cfg.scheduler == 'linear':
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps = cfg.num_warmup_steps, num_training_steps = num_train_steps
)
elif cfg.scheduler == 'cosine':
scheduler = get_cosine_schedule_with_warmup(
optimizer, num_warmup_steps = cfg.num_warmup_steps, num_training_steps = num_train_steps, num_cycles = cfg.num_cycles
)
return scheduler
def train_one_epoch(model, optimizer, scheduler, dataloader, epoch, valid_data):
model.train()
dataset_size = 0
running_loss_awp = 0
epoch_loss_awp = 0
# valid_labels = valid_data.label.values
validDataset = FeedBackDataset(valid_data)
valid_loader = DataLoader(validDataset,
batch_size = CFG.batch_size,
shuffle = False,
collate_fn = custom_collate_fn,
num_workers = CFG.num_workers,
pin_memory = True)
start = end = time.time()
data_list = []
target_list = []
pred_list = []
for step, data in enumerate(dataloader):
if os.path.exists('break.txt'):
raise ValueError('break error')
batch_size = data[-1].shape[0]
data_list.append({'data': data, 'batchsize': batch_size})
target_list.append(data[1].numpy())
if (step +1) % CFG.n_accumulate == 0:
accum_loss, datalist_size, result = model.loss_accumulate(data_list)
optimizer.step()
optimizer.zero_grad()
if scheduler is not None:
scheduler.step()
pred_list += result
data_list = [] # refresh the accumulate data_list
running_loss_awp += (accum_loss * datalist_size)
dataset_size += datalist_size
# average loss
epoch_loss_awp = running_loss_awp / dataset_size
end = time.time()
if step % CFG.print_freq == 0 or step == (len(dataloader)-1):
score_train = get_score(np.concatenate(pred_list), np.concatenate(target_list))
print('Train: [{}] '
'Loss: {:.4f} '
'Train AUC: {:.4f} '
'Step: [{}/{}] '
'Elapsed {remain:s} '
.format(epoch, epoch_loss_awp, score_train, step+1, len(dataloader),
remain = timeSince(start, float(step+1)/len(dataloader))))
pred = valid_one_epoch(model, valid_loader, epoch)
gc.collect()
return epoch_loss_awp, pred
@torch.no_grad()
def valid_one_epoch(model, dataloader, epoch):
dataset_size = 0
running_loss = 0
start = end = time.time()
pred = []
for step, data in enumerate(dataloader):
if os.path.exists('break.txt'):
raise ValueError('break error')
loss, outputs = model.get_loss(data)
pred.append(outputs.to('cpu').numpy())
batch_size = data[-1].shape[0]
running_loss += (loss.item() * batch_size)
dataset_size += batch_size
epoch_loss = running_loss / dataset_size
print('EVAL: [{}] '
'Loss: {:.4f} '
'Step: [{}/{}] '
'Elapsed {remain:s} '
.format(epoch, epoch_loss, step + 1 , len(dataloader),
remain = timeSince(start, float(step+1)/len(dataloader))))
pred = np.concatenate(pred)
model.train()
return pred
def train_loop(fold):
train_data = train_net[train_net.fold != fold].reset_index(drop=True)
valid_data = train_net[train_net.fold == fold].reset_index(drop=True)
trainDataset = FeedBackDataset(train_data)
train_loader = DataLoader(trainDataset,
batch_size = CFG.batch_size,
shuffle = True,
collate_fn = custom_collate_fn,
num_workers = CFG.num_workers,
pin_memory = True)
model = BPNetModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr = CFG.lr)
# loop
best_score = 0
# General training
num_train_steps = int(len(train_data) / CFG.batch_size * CFG.epochs)
scheduler = get_scheduler(CFG, optimizer, num_train_steps)
for epoch in range(CFG.epochs):
print(f'-------------epoch:{epoch} training-------------')
start_time = time.time()
train_epoch_loss, pred = train_one_epoch(model, optimizer, scheduler, train_loader, epoch, valid_data)
score = get_score(pred, valid_data[CFG.target_fea].values)
elapsed = time.time() - start_time
print(f'Fold {fold} Epoch {epoch} - avg_train_loss: {train_epoch_loss:.4f} time: {elapsed:.0f}s')
if score > best_score:
best_score = score
print(f'AUC score:{score} best_score:{best_score}')
torch.cuda.empty_cache()
gc.collect()
return best_score
模型结果展示:
best_scores = []
if CFG.train:
for fold in range(CFG.n_fold):
print(f'-------------fold:{fold} training-------------')
best_scores.append(train_loop(fold))
print(f"Cross Validation: {np.mean(best_scores)}")
部分训练日志如下:
-------------epoch:3 training-------------
Train: [3] Loss: 0.5390 Test AUC: 0.6302 Step: [1/1000] Elapsed 0m 0s (remain 1m 7s)
Train: [3] Loss: 0.5257 Test AUC: 0.6642 Step: [101/1000] Elapsed 0m 0s (remain 0m 3s)
Train: [3] Loss: 0.5302 Test AUC: 0.6614 Step: [201/1000] Elapsed 0m 0s (remain 0m 2s)
Train: [3] Loss: 0.5326 Test AUC: 0.6601 Step: [301/1000] Elapsed 0m 0s (remain 0m 1s)
Train: [3] Loss: 0.5342 Test AUC: 0.6591 Step: [401/1000] Elapsed 0m 0s (remain 0m 1s)
Train: [3] Loss: 0.5349 Test AUC: 0.6580 Step: [501/1000] Elapsed 0m 1s (remain 0m 1s)
Train: [3] Loss: 0.5320 Test AUC: 0.6597 Step: [601/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [3] Loss: 0.5323 Test AUC: 0.6583 Step: [701/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [3] Loss: 0.5330 Test AUC: 0.6592 Step: [801/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [3] Loss: 0.5336 Test AUC: 0.6593 Step: [901/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [3] Loss: 0.5328 Test AUC: 0.6590 Step: [1000/1000] Elapsed 0m 1s (remain 0m 0s)
EVAL: [3] Loss: 17.1517 Step: [250/250] Elapsed 0m 0s (remain 0m 0s)
Fold 4 Epoch 3 - avg_train_loss: 0.5328 time: 3s
AUC score:0.6507739583333333 best_score:0.6507739583333333
-------------epoch:4 training-------------
Train: [4] Loss: 0.5875 Test AUC: 0.6727 Step: [1/1000] Elapsed 0m 0s (remain 1m 12s)
Train: [4] Loss: 0.5156 Test AUC: 0.6763 Step: [101/1000] Elapsed 0m 0s (remain 0m 2s)
Train: [4] Loss: 0.5246 Test AUC: 0.6699 Step: [201/1000] Elapsed 0m 0s (remain 0m 2s)
Train: [4] Loss: 0.5281 Test AUC: 0.6632 Step: [301/1000] Elapsed 0m 0s (remain 0m 1s)
Train: [4] Loss: 0.5335 Test AUC: 0.6615 Step: [401/1000] Elapsed 0m 0s (remain 0m 1s)
Train: [4] Loss: 0.5364 Test AUC: 0.6585 Step: [501/1000] Elapsed 0m 1s (remain 0m 1s)
Train: [4] Loss: 0.5355 Test AUC: 0.6569 Step: [601/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [4] Loss: 0.5337 Test AUC: 0.6572 Step: [701/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [4] Loss: 0.5336 Test AUC: 0.6570 Step: [801/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [4] Loss: 0.5325 Test AUC: 0.6593 Step: [901/1000] Elapsed 0m 1s (remain 0m 0s)
Train: [4] Loss: 0.5325 Test AUC: 0.6598 Step: [1000/1000] Elapsed 0m 2s (remain 0m 0s)
EVAL: [4] Loss: 17.1514 Step: [250/250] Elapsed 0m 0s (remain 0m 0s)
Fold 4 Epoch 4 - avg_train_loss: 0.5325 time: 2s
AUC score:0.6505471666666667 best_score:0.6507739583333333
Cross Validation: 0.6568245416666667
该神经网络通过对数据归一化处理、众数填充缺失值,然后用四层线性层(768、512、768、768)构建了一个简单的多层感知机模型,可以看出该baseline的CV只有0.6568,比树差了很多(可能需要对数据和模型做进一步处理),但由于时间原因以及神经网络在二分类任务中通常弱于树模型,故放弃使用nn,继续优化树模型。
4.3 Bad Case Analysis
Bad Case,顾名思义,就是模型预测不了的数据。Bad Case分析,简单来说就是把模型预测不了的数据找出来,分析这部分数据有什么特点(特征共性、分布不同等),进而改进模型的效果,此处不再赘述,感兴趣的同学可以参考:https://zhuanlan.zhihu.com/p/104961266
4.4 模型参数调优
4.4.1 网格搜索
网格搜索,也称穷举搜索。简单来说,就是模型一个超参数有多个值(比如learning可以是1e-5、1e-4、1e-3等等),那么就遍历这所有可能的值,找到最优模型。但是随着超参数组合的增加,比如特征A有5个值、B有10个值、C有6个值,那么就有300种要遍历的方案,遍历时间会非常的长。详细可参考:https://blog.csdn.net/qq_39521554/article/details/86227582
cv_split = model_selection.ShuffleSplit(n_splits = 5, test_size = .2, train_size = .8, random_state = 0 )
vote_est = [
#Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
('hgbc',ensemble.HistGradientBoostingClassifier()),
#lightbgm
('lgb', LGBMClassifier()),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
('xgb', XGBClassifier(verbosity=0)),
('cbc',CatBoostClassifier(verbose=0))
]
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [42]
grid_param = [
[{
# hgbc
'learning_rate': grid_learn,
'max_depth': [1, 3, 5, 7, 9],
'max_iter':[50, 100, 200, 500],
'random_state': grid_seed
}],
[{
# lgb
'learning_rate': grid_learn,
'max_depth': [1, 3, 5, 7, 9],
'n_estimators': grid_n_estimator,
'colsample_bytree':[0.6, 0.7, 0.8, 0.9, 1],
'reg_alpha': [0, 0.05, 1],
'reg_lambda': [0, 0.1, 0.5, 1],
'seed': grid_seed
}],
[{
# xgb
'learning_rate': grid_learn,
'max_depth': [1, 3, 5, 7, 9],
'n_estimators': grid_n_estimator,
'gamma':[0, 0.2, 0.5],
'subsample':[0.6, 0.7, 0.8, 0.9, 1],
'seed': grid_seed ,
'verbosity':[0]
}],
[{
# cbc
'learning_rate': grid_learn,
'n_estimators': grid_n_estimator,
'depth': [4, 5, 6, 7, 8, 9],
'l2_leaf_reg': [1,3,5,7],
'seed': grid_seed ,
'verbose':[0]
}]
]
start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip
start = time.perf_counter()
best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
best_search.fit(train[features], train[target_feature])
run = time.perf_counter() - start
best_param = best_search.best_params_
print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
部分模型训练结果如下:
### The best parameter for HistGradientBoostingClassifier is
{'learning_rate': 0.03, 'max_depth': 7, 'max_iter': 500, 'random_state': 0}
with a runtime of 6016.12 seconds.
### The best parameter for LGBMClassifier is
{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 7, 'n_estimators': 300, 'reg_alpha': 1, 'reg_lambda': 0.1, 'seed': 0}
with a runtime of 33976.22 seconds.
用网格搜索的参数重新训练模型,发现模型并没有什么提高,由此可见手动输入搜索的超参数很难提高模型的效果。
4.4.2 Optuna
Optuna是一个基于贝叶斯优化的超参数优化框架。它的目标是通过智能的搜索策略,尽可能少的实验次数找到最佳超参数组合Optuna是一个基于贝叶斯优化的超参数优化框架。它的目标是通过智能的搜索策略,尽可能少的实验次数找到最佳超参数组合。详细可参考Optuna官方文档:https://zh-cn.optuna.org/index.html
import numpy as np
import pandas as pd
import os
os.chdir(os.path.abspath(os.curdir))
from tqdm import tqdm
from lightgbm import LGBMClassifier
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')
from optuna.integration import LightGBMPruningCallback
import optuna
from lightgbm import early_stopping
import warnings
warnings.filterwarnings('ignore')
train = pd.read_excel('fintech训练营/train.xlsx')
test = pd.read_excel('fintech训练营/test_A榜.xlsx')
datasets = [train,test]
for dataset in datasets:
for i in dataset.columns:
dataset[i] = dataset[i].apply(lambda x : np.nan if x=='?' else x)
label = LabelEncoder()
for dataset in datasets:
for i in dataset.columns:
if dataset[i].dtype == 'object':
if i != 'CUST_UID':
dataset[i] = label.fit_transform(dataset[i])
ignore = ['CUST_UID','LABEL']
features = [feat for feat in train.columns if feat not in ignore]
target_feature = 'LABEL'
def objective(trial, X, y):
# 参数网格
param_grid = {
"n_estimators": trial.suggest_categorical("n_estimators", [300,500,1000]),
"learning_rate": trial.suggest_categorical("learning_rate", [0.01, 0.03, 0.05, 0.1, 0.25]),
"num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"subsample": trial.suggest_float("subsample", 0.001, 0.999),
"subsample_freq": trial.suggest_categorical("subsample_freq", [1]),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.001, 0.999),
"random_state": 42,
"verbosity": -1
}
# 5折交叉验证
cv = StratifiedKFold(n_splits=5,random_state=100,shuffle=True)
cv_scores = np.empty(5)
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# LGBM建模
model = LGBMClassifier(**param_grid)
model.fit(
X_train,
y_train,
eval_set = [(X_test, y_test)],
eval_metric = ['auc'],
callbacks = [early_stopping(stopping_rounds = 100, verbose = 0)]
)
# 模型预测
preds = model.predict_proba(X_test)[:,1]
# 优化指标auc最大
cv_scores[idx] = roc_auc_score(y_test, preds)
return np.mean(cv_scores)
study = optuna.create_study(direction="maximize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, train[features], train[target_feature])
study.optimize(func, n_trials=2000)
print('best params:',study.best_params)
print('best score:',study.best_value)
部分训练日志如下:
[I 2024-08-01 10:45:25,902] A new study created in memory with name: LGBM Classifier [I 2024-08-01 10:45:38,510] Trial 0 finished with value: 0.9422476416666667 and parameters: {'n_estimators': 500, 'learning_rate': 0.05, 'num_leaves': 2660, 'max_depth': 11, 'subsample': 0.0905498028535803, 'subsample_freq': 1, 'colsample_bytree': 0.2742667597572731}. Best is trial 0 with value: 0.9422476416666667.
[I 2024-08-01 10:45:43,009] Trial 1 finished with value: 0.947348975 and parameters: {'n_estimators': 300, 'learning_rate': 0.25, 'num_leaves': 2520, 'max_depth': 4, 'subsample': 0.58422461412904, 'subsample_freq': 1, 'colsample_bytree': 0.41895727668202326}. Best is trial 1 with value: 0.947348975.
[I 2024-08-01 10:45:53,076] Trial 2 finished with value: 0.94371525 and parameters: {'n_estimators': 500, 'learning_rate': 0.03, 'num_leaves': 2220, 'max_depth': 12, 'subsample': 0.027482276053391742, 'subsample_freq': 1, 'colsample_bytree': 0.5719792346676457}. Best is trial 1 with value: 0.947348975.
[I 2024-08-01 10:46:11,290] Trial 3 finished with value: 0.9489362166666666 and parameters: {'n_estimators': 1000, 'learning_rate': 0.05, 'num_leaves': 2420, 'max_depth': 3, 'subsample': 0.4658330549093411, 'subsample_freq': 1, 'colsample_bytree': 0.9729852421378948}. Best is trial 3 with value: 0.9489362166666666.
Optuna相对于网格搜索速度大大提升,而且会自动搜索超参数,我们只需要提前输入超参数的值或范围(0~1等),它就会自动搜索最优参数,实在是调参的不二之选。
4.5 特征工程
4.5.1 分箱
对于非树模型,可以对连续值或者取值较多的离散变量进行分箱操作,有助于减少异常值的影响、处理缺失值(缺失值单独为一个箱)等,详细可参考:https://blog.csdn.net/CarryLvan/article/details/108775507
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
for name in train.columns[2:]:
train_cut=train.copy()
if train[name].dtype!='object':
train_cut[name] = pd.qcut(train_cut[name].rank(method='first'),5)
train_cut[name] = label.fit_transform(train_cut[name])
print(name)
strtfdKFold = StratifiedKFold(n_splits=5,random_state=100,shuffle=True)
#把特征和标签传递给StratifiedKFold实例
X_train = train_cut[features]
y_train = train_cut[target_feature]
kfold = strtfdKFold.split(X_train, y_train)
scores = []
for k, (train1, test1) in enumerate(kfold):
grid_soft.fit(X_train.iloc[train1,:], y_train.iloc[train1])
pred_lgb = grid_soft.predict_proba(X_train.iloc[test1, :])[:,1]
score=roc_auc_score(y_train.iloc[test1], pred_lgb)
scores.append(score)
print('Fold: %2d, Training/Test Split Distribution: %s, AUC: %s' % (k+1, np.bincount(y_train.iloc[train1]), score))
print('Cross-Validation AUC: %s +/- %s\n' %(np.mean(scores), np.std(scores)))
作者实现了等频分箱,主要利用pandas库的pd.cut函数,但由于使用的是树模型,分箱特征帮助不大(树模型分类本身就是分箱),最终没有使用该方法。
4.5.2 手工特征
其实特征工程算是机器学习最重要的一个模块,好的特征能够提取出更多的信息,从而提高模型的预测效果,其它的技巧只是提升上限罢了。那么该文章为什么把特征工程放在最后呢?一个很重要的原因是该数据集的特点为匿名特征、脱敏数据,你很难构造出有意义的手工强特。作者猜测特征的含义(由于涉及脱敏就不放在这里了),构造了两个强特如下:
train_ft['transfer_amount_avg'] = train_ft['MON_12_EXT_SAM_TRSF_OUT_AMT'] / train_ft['MON_12_EXT_SAM_NM_TRSF_OUT_CNT']
train_ft['dps_cur_month_peak_ratio'] = (train_ft['LAST_12_MON_COR_DPS_TM_PNT_BAL_PEAK_VAL'] - train_ft['CUR_MON_COR_DPS_MON_DAY_AVG_BAL']) / train_ft['CUR_MON_COR_DPS_MON_DAY_AVG_BAL']
这两个特征在A榜作用不大(估计是训练集和测试集分布近似原因),但在B榜提升不少。
4.5.3 Featuretools
匿名手工特征很难构造,作者另辟思路,选取featuretools自动化特征工具库,它可以运用groupby、mean、max、min 等算子,快速构建丰富的数据特征,从而提高模型的效果。详细可参考:https://blog.csdn.net/ShowMeAI/article/details/123650547
import featuretools as ft
train_ft = pd.read_excel('fintech训练营/train.xlsx')
test_ft = pd.read_excel('fintech训练营/test_A榜.xlsx')
datasets = [train_ft,test_ft]
for dataset in datasets:
for i in dataset.columns:
dataset[i]=dataset[i].apply(lambda x : np.nan if x=='?' else x)
ignore = ['CUST_UID','LABEL']
categorical = ['MON_12_CUST_CNT_PTY_ID',
'AI_STAR_SCO',
'WTHR_OPN_ONL_ICO',
'SHH_BCK',
'LGP_HLD_CARD_LVL',
'NB_CTC_HLD_IDV_AIO_CARD_SITU']
features = [feat for feat in train_ft.columns if feat not in (ignore + categorical)]
target_feature = 'LABEL'
es = ft.EntitySet(id='fintech') # 用id标识实体集
es=es.add_dataframe(
dataframe_name = "fintech_train",
dataframe = train_ft[features],
index = '1',
make_index = True
)
es=es.add_dataframe(
dataframe_name = "fintech_test",
dataframe = test_ft[features],
index = '2',
make_index = True
)
feature_train, feature_defs_train = ft.dfs(entityset=es,
target_dataframe_name='fintech_train',
agg_primitives=["mean", "sum", "mode"],
trans_primitives=['add_numeric', 'subtract_numeric', 'multiply_numeric', 'divide_numeric'], # 2列相加减乘除来生成
max_depth = 1)
feature_test, feature_defs_test = ft.dfs(entityset=es,
target_dataframe_name='fintech_test',
agg_primitives=["mean", "sum", "mode"],
trans_primitives=['add_numeric', 'subtract_numeric', 'multiply_numeric', 'divide_numeric'], # 2列相加减乘除来生成
max_depth = 1)
feature_train_last = pd.concat([feature_train,train_ft[categorical+[target_feature]]],axis=1)
feature_test_last = pd.concat([feature_test,test_ft[categorical]],axis=1)
label = LabelEncoder()
datasets=[feature_train_last,feature_test_last]
for dataset in datasets:
for i in categorical:
dataset[i] = label.fit_transform(dataset[i])
for dataset in datasets:
for i in dataset.columns:
dataset[i] = dataset[i].apply(lambda x : np.nan if x==np.inf else x) # 除会产生inf, xgboost预测需要转换成nan
作者运用mean、sum、mode、add_numeric、subtract_numeric、multiply_numeric、divide_numeric(2列相加减乘除)等算子,足足构造了上千个特征(可在trans_primitives参数里进一步添加cosine、sine、modulo_numeric、percentile、natural_logarithm等算子)。同时,由于除法会产生inf,这会导致XGBoost预测时报错,需要转换成nan。上千个特征导致数据占用内存特别大,不管是读取、保存、训练都会耗费很长时间,这里介绍一种减少内存的方法:
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float32)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
Memory usage of dataframe is 1393.43 MB
Memory usage after optimization is: 703.28 MB
Decreased by 49.5%
Memory usage of dataframe is 417.94 MB
Memory usage after optimization is: 209.01 MB
Decreased by 50.0%
该函数通过对数值型变量的转换,可以大大缩减特征的内存占用大小。由上部分运行记录可见,可以把一个dataframe的内存占用缩减一半,同时不会对数据的精度造成太大影响。
4.5.4 特征筛选
对featuretools构造的特征进行训练预测,结果如下(代码同上):
Fold: 1, Training/Test Split Distribution: [24000 8000], AUC: 0.9479184166666665
Fold: 2, Training/Test Split Distribution: [24000 8000], AUC: 0.9481912083333333
Fold: 3, Training/Test Split Distribution: [24000 8000], AUC: 0.9511504583333334
Fold: 4, Training/Test Split Distribution: [24000 8000], AUC: 0.9495554166666667
Fold: 5, Training/Test Split Distribution: [24000 8000], AUC: 0.9486693333333335
Cross-Validation AUC: 0.9490969666666667 +/- 0.0011678401394241225
大家可以看到,效果反而比不构造特征的时候更差了,这是因为这是直接特征之间交互(加减乘除)构造的特征,包含很多噪声,导致模型效果较差。这里介绍一种表现优异的特征筛选方法--逐步特征筛选法(以n个特征为例):
(1)用n个特征进行模型训练预测,用模型(比如LGBM)的特征重要性进行排名
(2)剔除重要性排名最后的n/2(每次剔除的数量可灵活调整)个特征
(3)用n/2个特征重新训练预测,重复步骤(1)、(2)
为什么用这个方法呢?这其实是贪心算法的一个变种,因为特征是耦合的,只保留一次训练中重要性排名靠前的特征可能不是最优的,所以每次只删一部分。最好的情况是每次只删除一个特征,但是特征数量一多会导致训练时间大大增加,所以用微小数据精度提升的可能性来换取时间。
import numpy as np
import pandas as pd
import os
os.chdir(os.path.abspath(os.curdir))
from tqdm import tqdm
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import time
def get_importance(importance, names, model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
return fi_df
def score(X_train,y_train):
start = time.time()
print(len(X_train.columns))
params_hgbc = {'max_iter': 300,
'learning_rate': 0.03,
'max_leaf_nodes': 1980,
'max_depth': 6,
'random_state': 22934
}
params_lgb = {'n_estimators': 300,
'learning_rate': 0.03,
'num_leaves': 1240,
'max_depth': 8,
'subsample': 0.614339420520959,
'subsample_freq': 1,
'colsample_bytree': 0.9711563047222685,
'random_state' : 2022
}
params_xgb = {'n_estimators': 300,
'learning_rate': 0.03,
'num_leaves': 1880,
'max_depth': 8,
'subsample': 0.7574143599011826,
'subsample_freq': 1,
'colsample_bytree': 0.682578966844618,
'verbosity':0,
'random_state': 2022
}
vote_est_new = [
#Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
('hgbc',ensemble.HistGradientBoostingClassifier(**params_hgbc)),
#lightbgm
('lgb', LGBMClassifier(**params_lgb)),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
('xgb', XGBClassifier(**params_xgb)),
]
grid_soft2 = ensemble.VotingClassifier(estimators = vote_est_new , voting = 'soft' ,weights = [0.2,0.4,0.4])
strtfdKFold = StratifiedKFold(n_splits=5,random_state=100,shuffle=True)
#把特征和标签传递给StratifiedKFold实例
kfold = strtfdKFold.split(X_train, y_train)
scores = []
for k, (train1, test1) in enumerate(kfold):
grid_soft2.fit(X_train.iloc[train1,:], y_train.iloc[train1])
pred_lgb = grid_soft2.predict_proba(X_train.iloc[test1, :])[:,1]
score=roc_auc_score(y_train.iloc[test1], pred_lgb)
scores.append(score)
print('Fold: %2d, Training/Test Split Distribution: %s, AUC: %s' % (k+1, np.bincount(y_train.iloc[train1]), score))
print('\nCross-Validation AUC: %s +/- %s\n' %(np.mean(scores), np.std(scores)))
print('time:',time.time()-start)
return np.mean(scores)
train = pd.read_parquet('data/train_feature_last.parquet')
test = pd.read_parquet('data/test_feature_last.parquet')
test_original = pd.read_excel('fintech训练营/test_A榜.xlsx')
ignore = ['CUST_UID','LABEL']
original_feature = list(test_original.columns)
target_feature = 'LABEL'
important=pd.read_csv('important/step_10/important_790.csv')
result_1 = pd.DataFrame(columns=['num','score'])
row = 0
for num in range(780,100,-10):
features = [feat for feat in important['feature_names'][:num] if feat not in ignore]
score_new = score(train[features],train[target_feature])
lgb=LGBMClassifier()
lgb.fit(train[features], train[target_feature])
important = get_importance(lgb.feature_importances_,features,'LGBM ')
important.to_csv('important/no_add_original_800/important'+str(num)+'.csv',index=False)
result_1.loc[row,'num'] = num
result_1.loc[row,'score'] = score_new
result_1.to_csv('important/no_add_original_800/result.csv',index=False)
row += 1
通过上述代码,筛选出分数最高的特征组合,再用optuna调参,模型的表现如下:
MLA Name |
MLA Train AUC Mean |
MLA Test AUC Mea |
MLA Test AUC 3*STD |
MLA Time |
---|---|---|---|---|
xgb | 0.997148 | 0.952981 | 0.006871 | 121.238519 |
lgb | 0.996984 | 0.952962 |
0.007715 |
23.109799 |
lghc | 0.979313 | 0.952644 |
0.0074 |
26.32936 |
cbc | 0.986432 | 0.951637 |
0.008834 |
47.699232 |
可见每个模型的CV都有较大提高,同时线上分数也同样提升,可见策略的有效性。
5. 赛题总结
5.1 A榜
本次竞赛A榜的训练集和测试集分布一致,导致大家评分非常接近(用LGBM跑baseline都有0.93,作者CV: 0.9539、LB: 0.9551第53名,0.956估计第1),剩下的就是卷过拟合罢了。
5.2 B榜
B榜的训练集和测试集分布不一致,数据漂移非常严重,有“毒特征”(CV很好,LeaderBoard很差)。
(1)作者通过对抗验证筛掉了漂移严重的特征(对抗验证AUC 0.99->0.71)
(2)构造手工强特,用这些基础特征通过featuretools构造特征(5000多个特征),并用逐步特征筛选法筛选(180个特征)
(3)训练集加入了A榜(用A榜模型打标签)、B测试集的伪标签
(4)选用LGBM、XGBoost、CatBoost、HistGradientBoostingClassifier四个集成树模型,并用optuna进行树模型超参数的调优,最后软投票融合,融合权重为LGBM:0.35,XGBoost:0.35,CatBoost:0.15,HistGradientBoostingClassifier:0.15)
运用这些技巧,作者最终B榜的分数为CV: 0.863,LB: 0.872,与A榜分数按比例计分后排名第4。
文章涉及的数据集和源代码可从Github下载:https://github.com/CNLCNL/2022-fintech,希望这些方法和技巧能对大家有所帮助,谢谢!