版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012736685/article/details/85755779
文章目录
数据传送门(与之前的不同): https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
目标:数据集是金融数据(非脱敏),要预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。
任务:数据类型转换和缺失值处理(尝试不同的填充看效果)以及及其他你能借鉴的数据探索。
一、相关库
# -*- coding:utf-8 -*-
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
二、数据读取
file_path = "data.csv"
data = pd.read_csv(file_path, encoding='gbk')
print(data.head())
print(data.shape)
结果输出
Unnamed: 0 custid ... latest_query_day loans_latest_day
0 5 2791858 ... 12.0 18.0
1 10 534047 ... 4.0 2.0
2 12 2849787 ... 2.0 6.0
3 13 1809708 ... 2.0 4.0
4 14 2499829 ... 22.0 120.0
[5 rows x 90 columns]
(4754, 90)
遇到的问题:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte
原因:‘utf-8’不能解码字节(0xbf),也就是这个字节超出了utf-8的表示范围了
解决方法:显式添加编码方式。亲测:encoding=‘gbk’ 或’ISO-8859-1’编码。
三、数据清洗——删除无关、重复数据
## 删除与个人身份相关的列
data.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1, inplace=True)
## 删除列中数据均相同的列
X = data.drop(labels='status',axis=1)
L = []
for col in X:
if len(X[col].unique()) == 1:
L.append(col)
for col in L:
X.drop(col, axis=1, inplace=True)
四、数据清洗——类型转换
1、数据集划分
划分不同数据类型:数值型、非数值型、标签
使用:Pandas对象有 select_dtypes() 方法可以筛选出特定数据类型的特征
参数:include 包括(默认);exclude 不包括
X_num = X.select_dtypes(include='number').copy()
X_str = X.select_dtypes(exclude='number').copy()
y = data['status']
2、缺失值处理
发现缺失值方法:缺失个数、缺失率
# 使用缺失率(可以了解比重)并按照值降序排序 ascending=False
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())
print('----------' * 5)
X_str_miss = (X_str.isnull().sum() / len(X_str)).sort_values(ascending=False)
print(X_str_miss.head())
输出结果
student_feature 0.630627
cross_consume_count_last_1_month 0.089609
latest_one_month_apply 0.063946
query_finance_count 0.063946
latest_six_month_apply 0.063946
dtype: float64
--------------------------------------------------
latest_query_time 0.063946
loans_latest_time 0.062474
reg_preference_for_trad 0.000421
dtype: float64
分析:缺失率最高的特征是student_feature,为 63.0627% > 50% ,其他的特征缺失率都在10%以下。
- 高缺失率特征处理:EM插补、多重插补。
==》由于两种方法比较复杂,这里先将缺失值归为一类,用0填充。 - 其他特征:平均数、中位数、众数…
## student_feature特征处理设置为0
X_num.fillna(0, inplace = True)
## 其他特征插值: 众数
X_num.fillna(X_num.mode().iloc[0, :], inplace=True)
X_str.fillna(X_str.mode().iloc[0, :], inplace=True)
3、异常值处理
- 箱型图的四分位距(IQR)
## 异常值处理:箱型图的四分位距(IQR)
def iqr_outlier(x, thre = 1.5):
x_cl = x.copy()
q25, q75 = x.quantile(q = [0.25, 0.75])
iqr = q75 - q25
top = q75 + thre * iqr
bottom = q25 - thre * iqr
x_cl[x_cl > top] = top
x_cl[x_cl < bottom] = bottom
return x_cl
X_num_cl = pd.DataFrame()
for col in X_num.columns:
X_num_cl[col] = iqr_outlier(X_num[col])
X_num = X_num_cl
4、离散特征编码
- 序号编码:用于有大小关系的数据
- one-hot编码:用于无序关系的数据
X_str_oh = pd.get_dummies(X_str['reg_preference_for_trad'])
5、日期特征处理
X_date = pd.DataFrame()
X_date['latest_query_time_year'] = pd.to_datetime(X_str['latest_query_time']).dt.year
X_date['latest_query_time_month'] = pd.to_datetime(X_str['latest_query_time']).dt.month
X_date['latest_query_time_weekday'] = pd.to_datetime(X_str['latest_query_time']).dt.weekday
X_date['loans_latest_time_year'] = pd.to_datetime(X_str['loans_latest_time']).dt.year
X_date['loans_latest_time_month'] = pd.to_datetime(X_str['loans_latest_time']).dt.month
X_date['loans_latest_time_weekday'] = pd.to_datetime(X_str['loans_latest_time']).dt.weekday
6、特征组合
X = pd.concat([X_num, X_str_oh, X_date], axis=1, sort=False)
print(X.shape)
五、数据集划分
## 预处理:标准化
# X_std = StandardScaler().fit(X)
## 划分数据集
X_std_train, X_std_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2019)
六、模型构建
## 模型1:Logistic Regression
lr = LogisticRegression()
lr.fit(X_std_train, y_train)
## 模型2:Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_std_train,y_train)
# ## 模型3:SVM
# svm = SVC(kernel='linear',probability=True)
# svm.fit(X_std_train,y_train)
## 模型4:Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_std_train,y_train)
## 模型5:XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_std_train,y_train)
## 模型6:LightGBM
lgbc = lgb.LGBMClassifier()
lgbc.fit(X_std_train,y_train)
七、模型评估
## 模型评估
def model_metrics(clf, X_test, y_test):
y_test_pred = clf.predict(X_test)
y_test_prob = clf.predict_proba(X_test)[:, 1]
accuracy = accuracy_score(y_test, y_test_pred)
print('The accuracy: ', accuracy)
precision = precision_score(y_test, y_test_pred)
print('The precision: ', precision)
recall = recall_score(y_test, y_test_pred)
print('The recall: ', recall)
f1_score = recall_score(y_test, y_test_pred)
print('The F1 score: ', f1_score)
print('----------------------------------')
# roc_auc_score = roc_auc_score(y_test, y_test_prob)
# print('The AUC of: ', roc_auc_score)
model_metrics(lr,X_std_test,y_test)
model_metrics(dtc,X_std_test,y_test)
model_metrics(rfc,X_std_test,y_test)
model_metrics(xgbc,X_std_test,y_test)
model_metrics(lgbc,X_std_test,y_test)