导入本次实践过程中所需的包:
import pandas as pd
from sklearn.preprocessing import LabelBinarizer, Imputer
准备数据
数据集下载
实践数据的下载地址 https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw
说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。
导入数据
从data_all.csv文件中导入原始数据,并查看数据相关信息:
data_origin = pd.read_csv('data.csv', encoding='gbk')
data_origin.head()
Unnamed: 0 | custid | trade_no | bank_card_no | low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | ... | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 2791858 | 20180507115231274000000023057383 | 卡号1 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | ... | 2900.0 | 1688.0 | 1200.0 | 75.0 | 1.0 | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 |
1 | 10 | 534047 | 20180507121002192000000023073000 | 卡号1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | ... | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 |
2 | 12 | 2849787 | 20180507125159718000000023114911 | 卡号1 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | ... | 1600.0 | 1250.0 | 4200.0 | 87.0 | 1.0 | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 |
3 | 13 | 1809708 | 20180507121358683000000388283484 | 卡号1 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | ... | 3200.0 | 1541.0 | 16300.0 | 80.0 | 5.0 | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 |
4 | 14 | 2499829 | 20180507115448545000000388205844 | 卡号1 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | ... | 2300.0 | 1630.0 | 8300.0 | 79.0 | 2.0 | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 |
5 rows × 90 columns
划分数据
将原始数据划分为数据集以及标签
label = data_origin.status
data = data_origin.drop(['status'], axis=1)
数据预处理
数据类型分析
查看数据集的特征信息:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 89 columns):
Unnamed: 0 4754 non-null int64
custid 4754 non-null int64
trade_no 4754 non-null object
bank_card_no 4754 non-null object
low_volume_percent 4752 non-null float64
middle_volume_percent 4752 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4751 non-null float64
trans_activity_month 4752 non-null float64
trans_activity_day 4752 non-null float64
transd_mcc 4752 non-null float64
trans_days_interval_filter 4746 non-null float64
trans_days_interval 4752 non-null float64
regional_mobility 4752 non-null float64
student_feature 1756 non-null float64
repayment_capability 4754 non-null int64
is_high_user 4754 non-null int64
number_of_trans_from_2011 4752 non-null float64
first_transaction_time 4752 non-null float64
historical_trans_amount 4754 non-null int64
historical_trans_day 4752 non-null float64
rank_trad_1_month 4752 non-null float64
trans_amount_3_month 4754 non-null int64
avg_consume_less_12_valid_month 4752 non-null float64
abs 4754 non-null int64
top_trans_count_last_1_month 4752 non-null float64
avg_price_last_12_month 4754 non-null int64
avg_price_top_last_12_valid_month 4650 non-null float64
reg_preference_for_trad 4752 non-null object
trans_top_time_last_1_month 4746 non-null float64
trans_top_time_last_6_month 4746 non-null float64
consume_top_time_last_1_month 4746 non-null float64
consume_top_time_last_6_month 4746 non-null float64
cross_consume_count_last_1_month 4328 non-null float64
trans_fail_top_count_enum_last_1_month 4738 non-null float64
trans_fail_top_count_enum_last_6_month 4738 non-null float64
trans_fail_top_count_enum_last_12_month 4738 non-null float64
consume_mini_time_last_1_month 4728 non-null float64
max_cumulative_consume_later_1_month 4754 non-null int64
max_consume_count_later_6_month 4746 non-null float64
railway_consume_count_last_12_month 4742 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null int64
pawns_auctions_trusts_consume_last_6_month 4754 non-null int64
jewelry_consume_count_last_6_month 4742 non-null float64
source 4754 non-null object
first_transaction_day 4752 non-null float64
trans_day_last_12_month 4752 non-null float64
id_name 4478 non-null object
apply_score 4450 non-null float64
apply_credibility 4450 non-null float64
query_org_count 4450 non-null float64
query_finance_count 4450 non-null float64
query_cash_count 4450 non-null float64
query_sum_count 4450 non-null float64
latest_query_time 4450 non-null object
latest_one_month_apply 4450 non-null float64
latest_three_month_apply 4450 non-null float64
latest_six_month_apply 4450 non-null float64
loans_score 4457 non-null float64
loans_credibility_behavior 4457 non-null float64
loans_count 4457 non-null float64
loans_settle_count 4457 non-null float64
loans_overdue_count 4457 non-null float64
loans_org_count_behavior 4457 non-null float64
consfin_org_count_behavior 4457 non-null float64
loans_cash_count 4457 non-null float64
latest_one_month_loan 4457 non-null float64
latest_three_month_loan 4457 non-null float64
latest_six_month_loan 4457 non-null float64
history_suc_fee 4457 non-null float64
history_fail_fee 4457 non-null float64
latest_one_month_suc 4457 non-null float64
latest_one_month_fail 4457 non-null float64
loans_long_time 4457 non-null float64
loans_latest_time 4457 non-null object
loans_credit_limit 4457 non-null float64
loans_credibility_limit 4457 non-null float64
loans_org_count_current 4457 non-null float64
loans_product_count 4457 non-null float64
loans_max_limit 4457 non-null float64
loans_avg_limit 4457 non-null float64
consfin_credit_limit 4457 non-null float64
consfin_credibility 4457 non-null float64
consfin_org_count_current 4457 non-null float64
consfin_product_count 4457 non-null float64
consfin_max_limit 4457 non-null float64
consfin_avg_limit 4457 non-null float64
latest_query_day 4450 non-null float64
loans_latest_day 4457 non-null float64
dtypes: float64(70), int64(12), object(7)
memory usage: 3.2+ MB
可以看出,该数据集有82个数值型特征(70个float型,12个int型)以及7个非数值型数据。其中student_feature特征有大量缺失值,另外部分特征有约300个左右的缺失值。
缺失大量特征的样本对后续训练用处不大,所以先删除缺失33个特征以上的样本以及重复样本:
data_del = data.dropna(thresh=50)
data_del.drop_duplicates(inplace=True)
接下来,我们将数据区分为数值型以及非数值型分别进行处理:
object_column = ['trade_no', 'bank_card_no', 'reg_preference_for_trad', 'source',
'id_name', 'latest_query_time', 'loans_latest_time']
data_obj = data_del[object_column]
data_num = data_del.drop(object_column, axis=1)
无关特征删除
首先查看非数值型特征:
data_obj.describe()
trade_no | bank_card_no | reg_preference_for_trad | source | id_name | latest_query_time | loans_latest_time | |
---|---|---|---|---|---|---|---|
count | 4476 | 4476 | 4474 | 4476 | 4476 | 4450 | 4457 |
unique | 4476 | 1 | 5 | 1 | 4307 | 207 | 232 |
top | 20180507123817727000000023097250 | 卡号1 | 一线城市 | xs | 李杰 | 2018-04-14 | 2018-05-03 |
freq | 1 | 4476 | 3196 | 4476 | 5 | 423 | 134 |
trade_no是每个交易唯一的交易号,id_name是用户名,这里我认为是无用特征。而bank_card_no和source特征的unique值均为1,即所有样本在这两个特征上的值都是一致的,对后续的训练没有帮助。综上,删除bank_card_no、source、trade_no、id_name这四个非数值型特征。
data_obj.drop(['bank_card_no', 'source', 'trade_no', 'id_name'], axis=1, inplace=True)
对于非数值型特征,由于数据集并未给出每个特征的含义,所以暂时只手动删除custid、Unnamed: 0和有大量缺失值的student_feature这两个特征。其余特征会通过后续数据处理进一步筛选。
data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)
缺失值处理
缺失值处理大致可以分为删除和填充两种方法。删除又分为删除行(样本)和删除列(特征)两种,由于之前我们已经删除了缺失大量特征的样本和部分无用特征,目前剩下的特征所含缺失值不多,所以我们不采用删除的方法处理缺失值。
缺失值填充的方法有很多,需要根据特征的情况进行不同类型的填充,常见的有:均值填充、众数填充、中位数填充、前值填充等等。
对于数值型数据,采用均值填充:
imputer = Imputer(strategy='mean')
num = imputer.fit_transform(data_num)
data_num = pd.DataFrame(num, columns=data_num.columns)
对于非数值型数据,采用前值填充:
data_obj.ffill(inplace=True)
非数值型数据的类型转换
对于reg_preference_for_trad特征,我们需要将其转化为数值型特征。有两种转换方式,一是直接将每种标签映射为一个数字,但这样这样做的问题在于,后续算法可以能会认为两个特征值临近的样本更加相似(实际上可能是没有关联的),所以这里采用One-hot编码。
附:One-hot 编码,又称独热编码、一位有效编码。其方法是使用 N 位状态寄存器来对 N 个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。
encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)
对于剩下latest_query_time和loans_latest_time两种日期型数据,将其拆分为月份和星期两个特征:
data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday
data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday
data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)
查看处理后结果:
data_obj.head()
一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 2.0 | 4.0 | 3.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 3.0 | 5.0 | 5.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 1.0 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 3.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 6.0 | 1.0 | 6.0 |
合并
将分别处理后的数值型和非数值型数据进行合并,形成处理后的数据集。
data_processed = pd.concat([data_num, data_obj], axis=1)
data_processed.head()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | loans_latest_day | 一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.01 | 0.99 | 0.0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | ... | 18.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 2.0 | 4.0 | 3.0 |
1 | 0.02 | 0.94 | 2000.0 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | ... | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 3.0 | 5.0 | 5.0 |
2 | 0.04 | 0.96 | 0.0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | ... | 6.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 1.0 |
3 | 0.00 | 0.96 | 2000.0 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | ... | 4.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 3.0 |
4 | 0.01 | 0.99 | 0.0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | ... | 120.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 6.0 | 1.0 | 6.0 |
5 rows × 88 columns
保存处理后的数据
data_saved = pd.concat([data_processed, label], axis=1)
data_saved.to_csv('data_processed.csv', index=False)
参考资料
[1] https://blog.csdn.net/qq_33472765/article/details/85945038 pandas 中时间序列的处理(获得时间特征:年月日周分秒等时间)
[2] Hands on Machine Learning with Scikit-Learn and Tensorflow, By Aurélien Géron