【一周算法实践进阶】任务1 数据预处理

导入本次实践过程中所需的包:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer, Imputer

准备数据

数据集下载

实践数据的下载地址 https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。

导入数据

从data_all.csv文件中导入原始数据,并查看数据相关信息:

data_origin = pd.read_csv('data.csv', encoding='gbk')
data_origin.head()
Unnamed: 0 custid trade_no bank_card_no low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day ... loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
0 5 2791858 20180507115231274000000023057383 卡号1 0.01 0.99 0 0.90 0.55 0.313 ... 2900.0 1688.0 1200.0 75.0 1.0 2.0 1200.0 1200.0 12.0 18.0
1 10 534047 20180507121002192000000023073000 卡号1 0.02 0.94 2000 1.28 1.00 0.458 ... 3500.0 1758.0 15100.0 80.0 5.0 6.0 22800.0 9360.0 4.0 2.0
2 12 2849787 20180507125159718000000023114911 卡号1 0.04 0.96 0 1.00 1.00 0.114 ... 1600.0 1250.0 4200.0 87.0 1.0 1.0 4200.0 4200.0 2.0 6.0
3 13 1809708 20180507121358683000000388283484 卡号1 0.00 0.96 2000 0.13 0.57 0.777 ... 3200.0 1541.0 16300.0 80.0 5.0 5.0 30000.0 12180.0 2.0 4.0
4 14 2499829 20180507115448545000000388205844 卡号1 0.01 0.99 0 0.46 1.00 0.175 ... 2300.0 1630.0 8300.0 79.0 2.0 2.0 8400.0 8250.0 22.0 120.0

5 rows × 90 columns

划分数据

将原始数据划分为数据集以及标签

label = data_origin.status
data = data_origin.drop(['status'], axis=1)

数据预处理

数据类型分析

查看数据集的特征信息:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 89 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(12), object(7)
memory usage: 3.2+ MB

可以看出,该数据集有82个数值型特征(70个float型,12个int型)以及7个非数值型数据。其中student_feature特征有大量缺失值,另外部分特征有约300个左右的缺失值。

缺失大量特征的样本对后续训练用处不大,所以先删除缺失33个特征以上的样本以及重复样本:

data_del = data.dropna(thresh=50)
data_del.drop_duplicates(inplace=True)

接下来,我们将数据区分为数值型以及非数值型分别进行处理:

object_column = ['trade_no', 'bank_card_no', 'reg_preference_for_trad', 'source',
                 'id_name', 'latest_query_time', 'loans_latest_time']
data_obj = data_del[object_column]
data_num = data_del.drop(object_column, axis=1)

无关特征删除

首先查看非数值型特征:

data_obj.describe()
trade_no bank_card_no reg_preference_for_trad source id_name latest_query_time loans_latest_time
count 4476 4476 4474 4476 4476 4450 4457
unique 4476 1 5 1 4307 207 232
top 20180507123817727000000023097250 卡号1 一线城市 xs 李杰 2018-04-14 2018-05-03
freq 1 4476 3196 4476 5 423 134

trade_no是每个交易唯一的交易号,id_name是用户名,这里我认为是无用特征。而bank_card_no和source特征的unique值均为1,即所有样本在这两个特征上的值都是一致的,对后续的训练没有帮助。综上,删除bank_card_no、source、trade_no、id_name这四个非数值型特征。

data_obj.drop(['bank_card_no', 'source', 'trade_no', 'id_name'], axis=1, inplace=True)

对于非数值型特征,由于数据集并未给出每个特征的含义,所以暂时只手动删除custid、Unnamed: 0和有大量缺失值的student_feature这两个特征。其余特征会通过后续数据处理进一步筛选。

data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)

缺失值处理

缺失值处理大致可以分为删除和填充两种方法。删除又分为删除行(样本)和删除列(特征)两种,由于之前我们已经删除了缺失大量特征的样本和部分无用特征,目前剩下的特征所含缺失值不多,所以我们不采用删除的方法处理缺失值。

缺失值填充的方法有很多,需要根据特征的情况进行不同类型的填充,常见的有:均值填充、众数填充、中位数填充、前值填充等等。

对于数值型数据,采用均值填充:

imputer = Imputer(strategy='mean')
num = imputer.fit_transform(data_num)
data_num = pd.DataFrame(num, columns=data_num.columns)

对于非数值型数据,采用前值填充:

data_obj.ffill(inplace=True)

非数值型数据的类型转换

对于reg_preference_for_trad特征,我们需要将其转化为数值型特征。有两种转换方式,一是直接将每种标签映射为一个数字,但这样这样做的问题在于,后续算法可以能会认为两个特征值临近的样本更加相似(实际上可能是没有关联的),所以这里采用One-hot编码。

附:One-hot 编码,又称独热编码、一位有效编码。其方法是使用 N 位状态寄存器来对 N 个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。

encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)

对于剩下latest_query_time和loans_latest_time两种日期型数据,将其拆分为月份和星期两个特征:

data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday

data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday

data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)

查看处理后结果:

data_obj.head()
一线城市 三线城市 二线城市 其他城市 境外 latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
0 1.0 0.0 0.0 0.0 0.0 4.0 2.0 4.0 3.0
1 1.0 0.0 0.0 0.0 0.0 5.0 3.0 5.0 5.0
2 1.0 0.0 0.0 0.0 0.0 5.0 5.0 5.0 1.0
3 0.0 1.0 0.0 0.0 0.0 5.0 5.0 5.0 3.0
4 1.0 0.0 0.0 0.0 0.0 4.0 6.0 1.0 6.0

合并

将分别处理后的数值型和非数值型数据进行合并,形成处理后的数据集。

data_processed = pd.concat([data_num, data_obj], axis=1)
data_processed.head()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... loans_latest_day 一线城市 三线城市 二线城市 其他城市 境外 latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
0 0.01 0.99 0.0 0.90 0.55 0.313 17.0 27.0 26.0 3.0 ... 18.0 1.0 0.0 0.0 0.0 0.0 4.0 2.0 4.0 3.0
1 0.02 0.94 2000.0 1.28 1.00 0.458 19.0 30.0 14.0 4.0 ... 2.0 1.0 0.0 0.0 0.0 0.0 5.0 3.0 5.0 5.0
2 0.04 0.96 0.0 1.00 1.00 0.114 13.0 68.0 22.0 1.0 ... 6.0 1.0 0.0 0.0 0.0 0.0 5.0 5.0 5.0 1.0
3 0.00 0.96 2000.0 0.13 0.57 0.777 22.0 14.0 6.0 3.0 ... 4.0 0.0 1.0 0.0 0.0 0.0 5.0 5.0 5.0 3.0
4 0.01 0.99 0.0 0.46 1.00 0.175 13.0 66.0 42.0 1.0 ... 120.0 1.0 0.0 0.0 0.0 0.0 4.0 6.0 1.0 6.0

5 rows × 88 columns

保存处理后的数据

data_saved = pd.concat([data_processed, label], axis=1)
data_saved.to_csv('data_processed.csv', index=False)

参考资料

[1] https://blog.csdn.net/qq_33472765/article/details/85945038 pandas 中时间序列的处理(获得时间特征:年月日周分秒等时间)

[2] Hands on Machine Learning with Scikit-Learn and Tensorflow, By Aurélien Géron

猜你喜欢

转载自blog.csdn.net/bear507/article/details/86649069