python课程设计作业—贷款发放预测

1.应用调研

贷款业务是银行最基本、最主要的资产业务，是银行获得利润的主要来源，也是一项风险性较大的资产。其风险性在于如果被贷款人没有偿还贷款的能力，那么银行就会产生坏账，造成亏损。因此在银行业务中常常需要做很多是否发放贷款的调研。本课程设计旨在利用python课堂上学习到的numpy和pandas知识对网络上收集到的数据进行数据清洗，对清洗好的数据进行逻辑回归来预测是否发放贷款。

2.代码分析

2.1数据预处理

import pandas as pd

loans_2007 = pd.read_csv('./LoanStats3a.csv', skiprows=1,low_memory=False)

half_count = len(loans_2007) / 2

loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)

loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)

loans_2007.to_csv('./loans_2007.csv', index=False)

对数据进行初步的清洗，把数据量减少为之开始的一半。drop掉desc和url一些无关的信息，把初步清洗的数据保存为loans_2007.csv

import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
#loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

打印观察数据，看看还有什么是无关项

id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
out_prncp                               0
out_prncp_inv                           0
total_pymnt                       5863.16
total_pymnt_inv                   5833.84
total_rec_prncp                      5000
total_rec_int                      863.16
total_rec_late_fee                      0
recoveries                              0
collection_recovery_fee                 0
last_pymnt_d                     Jan-2015
last_pymnt_amnt                    171.62
last_credit_pull_d               Nov-2016
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
52

以上是打印的信息，可以看出来，参数太多，如果我们直接拿52个特征去训练可能导致过拟合所以我们需要进一步的对特征值进行选择

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", /"grade", "sub_grade", "emp_title", "issue_d"], axis=1)
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", /"total_pymnt_inv", "total_rec_prncp"], axis=1)

这里是drop掉一些表示身份信息的id，以及一些压缩的编码，显然这些数字对于我们的训练是没有作用的。

print(loans_2007.iloc[0])
print(loans_2007.shape[1])

再次打印观察一下，结果如下：

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               Nov-2016
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
32

我们可以看到特征值已经变成32个了，基本上不能再去缩减特征值的数量了，这个时候我们需要去确定训练的目标值。显然，是否贷款就是我们的目标值，我们吧贷款状态打印出来看一下。

print(loans_2007['loan_status'].value_counts())#贷款状态

结果如下：

Fully Paid                                             33902
Charged Off                                             5658
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Current                                                  201
Late (31-120 days)                                        10
In Grace Period                                            9
Late (16-30 days)                                          5
Default                                                    1
Name: loan_status, dtype: int64

贷款状态除了借贷与否还有一些其他的状态，比如，延迟31-120天放款，不确定是否放款等等，为了减少计算量，我们简化成一个二分类的问题，就是放款与否，如此一来就是简单的把Fully Paid设置成1，Charged Off设置成0.，操作如下：

loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}

loans_2007 = loans_2007.replace(status_replace)# 二分类

数据中还有一些对于计算没有意义的特征，也就是说，很多人的这个特征值都是相同的，我们需要把这个特征drop。

orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)
print (loans_2007.shape)
loans_2007.to_csv('filtered_loans_2007.csv', index=False)

现在我们需要处理一下数据中含有nan值特别多的特征，处理方式有填充或者删除，因为我这边的特征很多，所以我选择简单的删除。

import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()

print(null_counts)

观察结果如下：

扫描二维码关注公众号，回复： 4274453 查看本文章

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1073
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
pymnt_plan                 0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

我们可以看到emp_length和pub_rec_bankruptcies的NAN值很多，但是我们只删除pub_rec_bankruptcies 。因为emp_length代表的是以前的贷款记录，很多人以前没有贷过款，所以为NAN值，但是这一项任然是我们计算的依据不能删除，在数值化的时候我们把他设置成0即可。

loans = loans.drop("pub_rec_bankruptcies", axis=1)
#loans=loans.drop("emp_length",axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())

接下来我们需要将字符型的数据转换成数值型的数据。其实要做的就是1.把单纯的字符型转换成one-hot编码2.把百分号的字符型转换后面的百分号去掉，然后进行强制类型转换。

object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])# 选中字符型数据

cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts())

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}# 做成字典的形式去映射
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")# 去掉百分号，转换成float类型
loans = loans.replace(mapping_dict)

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans = loans.drop("pymnt_plan", axis=1)

接下来我们打印了propose和title特征之后发现，其实title就是propose的更加细化的贷款原因，因此我们二者取其一进行one-hot编码即可。

debt_consolidation    18137
credit_card            4970
other                  3803
home_improvement       2869
major_purchase         2108
small_business         1771
car                    1492
wedding                 932
medical                 667
moving                  557
house                   365
vacation                350
educational             312
renewable_energy         95
Name: purpose, dtype: int64
Debt Consolidation                            2128
Debt Consolidation Loan                       1671
Personal Loan                                  640
Consolidation                                  503
debt consolidation                             483
Credit Card Consolidation                      348
Home Improvement                               344
Debt consolidation                             323
Small Business Loan                            310
Credit Card Loan                               302
Personal                                       296
Consolidation Loan                             254
Home Improvement Loan                          237
personal loan                                  224
personal                                       207
Wedding Loan                                   207
Loan                                           206
consolidation                                  193
Car Loan                                       193
Other Loan                                     177
Wedding                                        151
Credit Card Payoff                             149
Credit Card Refinance                          140
Major Purchase Loan                            136
Consolidate                                    126
Medical                                        114
Credit Card                                    112
home improvement                               105
Credit Cards                                    93
My Loan                                         92
                                              ... 
'71 Bobbed Duece                                 1
Consolidation + Home Improvement                 1
Home Improvement/Consolidation                   1
Kill My Debt                                     1
Finishing my debt                                1
Q-Disc Loan                                      1
CC Ref                                           1
blues in C minor                                 1
J.L.                                             1
Steve's Consolidation Loan                       1
Paying Off High Interest CC Debt                 1
WiseMove4Grandson                                1
4x4 truck                                        1
work                                             1
Fixing car                                       1
mybills                                          1
Pay the piper                                    1
Loan75                                           1
ant3300                                          1
Rolling the high interest into 1 loan            1
Marriage                                         1
Pay On Time                                      1
Consolidation Loan for ED                        1
louisiana purchase                               1
Loan2100 From LendingClub                        1
High Credit Score, Never Missed a Payment!       1
Start On The Right Path                          1
Eye On the Prize                                 1
buissness                                        1
Need capital to fund unique web site             1
Name: title, Length: 19094, dtype: int64

数据清洗完毕，保存csv

loans.to_csv('cleaned_loans2007.csv', index=False)

2.2模型建立

这边不仅仅是预测准确与否，而是涉及到了一个召回率的问题，也就是说：

代码表示就是：

import pandas as pd
# False positives.
#predictions

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

这边我们用skleean框架，线性回归模块在采用回归模型分析实际问题中，所研究的变量往往不全是区间变量而是顺序变量或属性变量，比如二项分布问题。通过分析年龄、性别、体质指数、平均血压、疾病指数等指标，判断一个人是否换糖尿病，Y=0表示未患病，Y=1表示患病，这里的响应变量是一个两点（0-1）分布变量，它就不能用h函数连续的值来预测因变量Y（只能取0或1）。
总之，线性回归模型通常是处理因变量是连续变量的问题，如果因变量是定性变量，线性回归模型就不再适用了，需采用逻辑回归模型解决。逻辑回归（Logistic Regression）是用于处理因变量为分类变量的回归问题，常见的是二分类或二项分布问题，也可以处理多分类问题，它实际上是属于一种分类方法。二分类问题的概率与自变量之间的关系往往是一个s型的曲线。

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
lr = LogisticRegression()
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)
print (predictions[:20])

预测结果如下：

0.9992125268800921
0.9975974866013676
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
dtype: int64

2.3模型优化

模型优化的方法有很多，这里我们只是简单的进行参数的调整，从上面打印的结果我们可以看得出来，整箱3预测的正确率和负向预测的概率是一样高的，为了将利润最大化，这是我们不想看到的，我们需要去降低负向预测的概率。

用到的方法很简单就是均衡权重。

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
lr = LogisticRegression(class_weight="balanced") #关键代码 
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)
print(predictions[:20])

预测结果如下：

0.6686555410848957
0.39641471077434853
0     1
1     0
2     0
3     1
4     1
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    1
13    1
14    0
15    0
16    1
17    1
18    1
19    0
dtype: int64

以上是框架自己去均衡权重的，我们也可以自己去设置权重的值：

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
penalty = {
    0: 4,
    1: 1
} # 加权重

lr = LogisticRegression(class_weight=penalty)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)

预测结果如下：

0.8168519247660296
0.6451672518942894

3.扩展部分

3.1.python及其第三方库在数据分析中的应用

就我常用的库来说：

numpy和pandas是矩阵运算的常用库也是很多机器学习和深度学习的依赖库。常用来做数据清洗。因为不论是爬虫来的数据还是哪里收集的数据一定有很多的无关数据和一些NAN的值，这个时候就需要我们进行特征选择特征抽取。

数据处理完之后如果要机器学习我常用sklearn库，这个是常用的机器学习框架，包含了常用的线性回归和逻辑回归，也包含了常用的调参方法，超参数网格搜索。

在深度学习方面我常用tensorflow，tensorflow就是依赖于numpy和pandas的谷歌开发的深度学习框架。

3.2上完本课程的收获和改进建议

收获：学习到了一些以前没有注意过的python的基础语法知识，学习到了面向对象的界面开发，并且找到了其中的乐趣，于是我自学了pyqt5框架，但是由于工程实践搁置了一段实践，打算放假的时候好好的研究一下。也学习到了numpy和pandas的基本运用知识。

改进建议：可以压缩基础语法的讲解，上课可以多分析一下代码。