用户流失预警的完整版-python-机器学习

整理个各个版本，但是时间的问题，或多或少会出现一些版本问题，所以认真的来写了关于用户流失预警的一个案例。

我们拿到的数据表示这种形式，如果需要数据包，请给出对应的邮箱。

首先附上对应的解释结果

在本案例中要注意的事项是：我们的数据处理采用的是归一化，不是标准化；其次选择了多个分类器的对比结果，在这里会发现，只是单纯的给出精确度的话，随机森林，knn和回归的精确度都很高，所以还需要考虑对应的召回率，最好是两者的组合，有一个F-分数可以体现，看可视化曲线所围绕的面积，其中，横坐标是召回率，纵坐标是精确度。给个草图解释：

为什么召回率很重要：如果100个人中10个坏蛋你没有分辨出来，很有影响的。

附上对应的信息和代码：

from __future__ import division
import pandas as pd
import numpy as np

#导入数据
churn_df = pd.read_csv('churn.csv')
#输出对应的列名
col_names = churn_df.columns.tolist()
print('列名',col_names)

#选择列名都前后6个
to_show = col_names[:6] + col_names[-6:]

#显示6个数据
churn_df[to_show].head(6)

#将最后一列的churn总的true和false修改成0,1,的形式
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)

#删除不要的字段
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)

#显示了yes或no的两个字段
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

扫描二维码关注公众号，回复： 3807542 查看本文章

features = churn_feat_space.columns
#在这个地方修改了几个东西，就是X = churn_feat_space.as_matrix().astype(np.float)中的改成了一下部分
X = churn_feat_space.values.astype(np.float)
features

#提供了数据预处理的库
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
#显示数据X的类型，为3333个记录，17个字段
print ("Feature space holds %d observations and %d features" % X.shape)
#y的指标是0,1
print ("Unique target labels:", np.unique(y))
#打印这个矩阵的第一条记录
print( X[0])
#查看y==0的记录有多少条
print (len(y[y == 0]))

#from sklearn.cross_validation import KFold
#意思是cross_validation模块在0.18版本中被弃用，现在已经被model_selection代替。所以在导入的时候把"sklearn.cross_validation import
#train_test_split "更改为  "from sklearn.model_selection import  train_test_split"
from sklearn.model_selection import  train_test_split
#X判断数据的标准，y预测的label，clf选择的分类器，**指定的参数
def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    #做成几份交叉验证
    kf = KFold(len(y),n_folds=5,shuffle=True)
    y_pred = y.copy()

    # 分成了几份做交叉验证循环
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # 分类器的类型
        clf = clf_class(**kwargs)
        #放入训练集
        clf.fit(X_train,y_train)
        #预测
        y_pred[test_index] = clf.predict(X_test)
        #返回预测值
    return y_pred

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
#定义了一个精确函数，比较真实值和预测值，返回均值
def accuracy(y_true,y_pred):
    #返回的是0,1值，表示预测是否正确的指标
    #NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)


#精确度很高并不能判定模型的优越性，最终想要的目标是检测出来流失的，主要看召回率
#而召回率主要是把实际上是真的判断成的真真和真假的占比
# print("Support vector machines:")
print('支持向量机')
print( "%.3f" % accuracy(y, run_cv(X,y,SVC)))
# print("Random forest:") 
print('随机森林')
print("%.3f" % accuracy(y, run_cv(X,y,RF))) 
print('K近邻算法')
print("%.3f" % accuracy(y, run_cv(X,y,KNN)))

#依概率的形式返回流失用户，轻重缓急

def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(len(y), n_folds=5, shuffle=True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        # Predict probabilities, not classes
        y_prob[test_index] = clf.predict_proba(X_test)
    return y_prob

import warnings
warnings.filterwarnings('ignore')
#上面两个可要可不要
# 使用10个估计器，所以预测都是0.1的倍数。
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
#print(pred_prob[0])
pred_churn = pred_prob[:,1]
is_churn = y == 1

# 一个预测概率被分配给观测值的次数
counts = pd.value_counts(pred_churn)
#print(count)

# 计算真概率
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob])
    true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']
counts

用户流失预警的完整版-python-机器学习

猜你喜欢