<iframe src="http://nbviewer.jupyter.org/github/littleadams/Uber-Rider-Retension-Prediction/blob/master/Uber%E6%89%93%E8%BD%A6%E7%94%A8%E6%88%B7%E7%95%99%E5%AD%98%E6%83%85%E5%86%B5%E9%A2%84%E6%B5%8B.ipynb" width="850" height="2000"></iframe>
Uber打车用户留存情况预测
1. 项目概况
1.1 项目背景与定义
为了辨别打车用户特征对于留存情况的影响以进一步提高用户留存率,Uber公开了50,000名于2014年1月注册Uber账户的用户部分信息,希望基于该信息得到具有相对可靠性及参考意义的预测模型。
为解决该二分类问题,本项目将实际运用Python相关库(Pandas, Numpy,Matplotlib, Scikit-learn等)进行分类模型的筛选、细化与评估,在得到可靠模型的同时,输出客户各特征对于Uber留存率的重要度。
1.2 项目流程
- 数据前处理与特征工程
- 数据探索分析
- 数值型特征缺失值处理
- 数值型特征异常点检测
- 留存用户标签定义
- 数据可视化与特征关联性分析
- 类别型特征指示变量转换
- 日期型特征离散
- 分类模型筛选
- 定义模型评分依据
- 尝试不同分类模型并筛选表现较好模型
- 模型交叉验证与调优
- 利用GridSearchCV调优得到max_depth, max_features, n_estimators参数较优值
- 模型验证与评估
- 通过学习曲线评估模型随复杂度变化的表现情况
- 结论
- 给出特征重要度排名
2. 数据前处理与特征工程
2.1 数据集概述
city
: 用户注册所在城市phone
: 用户使用的手机系统signup_date
: 账户注册日期,格式为‘YYYY-MM-DD’last_trip_date
: 用户最近一次打车日期,格式为‘YYYY-MM-DD’avg_dist
: 用户注册后前30天内平均打车距离(英里)avg_rating_by_driver
: 司机对该用户打分的平均值avg_rating_of_driver
: 该用户对所有司机打分的平均值surge_pct
: 用户紧急叫车次数占总乘车次数的百分比avg_surge
: 用户紧急叫车的平均加价倍率trips_in_first_30_days
: 用户注册后前30天内的总乘车次数uber_black_user
: 是否为UberBlack级别车用户weekday_pct
: 用户工作日打车占比
2.2 数据探索分析
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# 加载数据
with open('train.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
df.head()
avg_dist | avg_rating_by_driver | avg_rating_of_driver | avg_surge | city | last_trip_date | phone | signup_date | surge_pct | trips_in_first_30_days | uber_black_user | weekday_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.67 | 5.0 | 4.7 | 1.10 | King’s Landing | 2014-06-17 | iPhone | 2014-01-25 | 15.4 | 4 | True | 46.2 |
1 | 8.26 | 5.0 | 5.0 | 1.00 | Astapor | 2014-05-05 | Android | 2014-01-29 | 0.0 | 0 | False | 50.0 |
2 | 0.77 | 5.0 | 4.3 | 1.00 | Astapor | 2014-01-07 | iPhone | 2014-01-06 | 0.0 | 3 | False | 100.0 |
3 | 2.36 | 4.9 | 4.6 | 1.14 | King’s Landing | 2014-06-29 | iPhone | 2014-01-10 | 20.0 | 9 | True | 80.0 |
4 | 3.13 | 4.9 | 4.4 | 1.19 | Winterfell | 2014-03-15 | Android | 2014-01-27 | 11.8 | 14 | False | 82.4 |
# 各字段统计值
df.describe()
avg_dist | avg_rating_by_driver | avg_rating_of_driver | avg_surge | surge_pct | trips_in_first_30_days | weekday_pct | |
---|---|---|---|---|---|---|---|
count | 50000.000000 | 49799.000000 | 41878.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 5.796827 | 4.778158 | 4.601559 | 1.074764 | 8.849536 | 2.278200 | 60.926084 |
std | 5.707357 | 0.446652 | 0.617338 | 0.222336 | 19.958811 | 3.792684 | 37.081503 |
min | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2.420000 | 4.700000 | 4.300000 | 1.000000 | 0.000000 | 0.000000 | 33.300000 |
50% | 3.880000 | 5.000000 | 4.900000 | 1.000000 | 0.000000 | 1.000000 | 66.700000 |
75% | 6.940000 | 5.000000 | 5.000000 | 1.050000 | 8.600000 | 3.000000 | 100.000000 |
max | 160.960000 | 5.000000 | 5.000000 | 8.000000 | 100.000000 | 125.000000 | 100.000000 |
df1 = df.copy()
2.3 数值型特征缺失值处理
# 查看各字段缺失值情况
df1.isnull().sum()
avg_dist 0
avg_rating_by_driver 201
avg_rating_of_driver 8122
avg_surge 0
city 0
last_trip_date 0
phone 396
signup_date 0
surge_pct 0
trips_in_first_30_days 0
uber_black_user 0
weekday_pct 0
dtype: int64
avg_rating_by_driver, avg_rating_of_driver和phone三个字段存在缺失值,处理方法:
* avg_rating_by_driver、avg_rating_of_driver字段:用各自统计的中位数填充
* phone字段:直接删除对应的行
# avg_rating_by_driver、avg_rating_of_driver字段:用各自统计的中位数填充
df1['avg_rating_by_driver'].fillna(df1['avg_rating_by_driver'].median(), inplace=True)
df1['avg_rating_of_driver'].fillna(df1['avg_rating_of_driver'].median(), inplace=True)
# phone字段:直接删除对应的行
df1 = df1.drop(index=df[df.phone.isnull()].index).reset_index(drop=True)
df1.isnull().sum()
avg_dist 0
avg_rating_by_driver 0
avg_rating_of_driver 0
avg_surge 0
city 0
last_trip_date 0
phone 0
signup_date 0
surge_pct 0
trips_in_first_30_days 0
uber_black_user 0
weekday_pct 0
dtype: int64
2.4 数值型特征异常点检测
# 利用箱体图描述各字段分布情况
fig, ax = plt.subplots(figsize=(10,6))
sns.boxplot(data=df1[['avg_rating_by_driver', 'avg_rating_of_driver', 'avg_surge']])
fig, ax = plt.subplots(figsize=(10,6))
sns.boxplot(data=df1[['avg_dist', 'surge_pct', 'trips_in_first_30_days', 'weekday_pct']])
统计各记录异常点情况
# 统计各记录异常点情况
df1['outlier_counts'] = 0
def outlier_count(df, numeric_features):
for feature in numeric_features:
Q1 = df[feature].quantile(0.25)
Q3 = df[feature].quantile(0.75)
interquantile = Q3 - Q1
outlier_series = df[feature].apply(lambda x: 1 if ((x >= Q3 + 1.5 * interquantile) or (x <= Q1 - 1.5 * interquantile)) else 0)
count1 = outlier_series.sum()
print('Number of Outliers for %s : %d' % (feature, count1))
df['outlier_counts'] = df['outlier_counts'] + outlier_series
return df
各字段推断异常点的数量
numeric_features = ['avg_rating_by_driver','avg_rating_of_driver', 'avg_surge',
'avg_dist', 'surge_pct', 'trips_in_first_30_days', 'weekday_pct']
df1 = outlier_count(df1, numeric_features)
Number of Outliers for avg_rating_by_driver : 3922
Number of Outliers for avg_rating_of_driver : 3106
Number of Outliers for avg_surge : 8369
Number of Outliers for avg_dist : 4477
Number of Outliers for surge_pct : 6768
Number of Outliers for trips_in_first_30_days : 3153
Number of Outliers for weekday_pct : 0
同时有两项特征存在异常值的记录数
(df1.outlier_counts >= 2).sum()
7805
删除同时有两项特征存在异常值的记录
df1 = df1.drop(index=df1[df1.outlier_counts >= 2].index).reset_index(drop=True)
df1.shape[0]
41799
df1.drop(['outlier_counts'], axis=1, inplace=True)
2.5 非数值型特征
类别型特征:city, phone, uber_black_user
print('city : \n')
print(df1.city.value_counts())
print('\n')
print('phone : \n')
print(df1.phone.value_counts())
print('\n')
print('uber_black_user : \n')
print(df1.uber_black_user.value_counts())
city :
Winterfell 19838
Astapor 13619
King's Landing 8342
Name: city, dtype: int64
phone :
iPhone 28988
Android 12811
Name: phone, dtype: int64
uber_black_user :
False 25469
True 16330
Name: uber_black_user, dtype: int64
时间性特征 : signup_date, last_trip_date
print('signup_date: max = %s, min = %s' % (df1.signup_date.max(), df.signup_date.min()))
print('last_trip_date: max = %s, min = %s' % (df1.last_trip_date.max(), df.last_trip_date.min()))
signup_date: max = 2014-01-31, min = 2014-01-01
last_trip_date: max = 2014-07-01, min = 2014-01-01
将2014-07-01之前30天内有有车记录的用户定义为留存用户
df1['retained'] = df1.last_trip_date >= '2014-06-01'
df1.drop(['last_trip_date'], axis=1, inplace=True)
留存用户约占总体用户的37%
df1.retained.mean()
0.37405201081365586
留存用户约占总体用户的37%
2.6 数据可视化与特征关联性分析
对于数值型特征,采用Pair Scatter Plot与correlation matrix进行相关性分析
num_feat = ['avg_rating_by_driver','avg_rating_of_driver', 'avg_surge',
'avg_dist', 'surge_pct', 'trips_in_first_30_days', 'weekday_pct','retained']
sns.pairplot(df1[num_feat],
hue='retained',
vars=['avg_rating_by_driver','avg_rating_of_driver', 'avg_surge',
'avg_dist', 'surge_pct', 'trips_in_first_30_days', 'weekday_pct']
)
def corr_plot(df, num_feat):
sns.set(style='white')
corr = df[num_feat].corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11,9))
cmap = sns.diverging_palette(220,10,as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.4, annot=True,
square=True, linewidths=0.5, cbar_kws={'shrink':.5}, ax=ax)
corr_plot(df1, num_feat)
对于类别型数据,采用barplot进行可视化分析
sns.countplot(x='city', hue='retained', data=df1)
sns.countplot(x='phone', hue='retained', data=df1)
sns.countplot(x='uber_black_user', hue='retained', data=df1)
类别/日期型标签转化为数值型标签
df_city = pd.get_dummies(df1.city,prefix='city')
df1 = pd.concat([df1, df_city], axis=1)
df_phone = pd.get_dummies(df1.phone, prefix='phone')
df1 = pd.concat([df1, df_phone], axis=1)
def signup_date_convert(date):
if date < '2014-01-05':
return 0
elif date < '2014-01-10':
return 1
elif date < '2014-01-15':
return 2
elif date < '2014-01-20':
return 3
elif date < '2014-01-25':
return 4
else:
return 5
df1.signup_date = df1.signup_date.apply(signup_date_convert)
df1.drop(['city', 'phone'], axis=1, inplace=True)
df1.head()
avg_dist | avg_rating_by_driver | avg_rating_of_driver | avg_surge | signup_date | surge_pct | trips_in_first_30_days | uber_black_user | weekday_pct | retained | city_Astapor | city_King’s Landing | city_Winterfell | phone_Android | phone_iPhone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.67 | 5.0 | 4.7 | 1.1 | 5 | 15.4 | 4 | True | 46.2 | True | 0 | 1 | 0 | 0 | 1 |
1 | 8.26 | 5.0 | 5.0 | 1.0 | 5 | 0.0 | 0 | False | 50.0 | False | 1 | 0 | 0 | 1 | 0 |
2 | 0.77 | 5.0 | 4.3 | 1.0 | 1 | 0.0 | 3 | False | 100.0 | False | 1 | 0 | 0 | 0 | 1 |
3 | 10.56 | 5.0 | 3.5 | 1.0 | 1 | 0.0 | 2 | True | 100.0 | True | 0 | 0 | 1 | 0 | 1 |
4 | 3.95 | 4.0 | 4.9 | 1.0 | 4 | 0.0 | 1 | False | 100.0 | False | 1 | 0 | 0 | 1 | 0 |
df1.dtypes
avg_dist float64
avg_rating_by_driver float64
avg_rating_of_driver float64
avg_surge float64
signup_date int64
surge_pct float64
trips_in_first_30_days int64
uber_black_user bool
weekday_pct float64
retained bool
city_Astapor uint8
city_King's Landing uint8
city_Winterfell uint8
phone_Android uint8
phone_iPhone uint8
dtype: object
## 3. 模型筛选 定义模型评估依据
from sklearn.metrics import roc_auc_score
def performance_metric(y_true, y_pred):
score = roc_auc_score(y_true, y_pred)
return score
划分训练集和测试集
from sklearn.cross_validation import train_test_split
y = df1['retained']
X = df1.drop(['retained'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
X_test.head()
avg_dist | avg_rating_by_driver | avg_rating_of_driver | avg_surge | signup_date | surge_pct | trips_in_first_30_days | uber_black_user | weekday_pct | city_Astapor | city_King’s Landing | city_Winterfell | phone_Android | phone_iPhone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15233 | 1.23 | 3.0 | 5.0 | 1.0 | 4 | 0.0 | 2 | False | 100.0 | 0 | 0 | 1 | 0 | 1 |
37446 | 15.99 | 5.0 | 5.0 | 1.0 | 2 | 0.0 | 1 | False | 100.0 | 0 | 1 | 0 | 0 | 1 |
34578 | 3.03 | 4.8 | 4.0 | 1.0 | 3 | 0.0 | 5 | True | 100.0 | 1 | 0 | 0 | 1 | 0 |
32333 | 7.62 | 5.0 | 4.5 | 1.0 | 3 | 0.0 | 0 | True | 30.8 | 0 | 1 | 0 | 0 | 1 |
1813 | 25.93 | 5.0 | 4.9 | 1.0 | 2 | 0.0 | 1 | True | 0.0 | 0 | 0 | 1 | 0 | 1 |
尝试并筛选分类模型
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
clf_lr = LogisticRegression()
clf_svc = SVC()
clf_knn = KNeighborsClassifier()
clf_rf = RandomForestClassifier()
for clf in (clf_lr, clf_svc, clf_knn, clf_rf):
clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print(clf.__class__.__name__)
print('Score on training dataset : ', performance_metric(y_train, y_train_pred))
print('Score on test dataset : ', performance_metric(y_test, y_test_pred))
LogisticRegression
Score on training dataset : 0.6841050685926416
Score on test dataset : 0.6936245857781221
SVC
Score on training dataset : 0.7794541297288773
Score on test dataset : 0.7409838438260743
KNeighborsClassifier
Score on training dataset : 0.7960169470894022
Score on test dataset : 0.721543165526463
RandomForestClassifier
Score on training dataset : 0.9748921206002594
Score on test dataset : 0.7189725193641473
随机森林模型相较于其他模型表现较好,选定该模型进行进一步细化与调参。
4. 模型调参
利用GridSearchCV进行模型交叉验证调优
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.cross_validation import ShuffleSplit
# 对max_depth、max_features进行参数调优
def rf_fit_model(X, y):
cv_sets = ShuffleSplit(X.shape[0], test_size=0.3, random_state=1, n_iter=10)
rf_clf = RandomForestClassifier(random_state=1)
parameters = {
'max_depth':[1,2,3,4,5,6,7,8,9,10],
'max_features':[1,2,3,4,5,6,7,8,9,10,11],
# 'n_estimators':[10,20,30,40,50,60,70,80,90,100],
}
scoring_func = make_scorer(performance_metric)
grid_obj = GridSearchCV(rf_clf, param_grid=parameters, scoring=scoring_func, cv=cv_sets)
grid_obj = grid_obj.fit(X, y)
return grid_obj.best_estimator_
clf = rf_fit_model(X_train, y_train)
clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=6, max_features=9, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=1, verbose=0, warm_start=False)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print("Tuned model produces for training dataset an AUC score of ", performance_metric(y_train, y_train_pred))
print("Tuned model produces for test dataset an AUC score of ", performance_metric(y_test, y_test_pred))
print("Tuned model employs an max_depth of ", clf.get_params()['max_depth'])
print("Tuned model employs an max_features of ", clf.get_params()['max_features'])
Tuned model produces for training dataset an AUC score of 0.7593938939610745
Tuned model produces for test dataset an AUC score of 0.7640681532854529
Tuned model employs an max_depth of 6
Tuned model employs an max_features of 9
# 在调优后的max_depth, max_features基础上,继续对n_estimators参数进行调优
cv_sets = ShuffleSplit(X.shape[0], test_size=0.3, n_iter=10, random_state=1)
rfr = RandomForestClassifier(random_state=1, max_depth=6, max_features=8)
parameters = {'n_estimators':[10,20,30,50,70,100,150,200,500]}
scoring_func = make_scorer(performance_metric)
grid_obj = GridSearchCV(rfr, param_grid=parameters, cv=cv_sets, scoring=scoring_func)
grid_obj = grid_obj.fit(X,y)
clf = grid_obj.best_estimator_
clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=6, max_features=8, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
oob_score=False, random_state=1, verbose=0, warm_start=False)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print("Tuned model produces for training dataset an AUC score of ", performance_metric(y_train, y_train_pred))
print("Tuned model produces for test dataset an AUC score of ", performance_metric(y_test, y_test_pred))
Tuned model produces for training dataset an AUC score of 0.7555624338810432
Tuned model produces for test dataset an AUC score of 0.7660540396340655
5. 模型验证与评估
通过learning curve,评估模型随数据量变化的可靠度
import sklearn.learning_curve as curves
def max_features_curve(X,y):
cv_sets = ShuffleSplit(X.shape[0], test_size=0.3, n_iter=10, random_state=1)
clf = RandomForestClassifier(max_depth=6, n_estimators=200, random_state=1)
max_features_range = range(1,12)
train_scores, test_scores = curves.validation_curve(clf, X, y, param_name='max_features', param_range=max_features_range,
cv=cv_sets, scoring='roc_auc',)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(train_scores, axis=1)
plt.figure(figsize=(7,5))
plt.plot(max_features_range, train_scores_mean, '-o', color='r', label='Training_Scores_Mean')
plt.plot(max_features_range, test_scores_mean, '-o', color='b', label='TestScores_Mean')
plt.fill_between(max_features_range, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std, color='r', alpha=0.2)
plt.fill_between(max_features_range, test_scores_mean-test_scores_std, test_scores_mean+test_scores_std, color='r', alpha=0.2)
plt.title('Random Forest Classifier Complexity Performance')
plt.legend(loc='best')
plt.xlabel('max_features')
plt.ylabel('roc_auc')
plt.ylim([0.83,0.85])
def max_depth_curve(X,y):
cv_sets = ShuffleSplit(X.shape[0], test_size=0.3, n_iter=10, random_state=1)
clf = RandomForestClassifier(max_features=8, n_estimators=200, random_state=1)
max_depth_range = range(1,11)
train_scores, test_scores = curves.validation_curve(clf, X, y, param_name='max_depth', param_range=max_depth_range,
cv=cv_sets, scoring='roc_auc',)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(train_scores, axis=1)
plt.figure(figsize=(10,6))
plt.plot(max_depth_range, train_scores_mean, '-o', color='r', label='Training_Scores_Mean')
plt.plot(max_depth_range, test_scores_mean, '-o', color='b', label='TestScores_Mean')
plt.fill_between(max_depth_range, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std, color='r', alpha=0.2)
plt.fill_between(max_depth_range, test_scores_mean-test_scores_std, test_scores_mean+test_scores_std, color='r', alpha=0.2)
plt.title('Random Forest Classifier Complexity Performance')
plt.legend(loc='best')
plt.xlabel('max_depth')
plt.ylabel('roc_auc')
plt.ylim([0.60,0.95])
max_depth_curve(X, y)
clf = RandomForestClassifier(n_estimators=200, max_features=5, max_depth=9, random_state=1)
clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print("Tuned model produces for training dataset an AUC score of ", performance_metric(y_train, y_train_pred))
print("Tuned model produces for test dataset an AUC score of ", performance_metric(y_test, y_test_pred))
Tuned model produces for training dataset an AUC score of 0.7814827397745683
Tuned model produces for test dataset an AUC score of 0.7637462305703645
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(y_test, clf.predict_proba(X_test)[:,1])
roc_auc = roc_auc_score(y_test, y_test_pred)
plt.figure(figsize=(7,5))
plt.title('ROC Curve')
plt.plot(fpr, tpr, label='ROC curve (area=%.4f)' % roc_auc)
plt.plot([0,1.],[0,1.],'--',color='k')
plt.xlim([0,1.0])
plt.ylim([0,1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
6. 结论
依据最终模型给出特征重要度
feat_importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
indices = np.argsort(feat_importances[::-1])
df_importances = pd.DataFrame(list(zip(X_test.columns, np.transpose(feat_importances)))).sort_values([1], ascending=False)
df_importances.columns = ['feature_name', 'feature_importances']
print('Feature Importances')
print(df_importances)
Feature Importances
feature_name feature_importances
5 surge_pct 0.201706
1 avg_rating_by_driver 0.163218
8 weekday_pct 0.116414
10 city_King's Landing 0.111327
3 avg_surge 0.101603
6 trips_in_first_30_days 0.055413
0 avg_dist 0.049693
7 uber_black_user 0.044412
12 phone_Android 0.039708
13 phone_iPhone 0.036023
2 avg_rating_of_driver 0.027528
9 city_Astapor 0.024283
11 city_Winterfell 0.014679
4 signup_date 0.013995
```
```python
df_importances.plot(kind='bar', color='r',
yerr=std[df_importances.index], align='center')
plt.xticks(range(X_train.shape[1]), df_importances.feature_name, rotation=90)