基于spark分析海量用户日志预测用户流失

背景知识:

何为LCV?

在市场营销中,客户生命周期价值(CLV 或通常是 CLTV)、终身客户价值(LCV)或终身价值(LTV)是对整个未来与客户关系的净利润的预测。客户终身价值是一个重要的概念,因为它鼓励企业将重点从季度利润转移到客户关系的长期健康,利用顾客生命周期价值衡量过去,展望未来。

(Avg Monthly Revenue per Customer * Gross Margin per Customer) ÷ Monthly Churn Rate
(注:这是一个概念模型,实际不同行业运用会有差异)

Clv 将现值的概念应用于属于客户关系的现金流。 因为任何未来现金流的现值都是用来衡量未来现金流今天的一次总付价值的,所以 CLV 将代表客户关系今天的一次总付价值。 更简单地说,CLV 是客户关系对公司的货币价值。 通过衡量CLV可以进行客户细分,促使公司精细运营挖掘客户最大价值,提升企业盈利能力。这也是公司愿意为获得客户关系而支付的价格上限,从而控制市场部门应该花多少钱来获得每个顾客,特别是在直接响应营销中。

Clv运用需要注意以下问题:
客户关系价值通常不能简单通过现有指标数据获取,过度依赖单一模型,可能导致客户细分不准确。
客户关系现金流取决于多维度价值评分,要考虑到客户给产品、企业带来的附加价值。
以牺牲潜在客户为代价,高估现有客户。
Clv 是一个动态概念,而不是一个静态模型。

何为用户流失率:
在客户生命周期价值理论中,客户流失率是决定客户关系价值的关键因素。从较高的层次上讲,流失率是在设定的时间段内离开的客户数量的度量。 它用于衡量您因取消客户而损失了多少收入。 它还可用于衡量停止使用您的产品或服务的用户或帐户的数量。 无论哪种情况,流失率都是客户群的流失率。

此为通用模型,具体模型会有不同

为什么流失率很重要?

获得新客户的费用比保留一个新客户的费用高5-25倍
将客户流失率降低仅5%即可将获利能力提高75%
与收购相比,提高保留率对增长的影响要高2-4倍
出售给现有客户的可能性为60-70%,但潜在客户只有5-20%

常用的流失率统计:

客户流失的影响因素有多方面,有一点需要整体把握的是,流失取决于客户生命周期中的不同阶段。 通常,发现客户在订阅开始时的流失率要比几个月前高。这可能是由于多种原因而发生的,例如销售过程中的期望设定不佳,优先级突然改变,入门计划不佳, 随着客户的成熟,他们的客户流失率将会稳定。 因此,计算新客户和老客户之间的客户流失率很重要,不能高估稳定的客户流失率,也不要低估早期客户流失率。

如何分析用户流失?

流失跟用户的生命周期有关。用户在产品中的生命周期可以分为,体验期、新手期、探索期、成熟期和疲惫期。用户在不同的生命周期阶段,流失的原因不同。可以从以下方面探究客户流失的原因:

用户主动型流失分析:用户主动选择不再接受服务。这么做的原因可能有很多,从客户公司业务方向的改变,到喜欢上其他公司的同类产品,或者到客户一直用不明白或不满意产品等等这些都有可能。
用户满意型流失分析:客户对产品的服务体验很满意但还是不再续费。一般来讲这些用户之所以使用某个产品,都是因为他们有些专门需求,可以保持关注,有需要会再考虑。
用户被动型流失分析:客户未及时更新他们的信用卡信息导致续费失败。只要用点心给他们做一个付费提醒就可以了,可以按客户的选择来给他们发邮件或短信提醒,或者请客户直接提供更便捷保险的付款方式即可。
用户垂直型流失分析:“流失的客户分别属于哪种类型?”“我们在哪些类型上的客户流失情况很好哪些不好?”显然,对用户流失做垂直型分析主要适合于B2B业务,尤其当公司的服务本身就是针对某些具体领域的。
用户集群型流失分析:“客户流失最多的月份是哪个?”“上季度的价格调整对客户流失有何影响?”主要关注大的市场环境、政策、活动等影响。

如何应对用户流失?

企业都会存在客户流失,这是一个常规指标,通过研究客户流失率,企业可以采取积极行动,进一步改善产品和服务,调整市场策略,或者针对具体客户的分析进行激励或挽留,以降低客户流失率。

建立预警模型的目的是提前识别潜在流失用户,为挽留用户赢得时间。
常见的流失预警模型有如下五种。
• 基于用户属性的流失预警模型
• 基于关键事件的流失预警模型
• 基于负体验的流失预警模型
• 基于业务粘性的流失预警模型
• 基于用户活跃度的流失预警模型

建立防止用户流失的运营策略

  1. 防堵流失漏洞。
    • 性能优化。比如:优化卡顿、加载速度、降低耗电等。
    • 功能优化。比如:补充竞品的优势功能,做到人有我也有。
    • 体验优化。比如:缩短流程、优化交互、视觉体验等。
  2. 建立流失壁垒。
    • 沉淀资产。比如:我的阅读偏好、收藏文章、下载文件、好友关系、聊天记录等。
    • 增加转移难度。比如:特定的专属功能,播放视频独特格式。
    • 福利刺激。增加福利体系,类似奖金机制。

开展流失用户召回活动

  1. 流失用户召回是一系列手段,不要指望一个手段召回所有流失。
  2. 根据流失原因的不同,对症下药,做针对性召回。
  3. 有效的触达方式。比如:通知栏push、短信、好友关系链召回等等。

Table of Contents

项目概述

基于客户生命周期价值理论,为了实现用户价值最大化,降低用户流失率一直是众多企业关心的问题。本项目通过sparkify音乐APP的海量用户日志文件进行分析,探索数据、构造特征并训练模型进行预测客户是否流失。项目主要通过pyspark.ml库实现。

问题描述

日志中最终注销的用户被定义为流失用户,我们需要解决的问题是通过用户日志的基于用户的多维特征提取,构建模型进行是否流失的预测。这是一个二分类问题。

评价指标

结合本项目实际应用,模型的评价指标为f1和accurancy.

导入需要的库

# import libraries
from pyspark.sql import SparkSession,Window
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler
from pyspark.sql.functions import avg, col, concat, desc, explode, lit, min, max,count, split, udf, isnull,weekofyear

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, \
    LogisticRegressionModel, RandomForestClassifier,RandomForestClassificationModel,LinearSVC
    
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

from pyspark.ml.feature import CountVectorizer, IDF, Normalizer, \
    PCA, RegexTokenizer, Tokenizer, StandardScaler, StopWordsRemover, \
    StringIndexer, VectorAssembler, MaxAbsScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.clustering import KMeans
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType,FloatType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum
from time import time
import re
import numpy as np
import scipy
import pandas as pd
import datetime
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
import random
%matplotlib inline
spark = SparkSession.builder \
    .master("local") \
    .appName("Music App") \
    .getOrCreate()
stack_overflow_data = 'mini_sparkify_event_data.json'
df = spark.read.json(stack_overflow_data)

数据洞察和清洗

数据概览

df.head() #数据字段有哪些
Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30')
df.count() #数据有多少条
286500
df.select("userId").dropDuplicates().count() #因为将要构造以用户ID为行,用户特征为列的新的数据集,查看数据有多少用户
226
df.printSchema()  #充分理解数据各个字段的含义,数据类型,大致分布,才可以进一步探索数据
root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)
df.describe().show() #大致了解各项数值数据的分布情况
+-------+------------------+----------+---------+------+------------------+--------+-----------------+------+-----------------+------+-------+--------------------+-----------------+--------------------+------------------+--------------------+--------------------+-----------------+
|summary|            artist|      auth|firstName|gender|     itemInSession|lastName|           length| level|         location|method|   page|        registration|        sessionId|                song|            status|                  ts|           userAgent|           userId|
+-------+------------------+----------+---------+------+------------------+--------+-----------------+------+-----------------+------+-------+--------------------+-----------------+--------------------+------------------+--------------------+--------------------+-----------------+
|  count|            228108|    286500|   278154|278154|            286500|  278154|           228108|286500|           278154|286500| 286500|              278154|           286500|              228108|            286500|              286500|              278154|           286500|
|   mean| 551.0852017937219|      null|     null|  null|114.41421291448516|    null|249.1171819778458|  null|             null|  null|   null|1.535358834084427...|1041.526554973822|            Infinity|210.05459685863875|1.540956889810483...|                null|59682.02278593872|
| stddev|1217.7693079161374|      null|     null|  null|129.76726201140994|    null|99.23517921058361|  null|             null|  null|   null| 3.291321616327586E9|726.7762634630741|                 NaN| 31.50507848842214|1.5075439608226302E9|                null|109091.9499991047|
|    min|               !!!| Cancelled| Adelaida|     F|                 0|   Adams|          0.78322|  free|       Albany, OR|   GET|  About|       1521380675000|                1|Ég Átti Gr...|               200|       1538352117000|"Mozilla/5.0 (Mac...|                 |
|    max| Ólafur Arnalds|Logged Out|   Zyonna|     M|              1321|  Wright|       3024.66567|  paid|Winston-Salem, NC|   PUT|Upgrade|       1543247354000|             2474|Þau hafa slopp...|               404|       1543799476000|Mozilla/5.0 (comp...|               99|
+-------+------------------+----------+---------+------+------------------+--------+-----------------+------+-----------------+------+-------+--------------------+-----------------+--------------------+------------------+--------------------+--------------------+-----------------+

过滤掉未注册用户

df.select("userID").show()  #存在空值,说明有些客户没有登录
+------+
|userID|
+------+
|    30|
|     9|
|    30|
|     9|
|    30|
|     9|
|     9|
|    30|
|    30|
|    30|
|     9|
|     9|
|    30|
|     9|
|     9|
|    30|
|     9|
|    74|
|    30|
|     9|
+------+
only showing top 20 rows
df.where(df.userId=="").show()
+------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
|artist|      auth|firstName|gender|itemInSession|lastName|length|level|location|method| page|registration|sessionId|song|status|           ts|userAgent|userId|
+------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
|  null|Logged Out|     null|  null|          100|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355745000|     null|      |
|  null|Logged Out|     null|  null|          101|    null|  null| free|    null|   GET| Help|        null|        8|null|   200|1538355807000|     null|      |
|  null|Logged Out|     null|  null|          102|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355841000|     null|      |
|  null|Logged Out|     null|  null|          103|    null|  null| free|    null|   PUT|Login|        null|        8|null|   307|1538355842000|     null|      |
|  null|Logged Out|     null|  null|            2|    null|  null| free|    null|   GET| Home|        null|      240|null|   200|1538356678000|     null|      |
|  null|Logged Out|     null|  null|            3|    null|  null| free|    null|   PUT|Login|        null|      240|null|   307|1538356679000|     null|      |
|  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   PUT|Login|        null|      100|null|   307|1538358102000|     null|      |
|  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   PUT|Login|        null|      241|null|   307|1538360117000|     null|      |
|  null|Logged Out|     null|  null|           14|    null|  null| free|    null|   GET| Home|        null|      187|null|   200|1538361527000|     null|      |
|  null|Logged Out|     null|  null|           15|    null|  null| free|    null|   PUT|Login|        null|      187|null|   307|1538361528000|     null|      |
|  null|Logged Out|     null|  null|           21|    null|  null| free|    null|   GET| Home|        null|      187|null|   200|1538362007000|     null|      |
|  null|Logged Out|     null|  null|           22|    null|  null| free|    null|   GET| Home|        null|      187|null|   200|1538362095000|     null|      |
|  null|Logged Out|     null|  null|           23|    null|  null| free|    null|   PUT|Login|        null|      187|null|   307|1538362096000|     null|      |
|  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   GET| Home|        null|       27|null|   200|1538363488000|     null|      |
|  null|Logged Out|     null|  null|            1|    null|  null| free|    null|   GET|About|        null|       27|null|   200|1538363494000|     null|      |
|  null|Logged Out|     null|  null|            2|    null|  null| free|    null|   GET| Home|        null|       27|null|   200|1538363503000|     null|      |
|  null|Logged Out|     null|  null|           38|    null|  null| free|    null|   GET| Home|        null|      187|null|   200|1538364254000|     null|      |
|  null|Logged Out|     null|  null|           39|    null|  null| free|    null|   PUT|Login|        null|      187|null|   307|1538364255000|     null|      |
|  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   GET| Home|        null|      257|null|   200|1538364750000|     null|      |
|  null|Logged Out|     null|  null|           47|    null|  null| free|    null|   GET| Home|        null|      100|null|   200|1538370681000|     null|      |
+------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
only showing top 20 rows
df=df.filter(col('userId') != '')
df.count()
278154
df.select("userId").dropDuplicates().count()
225
df.select("artist").show() #查看到歌曲、艺术家首字母均是大写
+--------------------+
|              artist|
+--------------------+
|      Martha Tilston|
|    Five Iron Frenzy|
|        Adam Lambert|
|              Enigma|
|           Daft Punk|
|The All-American ...|
|The Velvet Underg...|
|        Starflyer 59|
|                null|
|            Frumpies|
|        Britt Nicole|
|                null|
|Edward Sharpe & T...|
|               Tesla|
|                null|
|         Stan Mosley|
|Florence + The Ma...|
|   Tokyo Police Club|
|             Orishas|
|             Ratatat|
+--------------------+
only showing top 20 rows

转换时间字段

#便于查看小时、周几、几号的流失、非流失听歌频率的分布
get_hour = udf(lambda x: datetime.datetime.fromtimestamp(x / 1000.0).hour)
df=df.withColumn("hour", get_hour(df.ts))

get_weekday = udf(lambda x: datetime.datetime.fromtimestamp(x / 1000.0).strftime("%w"))
df=df.withColumn("weekday", get_weekday(df.ts))

get_day = udf(lambda x: datetime.datetime.fromtimestamp(x / 1000.0).day)
df= df.withColumn("day", get_day(df.ts))
df.take(5)
[Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30', hour='0', weekday='1', day='1'),
 Row(artist='Five Iron Frenzy', auth='Logged In', firstName='Micah', gender='M', itemInSession=79, lastName='Long', length=236.09424, level='free', location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Canada', status=200, ts=1538352180000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9', hour='0', weekday='1', day='1'),
 Row(artist='Adam Lambert', auth='Logged In', firstName='Colin', gender='M', itemInSession=51, lastName='Freeman', length=282.8273, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Time For Miracles', status=200, ts=1538352394000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30', hour='0', weekday='1', day='1'),
 Row(artist='Enigma', auth='Logged In', firstName='Micah', gender='M', itemInSession=80, lastName='Long', length=262.71302, level='free', location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Knocking On Forbidden Doors', status=200, ts=1538352416000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9', hour='0', weekday='1', day='1'),
 Row(artist='Daft Punk', auth='Logged In', firstName='Colin', gender='M', itemInSession=52, lastName='Freeman', length=223.60771, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Harder Better Faster Stronger', status=200, ts=1538352676000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30', hour='0', weekday='1', day='1')]

转换性别、级别为数值

tran_gender= udf(lambda x:1 if x=="M" else 0, IntegerType())
df=df.withColumn("gender", tran_gender("gender"))
tran_level= udf(lambda x:1 if x=="paid" else 0, IntegerType())
df= df.withColumn("level", tran_level("level"))

数据探索

定义流失客户

df.filter("page = 'Cancellation Confirmation'").show()  #查看流失用户,具体流失用户的详细日志
+------+---------+---------+------+-------------+---------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+----+-------+---+
|artist|     auth|firstName|gender|itemInSession| lastName|length|level|            location|method|                page| registration|sessionId|song|status|           ts|           userAgent|userId|hour|weekday|day|
+------+---------+---------+------+-------------+---------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+----+-------+---+
|  null|Cancelled|   Adriel|     1|          104|  Mendoza|  null|    1|  Kansas City, MO-KS|   GET|Cancellation Conf...|1535623466000|      514|null|   200|1538943990000|"Mozilla/5.0 (Mac...|    18|  20|      0|  7|
|  null|Cancelled|    Diego|     1|           56|    Mckee|  null|    1|Phoenix-Mesa-Scot...|   GET|Cancellation Conf...|1537167593000|      540|null|   200|1539033046000|"Mozilla/5.0 (iPh...|    32|  21|      1|  8|
|  null|Cancelled|    Mason|     1|           10|     Hart|  null|    0|  Corpus Christi, TX|   GET|Cancellation Conf...|1533157139000|      174|null|   200|1539318918000|"Mozilla/5.0 (Mac...|   125|   4|      5| 12|
|  null|Cancelled|Alexander|     1|          332|   Garcia|  null|    1|Indianapolis-Carm...|   GET|Cancellation Conf...|1536817381000|      508|null|   200|1539375441000|Mozilla/5.0 (Wind...|   105|  20|      5| 12|
|  null|Cancelled|    Kayla|     0|          273|  Johnson|  null|    1|Philadelphia-Camd...|   GET|Cancellation Conf...|1538333829000|      797|null|   200|1539465584000|Mozilla/5.0 (Wind...|    17|  21|      6| 13|
|  null|Cancelled|    Molly|     0|           29| Harrison|  null|    0|Virginia Beach-No...|   GET|Cancellation Conf...|1534255113000|      843|null|   200|1539588854000|"Mozilla/5.0 (Mac...|   143|   7|      1| 15|
|  null|Cancelled|     Alex|     1|          145|    Hogan|  null|    1|Denver-Aurora-Lak...|   GET|Cancellation Conf...|1535066380000|      842|null|   200|1539729037000|Mozilla/5.0 (Wind...|   101|  22|      2| 16|
|  null|Cancelled|    Davis|     1|           34|     Wang|  null|    1|           Flint, MI|   GET|Cancellation Conf...|1538289776000|      802|null|   200|1539736161000|"Mozilla/5.0 (Win...|   129|   0|      3| 17|
|  null|Cancelled|  Nikolas|     1|          287|    Olsen|  null|    1|Oxnard-Thousand O...|   GET|Cancellation Conf...|1528403713000|      881|null|   200|1539759749000|Mozilla/5.0 (X11;...|   121|   7|      3| 17|
|  null|Cancelled|    Ethan|     1|          176|  Johnson|  null|    1|Lexington-Fayette...|   GET|Cancellation Conf...|1538080987000|      934|null|   200|1539761972000|"Mozilla/5.0 (Win...|    51|   7|      3| 17|
|  null|Cancelled|Christian|     1|          100| Robinson|  null|    1|       Quincy, IL-MO|   GET|Cancellation Conf...|1534942082000|     1092|null|   200|1540050556000|"Mozilla/5.0 (Win...|    87|  15|      6| 20|
|  null|Cancelled|    Molly|     0|           43|Patterson|  null|    1|   Memphis, TN-MS-AR|   GET|Cancellation Conf...|1535498705000|     1029|null|   200|1540062068000|Mozilla/5.0 (X11;...|   122|  19|      6| 20|
|  null|Cancelled|   Sophia|     0|           72|    Perry|  null|    1|Los Angeles-Long ...|   GET|Cancellation Conf...|1533885783000|     1072|null|   200|1540193374000|Mozilla/5.0 (Wind...|    12|   7|      1| 22|
|  null|Cancelled|    Erick|     1|           48|   Brooks|  null|    1|           Selma, AL|   GET|Cancellation Conf...|1537956751000|     1112|null|   200|1540223006000|"Mozilla/5.0 (Win...|    58|  15|      1| 22|
|  null|Cancelled|   Rachel|     0|           11|   Bailey|  null|    1|Albany-Schenectad...|   GET|Cancellation Conf...|1536102943000|     1059|null|   200|1540402387000|Mozilla/5.0 (Wind...|    73|  17|      3| 24|
|  null|Cancelled|  Jeffery|     1|           46|  Wheeler|  null|    1|         Bozeman, MT|   GET|Cancellation Conf...|1533886191000|     1324|null|   200|1540875543000|"Mozilla/5.0 (Win...|     3|   4|      2| 30|
|  null|Cancelled|   Sophia|     0|           18|      Key|  null|    1|Los Angeles-Long ...|   GET|Cancellation Conf...|1537679535000|     1383|null|   200|1541166424000|"Mozilla/5.0 (Mac...|   106|  13|      5|  2|
|  null|Cancelled|    Piper|     0|            8|  Nielsen|  null|    1|New York-Newark-J...|   GET|Cancellation Conf...|1537699856000|     1583|null|   200|1541340091000|"Mozilla/5.0 (Mac...|   103|  14|      0|  4|
|  null|Cancelled|   Teagan|     0|          306|  Roberts|  null|    1|New Philadelphia-...|   GET|Cancellation Conf...|1537634865000|     1519|null|   200|1541463632000|Mozilla/5.0 (Wind...|    28|   0|      2|  6|
|  null|Cancelled|    Alexi|     0|           42|   Warren|  null|    1|Spokane-Spokane V...|   GET|Cancellation Conf...|1532482662000|     1819|null|   200|1542051608000|Mozilla/5.0 (Wind...|    54|  19|      1| 12|
+------+---------+---------+------+-------------+---------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+----+-------+---+
only showing top 20 rows
#查看用户流失前的操作日志
df.select(["userId", "firstname", "page", "level", "song","sessionId","itemInSession","day"]).where(df.userId == "28").collect()[-100:]
[Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Songs That Make A Difference', sessionId=1519, itemInSession=205, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Down', level=1, song=None, sessionId=1519, itemInSession=206, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Pages In Blood', sessionId=1519, itemInSession=207, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Capitol City', sessionId=1519, itemInSession=208, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Dressed To Digress', sessionId=1519, itemInSession=209, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Down', level=1, song=None, sessionId=1519, itemInSession=210, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Agua', sessionId=1519, itemInSession=211, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Bohemian Forest', sessionId=1519, itemInSession=212, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Not Exactly', sessionId=1519, itemInSession=213, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Me And You Blues', sessionId=1519, itemInSession=214, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='By Your Side', sessionId=1519, itemInSession=215, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Tears Of The Dragon', sessionId=1519, itemInSession=216, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Up', level=1, song=None, sessionId=1519, itemInSession=217, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='El Salvador', sessionId=1519, itemInSession=218, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Volcan', sessionId=1519, itemInSession=219, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='A Dustland Fairytale', sessionId=1519, itemInSession=220, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Tulinesangala', sessionId=1519, itemInSession=221, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='So Jersey', sessionId=1519, itemInSession=222, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Sala De RecepÃ\x83§Ã\x83£o', sessionId=1519, itemInSession=223, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="You Know You're Right", sessionId=1519, itemInSession=224, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Some Swedish Trees', sessionId=1519, itemInSession=225, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Bewildered', sessionId=1519, itemInSession=226, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Tudo De Voce', sessionId=1519, itemInSession=227, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='The Cross [Live Audio]', sessionId=1519, itemInSession=228, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Peculiar', sessionId=1519, itemInSession=229, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Berlin', sessionId=1519, itemInSession=230, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Walls Of Huntsville', sessionId=1519, itemInSession=231, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Up', level=1, song=None, sessionId=1519, itemInSession=232, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Prefab', sessionId=1519, itemInSession=233, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='The Life', sessionId=1519, itemInSession=234, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Gears', sessionId=1519, itemInSession=235, day='5'),
 Row(userId='28', firstname='Teagan', page='Downgrade', level=1, song=None, sessionId=1519, itemInSession=236, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Habalim', sessionId=1519, itemInSession=237, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Too Young', sessionId=1519, itemInSession=238, day='5'),
 Row(userId='28', firstname='Teagan', page='Home', level=1, song=None, sessionId=1519, itemInSession=239, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Blood Bank', sessionId=1519, itemInSession=240, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Revelry', sessionId=1519, itemInSession=241, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Take A Walk Around The Table', sessionId=1519, itemInSession=242, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Picture', sessionId=1519, itemInSession=243, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Up', level=1, song=None, sessionId=1519, itemInSession=244, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Easier Said Than Done (Let It Happen Album Version)', sessionId=1519, itemInSession=245, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Walking On A Dream', sessionId=1519, itemInSession=246, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Living Proof', sessionId=1519, itemInSession=247, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Every Little Thing (Album Version)', sessionId=1519, itemInSession=248, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Mykonos', sessionId=1519, itemInSession=249, day='5'),
 Row(userId='28', firstname='Teagan', page='Downgrade', level=1, song=None, sessionId=1519, itemInSession=250, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='FrÃ\x83¡gil', sessionId=1519, itemInSession=251, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Innocent Bones (Album)', sessionId=1519, itemInSession=252, day='5'),
 Row(userId='28', firstname='Teagan', page='Downgrade', level=1, song=None, sessionId=1519, itemInSession=253, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='The God I Know', sessionId=1519, itemInSession=254, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Clones', sessionId=1519, itemInSession=255, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="You'll Be In My Heart", sessionId=1519, itemInSession=256, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="You're The One", sessionId=1519, itemInSession=257, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='ReprÃ\x83©sente', sessionId=1519, itemInSession=258, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="All These Things That I've Done", sessionId=1519, itemInSession=259, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Rosemary', sessionId=1519, itemInSession=260, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="Le vol d'un ange", sessionId=1519, itemInSession=261, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='End Of The World', sessionId=1519, itemInSession=262, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Mr. Weatherman', sessionId=1519, itemInSession=263, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Open All Night', sessionId=1519, itemInSession=264, day='5'),
 Row(userId='28', firstname='Teagan', page='Downgrade', level=1, song=None, sessionId=1519, itemInSession=265, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Ruska', sessionId=1519, itemInSession=266, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Fools', sessionId=1519, itemInSession=267, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Up', level=1, song=None, sessionId=1519, itemInSession=268, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Almaz', sessionId=1519, itemInSession=269, day='5'),
 Row(userId='28', firstname='Teagan', page='Logout', level=1, song=None, sessionId=1519, itemInSession=270, day='5'),
 Row(userId='28', firstname='Teagan', page='Home', level=1, song=None, sessionId=1519, itemInSession=273, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Hey_ Soul Sister', sessionId=1519, itemInSession=274, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Till The Sky Falls Down', sessionId=1519, itemInSession=275, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Our Song', sessionId=1519, itemInSession=276, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Friday (Instrumental Album Version)', sessionId=1519, itemInSession=277, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Blues In G', sessionId=1519, itemInSession=278, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Sense_ Sensibility', sessionId=1519, itemInSession=279, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Sleep In The Garden', sessionId=1519, itemInSession=280, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='We All Die One Day', sessionId=1519, itemInSession=281, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Down', level=1, song=None, sessionId=1519, itemInSession=282, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Mr. Jones', sessionId=1519, itemInSession=283, day='5'),
 Row(userId='28', firstname='Teagan', page='Thumbs Up', level=1, song=None, sessionId=1519, itemInSession=284, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Nah Let Go', sessionId=1519, itemInSession=285, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='No Rest For The Wicked', sessionId=1519, itemInSession=286, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Cherry Blossom Girl (Radio Mix)', sessionId=1519, itemInSession=287, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Milk', sessionId=1519, itemInSession=288, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Broken Open', sessionId=1519, itemInSession=289, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Sound And Vision (1999 Digital Remaster)', sessionId=1519, itemInSession=290, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="Ain't Misbehavin", sessionId=1519, itemInSession=291, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Love Of My Life (1993 Digital Remaster)', sessionId=1519, itemInSession=292, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Home', sessionId=1519, itemInSession=293, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song="I Keep Going Back To Joe's", sessionId=1519, itemInSession=294, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Le Million', sessionId=1519, itemInSession=295, day='5'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Free Style (feat. Kevo_ Mussilini & Lyrical 187)', sessionId=1519, itemInSession=296, day='6'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Drive Slow', sessionId=1519, itemInSession=297, day='6'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Stupid Girl', sessionId=1519, itemInSession=298, day='6'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Abstract Art (feat. NO)', sessionId=1519, itemInSession=299, day='6'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Como Duele (Album)', sessionId=1519, itemInSession=300, day='6'),
 Row(userId='28', firstname='Teagan', page='Home', level=1, song=None, sessionId=1519, itemInSession=301, day='6'),
 Row(userId='28', firstname='Teagan', page='NextSong', level=1, song='Back In The Day', sessionId=1519, itemInSession=302, day='6'),
 Row(userId='28', firstname='Teagan', page='Downgrade', level=1, song=None, sessionId=1519, itemInSession=303, day='6'),
 Row(userId='28', firstname='Teagan', page='Downgrade', level=1, song=None, sessionId=1519, itemInSession=304, day='6'),
 Row(userId='28', firstname='Teagan', page='Cancel', level=1, song=None, sessionId=1519, itemInSession=305, day='6'),
 Row(userId='28', firstname='Teagan', page='Cancellation Confirmation', level=1, song=None, sessionId=1519, itemInSession=306, day='6')]
churn_users=df.filter(df.page=="Cancellation Confirmation").select("userId").dropDuplicates()
churn_users_list=[(row['userId']) for row in churn_users.collect()]
churn_users_list
['125',
 '51',
 '54',
 '100014',
 '101',
 '29',
 '100021',
 '87',
 '73',
 '3',
 '28',
 '100022',
 '100025',
 '300007',
 '100006',
 '18',
 '70',
 '100005',
 '17',
 '100007',
 '300001',
 '100009',
 '100015',
 '200024',
 '100003',
 '103',
 '100024',
 '53',
 '122',
 '200017',
 '58',
 '100011',
 '100019',
 '100012',
 '200018',
 '200016',
 '200020',
 '106',
 '143',
 '32',
 '200001',
 '105',
 '200011',
 '100023',
 '100013',
 '100017',
 '121',
 '12',
 '200015',
 '129',
 '200021',
 '100001']
# 创建Churn列,用来标记后期证实流失的客户
flag_CancellationConfirmation_event = udf(lambda x: 1 if x in churn_users_list else 0, IntegerType())
df= df.withColumn("Churn", flag_CancellationConfirmation_event("userId"))

df.take(5)
[Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender=1, itemInSession=50, lastName='Freeman', length=277.89016, level=1, location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30', hour='0', weekday='1', day='1', Churn=0),
 Row(artist='Five Iron Frenzy', auth='Logged In', firstName='Micah', gender=1, itemInSession=79, lastName='Long', length=236.09424, level=0, location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Canada', status=200, ts=1538352180000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9', hour='0', weekday='1', day='1', Churn=0),
 Row(artist='Adam Lambert', auth='Logged In', firstName='Colin', gender=1, itemInSession=51, lastName='Freeman', length=282.8273, level=1, location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Time For Miracles', status=200, ts=1538352394000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30', hour='0', weekday='1', day='1', Churn=0),
 Row(artist='Enigma', auth='Logged In', firstName='Micah', gender=1, itemInSession=80, lastName='Long', length=262.71302, level=0, location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Knocking On Forbidden Doors', status=200, ts=1538352416000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9', hour='0', weekday='1', day='1', Churn=0),
 Row(artist='Daft Punk', auth='Logged In', firstName='Colin', gender=1, itemInSession=52, lastName='Freeman', length=223.60771, level=1, location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Harder Better Faster Stronger', status=200, ts=1538352676000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30', hour='0', weekday='1', day='1', Churn=0)]
df.select("Churn").show(100) #大部分是没有取消服务的
+-----+
|Churn|
+-----+
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    1|
|    0|
|    0|
|    0|
|    1|
|    0|
|    0|
|    0|
|    1|
|    0|
|    0|
|    0|
|    1|
|    1|
|    0|
|    0|
|    0|
|    1|
|    0|
|    0|
|    0|
|    1|
|    1|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
+-----+
only showing top 100 rows
df.describe("itemInSession").show()
+-------+------------------+
|summary|     itemInSession|
+-------+------------------+
|  count|            278154|
|   mean|114.89918174824018|
| stddev|  129.851729399489|
|    min|                 0|
|    max|              1321|
+-------+------------------+

查看用户关键行为:升级、降级

df.select(["userId", "page","sessionId"]).where(df.page=="Submit Downgrade").sort(["userId"]).collect()
[Row(userId='100', page='Submit Downgrade', sessionId=1590),
 Row(userId='100004', page='Submit Downgrade', sessionId=112),
 Row(userId='100004', page='Submit Downgrade', sessionId=147),
 Row(userId='100008', page='Submit Downgrade', sessionId=132),
 Row(userId='100009', page='Submit Downgrade', sessionId=126),
 Row(userId='100012', page='Submit Downgrade', sessionId=94),
 Row(userId='100015', page='Submit Downgrade', sessionId=121),
 Row(userId='100016', page='Submit Downgrade', sessionId=190),
 Row(userId='100018', page='Submit Downgrade', sessionId=67),
 Row(userId='100018', page='Submit Downgrade', sessionId=202),
 Row(userId='100025', page='Submit Downgrade', sessionId=123),
 Row(userId='103', page='Submit Downgrade', sessionId=865),
 Row(userId='109', page='Submit Downgrade', sessionId=1705),
 Row(userId='11', page='Submit Downgrade', sessionId=487),
 Row(userId='12', page='Submit Downgrade', sessionId=632),
 Row(userId='13', page='Submit Downgrade', sessionId=1695),
 Row(userId='13', page='Submit Downgrade', sessionId=1695),
 Row(userId='131', page='Submit Downgrade', sessionId=249),
 Row(userId='131', page='Submit Downgrade', sessionId=2041),
 Row(userId='140', page='Submit Downgrade', sessionId=753),
 Row(userId='140', page='Submit Downgrade', sessionId=1643),
 Row(userId='140', page='Submit Downgrade', sessionId=1918),
 Row(userId='141', page='Submit Downgrade', sessionId=479),
 Row(userId='20', page='Submit Downgrade', sessionId=378),
 Row(userId='20', page='Submit Downgrade', sessionId=1875),
 Row(userId='200003', page='Submit Downgrade', sessionId=28),
 Row(userId='200003', page='Submit Downgrade', sessionId=283),
 Row(userId='200009', page='Submit Downgrade', sessionId=251),
 Row(userId='200011', page='Submit Downgrade', sessionId=238),
 Row(userId='200019', page='Submit Downgrade', sessionId=148),
 Row(userId='200020', page='Submit Downgrade', sessionId=222),
 Row(userId='200023', page='Submit Downgrade', sessionId=69),
 Row(userId='200023', page='Submit Downgrade', sessionId=295),
 Row(userId='200025', page='Submit Downgrade', sessionId=115),
 Row(userId='24', page='Submit Downgrade', sessionId=1317),
 Row(userId='25', page='Submit Downgrade', sessionId=1718),
 Row(userId='30', page='Submit Downgrade', sessionId=532),
 Row(userId='300002', page='Submit Downgrade', sessionId=220),
 Row(userId='300004', page='Submit Downgrade', sessionId=267),
 Row(userId='300011', page='Submit Downgrade', sessionId=362),
 Row(userId='300015', page='Submit Downgrade', sessionId=69),
 Row(userId='300021', page='Submit Downgrade', sessionId=144),
 Row(userId='300023', page='Submit Downgrade', sessionId=459),
 Row(userId='35', page='Submit Downgrade', sessionId=812),
 Row(userId='35', page='Submit Downgrade', sessionId=1704),
 Row(userId='38', page='Submit Downgrade', sessionId=313),
 Row(userId='39', page='Submit Downgrade', sessionId=1546),
 Row(userId='39', page='Submit Downgrade', sessionId=1609),
 Row(userId='39', page='Submit Downgrade', sessionId=1985),
 Row(userId='49', page='Submit Downgrade', sessionId=1744),
 Row(userId='54', page='Submit Downgrade', sessionId=859),
 Row(userId='59', page='Submit Downgrade', sessionId=510),
 Row(userId='61', page='Submit Downgrade', sessionId=529),
 Row(userId='61', page='Submit Downgrade', sessionId=2345),
 Row(userId='74', page='Submit Downgrade', sessionId=1101),
 Row(userId='77', page='Submit Downgrade', sessionId=688),
 Row(userId='81', page='Submit Downgrade', sessionId=725),
 Row(userId='85', page='Submit Downgrade', sessionId=734),
 Row(userId='85', page='Submit Downgrade', sessionId=1183),
 Row(userId='9', page='Submit Downgrade', sessionId=1276),
 Row(userId='92', page='Submit Downgrade', sessionId=2125),
 Row(userId='95', page='Submit Downgrade', sessionId=826),
 Row(userId='96', page='Submit Downgrade', sessionId=1689)]
df.select(["userId", "page","sessionId"]).where(df.page=="Submit Upgrade").sort(["userId"]).collect()
[Row(userId='100', page='Submit Upgrade', sessionId=1899),
 Row(userId='100004', page='Submit Upgrade', sessionId=108),
 Row(userId='100004', page='Submit Upgrade', sessionId=112),
 Row(userId='100004', page='Submit Upgrade', sessionId=208),
 Row(userId='100009', page='Submit Upgrade', sessionId=85),
 Row(userId='100012', page='Submit Upgrade', sessionId=59),
 Row(userId='100013', page='Submit Upgrade', sessionId=62),
 Row(userId='100015', page='Submit Upgrade', sessionId=79),
 Row(userId='100015', page='Submit Upgrade', sessionId=162),
 Row(userId='100018', page='Submit Upgrade', sessionId=67),
 Row(userId='100018', page='Submit Upgrade', sessionId=114),
 Row(userId='100023', page='Submit Upgrade', sessionId=61),
 Row(userId='101', page='Submit Upgrade', sessionId=312),
 Row(userId='103', page='Submit Upgrade', sessionId=865),
 Row(userId='103', page='Submit Upgrade', sessionId=916),
 Row(userId='104', page='Submit Upgrade', sessionId=1103),
 Row(userId='105', page='Submit Upgrade', sessionId=104),
 Row(userId='106', page='Submit Upgrade', sessionId=429),
 Row(userId='108', page='Submit Upgrade', sessionId=1228),
 Row(userId='109', page='Submit Upgrade', sessionId=108),
 Row(userId='11', page='Submit Upgrade', sessionId=487),
 Row(userId='11', page='Submit Upgrade', sessionId=1824),
 Row(userId='111', page='Submit Upgrade', sessionId=1416),
 Row(userId='113', page='Submit Upgrade', sessionId=366),
 Row(userId='114', page='Submit Upgrade', sessionId=1013),
 Row(userId='115', page='Submit Upgrade', sessionId=316),
 Row(userId='118', page='Submit Upgrade', sessionId=601),
 Row(userId='12', page='Submit Upgrade', sessionId=379),
 Row(userId='12', page='Submit Upgrade', sessionId=910),
 Row(userId='121', page='Submit Upgrade', sessionId=719),
 Row(userId='122', page='Submit Upgrade', sessionId=1029),
 Row(userId='126', page='Submit Upgrade', sessionId=988),
 Row(userId='128', page='Submit Upgrade', sessionId=834),
 Row(userId='129', page='Submit Upgrade', sessionId=536),
 Row(userId='13', page='Submit Upgrade', sessionId=1524),
 Row(userId='13', page='Submit Upgrade', sessionId=1695),
 Row(userId='131', page='Submit Upgrade', sessionId=249),
 Row(userId='132', page='Submit Upgrade', sessionId=131),
 Row(userId='136', page='Submit Upgrade', sessionId=855),
 Row(userId='137', page='Submit Upgrade', sessionId=408),
 Row(userId='138', page='Submit Upgrade', sessionId=1033),
 Row(userId='139', page='Submit Upgrade', sessionId=172),
 Row(userId='140', page='Submit Upgrade', sessionId=492),
 Row(userId='140', page='Submit Upgrade', sessionId=1056),
 Row(userId='140', page='Submit Upgrade', sessionId=1681),
 Row(userId='140', page='Submit Upgrade', sessionId=2097),
 Row(userId='141', page='Submit Upgrade', sessionId=479),
 Row(userId='142', page='Submit Upgrade', sessionId=141),
 Row(userId='147', page='Submit Upgrade', sessionId=1851),
 Row(userId='152', page='Submit Upgrade', sessionId=1919),
 Row(userId='153', page='Submit Upgrade', sessionId=1719),
 Row(userId='155', page='Submit Upgrade', sessionId=1562),
 Row(userId='16', page='Submit Upgrade', sessionId=185),
 Row(userId='17', page='Submit Upgrade', sessionId=563),
 Row(userId='20', page='Submit Upgrade', sessionId=947),
 Row(userId='20', page='Submit Upgrade', sessionId=1948),
 Row(userId='200002', page='Submit Upgrade', sessionId=163),
 Row(userId='200003', page='Submit Upgrade', sessionId=228),
 Row(userId='200005', page='Submit Upgrade', sessionId=277),
 Row(userId='200008', page='Submit Upgrade', sessionId=32),
 Row(userId='200009', page='Submit Upgrade', sessionId=183),
 Row(userId='200011', page='Submit Upgrade', sessionId=134),
 Row(userId='200014', page='Submit Upgrade', sessionId=361),
 Row(userId='200017', page='Submit Upgrade', sessionId=206),
 Row(userId='200019', page='Submit Upgrade', sessionId=19),
 Row(userId='200020', page='Submit Upgrade', sessionId=204),
 Row(userId='200021', page='Submit Upgrade', sessionId=137),
 Row(userId='200023', page='Submit Upgrade', sessionId=37),
 Row(userId='200023', page='Submit Upgrade', sessionId=216),
 Row(userId='200023', page='Submit Upgrade', sessionId=410),
 Row(userId='200024', page='Submit Upgrade', sessionId=197),
 Row(userId='200025', page='Submit Upgrade', sessionId=115),
 Row(userId='200025', page='Submit Upgrade', sessionId=282),
 Row(userId='23', page='Submit Upgrade', sessionId=2056),
 Row(userId='24', page='Submit Upgrade', sessionId=1149),
 Row(userId='25', page='Submit Upgrade', sessionId=2039),
 Row(userId='26', page='Submit Upgrade', sessionId=2395),
 Row(userId='28', page='Submit Upgrade', sessionId=1367),
 Row(userId='29', page='Submit Upgrade', sessionId=589),
 Row(userId='30', page='Submit Upgrade', sessionId=2244),
 Row(userId='300001', page='Submit Upgrade', sessionId=1),
 Row(userId='300002', page='Submit Upgrade', sessionId=43),
 Row(userId='300002', page='Submit Upgrade', sessionId=224),
 Row(userId='300004', page='Submit Upgrade', sessionId=27),
 Row(userId='300005', page='Submit Upgrade', sessionId=30),
 Row(userId='300006', page='Submit Upgrade', sessionId=410),
 Row(userId='300009', page='Submit Upgrade', sessionId=9),
 Row(userId='300011', page='Submit Upgrade', sessionId=54),
 Row(userId='300011', page='Submit Upgrade', sessionId=412),
 Row(userId='300012', page='Submit Upgrade', sessionId=74),
 Row(userId='300014', page='Submit Upgrade', sessionId=61),
 Row(userId='300015', page='Submit Upgrade', sessionId=15),
 Row(userId='300015', page='Submit Upgrade', sessionId=109),
 Row(userId='300016', page='Submit Upgrade', sessionId=16),
 Row(userId='300018', page='Submit Upgrade', sessionId=127),
 Row(userId='300019', page='Submit Upgrade', sessionId=42),
 Row(userId='300021', page='Submit Upgrade', sessionId=144),
 Row(userId='300023', page='Submit Upgrade', sessionId=479),
 Row(userId='300025', page='Submit Upgrade', sessionId=63),
 Row(userId='32', page='Submit Upgrade', sessionId=540),
 Row(userId='35', page='Submit Upgrade', sessionId=677),
 Row(userId='35', page='Submit Upgrade', sessionId=1181),
 Row(userId='36', page='Submit Upgrade', sessionId=1738),
 Row(userId='37', page='Submit Upgrade', sessionId=893),
 Row(userId='38', page='Submit Upgrade', sessionId=1196),
 Row(userId='39', page='Submit Upgrade', sessionId=619),
 Row(userId='39', page='Submit Upgrade', sessionId=1546),
 Row(userId='39', page='Submit Upgrade', sessionId=1634),
 Row(userId='39', page='Submit Upgrade', sessionId=2459),
 Row(userId='4', page='Submit Upgrade', sessionId=1094),
 Row(userId='40', page='Submit Upgrade', sessionId=744),
 Row(userId='42', page='Submit Upgrade', sessionId=543),
 Row(userId='44', page='Submit Upgrade', sessionId=183),
 Row(userId='45', page='Submit Upgrade', sessionId=1045),
 Row(userId='46', page='Submit Upgrade', sessionId=1008),
 Row(userId='49', page='Submit Upgrade', sessionId=2332),
 Row(userId='50', page='Submit Upgrade', sessionId=1578),
 Row(userId='52', page='Submit Upgrade', sessionId=1407),
 Row(userId='53', page='Submit Upgrade', sessionId=1174),
 Row(userId='54', page='Submit Upgrade', sessionId=1316),
 Row(userId='55', page='Submit Upgrade', sessionId=1830),
 Row(userId='56', page='Submit Upgrade', sessionId=1510),
 Row(userId='58', page='Submit Upgrade', sessionId=325),
 Row(userId='59', page='Submit Upgrade', sessionId=455),
 Row(userId='59', page='Submit Upgrade', sessionId=899),
 Row(userId='6', page='Submit Upgrade', sessionId=483),
 Row(userId='60', page='Submit Upgrade', sessionId=59),
 Row(userId='61', page='Submit Upgrade', sessionId=529),
 Row(userId='61', page='Submit Upgrade', sessionId=1441),
 Row(userId='65', page='Submit Upgrade', sessionId=526),
 Row(userId='66', page='Submit Upgrade', sessionId=1169),
 Row(userId='67', page='Submit Upgrade', sessionId=791),
 Row(userId='69', page='Submit Upgrade', sessionId=414),
 Row(userId='70', page='Submit Upgrade', sessionId=376),
 Row(userId='71', page='Submit Upgrade', sessionId=1142),
 Row(userId='73', page='Submit Upgrade', sessionId=72),
 Row(userId='74', page='Submit Upgrade', sessionId=924),
 Row(userId='74', page='Submit Upgrade', sessionId=1498),
 Row(userId='77', page='Submit Upgrade', sessionId=1057),
 Row(userId='79', page='Submit Upgrade', sessionId=1921),
 Row(userId='81', page='Submit Upgrade', sessionId=952),
 Row(userId='82', page='Submit Upgrade', sessionId=1322),
 Row(userId='83', page='Submit Upgrade', sessionId=1472),
 Row(userId='85', page='Submit Upgrade', sessionId=654),
 Row(userId='85', page='Submit Upgrade', sessionId=1098),
 Row(userId='85', page='Submit Upgrade', sessionId=1303),
 Row(userId='86', page='Submit Upgrade', sessionId=1855),
 Row(userId='87', page='Submit Upgrade', sessionId=1036),
 Row(userId='88', page='Submit Upgrade', sessionId=1065),
 Row(userId='89', page='Submit Upgrade', sessionId=636),
 Row(userId='9', page='Submit Upgrade', sessionId=367),
 Row(userId='9', page='Submit Upgrade', sessionId=1347),
 Row(userId='91', page='Submit Upgrade', sessionId=693),
 Row(userId='92', page='Submit Upgrade', sessionId=501),
 Row(userId='93', page='Submit Upgrade', sessionId=2329),
 Row(userId='96', page='Submit Upgrade', sessionId=2300),
 Row(userId='97', page='Submit Upgrade', sessionId=1359),
 Row(userId='98', page='Submit Upgrade', sessionId=1139),
 Row(userId='99', page='Submit Upgrade', sessionId=699)]

观察用户使用最多的页面

df.groupBy('Page').count().sort("count", ascending=False).show()
+--------------------+------+
|                Page| count|
+--------------------+------+
|            NextSong|228108|
|           Thumbs Up| 12551|
|                Home| 10082|
|     Add to Playlist|  6526|
|          Add Friend|  4277|
|         Roll Advert|  3933|
|              Logout|  3226|
|         Thumbs Down|  2546|
|           Downgrade|  2055|
|            Settings|  1514|
|                Help|  1454|
|             Upgrade|   499|
|               About|   495|
|       Save Settings|   310|
|               Error|   252|
|      Submit Upgrade|   159|
|    Submit Downgrade|    63|
|              Cancel|    52|
|Cancellation Conf...|    52|
+--------------------+------+

用户流失在不同性别的分布

df_pd=df.dropDuplicates(["userId", "gender"]).groupby(["Churn", "gender"]).count().sort("Churn").toPandas()
sns.barplot(x='Churn', y='count', hue='gender', data=df_pd)
#1 male,paid 1,Churn 1
<matplotlib.axes._subplots.AxesSubplot at 0x7f78b88d2e48>

用户流失在不同会员等级的分布

df_pd=df.dropDuplicates(["userId", "level"]).groupby(["Churn", "level"]).count().sort("Churn").toPandas()
sns.barplot(x='Churn', y='count', hue='level', data=df_pd)
<matplotlib.axes._subplots.AxesSubplot at 0x7f78b87c59e8>

用户流失最多的地区

df.select("userId","Churn","location").dropDuplicates().groupby(["location","Churn"]).count().sort(["Churn","count"], ascending=False).show(10)
+--------------------+-----+-----+
|            location|Churn|count|
+--------------------+-----+-----+
|New York-Newark-J...|    1|    5|
|Los Angeles-Long ...|    1|    3|
|Philadelphia-Camd...|    1|    2|
|Phoenix-Mesa-Scot...|    1|    2|
|Spokane-Spokane V...|    1|    2|
|Miami-Fort Lauder...|    1|    2|
|           Flint, MI|    1|    2|
|         Jackson, MS|    1|    2|
|Greenville-Anders...|    1|    1|
|Indianapolis-Carm...|    1|    1|
+--------------------+-----+-----+
only showing top 10 rows

流失用户的每次登陆页面操作数

df.groupBy(['userId','churn']).avg('itemInSession').groupBy('churn').avg('avg(itemInSession)').show()
+-----+-----------------------+
|churn|avg(avg(itemInSession))|
+-----+-----------------------+
|    1|      72.39591226205863|
|    0|      89.13463393625388|
+-----+-----------------------+

用户在线时段观察

def plot_cnt_by_churn(time): 
    """
    此函数用来绘制频率分布的直方图
    """
    df_pd = df.filter(df.page == "NextSong").groupby("churn", time).count().orderBy(df[time].cast("float")).toPandas()
    df_pd[time] = pd.to_numeric(df_pd[time])
    df_pd[df_pd.churn==0].plot.bar(x=time, y='count', color='burlywood',label='Not churn')
    df_pd[df_pd.churn==1].plot.bar(x=time, y='count', color='lightseagreen', label='Churn')
plot_cnt_by_churn("hour")

plot_cnt_by_churn("weekday")

plot_cnt_by_churn("day")

特征工程

每个用户相关关键页面操作累计和

# for thepage in ["Help","Error","Upgrade","SubmitUpgrade","Downgrade","SubmitDowngrade","Cancel"]:
df_Help=df.select('userID','page')\
.where(df.page == 'Help')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Help')

df_Error=df.select('userID','page')\
.where(df.page == 'Error')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Error')

df_Upgrade=df.select('userID','page')\
.where(df.page == 'Upgrade')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Upgrade')

df_SubmitUpgrade=df.select('userID','page')\
.where(df.page == 'Submit Upgrade')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_SubmitUpgrade')

df_Downgrade=df.select('userID','page')\
.where(df.page == 'Downgrade')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Downgrade')

df_SubmitDowngrade=df.select('userID','page')\
.where(df.page == 'Submit Downgrade')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'SubmitDowngrade')

df_Cancel=df.select('userID','page')\
.where(df.page == 'Cancel')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Cancel')
df_Cancel.show()
+------+----------+
|userID|num_Cancel|
+------+----------+
|   125|         1|
|    51|         1|
|    54|         1|
|100014|         1|
|   101|         1|
|    29|         1|
|100021|         1|
|    87|         1|
|    73|         1|
|     3|         1|
|    28|         1|
|100022|         1|
|100025|         1|
|300007|         1|
|100006|         1|
|    18|         1|
|    70|         1|
|100005|         1|
|    17|         1|
|100007|         1|
+------+----------+
only showing top 20 rows

用户累计参与度

df_Add_to_Playlist=df.select('userID','page')\
.where(df.page == 'Add to Playlist')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Add_to_Playlist')

df_Add_Friend=df.select('userID','page')\
.where(df.page == 'Add Friend')\
.groupBy('userID')\
.agg({'page':'count'})\
.withColumnRenamed('count(page)', 'num_Add_Friend')

df_Add_Friend.show()
+------+--------------+
|userID|num_Add_Friend|
+------+--------------+
|100010|             4|
|200002|             4|
|    51|            28|
|   124|            74|
|     7|             1|
|    54|            33|
|    15|            31|
|   155|            11|
|   132|            41|
|   154|             3|
|100014|             6|
|   101|            29|
|    11|             6|
|   138|            41|
|300017|            63|
|    29|            47|
|    69|            12|
|100021|             7|
|    42|            52|
|   112|             7|
+------+--------------+
only showing top 20 rows
df_Add_to_Playlist.show()
+------+-------------------+
|userID|num_Add_to_Playlist|
+------+-------------------+
|100010|                  7|
|200002|                  8|
|    51|                 52|
|   124|                118|
|     7|                  5|
|    15|                 59|
|    54|                 72|
|   155|                 24|
|   132|                 38|
|   154|                  1|
|100014|                  7|
|   101|                 61|
|    11|                 20|
|   138|                 67|
|300017|                113|
|    29|                 89|
|    69|                 33|
|100021|                  7|
|    42|                104|
|   112|                  7|
+------+-------------------+
only showing top 20 rows

每个SessionId的统计指标

查看用户平均每个session的统计数据,包括平均所用时间、最多时间、最少时间

user_session_time = df.select("userId","sessionId","ts").groupby("userId", "sessionId").agg(((max(df.ts)-min(df.ts))/(1000*60)).alias("sessionTime"))
user_session_time_stat = user_session_time.groupby("userId").agg(avg(user_session_time.sessionTime).alias("avgSessionTime"), min(user_session_time.sessionTime).alias("minSessionTime"), max(user_session_time.sessionTime).alias("maxSessionTime")).sort("userId")
user_session_time_stat.show()
+------+------------------+-------------------+------------------+
|userId|    avgSessionTime|     minSessionTime|    maxSessionTime|
+------+------------------+-------------------+------------------+
|    10|459.74722222222226|  91.41666666666667|1547.9166666666667|
|   100| 316.5190476190476|                0.0|1019.5833333333334|
|100001|148.15833333333333|               63.0|215.28333333333333|
|100002|201.18333333333334|                0.0| 730.4666666666667|
|100003| 99.11666666666667|  3.216666666666667|195.01666666666668|
|100004|185.98650793650793|                0.0| 940.2166666666667|
|100005|120.18666666666668|                0.0|209.21666666666667|
|100006| 93.43333333333334|  93.43333333333334| 93.43333333333334|
|100007|189.41111111111113|                8.9| 675.6333333333333|
|100008| 528.2527777777779|              57.05|            2024.4|
|100009|211.96166666666667|  4.516666666666667| 666.3333333333334|
|100010|154.48333333333332|              22.55|             323.0|
|100011| 44.38333333333333|  44.38333333333333| 44.38333333333333|
|100012|271.43809523809523|                0.0| 729.4166666666666|
|100013|329.99880952380954|                0.0| 858.8166666666667|
|100014| 184.8138888888889|               3.45|            281.75|
|100015|278.59444444444443| 1.4833333333333334| 570.9333333333333|
|100016|265.40416666666664|               53.1|1049.4833333333333|
|100017|199.26666666666668| 199.26666666666668|199.26666666666668|
|100018|193.18730158730162|0.03333333333333333| 940.1666666666666|
+------+------------------+-------------------+------------------+
only showing top 20 rows
user_session_song = df.select("userId","sessionId","song","page").filter(df.page=="NextSong").groupby("userId", "sessionId").agg(count(df.song).alias("sessionSong"))
user_session_song_stat = user_session_song.groupby("userId").agg(avg(user_session_song.sessionSong).alias("avgSessionSong")).sort("userId")

查看用户平均每个session的统计数据,包括平均歌曲数

user_session_song_stat.show()
+------+------------------+
|userId|    avgSessionSong|
+------+------------------+
|    10|112.16666666666667|
|   100| 78.88235294117646|
|100001|             33.25|
|100002|             48.75|
|100003|              25.5|
|100004|              47.1|
|100005|              38.5|
|100006|              26.0|
|100007|              47.0|
|100008|128.66666666666666|
|100009|              51.8|
|100010|39.285714285714285|
|100011|              11.0|
|100012| 79.33333333333333|
|100013|              87.0|
|100014|42.833333333333336|
|100015| 66.66666666666667|
|100016|             66.25|
|100017|              52.0|
|100018|              50.1|
+------+------------------+
only showing top 20 rows

查看用户平均每个session的统计数据,包括平均点赞数、平均倒赞数

user_session_Thumbs_Up = df.select("userId","sessionId","page").filter(df.page=="Thumbs Up").groupby("userId", "sessionId").agg(count(df.page).alias("sessionThumbs_Up"))
user_session_Thumbs_Up_stat = user_session_Thumbs_Up.groupby("userId").agg(avg(user_session_Thumbs_Up.sessionThumbs_Up).alias("avgSessionThumbs_Up")).sort("userId")
user_session_Thumbs_Up_stat.show()
+------+-------------------+
|userId|avgSessionThumbs_Up|
+------+-------------------+
|    10|  6.166666666666667|
|   100|  4.933333333333334|
|100001| 2.6666666666666665|
|100002|                2.5|
|100003|                3.0|
|100004| 2.6923076923076925|
|100005| 2.3333333333333335|
|100006|                2.0|
|100007| 3.1666666666666665|
|100008|                7.4|
|100009|              2.875|
|100010| 2.8333333333333335|
|100012|                3.6|
|100013| 3.5454545454545454|
|100014|                3.4|
|100015|                3.5|
|100016| 3.5714285714285716|
|100017|                2.0|
|100018| 3.5384615384615383|
|100019|                1.0|
+------+-------------------+
only showing top 20 rows
user_session_Thumbs_Down = df.select("userId","sessionId","page").filter(df.page=="Thumbs Down").groupby("userId", "sessionId").agg(count(df.page).alias("sessionThumbs_Down"))
user_session_Thumbs_Down_stat = user_session_Thumbs_Down.groupby("userId").agg(avg(user_session_Thumbs_Down.sessionThumbs_Down).alias("avgSessionThumbs_Down")).sort("userId")
user_session_Thumbs_Down_stat.show()
+------+---------------------+
|userId|avgSessionThumbs_Down|
+------+---------------------+
|    10|                  4.0|
|   100|   2.4545454545454546|
|100001|                  1.0|
|100004|   1.5714285714285714|
|100005|                  1.5|
|100006|                  2.0|
|100007|                  2.0|
|100008|                  2.0|
|100009|                  2.0|
|100010|                 1.25|
|100011|                  1.0|
|100012|                 2.25|
|100013|                  2.5|
|100014|                  1.0|
|100015|                  1.6|
|100016|                 1.25|
|100017|                  1.0|
|100018|                  1.5|
|100019|                  1.0|
|100021|   1.6666666666666667|
+------+---------------------+
only showing top 20 rows

查看用户平均每个session的统计数据,包括平均广告播放次数

user_session_Roll_Advert = df.select("userId","sessionId","page").filter(df.page=="Roll Advert").groupby("userId", "sessionId").agg(count(df.page).alias("sessionRoll_Advert"))
user_session_Roll_Advert_stat = user_session_Roll_Advert.groupby("userId").agg(avg(user_session_Roll_Advert.sessionRoll_Advert).alias("avgSessionRoll_Advert")).sort("userId")
user_session_Roll_Advert_stat.show()
+------+---------------------+
|userId|avgSessionRoll_Advert|
+------+---------------------+
|    10|                  1.0|
|   100|   3.5714285714285716|
|100001|                  3.5|
|100002|                  1.5|
|100003|                  9.0|
|100004|    6.615384615384615|
|100005|                  4.5|
|100006|                  3.0|
|100007|                  1.0|
|100008|    6.666666666666667|
|100009|                 5.25|
|100010|    8.666666666666666|
|100011|                  2.0|
|100012|    6.333333333333333|
|100013|                4.875|
|100014|                  1.0|
|100015|                  6.9|
|100016|                  4.0|
|100017|                 14.0|
|100018|     8.88888888888889|
+------+---------------------+
only showing top 20 rows

用户听歌曲的丰富度

用户累计听了多少不同的歌曲,累计听过多少不同的艺术家

user_song_count=df.select("userId", "song").dropDuplicates().groupby("userId").agg(count(df.song).alias("userSongCount")).sort("userId")
user_artist_count=df.select("userId", "artist").dropDuplicates().groupby("userId").agg(count(df.artist).alias("userArtistCount")).sort("userId")
user_song_count.show(20)
+------+-------------+
|userId|userSongCount|
+------+-------------+
|    10|          629|
|   100|         2302|
|100001|          129|
|100002|          193|
|100003|           51|
|100004|          881|
|100005|          153|
|100006|           26|
|100007|          408|
|100008|          723|
|100009|          501|
|100010|          269|
|100011|           11|
|100012|          453|
|100013|         1041|
|100014|          248|
|100015|          755|
|100016|          493|
|100017|           51|
|100018|          942|
+------+-------------+
only showing top 20 rows
user_artist_count.show(20)
+------+---------------+
|userId|userArtistCount|
+------+---------------+
|    10|            565|
|   100|           1705|
|100001|            125|
|100002|            184|
|100003|             50|
|100004|            733|
|100005|            149|
|100006|             26|
|100007|            357|
|100008|            623|
|100009|            442|
|100010|            252|
|100011|             11|
|100012|            397|
|100013|            826|
|100014|            233|
|100015|            627|
|100016|            431|
|100017|             51|
|100018|            780|
+------+---------------+
only showing top 20 rows

用户付费周期的特征分布(付费天数、免费天数)

user_max_ts = df.groupby("userId").max("ts").sort("userId")
user_reg_ts = df.select("userId", "registration").dropDuplicates().sort("userId")
user_reg_days = user_reg_ts.join(user_max_ts, user_reg_ts.userId == user_max_ts.userId).select(user_reg_ts["userId"], ((user_max_ts["max(ts)"]-user_reg_ts["registration"])/(1000*60*60*24)).alias("regDay"))
user_reg_days.show()
+------+------------------+
|userId|            regDay|
+------+------------------+
|100010| 55.64365740740741|
|200002| 70.07462962962963|
|   125| 71.31688657407408|
|   124|131.55591435185184|
|    51|19.455844907407407|
|     7| 72.77818287037037|
|    15|56.513576388888886|
|    54|110.75168981481481|
|   155|23.556018518518517|
|100014| 85.08340277777778|
|   132|  66.8891087962963|
|   154|23.872037037037035|
|   101|        53.9659375|
|    11|124.47825231481481|
|   138| 66.62668981481481|
|300017| 74.35851851851852|
|100021| 64.73886574074074|
|    29|60.104050925925925|
|    69| 71.42444444444445|
|   112| 87.46262731481481|
+------+------------------+
only showing top 20 rows
df.createOrReplaceTempView("df_view")
user_paied_day = spark.sql("SELECT userID, SUM(level)/count(level) AS user_paied FROM df_view GROUP BY userID")
user_paied_day.show()
+------+-------------------+
|userID|         user_paied|
+------+-------------------+
|100010|                0.0|
|200002| 0.7468354430379747|
|   125|                0.0|
|    51|                1.0|
|   124|                1.0|
|     7|                0.0|
|    54| 0.8318300843759092|
|    15|                1.0|
|   155| 0.8562874251497006|
|   132| 0.9852430555555556|
|   154|                0.0|
|100014|                1.0|
|   101| 0.9646347138203816|
|    11|0.27004716981132076|
|   138| 0.8938841636289996|
|300017|                1.0|
|    29| 0.8917568692756037|
|    69| 0.9709388971684053|
|100021|                0.0|
|    42| 0.9696969696969697|
+------+-------------------+
only showing top 20 rows

用户平均每日登陆次数

user_session_count = df.select("userId", "sessionId").dropDuplicates().groupby("userId").count()
user_session_count = user_session_count.withColumnRenamed("count", "sessionCount")
user_session_count=user_session_count.join(user_reg_days,user_session_count.userId==user_reg_days.userId).select(user_session_count["userId"], (user_reg_days["regDay"]/user_session_count["sessionCount"]).alias("SessionOfday"))
user_session_count.show()
+------+------------------+
|userId|      SessionOfday|
+------+------------------+
|100010| 7.949093915343916|
|200002|11.679104938271605|
|   125| 71.31688657407408|
|   124| 4.536410839719029|
|    51|1.9455844907407407|
|     7|10.396883267195767|
|    15|3.7675717592592592|
|    54|2.9932889139139136|
|   155| 3.926003086419753|
|100014| 14.18056712962963|
|   132| 4.180569299768519|
|   154| 7.957345679012345|
|   101|        5.39659375|
|    11| 7.779890769675926|
|   138| 4.441779320987654|
|300017|1.1802939447383893|
|100021|12.947773148148148|
|    29|1.7677662037037036|
|    69|  7.93604938271605|
|   112| 8.746262731481481|
+------+------------------+
only showing top 20 rows

目前的level

user_login = df.select("userId","ts","level").groupby("userId").agg(max(df.ts).alias("finalTime")).sort("userId")
user_recent_level_time =df.select("userId","ts","level")
user_recent_level = user_recent_level_time.join(user_login, [user_login.userId == user_recent_level_time.userId, user_recent_level_time.ts == user_login.finalTime]).select(user_login.userId, "level").sort("userId")
user_login.show()
+------+-------------+
|userId|    finalTime|
+------+-------------+
|    10|1542631788000|
|   100|1543587349000|
|100001|1538498205000|
|100002|1543799476000|
|100003|1539274781000|
|100004|1543459065000|
|100005|1539971825000|
|100006|1538753070000|
|100007|1543491909000|
|100008|1543335219000|
|100009|1540611104000|
|100010|1542823952000|
|100011|1538417085000|
|100012|1541100900000|
|100013|1541184816000|
|100014|1542740649000|
|100015|1543073753000|
|100016|1543335647000|
|100017|1540062847000|
|100018|1543378360000|
+------+-------------+
only showing top 20 rows
user_recent_level_time.show()
+------+-------------+-----+
|userId|           ts|level|
+------+-------------+-----+
|    30|1538352117000|    1|
|     9|1538352180000|    0|
|    30|1538352394000|    1|
|     9|1538352416000|    0|
|    30|1538352676000|    1|
|     9|1538352678000|    0|
|     9|1538352886000|    0|
|    30|1538352899000|    1|
|    30|1538352905000|    1|
|    30|1538353084000|    1|
|     9|1538353146000|    0|
|     9|1538353150000|    0|
|    30|1538353218000|    1|
|     9|1538353375000|    0|
|     9|1538353376000|    0|
|    30|1538353441000|    1|
|     9|1538353576000|    0|
|    74|1538353668000|    0|
|    30|1538353687000|    1|
|     9|1538353744000|    0|
+------+-------------+-----+
only showing top 20 rows
user_recent_level.show()
+------+-----+
|userId|level|
+------+-----+
|    10|    1|
|   100|    1|
|100001|    0|
|100002|    1|
|100003|    0|
|100004|    1|
|100005|    0|
|100006|    0|
|100007|    1|
|100008|    0|
|100009|    0|
|100010|    0|
|100011|    0|
|100012|    0|
|100013|    1|
|100014|    1|
|100015|    1|
|100016|    0|
|100017|    0|
|100018|    0|
+------+-----+
only showing top 20 rows

沉默时间

用户有多久没有登录过了

final_time_df=df.select(max(df.ts))
final_time_df.collect()
[Row(max(ts)=1543799476000)]
user_login.createOrReplaceTempView("user_login_view")
user_silence = spark.sql("SELECT userID, 1543622466000-finalTime AS user_silence FROM user_login_view ")
user_silence.show()
+------+------------+
|userID|user_silence|
+------+------------+
|    10|   990678000|
|   100|    35117000|
|100001|  5124261000|
|100002|  -177010000|
|100003|  4347685000|
|100004|   163401000|
|100005|  3650641000|
|100006|  4869396000|
|100007|   130557000|
|100008|   287247000|
|100009|  3011362000|
|100010|   798514000|
|100011|  5205381000|
|100012|  2521566000|
|100013|  2437650000|
|100014|   881817000|
|100015|   548713000|
|100016|   286819000|
|100017|  3559619000|
|100018|   244106000|
+------+------------+
only showing top 20 rows

构造训练数据集

df_final=df.select("userId","gender","Churn").dropDuplicates().sort("userId")
df_final.show()
+------+------+-----+
|userId|gender|Churn|
+------+------+-----+
|    10|     1|    0|
|   100|     1|    0|
|100001|     0|    1|
|100002|     0|    0|
|100003|     0|    1|
|100004|     0|    0|
|100005|     1|    1|
|100006|     0|    1|
|100007|     0|    1|
|100008|     0|    0|
|100009|     1|    1|
|100010|     0|    0|
|100011|     1|    1|
|100012|     1|    1|
|100013|     0|    1|
|100014|     1|    1|
|100015|     0|    1|
|100016|     1|    0|
|100017|     1|    1|
|100018|     1|    0|
+------+------+-----+
only showing top 20 rows
df_final.count()
225

采用连接的方式将用户特征结合,形成新的数据集

final_data = df_final.join(df_Help, 'userId','left').join(df_Error, 'userId','left').join(df_Upgrade, 'userId','left')\
             .join(df_SubmitUpgrade, 'userId','left').join(df_Downgrade, 'userId','left').join(df_SubmitDowngrade, 'userId','left')\
             .join(df_Cancel, 'userId','left')
final_data = final_data.join(df_Add_to_Playlist, 'userId','left').join(df_Add_Friend, 'userId','left')
final_data = final_data.join(user_session_time_stat, 'userId','left').join(user_session_song_stat, 'userId','left')\
    .join(user_session_Thumbs_Up_stat, 'userId','left').join(user_session_Thumbs_Down_stat, 'userId','left')\
    .join(user_session_Roll_Advert_stat, 'userId','left')
final_data = final_data.join(user_song_count, 'userId','left').join(user_artist_count, 'userId','left')
final_data = final_data.join(user_reg_days, 'userId','left').join(user_paied_day, 'userId','left')
final_data = final_data.join(user_session_count, 'userId','left').join(user_recent_level, 'userId','left')\
    .join(user_silence, 'userId','left')
final_data=final_data.fillna(0)
final_data=final_data.dropDuplicates()
final_data.show(20)
+------+------+-----+--------+---------+-----------+-----------------+-------------+---------------+----------+-------------------+--------------+------------------+------------------+------------------+------------------+-------------------+---------------------+---------------------+-------------+---------------+------------------+-------------------+------------------+-----+------------+
|userId|gender|Churn|num_Help|num_Error|num_Upgrade|num_SubmitUpgrade|num_Downgrade|SubmitDowngrade|num_Cancel|num_Add_to_Playlist|num_Add_Friend|    avgSessionTime|    minSessionTime|    maxSessionTime|    avgSessionSong|avgSessionThumbs_Up|avgSessionThumbs_Down|avgSessionRoll_Advert|userSongCount|userArtistCount|            regDay|         user_paied|      SessionOfday|level|user_silence|
+------+------+-----+--------+---------+-----------+-----------------+-------------+---------------+----------+-------------------+--------------+------------------+------------------+------------------+------------------+-------------------+---------------------+---------------------+-------------+---------------+------------------+-------------------+------------------+-----+------------+
|100010|     0|    0|       2|        0|          2|                0|            0|              0|         0|                  7|             4|154.48333333333332|             22.55|             323.0|39.285714285714285| 2.8333333333333335|                 1.25|    8.666666666666666|          269|            252| 55.64365740740741|                0.0| 7.949093915343916|    0|   798514000|
|200002|     1|    0|       2|        0|          2|                1|            5|              0|         0|                  8|             4|266.40000000000003|              12.1|            497.45|              64.5|                3.5|                  3.0|                 1.75|          378|            339| 70.07462962962963| 0.7468354430379747|11.679104938271605|    1|  1298112000|
|   125|     1|    1|       0|        0|          0|                0|            0|              0|         1|                  0|             0|29.566666666666666|29.566666666666666|29.566666666666666|               8.0|                0.0|                  0.0|                  1.0|            8|              8| 71.31688657407408|                0.0| 71.31688657407408|    0|  4303548000|
|   124|     0|    0|      23|        6|          0|                0|           41|              0|         0|                118|            74| 578.9942528735633|               0.0|1770.6166666666666|145.67857142857142|              7.125|   2.1578947368421053|   1.3333333333333333|         3339|           2232|131.55591435185184|                1.0| 4.536410839719029|    1|    31700000|
|    51|     1|    1|      12|        1|          0|                0|           23|              0|         1|                 52|            28| 872.3566666666666| 67.06666666666666| 2069.383333333333|             211.1|  11.11111111111111|                2.625|                  0.0|         1854|           1385|19.455844907407407|                1.0|1.9455844907407407|    1|  3860494000|
|     7|     1|    0|       1|        1|          2|                0|            0|              0|         0|                  5|             1| 87.64047619047619| 4.066666666666666|             311.3|21.428571428571427|               1.75|                  1.0|                  3.2|          148|            142| 72.77818287037037|                0.0|10.396883267195767|    0|   666855000|
|    15|     1|    0|       8|        2|          0|                0|           28|              0|         0|                 59|            31| 528.2833333333333|               0.0|1455.5666666666666|136.71428571428572|               6.75|   1.5555555555555556|                  1.0|         1707|           1302|56.513576388888886|                1.0|3.7675717592592592|    1|   500648000|
|    54|     0|    1|      17|        1|          1|                1|           39|              1|         1|                 72|            33|322.28963963963963|               0.0|            2043.2| 81.17142857142858|  5.258064516129032|   1.7058823529411764|    3.357142857142857|         2414|           1744|110.75168981481481| 0.8318300843759092|2.9932889139139136|    1|  1570858000|
|   155|     0|    0|       9|        3|          2|                1|           12|              0|         0|                 24|            11| 548.5722222222222|127.63333333333334|1100.0833333333333|136.66666666666666|  9.666666666666666|                  1.0|                  4.0|          759|            643|23.556018518518517| 0.8562874251497006| 3.926003086419753|    1|   216756000|
|100014|     1|    1|       2|        0|          0|                0|            3|              0|         1|                  7|             6| 184.8138888888889|              3.45|            281.75|42.833333333333336|                3.4|                  1.0|                  1.0|          248|            233| 85.08340277777778|                1.0| 14.18056712962963|    1|   881817000|
|   132|     0|    0|      16|        3|          1|                1|           19|              0|         0|                 38|            41|498.95104166666675| 8.533333333333333|1896.8833333333334|             120.5|  6.857142857142857|   1.8888888888888888|                  1.0|         1718|           1299|  66.8891087962963| 0.9852430555555556| 4.180569299768519|    1|   788694000|
|   154|     0|    0|       1|        0|          0|                0|            0|              0|         0|                  1|             3|110.68333333333332|             68.55|             168.1|              28.0| 3.6666666666666665|                  0.0|   3.3333333333333335|           83|             78|23.872037037037035|                0.0| 7.957345679012345|    0|   291901000|
|   101|     1|    1|      12|        3|          1|                1|           22|              0|         1|                 61|            29| 817.5800000000002|             23.85| 3300.016666666667|             179.7|              10.75|                  4.0|   2.6666666666666665|         1608|           1241|        53.9659375| 0.9646347138203816|        5.39659375|    1|  3893429000|
|    11|     0|    0|       3|        1|          9|                2|            5|              1|         0|                 20|             6|161.37708333333333|             19.55|             688.5|           40.4375|  3.076923076923077|   1.2857142857142858|                  3.0|          616|            534|124.47825231481481|0.27004716981132076| 7.779890769675926|    1|   312764000|
|   138|     1|    0|      13|        1|          1|                1|           21|              0|         0|                 67|            41| 564.2266666666667|21.716666666666665|2386.4166666666665|             138.0|  6.785714285714286|                  4.0|   2.4285714285714284|         1791|           1332| 66.62668981481481| 0.8938841636289996| 4.441779320987654|    1|      101000|
|300017|     0|    0|      27|        5|          0|                0|           25|              0|         0|                113|            63|233.32407407407405|               0.0| 987.1666666666666|59.540983606557376|  5.826923076923077|   1.5555555555555556|                  1.1|         3013|           2070| 74.35851851851852|                1.0|1.1802939447383893|    1|   115379000|
|100021|     1|    1|       0|        2|          2|                0|            0|              0|         1|                  7|             7| 215.2266666666667|              30.4| 605.1833333333333|              46.0|               2.75|   1.6666666666666667|                  6.0|          226|            207| 64.73886574074074|                0.0|12.947773148148148|    0|   478684000|
|    29|     1|    1|      28|        0|          5|                1|           18|              0|         1|                 89|            47| 365.7568627450981|               8.1|2167.7833333333333| 89.05882352941177|  5.703703703703703|   1.8333333333333333|   1.8333333333333333|         2562|           1804|60.104050925925925| 0.8917568692756037|1.7677662037037036|    1|  1441435000|
|    69|     0|    0|       7|        4|          1|                1|            9|              0|         0|                 33|            12| 526.6851851851852|11.483333333333333|1263.7833333333333|             125.0|                8.0|                  1.8|                  1.5|         1036|            865| 71.42444444444445| 0.9709388971684053|  7.93604938271605|    1|   627235000|
|   112|     1|    0|       1|        0|          2|                0|            0|              0|         0|                  7|             7| 84.45166666666667|               0.0|319.68333333333334| 23.88888888888889|                1.5|                  1.5|                2.625|          211|            195| 87.46262731481481|                0.0| 8.746262731481481|    0|    33014000|
+------+------+-----+--------+---------+-----------+-----------------+-------------+---------------+----------+-------------------+--------------+------------------+------------------+------------------+------------------+-------------------+---------------------+---------------------+-------------+---------------+------------------+-------------------+------------------+-----+------------+
only showing top 20 rows
final_data.count()
225
final_data.describe().show()
+-------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+-----------------+-----------------+-----------------+-------------------+---------------------+---------------------+-----------------+-----------------+-------------------+------------------+-------------------+------------------+--------------------+
|summary|            userId|             gender|             Churn|          num_Help|         num_Error|       num_Upgrade| num_SubmitUpgrade|     num_Downgrade|   SubmitDowngrade|        num_Cancel|num_Add_to_Playlist|    num_Add_Friend|    avgSessionTime|   minSessionTime|   maxSessionTime|   avgSessionSong|avgSessionThumbs_Up|avgSessionThumbs_Down|avgSessionRoll_Advert|    userSongCount|  userArtistCount|             regDay|        user_paied|       SessionOfday|             level|        user_silence|
+-------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+-----------------+-----------------+-----------------+-------------------+---------------------+---------------------+-----------------+-----------------+-------------------+------------------+-------------------+------------------+--------------------+
|  count|               225|                225|               225|               225|               225|               225|               225|               225|               225|               225|                225|               225|               225|              225|              225|              225|                225|                  225|                  225|              225|              225|                225|               225|                225|               225|                 225|
|   mean|65391.013333333336| 0.5377777777777778|0.2311111111111111|6.4622222222222225|              1.12|2.2177777777777776|0.7066666666666667| 9.133333333333333|              0.28|0.2311111111111111| 29.004444444444445| 19.00888888888889| 288.1659840733499|36.13177777777776|968.0325185185188|70.78971233958933|  4.659001296544653|   1.6517472058249647|   2.9905099013219067|897.7911111111111|696.3777777777777|  79.84568348765428|0.5822655122895655| 11.828952608818401|0.6444444444444445|1.1222788133333333E9|
| stddev|105396.47791907164|0.49968243883744773|0.4224832108996327|7.2425851519011974|1.4726070176973318|2.5585369082956606|0.7338742593737899|11.734412152785014|0.5876709477736184|0.4224832108996327| 32.716653931055426|20.581716728496275|176.84244472357116|90.64909420429706| 723.317736950286| 42.6153697543817| 2.4223461004090705|   0.9260391713518761|   2.2934186342927476|896.3876044550344|603.9518698630802|  37.66147001861254| 0.407810466090919|  15.11686186391195|0.4797486114192829|1.4392218908526242E9|
|    min|                10|                  0|                 0|                 0|                 0|                 0|                 0|                 0|                 0|                 0|                  0|                 0|               7.0|              0.0|              7.0|              3.0|                0.0|                  0.0|                  0.0|                3|                3|0.31372685185185184|               0.0|0.31372685185185184|                 0|          -177010000|
|    max|                99|                  1|                 1|                46|                 7|                15|                 4|                73|                 3|                 1|                240|               143|            1179.9|904.8666666666667|4455.083333333333|286.6666666666667| 13.777777777777779|                  6.0|                 15.0|             5946|             3544|  256.3776736111111|               1.0| 124.55060185185185|                 1|          5205381000|
+-------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+-----------------+-----------------+-----------------+-------------------+---------------------+---------------------+-----------------+-----------------+-------------------+------------------+-------------------+------------------+--------------------+

建模、训练、评估

final_data.columns
['userId',
 'gender',
 'Churn',
 'num_Help',
 'num_Error',
 'num_Upgrade',
 'num_SubmitUpgrade',
 'num_Downgrade',
 'SubmitDowngrade',
 'num_Cancel',
 'num_Add_to_Playlist',
 'num_Add_Friend',
 'avgSessionTime',
 'minSessionTime',
 'maxSessionTime',
 'avgSessionSong',
 'avgSessionThumbs_Up',
 'avgSessionThumbs_Down',
 'avgSessionRoll_Advert',
 'userSongCount',
 'userArtistCount',
 'regDay',
 'user_paied',
 'SessionOfday',
 'level',
 'user_silence']
inputcols=['gender',
 'num_Help',
 'num_Error',
 'num_Upgrade',
 'num_SubmitUpgrade',
 'num_Downgrade',
 'SubmitDowngrade',
 'num_Cancel',
 'num_Add_to_Playlist',
 'num_Add_Friend',
 'avgSessionTime',
 'minSessionTime',
 'maxSessionTime',
 'avgSessionSong',
 'avgSessionThumbs_Up',
 'avgSessionThumbs_Down',
 'avgSessionRoll_Advert',
 'userSongCount',
 'userArtistCount',
 'regDay',
 'user_paied',
 'SessionOfday',
 'level',
 'user_silence']
assembler = VectorAssembler(inputCols=inputcols, outputCol="NumFeatures")
dataset=assembler.transform(final_data)
scaler = StandardScaler(inputCol="NumFeatures", outputCol="ScaledNumFeatures", withStd=True)
scalerModel = scaler.fit(dataset)
dataset= scalerModel.transform(dataset)
# dataset = dataset.withColumn('label', dataset['churn'].cast('float')).drop('churn') 

# feature_cols = dataset.drop('label').drop('userId').columns

dataset=dataset.select(col('Churn').alias('label'),col('ScaledNumFeatures').alias('features'))
dataset.take(5)
[Row(label=0, features=DenseVector([0.0, 0.2761, 0.0, 0.7817, 0.0, 0.0, 0.0, 0.0, 0.214, 0.1943, 0.8736, 0.2488, 0.4466, 0.9219, 1.1697, 1.3498, 3.7789, 0.3001, 0.4173, 1.4775, 0.0, 0.5258, 0.0, 0.5548])),
 Row(label=0, features=DenseVector([2.0013, 0.2761, 0.0, 0.7817, 1.3626, 0.4261, 0.0, 0.0, 0.2445, 0.1943, 1.5064, 0.1335, 0.6877, 1.5135, 1.4449, 3.2396, 0.7631, 0.4217, 0.5613, 1.8606, 1.8313, 0.7726, 2.0844, 0.902])),
 Row(label=1, features=SparseVector(24, {0: 2.0013, 7: 2.367, 10: 0.1672, 11: 0.3262, 12: 0.0409, 13: 0.1877, 16: 0.436, 17: 0.0089, 18: 0.0132, 19: 1.8936, 21: 4.7177, 23: 2.9902})),
 Row(label=0, features=DenseVector([0.0, 3.1757, 4.0744, 0.0, 0.0, 3.494, 0.0, 0.0, 3.6067, 3.5954, 3.2741, 0.0, 2.4479, 3.4185, 2.9414, 2.3302, 0.5814, 3.725, 3.6957, 3.4931, 2.4521, 0.3001, 2.0844, 0.022])),
 Row(label=1, features=DenseVector([2.0013, 1.6569, 0.6791, 0.0, 0.0, 1.96, 0.0, 2.367, 1.5894, 1.3604, 4.933, 0.7398, 2.861, 4.9536, 4.5869, 2.8347, 0.0, 2.0683, 2.2932, 0.5166, 2.4521, 0.1287, 2.0844, 2.6823]))]
dataset.count()
225
train,test = dataset.randomSplit([0.8, 0.2], seed=77)
train.count()
174

逻辑回归

lr= LogisticRegression(maxIter=10,regParam=0.1)
# rf= RandomForestClassifier(labelCol="label", featuresCol="ScaledNumFeatures",maxDepth=5,seed=17)
# svm=LinearSVC(labelCol="label", featuresCol="ScaledNumFeatures",maxIter=10,regParam=0.1)
paramGrid = ParamGridBuilder().addGrid(lr.regParam,[0.01,0.05,0.1]).build()
crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=2)
cvModel_lr = crossval.fit(train)
cvModel_lr.avgMetrics

res_lr = cvModel_lr.transform(test)
evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction")
print("The metrics for our Logistic Regression Classifier are as follows :")
print("The F-1 Score is {}".format(evaluator.evaluate(res_lr, {evaluator.metricName : "accuracy"})))
print("The accuracy is {}".format(evaluator.evaluate(res_lr, {evaluator.metricName : "f1"})))
cvModel_lr.save('cvModel_lr.model')
The metrics for our Logistic Regression Classifier are as follows :
The F-1 Score is 0.9803921568627451
The accuracy is 0.9799307958477508

随机森林

rf= RandomForestClassifier(maxDepth=5,seed=17)
paramGrid = ParamGridBuilder().addGrid(rf.maxDepth, [3,5]).build()
crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=2)
cvModel_rf = crossval.fit(train)
cvModel_rf.avgMetrics

res_rf = cvModel_rf.transform(test)
evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction")
print("The metrics for our RandomForest Classifier are as follows :")
print("The F-1 Score is {}".format(evaluator.evaluate(res_rf, {evaluator.metricName : "accuracy"})))
print("The accuracy is {}".format(evaluator.evaluate(res_rf, {evaluator.metricName : "f1"})))
cvModel_rf.save('cvModel_rf.model')
The metrics for our RandomForest Classifier are as follows :
The F-1 Score is 0.9803921568627451
The accuracy is 0.9799307958477508

总结

通过数据清洗、观察、探索、新建特征等步骤,我们重新整理了行记录以USER为单位的数据集。在此基础上,没有经过复杂的调参,逻辑回归与随机森林表现都很好,说明我们的特征非常有效提供了训练信息。可以尝试在更大的数据集上去进行测试。由于训练时间较短,逻辑回归将是首选。

猜你喜欢

转载自www.cnblogs.com/ceeyo/p/12933371.html