笔记
sql结构
scala> orders.show(5)
+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
| 2539329| 1| prior| 1| 2| 08| |
| 2398795| 1| prior| 2| 3| 07| 15.0|
| 473747| 1| prior| 3| 3| 12| 21.0|
| 2254736| 1| prior| 4| 4| 07| 29.0|
| 431534| 1| prior| 5| 4| 15| 28.0|
+--------+-------+--------+------------+---------+-----------------+----------------------+
only showing top 5 rows
需求把days_since_prior_order为空的值变0
scala> val ord = orders.selectExpr("*","if(days_since_prior_order='',0,days_since_prior_order) as dspo").drop("days_since_prior_order").show(5)
+--------+-------+--------+------------+---------+-----------------+----+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|dspo|
+--------+-------+--------+------------+---------+-----------------+----+
| 2539329| 1| prior| 1| 2| 08| 0|
| 2398795| 1| prior| 2| 3| 07|15.0|
| 473747| 1| prior| 3| 3| 12|21.0|
| 2254736| 1| prior| 4| 4| 07|29.0|
| 431534| 1| prior| 5| 4| 15|28.0|
+--------+-------+--------+------------+---------+-----------------+----+
only showing top 5 rows
每个用户订单的平均间隔多少天数
scala> ord.selectExpr("user_id","cast(dspo as int) as dspo").groupBy("user_id").agg(count("user_id"),sum("dspo"),avg("dspo")).show(4)
+-------+--------------+---------+------------------+
|user_id|count(user_id)|sum(dspo)| avg(dspo)|
+-------+--------------+---------+------------------+
| 296| 7| 38| 5.428571428571429|
| 467| 6| 53| 8.833333333333334|
| 675| 11| 220| 20.0|
| 691| 23| 303|13.173913043478262|
+-------+--------------+---------+------------------+
only showing top 4 rows
- avg() 求平均的函数,输入参数作为分子(会进行sum),分母是groupBy().count()