Spark之spark.sql

笔记


sql结构

scala> orders.show(5)
+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
| 2539329|      1|   prior|           1|        2|               08|                      |
| 2398795|      1|   prior|           2|        3|               07|                  15.0|
|  473747|      1|   prior|           3|        3|               12|                  21.0|
| 2254736|      1|   prior|           4|        4|               07|                  29.0|
|  431534|      1|   prior|           5|        4|               15|                  28.0|
+--------+-------+--------+------------+---------+-----------------+----------------------+
only showing top 5 rows

需求把days_since_prior_order为空的值变0

scala> val ord = orders.selectExpr("*","if(days_since_prior_order='',0,days_since_prior_order) as dspo").drop("days_since_prior_order").show(5)
+--------+-------+--------+------------+---------+-----------------+----+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|dspo|
+--------+-------+--------+------------+---------+-----------------+----+
| 2539329|      1|   prior|           1|        2|               08|   0|
| 2398795|      1|   prior|           2|        3|               07|15.0|
|  473747|      1|   prior|           3|        3|               12|21.0|
| 2254736|      1|   prior|           4|        4|               07|29.0|
|  431534|      1|   prior|           5|        4|               15|28.0|
+--------+-------+--------+------------+---------+-----------------+----+
only showing top 5 rows

每个用户订单的平均间隔多少天数

scala> ord.selectExpr("user_id","cast(dspo as int) as dspo").groupBy("user_id").agg(count("user_id"),sum("dspo"),avg("dspo")).show(4)
+-------+--------------+---------+------------------+
|user_id|count(user_id)|sum(dspo)|         avg(dspo)|
+-------+--------------+---------+------------------+
|    296|             7|       38| 5.428571428571429|
|    467|             6|       53| 8.833333333333334|
|    675|            11|      220|              20.0|
|    691|            23|      303|13.173913043478262|
+-------+--------------+---------+------------------+
only showing top 4 rows
  • avg() 求平均的函数,输入参数作为分子(会进行sum),分母是groupBy().count()

猜你喜欢

转载自www.cnblogs.com/blogyuhan/p/9300857.html