每半个小时间隔内用户拨打电话的次数以及时长

大家好,今天看了一个分析函数的应用场景,分享出来

用户拨打电话表: 字段用户id,开始时间,结束时间。样例数据如下所示(分隔符为,):

aaa,2018-01-01 08:01:00,2018-01-01 08:08:00
aaa,2018-01-01 08:15:00,2018-01-01 08:20:00
aaa,2018-01-01 08:45:00,2018-01-01 08:48:00

期望输出, 用户id,每个时间段的最早开始时间, 该时间段内拨打电话的次数,时长(分钟)

aaa 2018-01-01 08:01:00  2  12

aaa 2018-01-01 08:45:00  1  3

以下为创建的测试表以及详细的步骤

create table login_start_end_time (userid string,start_date string,end_date string) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/root/test/test.txt' INTO TABLE login_start_end_time
hive> select * from login_start_end_time;
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00

---第一步  求出每次的时长和上次的结束时间

select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time;
---去掉中间的mr过程
tage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 12.08 sec   HDFS Read: 9181 HDFS Write: 204 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 80 msec
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00	420	2018-01-01 08:01:00
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00	300	2018-01-01 08:08:00
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00	180	2018-01-01 08:20:00

---第二步 求出累计时长和每次之间的间隔时间

select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t 
--去掉中间的日志
Total MapReduce CPU Time Spent: 25 seconds 120 msec
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00	420	420	0
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00	300	720	420
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00	180	900	1500

---第三步 依据累计时长和每次之间的间隔时间,与30分钟取余,分出每30分钟的时间段 

select userid,start_date,end_date,long_time,floor((sum_log+diff_long)/(30*60)) as time_inter from (select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t) d 
--去掉中间的日志
Total MapReduce CPU Time Spent: 25 seconds 30 msec
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00	420	0
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00	300	0
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00	180	1

---第四步 依据时间段,求出每段的开始时间,次数,以及时长(分钟)

select userid,time_inter+1,min(start_date) as start_date,count(1) cnt,sum(long_time)/60 as long_time from (select userid,start_date,end_date,long_time,floor((sum_log+diff_long)/(30*60)) as time_inter from (select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t) d ) d1
group by userid,time_inter+1
--去掉中间的日志
Total MapReduce CPU Time Spent: 39 seconds 0 msec
OK
aaa	1	2018-01-01 08:01:00	2	12.0
aaa	2	2018-01-01 08:45:00	1	3.0
Time taken: 294.195 seconds, Fetched: 2 row(s)

个人理解: 关键构建时间段间隔字段,区分哪些行属于同一个时间段

猜你喜欢

转载自blog.csdn.net/zhaoxiangchong/article/details/115319221