The number and duration of calls made by users in every half-hour interval

Hello everyone, today I saw an application scenario of an analysis function and share it

User dialing table: field user id, start time, end time. The sample data is as follows (the separator is ,):

aaa,2018-01-01 08:01:00,2018-01-01 08:08:00
aaa,2018-01-01 08:15:00,2018-01-01 08:20:00
aaa,2018-01-01 08:45:00,2018-01-01 08:48:00

Expected output, user id, the earliest start time of each time period, the number of calls made in the time period, duration (minutes)

aaa 2018-01-01 08:01:00  2  12

aaa 2018-01-01 08:45:00  1  3

The following is the created test table and detailed steps

create table login_start_end_time (userid string,start_date string,end_date string) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/root/test/test.txt' INTO TABLE login_start_end_time
hive> select * from login_start_end_time;
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00

---The first step is to find the duration of each time and the end time of the last time

select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time;
---去掉中间的mr过程
tage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 12.08 sec   HDFS Read: 9181 HDFS Write: 204 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 80 msec
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00	420	2018-01-01 08:01:00
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00	300	2018-01-01 08:08:00
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00	180	2018-01-01 08:20:00

---The second step is to find the cumulative duration and the interval between each time

select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t 
--去掉中间的日志
Total MapReduce CPU Time Spent: 25 seconds 120 msec
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00	420	420	0
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00	300	720	420
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00	180	900	1500

---The third step is based on the cumulative duration and the interval between each time, taking the surplus from 30 minutes, and dividing the time period of every 30 minutes 

select userid,start_date,end_date,long_time,floor((sum_log+diff_long)/(30*60)) as time_inter from (select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t) d 
--去掉中间的日志
Total MapReduce CPU Time Spent: 25 seconds 30 msec
OK
aaa	2018-01-01 08:01:00	2018-01-01 08:08:00	420	0
aaa	2018-01-01 08:15:00	2018-01-01 08:20:00	300	0
aaa	2018-01-01 08:45:00	2018-01-01 08:48:00	180	1

---The fourth step is to find the start time, times, and duration of each segment according to the time period (minutes)

select userid,time_inter+1,min(start_date) as start_date,count(1) cnt,sum(long_time)/60 as long_time from (select userid,start_date,end_date,long_time,floor((sum_log+diff_long)/(30*60)) as time_inter from (select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t) d ) d1
group by userid,time_inter+1
--去掉中间的日志
Total MapReduce CPU Time Spent: 39 seconds 0 msec
OK
aaa	1	2018-01-01 08:01:00	2	12.0
aaa	2	2018-01-01 08:45:00	1	3.0
Time taken: 294.195 seconds, Fetched: 2 row(s)

Personal understanding: The key to construct the interval field of the time period to distinguish which rows belong to the same time period

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/115319221