hive中的分析函数的典型应用

大家:

  好!今天看到了一个hive的分析函数的题,感觉很有意思。把答案整理了整理,分享出来,希望对大家有用。需求如下所示:

   起初我看到这道题时,感觉应该要用到分析函数,但就不知道怎么用。好在最终,写出来了,思路如下所示:

----测试表的表结构以及数据如下所示:

hive> desc sales;
OK
id                   int                                    
produce_name         string                                  
start_time           date                                    
end_time             date                                    
days                 int                                    
Time taken: 0.354 seconds, Fetched: 5 row(s)
hive> select * from sales;
OK
1 nike 2011-09-01 2011-09-05 5
2 nike 2011-09-03 2011-09-06 4
3 nike 2011-09-09 2011-09-15 7
4 oppo 2011-08-04 2011-08-05 2
5 oppo 2011-08-04 2011-08-15 12
6 vivo 2011-08-15 2011-08-21 7
7 vivo 2011-09-02 2011-09-12 11
Time taken: 0.223 seconds, Fetched: 7 row(s)


---第一步:求出每个开始时间的前一次促销的开始时间和结束时间

select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales;


---第一步:执行结果(第一步)

1 nike 2011-09-01 2011-09-05 NULL NULL
2 nike 2011-09-03 2011-09-06 2011-09-01 2011-09-05
3 nike 2011-09-09 2011-09-15 2011-09-03 2011-09-06
4 oppo 2011-08-04 2011-08-05 NULL NULL
5 oppo 2011-08-04 2011-08-15 2011-08-04 2011-08-05
6 vivo 2011-08-15 2011-08-21 NULL NULL
7 vivo 2011-09-02 2011-09-12 2011-08-15 2011-08-21


---第二步  依据前一次促销的开始时间和结束时间,开始时间统一为最早的,为后面分组做准备

select produce_name,start_time,max(end_time) end_time from 
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time;


--第二步 执行结果(第二步)

nike 2011-09-01 2011-09-06
nike 2011-09-09 2011-09-15
oppo 2011-08-04 2011-08-15
vivo 2011-08-15 2011-08-21
vivo 2011-09-02 2011-09-12


---第三步  依据合并后的开始时间,算出每个时间段内的促销天数之和

select produce_name,start_time,end_time,datediff(end_time,start_time)+1 days from
(select produce_name,start_time,max(end_time) end_time from 
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time) e;


--执行结果(第三步)

nike 2011-09-01 2011-09-06 6
nike 2011-09-09 2011-09-15 7
oppo 2011-08-04 2011-08-15 12
vivo 2011-08-15 2011-08-21 7
vivo 2011-09-02 2011-09-12 11

--第四步:依据产品名称,求出最终的促销天数之和

select produce_name,sum(days) from
(select produce_name,start_time,end_time,datediff(end_time,start_time)+1 days from
(select produce_name,start_time,max(end_time) end_time from 
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time) e) f
group by produce_name;


---执行结果(第四步)
 

nike 13
oppo 12
vivo 18

说明:  个人的见解,希望对大家有帮助!

后面经过分析,此种方法有问题,如果连续值涉及到了多行,按照偏移量取最小值时就会有问题。

第一步  获取每行的上一次的结束时间

select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales

效果如下所示:

第二步  依据本行的开始时间和上次的结束时间,计算中间差了多少天。因为datediff函数是差值概念,连续值需要增加1,间隔的天数需要减1

select 
id,
produce_name,
end_time,
start_time,
before_end_time,
datediff(end_time,start_time)+1 as cnt_all,
case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end cnt 
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t

效果如下所示:

结果符合猜想,nike的第三行开始时间是9号,上次结束是6号,中间差了2天

第三步  按照产品汇总取连续值和中间的间隔值求和

select 
produce_name,
min(start_time) as start_time,
max(end_time) as end_time,
datediff(max(end_time),min(start_time))+1 cnt_all,
sum(case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end) cnt 
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t 
group by produce_name

结果如下所示:

第四步  最终的脚本 ,精简下

select 
produce_name,
datediff(max(end_time),min(start_time))+1-
    sum(case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end) cnt 
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t 
group by produce_name

最终结果如下所示:

      个人见解,欢迎指正。

猜你喜欢

转载自blog.csdn.net/zhaoxiangchong/article/details/78523589