Hive SQL窗口函数:
核心语法主框架:
8-Select
1 -From (left table)
3- (join_type)join(right_table)
2- On
4--Where
5-Group by
6- With
7-Having
9-Order by
10-Limit
1、 sum()、avg()用于累计窗口函数
2、 row_number(),rank() 用于创建排序窗口函数
3、 ntile()分组查询窗口函数
4、 lag(),lead()偏析分析窗口函数
累计窗口函数:
sum(…) over(….) over的作用是指定累计的条件(字段)
eg1:
—2018年每月的支付总额和当年累计支付总额
Select a.month,
a.pay_amount,
sum(a.pay_amount) over (order by a.month)
from
(select month(dt) month,
Sum(pay_amount) pay_amount
From user_trade
Where year(dr)=2018
Group by month(dt))a;
1、 partition by 起到分组的作用(下述例子中先按照年份(2017,2018)分组,再每个组内按照月份排序),
2、Order by 按照什么进行排序进行累加, asc 升序,desc 降序,默认是升序;
Eg:
2017-2018年每月的支付总额和当年的累计总额
Select a.year,
a.month,
a.pay_amount,
sum(pay_amount) over(partition by a.year order by a.month)
# partition by的作用,数据按年份分组,不然排序中就会有 2018-01,2017-03
From
(select year(dt)year
Month(dt) month
Sum(pay_amount) pay_amount
From user_trade
Where year(dt) in (2017,2018)
Group by year(dt)
Month(dt) a);
常见错误:
A. 没有分组
Sum(pay_amount) over( order by a.month)
24个月排序后18和17年的月份混淆
B. 分组依据(字段)设置错误
Sum(pay_amount) over ( partition by a.year, a.month order by a.month)
上述分组后24个月各是一组,就无法实现统计;
冬眠
时间的过客
回到那个夏天
Avg(……)over(……)
Eg1:
2018年每个月的近三月移动平均支付金额
移动平均值: 测定值(x1,x2,x3,x4,x5,x6,x7),则移动平均值为 (x1+x2+x3)/3,(x2+x3+x4)/3,
(x3+x4+x5)/3
Hive实现:
Select a.month, # f返回的是 月份列
a.pay_amount, #返回的每月的支付金额列
avg(pay_amount) over( order by a.month rows between 2 preceding and current row )
--返回求得的移动平均值列
from
(select month(dt) month,
Sum(pay_amount) pay_amount
From user_trade
Where year(dt)=’2018’
Group by month(dt))a;
总结:
1、 sum(a)over (partition by …b… order by c rows between d1 and d2)
2、 avg(a)over (partition by …b… order by c rows between d1 and d2)
a.需要被加工的字段
b.分组字段的名称
c.排序的字段名称
d.计算的行数范围
rows between unbounded preceding and current row;
--包括本行和之前所有的行--可以省略
rows between current now and unbanded following ;
本行之后所有的行;
rows between 3 preceding and current row ;
包括本行和之前的三行;
rows between 3 preceding and 1 following;
从前三行到下一行(共5行,包括本行)
分区排序窗口函数:(面试考点)
1、 row_number()over(……) 为查询每一行结果生成一个序号且排序,不会重复;
2、 rank()over(……) 和dense_rank()over(……) 计算的字段结果值相同,所得的序号相同;
eg1:
2018年1月用户购买商品品类数量的排名
select user_name,
count( distinct goods_catrgory),
row_number()over(order by count(distinct goods_category)),
rank()over( order by count(distinct goods_category)),
dence_rank()over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)=’2018-01’
group by user_name;
返回结果区别简示如下:
Goods_category Row_number() Rank() Dence_rank()
1 1 1 1
1 2 1 1
2 3 3 2
业务场景:
Row_number() 取前2个人
Rank() 高考成绩排名(上例中第二名没有人)
Dence_rank() 比赛奖牌获得者(分数相同 并列)
Eg2:
选出2019年支付金额排名在第10名,20名,30名的用户
经过函数处理后的字段重命名as可以省略?
Select a.user_name,
a.pay_amount,
a.rank
From
(select user_name, --姓名列
Sum(pay_amount) pay_amount, --每个人的支付总额列
Row_number()over( order by sum(pay_amount) desc ) rank --支付总额排序列
From user_trade
Where year(dt)=’2019’
Group by user_name)a
Where a.rank in (10,20,30);
分组排序窗口函数:
Ntile(n) over( partition by a order by b)
N: 切片的片数
A:分组的字段名称
B:排序的字段名称
Ntile(n) –用于将分组数据按照顺序切片,返回切片数;
Ntile不支持 rows between ;
如果切片(分组)不均匀,默认增加第一个切片的分布
Eg:将2019年1月的支付用户,按照支付金额分成5组;
Select user_name,
Sum(pay_amount) pay_amount,
Ntile(5)over( order by sum(pay_amount) desc ) level --切片分组的字段和排序字段是同一条件
From user_trade
Where substr(dt,1,7)=’2019-01’
Group by user_name;
Eg2:
选出2019年退款金额排名前百分之10的用户
Select a.user_name,
a.refund_amount,
a.level
From
(select user_name,
Sum(refund_amount)refund_amount,
Ntile(10)over(order by sum(refund_amount) desc) level
From user_refund
Where year(dt)=’2019’
Group by user_name)a
Where a.level=1;
偏析分析窗口函数
1、 lag(……)over(…) 取排好序的字段的前n行(leg都是往前偏移)
2、 lead(….)over(……) 取后n行 (lead都是往后偏移)
lag( exp_str, offset,defval) over ( partition by … order by ….)
lead( exp_str, offset,defval) over ( partition by … order by ….)
exp_str: 字段名称
offset:偏移量,上一个或上n个的值,offset的默认值是1;
defavl: 默认值,取值范围超出表的范围时,我们就用默认值代替;
eg1:
alice和alexander的各种时间偏移
select user_name,
dt,
lag(dt,1,dt) over (partition by user_name order by dt),
lag(dt) over(partition by user_name order by dt),
lag(dt,2,dt)over(partition by user_name order by dt),
lag(dt,2) over(partition by user_name order by dt)
from user_trade
where dt>‘0’
and user_name in ( ‘alice’,‘alexander’);
注释: lag(dt),对dt取偏移,offset默认值为1,defavl默认为null,
partition by user_name 以不同的人为条件分组,分别取不同的人的时间偏移;
eg2:
支付时间超过100天的用户
select count( distinct a.user_name)
from
(select user_name,
dt,
lead(dt,1) over( partition by user_name order by dt) lead_dt
from user_trade
where dt>'0')a
where a.lead_dt is not null
and datadiff(a.lead_dt,dt)>100;
重点练习:
每个城市,不同性别,2018年支付金额最高的top3用户
重点在:
上海 男 top3
女 top3
北京 男 top3
女 top3
自己答:
先求每个人2018年全年支付总额
再按城市,性别排序
左后筛选
select a.user_name,
(select user_name,
sum(pay_amount) over( partition by city and sex order by sum(pay_amount) ) pay_amount
from user_trade
where year(dt)=‘2018’
group by user_name)a
from user_trade
where a. > 3 and a. =3;
修改–改错:
select c.user_name,
c.city,
c.sex,
c.pay_amount
c.rank
from
(select a.user_name,
b.city,
b.sex,
a.pay_amount,
row_number() over (partition by b.city,b.sex order by a.pay_amount desc) rank
from
(select user_name,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by user_name)a
left join user_info b on a.user_name=b.user_name) c
where c.rank <=3;