Hive的窗口函数(附带上手案例)

版权声明:Please make the source marked https://blog.csdn.net/qq_31807385/article/details/84777349

目录

窗口函数的概述与总结:

可上手案例实操:

总结:


窗口函数的概述与总结:

1.什么时候用开窗函数?开窗函数常结合聚合函数使用,一般来讲聚合后的行数要少于聚合前的行数,但是有时我们既想显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数.如下:

+-------+-------------+-------+---------------+--+
| name  |  orderdate  | cost  | sum_window_0  |
+-------+-------------+-------+---------------+--+
| jack  | 2017-01-01  | 10    | 205           |
| jack  | 2017-01-08  | 55    | 205           |
| tony  | 2017-01-07  | 50    | 205           |
| jack  | 2017-01-05  | 46    | 205           |
| tony  | 2017-01-04  | 29    | 205           |
| tony  | 2017-01-02  | 15    | 205           |
| jack  | 2017-02-03  | 23    | 23            |
| mart  | 2017-04-13  | 94    | 341           |
| jack  | 2017-04-06  | 42    | 341           |
| mart  | 2017-04-11  | 75    | 341           |
| mart  | 2017-04-09  | 68    | 341           |
| mart  | 2017-04-08  | 62    | 341           |
| neil  | 2017-05-10  | 12    | 12            |
| neil  | 2017-06-12  | 80    | 80            |
+-------+-------------+-------+---------------+--

2.窗口函数的语法:

UDAF() over (PARTITION By col1,col2 order by col3 窗口子句(rows between .. and ..)) AS 列别名

注意:PARTITION By后可跟多个字段,order By只跟一个字段。

partition by子句:
一旦指定了partition by子句,聚合函数的作用范围就是分区之后的数据,这一点和group by 有些类似

order by子句:
order by子句对字段进行排序,如果order by子句后面没有跟rows between ** and ** 则表示起点到当前行
的聚合。order by后的rows子句近一步限制聚合函数的作用范围。

窗口子句
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
窗口子句对聚合函数的聚合范围作进一步的动态划分,没有指定的时候,默认为起点到当前行的聚合

注意:
(1)order by必须跟在partition by后;
(2)Rows必须跟在Order by子;
(3)(partition by .. order by)可替换为(distribute by .. sort by ..)

可上手案例实操:

数据的准备:
[isea@hadoop108 datas]$ cat business.txt 
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

需求:
(1)查询在2017年4月份购买过的顾客及总人数
(2)查询顾客的购买明细及月购买总额
(3)上述的场景, 将每个顾客的cost按照日期进行累加
(4)查询每个顾客上次的购买时间
(5)查询前20%时间的订单信息

创建hive表并导入数据:

create table business(name string,orderdate string,cost int)
row format delimited
fields terminated by ',';

load data local inpath '/opt/module/datas/business.txt' into table business;

0: jdbc:hive2://hadoop108:10000> select * from business;
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+--+
| jack           | 2017-01-01          | 10             |
| tony           | 2017-01-02          | 15             |
| jack           | 2017-02-03          | 23             |
| tony           | 2017-01-04          | 29             |
| jack           | 2017-01-05          | 46             |
| jack           | 2017-04-06          | 42             |
| tony           | 2017-01-07          | 50             |
| jack           | 2017-01-08          | 55             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| neil           | 2017-05-10          | 12             |
| mart           | 2017-04-11          | 75             |
| neil           | 2017-06-12          | 80             |
| mart           | 2017-04-13          | 94             |
+----------------+---------------------+----------------+--+


(1)查询在2017年4月份购买过的顾客及总人数:
分析过程:四月份的数据如下:
| jack           | 2017-04-06          | 42             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| mart           | 2017-04-11          | 75             |
| mart           | 2017-04-13          | 94             |

最后的输出结果应该是长成这个样子:
jack 2
mart 2

我们一起来理解一下聚合函数,下面的这个聚合函数,将business表中的所有内容作为输入,输入到聚合函数中去
select count(*) from business;
+------+--+
| _c0  |
+------+--+
| 14   |
+------+--+


下面的这个聚合函数,作用在分组的数据中,这样一来,聚合函数的作用对象就是组,即如果是同一个组,会作为输入,输入到聚合函数中去。
select name,count(*)
from business
where orderdate like '2017-04%'
group by name;

+-------+------+--+
| name  | _c1  |
+-------+------+--+
| jack  | 1    |
| mart  | 4    |
+-------+------+--+
上面的这个结果很明显并不是我们想要的,因为这里是将一个组的内容作为输入输入到聚合函数中去的,所以此时统计的是name的个数。我们
想要的结果是:将这两行的内容作为聚合函数的输入。我们可以使用窗口函数实现:

select name,count(*) over()
from business
where orderdate like '2017-04%'
group by name;

+-------+-----------------+--+
| name  | count_window_0  |
+-------+-----------------+--+
| mart  | 2               |
| jack  | 2               |
+-------+-----------------+--+


(2)查询顾客的购买明细及月购买总额
这里面是月购买总额,所以应该以月份来进行分组,同样月份的进入聚合函数,这里既要显示原来的数据,
又要显示聚合之后的数据,所以使用窗口函数。

select *,sum(cost) over(partition by month(orderdate)) 
from business;

+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 205           |
| jack           | 2017-01-08          | 55             | 205           |
| tony           | 2017-01-07          | 50             | 205           |
| jack           | 2017-01-05          | 46             | 205           |
| tony           | 2017-01-04          | 29             | 205           |
| tony           | 2017-01-02          | 15             | 205           |
| jack           | 2017-02-03          | 23             | 23            |
| mart           | 2017-04-13          | 94             | 341           |
| jack           | 2017-04-06          | 42             | 341           |
| mart           | 2017-04-11          | 75             | 341           |
| mart           | 2017-04-09          | 68             | 341           |
| mart           | 2017-04-08          | 62             | 341           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 80            |
+----------------+---------------------+----------------+---------------+--+

(3) 将每个顾客的cost按照日期进行累加
这里面需要将每一个顾客的cost累加,所以要对name进行分组,按照日期进行累加,应该对于日期进行排序,
这样才好一行一行的累加。

select *,sum(cost) over(partition by name  order by orderdate)
from business;

+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 10            |
| jack           | 2017-01-05          | 46             | 56            |
| jack           | 2017-01-08          | 55             | 111           |
| jack           | 2017-02-03          | 23             | 134           |
| jack           | 2017-04-06          | 42             | 176           |
| mart           | 2017-04-08          | 62             | 62            |
| mart           | 2017-04-09          | 68             | 130           |
| mart           | 2017-04-11          | 75             | 205           |
| mart           | 2017-04-13          | 94             | 299           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-02          | 15             | 15            |
| tony           | 2017-01-04          | 29             | 44            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+--+

除了上面的这种方式,我们还可以使用窗口子句来实现:
select *,sum(cost) over(partition by name order by orderdate
rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 10            |
| jack           | 2017-01-05          | 46             | 56            |
| jack           | 2017-01-08          | 55             | 111           |
| jack           | 2017-02-03          | 23             | 134           |
| jack           | 2017-04-06          | 42             | 176           |
| mart           | 2017-04-08          | 62             | 62            |
| mart           | 2017-04-09          | 68             | 130           |
| mart           | 2017-04-11          | 75             | 205           |
| mart           | 2017-04-13          | 94             | 299           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-02          | 15             | 15            |
| tony           | 2017-01-04          | 29             | 44            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+--+


(4)查询每个顾客上次的购买时间
对name 分区,对时间排序,例如下面的这个样子
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  | date
+----------------+---------------------+----------------+--+
| jack           | 2017-01-01          | 10             | null
| jack           | 2017-01-05          | 46             | 2017-01-01
| jack           | 2017-02-03          | 23             | 2017-01-05 
| jack           | 2017-04-06          | 42             | 2017-02-03
| tony           | 2017-01-02          | 15             | null
| tony           | 2017-01-04          | 29             | 2017-01-02

select *,lag(orderdate,1,'-1') over(partition by name order by orderdate)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | lag_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | -1            |
| jack           | 2017-01-05          | 46             | 2017-01-01    |
| jack           | 2017-01-08          | 55             | 2017-01-05    |
| jack           | 2017-02-03          | 23             | 2017-01-08    |
| jack           | 2017-04-06          | 42             | 2017-02-03    |
| mart           | 2017-04-08          | 62             | -1            |
| mart           | 2017-04-09          | 68             | 2017-04-08    |
| mart           | 2017-04-11          | 75             | 2017-04-09    |
| mart           | 2017-04-13          | 94             | 2017-04-11    |
| neil           | 2017-05-10          | 12             | -1            |
| neil           | 2017-06-12          | 80             | 2017-05-10    |
| tony           | 2017-01-02          | 15             | -1            |
| tony           | 2017-01-04          | 29             | 2017-01-02    |
| tony           | 2017-01-07          | 50             | 2017-01-04    |
+----------------+---------------------+----------------+---------------+--+


(5)查询前20%时间的订单信息
20%需要对时间进行排序,取到其中的20%,输入全部,得到20%,使用Ntail聚合函数

t1:
select *,NTILE(5) over(order by orderdate) num
from business ; 
+----------------+---------------------+----------------+------+--+
| business.name  | business.orderdate  | business.cost  | num  |
+----------------+---------------------+----------------+------+--+
| jack           | 2017-01-01          | 10             | 1    |
| tony           | 2017-01-02          | 15             | 1    |
| tony           | 2017-01-04          | 29             | 1    |
| jack           | 2017-01-05          | 46             | 2    |
| tony           | 2017-01-07          | 50             | 2    |
| jack           | 2017-01-08          | 55             | 2    |
| jack           | 2017-02-03          | 23             | 3    |
| jack           | 2017-04-06          | 42             | 3    |
| mart           | 2017-04-08          | 62             | 3    |
| mart           | 2017-04-09          | 68             | 4    |
| mart           | 2017-04-11          | 75             | 4    |
| mart           | 2017-04-13          | 94             | 4    |
| neil           | 2017-05-10          | 12             | 5    |
| neil           | 2017-06-12          | 80             | 5    |
+----------------+---------------------+----------------+------+--+


select * from
(select *,NTILE(5) over(order by orderdate) num
from business ) t1
where num = 1;
+----------+---------------+----------+---------+--+
| t1.name  | t1.orderdate  | t1.cost  | t1.num  |
+----------+---------------+----------+---------+--+
| jack     | 2017-01-01    | 10       | 1       |
| tony     | 2017-01-02    | 15       | 1       |
| tony     | 2017-01-04    | 29       | 1       |
+----------+---------------+----------+---------+--+

总结:

①理解窗口函数的前提是深入理解聚合函数,理解聚合函数,就是要理解聚合函数的作用范围,首先没有任何修饰的聚合函数的作用范围是全体的数据;其次有group by的聚合函数,聚合函数对同组的数据聚合;有了partition by 的范围也是组内的数据;有了窗口子句之后,窗口子句会进一步限制聚合函数的作用范围。②既想显示聚集前的数据,又要显示聚集后的数据,使用窗口函数,因为select 后面的字段必须是聚合函数和group by 字段,如果想显示其他字段,group by做不到,就得使用窗口函数。

猜你喜欢

转载自blog.csdn.net/qq_31807385/article/details/84777349