文章目录
标准sql里的窗口函数和分析函数
介绍(为什么引入窗口函数)
我们知道常规的聚合函数都要结合GROUP BY
语句来使用。但是有很少人知道SQL里的Window function
,它对一组数据进行计算然后对每一行数据返回一个聚合值。
Window function
相对于常规聚合函数主要的优点是:Window function
不会将所有的数据分成多个组,每一行数据仍然保持自己的列及列值,只是在每一行数据里添加一个聚合的值。
我们使用OVER()
语句来定义Window
(一组数据,Window function
基于这组数据进行聚合)。下面会更详细的讨论OVER()
语句。
Window functions的类型
Aggregate Window Functions
SUM(), MAX(), MIN(), AVG(), COUNT()
Ranking Window Functions
RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()
RANK()
和DENSE_RANK()
都是用于排名的,只不过RANK()
会跳跃,DENSE_RANK()
不会,后面会有例子详细说明。
ROW_NUMBER()
是为每一条数据生成唯一的序号。
NTILE(bucket_num)
是将数据分成多少个bucket
,每一行都会被分配到某个具体编号的bucket
,编号从1开始。
NTILE (expr) OVER (
[ PARTITION BY partition_expression ]
[ ORDER BY order_list ]
)
Value Window Functions
LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()
LAG (expr[, offset])
是获取前N个数据的字段值,如果不设置第二个参数的话默认是1
LAG (expr[, offset]) OVER (
[ PARTITION BY partition_expression ]
< ORDER BY order_list >
)
LEAD (expr[, offset])
是获取后N个数据的字段值,如果不设置第二个参数的话默认是1
FIRST_VALUE()
是获取排序数据中第一行数据的字段值
FIRST_VALUE(expr) OVER (
[ PARTITION BY partition_expression ]
[ ORDER BY order_list ]
)
LAST_VALUE()
是获取排序数据中最后一行数据的字段值
语法
我们以Mariadb数据库为例进行语法介绍
function (expression) OVER (
[ PARTITION BY expression_list ]
[ ORDER BY order_list [ frame_clause ] ] )
function:
A valid window function
expression_list:
expression | column_name [, expr_list ]
order_list:
expression | column_name [ ASC | DESC ]
[, ... ]
frame_clause:
{ROWS | RANGE} {frame_border | BETWEEN frame_border AND frame_border}
frame_border:
| UNBOUNDED PRECEDING
| UNBOUNDED FOLLOWING
| CURRENT ROW
| expr PRECEDING
| expr FOLLOWING
例子
创建数据库和用户
create database test;
create user 'root'@'%' identified with mysql_native_password by 'root123';
# 分配权限
grant all on test.* to 'root'@'%';
# 撤销权限
revoke all on test.* from 'root'@'%';
创建表和初始化数据
drop table if EXISTS orders;
CREATE TABLE orders
(
order_id INT,
order_date DATE,
customer_name VARCHAR(250),
city VARCHAR(100),
order_amount INT
);
-- SELECT DATE_FORMAT(NOW(), '%m/%d/%Y %H:%i:%S');
-- select STR_TO_DATE('04/01/2017', '%m/%d/%Y');
INSERT INTO orders values
('1001',STR_TO_DATE('04/01/2017', '%m/%d/%Y'),'David Smith','GuildFord',10000)
,
('1002',STR_TO_DATE('04/02/2017','%m/%d/%Y'),'David Jones','Arlington',20000)
,
('1003',STR_TO_DATE('04/03/2017','%m/%d/%Y'),'John Smith','Shalford',5000)
,
('1004',STR_TO_DATE('04/04/2017','%m/%d/%Y'),'Michael Smith','GuildFord',15000)
,
('1005',STR_TO_DATE('04/05/2017','%m/%d/%Y'),'David Williams','Shalford',7000)
,
('1006',STR_TO_DATE('04/06/2017','%m/%d/%Y'),'Paum Smith','GuildFord',25000)
,
('1007',STR_TO_DATE('04/10/2017','%m/%d/%Y'),'Andrew Smith','Arlington',15000)
,
('1008',STR_TO_DATE('04/11/2017','%m/%d/%Y'),'David Brown','Arlington',2000)
,
('1009',STR_TO_DATE('04/20/2017','%m/%d/%Y'),'Robert Smith','Shalford',1000)
,
('1010',STR_TO_DATE('04/25/2017','%m/%d/%Y'),'Peter Smith','GuildFord',500)
;
根据城市分组求每个城市的总销售额
常规的group by
函数写法如下
SELECT city, SUM(order_amount)
FROM orders GROUP BY city;
用窗口函数写法如下。其中PARTITION BY
就是窗口函数用于分组的,定义聚合函数在哪些数据上做聚合。
Defines the window (set of rows on which window function operates) for window functions
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
FROM orders;
如下图所示,为原表中的每一行数据都添加了一个字段total
,表示该城市的总销售额。仍然保留了原表中每行数据的字段,这是和常规的聚合函数区别最大的地方。
其他聚合函数MAX(), MIN(), AVG(), COUNT()
同理,此处就不一一举例了。
根据城市分组按照订单销售额从高到低给订单排序
主要是用到Ranking Window Functions
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
, rank() over(partition by city order by order_amount desc) seq
, dense_rank() OVER(PARTITION BY city ORDER BY order_amount DESC) seq_dense
FROM orders;
如下图所示,对每个城市的订单进行了排名。
下图看不出rank()
和dense_rank()
的区别。区别就是rank()
排名的时候遇到相同排名的后面的排名会跳跃,而dense_rank()
不会跳跃。
我们按照订单销售额全局排名看下就知道区别了。
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
, RANK() OVER(ORDER BY order_amount DESC) seq
, DENSE_RANK() OVER(ORDER BY order_amount DESC) seq_dense
FROM orders;
给每行数据生成唯一的序号,可以使用ROW_NUMBER()
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
, RANK() OVER(ORDER BY order_amount DESC) seq
, DENSE_RANK() OVER(ORDER BY order_amount DESC) seq_dense
, ROW_NUMBER() OVER(ORDER BY order_amount DESC) seq_row
FROM orders;
如图所示
根据城市分组进行分桶,使用NTILE()
函数给每行数据定义具体的bucket编号。
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
, RANK() OVER(PARTITION BY city ORDER BY order_amount DESC) seq
, DENSE_RANK() OVER(PARTITION BY city ORDER BY order_amount DESC) seq_dense
, NTILE(2) OVER(PARTITION BY city ORDER BY order_amount DESC) seq_ntile
FROM orders;
如下图所示
当然NTILE()
函数也可以不用接PARTITION BY
对整个表进行分桶。
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
, RANK() OVER(ORDER BY order_amount DESC) seq
, DENSE_RANK() OVER(ORDER BY order_amount DESC) seq_dense
, ROW_NUMBER() OVER(ORDER BY order_amount DESC) seq_row
, NTILE(5) OVER(ORDER BY order_amount DESC) seq_ntile
FROM orders;
根据销售额排序获取每个订单前2数据的订单id
可以使用LAG
函数获取前n条数据的值,使用LEAD
函数获取后n条数据的值
SELECT *, SUM(order_amount) OVER(PARTITION BY city) AS total
, LAG(order_id,2) OVER(ORDER BY order_amount DESC) lag_col
FROM orders;
如图所示。LEAD
函数同理就不再演示了。
根据城市分组,按订单销售额排序,累计求和,以及求当前订单和上一个订单的销售额的和
那么这个时候我们就需要用到Window Frames
的frame_clause
,具体语法可查看最上面的语法介绍。
下面的SUM(order_amount) OVER(PARTITION BY city ORDER BY order_amount ) AS total
就是累计求和。
其中ROWS BETWEEN 1 preceding AND current ROW
就是用来确定当前订单和上一个订单的范围的
SELECT *, SUM(order_amount) OVER(PARTITION BY city ORDER BY order_amount ) AS total
, SUM(order_amount) OVER(PARTITION BY city ORDER BY order_amount ROWS BETWEEN 1 preceding AND current ROW) AS a
FROM orders;
如图所示
Hive里的窗口函数和分析函数
经过上面的标准sql的窗口函数和分析函数的学习,我们再去看hive官网里关系其窗口函数的部分就非常容易理解了。
下面我们仍然以上面的数据来执行hive的窗口函数和分析函数。
创建表及导入数据
创建表orders
,注意我的order_date
字段是date
型。
create table orders(
order_id int, order_date date, customer_name string, city string, order_amount int)
row format delimited fields terminated by ','
stored as textfile;
数据文本如下,注意,日期字段这里的形式需要是YYYY-MM-DD
,否则会解析不到
1001,2017-04-01,David Smith,GuildFord,10000
1002,2017-04-02,David Jones,Arlington,20000
1003,2017-04-03,John Smith,Shalford,5000
1004,2017-04-04,Michael Smith,GuildFord,15000
1005,2017-04-05,David Williams,Shalford,7000
1006,2017-04-06,Paum Smith,GuildFord,25000
1007,2017-04-10,Andrew Smith,Arlington,15000
1008,2017-04-11,David Brown,Arlington,2000
1009,2017-04-20,Robert Smith,Shalford,1000
1010,2017-04-25,Peter Smith,GuildFord,500
假如我的文本数据所在的目录是/usr/local/src/hive/test_data
,执行load data local inpath '/usr/local/src/hive/test_data' overwrite into table orders;
即可成功导入数据
根据城市分组求每个城市的总销售额
如图所示是普通的group by
写法
如图所示是窗口函数的写法
把上面标准sql的一些样例sql语句在hive里执行,一样是能执行出同样的结果的。这里我就不一一执行了。
Window Frames的frame_clause中ROWS和RANGE的区别
观察下面sql的执行结果
SELECT *, SUM(order_amount) OVER(ORDER BY order_amount) AS total
, SUM(order_amount) OVER(ORDER BY order_amount RANGE BETWEEN unbounded preceding AND CURRENT ROW) AS a
, SUM(order_amount) OVER(ORDER BY order_amount ROWS BETWEEN unbounded preceding AND CURRENT ROW) AS b
FROM orders;
如下图所示,RANGE
是逻辑窗口,值相同的也会被包括起来;ROWS
是物理窗口,就是具体的几行。
ORDER BY
后面如果没有跟随frame_clause
则默认是RANGE BETWEEN unbounded preceding AND CURRENT ROW
参考网址
Mariadb-window-functions-overview/
use-window-functions
LanguageManual+WindowingAndAnalytics
LanguageManualTypes-date