HIVE SQL 聚合函数与 rows between / range between详解

一、rows between 与 range between 用法

1. 相关关键词解析

unbounded 无边界
preceding 往前
following 往后
unbounded preceding 往前所有行,即初始行
n preceding 往前n行
unbounded following 往后所有行,即末尾行
n following 往后n行
current row 当前行

语法
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

2. rows between ... and ...

rows:指以行号来决定frame的范围,是物理意义上的行。

比如rows between 1 preceding and 1 following代表从当前行往前一行以及往后一行。

3. range between ... and ...

range:指以当前行在开窗函数中的值为根基,然后按照order by进行排序,最后根据range去加减上下界。是逻辑意义上的行。

比如sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 表示按照id分组,按照score升序排序,然后以当前行的score,下界减一,上界加一,作为范围,将这范围里的score进行加总。

讲的比较拗口,下面看个例子就懂了。

二、举例

1. 数据准备

假设有表datadev.t_student,数据如下

id score
stu_1 1
stu_1 2
stu_1 3
stu_1 4
stu_1 5
stu_1 5

2. 测试 rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a3,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a4,
sum(score) over (PARTITION by id order by 1) as a5
from datadev.t_student;

测试结果如下:

分析:

  1. sum(score) over (PARTITION by id) as a1:按照id分组直接加总score,这种大家最熟悉了
  2. sum(score) over (PARTITION by id order by score) as a2:按照score排序,从起始行到当前行进行加总,与a3中的ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW不同的是,当score相同时,算相同排名,会一起加总。类似rank的概念。
  3. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):从起始行到当前行进行加总,与a2不同的是,当score相同时,排名不同,不会加总到当前行。类似row_number的概念。
  4. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):从起始行到末尾行进行加总,与a1相同。
  5. sum(score) over (PARTITION by id order by 1):作用与a2一样,这里order by 1,相当于score相同,因此全部加总。

a1与a2在官网上的解释如下:

  • When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
  • When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

因此,a1与a2等价于

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a2,
from datadev.t_student;

 官网地址如下:

LanguageManual WindowingAndAnalytics - Apache Hive - Apache Software Foundation

3. 测试 range between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as b1,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as b2
from datadev.t_student;

测试结果如下:

 分析:

  1. sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 是默认值,可不写。
  2. sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING):按照id分组,score升序排序,并将当前行的score下界减一,上界为全部(可认为是无穷大),作为筛选范围。最后将符合筛选范围的score进行相加。

b2运算过程分析如下:

id score 运算过程 运算 b2
stu_1 1 [当前行的score值 - 1,∞] ==> 即[0, ∞] 1+2+3+4+5+5=20 20
stu_1 2 [当前行的score值 - 1,∞] ==> 即[1, ∞] 1+2+3+4+5+5=20 20
stu_1 3 [当前行的score值 - 1,∞] ==> 即[2, ∞] 2+3+4+5+5=19 19
stu_1 4 [当前行的score值 - 1,∞] ==> 即[3, ∞] 3+4+5+5=17 17
stu_1 5 [当前行的score值 - 1,∞] ==> 即[4, ∞] 4+5+5=14 14
stu_1 5 [当前行的score值 - 1,∞] ==> 即[4, ∞] 4+5+5=14 14

4. 对比 range between ... and ... 与 rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) as a,
sum(score) over (PARTITION by id order by score ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as b
from datadev.t_student;

测试结果如下:

 分析:

id score range运算过程 range运算 a(range) rows运算过程 rows运算 b(rows)
stu_1 1 [当前行的score值 - 1,当前行的score值 + 1] ==> 即[0, 2] 1+2=3 3 将当前行的上一行以及下一行
的score进行相加
1+2=3 3
stu_1 2 [当前行的score值 - 1,当前行的score值 + 1] ==> 即[1, 3] 1+2+3=6 6 将当前行的上一行以及下一行
的score进行相加
1+2+3=6 6
stu_1 3 [当前行的score值 - 1,当前行的score值 + 1] ==> 即[2, 4] 2+3+4=9 9 将当前行的上一行以及下一行
的score进行相加
2+3+4=9 9
stu_1 4 [当前行的score值 - 1,当前行的score值 + 1] ==> 即[3, 5] 3+4+5+5=17 17 将当前行的上一行以及下一行
的score进行相加
3+4+5=12 12
stu_1 5 [当前行的score值 - 1,当前行的score值 + 1] ==> 即[4, 6] 4+5+5=14 14 将当前行的上一行以及下一行
的score进行相加
4+5+5=14 14
stu_1 5 [当前行的score值 - 1,当前行的score值 + 1] ==> 即[4, 6] 4+5+5=14 14 将当前行的上一行以及下一行
的score进行相加
5+5=10 10

参考文档:Hive 窗口与分析型函数

猜你喜欢

转载自blog.csdn.net/qq_37771475/article/details/121774383