Hive_Hive ROLLUP, GROUPING SETS, CUBE 聚合函数 与 GROUPING 函数

参考文章:

1.hive 中的grouping set,cube,roll up函数

https://blog.csdn.net/weixin_34352449/article/details/92169948

2.GROUPING函数

https://blog.csdn.net/u013053796/article/details/18619315

3.GROUP BY...WITH ROLL UP 分组统计后的再合计

https://blog.csdn.net/u013677636/article/details/52353812

  在使用Hive 的时候,我们常常进行聚合统计运算。

聚合统计的运算函数有很多,比如 我们最常用的 GROUP BY 函数。

但是常常我们需要多维度统计数据,这个时候我们就会用到Hive 的聚合统计函数

这里我们讲解下  ROLLUP,  GROUPING SETS, CUBE 的含义以及用法。

我们结合案例讲解下这几个函数的使用规则

数据统计背景:

我们现在有多个公司,多个部门,多位员工的薪水情况。现在我们需要按照多种维度去统计薪水。 

首先 我们构建下基础的表

use data_warehouse_test;


CREATE TABLE IF NOT EXISTS datacube_salary_org (
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,user_name STRING COMMENT '用户姓名'
 ,salary DECIMAL(10,2) COMMENT '薪水'
 ,create_time DATE COMMENT '创建时间'
 ,update_time DATE COMMENT '修改时间'
) 
PARTITIONED BY(
 pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;


CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;


CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;


CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
 company_name STRING COMMENT '公司名称'
 ,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;


CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
 total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;

然后,我们通过2种方式导入基础数据

1.直接通过SQL插入

2.通过文件LOAD

1.通过SQL插入

INSERT OVERWRITE TABLE datacube_salary_org PARTITION (pt = '20200407') VALUES
 ('s.zh','enginer',1,'szh',28000.0,'2020-04-07','2020-04-07'),
 ('s.zh','enginer',2,'zyq',26000.0,'2020-04-03','2020-04-03'),
 ('s.zh','tester',3,'gkm',20000.0,'2020-04-07','2020-04-07'),
 ('x.qx','finance',4,'pip',13400.0,'2020-04-07','2020-04-07'),
 ('x.qx','finance',5,'kip',24500.0,'2020-04-07','2020-04-07'),
 ('x.qx','finance',6,'zxxc',13000.0,'2020-04-07','2020-04-07'),
 ('x.qx','kiccp',7,'xsz',8600.0,'2020-04-07','2020-04-07')
;

2.通过文件LOAD

创建一个txt 文件,填入以下内容

s.zh,engineer,1,szh,28000.0,2020-04-07,2020-04-07
s.zh,engineer,2,zyq,26000.0,2020-04-03,2020-04-03
s.zh,tester,3,gkm,20000.0,2020-04-07,2020-04-07
x.qx,finance,4,pip,13400.0,2020-04-07,2020-04-07
x.qx,finance,5,kip,24500.0,2020-04-07,2020-04-07
x.qx,finance,6,zxxc,13000.0,2020-04-07,2020-04-07
x.qx,kiccp,7,xsz,8600.0,2020-04-07,2020-04-07

我们通过如下2种方式,将数据加载 进去

方法1.

 创建文件夹

 移动文件

 修复表

hdfs dfs -mkdir /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406
hdfs dfs -put org_data.txt /user/hive/warehouse/data_warehouse_test.db/datacube_salary_org/pt=20200406

在beeline / hive  客户端 执行如下语句

MSCK REPAIR TABLE datacube_salary_org;

方法2.

   创建分区

   LOAD文件

LOAD 文件的标准用法 

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]  

  

创建分区

ALTER TABLE datacube_salary_org ADD PARTITION (pt = '20200405');

LOAD数据

LOAD DATA LOCAL INPATH '/opt/hive/my_script/data_warehouse_test/rollup_table/org_data.txt'  OVERWRITE INTO TABLE datacube_salary_org PARTITION (pt = '20200405');

之后,我们看下这几个函数的用法

 

ROLLUP,  GROUPING SETS, CUBE

我们从 GROUPING SETS , ROLLUP, CUBE 一点点讨论。

GROUPING SETS

     GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解.

首先我们学习下 GROUPING SETS 

GROUPING SETS 用法如下:

SELECT
 a
 ,b
 ...
 ,f
FROM test_table
GROUP BY
 a
 ,b
 ...
 ,f
GROUPING SETS ((?,...,?),(xxx),(yyy))

GROUPING SETS 中间可以填写多个条件。

其中 (?,...,?) 可以为 a~f 中不重复的任意项

具体例子如下:

SELECT 
 a
 ,b
 ,SUM(c)
FROM test_table
GROUP BY
 a
 ,b
GROUPING SETS ((a),(a,b),())

等价于

SELECT 
 a
 ,NULL
 ,SUM(c)
FROM test_table
GROUP BY
 a

UNION ALL

SELECT
 a
 ,b
 ,SUM(c)
FROM test_table
GROUP BY
 a
 ,b

UNION ALL

SELECT
 NULL
 ,NULL
 ,SUM(c)
FROM test_table
;

实际案例,我们想按照公司 ,和整体 去统计员工的薪水, 但是我们想在 一条语句中完成。

我们该如何编写SQL 呢?

SQL如下:

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
GROUPING SETS ((company_name), ())
ORDER BY
 grouping__id
;

输出如下:

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.13 sec   HDFS Read: 11666 HDFS Write: 175 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.73 sec   HDFS Read: 7060 HDFS Write: 188 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 860 msec
INFO  : Completed executing command(queryId=hive_20200408032038_18e04047-b8c0-4d07-a5de-00ccbc7cb4cc); Time taken: 51.459 seconds
INFO  : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
+---------------+---------------+-----------+----------+---------------+

可以看到 grouping_id 为0, 计算的是整体的薪水和

而grouping_id 为1,  计算的是分公司的薪水

这里面 GROUPING__ID 的计算方式 和 GROUPING 函数请参考我另一篇文章

我们刚才说过 用 GROUPING SETS  和  GROUP BY + UNION ALL 是等价的,但是它们真的一样么。我们运行 EXPLAIN 检验一下。

下面给出两个运行结果等价的SQL 的 EXPLAIN 结果。

SQL1

EXPLAIN

SELECT 
*
FROM
(
SELECT
 0 AS mark
 ,NULL
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'

UNION ALL

SELECT
 1 AS mark
 ,company_name
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
) tmp
ORDER BY mark
;

SQL2

EXPLAIN

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
GROUPING SETS ((company_name), ())
ORDER BY
 grouping__id
;

先贴出2个SQL的分别EXPLAIN 结果

UNION ALL 的SQL

注意 : UNION ALL 中增加更多的SQL 会增加 JOB 数量。

INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200408035719_94eb75b2-c6fc-4804-bacb-d3555c61e7f3); Time taken: 0.016 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1, Stage-3      |
|   Stage-3 is a root stage                          |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200407') (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: salary (type: decimal(10,2)) |
|               outputColumnNames: salary            |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: sum(salary)          |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col0 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: sum(VALUE._col0)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: 0 (type: int), null (type: string), _col0 (type: decimal(20,2)) |
|             outputColumnNames: _col0, _col1, _col4 |
|             Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
|                 outputColumnNames: _col0, _col1, _col4 |
|                 Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: _col0 (type: int), _col1 (type: string), _col4 (type: decimal(20,2)) |
|                 outputColumnNames: _col0, _col1, _col4 |
|                 Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: string), _col4 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), null (type: void), null (type: void), VALUE._col3 (type: decimal(20,2)) |
|           outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|           Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 4 Data size: 257 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200407') (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), salary (type: decimal(10,2)) |
|               outputColumnNames: company_name, salary |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: sum(salary)          |
|                 keys: company_name (type: string)  |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: sum(VALUE._col0)           |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: 1 (type: int), _col0 (type: string), _col1 (type: decimal(20,2)) |
|             outputColumnNames: _col0, _col1, _col4 |
|             Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

GROUPING SETS 的SQL

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200408035951_8153b57c-9f60-4c2d-bc26-92659fdc8afd); Time taken: 0.007 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200407') (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), salary (type: decimal(10,2)) |
|               outputColumnNames: company_name, dep_name, user_id, salary |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: sum(salary)          |
|                 keys: company_name (type: string), dep_name (type: string), user_id (type: bigint), 0 (type: int) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|                 Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
|                   sort order: ++++                 |
|                   Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: int) |
|                   Statistics: Num rows: 14 Data size: 680 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col4 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: sum(VALUE._col0)           |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: int) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col3 (type: int), _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col4 (type: decimal(20,2)) |
|             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: int)   |
|               sort order: +                        |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: decimal(20,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint), VALUE._col3 (type: decimal(20,2)) |
|           outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

通过以上分析比较,我们不难发现 GROUPING SETS 的作业划分数量更少,

实际情况下 : GROUPING SETS 的 运行效率也要高于 UNION ALL 的 GROUP BY 形式

下面是对比的执行时间

GROUPING SETS

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.62 sec   HDFS Read: 11666 HDFS Write: 175 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.51 sec   HDFS Read: 7060 HDFS Write: 188 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 130 msec
INFO  : Completed executing command(queryId=hive_20200408045412_4ab9e09f-436e-4433-9a1f-a03d5b32ef3e); Time taken: 49.676 seconds
INFO  : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
+---------------+---------------+-----------+----------+---------------+

UNION ALL 的 GROUP BY

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.7 sec   HDFS Read: 10541 HDFS Write: 119 SUCCESS
INFO  : Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 4.34 sec   HDFS Read: 10919 HDFS Write: 152 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 5.08 sec   HDFS Read: 12932 HDFS Write: 188 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 13 seconds 120 msec
INFO  : Completed executing command(queryId=hive_20200408045141_2033cbf6-a457-4bdb-aaec-65900b386972); Time taken: 84.365 seconds
INFO  : OK
+-----------+----------+----------+----------+-------------------+
| tmp.mark  | tmp._c1  | tmp._c2  | tmp._c3  | tmp.total_salary  |
+-----------+----------+----------+----------+-------------------+
| 0         | NULL     | NULL     | NULL     | 133500.00         |
| 1         | x.qx     | NULL     | NULL     | 59500.00          |
| 1         | s.zh     | NULL     | NULL     | 74000.00          |
+-----------+----------+----------+----------+-------------------+

可以看到 GROUPING SETS 的执行效率更高

ROLLUP

rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH ROLLUP
ORDER BY
 grouping__id
;

等价于

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
GROUPING SETS ((company_name, dep_name, user_id), (company_name, dep_name), (company_name),())
ORDER BY
 grouping__id
;

等价于

SELECT 
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

UNION ALL

SELECT 
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

UNION ALL

SELECT 
 company_name
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name

UNION ALL

SELECT 
 NULL
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
;

运算结果

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.21 sec   HDFS Read: 11674 HDFS Write: 563 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.91 sec   HDFS Read: 7448 HDFS Write: 602 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 120 msec
INFO  : Completed executing command(queryId=hive_20200408052638_740f42b9-6f08-49a6-8123-9a77aedc6b19); Time taken: 50.563 seconds
INFO  : OK
+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | x.qx          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | s.zh          | tester    | 3        | 20000.00      |
+---------------+---------------+-----------+----------+---------------+

这里我们简要的说一下 GROUPING__ID 的计算规则

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH ROLLUP
ORDER BY
 grouping__id
;
 

可以看到 GROUP BY 中有 三个字段, 依次是 company_name, dep_name, user_id

可以看作3位二进制

即 (0/1),(0/1),(0/1) 

低位对应 company_name

高位对应 user_id 

如果这一位聚合了(GROUP BY 中存在该字段),则为0,GROUPING 函数对应也返回0 

最后 GROUPING_ID 为对应2进制 转换的 10进制数。

1) ROLLUP 的 全分组的子集

SELECT 
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

对应的结果

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | x.qx          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | s.zh          | tester    | 3        | 20000.00      |
+---------------+---------------+-----------+----------+---------------+

因为 GROUP BY 这三个字段都有,则为 111 ,所以 GROUPING__ID 为7

2)  ROLLUP  company_name, dep_name 分组的子集

SELECT 
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

对应的子集

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
+---------------+---------------+-----------+----------+---------------+

因为 GROUP BY  company_name ,dep_name

结合上面规则 

可以看到 GROUP BY 中有 三个字段, 依次是 company_name, dep_name, user_id

可以看作3位二进制

即 (0/1),(0/1),(0/1) 

低位对应 company_name

高位对应 user_id 

则 GROUPING__ID 为 011 ,即为 3

我相信这里已经解释的很清楚了,下面我们看下 CUBE 函数

CUBE 

  cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,它会统计所选列中值的所有组合的聚合

SELECT
 grouping__id
 ,company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 
 WITH CUBE
ORDER BY
 grouping__id
;


 

等价于 

(company_name,dep_name,user_id)

SELECT
 company_name
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name
 ,user_id 

UNION ALL 

SELECT
 company_name
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,dep_name

UNION ALL 

SELECT
 company_name
 ,NULL
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name
 ,user_id
 

UNION ALL 

SELECT
 company_name
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 company_name

UNION ALL 

SELECT
 NULL
 ,dep_name
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 dep_name
 ,user_id

 
UNION ALL 

SELECT
 NULL
 ,dep_name
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 dep_name

 
UNION ALL 

SELECT
 NULL
 ,NULL
 ,user_id
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'
GROUP BY 
 user_id

UNION ALL 

SELECT
 NULL
 ,NULL
 ,NULL
 ,SUM(salary) AS total_salary
FROM datacube_salary_org
WHERE pt = '20200407'

 
;

结果如下:

+---------------+---------------+-----------+----------+---------------+
| grouping__id  | company_name  | dep_name  | user_id  | total_salary  |
+---------------+---------------+-----------+----------+---------------+
| 0             | NULL          | NULL      | NULL     | 133500.00     |
| 1             | s.zh          | NULL      | NULL     | 74000.00      |
| 1             | x.qx          | NULL      | NULL     | 59500.00      |
| 2             | NULL          | finance   | NULL     | 50900.00      |
| 2             | NULL          | kiccp     | NULL     | 8600.00       |
| 2             | NULL          | tester    | NULL     | 20000.00      |
| 2             | NULL          | enginer   | NULL     | 54000.00      |
| 3             | s.zh          | tester    | NULL     | 20000.00      |
| 3             | s.zh          | enginer   | NULL     | 54000.00      |
| 3             | x.qx          | kiccp     | NULL     | 8600.00       |
| 3             | x.qx          | finance   | NULL     | 50900.00      |
| 4             | NULL          | NULL      | 7        | 8600.00       |
| 4             | NULL          | NULL      | 5        | 24500.00      |
| 4             | NULL          | NULL      | 4        | 13400.00      |
| 4             | NULL          | NULL      | 3        | 20000.00      |
| 4             | NULL          | NULL      | 2        | 26000.00      |
| 4             | NULL          | NULL      | 1        | 28000.00      |
| 4             | NULL          | NULL      | 6        | 13000.00      |
| 5             | s.zh          | NULL      | 2        | 26000.00      |
| 5             | s.zh          | NULL      | 3        | 20000.00      |
| 5             | x.qx          | NULL      | 5        | 24500.00      |
| 5             | x.qx          | NULL      | 6        | 13000.00      |
| 5             | s.zh          | NULL      | 1        | 28000.00      |
| 5             | x.qx          | NULL      | 7        | 8600.00       |
| 5             | x.qx          | NULL      | 4        | 13400.00      |
| 6             | NULL          | enginer   | 1        | 28000.00      |
| 6             | NULL          | finance   | 4        | 13400.00      |
| 6             | NULL          | tester    | 3        | 20000.00      |
| 6             | NULL          | finance   | 5        | 24500.00      |
| 6             | NULL          | kiccp     | 7        | 8600.00       |
| 6             | NULL          | enginer   | 2        | 26000.00      |
| 6             | NULL          | finance   | 6        | 13000.00      |
| 7             | x.qx          | finance   | 5        | 24500.00      |
| 7             | x.qx          | finance   | 4        | 13400.00      |
| 7             | x.qx          | kiccp     | 7        | 8600.00       |
| 7             | s.zh          | tester    | 3        | 20000.00      |
| 7             | s.zh          | enginer   | 2        | 26000.00      |
| 7             | s.zh          | enginer   | 1        | 28000.00      |
| 7             | x.qx          | finance   | 6        | 13000.00      |
+---------------+---------------+-----------+----------+---------------+
发布了519 篇原创文章 · 获赞 1146 · 访问量 283万+

猜你喜欢

转载自blog.csdn.net/u010003835/article/details/105353510