hive的多粒度计算优化列转行的函数lateral view

今天看了一个hive的减少job数的优化思路，值得借鉴

测试表以及数据如下所示：

create table sale_order as 
select '河南省' as province,'洛阳市' as city,'汝阳县' as country, 100 as sales
union all 
select '河南省' as province,'洛阳市' as city,'伊川县' as country, 200 as sales
union all 
select '河南省' as province,'洛阳市' as city,'汝阳县' as country, 300 as sales
union all 
select '河南省' as province,'许昌市' as city,'许昌县' as country, 500 as sales

需要按照省,市,区县的粒度来统计销售金额

第一种常规思路，单独汇总后的union all

--基本写法 4个job 
select '1' as id,province,sum(sales) as sales from sale_order group by province
union all 
select '2' as id,city,sum(sales) as sales from sale_order group by city
union all 
select '3' as id,country,sum(sales) as sales from sale_order group by country

日志如下所示

hive> select '1' as id,province,sum(sales) as sales from sale_order group by province
    > union all 
    > select '2' as id,city,sum(sales) as sales from sale_order group by city
    > union all 
    > select '3' as id,country,sum(sales) as sales from sale_order group by country;
Query ID = root_20210319133916_04d285a4-0e45-46dd-9313-aaef32a8a946
Total jobs = 4
Launching Job 1 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0007, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0007/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-03-19 13:40:02,368 Stage-1 map = 0%,  reduce = 0%
2021-03-19 13:40:51,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.16 sec
2021-03-19 13:41:18,388 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 24.51 sec
2021-03-19 13:41:19,676 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.79 sec
MapReduce Total cumulative CPU time: 25 seconds 790 msec
Ended Job = job_1616110053329_0007
Launching Job 2 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0008, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0008
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2021-03-19 13:41:51,501 Stage-3 map = 0%,  reduce = 0%
2021-03-19 13:42:08,663 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 4.0 sec
2021-03-19 13:42:23,783 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 6.46 sec
MapReduce Total cumulative CPU time: 6 seconds 460 msec
Ended Job = job_1616110053329_0008
Launching Job 3 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0009, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0009/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0009
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 1
2021-03-19 13:42:50,714 Stage-4 map = 0%,  reduce = 0%
2021-03-19 13:43:07,800 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec
2021-03-19 13:43:22,628 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 5.25 sec
MapReduce Total cumulative CPU time: 5 seconds 250 msec
Ended Job = job_1616110053329_0009
Launching Job 4 out of 4
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1616110053329_0010, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0010/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0010
Hadoop job information for Stage-2: number of mappers: 3; number of reducers: 0
2021-03-19 13:43:50,093 Stage-2 map = 0%,  reduce = 0%
2021-03-19 13:44:51,565 Stage-2 map = 0%,  reduce = 0%
2021-03-19 13:45:07,140 Stage-2 map = 67%,  reduce = 0%, Cumulative CPU 18.43 sec
2021-03-19 13:45:10,262 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 19.89 sec
MapReduce Total cumulative CPU time: 19 seconds 890 msec
Ended Job = job_1616110053329_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 25.79 sec   HDFS Read: 7128 HDFS Write: 128 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 6.46 sec   HDFS Read: 7106 HDFS Write: 160 SUCCESS
Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 5.25 sec   HDFS Read: 7102 HDFS Write: 191 SUCCESS
Stage-Stage-2: Map: 3   Cumulative CPU: 19.89 sec   HDFS Read: 9650 HDFS Write: 151 SUCCESS
Total MapReduce CPU Time Spent: 57 seconds 390 msec
OK
3	伊川县	200
3	汝阳县	400
3	许昌县	500
2	洛阳市	600
2	许昌市	500
1	河南省	1100
Time taken: 356.641 seconds, Fetched: 6 row(s)

可以看到，4个job，总共耗时356.641秒

第二种，借助列转行函数 lateral view，将需要汇总的列生成到一列中，再汇总

--写法二  1个job
   select type,code,sum(sales) from (
    select type,code,
		   sales from sale_order
    lateral view explode(str_to_map(concat('1=',province,'&2=',city,'&3=',country
	                     ),'&','=')
						 ) adtable as type,code 
  ) df group by type,code

日志如下所示:

hive> select type,code,sum(sales) from (select type,code,sales from sale_order lateral view explode(str_to_map(concat('1=',province,'&2=',city,'&3=',country),'&','=')) adtable as type,code ) df group by type,code;
Query ID = root_20210319134854_767b4eea-86f5-4763-88f8-a197dcbb4fa8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0011, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0011/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-03-19 13:49:18,665 Stage-1 map = 0%,  reduce = 0%
2021-03-19 13:49:43,007 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.93 sec
2021-03-19 13:50:04,839 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.49 sec
2021-03-19 13:50:07,697 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.07 sec
MapReduce Total cumulative CPU time: 11 seconds 70 msec
Ended Job = job_1616110053329_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.07 sec   HDFS Read: 11114 HDFS Write: 151 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 70 msec
OK
1	河南省	1100
2	洛阳市	600
2	许昌市	500
3	伊川县	200
3	汝阳县	400
3	许昌县	500
Time taken: 75.882 seconds, Fetched: 6 row(s)

可以看到，共1个job，总共耗时75.882秒，提高了效率

如发现有问题，欢迎指正。

hive的多粒度计算优化 列转行的函数lateral view

猜你喜欢

hive的多粒度计算优化列转行的函数lateral view