Hive(14):排序类型的sql

1、order by:全局排序,默认按照字典序排列

(1)特点

只会开启一个reduce做聚合,所以数据量很大的话,很影响性能;如果设置多个reduce但是,输出还是只有一个文件

(2)实例

set mapreduce.job.reduces=2;

insert overwrite local directory '/opt/datas/emp_order' row format delimited fields terminated by '\t' select * from emp order by sal;

结果:

尽管设置了两个reducer个数,但是在目录下cat /opt/datas/emp_order只有一个文件000000_0,因为全局排序,所以,set mapreduce.job.reduces=2;没有影响

7369    SMITH   CLERK   7902    1980-12-17      800.0   \N      20
7900    JAMES   CLERK   7698    1981-12-3       950.0   \N      30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  \N      20
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7934    MILLER  CLERK   7782    1982-1-23       1300.0  \N      10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  \N      10
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  \N      30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  \N      20
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  \N      20
7902    FORD    ANALYST 7566    1981-12-3       3000.0  \N      20
7839    KING    PRESIDENT       \N      1981-11-17      5000.0  \N      10

2.sort by:局部排序

(1)特点

针对每个reduce的结果进行排序,对于分区后的reduce进行排序

(2)实例

set mapreduce.job.reduces=2;
insert overwrite local directory '/opt/datas/emp_sort' row format delimited fields terminated by '\t' select * from emp sort by sal;

结果:/opt/datas/emp_sort目录下有两个文件
000000_0  000001_0

[root@bigdata emp_sort]# cat 000000_0 
7369    SMITH   CLERK   7902    1980-12-17      800.0   \N      20
7900    JAMES   CLERK   7698    1981-12-3       950.0   \N      30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  \N      20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  \N      20
7902    FORD    ANALYST 7566    1981-12-3       3000.0  \N      20
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  \N      20
[root@bigdata emp_sort]# cat 000001_0  
7934    MILLER  CLERK   7782    1982-1-23       1300.0  \N      10
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  \N      10
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  \N      30
7839    KING    PRESIDENT       \N      1981-11-17      5000.0  \N      10

3、distribute by :按照指定的值进行分区

(1)特点

按照指定的字段进行分区,然后再做其他操作。

(2)实例

set mapreduce.job.reduces=3;

insert overwrite local directory '/opt/datas/emp_dist' row format delimited fields terminated by '\t' select * from emp distribute by deptno sort by sal;

结果:/opt/datas/emp_dist有三个文件,因为设置了3个reducer
000000_0  000001_0  000002_0

[root@bigdata emp_dist]# cat 000000_0 
7900    JAMES   CLERK   7698    1981-12-3       950.0   \N      30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  \N      30
[root@bigdata emp_dist]# cat 000001_0  
7934    MILLER  CLERK   7782    1982-1-23       1300.0  \N      10
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  \N      10
7839    KING    PRESIDENT       \N      1981-11-17      5000.0  \N      10
[root@bigdata emp_dist]# cat 000002_0  
7369    SMITH   CLERK   7902    1980-12-17      800.0   \N      20
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  \N      20
7566    JONES   MANAGER 7839    1981-4-2        2975.0  \N      20
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  \N      20
7902    FORD    ANALYST 7566    1981-12-3       3000.0  \N      20

4、cluster by 

(1)特点

  指定分区字段,并且按照分区字段排序。等于distribute by sal sort by sal,两者是同样的字段。但是这个情况应用的很少。

(2)实例

insert overwrite local directory '/opt/datas/emp_cls' row format delimited fields terminated by '\t' select * from emp cluster by sal;

猜你喜欢

转载自blog.csdn.net/u010886217/article/details/83891149
今日推荐