1、order by:全局排序,默认按照字典序排列
(1)特点
只会开启一个reduce做聚合,所以数据量很大的话,很影响性能;如果设置多个reduce但是,输出还是只有一个文件
(2)实例
set mapreduce.job.reduces=2;
insert overwrite local directory '/opt/datas/emp_order' row format delimited fields terminated by '\t' select * from emp order by sal;
结果:
尽管设置了两个reducer个数,但是在目录下cat /opt/datas/emp_order只有一个文件000000_0,因为全局排序,所以,set mapreduce.job.reduces=2;没有影响
7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
2.sort by:局部排序
(1)特点
针对每个reduce的结果进行排序,对于分区后的reduce进行排序
(2)实例
set mapreduce.job.reduces=2;
insert overwrite local directory '/opt/datas/emp_sort' row format delimited fields terminated by '\t' select * from emp sort by sal;
结果:/opt/datas/emp_sort目录下有两个文件
000000_0 000001_0
[root@bigdata emp_sort]# cat 000000_0
7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
[root@bigdata emp_sort]# cat 000001_0
7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
3、distribute by :按照指定的值进行分区
(1)特点
按照指定的字段进行分区,然后再做其他操作。
(2)实例
set mapreduce.job.reduces=3;
insert overwrite local directory '/opt/datas/emp_dist' row format delimited fields terminated by '\t' select * from emp distribute by deptno sort by sal;
结果:/opt/datas/emp_dist有三个文件,因为设置了3个reducer
000000_0 000001_0 000002_0
[root@bigdata emp_dist]# cat 000000_0
7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
[root@bigdata emp_dist]# cat 000001_0
7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
[root@bigdata emp_dist]# cat 000002_0
7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
4、cluster by
(1)特点
指定分区字段,并且按照分区字段排序。等于distribute by sal sort by sal,两者是同样的字段。但是这个情况应用的很少。
(2)实例
insert overwrite local directory '/opt/datas/emp_cls' row format delimited fields terminated by '\t' select * from emp cluster by sal;