2020.9.18课堂笔记(hive数据排序:ORDER BY、SORT BY、DISTRIBUTE BY、CLUSTER BY)

hive数据排序

Order, Sort, Cluster, and Distribute By

This describes the syntax of SELECT clauses ORDER BY, SORT BY, CLUSTER BY, and DISTRIBUTE BY. See Select Syntax for general information.

1.ORDER BY

特点:
  • order by 可以指定多个字段
  • 可以按照某个字段进行升序asc(默认为升序), 或者降序desc排列
  • Hive 中 order by 和其他标准SQL语言并没区别,会对查询结果进行一个全局排序
  • 支持使用CASE WHEN或表达式
  • 支持按位置编号排序:
set hive.groupby.orderby.position.alias=true;
其缺点:

由于是全局排序,所以所有的数据会通过一个 Reducer 进行处理,当数据结果较大的时候,在一个Reducer中进行处理十分影响性能,应提前做好数据过滤

注意事项:

开启严格模式:

set hive.mapred.mode=strict;
  • 当开启MR 严格模式的时候ORDER BY 必须要设置 LIMIT 子句 ,否则会报错
  • 开启严格模式 hive.mapred.mode=strict ,对于分区表, 必须要对分区字段加限制条件

Syntax of Order By

The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST)           -- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy

There are some limitations in the “order by” clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a “limit” clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.

Note that columns are specified by name, not by position number. However in Hive 0.11.0 and later, columns can be specified by position when configured as follows:

  • For Hive 0.11.0 through 2.1.x, set hive.groupby.orderby.position.alias to true (the default is false).
  • For Hive 2.2.0 and later, hive.orderby.position.alias is true by default.
    The default sorting order is ascending (ASC).

In Hive 2.1.0 and later, specifying the null sorting order for each of the columns in the “order by” clause is supported. The default null sorting order for ASC order is NULLS FIRST, while the default null sorting order for DESC order is NULLS LAST.

In Hive 3.0.0 and later, order by without limit in subqueries and views will be removed by the optimizer. To disable it, set hive.remove.orderby.in.subquery to false.

2.SORT BY/DISTRIBUTE BY

Hive 中的 SORT BY,DISTRIBUTE BY:
由于Hive 中的ORDER BY 对于大数据集存在性能问题,延伸出了部分排序,以及将按相同KEY控制到同一划分集合的需求。
即以下两个方案 SORT BY , DISTRIBUTE BY, 我们分别对这两个方案进行介绍。

SORT BY
  • SORT BY 是一个部分排序方案,对每个Reducer中的数据进行排序,也就是执行一个局部排序过程,在数据进行reducer前完成排序
  • 当Reducer数量设置为1时,等于ORDER BY
  • 排序列必须出现在SELECT column列表中
注意:

使用sort by 你可以指定执行的reduce 个数

set mapred.reduce.tasks=<number>

对输出的数据再执行归并排序,即可以得到全部结果。

#设置reducer个数为3
set mapred.reduce.tasks=3;
#用sort by 成绩 按降序排序
select * from score sort by stu_score desc;

Syntax of Sort By

The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

In Hive 3.0.0 and later, sort by without limit in subqueries and views will be removed by the optimizer. To disable it, set hive.remove.orderby.in.subquery to false.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between “order by” and “sort by” is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, “sort by” may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified. The following example shows

SELECT key, value FROM src SORT BY key ASC, value DESC

The query had 2 reducers, and the output of each is:

0   5
0   3
3   6
9   1
0   4
0   3
1   1
2   5

Setting Types for Sort By

After a transform, variable types are generally considered to be strings, meaning that numeric data will be sorted lexicographically. To overcome this, a second SELECT statement with casts can be used before using SORT BY.

FROM (FROM (FROM src
            SELECT TRANSFORM(value)
            USING 'mapper'
            AS value, count) mapped
      SELECT cast(value as double) AS value, cast(count as int) AS count
      SORT BY value, count) sorted
SELECT TRANSFORM(value, count)
USING 'reducer'
AS whatever
DISTRIBUTE BY

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by 后面列,对应reducer的个数进行并发,默认是采用hash算法。sort by为每个reduce产生一个排序文件。一般情况下,distribute by经常和sort by配合使用。

  • 类似于标准SQL中的GROUP BY
  • 根据相应列以及对应reduce的个数进行分发
  • 默认是采用hash算法
  • 根据分区字段的hash码与reduce的个数进行模除
  • 通常使用在SORT BY语句之前
# distribute by和sort by连用,按每个学生分类,并进行科目成绩排序
select * from score distribute by stu_id sort by stu_id asc,stu_score desc;

DISTRIBUTE BY 控制 map 中的输出在 reducer 中是如何进行划分的。
使用DISTRIBUTE BY 可以保证相同KEY的记录被划分到一个Reduce 中。

-- 默认ASC正序,DESC倒序
SELECT department_id , name, employee_id, evaluation_score
FROM employee_hr 
DISTRIBUTE BY department_id SORT BY evaluation_score DESC;

3. CLUSTER BY

CLUSTER BY = DISTRIBUTE BY + SORT BY

  • 不支持ASC|DESC
  • 排序列必须出现在SELECT column列表中
  • 为了充分利用所有的Reducer来执行全局排序,可以先使用CLUSTER BY,然后使用ORDER BY
Hive 中的CLUSTER BY 与其不足:

如果对某一列既想采用SORT BY 也想采用 DISTRIBUTE BY ,那么可以使用CLUSTER BY 进行排序。
注意
排序只能是升序排序(默认排序规则),不能指定排序规则为asc或者desc。

Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By .

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:

x1
x2
x4
x3
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

In contrast, if we use Cluster By x , the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Instead of specifying Cluster By , the user can specify Distribute By and Sort By , so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
FROM (
    FROM pv_users
    MAP ( pv_users.userid, pv_users.date )
    USING 'map_script'
    AS c1, c2, c3
    DISTRIBUTE BY c2
    SORT BY c2, c1) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
    USING 'reduce_script'
    AS date, count;

猜你喜欢

转载自blog.csdn.net/m0_48758256/article/details/108681201