谈谈hive的order by ，sort by ，distribute by 和 cluster by - 代码天地

谈谈hive的order by ，sort by ，distribute by 和 cluster by

其他 2018-08-31 22:11:17 阅读次数: 0

版权声明：本文为博主原创文章，未经博主允许不得转载。 https://blog.csdn.net/qq_40795214/article/details/82190827

总说：

笼统地看，这四个在hive中都有排序和聚集的作用，然而，它们在执行时所启动的MR却各不相同。

细讲：

order by：

order by会对所给的全部数据进行全局排序，并且只会“叫醒”一个reducer干活。它就像一个糊涂蛋一样，不管来多少数据，都只启动一个reducer来处理。因此，数据量小还可以，但数据量一旦变大order by就会变得异常吃力，甚至“罢工”。

sort by：

sort by是局部排序。相比order by的懒惰糊涂，sort by正好相反，它不但非常勤快，而且具备分身功能。sort by会根据数据量的大小启动一到多个reducer来干活，并且，它会非常勤快的参与到每个reducer中，即sort by会为每个reducer产生一个排序文件。这样的好处是提高了全局排序的效率。

distribute by：

distribute by的功能是：某种情况下，我们需要控制某个特定行到某个reducer中，这种操作一般是为后续可能发生的聚集操作做准备。

举一个最常见的栗子：

接上面，

以上栗子为在根据年份和气温对气象数据进行排序时，我们希望看到同一年的数据被放到同一个reducer中去处理。因而，这个结果也肯定是全局排序的。特别的，当distribute by 遇上 sort by时，distribute by要放在前面，这个不难理解，因为要先通过distribute by 将待处理的数据放到reducer中，才能让sort by去到每个reducer中干活，不然reducer中都没活干，sort by去白跑一趟。

cluster by：

cluster by，在《Hadoop权威指南第二版》中这样描述道：

也就是说，如果参照上面气象数据的栗子，当二者皆取year列时，sql语句如下：

from recrds2
select year , temperature
cluster by year;

猜你喜欢

转载自blog.csdn.net/qq_40795214/article/details/82190827

谈谈hive的order by ，sort by ，distribute by 和 cluster by

Hive中order by、sort by、distribute by和cluster by

【Hive】Order by、Sort by、Distribute by和Cluster by

Hive中的order by、sort by、distribute by和cluster by

Hive的Order by、Sort by、Distribute by和Cluster by的区别

hive中order by ，sort by ，distribute by 和 cluster by

hive Sort By/Order By/Cluster By/Distribute By

hive中order by、distribute by、sort by和cluster by的区别和联系

HIVE 中 order by, sort by, distribute by, cluster by的用法和区别

hive中order by,sort by,distribute by,cluster by作用和用法

hive中order by ，sort by ，distribute by 和 cluster by的区别

Hive中 Oder by 、sort by、distribute by 和 cluster by

Hive的sort by, order by, distribute by, cluster by区别？

Hive中order by、sort by、distribute by、cluster by的区别

Hive之Order,Sort,Cluster and Distribute By

hive中 order by ,distribute by ,cluster by ,sort by 区别

Hive中order by sort by distribute by cluster by用法

Hive的排序（Order by，Sort by，Distribute by，Cluster by）

hive 中 order by ,sort by ,distribute by ,cluster by 详解

Hive 排序及优化 ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY

Hive中order by，sort by，distribute by，cluster by的区别

hive中的order by、sort by、distribute by、cluster by排序

hive中cluster by，order by，sort by，distribute by的区别

hive- order by、sort by 、distribute by、cluster by

hive的 group 、distribute 、sort 、cluster、order 区别

Hive中的order by,sort by,distribute by,cluster by 的区别

Hive学习：order by，sort by，distribute by，cluster by的区别

order by/sort by/distribute by /cluster by 的区分

HIVE中，order by、sort by、 distribute by和 cluster by区别，以及cluster by有什么意义

Hive_Hive 排序及优化 ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

让自己的头脑极度开放

CentOS 6.5(x64) 和Redhat6.5操作系误删libc

高可用注册中心

【日记】12.28/【题解】AtCoder AGC041

XML（5）_XML 约束_DTD

Java集合Map（四）

树梅派安装桌面环境教程

pipenv 的使用和安装

小程序白屏问题和内存研究

C语言简单选择排序

每日归档

更多

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)