hive入门四--分桶

对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

把表(或者分区)组织成桶(Bucket)有两个理由:

  • 获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
  • 使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。

1. 创建分桶表

create table t_buk(id string,name string)
clustered by (id) -- 按照id分桶
sorted by (id) -- 按照id排序
into 4 buckets -- 分4个桶
row format delimited fields terminated by ',';

2. 查看表结构

0: jdbc:hive2://localhost:10000> desc formatted t_buk;

内容如下:
这里写图片描述

3. 导入数据

0: jdbc:hive2://localhost:10000> load data local inpath '/home/hadoop/data.txt' into table t_buk;

结果如下:

INFO : Loading data to table shizhan03.t_buk from file:/home/hadoop/data.txt
INFO : Table shizhan03.t_buk stats: [numFiles=1, totalSize=46]
No rows affected (0.737 seconds)

注意:这样导入的数据并不会自动分桶,把数据截断。

4. 截断表

truncate table t_buk;

5. 插入数据

-- 这里会执行mapreduce程序
insert into table t_buk select id,name from t_sz03;

结果如下:

INFO : Number of reduce tasks is set to 0 since there’s no reduce operator’
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1688102775_0001
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:04:05,158 Stage-1 map = 0%, reduce = 0%
INFO : 2018-09-06 10:04:06,164 Stage-1 map = 100%, reduce = 0%
INFO : Ended Job = job_local1688102775_0001
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-04-02_708_5179772915570508915-1/-ext-10000 from hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-04-02_708_5179772915570508915-1/-ext-10002
INFO : Loading data to table shizhan03.t_buk from hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-04-02_708_5179772915570508915-1/-ext-10000
INFO : Table shizhan03.t_buk stats: [numFiles=1, numRows=5, totalSize=46, rawDataSize=41]
No rows affected (3.81 seconds)

查看结果发现还是一个文件内容:
即: /user/hive/warehouse/shizhan03.db/t_buk 这个路径下面只有一个文件

如何解决这个问题?

6. 设置分桶表参数

-- 设置参数:
set hive.enforce.bucketing = true;
查看设置的参数:
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing;
+------------------------------+--+
|             set              |
+------------------------------+--+
| hive.enforce.bucketing=true  |
+------------------------------+--+
1 row selected (0.006 seconds)
set mapreduce.job.reduces=4;-- 设置mapreduce的数量

0: jdbc:hive2://localhost:10000> set mapreduce.job.reduces;
+--------------------------+--+
|           set            |
+--------------------------+--+
| mapreduce.job.reduces=4  |
+--------------------------+--+
1 row selected (0.007 seconds)

查看排序结果:

0: jdbc:hive2://localhost:10000> select * from t_sz01 sort by (id);
INFO  : Number of reduce tasks not specified. Defaulting to jobconf value of: 4
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_local1059625551_0002
INFO  : The url to track the job: http://localhost:8080/
INFO  : Job running in-process (local Hadoop)
INFO  : 2018-09-06 10:26:11,257 Stage-1 map = 100%,  reduce = 0%
INFO  : 2018-09-06 10:26:12,271 Stage-1 map = 100%,  reduce = 100%
INFO  : Ended Job = job_local1059625551_0002
+------------+--------------+--+
| t_sz01.id  | t_sz01.name  |
+------------+--------------+--+
| 2          | lisi         |
| 4          | zhaoliu      |
| 1          | zhangsan     |
| 3          | wangwu       |
| 5          | zhouqi       |
+------------+--------------+--+
5 rows selected (2.81 seconds)

mapreduce的数量为4.

7. 导入数据

insert into table t_buk select id,name from t_sz01 distribute by(id) sort by(id);

执行过程如下:

0: jdbc:hive2://localhost:10000> insert into table t_buk
0: jdbc:hive2://localhost:10000> select id,name from t_sz01 distribute by(id) sort by(id);
INFO : Number of reduce tasks not specified. Defaulting to jobconf value of: 4
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1435934157_0003
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:42:08,780 Stage-1 map = 100%, reduce = 100%
INFO : Ended Job = job_local1435934157_0003
INFO : Number of reduce tasks determined at compile time: 4
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1436192478_0004
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:42:10,446 Stage-2 map = 100%, reduce = 0%
INFO : 2018-09-06 10:42:11,458 Stage-2 map = 100%, reduce = 50%
INFO : 2018-09-06 10:42:12,474 Stage-2 map = 100%, reduce = 75%
INFO : 2018-09-06 10:42:13,479 Stage-2 map = 100%, reduce = 100%
INFO : Ended Job = job_local1436192478_0004
INFO : Loading data to table shizhan03.t_buk from hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-42-07_217_7738831819154647152-1/-ext-10000
INFO : Table shizhan03.t_buk stats: [numFiles=4, numRows=16, totalSize=189, rawDataSize=173]
No rows affected (6.763 seconds)

通过浏览器查看文件系统,发现/user/hive/warehouse/shizhan03.db/t_buk
路径下面有4个文件,文件名分别为:
000000_0,000001_0,000002_0,000003_0

查看分桶之后的文件内容:
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000000_0;

内容如下:

+————-+–+
| DFS Output |
+————-+–+
| 15,spring |
| 4,一元go |
| 8,丢件 |
| 11,圣诞浏览 |
+————-+–+
4 rows selected (0.027 seconds)

0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000001_0;

+————-+–+
| DFS Output |
+————-+–+
| 1,11.11 |
| 16,browse |
| 12,圣诞注册 |
| 9,延时 |
| 5,排行榜 |
+————-+–+
5 rows selected (0.017 seconds)

0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000002_0;

+————-+–+
| DFS Output |
+————-+–+
| 6,邀请 |
| 2,更高补贴 |
| 13,圣诞领取按钮 |
+————-+–+
3 rows selected (0.019 seconds)

0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000003_0;

+——————–+–+
| DFS Output |
+——————–+–+
| 14,springFestival |
| 10,剁手 |
| 7,货损 |
| 3,+5 |
+——————–+–+
4 rows selected (0.022 seconds)

8. 导入数据:

-- 先清空表
truncate table t_buk;
-- 导入数据
insert into table t_buk select id,name from t_sz01 cluster by(id);

查看分桶数据:

0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000000_0;

!问题:将mapreduce数量设置为4,查询不是分桶表的话,能查询到全部数据吗?能!

0: jdbc:hive2://localhost:10000> select * from t_sz01 order by name;

INFO  : Number of reduce tasks determined at compile time: 1 
-- 会将mapreduce数量设置为1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_local157551508_0007
INFO  : The url to track the job: http://localhost:8080/
INFO  : Job running in-process (local Hadoop)
INFO  : 2018-09-06 10:59:40,673 Stage-1 map = 100%,  reduce = 100%
INFO  : Ended Job = job_local157551508_0007
+------------+-----------------+--+
| t_sz01.id  |   t_sz01.name   |
+------------+-----------------+--+
| 3          | +5              |
| 1          | 11.11           |
| 16         | browse          |
| 15         | spring          |
| 14         | springFestival  |
| 4          | 一元go            |
| 8          | 丢件              |
| 10         | 剁手              |
| 12         | 圣诞注册            |
| 11         | 圣诞浏览            |
| 13         | 圣诞领取按钮          |
| 9          | 延时              |
| 5          | 排行榜             |
| 2          | 更高补贴            |
| 7          | 货损              |
| 6          | 邀请              |
+------------+-----------------+--+
16 rows selected (1.404 seconds)

9. 保存select查询结果的几种方式:

-- 从一个表的数据导入到新建的一个表里面
a、create table t_tmp as select * from t_sz01;
-- 从一个表的数据导入到已经存在一个表里面
b、insert into table t_tmp select * from t_sz01;
-- 从一个表的数据保存到指定的文件目录(可以是本地,也可以是hdfs)    
c、insert overwrite local directory '/home/hadoop/test' select * from t_sz01;

完!!!

猜你喜欢

转载自blog.csdn.net/guo20082200/article/details/82459041