Hive基础练习二

下面是hive基本练习,持续补充中。

Hive导出数据有几种方式,如何导出数据

1.insert

# 分为导出到本地或者hdfs,还可以格式化输出,指定分隔符
# 导出到本地
0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu;
INFO  : Compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 0.107 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
INFO  : Query ID = hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : Starting Job = job_1573910690864_0002, Tracking URL = http://node01:8088/proxy/application_1573910690864_0002/
INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0002
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2019-11-16 22:19:40,957 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-11-16 22:19:42,002 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.51 sec
INFO  : MapReduce Total cumulative CPU time: 1 seconds 510 msec
INFO  : Ended Job = job_1573910690864_0002
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Copying data to local directory /kkb/install/hivedatas/stu3 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-19-32_776_5008666227900564137-1/-mr-10000
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.51 sec   HDFS Read: 3381 HDFS Write: 285797 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 seconds 510 msec
INFO  : Completed executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 10.251 seconds
INFO  : OK
No rows affected (10.383 seconds)
# 查看本地文件
[hadoop@node01 /kkb/install/hivedatas/stu3]$ cat 000000_0
1clyang

# 导出到hdfs
0: jdbc:hive2://node01:10000> insert overwrite directory '/kkb/stu' select * from stu;
INFO  : Compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 0.173 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
INFO  : Query ID = hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852
INFO  : Total jobs = 3
INFO  : Launching Job 1 out of 3
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : Starting Job = job_1573910690864_0003, Tracking URL = http://node01:8088/proxy/application_1573910690864_0003/
INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0003
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2019-11-16 22:24:13,962 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-11-16 22:24:15,018 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
INFO  : MapReduce Total cumulative CPU time: 1 seconds 460 msec
INFO  : Ended Job = job_1573910690864_0003
INFO  : Starting task [Stage-6:CONDITIONAL] in serial mode
INFO  : Stage-3 is selected by condition resolver.
INFO  : Stage-2 is filtered out by condition resolver.
INFO  : Stage-4 is filtered out by condition resolver.
INFO  : Starting task [Stage-3:MOVE] in serial mode
INFO  : Moving data to: hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000 from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10002
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Moving data to: /kkb/stu from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.46 sec   HDFS Read: 3315 HDFS Write: 286719 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 seconds 460 msec
INFO  : Completed executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 9.044 seconds
INFO  : OK
# 查看hdfs
[hadoop@node01 /kkb/install/hivedatas/stu3]$ hdfs dfs -cat /kkb/stu/000000_0
19/11/16 22:26:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1clyang

# 可以指定导出本地格式化分隔符,以导出到本地为例
0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu;
INFO  : Compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 0.128 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
INFO  : Query ID = hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : Starting Job = job_1573910690864_0005, Tracking URL = http://node01:8088/proxy/application_1573910690864_0005/
INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0005
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2019-11-16 22:31:27,083 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-11-16 22:31:28,139 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.93 sec
INFO  : MapReduce Total cumulative CPU time: 1 seconds 930 msec
INFO  : Ended Job = job_1573910690864_0005
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Copying data to local directory /kkb/install/hivedatas/stu4 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-31-20_415_1737902713220629568-1/-mr-10000
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.93 sec   HDFS Read: 3526 HDFS Write: 286073 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 seconds 930 msec
INFO  : Completed executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 8.707 seconds
INFO  : OK
# 查看本地文件,发现以@分隔
[hadoop@node01 /kkb/install/hivedatas/stu4]$ cat 000000_0
1@clyang

2.hadoop命令

数据使用hive保存后存在于hdfs,也可以直接从hdfs将数据拉到本地,使用get命令。

hdfs dfs -get /user/hive/warehouse/student/student.txt /opt/bigdata/data

3.bash shell覆盖追加导出

使用bin/hive -e sql语句或者bin/hive -f sql脚本,将数据覆盖或者追加导出,这里以前者为例,另外sql脚本本质上主要还是sql语句。

# 覆盖写
[hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' > /kkb/install/hivedatas/student2.txt
ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-11-16 22:37:46,342 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/16 22:37:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
OK
Time taken: 6.966 seconds, Fetched: 1 row(s)
You have new mail in /var/spool/mail/root
# 查看结果
[hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
stu.id  stu.name
1   clyang
# 追加写
[hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' >> /kkb/install/hivedatas/student2.txt
ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-11-16 22:39:03,442 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/16 22:39:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
OK
Time taken: 6.056 seconds, Fetched: 1 row(s)
You have new mail in /var/spool/mail/root
# 查看追加写后结果
[hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
stu.id  stu.name
1   clyang
stu.id  stu.name
1   clyang

4.export导出到hdfs

# 导出
0: jdbc:hive2://node01:10000> export table stu to '/kkb/studentexport';
INFO  : Compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.126 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
INFO  : Starting task [Stage-0:COPY] in serial mode
INFO  : Copying data from file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata to hdfs://node01:8020/kkb/studentexport
INFO  : Copying file: file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata
INFO  : Starting task [Stage-1:COPY] in serial mode
INFO  : Copying data from hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu to hdfs://node01:8020/kkb/studentexport/data
INFO  : Copying file: hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu/000000_0
INFO  : Completed executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.604 seconds
INFO  : OK

# 查看数据
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport
19/11/17 20:29:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rwxr-xr-x   3 anonymous supergroup       1330 2019-11-05 09:43 /kkb/studentexport/_metadata
drwxr-xr-x   - anonymous supergroup          0 2019-11-05 09:43 /kkb/studentexport/data
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport/data
19/11/17 20:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x   3 anonymous supergroup          9 2019-11-05 09:43 /kkb/studentexport/data/000000_0
You have new mail in /var/spool/mail/root
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -cat /kkb/studentexport/data/000000_0
19/11/17 20:29:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1clyang

分区和分桶的区别

分区是文件夹范畴的,就是按照文件夹区分,来存储文件,分桶是文件范畴的,将一个文件根据某个字段按hash取余,拆分为几个文件片段保存,它们都有各自的应用场景:

(1)分区用在按照日期,按照天,或者小时来保存数据,后面查询可以根据需求快速定位到数据,避免了速度慢的全表扫描查询。

(2)分桶则是更加细粒度的存储,可以指定桶的个数n,这样一份文件保存会划分为n份,如果想快速查找可以用tablesample(bucket x out of y)来指定要抽样查询的桶表。

另外分区表里面可能有分桶表。

将数据直接上传到分区目录(hdfs)上,让分区表和数据产生关联有哪些方式?

当创建分区表并将数据导入到分区后,发现导入的数据就保存在对应的分区目录下,并且可以正常查询表内容。如果先将数据导入到事先准备好的分区,然后再创建分区表,是查不到数据的,因为还没有建立分区表数据和hive表的映射关系,需要使用命令来修复,此外还有2种方法。

方法1 msck repair table 表名

提前准备好分区,并将数据上传。

[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /mystudentdatas/month=11/
19/11/17 12:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   3 hadoop supergroup        199 2019-11-17 12:36 /mystudentdatas/month=11/student.csv

创建表格

0: jdbc:hive2://node01:10000> create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas';
INFO  : Compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.149 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas'
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.271 seconds
INFO  : OK

修复表格,使用msck,修复后就可以查看到表的数据了,映射关系建立。

# 修复表格
0: jdbc:hive2://node01:10000> msck repair table student_partition_me;
INFO  : Compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.011 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.263 seconds
INFO  : OK
No rows affected (0.311 seconds)
# 查询,最后字段为分区字段month
0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_me;
INFO  : Compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.133 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
INFO  : Completed executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.0 seconds
INFO  : OK
+-----+-------+-------------+---------+--------+--+
| id  | name  |    year     | gender  | month  |
+-----+-------+-------------+---------+--------+--+
| 01  | 赵雷    | 1990-01-01  | 男       | 11     |
| 02  | 钱电    | 1990-12-21  | 男       | 11     |
| 03  | 孙风    | 1990-05-20  | 男       | 11     |
| 04  | 李云    | 1990-08-06  | 男       | 11     |
| 05  | 周梅    | 1991-12-01  | 女       | 11     |
| 06  | 吴兰    | 1992-03-01  | 女       | 11     |
| 07  | 郑竹    | 1989-07-01  | 女       | 11     |
| 08  | 王菊    | 1990-01-20  | 女       | 11     |
+-----+-------+-------------+---------+--------+--+
8 rows selected (0.214 seconds)

方法2 alter table 表名 add partition(col=xxx)

将数据上传到hdfs

# 注意这里hdfs数据目录换成studentdatas
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /studentdatas/month=12/
19/11/17 16:51:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   3 hadoop supergroup        199 2019-11-17 16:51 /studentdatas/month=12/student.csv

创建表格

0: jdbc:hive2://node01:10000> create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas';
INFO  : Compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.011 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas'
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.097 seconds
INFO  : OK

使用alter table指定分区

0: jdbc:hive2://node01:10000> alter table student_partition_pa add partition(month='12');
INFO  : Compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.051 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.116 seconds
INFO  : OK

查询数据,ok

0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_pa;
INFO  : Compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.092 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
INFO  : Completed executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.001 seconds
INFO  : OK
+-----+-------+-------------+---------+--------+--+
| id  | name  |    year     | gender  | month  |
+-----+-------+-------------+---------+--------+--+
| 01  | 赵雷    | 1990-01-01  | 男       | 12     |
| 02  | 钱电    | 1990-12-21  | 男       | 12     |
| 03  | 孙风    | 1990-05-20  | 男       | 12     |
| 04  | 李云    | 1990-08-06  | 男       | 12     |
| 05  | 周梅    | 1991-12-01  | 女       | 12     |
| 06  | 吴兰    | 1992-03-01  | 女       | 12     |
| 07  | 郑竹    | 1989-07-01  | 女       | 12     |
| 08  | 王菊    | 1990-01-20  | 女       | 12     |
+-----+-------+-------------+---------+--------+--+
8 rows selected (0.162 seconds)

方法3 load data inpath 'hdfs文件路径' into table 表名 partition(col名='xxx')

数据上传到hdfs

[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /
19/11/17 17:21:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 13 items
# 上传person.txt到hdfs
-rw-r--r--   3 hadoop      supergroup         68 2019-11-17 17:09 /person.txt

创建表格

0: jdbc:hive2://node01:10000> create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas';
INFO  : Compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.023 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas'
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.101 seconds
INFO  : OK

将hdfs上文件加载到分区目录下

0: jdbc:hive2://node01:10000> load data inpath '/person.txt' into table person_partition partition(age='25');
INFO  : Compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.082 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table myhive.person_partition partition (age=25) from hdfs://node01:8020/person.txt
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Partition myhive.person_partition{age=25} stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0]
INFO  : Completed executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.382 seconds
INFO  : OK

查询数据,ok

0: jdbc:hive2://node01:10000> select * from person_partition;
INFO  : Compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:person_partition.name, type:string, comment:null), FieldSchema(name:person_partition.citys, type:array<string>, comment:null), FieldSchema(name:person_partition.age, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.099 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
INFO  : Completed executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.001 seconds
INFO  : OK
+------------------------+----------------------------------------------+-----------------------+--+
| person_partition.name  |            person_partition.citys            | person_partition.age  |
+------------------------+----------------------------------------------+-----------------------+--+
| yang                   | ["beijing","shanghai","tianjin","hangzhou"]  | 25                    |
| messi                  | ["changchu","chengdu","wuhan"]               | 25                    |
+------------------------+----------------------------------------------+-----------------------+--+

分桶表是否可以通过直接load将数据导入?

桶表需要根据某个字段进行hash取余然后拆分数据保存为不同的文件保存到hdfs,需要通过普通中间表的字段值来计算拆分,因此不能直接load导入,直接导入到hdfs只有一个文件。另外通过桶表的文件类型可以看出,它不是原来的格式了,是一个mr计算后的文件,因此也说明不能用hdfs直接导入。

hive中分区可以提高查询效率,分区是否越多越好,为什么?

hive查询本质上是执行MapReduce任务,如果分区太多,同样体量的数据会产生更多的小文件block块,则会产生的更多的元数据(块的位置、大小等信息),这样对namenode来说压力很大。

另外hive sql会转化为mapreduce任务,分区的一个小文件会对应一个的task,一个task对应一个JVM实例,过多的分区会产生大量的JVM实例,导致JVM频繁的创建与销毁,会降低系统整体性能。

参考博文:
(1)https://www.cnblogs.com/tele-share/p/9829515.html

猜你喜欢

转载自www.cnblogs.com/youngchaolin/p/11877986.html