目录
概述
默认情况下,一个简单的HQL查询扫描整个表,这对于大表来说查询性能会非常低。通过创建分区解决很好的解决这类问题,分区与RDBMS 中的分区非常相似。在Hive中,每个分区对应一个预定义的分区列,该列会映射到HDFS中表目录中的子目录。查询表中数据时,只读取所需的数据分区(目录),这就会大大减少查询的I/O和时间。在Hive中使用分区是提高性能的一种非常简单有效的方法。
创建分区表
分区表的创建语法是在创建普通表的基础上多了 PARTITIONED BY 子句来指定分区字段。
示例
> CREATE TABLE employee_partitioned (
name STRING,
work_place ARRAY<STRING>,
gender_age STRUCT<gender:STRING,age:INT>,
skills_score MAP<STRING,INT>,
depart_title MAP<STRING,ARRAY<STRING>>
)
PARTITIONED BY (year INT, month INT) -- partition column
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
> desc employee_partitioned;
+--------------------------+--------------------------------+----------+
| col_name | data_type | comment |
+--------------------------+--------------------------------+----------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| year | int | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| year | int | |
| month | int | |
+--------------------------+--------------------------------+----------+
加载数据
> LOAD DATA INPATH '/tmp/employee1.txt'
OVERWRITE INTO TABLE employee_partitioned
PARTITION (year=2018, month=11);
> LOAD DATA INPATH '/tmp/employee2.txt'
OVERWRITE INTO TABLE employee_partitioned
PARTITION (year=2018, month=12);
查看分区
使用SHOW查看分区
查看表的所有分区
SHOW PARTITIONS table_name;
SHOW PARTITIONS会列出表中所有已存在的分区,分区会按照字母表顺序列出。
示例
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
+---------------------+
查看部分分区
示例:使用分区的部分说明信息(partition specification)过滤结果。
> SHOW PARTITIONS employee_partitioned PARTITION(month='11');
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
+---------------------+
查看分区的扩展信息
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' PARTITION(partition_spec);
查看分区的扩展信息时,将不支持表名是正则表达式,而且分区的说明信息必须是完整的信息,不能只有部分分区信息。
示例
> SHOW TABLE EXTENDED LIKE "employee_partitioned" PARTITION (year='2018',month='11');
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| tableName:employee_partitioned |
| owner:hadoop |
| location:hdfs://ns001/tmp/hive/employee_partitioned/year=2018/month=11 |
| inputformat:org.apache.hadoop.mapred.TextInputFormat |
| outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| columns:struct columns { string name, list<string> work_place, struct<gender:string,age:i32> gender_age, map<string,i32> skills_score, map<string,list<string>> depart_title} |
| partitioned:true |
| partitionColumns:struct partition_columns { i32 year, i32 month} |
| totalNumberFiles:1 |
| totalFileSize:228 |
| maxFileSize:228 |
| minFileSize:228 |
| lastAccessTime:1569484958041 |
| lastUpdateTime:1569484989618 |
| |
+----------------------------------------------------+
使用DESCRIBE查看分区
适用于 Hive 1.x.x 和 0.x.x
--不指定数据库时,字段和表名使用一个点(DOT)分隔
DESCRIBE [EXTENDED|FORMATTED] table_name[.column_name] PARTITION partition_spec;
--指定数据库时,字段和表名使用空格分隔
DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name [column_name] PARTITION partition_spec;
适用于Hive2.0 以后
-- 不支持字段和表名使用一个点(DOT)分隔这种格式
-- 分区说明信息是在表名之后、字段之前;之前的版本中,字段信息在表名和分区说明之间。
-- field_name用于struct,'$elem$'用于array,'$key$'用于map的key,'$value$'用于map的value
DESCRIBE [EXTENDED | FORMATTED]
[db_name.]table_name PARTITION partition_spec [col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];
示例
> DESCRIBE employee_partitioned PARTITION (year='2018',month='11');
+--------------------------+--------------------------------+----------+
| col_name | data_type | comment |
+--------------------------+--------------------------------+----------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| year | int | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| year | int | |
| month | int | |
+--------------------------+--------------------------------+----------+
> DESCRIBE employee_partitioned PARTITION (year='2018',month='11') skills_score.$value$;
+-----------+------------+--------------------+
| col_name | data_type | comment |
+-----------+------------+--------------------+
| $value$ | int | from deserializer |
+-----------+------------+--------------------+
如果使用EXTENDED关键字,则以Thrift序列化形式显示表的元数据,通常用于调试。如果使用 FORMATTED 关键字,则以表格形式显示元数据。
> DESCRIBE FORMATTED employee_partitioned PARTITION (year='2018',month='11');
+---------------------------------+----------------------------------------------------+----------+
| col_name | data_type | comment |
+---------------------------------+----------------------------------------------------+----------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| year | int | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| year | int | |
| month | int | |
| | NULL | NULL |
| Detailed Partition Information | Partition(values:[2018, 11], dbName:test2, tableName:employee_partitioned, createTime:1569484989, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:work_place, type:array<string>, comment:null), FieldSchema(name:gender_age, type:struct<gender:string,age:int>, comment:null), FieldSchema(name:skills_score, type:map<string,int>, comment:null), FieldSchema(name:depart_title, type:map<string,array<string>>, comment:null), FieldSchema(name:year, type:int, comment:null), FieldSchema(name:month, type:int, comment:null)], location:hdfs://ns001/tmp/hive/employee_partitioned/year=2018/month=11, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{mapkey.delim=:, collection.delim=,, serialization.format=|, field.delim=|}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{transient_lastDdlTime=1569484989, totalSize=228, numRows=0, rawDataSize=0, numFiles=1}, catName:hive, writeId:0) | |
+---------------------------------+----------------------------------------------------+----------+
> DESCRIBE FORMATTED employee_partitioned PARTITION (year='2018',month='11') skills_score.$value$;
+-------------+------------+--------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+------------+----------+
| col_name | data_type | min | max | num_nulls | distinct_count | avg_col_len | max_col_len | num_trues | num_falses | bitvector | comment |
+-------------+------------+--------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+------------+----------+
| # col_name | data_type | comment | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| $value$ | int | from deserializer | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+-------------+------------+--------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+------------+----------+
修改分区
添加字段
添加字段的操作在表的DDL中已经描述过。只是默认的添加字段只是修改表的元数据,不会对已经存在的分区产生影响。也就是说在默认情况下,添加的字段只会对新建的分区有影响。如果需要对所有的分区同时添加字段,只需要使用CASCADE关键字。
示例
> ALTER TABLE employee_partitioned ADD COLUMNS (work string) CASCADE;
> DESCRIBE employee_partitioned PARTITION (year='2018',month='11');
+--------------------------+--------------------------------+----------+
| col_name | data_type | comment |
+--------------------------+--------------------------------+----------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| work | string | |
| year | int | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| year | int | |
| month | int | |
+--------------------------+--------------------------------+----------+
添加分区
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
使用ALTER TABLE ADD PARTITION给表添加分区。只有分区值是字符串时,才需要被引号包围。Location必须是一个内部存有数据文件的目录。(ADD PARTITION只是更改表的元数据,但不会加载数据。如果分区的位置不存在数据,查询将不会返回任何结果。如果要添加的分区表中已经存在,则会报错。使用IF NOT EXISTS可以跳过错误。
示例
> ALTER TABLE employee_partitioned ADD IF NOT EXISTS PARTITION (year=2019,month=01);
> show partitions employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
| year=2019/month=1 |
+---------------------+
修改分区名
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
示例
> ALTER TABLE employee_partitioned PARTITION (year=2019,month=1) RENAME TO PARTITION (year=2019,month=2);
> show partitions employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
| year=2019/month=2 |
+---------------------+
交换分区
分区可以在表之间进行交换(移动)
-- Move partition from table_name_1 to table_name_2
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;
-- multiple partitions
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;
table_name_2 是目标表,table_name_1 是源表
示例
> CREATE TABLE employee_partitioned_copy1 like employee_partitioned;
> ALTER TABLE employee_partitioned_copy1 EXCHANGE PARTITION (year=2019,month=2) WITH TABLE employee_partitioned;
> show partitions employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
+---------------------+
:> show partitions employee_partitioned_copy1;
+--------------------+
| partition |
+--------------------+
| year=2019/month=2 |
+--------------------+
修改分区字段的数据类型
示例:将分区字段year的类型从int改为string
> ALTER TABLE employee_partitioned PARTITION COLUMN(year string);
> desC employee_partitioned;
+--------------------------+--------------------------------+----------+
| col_name | data_type | comment |
+--------------------------+--------------------------------+----------+
| name | string | |
| work_place | array<string> | |
| gender_age | struct<gender:string,age:int> | |
| skills_score | map<string,int> | |
| depart_title | map<string,array<string>> | |
| work | string | |
| year | string | |
| month | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| year | string | |
| month | int | |
+--------------------------+--------------------------------+----------+
修改分区的文件格式
示例:将分区(year=‘2018’,month=‘12’)的文件格式从text改为ORC
> DESC FORMATTED employee_partitioned PARTITION (year='2018',month='12')
+-----------------------------------+----------------------------------------------------+-----------------------+
| col_name | data_type | comment |
+-----------------------------------+----------------------------------------------------+-----------------------+
...
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
...
+-----------------------------------+----------------------------------------------------+-----------------------+
> ALTER TABLE employee_partitioned PARTITION (year='2018',month='12') SET FILEFORMAT ORC;
> DESC FORMATTED employee_partitioned PARTITION (year='2018',month='12');
+-----------------------------------+----------------------------------------------------+-----------------------+
| col_name | data_type | comment |
+-----------------------------------+----------------------------------------------------+-----------------------+
...
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
...
+-----------------------------------+----------------------------------------------------+-----------------------+
注意:用户需要自己确定分区修改后的实际数据文件格式和修改相匹配。否则查询时会报错。实际上,单独去修改表的某个分区文件格式的应用场景会非常少。
修改分区的存储位置
示例:修改分区(year=‘2018’,month=‘12’)的存储位置
> DESC FORMATTED employee_partitioned partition(year='2018',month='12')
+-----------------------------------+----------------------------------------------------+-----------------------+
| col_name | data_type | comment |
+-----------------------------------+----------------------------------------------------+-----------------------+
...
| Location: | hdfs://ns001/tmp/hive/employee_partitioned/year=2018/month=12 | NULL |
...
+-----------------------------------+----------------------------------------------------+-----------------------+
> ALTER TABLE employee_partitioned PARTITION (year='2018',month='12') SET LOCATION '/user/hive/warehouse/employee_partitioned/year=2018/month=12';
> DESC FORMATTED employee_partitioned partition(year='2018',month='11')
+-----------------------------------+----------------------------------------------------+-----------------------+
| col_name | data_type | comment |
+-----------------------------------+----------------------------------------------------+-----------------------+
...
| Location: | hdfs://ns001/tmp/hive/employee_partitioned/year=2018/month=11 | NULL |
...
+-----------------------------------+----------------------------------------------------+-----------------------+
注意:修改分区的存储位置并不会对已有的数据文件进行移动,用户需要自己确保数据文件符合分区的位置。
合并分区中的小文件
在RCFile 或者 ORC存储格式的分区中,如果有很多的小文件,可以使用 CONCATENATE 选项对小文件进行合并。
示例:对分区(year=‘2018’,month=‘12’)中的文件进行合并
ALTER TABLE employee_partitioned PARTITION (year='2018',month='12') CONCATENATE;
清理分区
清空分区有2种方式,一种使用DROP将分区及数据删除,另一种使用TRUNCATE只清空数据但保留分区元数据信息
DROP
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE]; -- (Note: PURGE available in Hive 1.2.0 and later, IGNORE PROTECTION not available 2.0.0 and later)
示例:删除分区(year=‘2018’,month=‘12’)
> SHOW PARTITIONS employee_partitioned
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
+---------------------+
> ALTER TABLE employee_partitioned DROP IF EXISTS PARTITION (year='2018',month='12');
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
+---------------------+
TRUNCATE
示例:清空分区(year=‘2018’,month=‘11’)中的数据
> TRUNCATE TABLE employee_partitioned PARTITION (year='2018',month='11');
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
+---------------------+
修复分区
在使用Hive的时候,如果我们使用HDFS的命令(fs -put或者fs -rm)直接把数据放到分区表的数据目录下或者从数据目录中删除分区数据。此时Hive的元数据并不会感知到这个变化。因此即使分区表有实际数据,我们也无法查询出来。一种可行的方式是,我们可以为每一个新增的分区或是删除的分区执行 ALTER TABLE table_name ADD/DROP PARTITION 命令,使的元数据信息和实际数据的分布匹配。但是假设我们一次新增了很多的分区,使用这种方式一个一个的添加分区难免效率会很低,Hive提供了一个分区修复命令MSCK可以帮助我们一次性完成这样的操作。
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
默认执行的操作是ADD PARTITIONS。DROP SYNC PARTITIONS选项将从metastore中删除分区(其数据已经从HDFS中删除)信息。SYNC PARTITIONS相当于同时执行ADD 和 DROP PARTITIONS。
示例
- 查看分区employee_partitioned表当前的分区信息,只有一个分区
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
+---------------------+
- 使用HDFS命令在employee_partitioned的数据目录下添加子目录 year=2019/month=1 和 year=2019/month=2
hadoop fs -mkdir /tmp/hive/employee_partitioned/year=2019/month=1/
hadoop fs -mkdir /tmp/hive/employee_partitioned/year=2019/month=2/
- 然后执行MSCK命令添加分区
> MSCK REPAIR TABLE employee_partitioned ADD PARTITIONS;
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
| year=2019/month=1 |
| year=2019/month=2 |
+---------------------+
可以看到,分区变为了4个,其中year=2018/month=12是之前测试中未被删除的子目录,因此也被恢复了。
- 然后删除一个分区目录
hadoop fs -rm -r /tmp/hive/employee_partitioned/year=2019/month=1/
- 再次执行MSCK命令
> MSCK REPAIR TABLE employee_partitioned DROP PARTITIONS;
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
| year=2019/month=2 |
+---------------------+
注意,添加分区时,新增的目录结构必须符合分区表的分区定义规范,比如对于上述分区表employee_partitioned,新增的分区目录必须是year=xxx/month=yy 这种格式,如果是其他格式,MSCK将无法工作
比如新增一个子目录year=2019/test=1
hadoop fs -mkdir /tmp/hive/employee_partitioned/year=2019/test=1
执行MSCK命令
> MSCK REPAIR TABLE employee_partitioned ADD PARTITIONS;
> SHOW PARTITIONS employee_partitioned;
+---------------------+
| partition |
+---------------------+
| year=2018/month=11 |
| year=2018/month=12 |
| year=2019/month=2 |
+---------------------+
分区信息并无变化
补充:执行MSCK命令时,如果出现错误:FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask,这是由于从Hive1.3开始,如果在HDFS上找到分区值中包含不允许的字符的目录,MSCK将引发异常。可以在会话中将配置项hive.msck.path.validation设置为ignore;
set hive.msck.path.validation=ignore;
参考
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 3