参考书籍:《Hive编程指南》
1.创建库
(1)语法结构
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment] //关于数据块的描述
[LOCATION hdfs_path] //指定数据库在HDFS上的存储位置
[WITH DBPROPERTIES (property_name=property_value, ...)]; //指定数据块属性
默认地址:/user/hive/warehouse/db_name.db/table_name/partition_name/…
(2)创建库的方式
# 创建普通的数据库
create database t1;
# 创建库的时候检查存与否
create database if not exists t1;
# 创建库的时候带注释
create database if not exists t2 comment 'learning hive';
2.查看库
(1)查看库的方式
# 查看有哪些数据库
show databases;
+-----------------------+
| database_name |
+-----------------------+
| default |
| information_schema |
| sys |
+-----------------------+
--使用正则表达式匹配来筛选需要的数据库名
show databases like 't*';
+-----------------------+
| database_name |
+-----------------------+
| t1 |
| t2 |
+-----------------------+
Hive会为每个数据库创建一个目录,数据库中的表将会以这个数据库目录的子目录形式存储。
default数据库中的表是个例外,因为这个数据库本身没有自己的目录。
数据库所在的目录位于参数hive.metastore.warehouse.dir所指定的顶层目录之后。
# 显示数据库的详细属性信息
desc database extended t2;
+----------+----------------+----------------------------------------------------+-------------+-------------+-------------+
| db_name | comment | location | owner_name | owner_type | parameters |
+----------+----------------+----------------------------------------------------+-------------+-------------+-------------+
| t2 | learning hive | hdfs://server1:8020/warehouse/tablespace/managed/hive/t2.db | hdfs | USER | |
+----------+----------------+----------------------------------------------------+-------------+-------------+-------------+
# 查看正在使用哪个库
select current_database();
+----------+
| _c0 |
+----------+
| default |
+----------+
# 查看创建库的详细语句
show create database t2;
+----------------------------------------------------+
| createdb_stmt |
+----------------------------------------------------+
| CREATE DATABASE `t2` |
| COMMENT |
| 'learning hive' |
| LOCATION |
| 'hdfs://server1:8020/warehouse/tablespace/managed/hive/t2.db' |
+----------------------------------------------------+
3.删除库
(1)删除语法
drop database t1;
drop database if exists t1;
# 默认情况下hive不允许删除包含表的数据库。用户要么先删除数据库中的表,然后再删除数据库,要么在删除命令的最后面加上关键字cascade,这样就能直接删除库。
drop database if exists t1 cascade;
drop database t1 cascade;
4.切换库
use database_name;
5.创建表
(1)语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
(2)关键字说明
•EXTERNAL 创建一个外部表,在建表的同时指定指向实际数据的路径(LOCATION)
•LIKE 创建表时copy已有的表结构,不复制数据
•COMMENT 注释
•PARTITIONED BY 指定分区字段
•ROW FORMAT
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
| SERDE serde_name [WITH SERDEPROPERTIES
(property_name=property_value, property_name=property_value, ...)]
行设置
•STORED AS //数据存储格式设置
SEQUENCEFILE //序列化文件
| TEXTFILE //普通的文本文件格式
| RCFILE //行列存储相结合的文件
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname //自定义文件格式
如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCE 。
•LOCATION 指定表在HDFS的存储路径
建议:
如果一份数据已经存储在HDFS上,并且要被多个用户或者客户端使用,最好创建外部表。
(3)创建表以及操作
# 创建内部表
create table student_test
(
id int,
name string,
sex string,
age int)
row format delimited fields terminated by ",";
默认创建的表都是内部表(也叫管理表),Hive会控制着数据的生命周期,删除管理表时,也会删除表的数据。
管理表不方便和其他工作共享数据。
# 创建外部表(external)
create external table student_ext
(
id int,
name string,
sex string,
age int)
row format delimited fields terminated by ","
location "/hive/student_ext";
外部表在删除该表时并不会删除掉这份数据,不过描述表的元数据信息会被删除掉
# 创建分区表
create external table student_part
(
id int,
name string,
sex string,
age int)
partitioned by (city string)
row format delimited fields terminated by ","
location "/hive/student_part";
--添加新的分区
alter table student_part add partition (city="beijing");
# 分桶表
create external table student_bck
(
id int,
name string,
sex string,
age int,
department string)
clustered by (id)
sorted by (id asc, name desc) into 4 buckets
row format delimited fields terminated by ","
location "/hive/student_bck";
# like 拷贝一张已经存在的表的表模式
create table student_like like student;
--外部表
create external table student_like_ext like student
location '//hive/student_like_ext'
注意:
如果语句中省略掉external关键字而且原始表是外部表的话,那么生成的新表也将是外部表。
如果语句中省略掉external关键字而且原始表是管理表的话,那么生成的新表也将是管理表。
如果语句中有external关键字而且原始表是管理表的话,那么生成的新表将是外部表。
6.查看表
(1)查看表列表
# 查看当前使用的数据库中有哪些表
show tables;
+-------------------------+
| tab_name |
+-------------------------+
| call_center |
+-------------------------+
# 查看非当前使用的数据库中有哪些表
show tables in t2;
+-------------------------+
| tab_name |
+-------------------------+
| catalog_page |
+-------------------------+
# 查看数据库中以xxx开头的表
show tables like 'call_*';
+-------------------------+
| tab_name |
+-------------------------+
| call_center |
+-------------------------+
(2)查看表信息
desc call_center;
+--------------------+------------+----------+
| col_name | data_type | comment |
+--------------------+------------+----------+
| cc_call_center_sk | int | |
| cc_call_center_id | string | |
| cc_rec_start_date | string | |
| cc_rec_end_date | string | |
| cc_closed_date_sk | int | |
| cc_open_date_sk | int | |
| cc_name | string | |
| cc_class | string | |
| cc_employees | int | |
| cc_sq_ft | int | |
| cc_hours | string | |
| cc_manager | string | |
| cc_mkt_id | int | |
| cc_mkt_class | string | |
| cc_mkt_desc | string | |
| cc_market_manager | string | |
| cc_division | int | |
| cc_division_name | string | |
| cc_company | int | |
| cc_company_name | string | |
| cc_street_number | string | |
| cc_street_name | string | |
| cc_street_type | string | |
| cc_suite_number | string | |
| cc_city | string | |
| cc_county | string | |
| cc_state | string | |
| cc_zip | string | |
| cc_country | string | |
| cc_gmt_offset | float | |
| cc_tax_percentage | float | |
+--------------------+------------+----------+
# 查看表的详细信息(格式不友好)
desc extended call_center;
# 查看表的详细信息(格式友好)包含文件数 分区数等等,查看表的文件个数(用于分析小文件)
desc formatted call_center;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| cc_call_center_sk | int | |
| cc_call_center_id | string | |
| cc_rec_start_date | string | |
| cc_rec_end_date | string | |
| cc_closed_date_sk | int | |
| cc_open_date_sk | int | |
| cc_name | string | |
| cc_class | string | |
| cc_employees | int | |
| cc_sq_ft | int | |
| cc_hours | string | |
| cc_manager | string | |
| cc_mkt_id | int | |
| cc_mkt_class | string | |
| cc_mkt_desc | string | |
| cc_market_manager | string | |
| cc_division | int | |
| cc_division_name | string | |
| cc_company | int | |
| cc_company_name | string | |
| cc_street_number | string | |
| cc_street_name | string | |
| cc_street_type | string | |
| cc_suite_number | string | |
| cc_city | string | |
| cc_county | string | |
| cc_state | string | |
| cc_zip | string | |
| cc_country | string | |
| cc_gmt_offset | float | |
| cc_tax_percentage | float | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | tpcds_orc_hive_8000 | NULL |
| OwnerType: | USER | NULL |
| Owner: | hdfs | NULL |
| CreateTime: | Wed Dec 09 03:17:21 CST 2020 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://server1:8020/warehouse/tablespace/managed/hive/tpcds_orc_hive_8000.db/call_center | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"cc_call_center_id\":\"true\",\"cc_call_center_sk\":\"true\",\"cc_city\":\"true\",\"cc_class\":\"true\",\"cc_closed_date_sk\":\"true\",\"cc_company\":\"true\",\"cc_company_name\":\"true\",\"cc_country\":\"true\",\"cc_county\":\"true\",\"cc_division\":\"true\",\"cc_division_name\":\"true\",\"cc_employees\":\"true\",\"cc_gmt_offset\":\"true\",\"cc_hours\":\"true\",\"cc_manager\":\"true\",\"cc_market_manager\":\"true\",\"cc_mkt_class\":\"true\",\"cc_mkt_desc\":\"true\",\"cc_mkt_id\":\"true\",\"cc_name\":\"true\",\"cc_open_date_sk\":\"true\",\"cc_rec_end_date\":\"true\",\"cc_rec_start_date\":\"true\",\"cc_sq_ft\":\"true\",\"cc_state\":\"true\",\"cc_street_name\":\"true\",\"cc_street_number\":\"true\",\"cc_street_type\":\"true\",\"cc_suite_number\":\"true\",\"cc_tax_percentage\":\"true\",\"cc_zip\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 1 |
| | numRows | 10 |
| | rawDataSize | 0 |
| | totalSize | 5899 |
| | transactional | true |
| | transactional_properties | default |
| | transient_lastDdlTime | 1607455918 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
# 查看分区信息
show partitions time_dim;
(3)查看建表语句
# 可以看到建表语句的详细信息
show create table call_center;
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `call_center`( |
| `cc_call_center_sk` int, |
| `cc_call_center_id` string, |
| `cc_rec_start_date` string, |
| `cc_rec_end_date` string, |
| `cc_closed_date_sk` int, |
| `cc_open_date_sk` int, |
| `cc_name` string, |
| `cc_class` string, |
| `cc_employees` int, |
| `cc_sq_ft` int, |
| `cc_hours` string, |
| `cc_manager` string, |
| `cc_mkt_id` int, |
| `cc_mkt_class` string, |
| `cc_mkt_desc` string, |
| `cc_market_manager` string, |
| `cc_division` int, |
| `cc_division_name` string, |
| `cc_company` int, |
| `cc_company_name` string, |
| `cc_street_number` string, |
| `cc_street_name` string, |
| `cc_street_type` string, |
| `cc_suite_number` string, |
| `cc_city` string, |
| `cc_county` string, |
| `cc_state` string, |
| `cc_zip` string, |
| `cc_country` string, |
| `cc_gmt_offset` float, |
| `cc_tax_percentage` float) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://server1:8020/warehouse/tablespace/managed/hive/tpcds_orc_hive_8000.db/call_center' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1607455918') |
+----------------------------------------------------+
7.设置参数
(1)查看当前参数
set hive.auto.convert.join;
+------------------------------+
| set |
+------------------------------+
| hive.auto.convert.join=true |
+------------------------------+
(2)设置参数
set hive.auto.convert.join = false;
8.查看sql的执行计划
explain select * from test_table;
# 查看详细的执行计划(信息会很多)
explain extended select * from test_table;