Tables and syntax in HIVE

Tables and syntax in HIVE

1. The table of HIVE

    There are four functional tables used by HIVE: internal tables, external tables, partitioned tables, and bucketed tables.

1, internal table, external table

1. Features

    Create a hive table. After inspection, it is found that in the TBLS table, the type of the hive table is MANAGED_TABLE, which is the so-called internal table.

    The characteristic of the internal table is that there is data after the table, and the data is uploaded to the hdfs directory corresponding to the table for management.

    In fact, the process of the internal table is almost the same as that of the sql database.

    But in real development, it is very likely that there is already data in hdfs, and it is hoped to use this data directly as table content through hive.

    At this point, you can create a hive table associated with this location and manage the data in it. The table created in this way is called an external table.

    The characteristic of external table is that there is data first and then table, and the hive table is associated with this location to manage the data in it.

2. create table

    The statement to create the internal table is the same as the sql statement.

    The syntax for creating an external table is as follows:

create external table .... location 'xxxx';

    case

    Prepare files to hdfs:

hadoop fs -mkdir /hdata
hadoop fs -put stu.txt /hdata/stu.txt
hadoop fs -put teacher.txt /hdata/teacher.txt  

    Create an external table in hive to manage existing data:

create external table ext_stu(id int ,name string) row format delimited fields terminated by '\t' location '/hdata';

    Discovered by observation:

    Table folders are not created under /usr/hive/warehouse/[db]/.

    There are more records in the TBLS of the metabase, and the table type is EXTERNAL_TABLE.

    There are more column related records in COLUMSV2 of the metabase.

    There are more records in the SDS of the metadata database, pointing to the location where the real data is stored.

    After inspection, it was found that the data in it can be used. Successfully created an external table.

3. delete table

    The statement to delete the internal and external tables is the same as the SQL statement.

    When dropping a table:

    The internal table deletes the relevant metadata in the metadata database, and deletes the folder corresponding to the table in hdsf and the data in it.

    The external table deletes the related metadata in the metadata database, and does not delete the associated folder and its internal data.

2. Partition table

    Hive also supports partitioned tables.

    The partition table can be used to partition the data to improve the efficiency of the query. When a large amount of data is often queried according to certain specified fields, the partition table can be designed to improve the efficiency.

1. Create a partitioned table

create table book (id bigint, name string) partitioned by (country string) row format delimited fields terminated by '\t';

    When creating a partitioned table, the partitioned field may not be in the field list. The generated file will have this field automatically.

2. Partitioned table loading data

1> relative path loading

    Load local data using relative paths:

load data local inpath './book_china.txt' overwrite into table book partition (country='china');

2> absolute path loading

    Load local data with absolute paths:

load data local inpath '/root/book_english.txt' overwrite into table book partition (country='english');

3> Load remote data

    Load data in hdfs:

load data inpath '/book/jp.txt' overwrite into table book partition (country='jp');

    Note how the path is written:

    If the address of hdfs is not specified in the path, the data on the local HDFS will be searched by default.

    If the address of HDFS is specified, the data will be searched according to the specified address.

    For example: 'hdfs://hadoop:9000/book/jp.txt'

3. Querying data from a partitioned table

select * from book;
select * from book where pubdate='2010-08-22';

4. Process analysis

    After creating a partition table and writing data to the partition table, a sub-level partition folder will be created under the folder corresponding to the table to store the data, and this directory will be added to the SDS in the metadata database as the data source folder.

    When querying according to the partition field as a condition, hive can directly find the folder corresponding to the value of the partition field, and directly return the data under the folder, which is very efficient.

    Therefore, if there is a large amount of data in the hive table, and some fields are frequently queried, the field can be designed as a table to obtain a partition field to improve efficiency.

5. multiple partitions

    Hive supports multiple partition fields in a table, which can be declared in sequence when creating the table.

    Create the table:

create table book2 (id bigint, name string) partitioned by (country string,category string) row format delimited fields terminated by '\t';

    Download Data:

load data local inpath '/root/work/china.txt' into table book2 partition (country='china',category='xs');

    Multiple partitions will form multi-level subdirectories to store data separately. It can also improve the efficiency of querying data according to partition fields.

6. Add upload data

    If a data file is directly uploaded to a table in HIVE in HDFS, the directory created manually cannot be used by hive because the partition is not recorded in the metadata database.

    If you need to recognize the partition you created yourself, you need to execute:

ALTER TABLE book2 add  PARTITION (country='jp',category = 'xs') location '/user/hive/warehouse/park2.db/book2/country=jp/category=xs';
ALTER TABLE book2 add  PARTITION (country='jp',category = 'it') location '/user/hive/warehouse/park2.db/book2/country=jp/category=it'; 

3. Bucket table

1. introduce

    Hive also supports bucketing tables. The bucketing operation is a more fine-grained allocation method. A table can be partitioned and bucketed at the same time.

    The principle of bucketing is to store the data separately according to the calculation of the specified column.

    The main functions of bucketing are:

    For a huge data set, we often need to take a small part as a sample, and then verify our query on the sample, optimize our program, and use the bucket table to achieve data sampling.

2. Application steps

    Hive disables the bucketing function by default, you need to manually enable set hive.enforce.bucketing=true;

    The bucketing operation needs to work during the execution of the underlying mr, so usually, instead of adding buckets to the original data table, a test table is specially created, the data in the original table is imported into the test table, and then imported The bucketing is implemented in the mr triggered by the process.

1> Enable the bucketing function

    Turn on the bucketing function to force multiple reducers to output:

hive>set hive.enforce.bucketing=true;

2> Create the main table

    Prepare the master table:

create table teacher(id int,name string) row format delimited fields terminated by '|';

    Load data to the main table:

load data local inpath '/root/work/teachers.txt' into table teacher;

3> Create bucket table

    Create a table with buckets:

create table teacher_temp(id int,name string) clustered by (id) into 2 buckets  row format delimited fields terminated by '|';

Import data from the main table to the bucket table:

insert overwrite/into table teacher_temp select * from teacher;

    The bucket table actually stores the data in the table in buckets according to the hash bucket method, and the so-called buckets are different files under the table folder.

4> Test

    Test based on some data in the bucket table:

select * from teacher_temp tablesample(bucket 1 out of 2 on id);

    The number of buckets is counted from 1, and the preceding query retrieves data from the first bucket of the 2 buckets. Actually, it's half.

    Get 1/4 of the data from the bucket table:

select * from bucketed_user tablesample(bucket 1 out of 4 on id);

    The tablesample function is a logical sampling process. Even if the physical buckets are not the same as 4 buckets, the data can be obtained, but it is best to ensure that the total number of samples and the total number of physical buckets are the same during sampling to ensure the best efficiency.

2. HIVE grammar

1. Data type

    Comparison of data types between HIVE and java:

    TINYINT:byte

    SMALLINT:short

    INT : int

    BIGINT:long

    BOOLEAN:boolean

    FLOAT:float

    DOUBLE:double

    STRING:String

    TIMESTAMP:TimeStamp

    BINARY:byte[]

2. Grammar

1.create table

1> Keywords

CREATE TABLE

    Create a table with the specified name. If a table with the same name already exists, an exception is thrown; the user can use the IF NOT EXIST option to ignore this exception, but the exception is ignored, the table is still not created, and no prompt is given.

EXTERNAL

    This keyword allows users to create an external table and specify a path (LOCATION) to the actual data when creating the table. When Hive creates an internal table, it will move the data to the path pointed to by the data warehouse; if an external table is created, only Record the path where the data is located, without making any changes to the location of the data. When deleting a table, the metadata and data of the inner table will be deleted together, while the outer table only deletes the metadata, not the data.

LIKE

    Allows the user to copy the existing table structure, but not the data.

PARTITIONED BY

    Partitioned tables can be created using the PARTITIONED BY statement. A table can have one or more partitions, and each partition exists in a separate directory.

2> Grammar

    Syntax to create a table:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]  (Note:  only available starting with 0.6.0)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0)
[AS select_statement]  (Note: this feature is only available starting with 0.5.0.)

    Syntax for clone table:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
LIKE existing_table_name
[LOCATION hdfs_path]

    To clone a table, only the table structure is cloned without the data.

3> Field

data_type

    The data_type data type can have the following values:

    primitive_type、array_type、map_type、struct_type。

primitive_type

    The values ​​of primitive_type are as follows:

    TINYINT、SMALLINT、INT、BIGINT、BOOLEAN、FLOAT、DOUBLE、STRING。

array_type

    There is only one value of array_type: ARRAY < data_type >.

map_type

    The value of this field is only this one: MAP < primitive_type, data_type >.

struct_type

    There is only one value for this field: STRUCT < col_name : data_type [COMMENT col_comment],...>.

row_format

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format

    SEQUENCEFILE、TEXTFILE、RCFILE     (Note:  only available starting with 0.6.0)、INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname。

practise

Create an internal table

create table xx (id int,name string) row format DELIMITED FIELDS TERMINATED BY '\t';

Create an external table

create external table xx (id int,name string) row format DELIMITED FIELDS TERMINATED BY '\t';

Create an external table with partitions

create external table xx (id int,name string) row format DELIMITED FIELDS TERMINATED BY '\t' partitioned by 'ccc';

 

2.Alter Table

1>Add Partitions

    Add partitions.

Grammar

ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...

②Field _

1)partition_spec

PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

 

2>Drop Partitions

    Delete the partition.

    grammar:

ALTER TABLE table_name DROP partition_spec, partition_spec,...

 

3>Rename Table

    Table renaming.

    grammar:

ALTER TABLE table_name RENAME TO new_table_name

4>Change Column

    Modify field information.

    grammar:

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

    This command can allow changing column names, data types, comments, column positions, or any combination of these.

5>Add/Replace Columns

    Add or replace fields.

    grammar:

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

    ADD means adding a new field, and the field position is behind all columns (before the partition column); REPLACE means replacing all fields in the table.

3.Show

1> View the library table

SHOW DATABASES;
SHOW TABLES;

2> View the table name

    Looking at the table name, there is a partial match.

SHOW TABLES 'page.*';
SHOW TABLES '.*view';

3> View partition

    View all Partitions of a table, if not, report an error:

SHOW PARTITIONS page_view;

4> View the table structure

    View a table structure:

DESCRIBE invites;
DESC invites;

5> View partition content

    View partition contents:

SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

    View limited line content, the same as Greenplum, use the limit keyword:

SELECT a.foo FROM invites a limit 3;

    View the table partition definition:

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

 

4.Load

    grammar:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

    The Load operation is just a simple copy/move operation, which moves the data file to the corresponding location of the Hive table.

5.Insert

1> Insert the query result inside

    Insert the result of a query into the hive table.

    grammar:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement

2> output the query result

    Write out query results to an external file system.

    grammar:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

    local: This item is not added and stored in HDFS by default, and this item is added and stored in the local disk path.

6.Drop

    Deleting an internal table also deletes the table's metadata and data.

    Dropping an external table removes only the metadata but keeps the data.

    grammar:

drop table_name

7.Limit

    Limit can limit the number of records queried.

8.Select

    Lookup table contents.

    grammar:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[   CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

9.JOIN

    Multi-table query join query. Linked query is divided into: Cartesian basis query, inner link query, outer link query. Outer join query is divided into left outer join, right outer join, full outer join.

1>join_table

    grammar:

table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition

2> Field

table_reference

    table_factor | join_table

table_factor

    tbl_name [alias] | table_subquery alias | ( table_references )

join_condition

    ON equality_expression ( AND equality_expression )*

equality_expression

    expression = expression

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325154007&siteId=291194637