[Turn] Hive four data import mode

https://cloud.tencent.com/developer/article/1063706

Several common data import mode Hive introduced here four kinds: (1) import data from a local file system to Hive form; (2), introduced into the data from the HDFS Hive table; (3), from another table query and the corresponding data table into Hive; (4), when creating the table by querying the corresponding record from the table and inserted into another table created.

An import data from a local file system to Hive table

Create a good table inside the Hive, as follows:

hive> create table wyp
 
> (id int, name string,

> age int, tel string)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY '\t'

> STORED AS TEXTFILE;

OK

Time taken: 2.832 seconds

This table is very simple, only four fields, the specific meaning I will not explain. Local file system which has a /home/wyp/wyp.txt file, as follows:

[wyp master @ ~] $ cat wyp.txt 

1 wyp 25 13188888888888 

two test 13888888888888 30 

3 34 899 314 121 zs

Wyp.txt between columns of data files is to use \ t divided, it can be introduced into the data inside the file table inside wyp by the following statement, as follows:

hive> load data local inpath 'wyp.txt' into table wyp;
 
Copying data from file:/home/wyp/wyp.txt

Copying file: file:/home/wyp/wyp.txt

Loading data to table default.wyp

Table default.wyp stats:

[num_partitions: 0, num_files: 1, num_rows: 0, total_size: 67]

OK

Time taken: 5.967 seconds

Thus the contents of which will be introduced into wyp.txt wyp go inside the table, the data directory can wyp Table View, the following command:

hive> dfs -ls /user/hive/warehouse/wyp ;

Found 1 items

-rw-r--r--3 wyp supergroup 67 2014-02-19 18:23 /hive/warehouse/wyp/wyp.txt

Note that: and we are familiar with relational database is not the same, Hive does not currently support a given set of records directly in the form of text inside an insert statement, that is to say, Hive does not support INSERT INTO ... VALUES form of the statement. .

Second, the HDFS Hive import data into the data in the table into the local file system to process Hive table, in fact, the data is first copied to a temporary directory HDFS (the typical case is copied to the user's upload directory Home HDFS next, such as / home / wyp /), then data from the temporary directory that move (note here that is moving, not a copy!) to the corresponding Hive table data directory. That being the case, then surely support the Hive data directly from a directory to a corresponding movement of HDFS Hive data directory table, consider the following document /home/wyp/add.txt, the specific operation is as follows:

[wyp @ master /home/q/hadoop-2.2.0]$ bin / hadoop fs -cat /home/wyp/add.txt 

5 wyp1 23 131 212 121 212 

6 24 134 535 353 535 wyp2 

7 wyp3 25 132453535353 

8 154243434355 26 wyp4

The above is the need to insert the content data, the file is located on HDFS / home / wyp directory (and a different mentioned, a document mentioned is located on the local file system) which, we can use the following command to the file content into the inside of the Hive table, as follows:

hive> load data inpath '/home/wyp/add.txt' into table wyp;

Loading data to table default.wyp

Table default.wyp stats:

[num_partitions: 0, num_files: 2, num_rows: 0, total_size: 215]

OK

Time taken: 0.47 seconds



hive> select * from wyp;

OK

5 wyp1 23 131212121212

6 wyp2 24 134535353535

7 wyp3 25 132453535353

8 wyp4 26 154243434355

1 wyp 25 13188888888888

2 test 30 13888888888888

3 zs 34 899314121

Time taken: 0.096 seconds, Fetched: 7 row(s)

We can see from the results above, the data indeed imported into wyp the table! Note that the load data inpath '/home/wyp/add.txt' into table wyp; there is no local this word, and this is a difference in the.

Third, the query from the other table and the corresponding data into Hive Hive table has assumed the test table which build table statement is as follows:

hive> create table test(

> id int, name string

> ,tel string)

> partitioned by

> (age int)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY '\t'

> STORED AS TEXTFILE;

OK

Time taken: 0.261 seconds

Similar general and wyp table built table statement, but with age as the test table inside the partition field. For partitions, where do explain:

Partition: In Hive, each partition corresponding correspondence table directory table, the data of all partitions are stored in the corresponding directory. For example wyp dt and city table has two partitions, a corresponding dt = 20131218, city = BJ correspondence table directory is / user / hive / warehouse / dt = 20131218 / city = BJ, all belonging to this data are stored in the partition directory.

The following statement is the query results wyp table and inserted into the test table:

Hive> INSERT INTO Table Test 

> Partition (Age = '25 ') 

> SELECT ID, name, Tel 

> from WYP; 

######################## ############################################# 

here a bunch of output Mapreduce task information is omitted here 

############################################ ######################### 

the Total Time Spent the CPU the MapReduce: 310 seconds The msec. 1 

the OK 

Time taken: 19.125 seconds The 



Hive> SELECT * from Test; 

the OK 

. 5 25 131 212 121 212 wyp1 

. 6 wyp2 25 134 535 353 535 

. 7 132 453 535 353 25 wyp3 

. 8 25 154 243 434 355 wyp4 

. 1 WYP 13,188,888,888,888 25 

2 25 Test 13,888,888,888,888 

. 3 ZS 899 314 121 25 

Time taken: 0.126 seconds The, FETCHED: Row. 7 (S)

 

Here to do some explanation: We know our traditional form of data blocks insert into table values ​​(Field 1, Field 2), this form hive is not supported.

Through the above output, we can see the queries from wyp table out of things have been successfully inserted into the test table to go! Partition field is not present if the target table (test) can remove the partition (age = '25 ') statement. Of course, we can also dynamically specify the partition by using the partition value in the select statement inside:

Hive> SET = hive.exec.dynamic.partition.mode a nonstrict; 

Hive> INSERT INTO Table Test 

> Partition (Age) 

> SELECT ID, name, 

> Tel, Age 

> from WYP; 

########## ################################################## ######### 

here a bunch of Mapreduce output task information is omitted here 

############################## ####################################### 

the Total Time Spent the CPU the MapReduce: 510 seconds The msec. 1 

the OK 

taken Time: 17.712 seconds The 


Hive> SELECT * from Test; 

the OK 

. 5 wyp1 23 is 131 212 121 212 

. 6 134 535 353 535 24 wyp2 

. 7 132 453 535 353 25 wyp3 

. 1 WYP 13,188,888,888,888 25 

. 8 wyp4 26 is 154 243 434 355 

2 30 Test 13,888,888,888,888 

. 3 899 314 121 34 is ZS

Time taken: 0.399 seconds, Fetched: 7 row(s)

This method is called dynamic partitioning insert, Hive but the default is off, the first hive.exec.dynamic.partition.mode before use to nonstrict. Of course, Hive also supports insert overwrite way to insert the data, we can see from the literal, overwrite for coverage, yes, executing this statement when the data in the corresponding directory data will be overwritten! And insert into it will not, pay attention to the difference between the two. Examples are as follows:

hive> insert overwrite table test

> PARTITION (age)

> select id, name, tel, age

> from wyp;

More good news is, Hive also supports multi-table insert, what does that mean? In the Hive, we can insert statement upside down and put on the front from its implementation of the results and on the back is the same, as follows:

hive> show create table test3;

OK

CREATE TABLE test3(

id int,

name string)

Time taken: 0.277 seconds, Fetched: 18 row(s)



hive> from wyp

> insert into table test

> partition(age)

> select id, name, tel, age

> insert into table test3

> select id, name

> where age>25;



hive> select * from test3;

OK

8 wyp4

2 test

3 zs

Time taken: 4.308 seconds, Fetched: 3 row(s)

You can use multiple insert a clause in the same query, this advantage is we only need to scan it again the source table can generate multiple output disjoint. This cool it!

Fourth, when you create a table by querying the corresponding records from the other table and inserted into the table created in

In practice, the output table may be too much, is not suitable for display on the console, this time, the query output is a direct result of a new Hive table is very convenient, we call this condition the CTAS ( create table .. as select) as follows:

hive> create table test4

> as

> select id, name, tel

> from wyp;



hive> select * from test4;

OK

5 wyp1 131212121212

6 wyp2 134535353535

7 wyp3 132453535353

8 wyp4 154243434355

1 wyp 13188888888888

2 test 13888888888888

3 zs 899314121

Time taken: 0.089 seconds, Fetched: 7 row(s

Test4 data is inserted into the table to go, CTAS operation is atomic, so if you select the query fails for some reason, the new table is not created!

Guess you like

Origin blog.csdn.net/kingdelee/article/details/85125371