-
About Hive
hive is a Hadoop-based data warehousing tools, you can map the structure of the data file to a database table, and provides a simple sql query function, you can convert the class sql statement to run MapReduce tasks.
-
Hive of nature
The HQL into MapReduce programs.
-
SQL -> MapReduce principle
-
Hive's advantage
- Simple and easy to use: provides a SQL-like query language HQL;
- Scalability: large data set is designed to calculate / scalability (MR as a calculation engine, HDFS as a storage system) generally do not need to restart the service can be expanded scale Hive free clusters;
- Providing a unified metadata management;
- Ductility: Hive support user-defined functions, users can implement your own functions according to their needs;
- Fault tolerance: Good fault tolerance, problems with SQL node can still complete the implementation;
- Hive has the advantage of handling big data, there is no advantage for small data processing;
- Hive supports user-defined functions, users can implement your own functions according to their needs.
-
Hive shortcomings
- HQL limited hive of skills:
(1) iterative algorithm can not express, such as pagerank;
(2) data mining is not good, such as the kmeans;
2. hive of efficiency is relatively low:
(1) hive mapreduce automatically generated job usually enough intelligence
(2) hive tuning more difficult, coarser
(3) hive poor controllability
-
Hive architecture
User Interface : Client CLI (hive shell command line), JDBC / ODBC (java access hive), WEBUI (browser access hive);
Metadata (Meta Store) : metadata includes: table name, table belongs to a database (the default is the default), owner of the table, column / partition field, table type (whether it is an external table), where the data table of contents, etc. the default is stored in the database comes derby, it is recommended to use MySQL storage Metastore
Use Hadoop HDFS store, is calculated using the MapReduce;
Driver (Driver)
(1) parser (SQL Parser) : Convert characters into abstract syntax tree SQL AST, this step is usually to use third-party tools to complete the library, such as antlr, AST parsing of such table exists, field is present, whether the SQL statement is incorrect;
(2) a compiler (the Physical Plan) : The logic AST compiled execution plan;
(3) Optimizer (Query Optimizer) : optimization logic execution plans;
(4) the actuator (the Execution) : to convert into a physical execution plan logic program can run, for Hive, it is to MR / Spark.
-
Hive and SQL database compare
Query Language |
HQL |
SQL |
Data storage location |
HDFS |
Local FS |
Data Format |
User-defined |
System decided |
Data Update |
hive after (0.14) Support |
stand by |
index |
no |
Have |
carried out |
MapReduce |
Executor |
Execution delayed |
high |
low |
Scalability |
high |
low |
Scale data |
Big |
small |