Hive Introduction and Architecture

  • About Hive

    hive is a Hadoop-based data warehousing tools, you can map the structure of the data file to a database table, and provides a simple sql query function, you can convert the class sql statement to run MapReduce tasks.

 

  • Hive of nature

    The HQL into MapReduce programs.

 

  • SQL -> MapReduce principle

 

 

  • Hive's advantage

  1. Simple and easy to use: provides a SQL-like query language HQL;
  2. Scalability: large data set is designed to calculate / scalability (MR as a calculation engine, HDFS as a storage system) generally do not need to restart the service can be expanded scale Hive free clusters;
  3. Providing a unified metadata management;
  4. Ductility: Hive support user-defined functions, users can implement your own functions according to their needs;
  5. Fault tolerance: Good fault tolerance, problems with SQL node can still complete the implementation;
  6. Hive has the advantage of handling big data, there is no advantage for small data processing;
  7. Hive supports user-defined functions, users can implement your own functions according to their needs.

 

  • Hive shortcomings

  1. HQL limited hive of skills:

     (1) iterative algorithm can not express, such as pagerank;

     (2) data mining is not good, such as the kmeans;

    2. hive of efficiency is relatively low:

     (1) hive mapreduce automatically generated job usually enough intelligence

     (2) hive tuning more difficult, coarser

     (3) hive poor controllability

 

  • Hive architecture

 

User Interface : Client CLI (hive shell command line), JDBC / ODBC (java access hive), WEBUI (browser access hive);

 

Metadata (Meta Store) : metadata includes: table name, table belongs to a database (the default is the default), owner of the table, column / partition field, table type (whether it is an external table), where the data table of contents, etc. the default is stored in the database comes derby, it is recommended to use MySQL storage Metastore

Use Hadoop HDFS store, is calculated using the MapReduce;

 

Driver (Driver)

(1) parser (SQL Parser) : Convert characters into abstract syntax tree SQL AST, this step is usually to use third-party tools to complete the library, such as antlr, AST parsing of such table exists, field is present, whether the SQL statement is incorrect;

(2) a compiler (the Physical Plan) : The logic AST compiled execution plan;

(3) Optimizer (Query Optimizer) : optimization logic execution plans;

(4) the actuator (the Execution) : to convert into a physical execution plan logic program can run, for Hive, it is to MR / Spark.

 

  • Hive and SQL database compare

Query Language

HQL

SQL

Data storage location

HDFS

Local FS

Data Format

User-defined

System decided

Data Update

hive after (0.14) Support

stand by

index

no

Have

carried out

MapReduce

Executor

Execution delayed

high

low

Scalability

high

low

Scale data

Big

small

 

Published 118 original articles · won praise 42 · views 30000 +

Guess you like

Origin blog.csdn.net/qq_41490561/article/details/104557228