Hive CLI – Migrating to Beeline

Hive Beeline的用法

转载：http://www.teckstory.com/hadoop-ecosystem/hive-new-cli-beeline-for-hive/

Hive is the data warehouse software of Hadoop ecosystem. It provides a mechanism to project structure onto large data sets stored in Hadoop. Hive allows to query this data using the SQL-like language called HiveQL. The use case for the hive is already established and it is widely adopted as well. In Feb 2015, Hive 1.0.0 was released. Before this release, it was at 0.14 and next release 0.14.1 was voted by hive community to be released as 1.0.0. Similarly, next major release of hive 0.15.0 has been renamed as 1.1.0.

When it comes to interact with any database including Hive, first and most basic yet powerful method is a command line tool. We have been using Hive CLI since long but now, Beeline is the new hive CLI and old hive CLI is being deprecated in favor of beeline. If you are looking for a command line tool to interact with Hive, Beeline is the recommended tool for you. This article will discuss most common functions which you were performing earlier using old hive CLI and how will you do them now using Beeline. This article will give you a jumpstart migrating from old CLI to Beeline.

What are the things you will want to do with a command line tool? Let’s look at the example of most common things you may want to do with a command line tool and how can you do it using hive beeline CLI. I will use Cloudera quick start VM 5.4.x for executing commands and generate output for this article. If you are using any other Hadoop environment, your output may differ caused by the difference in versions. I am also truncating beeline prompt info at some places to fit into this page.

Hive Shell Command (Old CLI)

Hive shell command is the gateway to Hive services. It is used to invoke various hive services including hive command line interface (CLI). You can get a list of services using below command.

[cloudera@quickstart ~]$hive –help
Usage ./hive <parameters> –service serviceName <service parameters>
Service List: beeline cli help hiveburninclient hiveserver2 hiveserver hwi jar lineage metastore metatool orcfiledump rcfilecat schemaTool version

You can see various services available including CLI. CLI is the most important service and it is the default service. So if you just issue hive command without specifying anything, it will start CLI service.

[cloudera@quickstart ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>

This hive shell command with default service is the one known as “hive CLI” or hive command line interface. And you can see that I am getting a warning that Hive CLI is deprecated and advised to migrate to Beeline. In the services list, you can see many other services. One of them is hiveserver2. This one is the most important service of our interest for current discussion. We will talk about it. One more thing you might have noticed that by just executing hive command, you got connected to the hive. At this stage, you are ready to execute hive database commands.

HiveServer2

The old Hive CLI directly connects to Hive database and meta-store. Hive CLI can be used only on the hosts where you have access to these services, for example, cluster nodes (Name Node and Data Nodes) or Edge nodes. This makes situation little unrealistic in an enterprise environment. In an enterprise environment, usually cluster nodes are protected by firewalls. There is no direct access to these nodes for anyone other than admins. Edge nodes are the only place you may have access to hive CLI but usually, they are less in number and everyone will have to remote login to these systems. This is not how most of the enterprise database systems work. We needed client/server architecture with concurrency and authentication, just like other database systems.

This is offered by HiveServer2 and Beeline together.

HiveServer2 is The Server that enables remote clients to send their queries against Hive and retrieve results. It supports concurrent access from multiple clients and authentication.
Beeline is The Client which is nothing but a JDBC application. It is based on popular SQLLine CLI.

Default authentication for HiveServer2 is NONE. However, we can configure username/password based authentication like any other database. Such an authentication is backed by LDAP. If you have LDAP configured, your username and password will be validated by HiveServer2 using LDAP.

With all this background, we are now ready to take a tour of Beeline. Let’s see how to achieve basic and most common things which we will need during our day to day work with hive database.

Connecting to Hive

You can start beeline tool by simply typing beeline command.

[cloudera@quickstart ~]$ beeline
Beeline version 1.1.0-cdh5.4.2 by Apache Hive

But you are not yet connected to hive database. To connect to hive database, you will have to use !connect command. Beeline supports two connection modes.

Embedded
Remote

Embedded Mode

To connect in embedded mode, you need to be on the machine where Hive is installed. It is the similar type of connection we used with old hive CLI. To connect to the hive in embedded mode using beeline, you should use below command.

beeline> !connect jdbc:hive2:// scott tiger
scan complete in 2ms
Connecting to jdbc:hive2://
Added [/usr/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-contrib.jar]
Connected to: Apache Hive (version 1.1.0-cdh5.4.2)
Driver: Hive JDBC (version 1.1.0-cdh5.4.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://>

In this example, !connect is a beeline command. next is connection string and finally username and password. scott is the username and tiger is password. We have passed username and password for the sake of passing it, but since my HiveServer2 is not configured with any LDAP to verify my credentials, HIveServer2 will ignore it and allow me to connect. if you don’t pass user and password, beeline will ask for it.

At this stage, you are now connected to hive server2 and ready to enter your hive commands interactively. As we already discussed, this type of connection is useful for those who have access to cluster nodes.

Remote Mode

Embedded mode is mostly used by admins because they will have access to machines where the hive is installed. For developers, it’s almost always remote mode connection to the hive. When you connect to hive using remote mode, in fact, you are interacting with HiveServer2. But you will need TCP network connectivity to HiveServer2. To connect to HiveServer2 using beeline remote, you can use !connect command. This command will take JDBC URL to connect to your database. A simplified format of JDBC URL is given below.

!connect jdbc:hive2://<host>:<port>/<db> <user> <password>

The detailed format of JDBC URL for HiveServer2 is given below.

jdbc:hive2://<host>:<port>/dbName;sess_var_list?hive_conf_list#hive_var_list

You already understand <host>:<port> they represent HiveServer2 hostname and port.
dbName is the name of the initial database.
sess_var_list is a semicolon separated list of key=value pairs of session variables
hive_conf_list is a semicolon separated list of key=value pairs of Hive configuration variables for this session
hive_var_list is a semicolon separated list of key=value pairs of Hive variables for this session.

We will discuss and show example for hive_conf_list and hive_var_list. For now, lets connect to hive

beeline> !connect jdbc:hive2://192.168.172.143:10000/default scott tiger
scan complete in 2ms
Connecting to jdbc:hive2://192.168.172.143:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.4.2)
Driver: Hive JDBC (version 1.1.0-cdh5.4.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://192.168.172.143:10000/default>

At this stage, you are now ready to enter your hive commands interactively.

I mentioned earlier that you need TCP network connectivity to be able to connect to HiveServer2 in remote mode. HiveServer2 also support remote connections over HTTP but your administrator will have to start HiveServer2 in HTTP transport mode. At a time, HiveServer2 can accept either TCP requests or HTTP requests. If your HiveServer2 is running in HTTP transport mode, you can use below command to connect to hive.

!connect jdbc:hive2://<host>:<port>/<db>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>

!connect jdbc:hive2://C15738:10001/default?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice

Default port is 10001 and default endpoint is cliservice. At this stage, again, you are ready to enter your hive commands interactively.

It is not necessary to start beeline and then use !connect command. You can pass connection parameters to beeline and get connected to Hive automatically on start.

[cloudera@quickstart ~]$ beeline -u jdbc:hive2://192.168.172.143:10000/test -n scott -p tiger
scan complete in 3ms
Connecting to jdbc:hive2://192.168.172.143:10000/test
Connected to: Apache Hive (version 1.1.0-cdh5.4.2)
Driver: Hive JDBC (version 1.1.0-cdh5.4.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
0: jdbc:hive2://192.168.172.143:10000/test>

In above example, -u takes JDBC URL, -n takes the username and -p is a valid password.

Interactively executing your DML/DDL statements

This is the most common requirement and once you are connected to your HiveServer2, it is very simple to use your hive queries interactively from beeline command line interface. Let us execute one simple select statement.

0: jdbc:hive2://> select stock_symbol,exchange_code,stock_date,stock_open from stock_data limit 5;
+—————+—————-+————-+————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+—————+—————-+————-+————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
| ABB | NYSE | 2015-03-26 | 21.35 |
| ABB | NYSE | 2015-03-25 | 21.66 |
+—————+—————-+————-+————-+–+
5 rows selected (0.233 seconds)
0: jdbc:hive2://>

Executing Single Hive query from beeline

You can use -e option to execute a single hive query just like you were using in old hive CLI. This kind of option might be useful for a quick query or if you want to achieve something using shell scripting.

[cloudera@quickstart ~]$ beeline -u jdbc:hive2://192.168.172.143:10000/test -n scott -p tiger -e “select stock_symbol,exchange_code,stock_date,stock_open from stock_data limit 5”
scan complete in 3ms
Connecting to jdbc:hive2://192.168.172.143:10000/test
Connected to: Apache Hive (version 1.1.0-cdh5.4.2)
Driver: Hive JDBC (version 1.1.0-cdh5.4.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+—————+—————-+————-+————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+—————+—————-+————-+————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
| ABB | NYSE | 2015-03-26 | 21.35 |
| ABB | NYSE | 2015-03-25 | 21.66 |
+—————+—————-+————-+————-+–+
5 rows selected (0.863 seconds)
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
Closing: 0: jdbc:hive2://192.168.172.143:10000/test
[cloudera@quickstart ~]$

Executing your hive queries from files

This option -f is an extension of the previous one and can be used in similar situations where you have multiple HQL statements stored in a file. It is very common to use a combination of this feature and the single statement execution feature to build test scripts, deployment scripts and various other automation purposes in your project.

[cloudera@quickstart ~]$ beeline -u jdbc:hive2://192.168.172.143:10000/test -n scott -p tiger -f test.hql
scan complete in 3ms
Connecting to jdbc:hive2://192.168.172.143:10000/test
Connected to: Apache Hive (version 1.1.0-cdh5.4.2)
Driver: Hive JDBC (version 1.1.0-cdh5.4.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://192.168.172.143:10000/test> select stock_symbol,exchange_code,stock_date,stock_open from stock_data limit 5;
+—————+—————-+————-+————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+—————+—————-+————-+————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
| ABB | NYSE | 2015-03-26 | 21.35 |
| ABB | NYSE | 2015-03-25 | 21.66 |
+—————+—————-+————-+————-+–+
5 rows selected (0.198 seconds)
0: jdbc:hive2://192.168.172.143:10000/test>
Closing: 0: jdbc:hive2://192.168.172.143:10000/test
[cloudera@quickstart ~]$

Initializing your hive environment

In a typical hive project, you will have requirements when you want to set some hive configurations to customize hive behavior. These are some common initialization requirements for specific situations. You can find a detailed list of hive configuration properties here. You can set hive configuration using –hiveconf option for beeline. Below example shows how can you set hive.cli.print.current.db configuration property.

[cloudera@quickstart ~]$ beeline -u jdbc:hive2://192.168.172.143:10000/test -n scott -p tiger –hiveconf:hive.cli.print.current.db=false
scan complete in 3ms
Connecting to jdbc:hive2://192.168.172.143:10000/test
Connected to: Apache Hive (version 1.1.0-cdh5.4.2)
Driver: Hive JDBC (version 1.1.0-cdh5.4.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
0: jdbc:hive2://192.168.172.143:10000/test>

Unfortunately, this configuration variable doesn’t have any effect in the beeline and we are still seeing the database name in beeline prompt info. But we can use the set <variable_name>command to check the current value of the configuration variable.

0: jdbc:hive2://192.168.172.143:10000/test> set hiveconf:hive.cli.print.current.db;
+——————————————-+–+
| set |
+——————————————-+–+
| hiveconf:hive.cli.print.current.db=false |
+——————————————-+–+
1 row selected (0.2 seconds)

You can also use set command to change configuration variable while you are in interactive mode.

0: jdbc:hive2://192.168.172.143:10000/test> set hiveconf:hive.cli.print.current.db=true;
No rows affected (0.012 seconds)

You can test it again and value for this variable must have been changed to true.

Setting variables and parameters in hive queries

When you are trying to automate something, you will require parameters in your queries. In hive you can define variables and set their values, you can use these variables in your hive queries.This feature is very powerful which you can feel looking at some examples below.

0: jdbc:hive2://> set hivevar:sopen=21.15;
No rows affected (0.01 seconds)
0: jdbc:hive2://> select stock_symbol,exchange_code,stock_date,stock_open from stock_data where stock_open=${hivevar:sopen};
+—————+—————-+————-+————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+—————+—————-+————-+————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-01-02 | 21.15 |
| ABB | NYSE | 2012-01-31 | 21.15 |
+—————+—————-+————-+————-+–+
3 rows selected (0.273 seconds)

Above example, shows that you can define variables and use them in where clause of your query. Let us take another example.

0: jdbc:hive2://192.168.172.143:10000/test> set hivevar:tbl=stock_data;
No rows affected (0.02 seconds)
0: jdbc:hive2://192.168.172.143:10000/test> select stock_symbol,exchange_code,stock_date,stock_open from ${hivevar:tbl} limit 3;
+—————+—————-+————-+————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+—————+—————-+————-+————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
+—————+—————-+————-+————-+–+
3 rows selected (0.188 seconds)
0: jdbc:hive2://192.168.172.143:10000/test>

Isn’t it great, you can use it in your from clause as well.

0: jdbc:hive2://> set hivevar:col2=exchange_code;
No rows affected (0.013 seconds)
0: jdbc:hive2://> select stock_symbol,${hivevar:col2},stock_date,stock_open from stock_data limit 3;
+—————+—————-+————-+————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+—————+—————-+————-+————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
+—————+—————-+————-+————-+–+
3 rows selected (0.143 seconds)

You can use variables in your where clause, table name, and column names as well. This makes variable substitution very powerful. Variable substitution in the hive is so powerful that it can take you very close to something like dynamic HQL.

Some other things in Beeline Hive CLI

Exiting from Hive CLI –> Use !q or !quit command
Command history –> Use up and down arrow keys
Autocomplete –> press tab key
Executing shell commands from CLI –> Use !sh <your_shell_command>
Show help –> use !help when you are in interactive mode or use beeline -h
Executing commands from a file in Hive CLI –> use !run <file_name>
More beeline command –> check here

If you have good shell scripting skills, you can use all of the hive beeline CLI features defined so far and can really take it too far in automating and building highly flexible and customized scripts.