hive principle

Hive is a data warehouse tool based on Hadoop . It can map structured data files into a database table and provide complete SQL query functions. It can convert SQL statements into MapReduce tasks for running. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements , and there is no need to develop special MapReduce applications. It is very suitable for statistical analysis of data warehouses .
Hvie is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used to perform data extraction transformation loading (ETL), which is a mechanism that can store, query and analyze large-scale data stored in Hadoop. Hive defines a simple SQL-like query statement called HQL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot complete.
Since Hive uses SQL's query language HQ L, it is easy to understand Hive as a database. In fact, from a structural point of view, apart from having similar query languages, Hive and database have nothing in common. This article will explain the differences between Hive and databases from many aspects. Databases can be used in Online applications, but Hive is designed for data warehouses. Knowing this will help you understand the characteristics of Hive from an application perspective.

beeline configuration

beeline configuration
cd /opt/soft/hive110/conf/
vim hive-site.xml
hive's hive-site.xml configuration file

<property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
    <description>
      Expects one of [nosasl, none, ldap, kerberos, pam, custom].
      Client authentication types.
        NONE: no authentication check
        LDAP: LDAP/AD based authentication
        KERBEROS: Kerberos/GSSAPI authentication
        CUSTOM: Custom authentication provider
                (Use with property hive.server2.custom.authentication.class)
        PAM: Pluggable authentication module
        NOSASL:  Raw transport
    </description>
  </property>

  <property>
    <name>hive.server2.thrift.client.user</name>
    <value>root</value>
    <description>Username to use against thrift client</description>
  </property>
  <property>
    <name>hive.server2.thrift.client.password</name>
    <value>root</value>
    <description>Password to use against thrift client</description>
  </property>

core-site.xml configuration of hadoop

<configuration>
 <!--Specify the address of namenode-->
    <property>
                <name>fs.defaultFS</name>
                <value>hdfs://192.168.11.207:9000</value>
    </property>
    <!--Used to specify the storage directory for files generated when using hadoop -->
    <property>
             <name>hadoop.tmp.dir</name>
             <!--<value>file:/usr/local/kafka/hadoop-2.7.6/tmp</value>-->
             <value>file:/home/hadoop/temp</value>
    </property>
        <!--Used to set the maximum time for checkpoint backup logs-->
    <!--    <name>fs.checkpoint.period</name> 
        <value>3600</value>
-->
<!-- Represents the proxy user who sets hadoop -->
    <property>
        <!--Indicates the group the proxy user belongs to-->
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
    <property>
        <!--Indicates that any node using the proxy user hadoop of the hadoop cluster can access the hdfs cluster-->
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>

</configuration>

sql deduplication

1 distinct

select distinct name,age from test

2 group by

select name,age from test
group by name,age;

3 Pseudo column deduplication

select id,name,age from test t1
where t1.rowid in (select min(rowid) from test t2 where t1.name=t2.name and t1.age=t2.age);

4 window function over (partition by)

select t.id ,t.name,t.age from 
(select row_number() over(partition by name,age order by age) rank,test.* from test)t
where t.rank = 1;

important point

The above are the four common deduplication methods in Oracle. The first method is not recommended because the efficiency is very low when the amount of data is very large. In Mysql versions before 8.0, window functions are not supported, so it is recommended to use group by to remove duplicates , you can happily use window functions after 8.0 .

Applicable scenarios for window functions: In the scenario where each record in the grouped statistical results is calculated, it is better to use the window function. Note, it is each record!! Because the result of MySQL's ordinary aggregate function (such as group by) is each group There is only one record!!!

hive installation

cd /opt/jar

tar -zxf hive-1.1.0-cdh5.14.2.tar.gz 

mv hive-1.1.0-cdh5.14.2 /opt/soft/hive110

cd /opt/soft/hive110/conf

vim hive-site.xml #Add the following code
====================================
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.64.210:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>3090_Cmok</value>
</property>

<property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
  </property>
  <property>
    <name>hive.server2.thrift.client.user</name>
    <value>root</value>
  </property>
  <property>
    <name>hive.server2.thrift.client.password</name>
    <value>root</value>
  </property>
</configuration>

<!-- mysql database password <value>3090_Cmok</value> Use your own 3090_Cmok>
<!-- password=root equals no password login to facilitate connection>
<!-- If it is a remote mysql database, you need to write the remote IP or hosts here -->
====================================

2 hadoop configuration core-site.xml (note that there are 5 in total, no more or less!!!)

   cd /opt/soft/hadoop260/etc/hadoop/
   
   vim core-site.xml


    <property>
                <name>fs.defaultFS</name>
                <value>hdfs://192.168.64.210:9000</value>
    </property>
    <property>
             <name>hadoop.tmp.dir</name>           
             <value>file:/home/hadoop/temp</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
       <property>
                <name>hadoop.proxyuser.root.users</name>
                <value>*</value>
        </property>

3. Drag in the mysql driver

4. Configure environment variables

vim /etc/profile

#Hive
export HIVE_HOME=/opt/soft/hive110
export PATH=$PATH:$HIVE_HOME/bin

:wq

source /etc/profile

5Initialize the database

schematool -dbType mysql -initSchema

6. Start

zkServer.sh start
start-all.sh

hive --service metastore

hive --service hiveserver2

hive

7.hql

show databases;
create database mydemo;
use mydemo;
create table userinfos(userid int,username string,birthday string);
insert into userinfos values(1,'zs',30);
select * from userinfos;

Zeppelin installation

tar -zxf zeppelin-0.8.1-bin-all.tgz -C /opt/soft/

hdfs dfs -cat /hive/warehouse/mydemo.db/userinfos/000000_0


cd /opt/soft/

ls

mv zeppelin-0.8.1-bin-all/ zeppelin081

ls

cd /opt/soft/zeppelin081/conf/

ls

cp zeppelin-site.xml.template zeppelin-site.xml

vim zeppelin-site.xml
==============================
<property>
  <name>zeppelin.helium.registry</name>
  <value>helium</value>
</property>
==============================

cp zeppelin-env.sh.template zeppelin-env.sh

vim zeppelin-env.sh
==============================
export JAVA_HOME=/opt/soft/jdk180
export HADOOP_CONF_DIR=/opt/soft/hadoop260/etc/hadoop
==============================

cp /opt/soft/hive110/conf/hive-site.xml /opt/soft/zeppelin081/conf/

1.Copy files

cp /opt/soft/hadoop260/share/hadoop/common/hadoop-common-2.6.0-cdh5.14.2.jar /opt/soft/zeppelin081/interpreter/jdbc/

cp /opt/soft/hive110/lib/hive-jdbc-1.1.0-cdh55.14.2-standalone.jar /opt/soft/zepplin081/interpreter/jdbc/

2. Configure environment variables

vim /etc/profile
#Zeppelin
export ZEPPLIN_HOME=/opt/soft/zeppelin081
export PATH=$PATH:$ZEPPELIN_HOME/bin

:wq

source /etc/profile

3. Start

cd /opt/soft/zeppelin081/bin/

./zeppelin-daemon.sh start

http://192.168.64.210:8080/ #Enter the address in the browser to enter zeppelin

#http://192.168.64.210:50070/ #hadoopView

cd /opt/
mkdir dir
rm -rf dir
mkdir data

%hive
create external table mydemo.train(
    userid string,
    eventid string,
    invited string,
    partytime string,
    interested string,
    not_interested string
)
row format delimited fields terminated by ','
location '/temp'
TBLPROPERTIES("skip.header.line.count"="1")

1 | zhangsan | male,40 | java,hadoop,spring | fid1:wangwu,fid2:zhangsanfeng
2 | lisi | female,30 | linux,centos | fid1:xiaolifeidao

%hive
create external table mydemo.train(
    userid string,
    eventid string,
    invited string,
    partytime string,
    interested string,
    not_interested string
)
row format delimited fields terminated by ','
location '/temp'
TBLPROPERTIES("skip.header.line.count"="1")

#Static installment loading of data requires users to specify partitions themselves
#hive
load data local inpath

Hive installation & 4 ways to remove sql duplication & Zeppelin installation