Hadoop生态圈-Hive

Hive

Hive引言
Hive的运行原理
Hive环境搭建
Hive基本操作
MetaStore的替换问题
Hive基础语法

1.HQL

2.表操作

1）管理表 (MANAGED_TABLE)
2)外部表
3) 分区表【优化查询】
4）桶表
5）临时表

3. 数据的导入

1). 基本导入
2). 通过as关键完成数据的导入
3). 通过insert的方式导入数据
4). hdfs导入数据
5). 导入数据过程中数据的覆盖
6). 通过HDFS的API完成文件的上传

4. 数据的导出

1). sqoop

2). insert的方式

3). 通过HDFS的API完成文件的下载
4). 命令行脚本的方式

5. Hive提供导入，导出的工具
6.与MR相关的配置

Hive引言

什么是Hive

    hive是facebook开源，并捐献给了apache组织，作为apache组织的顶级项目。 hive.apache.org
    hive是一个基于大数据技术的数据仓库技术  DataWareHouse (数仓)
        数据库  DataBase
               数据量级小，数据价值高
        数据仓库 DataWareHouse
               数据体量大，数据价值低
    底层依附是HDFS,MapReduce

Hive的好处

Hive让程序员应用时，书写SQL语句，最终由Hive把SQL语句转换成MapReduce运行，这样简化了程序员的工作。

Hive的运行原理

Hive是将大多数Hive SQL语句底层转换为MapReduce 运行Job作业来进行数据的处理

Hive环境搭建

1. linux服务器  ip 映射  主机名  关闭防火墙  关闭selinux  ssh免密登陆 jdk
2. 搭建hadoop环境
3. 安装Hive
   3.1 解压缩hive 
   3.2 hive_home/conf/hive-env.sh [改名]
       HADOOP_HOME=/opt/install/hadoop-2.5.2
       export HIVE_CONF_DIR=/opt/install/apache-hive-0.13.1-bin/conf
   3.2 hdfs创建2个目录
       /tmp
       /user/hive/warehouse
       bin/hdfs dfs -mkdir /tmp
       bin/hdfs dfs -mkdir /user/hive/warehouse
   3.3 启动hive
       bin/hive 
   3.4 jps
       runjar

Hive基本操作

# 创建数据库
create database [if not exists] test;
# 查看所有数据库
 show databases;
# 使用数据库
 use db_name;
# 删除空数据库 
 drop database db_name;
 drop database db_name cascade;
# 查看数据库的本质
 hive中的数据库 本质是 hdfs的目录 /user/hive/warehouse/test.db
  
# 查看当前数据库下的所有表
  show tables;
# 建表语句
  create table t_user(
    id int ,
    name string
   )row format delimited fields terminated by '\t';
# 查看表的本质
  hive中的表  本质是 hdfs的目录 /user/hive/warehouse/test.db/t_user
# 删除表
  drop table t_user;
  
# hive中向表导入数据
  load data local inpath '/root/hive/data' into table t_user;
# hive导入数据的本质
  load data local inpath '/root/hive/data' into table t_user;
  1. 导入数据 本质本质上就是 hdfs 上传文件
  bin/hdfs dfs -put /root/hive/data /user/hive/warehouse/test.db/t_user;
  2. 上传了重复数据，hive导数据时，会自动修改文件名
  3. 查询某一个张表时，Hive会把表中这个目录下所有文件的内容，整合查询出来
  
  
# SQL(类SQL 类似于SQL HQL Hive Query Language)
select * from t_user;
select id from t_user;
1. Hive把SQL转换成MapReduce (如果清洗数据 没有Reduce)
2. Hive在绝大多数情况下运行MR,但是在* limit操作时不运行MR

MetaStore的替换问题

Hive中的MetaStore把HDFS对应结构，与表对应结果做了映射（对应）。但是默认情况下hive的metaStore应用的是derby数据库，只支持一个client访问。

Hive中元数据库Derby替换成MySQL(Oracle)

0. 删除hdfs /user/hive/warehouse目录，并重新建立
1. linux mysql
   yum -y install mysql-server
2. 启动mysql服务并设置管理员密码
   service mysqld start
   /usr/bin/mysqladmin -u root password '123456'
3. 打开mysql远程访问权限
   GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456';
   flush privileges;   
   use mysql 
   delete from user where host like 'hadoop%';
   delete from user where host like 'l%';
   delete from user where host like '1%';
   service mysqld restart
4. 创建conf/hive-site.xml
   mv hive-default.xml.template hive-site.xml
   hive-site.xml
   <property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://CentOSA:3306/metastore?createDatabaseIfNotExist=true</value>
	  <description>the URL of the MySQL database</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	  <description>Driver class name for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>root</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>123456</value>
	</property>
5. hive_home/lib 上传mysql driver jar包

Hive基础语法

1.HQL

1. 基本查询
   select * from table_name # 不启动mr
   select id from table_name # 启动mr
2. 条件查询 where
   select id,name from t_users where name = 'mask1';
   2.1 比较查询  =  ！=  >=  <=
       select id,name from t_users where age > 20;
   2.2 逻辑查询  and or  not
       select id,name,age from t_users where name = 'mask' or age>30;
   2.3 谓词运算
       between and
       select name,salary from t_users where salary between 100 and 300;
       in
       select name,salary from t_users where salary in (100,300);
       is null
       select name,salary from t_users where salary is null;
       like
       select name,salary from t_users where name like 'mask%';
       select name,salary from t_users where name like 'mask__';
       select name,salary from t_users where name like 'mask%' and length(name) = 6;
3. 排序 order by [底层使用的是 map sort  group sort  compareto]
   select name,salary from t_users order by salary desc;
4. 去重 distinct
   select distinct(age) from t_users;
5. 分页 [Mysql可以定义起始的分页条目，但是Hive不可以]
   select * from t_users limit 3;  
6. 聚合函数（分组函数） count() avg() max() min() sum() 
   count(*)  count(id) 区别
7. group by
   select max(salary) from t_users group by age;
   规矩： select 后面只能写 分组依据和聚合函数 （Oracle报错，Mysql不报错，结果不对）
8. having 
   分组后，聚合函数的条件判断用having
   select max(salary) from t_users group by age having max(salary) > 800;
9. hive不支持子查询 
10. hive内置函数 
    show functions 

    length(column_name)  获得列中字符串数据长度
    substring(column_name,start_pos,total_count)
    concat(col1,col2)
    to_data('yyyy-mm-dd')
    year(data) 获得年份
    month(data)
    date_add
    ....
    select year(to_date('1999-10-11')) ;
11. 多表操作
    inner join
    select e.name,e.salary,d.dname
    from t_emp as e
    inner join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname
    from t_emp as e
    left join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname
    from t_emp as e
    right join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname [mysql 不支持]
    from t_emp as e
    full join t_dept as d
    on e.dept_id = d.id;

2.表操作

1）管理表 (MANAGED_TABLE)

1. 基本管理表的创建
create table if not exists table_name(
column_name data_type,
column_name data_type
)row format delimited fields terminated by '\t' [location 'hdfs_path']

2. as 关键字创建管理表
create table if not exists table_name as select id,name from t_users [location ''];
表结构 由 查询的列决定，同时会把查询结果的数据 插入新表中

3. like 关键字创建管理表
create table if not exists table_name like t_users [location 'hdfs_path'];
表结构 和 like关键字后面的表 一致，但是没有数据是空表

细节操作

1. 数据类型 int string varchar char double float boolean  
2. location hdfs_path
   定制创建表的位置，默认是 /user/hive/warehouse/db_name.db/table_name
   create table t_mask(
   id,int
   name,string
   )row format delimited fields terminated by '\t' stored as textfile location /xiaohei ;
   启示：日后先有hdfs目录，文件，在创建表进行操作。
3. 查看hive表结构的命令
   desc table_name        describe table_name
   desc extended table_name
   desc formatted table_name

2)外部表

1. 基本
create external table if not exists table_name(
id int,
name string
) row delimited fields terminated by '\t' stored as textfile [location 'hdfs_path'];
2. as 
create external table if not exists table_name as select id,name from t_users [location ''];
3. like
create external table if not exists table_name like t_users [location 'hdfs_path'];

4. 管理表和外部表的区别
drop table t_users_as; 删除管理表时，直接删除metastore,同时删除hdfs的目录和数据文件
drop table t_user_ex;  删除外部表时，删除metastore的数据。
5. 外部表与管理表使用方式的区别

3) 分区表【优化查询】

分区表是为了提高条件查询时的效率

create table t_user_part(
id int,
name string,
age int,
salary int)partitioned by (data string) row format delimited fields terminated by '\t';

load data local inpath '/root/data15' into table t_user_part partition (date='15');
load data local inpath '/root/data16' into table t_user_part partition (date='16');

select * from t_user_part  全表数据进行的统计

select id from t_user_part where data='15' and age>20;

4）桶表

5）临时表

3. 数据的导入

1). 基本导入

   load data local inpath 'local_path' into table table_name

2). 通过as关键完成数据的导入

   建表的同时，通过查询导入数据
   create table if not exists table_name as select id,name from t_users

3). 通过insert的方式导入数据

   #表格已经建好，通过查询导入数据。
   create table t_users_like like t_users;
   
   insert into table t_users_like select id,name,age,salary from t_users;

4). hdfs导入数据

   load data inpath 'hdfs_path' into table table_name

5). 导入数据过程中数据的覆盖

   load data inpath 'hdfs_path' overwrite into table table_name
   本质 把原有表格目录的文件全部删除，再上传新的

6). 通过HDFS的API完成文件的上传

   bin/hdfs dfs -put /xxxx  /user/hive/warehouse/db_name.db/table_name

4. 数据的导出

1). sqoop

     hadoop的一种辅助工具  HDFS/Hive  <------> RDB (MySQL,Oracle)

2). insert的方式

      #xiaohei一定不能存在，自动创建
      insert overwrite 【local】 directory '/root/xiaohei' select name from t_user;

3). 通过HDFS的API完成文件的下载

      bin/hdfs dfsd -get /user/hive/warehouse/db_name.db/table_name /root/xxxx

4). 命令行脚本的方式

      bin/hive --database 'test' -f /root/hive.sql > /root/result

5. Hive提供导入，导出的工具

      1. export 导出
      	export table tb_name to 'hdfs_path'
      2. import 导入
      	import table tb_name from 'hdfs_path'

6.与MR相关的配置

#与MR相关的参数
Map --> Split  ---> Block 
#reduce相关个数
mapred-site.xml
<property>
     <name>mapreduce.job.reduces</name>
     <value>1</value>
</property>
hive-site.xml
<!--1G-->
<property>
	  <name>hive.exec.reducers.bytes.per.reducer</name>
	  <value>1000000000</value>
</property>
<property>
     <name>hive.exec.reducers.max</name>
     <value>999</value>
</property>

站内首发文章

豆比米大

发布了19 篇原创文章 · 获赞 8 · 访问量 4546

私信关注

Hive

Hive引言

Hive的运行原理

Hive环境搭建

Hive基本操作

MetaStore的替换问题

Hive基础语法

1.HQL

2.表操作

1）管理表 (MANAGED_TABLE)

2)外部表

3) 分区表【优化查询】

4）桶表

5）临时表

3. 数据的导入

1). 基本导入

2). 通过as关键完成数据的导入

3). 通过insert的方式导入数据

4). hdfs导入数据

5). 导入数据过程中数据的覆盖

6). 通过HDFS的API完成文件的上传

4. 数据的导出

1). sqoop

2). insert的方式

3). 通过HDFS的API完成文件的下载

4). 命令行脚本的方式

5. Hive提供导入，导出的工具

6.与MR相关的配置

猜你喜欢