一.Hive 是什么
The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Apache Hive数据仓库软件简化了分布式存储中的大型数据集的读、写和管理,并使用SQL语法进行查询。
自己的理解就是:Hive 就是使用sql 的语法来操作分布式大数据
它之所以产生的原因归纳就是:
1.MapReduce的方式 来操作大数据 不方便,详情见我之前的MapReduce的文章,需要写代码,如果清洗,统计数据规则发生改变,代码就需要重写写
2.从传统RDBMS(关系型数据库) 来说,SQL的这种语言,更方便,也有更广泛的受众
3.HDFS 上面都是单纯的数据文件,没有schema(表名,字段名…) 的概念
为什么要用Hive?
1.简单易上手,构建在Hadoop之上的数据仓库
2.扩展强
3.统一的元数据管理
Hive数据是存放在HDFS
元数据信息(记录数据的数据)是存放在MySQL中的
一句话概括:
Hive 通过sql , 将命令转化成 MapReduce的任务在 YARN上运行
Hive的技术架构
了解下就可以了
二:Hive DDL详解
官网文档:https://cwiki.apache.org/confluence/display/Hive/
DDL:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
创建表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
示例:
CREATE TABLE IF NOT EXISTS test (
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
三:Hive DML详解
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
加载数据
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
示例:加载本地emp.txt 数据到emp 表中
LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp;
将信息下载到本地
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
示例:下载数据到本地/tmp/hive目录下
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT empno,ename,sal,deptno FROM emp;
四:小技巧
命令:
1.desc formatted emp; //查看emp表的信息
2.set hive.cli.print.current.db=true; //展示所在的数据库

五:注意
1.启动HIve 时,需要先启动HDFS 和 YARN
2.Hive安装前,需要安装对应的MySQL
3.MySQL需要设置远程访问,记得改权限
mysql> Grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
(%表示是所有的外部机器,如果指定某一台机,就将%改为相应的机器名;‘root’则是指要使用的用户名,)
mysql> flush privileges; (运行此句才生效,或者重启MySQL)
Query OK, 0 rows affected (0.03 sec)
官网 http://hive.apache.org/
维基 https://cwiki.apache.org/confluence/display/HIVE
下载地址 CDH5 版本
http://archive.cloudera.com/cdh5/cdh/5/
hive-1.1.0-cdh5.15.1.tar.gz (要和hadoop cdh5的版本号一样,这样才可以)