Hive高级02

一、Hive重温

1、复杂数据类型:

map,结构体,数组

2、JDBC编程:有两句官网的代码有误

3、ZK:HA集群比较多,奇数台。

二、压缩Compression

1、常用压缩格式介绍

   Tool    Algorithm       File extention
gzip  gzip   DEFLATE          .gz
bzip2  bizp2    bzip2                               .bz2
LZO  lzop      LZO         .lzo
Snappy  N/A     Snappy                          .snappy

2、压缩格式对比

压缩比

压缩速度/解压缩度

3、压缩在Hadoop中的应用

1)是否支持分割

压缩有什么用,举个例子:

1G的数据:

没压缩,Map Task = 1024/128

Gzip压缩后,由于不支持分割,所以是一个Map Task处理。

2)常用Codec

Codec:我们只需配置再看hadoop的配置文件中即可

3)压缩在MapReduce中的应用计划

1:

MapReduce jobs从HDFS读取数据

压缩可以减少磁盘开销

输入时的压缩尽量使用可分割的压缩方式

压缩使用可分割的文件结构:ORC或者... 

2:

中间解压尽量使用解压速度快的技术比如Snappy,LZO

3、

最后压缩输出的时候分两种情况:

1)保存到磁盘,使用高压缩比的技术

2)给下个Map,使用能分割的压缩技术

三、配置HDFS压缩

1、core-site.xml


<property>

    <name>io.compression.codecs</name>

 <value>

 org.apache.hadoop.io.compress.GzipCodec,

 org.apache.hadoop.io.compress.DefaultCodec,

 org.apache.hadoop.io.compress.BZip2Codec,

 </value>

</property>


2、mapred-site.xml

是否需要压缩:


<property>

 <name>mapreduce.output.fileoutputformat.compress</name>     ---------------------第三步

 <value>true</value>

</property>


输出使用哪种压缩:


<property>

 <name>mapreduce.output.fileoutputformat.compress.codec</name>

 <value>org.apache.hadoop.io.compress.BZip2Codec</value>

</property>


map阶段或者其他步骤使用哪种压缩:

官网:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

搜索:mapreduce.output.fileoutputformat.compress

往下看输出,类推使用方式:

3、测试压缩效果

进入目录:

cd /opt/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce

执行:

hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /ruozeinput.txt /ruozedata-compression-out/

成功:

四、Hive压缩

1、创建page_views表


create table page_views(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by "\t" ;


2、读取外部数据源到表

load data local inpath ' ' overwrite into table page_views;

3、创建另外一张表,将数据压缩后再导入Hive

(1)修改默认的配置:


SET hive.exec.compress.output=true;

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;

(2)创建一个表,导入压缩后数据到此表:


create table page_views_bzip2

row format delimited fields terminated by '\t'

as select * from page_views; 

(3)到HDFS查看文件

hdfs dfs -du -s -h /user/hive/warehouse/page_views

压缩的文件小了1/3

五、Hive Storage Format

1、STORED AS file_format

create table ruoze_a(id int) stored as textfile;

2、查看表结构

desc formatted ruoze_a;

发现是默认的输入输出格式。

3、定义输入输出格式,将textfile拆开。


create table ruoze_b(id int) stored as 

INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';


4、TEXTFILE

现在都在使用的。

5、SEQUENCEFILE


create table ruoze_page_views_seq(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

row format delimited fields terminated by '\t'

stored as SEQUENCEFILE; 


错误:load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table ruoze_page_views_seq;//创建表格已经定义了表格是SEQ格式,但是源数据的文件是文本文件,所以报错。因此需要从别的文本表insert过去。

insert into table ruoze_page_views_seq select * from ruoze_page_views;

总结:

SEQ会使数据变大。

6、rcfile


create table ruoze_page_views_rc(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

row format delimited fields terminated by '\t'

stored as rcfile; 

insert into table ruoze_page_views_rc select * from ruoze_page_views;


7、ORC(https://blog.csdn.net/dabokele/article/details/51542327)

Compared with RCFile format, for example, ORC file format has many advantages such as:

  • a single file as the output of each task, which reduces the NameNode's load
  • Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
  • light-weight indexes stored within the file
    • skip row groups that don't pass predicate filtering
    • seek to a given row
  • block-mode compression based on data type
    • run-length encoding for integer columns
    • dictionary encoding for string columns
  • concurrent reads of the same file using separate RecordReaders
  • ability to split files without scanning for markers
  • bound the amount of memory needed for reading or writing
  • metadata stored using Protocol Buffers, which allows addition and removal of fields


create table ruoze_page_views_orc(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

row format delimited fields terminated by '\t'

stored as orc; 

insert into table ruoze_page_views_orc select * from ruoze_page_views;


总结:

比bzip2压缩的数据更小。

(1)ORC+null_compress


create table ruoze_page_views_orc_null(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

row format delimited fields terminated by '\t'

stored as orc tblproperties ("orc.compress"="NONE"); 

insert into table ruoze_page_views_orc_null select * from ruoze_page_views;

8、Parquet源于dremel


create table ruoze_page_views_parquet(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

row format delimited fields terminated by '\t'

stored as parquet; 

insert into table ruoze_page_views_parquet select * from ruoze_page_views;


1)Parquet+compress

set parquet.compression=GZIP;

create table ruoze_page_views_parquet_gzip 

row format delimited fields terminated by '\t'

stored as parquet

as select * from ruoze_page_views;

六、行式存储and列式存储

行式存储:

一行的数据,存在一个Block中

列式存储:

按列存

总结:

1、大数据中一个表非常多的字段,我们大部分场景只有其中的某些字段,可以直接取数,所以列式存储可以减少系统I/O

2、压缩的话,列式存储压缩比更高。

3、但是如果查较多列的话,列式存储需要重组所有列,因此性能耗费较高。

七、读写、性能层面

对比分析null、ORC、parquet三者在HDFS上读取的数据量

select count(1) from ruoze_page_views where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';

19022752

select count(1) from ruoze_page_views_orc where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';

1257523

select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';

2687077

3496487

猜你喜欢

转载自blog.csdn.net/Binbinhb/article/details/88391483