大数据里的存储格式

hive里的存储格式

详见官网
https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation
在这里插入图片描述
hive里默认存储是textfile

hive (default)> set hive.default.fileformat;
hive.default.fileformat=TextFile

数据表存储方式如下指定


hive (default)> create table t_2(id int) stored as INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

 也可以直接 
hive (default)> create table t_3(id int) stored as textfile;

 看下两张表的结构
hive (default)> desc formatted t_2/t_3;

InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 

行式存储

优点:保证每一行在一个block里,如果是select *这种操作,行快。
缺点:如果ABCD是不同类型,只能使用一种压缩格式,压缩比不好
在这里插入图片描述

列式存储

优点:不同数据类型的列可以采用合适的压缩(内部自己解决,不需要手动设定),如果只要几列,列式更快,比如只要3列行式存储仍要全量读取。这里几列一组是随机的。
在这里插入图片描述

TextFile格式

hive存储的默认格式,行式存储。

文本格式里所有的内容都是字符串,然而hive里数据有schema(元数据信息),也就是数据可以有多种数值类型。所以textfile存储过程中会有转换类型,这种资源开销较大。

Sequencefile格式

record是存储的数据,采用key-value方式存储。所以体积要更大。如果压缩,只是压缩值。

在这里插入图片描述
测试一下性能

create table page_views_sequencefile(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
stored as sequencefile;

load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table page_views_sequencefile;

 报错,文件读不了,因为文本格式不能直接转变成其他格式,要先以textfile存储,然后导出数据到其他格式的表。这里用insert方式导数。

insert into table page_views_sequencefile select * from page_views;

 数据反而更大了,这就是因为存储是键值对的方式
[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views
18.1 M  18.1 M  /user/hive/warehouse/page_views/page_views.dat
[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_sequencefile
20.6 M  20.6 M  /user/hive/warehouse/page_views_sequencefile/000000_0

RCfile

行列混合存储
在这里插入图片描述
试下性能

create table page_views_rcfile(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
stored as rcfile;


insert into table page_views_rcfile select * from page_views;

 
[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views
18.1 M  18.1 M  /user/hive/warehouse/page_views/page_views.dat
[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_sequencefile
20.6 M  20.6 M  /user/hive/warehouse/page_views_sequencefile/000000_0
 有问题,这个窗口启动了压缩方式,要换一个窗口
[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_rcfile
3.2 M  3.2 M  /user/hive/warehouse/page_views_rcfile/000000_0

 重建个窗口
 hive (default)> insert overwrite table page_views_rcfile select * from page_views;

 正常结果
[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_rcfile
17.9 M  17.9 M  /user/hive/warehouse/page_views_rcfile/000000_0

parquet列式存储

和orc是两种主流存储方式

create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
stored as parquet;

insert into table page_views_parquet select * from page_views;

[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_parquet
13.1 M  13.1 M  /user/hive/warehouse/page_views_parquet/000000_0

 不压缩就这么厉害啦,试试压缩咯
hive (default)> set parquet.compression=gzip;
hive (default)> set parquet.compression;
parquet.compression=gzip

create table page_views_parquet_gzip stored as parquet as select * from page_views;

[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_parquet_gzip
3.9 M  3.9 M  /user/hive/warehouse/page_views_parquet_gzip/000000_0

ORC存储

create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
stored as orc;

insert into table page_views_orc select * from page_views;

[hadoop@hadoop000 data]$ hadoop fs -du -h /user/hive/warehouse/page_views_orc
2.8 M  2.8 M  /user/hive/warehouse/page_views_orc/000000_0


从查询角度比较各种存储

hive (default)>  select count(1)  from page_views where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.48 sec   HDFS Read: 1902268 

hive (default)>  select count(1)  from page_views_sequencefile where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.74 sec   HDFS Read: 2050926

hive (default)>  select count(1)  from page_views_rcfile where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.73 sec   HDFS Read: 3725383

hive (default)>  select count(1)  from page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.55 sec   HDFS Read: 2687017

hive (default)> select count(1)  from page_views_orc where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.8 sec   HDFS Read: 1257463 

hive server2

相当于在一台机子起个服务端,然后可以远程客户端访问

 启动服务端
[hadoop@hadoop000 bin]$ ./hiveserver2 

[hadoop@hadoop000 bin]$ ls
beeline  ext  hive  hive-config.sh  hiveserver2  metatool  schematool
 最简单的启动客户端方式
[hadoop@hadoop000 bin]$ beeline
which: no hbase in (/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin:/usr/java/jdk1.7.0_80/bin:/home/hadoop/app/sqoop-1.4.6-cdh5.7.0/bin:/home/hadoop/app/hive-1.1.0-cdh5.7.0/bin:/home/hadoop/app/findbugs-1.3.9/bin:/home/hadoop/app/protobuf/bin:/home/hadoop/app/apache-maven-3.3.3/bin:/usr/java/jdk1.7.0_80/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
beeline> 

 常用登陆方式
-u数据库url,-n用户名(不是mysql的),-w暂时没设置不需要,hive server2默认端口 10000
beeline -u jdbc:hive2://localhost:10000/default -n hadoop -w password_file

 也可以修改端口
 ./hiveserver2 -- hiveconf hive.server2.thrift.port=14000

猜你喜欢

转载自blog.csdn.net/qq_36459386/article/details/84757155
今日推荐