hive语句二

分区表概念

分区意义：
避免全表扫描，从而提高查询效率；默认使用全表扫描。
使用什么样的分区：
日期、区域、能将数据分散开来
分区技术：
[PARTITIONED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...)]
1.hive的分区名区分大小写
2.hive的分区字段是一个伪字段，但是可以用来进行操作
3.一个表可以有一个或者多个分区，并且分区下面也可以有一个或者多个分区
4.分区使用表外字段
本质：
在表的目录或者是扽去的目录下在创建目录，分区的目录名为指定

区分大小写

load data local inpath '/hivedata/user.txt' into table part4 partition(year='2018',month='03',day='AA');

###2.5.11.5 查看分区:

show partitions part4;

一级分区

创建分区表：
create table if not exists part1(
id int,
name string
)
partitioned by (dt string)
row format delimited fields terminated by ','
;

加载数据
load data local inpath '/hivedata/user.txt' into table part1 partition(dt='2018-03-21');
load data local inpath '/hivedata/user.txt' into table part1 partition(dt='2018-03-20');

二级分区

create table if not exists part2(
id int,
name string
)
partitioned by (year string,month string)
row format delimited fields terminated by ','
;

load data local inpath '/hivedata/user.txt' into table part2 partition(year='2018',month='03'); 
load data local inpath '/hivedata/user.txt' into table part2 partition(year='2018',month=02);

三级分区

create table if not exists part3(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ','
;

load data local inpath '/hivedata/user.txt' into table part3 partition(year='2018',month='03',day='21'); month=03

load data local inpath '/hivedata/user.txt' into table part3 partition(year='2018',month='02',day='20');  month=2

create table if not exists part4(
id int,
name string
)
partitioned by (year string,month string,DAY string)
row format delimited fields terminated by ','
;

load data local inpath '/hivedata/user.txt' into table part4 partition(year='2018',month='03',day='21');

修改分区：

1.查看分区
show partition 表名：
2.添加分区
alter table part2 add partition (dt='2019-09-10');
alter table part2 add partition (dt='2019-09-10') location '/user/hive/warehouse/qf1704.db/part1/dt=2019-09-10'
alter table part2 add partition (dt='2019-09-10') partition (dt='2019-09-11')
3.分区名称修改
alter table part1  partition(dt='2019-09-10') rename to  partition(dt='2019-09-11')
4.修改分区路径（注意：location后接的hdfs路径需要写成完全路径）
alter table part1  partition(dt='2019-09-10')set location 'hdfs://hadoop-01:9000/user/hive/warehouse/qf1704.db/part1/dt=2019-09-10'
5.删除分区
alter table part1 drop partition(dt='2019-09-10')
alter table part1 drop partition(dt='2019-09-10'),partition(dt='2019-09-11')
静态分区：加载数据到指定分区的值
动态分区：数据未知，根据分区的值来确定需要创建的分区
混合分区：静态和动态都有。
动态分区的属性：
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=strict/nonstrict
set hive.exec.max.dynamic.partitions=1000
set hive.exec.max.dynamic.partitions.pernode=100每个节点最大分区数
strict：严格模式必须至少一个静态分区
nostrict：可以所有的都为动态分区，但是建议尽量评估动态数去的数量

案例：

[外链图片转存失败(img-51RcVkhk-1568968627728)(D:/新机/千峰笔记/1568016208621.png)]

动态分区加载

insert into dy_part1 partition(dt) select id,name,dt from temp_part;
select * from dy_part1 where dt='2019-06-25';

分区表注意事项：

1、hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
2、分区字段不建议使用中文。
3、一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致namenode和resourcemanager的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。
4、分区属性的修改都可以使用修改元数据和hdfs数据内容。

hive分区和mysql分区的区别

mysql分区字段用的是表内字段；而hive分区字段采用表外字段。

hive的严格模式：

严格模式阻挡5类查询：
1.笛卡尔积：


2.分区表没有分区字段过滤


3.order by 不带limit查询


4.（bigint（8）和String比较）


5.（bigint和double比较）

hive读写模式；

hive是一个严格的读时模式
mysql是一个严格的写时模式。写的时候检查语法，不ok就会报错

分桶：

为什么分桶：

当单个分区或者表的数据量过大，分区不能更细粒度的划分数据，就需要使用分桶技术将数据划分成更细的粒度

关键字：bucket

意义:

为了保存分桶查询的分桶结构（数据按照分桶字段进行保存hash散列）
	分桶表的数据进行抽样和JOIN时可以提高查询效率 sample
抽样查询：
join提高查询效率
分区下创建分桶表
表下创建分桶表
分桶使用的表内字段，（对应mr中的partir tion）

默认：分桶技术实现是按照分桶字段进行hash值

创建分桶表

create table if not exists buc1(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;

数据：
id,name,age
1,aa1,18
2,aa2,19
3,aa3,20
4,aa4,21
5,aa5,22
6,aa6,23
7,aa7,24
8,aa8,25
9,aa9,26

创建临时表

create table if not exists temp_buc1(
id int,
name string,
age int
)
row format delimited fields terminated by ','
;

分桶使用load方式加载数据不能体现分桶

load data local inpath '/home/hivedata/buc1.txt' into table buc1;

加载数据到临时表

load data local inpath '/home/hivedata/buc1.txt' into table temp_buc1;

使用分桶查询将数据导入到分桶表

insert overwrite table buc2
select id,name,age from temp_buc1
cluster by (id)
;

设置强制分桶的属性

set hive.enforce.bucketing=false/true
<name>hive.enforce.bucketing</name>
<value>false</value>
<description>Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.</description>

如果设置了reduces的个数和总桶数不一样，请手动设置:

set mapreduce.job.reduces=？

创建指定排序字段的分桶表

create table if not exists buc2(
id int,
name string,
age int
)
clustered by (id) 
sorted by (id desc) into 8 buckets
row format delimited fields terminated by ','
;

导入数据

insert overwrite table buc3
select id,name,age from temp_buc1
distribute by (id) sort by (id asc)
;
和下面的语句效果一样
insert overwrite table buc2
select id,name,age from buc4
cluster by (id)
;

分桶表查询案例

select * from buc3;
select * from buc3 tablesample(bucket 1 out of 1 on id);

查询第1桶的数据：

select * from buc3 tablesample(bucket 1 out of 4 on id);
select * from buc3 tablesample(bucket 2 out of 4 on id);
select * from buc3 tablesample(bucket 1 out of 2 on id);

tablesample(bucket x out of y on id)
x:代表从第几桶开始查询
y:查询的总桶数，y可以是总桶数的倍数或者是因子。x不能大于y。
不压缩不拉伸：
1 1+4
压缩：
1 1+4/2 1+4/2+4/2
2 2+4/2 2+4/2+4/2

select * from buc3 tablesample(bucket 1 out of 8 on id);
select * from buc3 tablesample(bucket 2 out of 8 on id);

查询（注意：tablesample一定是放在from后面）

查询id为基数：

select
*
from buc2 tablesample(bucket 2 out of 2 on uid)
where uname = "aa1"  
;

select
*
from buc2
where uname = "aa1"  
;

查询：
select * from temp_buc1 limit 3;
select * from buc1 limit3;（分桶表）
OK
1       aa1     18
2       aa2     19
3       aa3     20
4       aa4     21
5       aa5     22
6       aa6     23
7       aa7     24
8       aa8     25
9       aa9     26
Time taken: 0.109 seconds, Fetched: 9 row(s)
hive> select * from buc1 limit1;
OK
1       aa1     18
2       aa2     19
3       aa3     20
4       aa4     21
5       aa5     22
6       aa6     23
7       aa7     24
8       aa8     25
9       aa9     26
select * from temp_buc1 tablesample(3 rows);
 select * from buc1 tablesample (3 rows);
select * from temp_buc1 tablesample(30 percent);
select * from temp_buc1 tablesample(6B);B k M G T P
hive> select * from buc2 tablesample (4B)
    > ;
OK
4       aa4     21
Time taken: 0.116 seconds, Fetched: 1 row(s)
hive> select * from buc2 tablesample (4b)
    > ;
OK
4       aa4     21
SELECT * from temp_buc1 order by rand() limit 3;
 select * from green order by rand() limit 3;

分区分桶案例：

分区关键字在前，分桶关键字在后

分桶总结

1、定义
    clustered by (id)         ---指定分桶的字段
    sorted by (id asc|desc)   ---指定数据的排序规则，表示咱们预期的数据是以这种规则进行的排序
    
2、导入数据
    cluster by (id)   ---指定getPartition以哪个字段来进行hash，并且排序字段也是指定的字段，排序是以asc排列
    distribute by (id)    ---- 指定getPartition以哪个字段来进行hash
    sort by (name asc | desc) ---指定排序字段
    
    区别：distribute by 这种方式可以分别指定getPartition和sort的字段
    分区使用的是表外字段，分桶使用的是表内字段
分桶跟家细粒度的管理数据，更多的是使用来做抽样、join

查询语句

select
from
join 
on
where 
group by 
/grouping sets/ with cube/with rollup

having 
order by
sort by 
limit
union/union all
;

sql语句的执行顺序：

FROM 
<left_table>
ON 
<join_condition>
<join_type>
 JOIN 
<right_table>
WHERE 
<where_condition>
GROUP BY 
<group_by_list>
HAVING 
<having_condition>
SELECT
DISTINCT 
<select_list>
ORDER BY 
<order_by_condition>
LIMIT 
<limit_number>
hive尽量不要使用子查询，尽量不要使用in 和 not in (mysql索引失效)
尽量避免jion查询，但是这种操作咱们是避免不了的
查询永远是小表驱动大表（永远是小结果集驱动大结果集）

关系型数据库最难的地方，就是建模（model）。

错综复杂的数据，需要建立模型，才能储存在数据库。所谓"模型"就是两样东西：实体（entity）+ 关系（relationship）。

实体指的是那些实际的对象，带有自己的属性，可以理解成一组相关属性的容器。关系就是实体之间的联系，通常可以分成"一对一"、"一对多"和"多对多"等类型。

join的语法和特点：

在关系型数据库里面，每个实体有自己的一张表（table），所有属性都是这张表的字段（field），表与表之间根据关联字段"连接"（join）在一起。所以，表的连接是关系型数据库的核心问题。

表的连接分成好几种类型。
内连接（inner join）
外连接（outer join）
左连接（left join）
右连接（right join）
全连接（full join ）
select * from buc1 full outer join buc2 on buc1.id=buc2.id;
1       aa1     18      1       aa1     18
2       aa2     19      2       aa2     19
3       aa3     20      3       aa3     20
4       aa4     21      4       aa4     21
5       aa5     22      5       aa5     22
6       aa6     23      6       aa6     23
7       aa7     24      7       aa7     24
8       aa8     25      8       aa8     25
9       aa9     26      9       aa9     26
最常用的left join
inner join:
左连接：以坐标为基础表，右表关联不上的用null替代
left join \left semi jion \left outer join
left semi join和left join区别
1.都是左外连接，但是semi join右表关联不上的左表也不会出来，left join不一样
2.semi join只能查询左表信息，left join可以查询所有
3.semi join是left join的一种优化
4,semi join一般使用查询不存在的情况

right semi join在hive中不支持，其他的right join\right outer join

full outer join：取并集

inner join (等价：join    )

[外链图片转存失败(img-2MFSbdLM-1568968627730)(D:/新机/千峰笔记/1568088000489.png)]

子查询

hive对子查询支持不是很友好，特别是 "="问题较多
select
e.*
from emp e
where e.deptno = (
select 
d.deptno
from dept d
limit 1
)
;

select
e.*
from emp e
where e.deptno in (
select 
d.deptno
from dept d
)
;

inner join 和outer join的区别：
分区字段对outer join 中的on条件是无效，对inner join 中的on条件有效

有inner join 但是没有full inner join 
有full outer join但是没有outer join
所有join连接，只支持等值连接(= 和 and )。不支持 != 、 < 、> 、 <> 、>=、 <= 、or

map-side join

map-side join：
如果所有的表中有小表，将会把小表缓存内存中，然后在map端进行连接关系查找。hive在map端
查找时将减小查询量，从内存中读取缓存小表数据，效率较快，还省去大量数据传输和shuffle耗时。

注意看该属性：
set hive.auto.convert.join=true
select
e.*
from u1 d 
join u2 e
on d.id = e.id
;

以前的老版本，需要添加(/+MAPJOIN(小表名)/)来标识该join为map端的join。hive 0.7以后hive已经废弃，但是仍然管用：
????需要再测试看看是否有效???
select
/+MAPJOIN(d)/
e.*
from u1 d 
join u2 e
on d.id = e.id
;
到底小表多大才会被转换为map-side join：
set hive.mapjoin.smalltable.filesize=25000000   约23.8MB

hive1.x版本默认开启，可以适当更改大小

set hive.auto.convert.join=true:是否开启map端join优化默认true

set hive.mapjoin.mapjoin.smalltable.filesize=25000000 约23.5M

mapjoin的标识：hashtable缓存

hive.skewjoin.mapjoin.map.tasks:

倾斜的join

where

不能跟聚合函数普通函数可以

where ： where后面通常是表达式 、还可以是非聚合函数表达式(但是不能是聚合函数表达式)

select
d.*
from dept d 
where  length(d.dname) > 5 
;

group by：

group by ： 分组，通常和聚合函数搭配使用
查询的字段要么出现在group by 后面，要么出现在聚合函数里面


分组，通常和聚合函数搭配使用

查询的字段，要么在group by中,要么在聚合函数里。

一般有group by出现将会有reduce

having

对分组以后的结果集进行过滤。
select
e.deptno,
count(e.deptno) ct
from emp e
group by e.deptno
having ct > 3
;

limit:

limit ： 从结果集中取数据的条数
将set hive.limit.optimize.enable=true 时，limit限制数据时就不会全盘扫描，而是根据限制的数量进行抽样。

同时还有两个配置项需要注意：
hive.limit.row.max.size        这个是控制最大的抽样数量
hive.limit.optimize.limit.file 这个是抽样的最大文件数量

取多少条

cluster BY

cluster by ：兼有distribute by以及sort by的升序功能。
排序只能是升序排序（默认排序规则），不能指定排序规则为asc 或者desc。等价于distribute by sort by
 distribute by (id)    ---- 指定getPartition以哪个字段来进行hash

分区排序Distribute By

distribute by ： 根据by后的字段和reducer个数，决定map的输出去往那个reducer。
默认使用查询的第一列的hash值来决定map的输出去往那个reducer。如果reducer的个数为1时没有任何体现。

sort by:局部排序，只保证单个reducer有顺序。
order by：全局排序，保证所有reducer中的数据都是有顺序。
如果reduser个数只有一个，两者都差不多。
两者都通常和 desc 、 asc 搭配。默认使用升序asc。

order by的缺点：
由于是全局排序，所以所有的数据会通过一个Reducer 进行处理，当数据结果较大的时候，一个Reducer 进行处理十分影响性能。
注意事项：
当开启MR 严格模式的时候ORDER BY 必须要设置 LIMIT 子句 ，否则会报错
 
手动设置reducer个数：
set mapreduce.job.reduces=3;
select
e.empno
from emp e
order by e.empno desc 
;

只要使用order by ，reducer的个数将是1个。

如果sort by 和 distribute by 同时出现：那个在前面？？

如果sort by 和 distribute by 同时出现，并且后面的字段一样、sort by使用升序时  <==> cluster by 字段

union ：将多个结果集合并，去重，排序
union all ：将多个结果集合并，不去重，不排序。

select
d.deptno as deptno,
d.dname as dname
from dept d
union
select
e.deptno as deptno,
e.ename as dname
from emp e
;

select
d.deptno as deptno,
d.dname as dname
from dept d
union all
select
d.dname as dname,
d.deptno as deptno
from dept d
;

单个union 语句不支持：orderBy、clusterBy、distributeBy、sortBy、limit
单个union语句字段的个数要求相同，字段的顺序要求相同。

distinct : 去重

union和union all:

连接：将多个结果集合并、去重、排序
union all：将多个结果集合并，不去重，不排序。

HIVE的数据类型

基础数据类型

tinyint		1	-128~127
smallint	2	-2的15 ~ 2的15-1
int			4
bigint		8
float		4
double		8
boolean		1
string		
binary			字节
timestamp		2017-06-02 11:36:22

java中有的而hive中没有的：
long
char
short
byte

复杂数据类型

array : col array<基本类型> ,下标从0开始，越界不报错，以NULL代替
map   : column map<string,string> 
struct: col struct

array

hive> create table if not exists arr2(
    > name string,
    > score array<String>
    > )
    > row format delimited fields terminated by '\t'
    > collection items terminated by ','
    > ;
hive> select * from arr2;
OK
zhangsan        ["78","89","92","96"]
lisi    ["67","75","83","94"]
Time taken: 0.124 seconds, Fetched: 2 row(s)
hive> select name,score[1] from arr2 where size(score)>2;
OK
zhangsan        89
lisi    75
explode：展开 炸裂函数
select explode(score) score from arr2;
OK
78
89
92
96
67
75
83
94
select name,cj from arr2 lateral view explode(score) score as cj;
OK
zhangsan        78
zhangsan        89
zhangsan        92
zhangsan        96
lisi    67
lisi    75
lisi    83
lisi    94

 select name,sum(cj) as scj from arr2 lateral view explode(score) score as cj group by name;
OK
lisi    319.0
zhangsan        355.0

炸裂写回

如何往array字段写入数据：
准备数据：
create table arr_temp
as
select name,cj from arr1 lateral view explode(score) score as cj;
insert into arr3
select 
name,
collect_set(cast(cj as int)) 
from arr_temp 
group by name;

insert into arr3
select 
name,
collect_list(cast(cj as int)) 
from arr_temp 
group by name;
lisi    [67,75,83,94]
zhangsan        [78,89,92,96]

map



用的arr的数据修改版
create table if not exists map3(
name string,
score map<string,int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':'
;

导入数据：
insert into map3
select name,map('chinese',score1,'math',score2,'english',score3) from map_temp1;

正解
concat 或者 concat_ws合并（数组）
select concat('$',array("1","2","3"));
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '"3"': CONCAT only takes primitive arguments
hive> select concat_ws('&',array("1","2","3"));
OK
1&2&3
Time taken: 0.182 seconds, Fetched: 1 row(s)
select name,str_to_map(concat_ws('&',collect_list(concat_ws(':',score1,cast(score2 as string)))),'&',':')
    > from map_temp1
    > group by name;
    lisi    {"chinese":"60","math":"30","english":"78","nature":null}
wangwu  {"chinese":"89","math":null,"english":"81","nature":"9"}
zhangsan        {"chinese":"90","math":"87","english":"63","nature":"76"}
Time taken: 1.6 seconds, Fetched: 3 row(s)

[外链图片转存失败(img-1fj0VIKf-1568968627731)(D:/新机/千峰笔记/1568166248663.png)]

[外链图片转存失败(img-m7sac6Vz-1568968627732)(D:/新机/千峰笔记/1568194325169.png)]

3
select name,map(‘chinese’,score1,‘math’,score2,‘english’,score3) from map_temp1;

正解
concat 或者 concat_ws合并（数组）
select concat(’$’,array(“1”,“2”,“3”));
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘“3”’: CONCAT only takes primitive arguments
hive> select concat_ws(’&’,array(“1”,“2”,“3”));
OK
1&2&3
Time taken: 0.182 seconds, Fetched: 1 row(s)
select name,str_to_map(concat_ws(’&’,collect_list(concat_ws(’:’,score1,cast(score2 as string)))),’&’,’:’)
> from map_temp1
> group by name;
lisi {“chinese”:“60”,“math”:“30”,“english”:“78”,“nature”:null}
wangwu {“chinese”:“89”,“math”:null,“english”:“81”,“nature”:“9”}
zhangsan {“chinese”:“90”,“math”:“87”,“english”:“63”,“nature”:“76”}
Time taken: 1.6 seconds, Fetched: 3 row(s)


[外链图片转存中...(img-1fj0VIKf-1568968627731)]

[外链图片转存中...(img-m7sac6Vz-1568968627732)]

##

流浮影

发布了44 篇原创文章 · 获赞 7 · 访问量 2160

私信关注

hive基础语法二