【Hive】HiveQL实战之分析函数&窗口函数

本篇主要介绍将存储到Mysql的示例数据库Sakila以Sqoop的方式导入到Hive,然后详细讲解Hive的分析函数和窗口函数。

一 实战环境

1 Hive版本
hive> select version();
OK
2.3.3 r8a511e3f79b43d4be41cd231cf5c99e43b248383
Time taken: 0.944 seconds, Fetched: 1 row(s)
hive> 
2 Mysql数据库
mysql> select database();
+------------+
| database() |
+------------+
| sakila     |
+------------+
1 row in set (0.11 sec)
注:本实战使用Mysql示例数据库,具体安装可参考:MySQL之示例数据库Sakila下载及安装。

3 数据准备

1)方法一:MySQL——>HDFS——>Hive
--将Sakila下的City导入至HDFS
[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
te" --warehouse-dir /user/sqoop1 -m 1
--查看已导入的数据
[hadoop@strong ~]$ hdfs dfs -cat /user/sqoop1/city/part-m-00000
--在Hive创建city表
hive> create table city(city_id int, city string,country_id int,last_update string)
    > row format delimited 
    > fields terminated by ',';
OK
Time taken: 1.386 seconds
--加载HDFS中的city文件
hive> load data inpath '/user/sqoop1/city/' into table city;
Loading data to table hive.city
OK
Time taken: 4.764 seconds
--查看导入到Hive的数据
hive> select *from city limit 10;
OK
1	A Corua (La Corua)	87	2006-02-15 04:45:25.0
2	Abha	82	2006-02-15 04:45:25.0
3	Abu Dhabi	101	2006-02-15 04:45:25.0
4	Acua	60	2006-02-15 04:45:25.0
5	Adana	97	2006-02-15 04:45:25.0
6	Addis Abeba	31	2006-02-15 04:45:25.0
7	Aden	107	2006-02-15 04:45:25.0
8	Adoni	44	2006-02-15 04:45:25.0
9	Ahmadnagar	44	2006-02-15 04:45:25.0
10	Akishima	50	2006-02-15 04:45:25.0
Time taken: 10.489 seconds, Fetched: 10 row(s)
2)方法二:MySQL——>Hive
--删除HDFS文件
[hadoop@strong ~]$ hdfs dfs -rm -R /user/hadoop/city
--删除City表
hive> drop table city;
OK
Time taken: 0.995 seconds
--将Sakila下的City导入至Hive下的City表
[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
te" --hive-import --hive-database hive --create-hive-table --fields-terminated-by ',' -m 1
注:会出现以下错误:Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
解决办法:[hadoop@strong ~]$ cp /usr/local/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar /usr/local/sqoop-1.4.7/lib/
--查看导入Hive的数据
hive> select *from city limit 10;
OK
1	A Corua (La Corua)	87	2006-02-15 04:45:25.0
2	Abha	82	2006-02-15 04:45:25.0
3	Abu Dhabi	101	2006-02-15 04:45:25.0
4	Acua	60	2006-02-15 04:45:25.0
5	Adana	97	2006-02-15 04:45:25.0
6	Addis Abeba	31	2006-02-15 04:45:25.0
7	Aden	107	2006-02-15 04:45:25.0
8	Adoni	44	2006-02-15 04:45:25.0
9	Ahmadnagar	44	2006-02-15 04:45:25.0
10	Akishima	50	2006-02-15 04:45:25.0
Time taken: 0.544 seconds, Fetched: 10 row(s)
--查看HDFS文件
[hadoop@strong ~]$ hdfs dfs -ls /user/hadoop/city
ls: `/user/hadoop/city': No such file or directory
注:查看Sqoop Import执行的信息18/06/28 12:34:38 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.可看出不会创建HDFS文件,执行过程中HDFS文件只是一个过渡,是个临时文件,完成后即删除。

二 Hive分析函数实战

1 样例数据
hive> select *from city where country_id in (101,102,108);
OK
city.city_id	city.city	city.country_id	city.last_update
3	Abu Dhabi	101	2006-02-15 04:45:25.0
12	al-Ayn	101	2006-02-15 04:45:25.0
88	Bradford	102	2006-02-15 04:45:25.0
149	Dundee	102	2006-02-15 04:45:25.0
280	Kragujevac	108	2006-02-15 04:45:25.0
312	London	102	2006-02-15 04:45:25.0
368	Novi Sad	108	2006-02-15 04:45:25.0
470	Sharja	101	2006-02-15 04:45:25.0
494	Southampton	102	2006-02-15 04:45:25.0
495	Southend-on-Sea	102	2006-02-15 04:45:25.0
496	Southport	102	2006-02-15 04:45:25.0
500	Stockport	102	2006-02-15 04:45:25.0
589	York	102	2006-02-15 04:45:25.0
Time taken: 0.359 seconds, Fetched: 13 row(s)
2 Row_Number函数
hive> select t.*,row_number()over(partition by country_id) rnk from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	rnk
3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
470	Sharja	101	2006-02-15 04:45:25.0	2
12	al-Ayn	101	2006-02-15 04:45:25.0	3
589	York	102	2006-02-15 04:45:25.0	1
500	Stockport	102	2006-02-15 04:45:25.0	2
496	Southport	102	2006-02-15 04:45:25.0	3
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	4
494	Southampton	102	2006-02-15 04:45:25.0	5
312	London	102	2006-02-15 04:45:25.0	6
149	Dundee	102	2006-02-15 04:45:25.0	7
88	Bradford	102	2006-02-15 04:45:25.0	8
280	Kragujevac	108	2006-02-15 04:45:25.0	1
368	Novi Sad	108	2006-02-15 04:45:25.0	2
Time taken: 63.74 seconds, Fetched: 13 row(s)
3 Rank函数
--加上partition by 
hive> select t.*,rank()over(partition by country_id order by city_id) rnk from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	rnk
3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
12	al-Ayn	101	2006-02-15 04:45:25.0	2
470	Sharja	101	2006-02-15 04:45:25.0	3
88	Bradford	102	2006-02-15 04:45:25.0	1
149	Dundee	102	2006-02-15 04:45:25.0	2
312	London	102	2006-02-15 04:45:25.0	3
494	Southampton	102	2006-02-15 04:45:25.0	4
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	5
496	Southport	102	2006-02-15 04:45:25.0	6
500	Stockport	102	2006-02-15 04:45:25.0	7
589	York	102	2006-02-15 04:45:25.0	8
280	Kragujevac	108	2006-02-15 04:45:25.0	1
368	Novi Sad	108	2006-02-15 04:45:25.0	2
Time taken: 57.047 seconds, Fetched: 13 row(s)
--不加上partition by 
hive> select t.*,rank()over(order by city_id) rnk from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	rnk
3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
12	al-Ayn	101	2006-02-15 04:45:25.0	2
88	Bradford	102	2006-02-15 04:45:25.0	3
149	Dundee	102	2006-02-15 04:45:25.0	4
280	Kragujevac	108	2006-02-15 04:45:25.0	5
312	London	102	2006-02-15 04:45:25.0	6
368	Novi Sad	108	2006-02-15 04:45:25.0	7
470	Sharja	101	2006-02-15 04:45:25.0	8
494	Southampton	102	2006-02-15 04:45:25.0	9
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	10
496	Southport	102	2006-02-15 04:45:25.0	11
500	Stockport	102	2006-02-15 04:45:25.0	12
589	York	102	2006-02-15 04:45:25.0	13
Time taken: 73.697 seconds, Fetched: 13 row(s)
--使用Rank排序会出现跳号的情况,即
hive> select t.*,rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	rnk
3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
470	Sharja	101	2006-02-15 04:45:25.0	1
12	al-Ayn	101	2006-02-15 04:45:25.0	1
589	York	102	2006-02-15 04:45:25.0	4
500	Stockport	102	2006-02-15 04:45:25.0	4
496	Southport	102	2006-02-15 04:45:25.0	4
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	4
494	Southampton	102	2006-02-15 04:45:25.0	4
312	London	102	2006-02-15 04:45:25.0	4
149	Dundee	102	2006-02-15 04:45:25.0	4
88	Bradford	102	2006-02-15 04:45:25.0	4
280	Kragujevac	108	2006-02-15 04:45:25.0	12
368	Novi Sad	108	2006-02-15 04:45:25.0	12
Time taken: 57.533 seconds, Fetched: 13 row(s)
注:Rank函数排序会出现跳号。

4 Dense_rank函数
hive> select t.*,dense_rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	rnk
3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
470	Sharja	101	2006-02-15 04:45:25.0	1
12	al-Ayn	101	2006-02-15 04:45:25.0	1
589	York	102	2006-02-15 04:45:25.0	2
500	Stockport	102	2006-02-15 04:45:25.0	2
496	Southport	102	2006-02-15 04:45:25.0	2
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	2
494	Southampton	102	2006-02-15 04:45:25.0	2
312	London	102	2006-02-15 04:45:25.0	2
149	Dundee	102	2006-02-15 04:45:25.0	2
88	Bradford	102	2006-02-15 04:45:25.0	2
280	Kragujevac	108	2006-02-15 04:45:25.0	3
368	Novi Sad	108	2006-02-15 04:45:25.0	3
Time taken: 60.528 seconds, Fetched: 13 row(s)
注:Dense_rank函数排序不会出现跳号。

5 Lead函数
hive> select t.*, lead(city_id)over(partition by country_id order by city_id) next_city  from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	next_city
3	Abu Dhabi	101	2006-02-15 04:45:25.0	12
12	al-Ayn	101	2006-02-15 04:45:25.0	470
470	Sharja	101	2006-02-15 04:45:25.0	NULL
88	Bradford	102	2006-02-15 04:45:25.0	149
149	Dundee	102	2006-02-15 04:45:25.0	312
312	London	102	2006-02-15 04:45:25.0	494
494	Southampton	102	2006-02-15 04:45:25.0	495
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	496
496	Southport	102	2006-02-15 04:45:25.0	500
500	Stockport	102	2006-02-15 04:45:25.0	589
589	York	102	2006-02-15 04:45:25.0	NULL
280	Kragujevac	108	2006-02-15 04:45:25.0	368
368	Novi Sad	108	2006-02-15 04:45:25.0	NULL
Time taken: 59.491 seconds, Fetched: 13 row(s)
注: lead(city_id)超出边界,将返回NULL值。

6 LAG函数
hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city  from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	prior_city
3	Abu Dhabi	101	2006-02-15 04:45:25.0	NULL
12	al-Ayn	101	2006-02-15 04:45:25.0	3
470	Sharja	101	2006-02-15 04:45:25.0	12
88	Bradford	102	2006-02-15 04:45:25.0	NULL
149	Dundee	102	2006-02-15 04:45:25.0	88
312	London	102	2006-02-15 04:45:25.0	149
494	Southampton	102	2006-02-15 04:45:25.0	312
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	494
496	Southport	102	2006-02-15 04:45:25.0	495
500	Stockport	102	2006-02-15 04:45:25.0	496
589	York	102	2006-02-15 04:45:25.0	500
280	Kragujevac	108	2006-02-15 04:45:25.0	NULL
368	Novi Sad	108	2006-02-15 04:45:25.0	280
Time taken: 56.631 seconds, Fetched: 13 row(s)
--指定偏移量,默认偏移量为1
hive> select t.*, lag(city_id,2,'No Data')over(partition by country_id order by city_id) next_city  from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	next_city
3	Abu Dhabi	101	2006-02-15 04:45:25.0	NULL
12	al-Ayn	101	2006-02-15 04:45:25.0	NULL
470	Sharja	101	2006-02-15 04:45:25.0	3
88	Bradford	102	2006-02-15 04:45:25.0	NULL
149	Dundee	102	2006-02-15 04:45:25.0	NULL
312	London	102	2006-02-15 04:45:25.0	88
494	Southampton	102	2006-02-15 04:45:25.0	149
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	312
496	Southport	102	2006-02-15 04:45:25.0	494
500	Stockport	102	2006-02-15 04:45:25.0	495
589	York	102	2006-02-15 04:45:25.0	496
280	Kragujevac	108	2006-02-15 04:45:25.0	NULL
368	Novi Sad	108	2006-02-15 04:45:25.0	NULL
Time taken: 66.487 seconds, Fetched: 13 row(s)
--返回当前值的前一个和后一个
hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city,lead(city_id)over(partition by country_id order by city_id) next_city  from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	prior_city	next_city
3	Abu Dhabi	101	2006-02-15 04:45:25.0	NULL	12
12	al-Ayn	101	2006-02-15 04:45:25.0	3	470
470	Sharja	101	2006-02-15 04:45:25.0	12	NULL
88	Bradford	102	2006-02-15 04:45:25.0	NULL	149
149	Dundee	102	2006-02-15 04:45:25.0	88	312
312	London	102	2006-02-15 04:45:25.0	149	494
494	Southampton	102	2006-02-15 04:45:25.0	312	495
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	494	496
496	Southport	102	2006-02-15 04:45:25.0	495	500
500	Stockport	102	2006-02-15 04:45:25.0	496	589
589	York	102	2006-02-15 04:45:25.0	500	NULL
280	Kragujevac	108	2006-02-15 04:45:25.0	NULL	368
368	Novi Sad	108	2006-02-15 04:45:25.0	280	NULL
Time taken: 52.446 seconds, Fetched: 13 row(s)
7 First_Value函数
hive> select t.*, first_value(city_id)over(partition by country_id order by city_id) first_city  from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	first_city
3	Abu Dhabi	101	2006-02-15 04:45:25.0	3
12	al-Ayn	101	2006-02-15 04:45:25.0	3
470	Sharja	101	2006-02-15 04:45:25.0	3
88	Bradford	102	2006-02-15 04:45:25.0	88
149	Dundee	102	2006-02-15 04:45:25.0	88
312	London	102	2006-02-15 04:45:25.0	88
494	Southampton	102	2006-02-15 04:45:25.0	88
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	88
496	Southport	102	2006-02-15 04:45:25.0	88
500	Stockport	102	2006-02-15 04:45:25.0	88
589	York	102	2006-02-15 04:45:25.0	88
280	Kragujevac	108	2006-02-15 04:45:25.0	280
368	Novi Sad	108	2006-02-15 04:45:25.0	280
Time taken: 78.285 seconds, Fetched: 13 row(s)
8 Last_value函数
hive> select t.*, last_value(city_id)over(partition by country_id order by city_id rows between unbounded preceding and unbounded following) last_city  from city t where country_id in(101,102,108);
t.city_id	t.city	t.country_id	t.last_update	last_city
3	Abu Dhabi	101	2006-02-15 04:45:25.0	470
12	al-Ayn	101	2006-02-15 04:45:25.0	470
470	Sharja	101	2006-02-15 04:45:25.0	470
88	Bradford	102	2006-02-15 04:45:25.0	589
149	Dundee	102	2006-02-15 04:45:25.0	589
312	London	102	2006-02-15 04:45:25.0	589
494	Southampton	102	2006-02-15 04:45:25.0	589
495	Southend-on-Sea	102	2006-02-15 04:45:25.0	589
496	Southport	102	2006-02-15 04:45:25.0	589
500	Stockport	102	2006-02-15 04:45:25.0	589
589	York	102	2006-02-15 04:45:25.0	589
280	Kragujevac	108	2006-02-15 04:45:25.0	368
368	Novi Sad	108	2006-02-15 04:45:25.0	368
Time taken: 49.807 seconds, Fetched: 13 row(s)
注:在使用first_value和last_value时需加上窗口rows between unbounded preceding and unbounded following,否则会出现预想不到的结果!!!因为默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。

9 Count函数
hive> select country_id,count(1) cqty from city where country_id in(101,102,108) group by country_id;
country_id	cqty
101	3
102	8
108	2
Time taken: 56.987 seconds, Fetched: 3 row(s)
10 Sum函数
hive> select country_id,sum(city_id) csum from city where country_id in(101,102,108) group by country_id;
country_id	csum
101	485
102	3123
108	648
Time taken: 57.993 seconds, Fetched: 3 row(s)
11 Avg函数
hive> select country_id,avg(city_id) cavg from city where country_id in(101,102,108) group by country_id;
country_id	cavg
101	161.66666666666666
102	390.375
108	324.0
Time taken: 54.42 seconds, Fetched: 3 row(s)
12 Max函数
hive> select country_id,max(city_id) cmax from city where country_id in(101,102,108) group by country_id;
country_id	cmax
101	470
102	589
108	368
Time taken: 58.02 seconds, Fetched: 3 row(s)
13 Min函数
hive> select country_id,min(city_id) cmax from city where country_id in(101,102,108) group by country_id;
country_id	cmax
101	3
102	88
108	280
Time taken: 57.602 seconds, Fetched: 3 row(s)
注:以上5个函数都可以在后面添加Over和窗口。
窗口有如下规范:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING



猜你喜欢

转载自blog.csdn.net/alen_liu_sz/article/details/80845565
今日推荐