Hive中row_number()函数用法详解及示例

目录

一、Hive 中row_number()函数介绍

二、使用示例

三、总结

四、附录



在Oracle中,我们经常会用到row_number() over(partition by clo1 order by clo2 desc) 方法来取表中clo1 重复记录clo2最大的一条或几条记录,那在Hive上row_number()是否存在这个函数,其具体的用法是怎么样的呢?下面我们通过具体的示例来看下。

一、Hive 中row_number()函数介绍

 Hive-0.11.0中内置row_number函数

    org.apache.hadoop.hive.ql.exe.FunctionRegistry

    registerHiveUDAFsAsWindowFunctions();
    registerWindowFunction("row_number", new GenericUDAFRowNumber());  --row_number实现类
    registerWindowFunction("rank", new GenericUDAFRank());
    registerWindowFunction("dense_rank", new GenericUDAFDenseRank());
    registerWindowFunction("percent_rank", new GenericUDAFPercentRank());
    registerWindowFunction("cume_dist", new GenericUDAFCumeDist());
    registerWindowFunction("ntile", new GenericUDAFNTile());
    registerWindowFunction("first_value", new GenericUDAFFirstValue());
    registerWindowFunction("last_value", new GenericUDAFLastValue());
    registerWindowFunction(LEAD_FUNC_NAME, new GenericUDAFLead(), false);
    registerWindowFunction(LAG_FUNC_NAME, new GenericUDAFLag(), false);

二、使用示例

数据提取目标:从tmp_test表中根据col1字段去重,选取clo2最大的那条记录,导入tmp_test_c表。

1、创建测试表

create table tmp_test(col1 string,clo2 string);

2、添加测试数据

insert into table tmp_test
select 1,'str1' from dual
union all
select 2,'str2' from dual
union all
select 3,'str3' from dual
union all
select 3,'str31' from dual
union all
select 3,'str33' from dual
union all
select 4,'str41' from dual
union all
select 4,'str42' from dual;

注意:dual 为在Hive上创建的表。

3、查看测试数据

select * from tmp_test;

col1    clo2
-----   ------
1    str1
2    str2
3    str3
3    str31
3    str33
4    str41
4    str42

4、使用row_number()函数查询数据

select t.*,row_number() over(distribute by col1 sort by clo2 desc) rn
from tmp_test t;

col1    clo2    rn
-----   ------  ----
1    str1    1
2    str2    1
3    str33    1
3    str31    2
3    str3    3
4    str42    1
4    str41    2

5、将目标数据存入临时表中

drop table tmp_test_c;
create table tmp_test_c
as
select *
from
(
select t.*,row_number() over(distribute by col1 sort by clo2 desc) rn
from tmp_test t
)tt
where tt.rn=1;

select * from tmp_test_c;

col1    clo2    rn
-----   ------  ----
1    str1    1
2    str2    1
3    str33    1
4    str42    1

6、创建一张表,将数据添加到该表中

create table tmp_test_d(col1 string,clo2 string);

insert into table tmp_test_d
select col1,clo2
from
(
select t.*,row_number() over(distribute by col1 sort by clo2 desc) rn
from tmp_test t
)tt
where tt.rn=1;

select * from tmp_test_d;
col1    clo2
-----   ------
1    str1
2    str2
3    str33
4    str42

三、总结

以上方面对于Hive版本是高版本的可以,但在低版本,却会报错,不会执行:如:

select col1,clo2
from
(
select t.*,row_number() over(distribute by col1 sort by clo2 desc) rn
from tmp_test t
)tt
where tt.rn=1;

错误:FAILED: NullPointerException null

低版本是不支持这种写法的,需要写在row_number()的括号里面,调整成如下方式才可以执行:

select col1,clo2,row_number(clo2) as rn
from
(select col1,clo2
from tmp_test
distribute by col1
sort by clo2 desc) a
where row_number(clo2)=1;

四、附录

row_number() 函数在Oracle中的用法:

select col1,clo2
from
(
select t.*,row_number() over(partition by col1 order by clo2 desc) rn
from tmp_test t
)
where rn=1;

参考:
https://blog.csdn.net/javajxz008/article/details/53493509
http://blog.sina.com.cn/s/blog_5ceb51480102wabj.html

猜你喜欢

转载自blog.csdn.net/silentwolfyh/article/details/89533846