Article directory
- Copyright Notice
- function
-
- 1 Function classification
- 2 View function list
- 3 Math functions
- 4 String functions
-
- String length function: length
- String reversal function: reverse
- String concatenation function: concat
- String concatenation function - with delimiter: concat_ws
- String interception function: substr, substring
- String interception function: substr, substring
- String to uppercase function: upper, ucase
- String to lowercase function: lower,lcase
- Remove spaces function: trim
- Function to remove spaces on the left: ltrim
- Function to remove spaces on the right: rtrim
- Regular expression replacement function: regexp_replace
- URL parsing function: parse_url
- Split string function: split
- 5 date functions
-
- Get the current UNIX timestamp function: unix_timestamp
- UNIX timestamp to date function: from_unixtime
- Date to UNIX timestamp function: unix_timestamp
- Function to convert date in specified format to UNIX timestamp: unix_timestamp
- Date time to date function: to_date
- Date to year function: year
- Date to month function: month
- Date to day function: day
- Date week function: weekofyear
- Date comparison function: datediff
- Date addition function: date_add
- Date subtraction function: date_sub
- 6 conditional function
- 7 Conversion function
- 8 Hive row to column conversion
- 9 Hive table generation function
- 10 Hive’s windowing function
- 11 Hive custom function
-
- Overview
- Custom UDF
-
- Step 1: Create a maven java project and import the jar package
- Step 2: Develop a java class to inherit UDF and overload the evaluate method
- Step 3: Package the project and upload it to hive’s lib directory
- Step 4: Add our jar package
- Step 5: Set the function to associate with our custom function
- Step 6: Use custom functions
- Custom UDTF
Copyright Notice
- The content of this blog is based on my personal study notes from the Dark Horse Programmer course. I hereby declare that all copyrights belong to Dark Horse Programmers or related rights holders. The purpose of this blog is only for personal learning and communication, not commercial use.
- I try my best to ensure accuracy when organizing my study notes, but I cannot guarantee the completeness and timeliness of the content. The content of this blog may become outdated over time or require updating.
- If you are a Dark Horse programmer or a related rights holder, if there is any copyright infringement, please contact me in time and I will delete it immediately or make necessary modifications.
- For other readers, please abide by relevant laws, regulations and ethical principles when reading the content of this blog, refer to it with caution, and bear the resulting risks and responsibilities at your own risk. Some of the views and opinions in this blog are my own and do not represent the position of Dark Horse Programmers.
function
1 Function classification
- Hive's functions are divided into two categories: aggregate functions , built-in functions (Built-in Functions) , and user-defined functions UDF (User-Defined Functions)
2 View function list
- Use show functions to view all currently available functions;
- See how the function is used by describe function extended funcname.
--查看所有函数 show functions; --查看具体函数的使用方式 describe function extended func_name;
3 Math functions
Rounding function: round
- Syntax: round(double a)
- Return value: BIGINT
- Description: Returns the integer value part of double type (following rounding)
- Example:
select round(3.1415926);
Specify precision rounding function: round
- Syntax: round(double a, int d)
- Return value: DOUBLE
- Description: Returns the double type with specified precision d
- Example:
select round(3.1415926,4);
Round down function: floor
- Syntax: floor(double a)
- Return value: BIGINT
- Description: Returns the largest integer equal to or less than the double variable
- Example:
select floor(3.1415926);
Round up function: ceil
-
Syntax: ceil(double a)
-
Return value: BIGINT
-
Description: Returns the smallest integer equal to or greater than the double variable
-
Example:
select ceil(3.1415926)
Get a random number function: rand
- Syntax: rand(),rand(int seed)
- Return value: double
- Description: Returns a random number in the range of 0 to 1. If the seed is specified, a fixed random number will be returned.
- Example:
select rand();
0.5577432776034763
Power operation function: pow
-
Syntax: pow(double a, double p)
-
Return value: double
-
Description: Returns a raised to the power p
-
Example:
select pow(2,4) ;
16.0
Absolute value function: abs
-
Syntax: abs(double a) abs(int a)
-
Return value: double int
-
Description: Returns the absolute value of value a
-
Example:
select abs(-3.9); 3.9
4 String functions
String length function: length
-
Syntax: length(string A)
-
Return value: int
-
Description: Returns the length of string A
-
Example:
select length('abcedfg');
7
String reversal function: reverse
-
Syntax: reverse(string A)
-
Return value: string
-
Description: Returns the reverse result of string A
-
Example :
hive\> select reverse(abcedfg’);
gfdecba
String concatenation function: concat
- Syntax: concat(string A, string B…)
- Return value: string
- Description: Returns the result after input string concatenation, supports any number of input strings
- Example:
hive\> select concat(‘abc’,'def’,'gh’);; abcdefgh
String concatenation function - with delimiter: concat_ws
-
语法: concat_ws(string SEP, string A, string B…)
-
Return value: string
-
Description: Returns the result after concatenating the input strings. SEP represents the separator between each string.
-
Example:
hive\> select concat_ws(',','abc','def','gh');
abc,def,gh
String interception function: substr, substring
-
语法: substr(string A, int start),substring(string A, int start)
-
Return value: string
-
Description: Returns the string from the start position to the end of string A
-
Example:
hive\> select substr('abcde',3);
cde
hive\>select substr('abcde',-1);
e
String interception function: substr, substring
-
语法: substr(string A, int start, int len),substring(string A, intstart, int len)
-
Return value: string
-
Description: Returns the string A starting from the start position and having a length of len.
-
Example:
hive\> select substr('abcde',3,2);
cd
hive\> select substring('abcde',3,2);
cd
hive\>select substring('abcde',-2,2);
de
String to uppercase function: upper, ucase
-
Syntax: upper(string A) ucase(string A)
-
Return value: string
-
Description: Returns the uppercase format of string A
-
Example:
hive\> select upper('abSEd');
ABSED
hive\> select ucase('abSEd');
ABSED
String to lowercase function: lower,lcase
-
语法: lower(string A) lcase(string A)
-
Return value: string
-
Description: Returns the lowercase format of string A
-
Example:
hive\> select lower('abSEd');
absed
hive\> select lcase('abSEd');
absed
Remove spaces function: trim
-
Syntax: trim(string A)
-
Return value: string
-
Description: Remove spaces on both sides of the string
-
Example:
hive\> select trim(' abc ');
abc
Function to remove spaces on the left: ltrim
-
Syntax: ltrim(string A)
-
Return value: string
-
Description: Remove the spaces on the left side of the string
-
Example:
hive\> select ltrim(' abc ');
abc
Function to remove spaces on the right: rtrim
- Syntax: rtrim(string A)
- Return value: string
- Description: Remove the spaces on the right side of the string
- Example:
hive\> select rtrim(' abc ');
abc
Regular expression replacement function: regexp_replace
-
语法: regexp_replace(string A, string B, string C)
-
Return value: string
-
Description: Replace the part of string A that matches java regular expression B with C.
-
Note that in some cases escape characters must be used, similar to the regexp_replace function in Oracle.
-
Example:
hive\> select regexp_replace('foobar', 'oo\|ar', '');
fb
URL parsing function: parse_url
- 语法: parse_url(string urlString, string partToExtract [, stringkeyToExtract])
- Valid values for partToExtract are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO
- Return value: string
- Description: Returns the specified part of the URL.
- Example:
hive\> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2\#Ref1', 'HOST');
facebook.com
hive\> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2\#Ref1', 'PATH');
/path1/p.php
hive\> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2\#Ref1', 'QUERY','k1');
v1
Split string function: split
-
Syntax: split(string str, stringpat)
-
Return value: array
-
Description: Split str according to the pat string, and the split string array will be returned.
-
Example:
hive\> select split('abtcdtef','t');
["ab","cd","ef"]
5 date functions
Get the current UNIX timestamp function: unix_timestamp
-
Syntax: unix_timestamp()
-
Return value: bigint
-
Description: Get the UNIX timestamp of the current time zone
-
Example:
hive\> select unix_timestamp();
1323309615
UNIX timestamp to date function: from_unixtime
-
语法: from_unixtime(bigint unixtime[, string format])
-
Return value: string
-
Description: Convert UNIX timestamp (from 1970-01-01 00:00:00 UTC to the number of seconds in the specified time) to the time format of the current time zone
-
Example:
hive\> select from_unixtime(1323308943,'yyyyMMdd');
20111208
Date to UNIX timestamp function: unix_timestamp
-
Syntax: unix_timestamp(string date)
-
Return value: bigint
-
Description: Convert the date in the format "yyyy-MM-ddHH:mm:ss" to a UNIX timestamp. If the conversion fails, 0 is returned.
-
Example:
hive\> select unix_timestamp('2011-12-07 13:01:03');
1323234063
Function to convert date in specified format to UNIX timestamp: unix_timestamp
-
语法: unix_timestamp(string date, string pattern)
-
Return value: bigint
-
Description: Convert the date in pattern format to a UNIX timestamp. If the conversion fails, 0 is returned.
-
Example:
hive\> select unix_timestamp('20111207 13:01:03','yyyyMMddHH:mm:ss');
1323234063
Date time to date function: to_date
-
Syntax: to_date(string timestamp)
-
Return value: string
-
Description: Returns the date part in the datetime field.
-
Example:
hive\> select to_date('2011-12-08 10:03:01');
2011-12-08
Date to year function: year
-
Syntax: year(string date)
-
Return value: int
-
Description: Returns the year in the date.
-
Example:
hive\> select year('2011-12-08 10:03:01');
2011
hive\> select year('2012-12-08');
2012
Date to month function: month
-
Syntax: month (string date)
-
Return value: int
-
Description: Returns the month in the date.
-
Example:
hive\> select month('2011-12-08 10:03:01');
12
hive\> select month('2011-08-08');
8
Date to day function: day
- Syntax: day (string date)
-Return value: int - Description: Returns the day in the date.
- Example:
hive\> select day('2011-12-08 10:03:01');
8
hive\> select day('2011-12-24');
24
- Similarly, there are hour, minute, and second functions to obtain hours, minutes, and seconds respectively.
select hour('2023-e5-11 10:36:59');
select minute('2023-05-11 10:36:59');
select second('2023-05-11 10:36:59');
Date week function: weekofyear
-
Syntax: weekofyear (string date)
-
Return value: int
-
Description: Return the date in the current week.
-
Example:
hive\> select weekofyear('2011-12-08 10:03:01');
49
Date comparison function: datediff
-
语法: datediff(string enddate, string startdate)
-
Return value: int
-
Description: Returns the number of days minus the start date from the end date.
-
Example:
hive\> select datediff('2012-12-08','2012-05-09');
213
Date addition function: date_add
-
Syntax: date_add(string startdate, int days)
-
Return value: string
-
Description: Returns the date after startdate is added days days.
-
Example:
hive\> select date_add('2012-12-08',10); 2012-12-18
Date subtraction function: date_sub
-
Syntax: date_sub (string startdate, int days)
-
Return value: string
-
Description: Returns the date after the start date is reduced by days days.
-
Example:
hive\> select date_sub('2012-12-08',10); 2012-11-28
6 conditional function
if function: if
- 语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
- Return value: T
- Note: When the condition testCondition is TRUE, return valueTrue; otherwise, return valueFalseOrNull
- Example:
hive\> select if(1=2,100,200) ;
200
hive\> select if(1=1,100,200) ;
100
Conditional judgment function: CASE
- 语法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
- Return value: T
- Description: If a equals b, then return c; if a equals d, then return e; otherwise return f
- Example:
hive\> select case 100 when 50 then 'tom' when 100 then 'mary'else 'tim' end ; mary
hive\> select case 200 when 50 then 'tom' when 100 then 'mary'else 'tim' end ;
tim
Conditional judgment function: CASE
- 语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
- Return value: T
- Description: If a is TRUE, return b; if c is TRUE, return d; otherwise return e
- Example:
hive\> select case when 1=2 then 'tom' when 2=2 then 'mary' else'tim' end ;
mary
hive\> select case when 1=1 then 'tom' when 2=2 then 'mary' else'tim' end ;
tom
7 Conversion function
hive has two type conversion functions.
cast() function.
- The cast function can convert time data of type "20190607" into int type data
. Formula:
cast(表达式 as 数据类型)
cast("20190607" as int)
select cast('2017-06-12' as date) filed;
8 Hive row to column conversion
introduce
-
Row to column conversion refers to converting multiple rows of data into a column of fields.
-
Functions used in Hive row to column conversion:
-
concat(str1,str2,…) --Field or string concatenation
-
concat_ws(sep, str1,str2) -- concatenate each string with delimiter
-
collect_set(col) --Deduplicate and summarize the values of a certain field to generate an array type field
Test Data:
- Field: deptno ename
20 SMITH
30 ALLEN
30 WARD
20 JONES
30 MARTIN
30 BLAKE
10 CLARK
20 SCOTT
10 KING
30 TURNER
20 ADAMS
30 JAMES
20 FORD
10 MILLER
Steps
- Create table
create table emp(
deptno int,
ename string
) row format delimited fields terminated by '\t';
- Insert data:
load data local inpath "/opt/data/emp.txt" into table emp;
-
Convert
select deptno,concat_ws("|",collect_set(ename)) as ems from emp group by deptno;
- Row to column, COLLECT_SET(col): The function only accepts basic data types. Its main function is to deduplicate and summarize the values of a certain field to generate an array type field.
-
View results
9 Hive table generation function
explode function
-
explode(col): Split the complex array or map structure in one column of hive into multiple rows.
-
explode(ARRAY) generates one row for each element in the list
-
explode(MAP) generates one row for each key-value pair in the map, with key as one column and value as one column.
data:
10 CLARK|KING|MILLER
20 SMITH|JONES|SCOTT|ADAMS|FORD
30 ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES
Create table:
create table emp( deptno int, names array\<string\> )
row format delimited fields terminated by '\t'
collection items terminated by '|';
Insert data
load data local inpath "/server/data/hivedatas/emp3.txt" into table emp;
Query data
select * from emp;
- Query using expload
select explode(names) as name from emp;
LATERAL VIEWSide view
-
用法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias
-
Explanation: Used with split, explode and other UDTFs. It can split a column of data into multiple rows of data. On this basis, the split data can be aggregated.
Column to row
select deptno,name from emp lateral view explode(names) tmp_tb as name;
Reflect function
- The reflect function can support calling the built-in functions in Java in SQL
Use Max in java.lang.Math to find the maximum value in two columns
--创建hive表
create table test_udf(col1 int,col2 int)
row format delimited fields terminated by ',';
--准备数据 test_udf.txt
1,2
4,3
6,4
7,5
5,6
--加载数据
load data local inpath '/root/hivedata/test_udf.txt' into table test_udf;
--使用java.lang.Math当中的Max求两列当中的最大值
select reflect("java.lang.Math","max",col1,col2) from test_udf;
Different records execute different java built-in functions
--创建hive表
create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';
--准备数据 test_udf2.txt
java.lang.Math,min,1,2
java.lang.Math,max,2,3
--加载数据
load data local inpath '/root/hivedata/test_udf2.txt' into table test_udf2;
--执行查询
select reflect(class_name,method_name,col1,col2) from test_udf2;
10 Hive’s windowing function
Window function (1) NTILE,ROW_NUMBER,RANK,DENSE_RANK
data preparation
cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,2
cookie2,2018-04-11,3
cookie2,2018-04-12,5
cookie2,2018-04-13,6
cookie2,2018-04-14,3
cookie2,2018-04-15,9
cookie2,2018-04-16,7
CREATE TABLE itcast_t2 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
-- 加载数据:
load data local inpath '/root/hivedata/itcast_t2.dat' into table itcast_t2;
ROW_NUMBER
- ROW_NUMBER() Starting from 1, in order, generate the sequence of records in the group
SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM itcast_t2;
RANK and DENSE_RANK
-
RANK() generates the ranking of data items in the group. If the rankings are equal, there will be a gap in the ranking.
-
DENSE_RANK() generates the ranking of data items in the group. If the rankings are equal, there will be no gaps in the rankings.
SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM itcast_t2
WHERE cookieid = 'cookie1';
Hive analysis window function (2) SUM,AVG,MIN,MAX
data preparation
--建表语句:
create table itcast_t1(
cookieid string,
createtime string, --day
pv int
) row format delimited
fields terminated by ',';
--加载数据:
load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;
cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
--开启智能本地模式
SET hive.exec.mode.local.auto=true;
SUM (the result is related to ORDER BY, the default is ascending order)
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1
from itcast_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid
order by createtime
rows between unbounded preceding and current row) as pv2
from itcast_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid) as pv3
from itcast_t1; --如果每天order by排序语句 默认把分组内的所有数据进行sum操作
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid
order by createtime
rows between 3 preceding and current row) as pv4
from itcast_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid
order by createtime
rows between 3 preceding and 1 following) as pv5
from itcast_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid
order by createtime
rows between current row and unbounded following) as pv6
from itcast_t1;
--pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
--pv2: 同pv1
--pv3: 分组内(cookie1)所有的pv累加
--pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号,13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
--pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
--pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10
/*
- 如果不指定rows between,默认为从起点到当前行;
- 如果不指定order by,则将分组内所有值累加;
- 关键是理解rows between含义,也叫做window子句:
- preceding:往前
- following:往后
- current row:当前行
- unbounded:起点
- unbounded preceding 表示从前面的起点
- unbounded following:表示到后面的终点
*/
AVG,MIN,MAX
- AVG, MIN, MAX and SUM are used the same way
select cookieid,createtime,pv,
avg(pv) over(partition by cookieid order by createtime
rows between unbounded preceding and current row) as pv2
from itcast_t1;
select cookieid,createtime,pv,
max(pv) over(partition by cookieid order by createtime
rows between unbounded preceding and current row) as pv2
from itcast_t1;
select cookieid,createtime,pv,
min(pv) over(partition by cookieid order by createtime
rows between unbounded preceding and current row) as pv2
from itcast_t1;
Hive analysis window function (3) LAG, LEAD, FIRST_VALUE, LAST_VALUE
Prepare data
cookie1,2018-04-10 10:00:02,url2
cookie1,2018-04-10 10:00:00,url1
cookie1,2018-04-10 10:03:04,1url3
cookie1,2018-04-10 10:50:05,url6
cookie1,2018-04-10 11:00:00,url7
cookie1,2018-04-10 10:10:00,url4
cookie1,2018-04-10 10:50:01,url5
cookie2,2018-04-10 10:00:02,url22
cookie2,2018-04-10 10:00:00,url11
cookie2,2018-04-10 10:03:04,1url33
cookie2,2018-04-10 10:50:05,url66
cookie2,2018-04-10 11:00:00,url77
cookie2,2018-04-10 10:10:00,url44
cookie2,2018-04-10 10:50:01,url55
CREATE TABLE itcast_t4 (
cookieid string,
createtime string, --页面访问时间
url STRING --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
--加载数据:
load data local inpath '/root/hivedata/itcast_t4.dat' into table itcast_t4;
LAG
- LAG(col,n,DEFAULT) is used to count the value of the nth row up in the statistics window
- The first parameter is the column name,
- The second parameter is the nth line up (optional, default is 1),
- The third parameter is the default value (when the nth row up is NULL, the default value is taken, if not specified, it is NULL)
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM itcast_t4;
--last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'
cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00
cookie1第三行,往上1行值为第二行值,2015-04-10 10:00:02
cookie1第六行,往上1行值为第五行值,2015-04-10 10:50:01
--last_2_time: 指定了往上第2行的值,为指定默认值
cookie1第一行,往上2行为NULL
cookie1第二行,往上2行为NULL
cookie1第四行,往上2行为第二行值,2015-04-10 10:00:02
cookie1第七行,往上2行为第五行值,2015-04-10 10:50:01
LEAD
- LEAD(col,n,DEFAULT) is used to count the nth row value from the bottom of the window
- The first parameter is the column name,
- The second parameter is the nth line down (optional, default is 1),
- The third parameter is the default value (when the nth row down is NULL, the default value is taken, if not specified, it is NULL)
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00')
OVER(PARTITION BY cookieid
ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid
ORDER BY createtime) AS next_2_time
FROM itcast_t4;
FIRST_VALUE
- After sorting within the group, end at the current row and the first value
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM itcast_t4;
LAST_VALUE
- After sorting within the group, end at the current row and the last value
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM itcast_t4;
- If you want to get the last value after sorting in the group, you need to work around it:
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM itcast_t4
ORDER BY cookieid,createtime;
- If ORDER BY is not specified, the sorting will be chaotic and incorrect results will occur.
SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM itcast_t4;
11 Hive custom function
Overview
-
Hive comes with some functions, but the number is limited and can be easily expanded by customizing UDF.
-
When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF: user-defined function).
-
According to the user-defined function category, it is divided into the following three types:
- UDF(User-Defined-Function)
- One in and one out
- UDAF(User-Defined Aggregation Function)
- Aggregation function, multiple in and one out is similar to: count/max/min
- UDTF(User-Defined Table-Generating Functions)
- One in and multiple out, such as lateral view explore()
Custom UDF
-
Programming steps:
- Inherits org.apache.hadoop.hive.ql.UDF
- The evaluate function needs to be implemented; the evaluate function supports overloading;
-
Precautions:
- UDF must have a return type and can return null, but the return type cannot be void;
- Text/LongWritable and other types are commonly used in UDF, and java types are not recommended;
Step 1: Create a maven java project and import the jar package
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.5</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
Step 2: Develop a java class to inherit UDF and overload the evaluate method
public class MyUDF extends UDF{
public Text evaluate(final Text s) {
if (null == s) {
return null;
}
//返回大写字母
return new Text(s.toString().toUpperCase());
}
}
Step 3: Package the project and upload it to hive’s lib directory
Step 4: Add our jar package
- Rename jar package name
cd /export/server/hive-2.7.5/lib
mv original-day_10_hive_udf-1.0-SNAPSHOT.jar my_upper.jar
- Add our jar package to hive client
add jar /export/server/hive-2.7.5/lib/my_upper.jar;
Step 5: Set the function to associate with our custom function
create temporary function my_upper as 'cn.itcast.udf.ItcastUDF';
Step 6: Use custom functions
select my_upper('abc');
Custom UDTF
need
- Customize a UDTF to cut a string with any delimiter into independent words, for example:
源数据:
"zookeeper,hadoop,hdfs,hive,MapReduce"
目标数据:
zookeeper
hadoop
hdfs
hive
MapReduce
Code
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import java.util.ArrayList;
import java.util.List;
import java.util.function.ObjDoubleConsumer;
public class MyUDTF extends GenericUDTF {
private final transient Object[] forwardListObj = new Object[1];
@Override
public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
//设置列名的类型
List<String> fieldNames = new ArrayList<>();
//设置列名
fieldNames.add("column_01");
List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>() ;//检查器列表
//设置输出的列的值类型
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
@Override
public void process(Object[] objects) throws HiveException {
//1:获取原始数据
String args = objects[0].toString();
//2:获取数据传入的第二个参数,此处为分隔符
String splitKey = objects[1].toString();
//3.将原始数据按照传入的分隔符进行切分
String[] fields = args.split(splitKey);
//4:遍历切分后的结果,并写出
for (String field : fields) {
//将每一个单词添加值对象数组
forwardListObj[0] = field;
//将对象数组内容写出
forward(forwardListObj);
}
}
@Override
public void close() throws HiveException {
}
}
Add our jar package
- Upload the packaged jar package to the node3 host
/export/server/hive/lib
directory and rename our jar package name
cd /export/server/hive/lib
mv original-day_10_hive_udtf-1.0-SNAPSHOT.jar my_udtf.jar
- Add our jar package to the hive client and add the jar package to hive's classpath
hive>add jar /export/server/hive/lib/my_udtf.jar
Create a temporary function to associate with the developed udtf code
create temporary function my_udtf as 'cn.itcast.udf.ItcastUDF';
Use custom udtf function
select myudtf("zookeeper,hadoop,hdfs,hive,MapReduce",",") word;