Hive Sunflower Collection: Complete Collection of Hive Functions

Article directory

Copyright Notice

  • The content of this blog is based on my personal study notes from the Dark Horse Programmer course. I hereby declare that all copyrights belong to Dark Horse Programmers or related rights holders. The purpose of this blog is only for personal learning and communication, not commercial use.
  • I try my best to ensure accuracy when organizing my study notes, but I cannot guarantee the completeness and timeliness of the content. The content of this blog may become outdated over time or require updating.
  • If you are a Dark Horse programmer or a related rights holder, if there is any copyright infringement, please contact me in time and I will delete it immediately or make necessary modifications.
  • For other readers, please abide by relevant laws, regulations and ethical principles when reading the content of this blog, refer to it with caution, and bear the resulting risks and responsibilities at your own risk. Some of the views and opinions in this blog are my own and do not represent the position of Dark Horse Programmers.

function

1 Function classification

  • Hive's functions are divided into two categories: aggregate functions , built-in functions (Built-in Functions) , and user-defined functions UDF (User-Defined Functions)
    Please add image description
    Insert image description here

2 View function list

  1. Use show functions to view all currently available functions;
  2. See how the function is used by describe function extended funcname.
    --查看所有函数
    show functions;
    
    --查看具体函数的使用方式
    describe function extended func_name;
    

3 Math functions

Rounding function: round

  • Syntax: round(double a)
  • Return value: BIGINT
  • Description: Returns the integer value part of double type (following rounding)
  • Example:
select round(3.1415926);

Specify precision rounding function: round

  • Syntax: round(double a, int d)
  • Return value: DOUBLE
  • Description: Returns the double type with specified precision d
  • Example:
select round(3.1415926,4);

Round down function: floor

  • Syntax: floor(double a)
  • Return value: BIGINT
  • Description: Returns the largest integer equal to or less than the double variable
  • Example:
 select floor(3.1415926); 

Round up function: ceil

  • Syntax: ceil(double a)

  • Return value: BIGINT

  • Description: Returns the smallest integer equal to or greater than the double variable

  • Example:

select ceil(3.1415926)

Get a random number function: rand

  • Syntax: rand(),rand(int seed)
  • Return value: double
  • Description: Returns a random number in the range of 0 to 1. If the seed is specified, a fixed random number will be returned.
  • Example:
 select rand(); 
 0.5577432776034763   

Power operation function: pow

  • Syntax: pow(double a, double p)

  • Return value: double

  • Description: Returns a raised to the power p

  • Example:

select pow(2,4) ; 
16.0 

Absolute value function: abs

  • Syntax: abs(double a) abs(int a)

  • Return value: double int

  • Description: Returns the absolute value of value a

  • Example:

select abs(-3.9); 3.9

4 String functions

String length function: length

  • Syntax: length(string A)

  • Return value: int

  • Description: Returns the length of string A

  • Example:

select length('abcedfg');  
7 

String reversal function: reverse

  • Syntax: reverse(string A)

  • Return value: string

  • Description: Returns the reverse result of string A

  • Example :

hive\> select reverse(abcedfg’); 
gfdecba 

String concatenation function: concat

  • Syntax: concat(string A, string B…)
  • Return value: string
  • Description: Returns the result after input string concatenation, supports any number of input strings
  • Example:
hive\> select concat(‘abc’,'def’,'gh’);; abcdefgh

String concatenation function - with delimiter: concat_ws

  • 语法: concat_ws(string SEP, string A, string B…)

  • Return value: string

  • Description: Returns the result after concatenating the input strings. SEP represents the separator between each string.

  • Example:

hive\> select concat_ws(',','abc','def','gh'); 
abc,def,gh

String interception function: substr, substring

  • 语法: substr(string A, int start),substring(string A, int start)

  • Return value: string

  • Description: Returns the string from the start position to the end of string A

  • Example:

hive\> select substr('abcde',3); 
cde
hive\>select substr('abcde',-1); 
e 

String interception function: substr, substring

  • 语法: substr(string A, int start, int len),substring(string A, intstart, int len)

  • Return value: string

  • Description: Returns the string A starting from the start position and having a length of len.

  • Example:

hive\> select substr('abcde',3,2); 
cd 
hive\> select substring('abcde',3,2); 
cd 
hive\>select substring('abcde',-2,2); 
de 

String to uppercase function: upper, ucase

  • Syntax: upper(string A) ucase(string A)

  • Return value: string

  • Description: Returns the uppercase format of string A

  • Example:

hive\> select upper('abSEd'); 
ABSED 
hive\> select ucase('abSEd'); 
ABSED 

String to lowercase function: lower,lcase

  • 语法: lower(string A) lcase(string A)

  • Return value: string

  • Description: Returns the lowercase format of string A

  • Example:

hive\> select lower('abSEd'); 
absed 
hive\> select lcase('abSEd'); 
absed

Remove spaces function: trim

  • Syntax: trim(string A)

  • Return value: string

  • Description: Remove spaces on both sides of the string

  • Example:

hive\> select trim(' abc '); 
abc

Function to remove spaces on the left: ltrim

  • Syntax: ltrim(string A)

  • Return value: string

  • Description: Remove the spaces on the left side of the string

  • Example:

hive\> select ltrim(' abc '); 
abc

Function to remove spaces on the right: rtrim

  • Syntax: rtrim(string A)
  • Return value: string
  • Description: Remove the spaces on the right side of the string
  • Example:
hive\> select rtrim(' abc '); 
abc

Regular expression replacement function: regexp_replace

  • 语法: regexp_replace(string A, string B, string C)

  • Return value: string

  • Description: Replace the part of string A that matches java regular expression B with C.

  • Note that in some cases escape characters must be used, similar to the regexp_replace function in Oracle.

  • Example:

 hive\> select regexp_replace('foobar', 'oo\|ar', ''); 
 fb

URL parsing function: parse_url

  • 语法: parse_url(string urlString, string partToExtract [, stringkeyToExtract])
    • Valid values ​​for partToExtract are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO
  • Return value: string
  • Description: Returns the specified part of the URL.
  • Example:
 hive\> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2\#Ref1', 'HOST'); 
 facebook.com  
 hive\> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2\#Ref1', 'PATH'); 
 /path1/p.php  
 hive\> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2\#Ref1', 'QUERY','k1'); 
 v1 

Split string function: split

  • Syntax: split(string str, stringpat)

  • Return value: array

  • Description: Split str according to the pat string, and the split string array will be returned.

  • Example:

hive\> select split('abtcdtef','t'); 
["ab","cd","ef"]

5 date functions

Get the current UNIX timestamp function: unix_timestamp

  • Syntax: unix_timestamp()

  • Return value: bigint

  • Description: Get the UNIX timestamp of the current time zone

  • Example:

hive\> select unix_timestamp(); 
1323309615

UNIX timestamp to date function: from_unixtime

  • 语法: from_unixtime(bigint unixtime[, string format])

  • Return value: string

  • Description: Convert UNIX timestamp (from 1970-01-01 00:00:00 UTC to the number of seconds in the specified time) to the time format of the current time zone

  • Example:

hive\> select from_unixtime(1323308943,'yyyyMMdd'); 
20111208

Date to UNIX timestamp function: unix_timestamp

  • Syntax: unix_timestamp(string date)

  • Return value: bigint

  • Description: Convert the date in the format "yyyy-MM-ddHH:mm:ss" to a UNIX timestamp. If the conversion fails, 0 is returned.

  • Example:

hive\> select unix_timestamp('2011-12-07 13:01:03'); 
1323234063

Function to convert date in specified format to UNIX timestamp: unix_timestamp

  • 语法: unix_timestamp(string date, string pattern)

  • Return value: bigint

  • Description: Convert the date in pattern format to a UNIX timestamp. If the conversion fails, 0 is returned.

  • Example:

 hive\> select unix_timestamp('20111207 13:01:03','yyyyMMddHH:mm:ss'); 
 1323234063

Date time to date function: to_date

  • Syntax: to_date(string timestamp)

  • Return value: string

  • Description: Returns the date part in the datetime field.

  • Example:

hive\> select to_date('2011-12-08 10:03:01'); 
2011-12-08

Date to year function: year

  • Syntax: year(string date)

  • Return value: int

  • Description: Returns the year in the date.

  • Example:

hive\> select year('2011-12-08 10:03:01'); 
2011 
hive\> select year('2012-12-08'); 
2012 

Date to month function: month

  • Syntax: month (string date)

  • Return value: int

  • Description: Returns the month in the date.

  • Example:

hive\> select month('2011-12-08 10:03:01'); 
12 
hive\> select month('2011-08-08'); 
8

Date to day function: day

  • Syntax: day (string date)
    -Return value: int
  • Description: Returns the day in the date.
  • Example:
hive\> select day('2011-12-08 10:03:01'); 
8 
hive\> select day('2011-12-24'); 
24
  • Similarly, there are hour, minute, and second functions to obtain hours, minutes, and seconds respectively.
select hour('2023-e5-11 10:36:59');
select minute('2023-05-11 10:36:59');
select second('2023-05-11 10:36:59');

Date week function: weekofyear

  • Syntax: weekofyear (string date)

  • Return value: int

  • Description: Return the date in the current week.

  • Example:

hive\> select weekofyear('2011-12-08 10:03:01'); 
49 

Date comparison function: datediff

  • 语法: datediff(string enddate, string startdate)

  • Return value: int

  • Description: Returns the number of days minus the start date from the end date.

  • Example:

hive\> select datediff('2012-12-08','2012-05-09'); 
213

Date addition function: date_add

  • Syntax: date_add(string startdate, int days)

  • Return value: string

  • Description: Returns the date after startdate is added days days.

  • Example:

hive\> select date_add('2012-12-08',10); 2012-12-18

Date subtraction function: date_sub

  • Syntax: date_sub (string startdate, int days)

  • Return value: string

  • Description: Returns the date after the start date is reduced by days days.

  • Example:

hive\> select date_sub('2012-12-08',10); 2012-11-28

6 conditional function

if function: if

  • 语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
  • Return value: T
  • Note: When the condition testCondition is TRUE, return valueTrue; otherwise, return valueFalseOrNull
  • Example:
hive\> select if(1=2,100,200) ; 
200 
hive\> select if(1=1,100,200) ; 
100

Conditional judgment function: CASE

  • 语法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
  • Return value: T
  • Description: If a equals b, then return c; if a equals d, then return e; otherwise return f
  • Example:
hive\> select case 100 when 50 then 'tom' when 100 then 'mary'else 'tim' end ; mary 
hive\> select case 200 when 50 then 'tom' when 100 then 'mary'else 'tim' end ; 
tim 

Conditional judgment function: CASE

  • 语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
  • Return value: T
  • Description: If a is TRUE, return b; if c is TRUE, return d; otherwise return e
  • Example:
hive\> select case when 1=2 then 'tom' when 2=2 then 'mary' else'tim' end ; 
mary 
hive\> select case when 1=1 then 'tom' when 2=2 then 'mary' else'tim' end ; 
tom

7 Conversion function

hive has two type conversion functions.

cast() function.

  • The cast function can convert time data of type "20190607" into int type data
    . Formula:
cast(表达式 as 数据类型)
cast("20190607" as int)
select cast('2017-06-12' as date) filed;

8 Hive row to column conversion

introduce

  1. Row to column conversion refers to converting multiple rows of data into a column of fields.

  2. Functions used in Hive row to column conversion:

  • concat(str1,str2,…) --Field or string concatenation

  • concat_ws(sep, str1,str2) -- concatenate each string with delimiter

  • collect_set(col) --Deduplicate and summarize the values ​​of a certain field to generate an array type field

Test Data:

  • Field: deptno ename
20 SMITH  
30 ALLEN  
30 WARD  
20 JONES  
30 MARTIN  
30 BLAKE  
10 CLARK  
20 SCOTT  
10 KING  
30 TURNER  
20 ADAMS  
30 JAMES  
20 FORD  
10 MILLER

Steps

  1. Create table
create table emp( 
deptno int, 
ename string 
) row format delimited fields terminated by '\t'; 
  1. Insert data:
load data local inpath "/opt/data/emp.txt" into table emp;
  1. Convert

    select deptno,concat_ws("|",collect_set(ename)) as ems 
    from emp group by deptno;
    
    • Row to column, COLLECT_SET(col): The function only accepts basic data types. Its main function is to deduplicate and summarize the values ​​of a certain field to generate an array type field.
  2. View results
    Please add image description

9 Hive table generation function

explode function

  • explode(col): Split the complex array or map structure in one column of hive into multiple rows.

  • explode(ARRAY) generates one row for each element in the list

  • explode(MAP) generates one row for each key-value pair in the map, with key as one column and value as one column.

data:

10 CLARK|KING|MILLER 
20 SMITH|JONES|SCOTT|ADAMS|FORD 
30 ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES

Create table:

create table emp( deptno int, names array\<string\> ) 
row format delimited fields terminated by '\t' 
collection items terminated by '|';

Insert data

load data local inpath "/server/data/hivedatas/emp3.txt" into table emp; 

Query data

select * from emp;

Insert image description here

  • Query using expload
select explode(names) as name from emp;

Insert image description here

LATERAL VIEWSide view

  • 用法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias

  • Explanation: Used with split, explode and other UDTFs. It can split a column of data into multiple rows of data. On this basis, the split data can be aggregated.

Column to row

select deptno,name from emp lateral view explode(names) tmp_tb as name;

Insert image description here

Reflect function

  • The reflect function can support calling the built-in functions in Java in SQL

Use Max in java.lang.Math to find the maximum value in two columns

--创建hive表 
create table test_udf(col1 int,col2 int) 
row format delimited fields terminated by ',';  
--准备数据 test_udf.txt 
1,2 
4,3 
6,4 
7,5 
5,6  
--加载数据  
load data local inpath '/root/hivedata/test_udf.txt' into table test_udf;  
--使用java.lang.Math当中的Max求两列当中的最大值 
select reflect("java.lang.Math","max",col1,col2) from test_udf; 

Different records execute different java built-in functions

--创建hive表
create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

--准备数据 test_udf2.txt
java.lang.Math,min,1,2
java.lang.Math,max,2,3

--加载数据
load data local inpath '/root/hivedata/test_udf2.txt' into table test_udf2;

--执行查询
select reflect(class_name,method_name,col1,col2) from test_udf2;

10 Hive’s windowing function

Window function (1) NTILE,ROW_NUMBER,RANK,DENSE_RANK

data preparation

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,2
cookie2,2018-04-11,3
cookie2,2018-04-12,5
cookie2,2018-04-13,6
cookie2,2018-04-14,3
cookie2,2018-04-15,9
cookie2,2018-04-16,7
CREATE TABLE itcast_t2 (
cookieid string,
createtime string,   --day 
pv INT
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
stored as textfile;
  
-- 加载数据:
load data local inpath '/root/hivedata/itcast_t2.dat' into table itcast_t2;

ROW_NUMBER

  • ROW_NUMBER() Starting from 1, in order, generate the sequence of records in the group
SELECT 
  cookieid,
  createtime,
  pv,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn 
  FROM itcast_t2;

RANK and DENSE_RANK

  • RANK() generates the ranking of data items in the group. If the rankings are equal, there will be a gap in the ranking.

  • DENSE_RANK() generates the ranking of data items in the group. If the rankings are equal, there will be no gaps in the rankings.

SELECT 
	cookieid,
	createtime,
	pv,
	RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
	DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
	ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM itcast_t2 
WHERE cookieid = 'cookie1';

Hive analysis window function (2) SUM,AVG,MIN,MAX

data preparation

--建表语句:
create table itcast_t1(
cookieid string,
createtime string,   --day 
pv int
) row format delimited 
fields terminated by ',';

--加载数据:
load data local inpath '/root/hivedata/itcast_t1.dat' into table itcast_t1;

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4

--开启智能本地模式
SET hive.exec.mode.local.auto=true;

SUM (the result is related to ORDER BY, the default is ascending order)

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1 
from itcast_t1;

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid 
			order by createtime 
			rows between unbounded preceding and current row) as pv2
from itcast_t1;

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid) as pv3
from itcast_t1;  --如果每天order  by排序语句  默认把分组内的所有数据进行sum操作

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid 
			order by createtime 
			rows between 3 preceding and current row) as pv4
from itcast_t1;

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid 
			order by createtime 
			rows between 3 preceding and 1 following) as pv5
from itcast_t1;

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid 
			order by createtime 
			rows between current row and unbounded following) as pv6
from itcast_t1;

--pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
--pv2: 同pv1
--pv3: 分组内(cookie1)所有的pv累加
--pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号,13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
--pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
--pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

/*
- 如果不指定rows between,默认为从起点到当前行;
- 如果不指定order by,则将分组内所有值累加;
- 关键是理解rows between含义,也叫做window子句:
  - preceding:往前
  - following:往后
  - current row:当前行
  - unbounded:起点
  - unbounded preceding 表示从前面的起点
  - unbounded following:表示到后面的终点
 */ 

AVG,MIN,MAX

  • AVG, MIN, MAX and SUM are used the same way
select cookieid,createtime,pv,
avg(pv) over(partition by cookieid order by createtime 
rows between unbounded preceding and current row) as pv2
from itcast_t1;

select cookieid,createtime,pv,
max(pv) over(partition by cookieid order by createtime 
rows between unbounded preceding and current row) as pv2
from itcast_t1;

select cookieid,createtime,pv,
min(pv) over(partition by cookieid order by createtime 
rows between unbounded preceding and current row) as pv2
from itcast_t1;

Hive analysis window function (3) LAG, LEAD, FIRST_VALUE, LAST_VALUE

Prepare data

cookie1,2018-04-10 10:00:02,url2
cookie1,2018-04-10 10:00:00,url1
cookie1,2018-04-10 10:03:04,1url3
cookie1,2018-04-10 10:50:05,url6
cookie1,2018-04-10 11:00:00,url7
cookie1,2018-04-10 10:10:00,url4
cookie1,2018-04-10 10:50:01,url5
cookie2,2018-04-10 10:00:02,url22
cookie2,2018-04-10 10:00:00,url11
cookie2,2018-04-10 10:03:04,1url33
cookie2,2018-04-10 10:50:05,url66
cookie2,2018-04-10 11:00:00,url77
cookie2,2018-04-10 10:10:00,url44
cookie2,2018-04-10 10:50:01,url55
 
CREATE TABLE itcast_t4 (
cookieid string,
createtime string,  --页面访问时间
url STRING       --被访问页面
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
stored as textfile;

--加载数据:
load data local inpath '/root/hivedata/itcast_t4.dat' into table itcast_t4;

LAG

  • LAG(col,n,DEFAULT) is used to count the value of the nth row up in the statistics window
    • The first parameter is the column name,
    • The second parameter is the nth line up (optional, default is 1),
    • The third parameter is the default value (when the nth row up is NULL, the default value is taken, if not specified, it is NULL)
SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
  LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
  FROM itcast_t4;
  
--last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'  
      cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00
      cookie1第三行,往上1行值为第二行值,2015-04-10 10:00:02
      cookie1第六行,往上1行值为第五行值,2015-04-10 10:50:01
--last_2_time: 指定了往上第2行的值,为指定默认值
      cookie1第一行,往上2行为NULL
      cookie1第二行,往上2行为NULL
      cookie1第四行,往上2行为第二行值,2015-04-10 10:00:02
      cookie1第七行,往上2行为第五行值,2015-04-10 10:50:01

LEAD

  • LEAD(col,n,DEFAULT) is used to count the nth row value from the bottom of the window
    • The first parameter is the column name,
    • The second parameter is the nth line down (optional, default is 1),
    • The third parameter is the default value (when the nth row down is NULL, the default value is taken, if not specified, it is NULL)
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') 
			OVER(PARTITION BY cookieid 
			ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid 
			ORDER BY createtime) AS next_2_time 
FROM itcast_t4;

FIRST_VALUE

  • After sorting within the group, end at the current row and the first value
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
FROM itcast_t4;

LAST_VALUE

  • After sorting within the group, end at the current row and the last value
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 
FROM itcast_t4;

  • If you want to get the last value after sorting in the group, you need to work around it:
SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
  FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 
  FROM itcast_t4 
  ORDER BY cookieid,createtime;
  • If ORDER BY is not specified, the sorting will be chaotic and incorrect results will occur.
SELECT cookieid,
		createtime,
		url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2  
FROM itcast_t4;

11 Hive custom function

Overview

  • Hive comes with some functions, but the number is limited and can be easily expanded by customizing UDF.

  • When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF: user-defined function).

  • According to the user-defined function category, it is divided into the following three types:

  1. UDF(User-Defined-Function)
    • One in and one out
  2. UDAF(User-Defined Aggregation Function)
    • Aggregation function, multiple in and one out is similar to: count/max/min
  3. UDTF(User-Defined Table-Generating Functions)
    • One in and multiple out, such as lateral view explore()

Custom UDF

  • Programming steps:

    1. Inherits org.apache.hadoop.hive.ql.UDF
    2. The evaluate function needs to be implemented; the evaluate function supports overloading;
  • Precautions:

    1. UDF must have a return type and can return null, but the return type cannot be void;
    2. Text/LongWritable and other types are commonly used in UDF, and java types are not recommended;

Step 1: Create a maven java project and import the jar package

<dependencies>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>2.7.5</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.5</version>
    </dependency>
</dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
        </plugins>
    </build>

Step 2: Develop a java class to inherit UDF and overload the evaluate method

public class MyUDF  extends UDF{
    public Text evaluate(final Text s) {
        if (null == s) {
            return null;
        }
        //返回大写字母
        return new Text(s.toString().toUpperCase());

    }
}

Step 3: Package the project and upload it to hive’s lib directory

Insert image description here

Step 4: Add our jar package

  • Rename jar package name
cd /export/server/hive-2.7.5/lib
mv original-day_10_hive_udf-1.0-SNAPSHOT.jar my_upper.jar
  • Add our jar package to hive client
add jar /export/server/hive-2.7.5/lib/my_upper.jar;

Step 5: Set the function to associate with our custom function

create temporary function my_upper as 'cn.itcast.udf.ItcastUDF';

Step 6: Use custom functions

select my_upper('abc');

Custom UDTF

need

  • Customize a UDTF to cut a string with any delimiter into independent words, for example:
源数据:
"zookeeper,hadoop,hdfs,hive,MapReduce"
目标数据:
zookeeper
hadoop
hdfs
hive
MapReduce

Code

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;
import java.util.List;
import java.util.function.ObjDoubleConsumer;

public class MyUDTF extends GenericUDTF {
    private final transient Object[] forwardListObj = new Object[1];

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        //设置列名的类型
        List<String> fieldNames = new ArrayList<>();
        //设置列名
        fieldNames.add("column_01");
        List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>()  ;//检查器列表

        //设置输出的列的值类型
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
         
        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);

    }

    @Override
    public void process(Object[] objects) throws HiveException {
        //1:获取原始数据
        String args = objects[0].toString();
        //2:获取数据传入的第二个参数,此处为分隔符
        String splitKey = objects[1].toString();
        //3.将原始数据按照传入的分隔符进行切分
        String[] fields = args.split(splitKey);
        //4:遍历切分后的结果,并写出
        for (String field : fields) {
            //将每一个单词添加值对象数组
            forwardListObj[0] = field;
            //将对象数组内容写出
            forward(forwardListObj);
        }

    }

    @Override
    public void close() throws HiveException {

    }
}

Add our jar package

  1. Upload the packaged jar package to the node3 host /export/server/hive/libdirectory and rename our jar package name
cd /export/server/hive/lib
mv original-day_10_hive_udtf-1.0-SNAPSHOT.jar my_udtf.jar
  1. Add our jar package to the hive client and add the jar package to hive's classpath
hive>add jar /export/server/hive/lib/my_udtf.jar

Create a temporary function to associate with the developed udtf code

create temporary function my_udtf as 'cn.itcast.udf.ItcastUDF';

Use custom udtf function

select myudtf("zookeeper,hadoop,hdfs,hive,MapReduce",",") word;

Guess you like

Origin blog.csdn.net/yang2330648064/article/details/132777834