built-in operators

Hive official documentation

built-in function

Hive official documentation

Hive built-in functions

A quick way to test various built-in functions:

1. Create a dual table

create table dual(id string);

2. load data local inpath 'home/hadoop/dual.dat' into table dual; one file (one line, one space) to dual table

3、select substr('angelababy',2,3) from dual;

select substr('angelababy',0,3) from dual;==> is exactly the same as the one returned by the next sentence

select substr('angelababy',1,3) from dual;==>The same as the previous sentence

4、select concat('a'，'b')from dual;

Custom Functions and Transforms

When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using a user-defined function (UDF: user-defined function).

1. Create a java project, decompress the installation path of apache-hive-1.2.2-bin.tar.gz, find the lib folder, and put the jar package in the project

2. Create the package name cn.itcast.bigdata.udf

3. Create a custom class ToLowerCase--change the incoming string to lowercase

package cn.itcast.bigdata.udf;

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF {
	public static HashMap<String, String> provinceMap = new HashMap<String, String>();
	static {
		provinceMap.put("136", "beijing");
		provinceMap.put("137", "shanghai");
		provinceMap.put("138", "shenzhen");
	}

	// must be public--override this method
	public String evaluate(String field) {
		String result = field.toLowerCase();
		return result;
	}
        //must be public--override this method--the same two methods will not affect
	public String evaluate(int phonenbr) {
		
		String pnb = String.valueOf(phonenbr);
		return provinceMap.get(pnb.substring(0, 3)) == null ? "huoxing":provinceMap.get(pnb.substring(0,3));

	}

}

4. Make a jar package, upload it to the server of the hive node, and add a hive function to the server to match the jar package

hive>add JAR /home/hadoop/udf.jar;
Hive>create temporary function custom function name tolow as 'cn.itcast.bigdata.udf.ToProvince';

5. View select * from t_p;

6. Add data insert into t_p values(13,'ANGELA');

7. View select * from t_p; you can see uppercase data

8. select id, custom function name tolow(name) from t_p; you can see that it has become lowercase. When selct, we perform function processing on name, resulting in lowercase

Hive>create temporary function custom function name getprovince as 'cn.itcast.bigdata.udf.ToProvince';

select phonenbr, custom function name getprovince(phonenbr), flow from t_flow;

Hive custom function category

A UDF operates on a single row of data, producing a row of data as output. (mathematical functions, string functions)

UDAF (User-Defined Aggregate Function): Takes multiple rows of input data and produces one row of output data. (count, max)

UDF development example

l Simple UDF example

1. First develop a java class, inherit UDF, and overload the evaluate method

package cn.itcast.bigdata.udf
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
 
public final class Lower extends UDF{
     public Text evaluate(final Text s){
          if(s==null){return null;}
          return new Text(s.toString().toLowerCase());
     }
}

2. Make a jar package and upload it to the server

3. Add the jar package to hive's classpath
hive> add JAR /home/hadoop/udf.jar;

4. Create a temporary function associated with the developed java class

Hive>create temporary function custom function name as 'cn.itcast.bigdata.udf.ToProvince';

5. You can use the custom function strip in hql

select custom function name (name), age from t_test;

l Json data parsing UDF development

data preparation

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
{"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}
{"movie":"595","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"938","rate":"4","timeStamp":"978301752","uid":"1"}
{"movie":"2398","rate":"4","timeStamp":"978302281","uid":"1"}
{"movie":"2918","rate":"4","timeStamp":"978302124","uid":"1"}
{"movie":"1035","rate":"5","timeStamp":"978301753","uid":"1"}
{"movie":"2791","rate":"4","timeStamp":"978302188","uid":"1"}
{"movie":"2687","rate":"3","timeStamp":"978824268","uid":"1"}
{"movie":"2018","rate":"4","timeStamp":"978301777","uid":"1"}
{"movie":"3105","rate":"5","timeStamp":"978301713","uid":"1"}
{"movie":"2797","rate":"4","timeStamp":"978302039","uid":"1"}

1、create table t_json(line string) row format delimited;

2、load data local inpath 'home/hadoop/rating.json' into table t_json;

3、select * from t_json limit 10;

4. Custom class JsonParser

package cn.itcast.bigdata.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import parquet.org.codehaus.jackson.map.ObjectMapper;//The lib jar package has been added to the project before

public class JsonParser extends UDF {

	public String evaluate(String jsonLine) {

		ObjectMapper objectMapper = new ObjectMapper();

		try {
			MovieRateBean bean = objectMapper.readValue(jsonLine, MovieRateBean.class);
			return bean.toString();
		} catch (Exception e) {

		}
		return "";
	}

}

MovieRateBean

package cn.itcast.bigdata.udf;

//{"movie":"1721","rate":"3","timeStamp":"965440048","uid":"5114"}The attribute name must be exactly the same as the key
public class MovieRateBean {
        
	private String movie;
	private String rate;
	private String timeStamp;
	private String uid;
        //Because of space -- get and set methods are hidden here
	@Override
	public String toString() {//Return a string after parsing
		return movie + "\t" + rate + "\t" + timeStamp + "\t" + uid;
	}


}

5. Package and upload to the server, add the package to the classpath of hive

6. Hive>create temporary function custom function name parsejson as 'cn.itcast.bigdata.udf.Parsejson';

7. View data select parsejson(line) from t_json limit 10;

8. Other ways

create table rat_json(line string) row format delimited;
load data local inpath '/home/hadoop/rating.json' into table rat_json;

drop table if exists t_rating;
create table t_rating(movieid string,rate int,timestring string,uid string)
row format delimited fields terminated by '\t';

insert overwrite table t_rating to create a new table t_rating
select split to get an array (custom function parsejson(line),'\t')[0]as movieid,
split(parsejson(line),'\t')[1] as rate,
split(parsejson(line),'\t')[2] as timestring,
split(parsejson(line),'\t')[3] as uid
from rat_json limit 10;

Transform implementation

Hive's TRANSFORM keyword provides the function of calling self-written scripts in SQL

It is suitable for implementing functions that are not available in Hive and do not want to write UDF

Use example 1: The following sql borrows weekday_mapper.py to process the data .

CREATE TABLE u_data_new (
  movieid INT,
  rating INT,
  weekday INT,
  userid INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
 
add FILE weekday_mapper.py;//If it is java, it is add jar
 
INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (movieid, rate, timestring,uid)//Query 4 fields with python script to process
  USING 'python weekday_mapper.py'
  AS (movieid, rating, weekday, userid) //Convert time string to day of week as returns 4 results
FROM t_rating;

The content of weekday_mapper.py is as follows

#!/bin/python
import sys
import datetime
 
for line in sys.stdin:
  line = line.strip()
  movieid, rating, unixtime,userid = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([movieid, rating, str(weekday),userid])//Add the /t separator to the left and right fields

Steps to organize:

transform case:

1. First load the rating.json file to an original table rat_json of hive
create table rat_json(line string) row format delimited;
load data local inpath '/home/hadoop/rating.json' into table rat_json;

2. You need to parse the json data into four fields and insert a new table t_rating
insert overwrite table t_rating
select get_json_object(line,'$.movie') as moive,get_json_object(line,'$.rate') as rate  from rat_json;

3. Use transform+python to convert unixtime to weekday
First edit a python script file
########python######code
vi weekday_mapper.py
#!/bin/python
import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  movieid, rating, unixtime,userid = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([movieid, rating, str(weekday),userid])

save document
Then, add the file to hive's classpath:
hive>add FILE /home/hadoop/weekday_mapper.py;
hive>create TABLE u_data_new as//Get data through query
SELECT
  TRANSFORM (movieid, rate, timestring, uid)//The corresponding t_rating is passed as input to the python script
  USING 'python weekday_mapper.py'
  AS (movieid, rate, weekday, uid)//Insert the fields of another table into the new table u_data_new according to the result
FROM t_rating;

select distinct(weekday) from u_data_new limit 10;

Hive function