built-in operators
built-in function
A quick way to test various built-in functions:
1. Create a dual table
create table dual(id string);
2. load data local inpath 'home/hadoop/dual.dat' into table dual; one file (one line, one space) to dual table
3、select substr('angelababy',2,3) from dual;
select substr('angelababy',0,3) from dual;==> is exactly the same as the one returned by the next sentence
select substr('angelababy',1,3) from dual;==>The same as the previous sentence
4、select concat('a','b')from dual;
Custom Functions and Transforms
When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using a user-defined function (UDF: user-defined function).
1. Create a java project, decompress the installation path of apache-hive-1.2.2-bin.tar.gz, find the lib folder, and put the jar package in the project
2. Create the package name cn.itcast.bigdata.udf
3. Create a custom class ToLowerCase--change the incoming string to lowercase
package cn.itcast.bigdata.udf; import java.util.HashMap; import org.apache.hadoop.hive.ql.exec.UDF; public class ToLowerCase extends UDF { public static HashMap<String, String> provinceMap = new HashMap<String, String>(); static { provinceMap.put("136", "beijing"); provinceMap.put("137", "shanghai"); provinceMap.put("138", "shenzhen"); } // must be public--override this method public String evaluate(String field) { String result = field.toLowerCase(); return result; } //must be public--override this method--the same two methods will not affect public String evaluate(int phonenbr) { String pnb = String.valueOf(phonenbr); return provinceMap.get(pnb.substring(0, 3)) == null ? "huoxing":provinceMap.get(pnb.substring(0,3)); } }
4. Make a jar package, upload it to the server of the hive node, and add a hive function to the server to match the jar package
hive>add JAR /home/hadoop/udf.jar; Hive>create temporary function custom function name tolow as 'cn.itcast.bigdata.udf.ToProvince';
5. View select * from t_p;
6. Add data insert into t_p values(13,'ANGELA');
7. View select * from t_p; you can see uppercase data
8. select id, custom function name tolow(name) from t_p; you can see that it has become lowercase. When selct, we perform function processing on name, resulting in lowercase
Hive>create temporary function custom function name getprovince as 'cn.itcast.bigdata.udf.ToProvince';select phonenbr, custom function name getprovince(phonenbr), flow from t_flow;
Hive custom function category
A UDF operates on a single row of data, producing a row of data as output. (mathematical functions, string functions)
UDAF (User-Defined Aggregate Function): Takes multiple rows of input data and produces one row of output data. (count, max)
UDF development example
l Simple UDF example
1. First develop a java class, inherit UDF, and overload the evaluate method
package cn.itcast.bigdata.udf import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF{ public Text evaluate(final Text s){ if(s==null){return null;} return new Text(s.toString().toLowerCase()); } }
2. Make a jar package and upload it to the server
3. Add the jar package to hive's classpathhive> add JAR /home/hadoop/udf.jar;
4. Create a temporary function associated with the developed java class
Hive>create temporary function custom function name as 'cn.itcast.bigdata.udf.ToProvince';
5. You can use the custom function strip in hql
select custom function name (name), age from t_test;
l Json data parsing UDF development
data preparation
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} {"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"} {"movie":"595","rate":"5","timeStamp":"978824268","uid":"1"} {"movie":"938","rate":"4","timeStamp":"978301752","uid":"1"} {"movie":"2398","rate":"4","timeStamp":"978302281","uid":"1"} {"movie":"2918","rate":"4","timeStamp":"978302124","uid":"1"} {"movie":"1035","rate":"5","timeStamp":"978301753","uid":"1"} {"movie":"2791","rate":"4","timeStamp":"978302188","uid":"1"} {"movie":"2687","rate":"3","timeStamp":"978824268","uid":"1"} {"movie":"2018","rate":"4","timeStamp":"978301777","uid":"1"} {"movie":"3105","rate":"5","timeStamp":"978301713","uid":"1"} {"movie":"2797","rate":"4","timeStamp":"978302039","uid":"1"}
1、create table t_json(line string) row format delimited;
2、load data local inpath 'home/hadoop/rating.json' into table t_json;
3、select * from t_json limit 10;
4. Custom class JsonParser
package cn.itcast.bigdata.udf; import org.apache.hadoop.hive.ql.exec.UDF; import parquet.org.codehaus.jackson.map.ObjectMapper;//The lib jar package has been added to the project before public class JsonParser extends UDF { public String evaluate(String jsonLine) { ObjectMapper objectMapper = new ObjectMapper(); try { MovieRateBean bean = objectMapper.readValue(jsonLine, MovieRateBean.class); return bean.toString(); } catch (Exception e) { } return ""; } }
MovieRateBean
package cn.itcast.bigdata.udf; //{"movie":"1721","rate":"3","timeStamp":"965440048","uid":"5114"}The attribute name must be exactly the same as the key public class MovieRateBean { private String movie; private String rate; private String timeStamp; private String uid; //Because of space -- get and set methods are hidden here @Override public String toString() {//Return a string after parsing return movie + "\t" + rate + "\t" + timeStamp + "\t" + uid; } }
5. Package and upload to the server, add the package to the classpath of hive
6. Hive>create temporary function custom function name parsejson as 'cn.itcast.bigdata.udf.Parsejson';
7. View data select parsejson(line) from t_json limit 10;
8. Other ways
create table rat_json(line string) row format delimited; load data local inpath '/home/hadoop/rating.json' into table rat_json; drop table if exists t_rating; create table t_rating(movieid string,rate int,timestring string,uid string) row format delimited fields terminated by '\t'; insert overwrite table t_rating to create a new table t_rating select split to get an array (custom function parsejson(line),'\t')[0]as movieid, split(parsejson(line),'\t')[1] as rate, split(parsejson(line),'\t')[2] as timestring, split(parsejson(line),'\t')[3] as uid from rat_json limit 10;
Transform implementation
Hive's TRANSFORM keyword provides the function of calling self-written scripts in SQL
It is suitable for implementing functions that are not available in Hive and do not want to write UDF
Use example 1: The following sql borrows weekday_mapper.py to process the data .
CREATE TABLE u_data_new ( movieid INT, rating INT, weekday INT, userid INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; add FILE weekday_mapper.py;//If it is java, it is add jar INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (movieid, rate, timestring,uid)//Query 4 fields with python script to process USING 'python weekday_mapper.py' AS (movieid, rating, weekday, userid) //Convert time string to day of week as returns 4 results FROM t_rating;
The content of weekday_mapper.py is as follows
#!/bin/python import sys import datetime for line in sys.stdin: line = line.strip() movieid, rating, unixtime,userid = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([movieid, rating, str(weekday),userid])//Add the /t separator to the left and right fields
Steps to organize:
transform case: 1. First load the rating.json file to an original table rat_json of hive create table rat_json(line string) row format delimited; load data local inpath '/home/hadoop/rating.json' into table rat_json; 2. You need to parse the json data into four fields and insert a new table t_rating insert overwrite table t_rating select get_json_object(line,'$.movie') as moive,get_json_object(line,'$.rate') as rate from rat_json; 3. Use transform+python to convert unixtime to weekday First edit a python script file ########python######code vi weekday_mapper.py #!/bin/python import sys import datetime for line in sys.stdin: line = line.strip() movieid, rating, unixtime,userid = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([movieid, rating, str(weekday),userid]) save document Then, add the file to hive's classpath: hive>add FILE /home/hadoop/weekday_mapper.py; hive>create TABLE u_data_new as//Get data through query SELECT TRANSFORM (movieid, rate, timestring, uid)//The corresponding t_rating is passed as input to the python script USING 'python weekday_mapper.py' AS (movieid, rate, weekday, uid)//Insert the fields of another table into the new table u_data_new according to the result FROM t_rating; select distinct(weekday) from u_data_new limit 10;