6.Hive函数重要应用案例

1.Hive中的分隔符

默认规则
Hive默认序列化类是LazySimpleSerDe,其中支持使用单字节分隔符（char）来加载文本数据。根据不同文本文件的分隔符，我们可以通过在创建表时使用row format delimited来指定文件中的分隔符。

row_format
    : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION TIEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char] -- Avilable in Hive 0.13 and later
    | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

特殊数据
情况一：每行数据的分隔符时多字节分隔符，例如：“||”、"–"等
情况二：数据的字段中包含了分隔符

解决方案一：替换分隔符
处理小批量数据的时候可以写代码将多字节分隔符替换为单字节分隔符
在处理大量数据时使用MR程序操作

解决方案二：RegexSerDe正则加载
Hive内置的SerDe
除了使用最多的LazySimpleSerDe，Hive内置了很多SerDe类
多种SerDe用于解析和加载不同类型的数据文件，常用的有ORCSerDe、RegexSerDe、JsonSerDe等

BUilt-in SerDes
- Avro (Hive 0.9.1 and later)
- ORC (Hive 0.11 and later)
- RegEx
- Thrift
- Parguet (Hive 0.13 and later)
- CSV (Hive 0.13 and later)
- JsonSerDe (Hive 0.12 and later in hcatalog-core)

方案概述
RegexSerDe用来加载特殊数据的问题，使用正则匹配来加载数据
RegexSerDe解决多字节分隔符
分析数据格式，构建正则表达式

原始数据格式：
01||周杰伦||中国||台湾||男||七里香

正则表达式定义每一列
([0-9]*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)

create table singer(
    id string, -- 歌手id
    name string, -- 歌手名称
    country string, -- 国家
    province string, -- 省份
    gender string, -- 性别
    works string -- 作品
)
-- 指定使用RegexSerDe加载数据
row format serde 'org.apache.hadoop.hive.serde2,RegexSerDe'
with serdeproperties ("input.regex" = "([0-9]*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)");

-- 加载数据
load data local inpath '/root/hivedata/test01.txt' into table singer;

RegexSerDe解决数据中包含分隔符
分析数据格式，构建正则表达式

原始数据格式：
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328

正则表达式定义每一列
([^ ]*) ([^}]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([^ ]*

^:匹配输入字符串的开始位置，除非在中括号表达式中使用，当该符号在中括号表达式中使用时，表示不接受该方括号表达式中的字符集合。要匹配^字符本身，请使用\^

create table apachelog(
    ip string, -- IP地址
    stime string, -- 时间
    mothed string, -- 请求方式
    url string, -- 请求地址
    policy string, -- 请求协议
    stat string, -- 请求状态
    body string -- 字节大小
)
-- 指定使用RegexSerDe加载数据
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
-- 指定正则表达式
with serdeproperties(
    "input.regex" = "([^ ]*) ([^}]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([^ ]*)"
) stored as textfile;

解决方案三：自定义InputFormat
Hive中也允许使用自定义InputFormat来解决以上问题，通过在自定义InputFormat，来自定义解析逻辑实现读取每行数据

自定义InputFormat
与MapReduce中自定义InputFormat一致，继承TextInputFormat

public class UserInputFormat extends TextInputFormat{
    @Override
    public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException{
        reporter.setStatus(genericSplit.toString());
        UserRecordReader reader = new UserRecordReader(job, (FileSplit) genericSplit);
        return reader;
    }
}

自定义RecordReader
与MapReduce中自定义RecordReader一致，实现RecordReader接口，实现next方法

public synchronized boolean next(LongWritable key,Text value) throws IOException{
    while(getFilePosition() <= end){
        key.set(pos);
        int newSize = in.readLine(value, maxLineLength, Math.max(maxBytesToConsume(pos), maxLineLength));
        String str = value.toString().replaceAll("\\|\\|","\\|");
        value.set(str);
        pos += newSize;
        if(newSize == 0){
            return false;
        }
        if(newSize < maxLineLength){
            return true;
        }
        LOG.info("Skipped line of size" + newSize + " at pos" + (pos - newSize));
    }
    return false;
}

添加自定义InputFormat到Hive中
打成jar包，添加到Hive的classpath中

-- 创建表
create table singer(
    id string, -- 歌手id
    name string, -- 歌手名称
    country string, -- 国家
    province string, -- 省份
    gender string, -- 性别
    works string -- 作品
)
-- 指定使用分隔符为|
row format delimited fields terminated by '|'
-- 指定使用自定义的类实现解析
stored as
inputformat 'hive.mr.UserInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

-- 加载数据
load data local inpath '/root/hivedata/test01.txt' into table singer;

2.URL解析

URL基本组成
PROTOCOL协议类型://HOST 域名/PATH 访问路径?QUERY 参数数据

Hive中的URL解析函数
函数
Hive中为了实现对URL的解析，专门提供了解析URL的函数parse_url和parse_url_tuple在show functions中可以看到对应函数

2.1案例：从URL中获取每个ID对应HOST、PATH以及QUERY

parse_url函数
功能：parse_url函数是Hive中提供的最基本的url解析函数，可以根据指定的参数，从URL解析出对应的参数值进行返回，函数为普通的一对一函数类型

语法：
parse_url(url, partToExtract[,key]) --extracts a part from a URL

Parts:Host.PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO key

-- URL解析
select parse_url('http://facebook.com/path/p1.php?id=10086','HOST');
select parse_url('http://facebook.com/path/p1.php?id=10086&name=allen','QUERY');
select parse_url('http://facebook.com/path/p1.php?id=10086&name=allen','QUERY','name');

parse_url弊端
想要一次解析多个参数，需要使用多次函数

parse_url_tuple函数
功能：parse_url_tuple函数是Hive中提供的基于parse_url的url解析函数，可以通过一次指定多个参数，从URL解析出多个参数的值进行返回多列，函数为特殊的一对多函数类型。是UDTF函数

语法：
parse_url_tuple(url,partname1,partname2,...,partnameN) -extract N(N>=1)parts from a URL

It takes a URL and one or multiple partnames, and returns a tuple.

-- 建表
create table tb_url(
    id int,
    url string
)row format delimited
fields terminated by '\t';
-- 加载数据

-- url解析
select parse_url_tuple(url,"HOST","PATH") as (host,path) from tb_url;

select parse_url_tuple(url,"PROTOCOL","HOST","PATH") as (protocol,host,path) from tb_url;

select parse_url_tuple(url,"HOST","PATH","QUERY") as (host,path,query) from tb_url;

parse_url_tuple函数的问题
parse_url_tuple是UDTF函数，要么单独使用，要么结合Lateral View一起使用

3.行列转换应用和实现

3.1 行转列：多行转多列

case when函数
功能：用于实现对数据的判断，根据条件，不同的情况返回不同的结果，类似于java中的switch case功能

语法一：
CASE
WHEN 条件1 THEN VALUE1
...
WHEN 条件N THEN VALUE1
ELSE 默认值 END

语法二：
CASE 列
WHEN V1 THEN VALUE1
...
WHEN VN THEN VALUE1
ELSE 默认值 END

案例：实现多行转多列

col col2 col3
a   c    1
a   d    2
a   e    3
b   c    4  
b   d    5
b   e    6

转换为

col1    c   d   e
a       1   2   3
b       4   5   6

-- sql实现
select 
    col1 as col1,
    max(case col2 when 'c' then col3 else 0 end) as c,
    max(case col2 when 'd' then col3 else 0 end) as d,
    max(case col2 when 'e' then col3 else 0 end) as e,
from row2col1
group by col1;

3.2 行转列：多行转单列

案例

col col2 col3
a   b    1
a   b    2
a   b    3
c   d    4  
c   d    5
c   d    6

转换为

col1    Col2    col3
a       b       1,2,3
c       d       4,5,6

-- 2.多行转单列
select col1,col2,concat_ws(",",collect_list(cast(col3 as string))) as col3
from row2col2
group by col1,col2;

3.3 列转行：多列转多行

col1    Col2    col3    col4
a       1       2       3
b       4       5       6

转换为

col col2 col3
a   c    1
a   d    2
a   e    3
b   c    4  
b   d    5
b   e    6

-- 多列转多行
select col1,'c' as col2,col2 as col3 from col2row1
union all
select col1,'d' as col2,col3 as col3 from col2row1
union all
select col1,'e' as col2,col4 as col3 from col2row1

3.3 列转行：单列转多行

col1    Col2    col3
a       b       1,2,3       
b       d       4,5,6       

转换为

col col2 col3
a   b    1
a   b    2
a   b    3
c   d    4  
c   d    5
c   d    6

-- 单列转多行
select col1,col2,lv.col3 as col3
from col2row2 
lateral view
explode(split(col3,",")) lv as col3;

4.Json数据处理

4.1 案例：对JSON数据实现处理，解析每个字段到表中

Hive中处理JSON的方式
方式一：使用JSON函数处理
get_json_object、json_tuple
这两个函数都可以实现将json数据中的每个字段单独解析出来，构建成表

方式二：JSON Serde加载数据
建表时指定serde，加载json文件到表中，会自动解析为对应的表格式

get_json_object
功能：用于解析JSON字符串，可以从JSON字符串中返回指定的某个对象列的值
语法：get_json_object(json_txt, path)
参数：第一个参数指定解析的json字符串；第二个参数指定返回的字段，通过$.columnName的方式来指定path
特点：每次只能返回JSON对象一列的值

-- json数据
{"device":"device_30","deviceType":"kafka","signal":98.0,"time":1616817201399}
-- 从json数据中获取字段
select 
    -- 获取设备名称
    get_json_object(json,"$.device") as device,
    -- 获取设备类型
    get_json_object(json,"$.deviceType") as deviceType,
    -- 获取设备信号强度
    get_json_object(json,"$.signal") as signal,
    -- 获取时间
    get_json_object(json,"$.time") as stime,
from tb_json_test1;

json_tuple
功能：用于实现JSON字符串的解析，可以通过指定多个参数来解析JSON返回多列的值
语法：json_tuple(jsonStr, p1, p2, …, pn)
参数：第一个参数指定要解析的json字符串；后面的参数指定返回的字段
特点：功能类似于get_json_object，但是可以调用一次返回多列的值，数据UDTF类型函数，一般搭配lateral view使用；返回的每一列都是字符串类型

-- 单独使用
select 
    -- 解析所有字段
    get_json_object(json,"device","deviceType","signal","time") as (device,deviceType,signal,stime)
from tb_json_test1;

-- 搭配视图使用
select json,device,deviceType,signal,stime
from tb_json_test1
lateral view get_json_object(json,"device","deviceType","signal","time") b
as device,deviceType,signal,stime

4.2 JSONSerde

功能
Hive中为了简化对于JSON文件的处理，内置了一种专门用于解析JSON文件的Serde解析器，在创建表时，只要指定使用JSONSerde解析表的文件，就会自动将JSON文件中的每一列进行解析

-- 建表
create table tb_json_test2(
    device string,
    deviceType string,
    signal double,
    `time` string
) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;

load data local inpath '/root/hivedata/device.json' into table tb_json_test2;

5.拉链表的设计与实现

背景
Hive在实际工作中主要用于构建离线数据仓库，定期的从各种数据源中同步菜价数据到Hive中，经过分层转换提供数据应用
例如每天需要从MySQL中同步新的订单信息、用户信息、店铺信息到数据仓库中，进行订单分析、用户分析

问题：数据发生变化，新增和修改怎么处理？

方案一：在Hive中用新的数据覆盖老的数据
优点：实现最简单，使用起来最方便
缺点：没有历史状态

方案二：每次数据改变，根据日期构建一份全量的块表，每天一张表
优点：记录了所有数据在不同时间的状态
缺点：冗余存储

方案三：构建拉链表，通过时间标记发生变化的数据的每种状态的时间周期

5.1 拉链表的设计

拉链表专门用于解决在数据仓库中数据发生变化如何实现数据存储的问题
拉链表的设计时将更新的数据进行状态记录，没有发生更新的数据进行状态存储，用于存储素有数据在不同时间上的状态，通过时间进行标记每个状态的生命周期，查询时，根据需求可以获得时间范围状态的数据，默认用9999-12-31等最大值来表示最新状态

实现过程
第一步：增量采集变化数据，进入增量表
第二步：将Hive中拉链表与临时表的数据进行合并，合并结果写入临时表
第三步：将临时表的数据覆盖入拉链表中

5.1 拉链表的实现

数据准备

数据示例：
001 186xxxx1234 laoda   0   sh  2021-01-01  9999-12-31

-- 创建拉链表
create table dw_zipper(
    userid string,
    phone string,
    nick string,
    gender int,
    addr string,
    starttime string,
    endtime string
) row format delimited fields terminated by "\t";

-- 加载模型数据
load data local inpath '/root/hivedata/zipper/txt' into table dw_zipper;

增量采集

-- 创建ods层增量表 加载数据
create table ods_zipper_update(
    userid string,
    phone string,
    nick string,
    gender int,
    addr string,
    starttime string,
    endtime string
) row format delimited fields terminated by '\t';

load加载数据

合并数据

-- 创建临时表
create table tmp_zipper(
    userid string,
    phone string,
    nick string,
    gender int,
    addr string,
    starttime string,
    endtime string
) row format delimited fields terminated by '\t';

-- 合并历史拉链表与增量表
inser overwrite table tmp_zipper
select userid,phone,nick,gender,addr,starttime,endtime
from ods_zipper_update
union all
-- 查询原来拉链表的所有数据，并将这次需要更新的数据endtime更改为更新至的starttime
select  userid,phone,nick,gender,addr,starttime,
    -- 如果这条数据没有跟新或者这条数据不是要跟新的数据，就保留原来的值，否则就改为新数据的开始时间-1
    if(b.userid is null or a.endtime < '9999-12-31',a.endtime,date_sub(b.starttime,1)) as endtime
from dw_zipper a left join ods_zipper_update b
on a.userid = b.userid;

覆盖到拉链表中

-- 覆盖拉链表
insert overwrite table dw_zipper
select * from tmp_zipper;