hive：函数：get_json_object和json_tuple 操作json数据（hive加载json数据和解析json）

业务情景一：

hive加载json数据到表中：

linux本地创建文件people.json

数据为：

{"name":"Michael"}

{"name":"Andy", "age":30}

{"name":"Justin", "age":19}

创建表：

CREATE TABLE
    ods.spark_people_json
    (
        `name` string,
        `age` INT
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'STORED AS TEXTFILE;

加载数据到hive表：

注意这里可能会报错：

因为直接使用JsonSerDe类，是会报错的，因为这个类并没有在初始化的时候加载到环境中

报错如下：

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe

所以我先下载这个包然后add到hive中即可使用：

https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/0.12.0-cdh5.1.4

包名：hive-hcatalog-core-0.12.0-cdh5.1.4.jar

hive (default)> add jar /var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar;
Added [/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar] to class path
Added resources: [/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar]
hive (default)> load  data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table spark_people_json;
FAILED: ParseException line 1:5 character ' ' not supported here
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table spark_people_json;
FAILED: SemanticException [Error 10001]: Line 1:104 Table not found 'spark_people_json'
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into ods.table spark_people_json;
FAILED: ParseException line 1:98 missing TABLE at 'ods' near '<EOF>'
line 1:108 extraneous input 'spark_people_json' expecting EOF near '<EOF>'
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table ods.spark_people_json;
Loading data to table ods.spark_people_json
Table ods.spark_people_json stats: [numFiles=1, totalSize=75]
OK
Time taken: 0.736 seconds
hive (default)> select * from ods.spark_people_json;
OK
spark_people_json.name	spark_people_json.age
Michael	NULL
Michael	NULL
Andy	30
Andy	30
Justin	19
Time taken: 0.464 seconds, Fetched: 5 row(s)
hive (default)>

字段缺失的值为NULL

业务情景二：

某个字段为json，想要获取里面的某个值怎么操作？

解析json：

使用两个函数：get_json_object()和json_tuple()

1、get_json_object()

get_json_object 函数第一个参数填写json对象变量，第二个参数使用$表示json变量标识，然后用 . 或 [] 读取对象或数组；

SELECT
    get_json_object
    (
    '{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net", "owner":"tang" } '
    ,'$.owner');

返回：
tang

select get_json_object('{"name":"jack","server":"www.qq.com"}','$.server');
返回：
www.qq.com

如果是数组怎么取值呢？

SELECT
    get_json_object
    (
    '{"shop":{"book":[{"price":43.3,"type":"art"},{"price":30,"type":"technology"}],"clothes":{"price":19.951,"type":"shirt"}},"name":"jane","age":"23"}'
    , '$.shop.book[0].type');

返回：
art

但是问题来了每次只能查一个字段。

大概来说，意思是这个方法，只能接受两个参数，多的不行，那么就导致我们对同一个json数据想要查看多个值，只能多写几个get_json_object，比较麻烦，所以另一个方法就派上了用场。

json_tuple

select json_tuple('{"name":"jack","server":"www.qq.com"}','server','name');
返回：
www.qq.com	jack

但是缺点就是对于复杂的嵌套的json，就操作不了了（就是说使用不了"."，“[]”这种符号来操作json对象），所以看情况选择这两个方法去使用。

参考：

https://blog.csdn.net/lsr40/article/details/79399166