hive:函数:get_json_object和json_tuple 操作json数据(hive加载json数据和解析json)

业务情景一:

hive加载json数据到表中:

linux本地创建文件people.json

数据为:

{"name":"Michael"}

{"name":"Andy", "age":30}

{"name":"Justin", "age":19}

创建表:

CREATE TABLE
    ods.spark_people_json
    (
        `name` string,
        `age` INT
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'STORED AS TEXTFILE;

加载数据到hive表:

注意这里可能会报错:

因为直接使用JsonSerDe类,是会报错的,因为这个类并没有在初始化的时候加载到环境中

报错如下:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
 

所以我先下载这个包然后add到hive中即可使用:

https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/0.12.0-cdh5.1.4

包名:hive-hcatalog-core-0.12.0-cdh5.1.4.jar

hive (default)> add jar /var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar;
Added [/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar] to class path
Added resources: [/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar]
hive (default)> load  data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table spark_people_json;
FAILED: ParseException line 1:5 character ' ' not supported here
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table spark_people_json;
FAILED: SemanticException [Error 10001]: Line 1:104 Table not found 'spark_people_json'
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into ods.table spark_people_json;
FAILED: ParseException line 1:98 missing TABLE at 'ods' near '<EOF>'
line 1:108 extraneous input 'spark_people_json' expecting EOF near '<EOF>'
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table ods.spark_people_json;
Loading data to table ods.spark_people_json
Table ods.spark_people_json stats: [numFiles=1, totalSize=75]
OK
Time taken: 0.736 seconds
hive (default)> select * from ods.spark_people_json;
OK
spark_people_json.name	spark_people_json.age
Michael	NULL
Michael	NULL
Andy	30
Andy	30
Justin	19
Time taken: 0.464 seconds, Fetched: 5 row(s)
hive (default)> 

字段缺失的值为NULL

业务情景二:

某个字段为json,想要获取里面的某个值怎么操作?

解析json:

使用两个函数:get_json_object()和json_tuple()

1、get_json_object()

get_json_object 函数第一个参数填写json对象变量,第二个参数使用$表示json变量标识,然后用 . 或 [] 读取对象或数组;

SELECT
    get_json_object
    (
    '{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net", "owner":"tang" } '
    ,'$.owner');

返回:
tang

select get_json_object('{"name":"jack","server":"www.qq.com"}','$.server');
返回:
www.qq.com

如果是数组怎么取值呢?

SELECT
    get_json_object
    (
    '{"shop":{"book":[{"price":43.3,"type":"art"},{"price":30,"type":"technology"}],"clothes":{"price":19.951,"type":"shirt"}},"name":"jane","age":"23"}'
    , '$.shop.book[0].type');

返回:
art

但是问题来了每次只能查一个字段。

大概来说,意思是这个方法,只能接受两个参数,多的不行,那么就导致我们对同一个json数据想要查看多个值,只能多写几个get_json_object,比较麻烦,所以另一个方法就派上了用场。

 

json_tuple

select json_tuple('{"name":"jack","server":"www.qq.com"}','server','name');
返回:
www.qq.com	jack

但是缺点就是对于复杂的嵌套的json,就操作不了了(就是说使用不了".",“[]”这种符号来操作json对象),所以看情况选择这两个方法去使用。 

参考:

https://blog.csdn.net/lsr40/article/details/79399166

猜你喜欢

转载自blog.csdn.net/weixin_38750084/article/details/93498986