业务情景一:
hive加载json数据到表中:
linux本地创建文件people.json
数据为:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
创建表:
CREATE TABLE
ods.spark_people_json
(
`name` string,
`age` INT
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'STORED AS TEXTFILE;
加载数据到hive表:
注意这里可能会报错:
因为直接使用JsonSerDe类,是会报错的,因为这个类并没有在初始化的时候加载到环境中
报错如下:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
所以我先下载这个包然后add到hive中即可使用:
https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/0.12.0-cdh5.1.4
包名:hive-hcatalog-core-0.12.0-cdh5.1.4.jar
hive (default)> add jar /var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar;
Added [/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar] to class path
Added resources: [/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/hive-hcatalog-core-0.12.0-cdh5.1.4.jar]
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table spark_people_json;
FAILED: ParseException line 1:5 character ' ' not supported here
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table spark_people_json;
FAILED: SemanticException [Error 10001]: Line 1:104 Table not found 'spark_people_json'
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into ods.table spark_people_json;
FAILED: ParseException line 1:98 missing TABLE at 'ods' near '<EOF>'
line 1:108 extraneous input 'spark_people_json' expecting EOF near '<EOF>'
hive (default)> load data local inpath '/var/lib/hadoop-hdfs/spride_sqoop_beijing/bi_table/tang/people.json' into table ods.spark_people_json;
Loading data to table ods.spark_people_json
Table ods.spark_people_json stats: [numFiles=1, totalSize=75]
OK
Time taken: 0.736 seconds
hive (default)> select * from ods.spark_people_json;
OK
spark_people_json.name spark_people_json.age
Michael NULL
Michael NULL
Andy 30
Andy 30
Justin 19
Time taken: 0.464 seconds, Fetched: 5 row(s)
hive (default)>
字段缺失的值为NULL
业务情景二:
某个字段为json,想要获取里面的某个值怎么操作?
解析json:
使用两个函数:get_json_object()和json_tuple()
1、get_json_object()
get_json_object 函数第一个参数填写json对象变量,第二个参数使用$表示json变量标识,然后用 . 或 [] 读取对象或数组;
SELECT
get_json_object
(
'{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net", "owner":"tang" } '
,'$.owner');
返回:
tang
select get_json_object('{"name":"jack","server":"www.qq.com"}','$.server');
返回:
www.qq.com
如果是数组怎么取值呢?
SELECT
get_json_object
(
'{"shop":{"book":[{"price":43.3,"type":"art"},{"price":30,"type":"technology"}],"clothes":{"price":19.951,"type":"shirt"}},"name":"jane","age":"23"}'
, '$.shop.book[0].type');
返回:
art
但是问题来了每次只能查一个字段。
大概来说,意思是这个方法,只能接受两个参数,多的不行,那么就导致我们对同一个json数据想要查看多个值,只能多写几个get_json_object,比较麻烦,所以另一个方法就派上了用场。
json_tuple
select json_tuple('{"name":"jack","server":"www.qq.com"}','server','name');
返回:
www.qq.com jack
但是缺点就是对于复杂的嵌套的json,就操作不了了(就是说使用不了".",“[]”这种符号来操作json对象),所以看情况选择这两个方法去使用。
参考: