有如下数据:
32365 MOVE 1577808000000 {"goodid": 478777, "title": "商品478777", "price": "12000",
"shopid": "1", "mark": "mark"} 6.0.0 android {"browsetype": "chrome", "browseversion": "82,0"}
90339 MOVE 1577808008000 {"goodid": 998446, "title": "商品998446", "price": "12000",
"shopid": "1", "mark": "mark"} 6.0.0 android {"browsetype": "chrome", "browseversion": "82,0"}
10519 ORDER 1577808016000 {"goodid": 914583, "title": "商品914583", "price": "12000",
"shopid": "1", "mark": "mark"} 6.0.0 android {"browsetype": "chrome", "browseversion": "82,0"}
53844 CART 1577808024000 {"goodid": 4592971, "title": "商品4592971", "price": "12000",
"shopid": "1", "mark": "mark"} 6.0.0 android {"appid": "123456", "appversion": "11.0.0"}
字段如下:
- 其中goodinfo和appinfo为上表所示的json格式的字段,现要求使用pyspark提取出解析其中的json内容(也可以用正则提取)并写入到hive表中
userid int,
action string,
acttime string,
goodinfo string,
version string,
system string,
appinfo string
思路分析:
- 本题只需要解析json内容,并且是从表中读取写到一张新表中,所以本题使用pyspark,可以使用pyspark下的get_json_object函数
踩雷汇总:
- python字符集与hadoop平台默认不一致,因此需要在第一行加
# -*- coding:utf-8 -*-
使得文本在utf-8的环境下生成 - 配置本地spark,安装findpyspark模块并且初始化本地模式下的spark,否则会报
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.isEncryptionEnabled does not exist in the JVM
的错误
代码如下:
# -*- coding:utf-8 -*-
import findspark
findspark.init()
from pyspark.sql import HiveContext, SparkSession
import pyspark.sql.functions as F
if __name__ == '__main__':
spark = SparkSession.builder.master("local[*]").appName("logs") \
.config("hive.metastore.uris", "thrift://single:9083") \
.enableHiveSupport().getOrCreate()
df = spark.sql("select * from ods_myshops.ods_logs")
df.withColumn("goodid",F.get_json_object("goodinfo","$.goodid"))\
.withColumn("title",F.get_json_object("$.goodinfo","$.title"))\
.withColumn("price",F.get_json_object("goodinfo","$.price"))\
.withColumn("shopid",F.get_json_object("goodinfo","$.shopid"))\
.withColumn("mark",F.get_json_object("goodinfo","$.mark"))\
.withColumn("soft"
,F.when(F.instr(df['appinfo'],"browsetype")==0
,F.get_json_object("appinfo","$.appid"))
.otherwise(F.get_json_object("appinfo","$.browsetype")))\
.withColumn("soft_version"
,F.when(F.instr(df['info'],"browsetype")==0
,F.get_json_object("appinfo","$.appversion"))
.otherwise("appinfo","$.browseversion"))\
.drop("goodinfo","appinfo")\
.write.format("hive").mode("overwrite")\
.saveAsTable("ods_myshops.ods_newlog")