sparksql series (seven) Json turn Map, multi-file generation

        All products are json data reported to the number of positions to use, because they do not result in a unified format of the data processing is very troublesome, public discussion will be drawn out of the field, the field lines of business in their own field extends inside each line of business people to write their own sql field extends parsing process. Which involves a knowledge json turn map of this record it again.        

A: JSON turn Map

Why we need to turn JSON Map

        There are many products company, reported a lot of data, a very non-standard format of the same name is a common thing for parsing is very difficult and requires a unified script to parse out the fields.

        上报的数据类似:{"id":"7","sex":"7","data":{"sex":"13","class":"7"}}

Import jar package

        We use fastjson to be processed into json Map data structure

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.47</version>
    </dependency>

data

        {"id":"7","sex":"7","da","data":{"name":"7","class":"7","data":{"name":"7","class":"7"}}}
        {"id":"8","name":"8","data":{"sex":"8","class":"8"},"data":{"sex":"8","class":"8"}}
        {"class":"9","data":{"name":"9","sex":"9"}}
        {"id":"10","name":"10","data":{"sex":"10","class":"10"}}
        {"id":"11","class":"11","data":{"name":"11","sex":"11"}}

Code

        import org.apache.spark.sql.SparkSession
        import com.alibaba.fastjson.JSON
        import java.util

        // we have the example of a single id extracted, retains the remaining fields which extends to
        Val sparkSession SparkSession.builder = (). Master ( "local"). GetOrCreate ()
        Val = nameRDD1df sparkSession.read.textFile ( "/ Software / java / idea / data ")

        import sparkSession.implicits._
        import org.apache.spark.sql.functions.col
        val finalResult = nameRDD1df.map(x=>{
                var map:util.HashMap[String, Object] = new util.HashMap[String, Object]()
                try{
                        map = JSON.parseObject(x, classOf[util.HashMap[String, Object]])
                }catch {case e :Exception =>{ println(e.printStackTrace())}}

                var finalMap:util.HashMap[String, Object] = if(map.containsKey("data")){

                        var dataMap:util.HashMap[String, Object] = new util.HashMap[String, Object]()
                        try{
                                dataMap = JSON.parseObject(map.get("data").toString, classOf[util.HashMap[String, Object]])
                        }catch {case e :Exception =>{ println(e.printStackTrace())}}
                        dataMap.putAll(map);dataMap.remove("id");dataMap.remove("data");
                        dataMap
                }else {new util.HashMap[String, Object]()}
                val id = if(map.get("id") == null) "" else map.get("id").toString
                (id,JSON.toJSONString(finalMap,false))
        })
        .toDF("id","extends")
        .filter(col("id") =!= "")

        finalResult.show (10, false)

II: multi-file generation

        Many times we use sparksql, is to read a directory to generate a table of contents, but when really used, there needs to read (which has ID data field as the distinction) multiple directories to generate multiple directories, this time using this ,record it. In fact, nature is partitionBy

 sparksql--->>>partitionBy

    import org.apache.spark.sql.SparkSession
    val sparkSession= SparkSession.builder().master("local").getOrCreate()
val nameRDD1df = sparkSession.read.json("/software/java/idea/data")
.select("id","name")
.write.mode(SaveMode.Append).partitionBy("id")
.json("/software/java/idea/end")

 spark-core --- >>> custom function

    import org.apache.spark.sql.SparkSession
import org.apache.hadoop.fs.{FileSystem, Path}

val sparkSession= SparkSession.builder().master("local").getOrCreate()
val sparkContext = sparkSession.sparkContext
val fileSystem = FileSystem.get(sparkContext.hadoopConfiguration)
fileSystem.delete(new Path("/software/java/idea/end"), true)

sparkContext.textFile("/software/java/idea/data").map(x=>{
val array = x.split("\\|")
((array(0)+"="+array(1)),array(2))
}).saveAsHadoopFile("/software/java/idea/end",classOf[String],classOf[String],classOf[RDDMultipleTextOutputFormat[_, _]])
    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat[K, V]() extends MultipleTextOutputFormat[K, V]() {
override def generateFileNameForKeyValue(key: K, value: V, name: String) : String = {
(key + "/" + name)
}
}

Guess you like

Origin www.cnblogs.com/wuxiaolong4/p/12590473.html