3. Spark SQL parsing

3.1 new starting point SparkSession 

      In the old version, SparkSQL provides two SQL queries starting point, a man named SQLContext, for Spark provide their own SQL queries, called HiveContext, for connecting Hive queries, SparkSession Spark is the latest SQL queries starting point, in essence, and the combination SQLCotext HiveContext, so SQLContext and available on HiveContext API also may be used in SparkSession. Internal SparkSession encapsulates sparkContext, so the calculation is actually performed by the sparkContext

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

      SparkSession.builder used to create a SparkSession

      Import spark.implicits._ is introduced for DataFrame implicitly converted to RDD, df may be used in the method of RDD

      If you need Hive support, you need to create the following statement:

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .enableHiveSupport()
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

3.2 Creating DataFrames 

      In the SQL Spark SparkSession create and execute SQL DataFrames inlet, DataFrames there are three ways to create a conversion may be made from an existing RDD, can also be returned from the query Hive Table, or data created by Spark Source

      Spark created from a data source:

val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

      Conversion from RDD:

/**
Michael, 29
Andy, 30
Justin, 19
  **/
scala> val peopleRdd = sc.textFile("examples/src/main/resources/people.txt")
peopleRdd: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt
MapPartitionsRDD[18] at textFile at <console>:24

scala> val peopleDF3 = peopleRdd.map(_.split(",")).map(paras => (paras(0),paras(1).trim().toInt)).toDF("name","age")
peopleDF3: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> peopleDF.show()
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

3.3 DataFrame common operations 

  3.3.1 DSL style syntax 

// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// +-------+
// |   name|
// +-------+
// |Michael|
// |   Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// |   name|(age + 1)|
// +-------+---------+
// |Michael|     null|
// |   Andy|       31|
// | Justin|       20|
// +-------+---------+

// Select people older than 21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// |  19|    1|
// |null|    1|
// |  30|    1|
// +----+-----+

  3.3.2 SQL syntax style  

// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+

// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+

      Session temporary table is within the scope of the Session exits, the table becomes ineffective. If you want within the scope of application of effective, you can use the global table. Note the use of global table, you need access to the full path, such as: global_temp.people

3.4 create a DataSet  

      Dataset is a strongly typed data set, it is necessary to provide information corresponding to the type of

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
case class Person(name: String, age: Long)

// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+

// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

3.5 Dataset and RDD interoperability  

      Spark SQL support RDD conversion process by the presence of two ways Dataset, the converted need to obtain Dataset Schema information in RDD, there are two ways, one is to get the Schema information in the RDD by reflection. This approach is suitable for the case of the column name known. The second way is through the programming interface will Schema information to RDD, this way you can deal with that at runtime in order to know the way the column

  3.5.1 acquired by reflection Schema 

      SparkSQL automatically will contain case classes RDD converted DataFrame, case class defines the structure of the table, case by reflection into a class attribute table column name. Case class can include complex structures such as Array, or the like Seqs

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()

// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index ROW object
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))

  3.5.2 be programmed Schema  

      If the case can not be defined in advance class can be defined by the following three steps a DataFrame

      Create a multi-line structure of the RDD

      Creating the line structure information represented by StructType

      CreateDataFrame by providing a method to apply SparkSession Schema

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string,应该是动态通过程序生成的
val schemaString = "name age"

// Generate the schema based on the string of schema Array[StructFiled]
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))

// val filed = schemaString.split(" ").map(filename=> filename match{ case "name"=>
StructField(filename,StringType,nullable = true); case "age"=>StructField(filename, IntegerType,nullable = true)} )

val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
import org.apache.spark.sql._
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// | Name:   Andy|
// | Name: Justin|
// +-------------+

Conversion between 3.6 Type summary  

      RDD, DataFrame, Dataset three have much in common, have their own applicable scenarios often need to convert between them,

      DataFrame / Dataset 转 eet:

Val rdd1 = testDF.rdd 
Val rdd2 = testDS.rdd

      Eet 转 DataFrame:

import spark.implicits._
val testDF = rdd.map {line=>
        (line._1,line._2) 
     }.toDF("col1","col2")            

      Usually tuple write data line together, and then specify the name of the field in the toDF

      Eet 转 Dataset:

spark.implicits._ Import 
Case class COLTEST (col1: String, col2: Int) the extends the Serializable // custom field names and types 
Val testds = {Line rdd.map => 
        COLTEST (line._1, line._2) 
     } .toDS

      It can be noted, when the type is defined for each row (case class), it has been given of field names and types, as long as the rear case class to which to add value 

      Dataset转DataFrame:

      This is also very simple, because it is only the case class packaged into Row

import spark.implicits._ 
val testDF = testDS.toDF

      DataFrame转Dataset:

spark.implicits._ Import 
Case class COLTEST (col1: String, col2: Int) the extends the Serializable // custom field names and types 
val testDS = testDF.as [Coltest]

      This method is given in the type of each column, using as a method, converted into a Dataset, which is very convenient when the data type and need to be addressed for each DataFrame field

      When using some special operations, be sure to add import spark.implicits._ otherwise toDF, toDS can not be used 

3.7 User-defined functions 

      Spark.udf function by user can customize function

  3.7.1 User-defined function UDF 

scala> val df = spark.read.json("examples/src/main/resources/people.json") 
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> spark.udf.register("addName", (x:String)=> "Name:"+x)
res5: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.createOrReplaceTempView("people")

scala> spark.sql("Select addName(name), age from people").show()
+-----------------+----+
|UDF:addName(name)| age|
+-----------------+----+
|     Name:Michael|null|
|        Name:Andy|  30|
|      Name:Justin|  19|
+-----------------+----+

  3.7.2 User-defined aggregate function  

      Dataset strongly typed and weakly typed DataFrame provides relevant aggregate function, such as count (), countDistinct (), avg (), max (), min (). In addition, users can set their own custom aggregate functions

    3.7.2.1 User-defined types weak aggregate functions 

      Inherited UserDefinedAggregateFunction achieved by a user-defined aggregate function. The following shows a wage averaged custom aggregation function

org.apache.spark.sql.expressions.MutableAggregationBuffer Import 
Import org.apache.spark.sql.expressions.UserDefinedAggregateFunction 
Import org.apache.spark.sql.types._ 
Import org.apache.spark.sql.Row 
Import org.apache .spark.sql.SparkSession 

Object MyAverage the extends UserDefinedAggregateFunction { 


  // data type of polymerization function input parameters 
  DEF inputSchema: StructType = StructType (StructField ( "inputColumn", LongType) :: Nil) 
  
  // polymerization buffer worth of data type 
  def bufferSchema : = {StructType 
    StructType (StructField ( "SUM", LongType) :: StructField ( "COUNT", LongType) :: Nil) 
  } 

  // return value data type 
  DEF dataType: the dataType = DoubleType 
  // has been input is the same for return the same output.
  DETERMINISTIC DEF: = Boolean to true 
  // initialize 
  DEF the initialize (Buffer: MutableAggregationBuffer): Unit = { 
    // keep total wages 
    Buffer (0) = 0L 
    // number stored wages 
    Buffer (. 1) = 0L 
  } 
  same Execute // data between consolidation. 
  Update DEF (Buffer: MutableAggregationBuffer, INPUT: Row): Unit = { 
    IF (! input.isNullAt (0)) { 
    Buffer (0) = buffer.getLong (0) + input.getLong (0) 
    Buffer (. 1) = Buffer .getLong (. 1) +. 1 
    } 
  } 
  // between different data merge the Execute 
  DEF merge (Buffer1: MutableAggregationBuffer, Buffer2: Row): Unit = { 
    Buffer1 (0) = buffer1.getLong (0) + buffer2.getLong (0)
    buffer1 (1) = buffer1.getLong (1) + buffer2.getLong (1)
  }
  // 计算最终结果
  def evaluate (buffer: Row): Double = buffer.getLong (0).toDouble / buffer.getLong (1)
}
  
// 注册函数
spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// |   Andy| 4500|
// | Justin| 3500|
// |  Berta| 4000|
// +-------+------+

val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

    3.7.2.2 User-defined types strongly aggregate functions  

      To achieve a strongly typed by inheritance Custom aggregation function Aggregator, also averaged wage

org.apache.spark.sql.expressions.Aggregator Import 
Import org.apache.spark.sql.Encoder 
Import org.apache.spark.sql.Encoders 
Import org.apache.spark.sql.SparkSession 

// Since it is a strongly typed, may there case class 
case the Employee class (name: String, the salary: Long) 
case Average class (var SUM: Long, var COUNT: Long) 

Object MyAverage the extends Aggregator in [the Employee, Average, Double] { 
// definition of a data structure to keep the wage The total number and the total number of wage, are initially 0 
DEF ZERO: Average = Average (0L, 0L) 
// TWO values as Combine A new new value to Produce the For Performance, The function `On May Modify buffer`. 
// and return of IT INSTEAD A new new Object Constructing 
DEF the reduce (Buffer: Average, Employee: the Employee): = {Average 
buffer.sum + = employee.salary
+ =. 1 buffer.count 
Buffer 
} 

// execute different polymerization results 
DEF Merge (B1: Average, B2: Average): = {Average 
b1.sum + = b2.sum 
b1.count + = b2.count 
B1 
} 

// calculating an output 
DEF Finish (Reduction: Average): Double = reduction.sum.toDouble / reduction.count 
// set values between encoder type, to be converted into a case class 
// Encoders.product is tuple and case scala encoder class switching 
DEF bufferEncoder: encoder [Average] = Encoders.product 
// set the final output value of the encoder 
DEF outputEncoder: encoder [Double] = Encoders.scalaDouble 
} 

Import spark.implicits._ 
Val DS = spark.read .json ( "examples / the src / main / Resources / employees.json"). AS [the Employee] 
ds.show () 
// + ------- + ------ +
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

 

Guess you like

Origin www.cnblogs.com/zhanghuicheng/p/11189506.html