3.1 new starting point SparkSession
In the old version, SparkSQL provides two SQL queries starting point, a man named SQLContext, for Spark provide their own SQL queries, called HiveContext, for connecting Hive queries, SparkSession Spark is the latest SQL queries starting point, in essence, and the combination SQLCotext HiveContext, so SQLContext and available on HiveContext API also may be used in SparkSession. Internal SparkSession encapsulates sparkContext, so the calculation is actually performed by the sparkContext
import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._
SparkSession.builder used to create a SparkSession
Import spark.implicits._ is introduced for DataFrame implicitly converted to RDD, df may be used in the method of RDD
If you need Hive support, you need to create the following statement:
import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .enableHiveSupport() .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._
3.2 Creating DataFrames
In the SQL Spark SparkSession create and execute SQL DataFrames inlet, DataFrames there are three ways to create a conversion may be made from an existing RDD, can also be returned from the query Hive Table, or data created by Spark Source
Spark created from a data source:
val df = spark.read.json("examples/src/main/resources/people.json") // Displays the content of the DataFrame to stdout df.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
Conversion from RDD:
/** Michael, 29 Andy, 30 Justin, 19 **/ scala> val peopleRdd = sc.textFile("examples/src/main/resources/people.txt") peopleRdd: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[18] at textFile at <console>:24 scala> val peopleDF3 = peopleRdd.map(_.split(",")).map(paras => (paras(0),paras(1).trim().toInt)).toDF("name","age") peopleDF3: org.apache.spark.sql.DataFrame = [name: string, age: int] scala> peopleDF.show() +-------+---+ | name|age| +-------+---+ |Michael| 29| | Andy| 30| | Justin| 19| +-------+---+
3.3 DataFrame common operations
3.3.1 DSL style syntax
// This import is needed to use the $-notation import spark.implicits._ // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column df.select("name").show() // +-------+ // | name| // +-------+ // |Michael| // | Andy| // | Justin| // +-------+ // Select everybody, but increment the age by 1 df.select($"name", $"age" + 1).show() // +-------+---------+ // | name|(age + 1)| // +-------+---------+ // |Michael| null| // | Andy| 31| // | Justin| 20| // +-------+---------+ // Select people older than 21 df.filter($"age" > 21).show() // +---+----+ // |age|name| // +---+----+ // | 30|Andy| // +---+----+ // Count people by age df.groupBy("age").count().show() // +----+-----+ // | age|count| // +----+-----+ // | 19| 1| // |null| 1| // | 30| 1| // +----+-----+
3.3.2 SQL syntax style
// Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ // Register the DataFrame as a global temporary view df.createGlobalTempView("people") // Global temporary view is tied to a system preserved database `global_temp` spark.sql("SELECT * FROM global_temp.people").show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ // Global temporary view is cross-session spark.newSession().sql("SELECT * FROM global_temp.people").show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
Session temporary table is within the scope of the Session exits, the table becomes ineffective. If you want within the scope of application of effective, you can use the global table. Note the use of global table, you need access to the full path, such as: global_temp.people
3.4 create a DataSet
Dataset is a strongly typed data set, it is necessary to provide information corresponding to the type of
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface case class Person(name: String, age: Long) // Encoders are created for case classes val caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show() // +----+---+ // |name|age| // +----+---+ // |Andy| 32| // +----+---+ // Encoders for most common types are automatically provided by importing spark.implicits._ val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) // DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name val path = "examples/src/main/resources/people.json" val peopleDS = spark.read.json(path).as[Person] peopleDS.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
3.5 Dataset and RDD interoperability
Spark SQL support RDD conversion process by the presence of two ways Dataset, the converted need to obtain Dataset Schema information in RDD, there are two ways, one is to get the Schema information in the RDD by reflection. This approach is suitable for the case of the column name known. The second way is through the programming interface will Schema information to RDD, this way you can deal with that at runtime in order to know the way the column
3.5.1 acquired by reflection Schema
SparkSQL automatically will contain case classes RDD converted DataFrame, case class defines the structure of the table, case by reflection into a class attribute table column name. Case class can include complex structures such as Array, or the like Seqs
// For implicit conversions from RDDs to DataFrames import spark.implicits._ // Create an RDD of Person objects from a text file, convert it to a Dataframe val peopleDF = spark.sparkContext .textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) .toDF() // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") // The columns of a row in the result can be accessed by field index ROW object teenagersDF.map(teenager => "Name: " + teenager(0)).show() // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+ // or by field name teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show() // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+ // No pre-defined encoders for Dataset[Map[K,V]], define explicitly implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] // Primitive types and case classes can be also defined as // implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder() // row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T] teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect() // Array(Map("name" -> "Justin", "age" -> 19))
3.5.2 be programmed Schema
If the case can not be defined in advance class can be defined by the following three steps a DataFrame
Create a multi-line structure of the RDD
Creating the line structure information represented by StructType
CreateDataFrame by providing a method to apply SparkSession Schema
import org.apache.spark.sql.types._ // Create an RDD val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt") // The schema is encoded in a string,应该是动态通过程序生成的 val schemaString = "name age" // Generate the schema based on the string of schema Array[StructFiled] val fields = schemaString.split(" ") .map(fieldName => StructField(fieldName, StringType, nullable = true)) // val filed = schemaString.split(" ").map(filename=> filename match{ case "name"=> StructField(filename,StringType,nullable = true); case "age"=>StructField(filename, IntegerType,nullable = true)} ) val schema = StructType(fields) // Convert records of the RDD (people) to Rows import org.apache.spark.sql._ val rowRDD = peopleRDD .map(_.split(",")) .map(attributes => Row(attributes(0), attributes(1).trim)) // Apply the schema to the RDD val peopleDF = spark.createDataFrame(rowRDD, schema) // Creates a temporary view using the DataFrame peopleDF.createOrReplaceTempView("people") // SQL can be run over a temporary view created using DataFrames val results = spark.sql("SELECT name FROM people") // The results of SQL queries are DataFrames and support all the normal RDD operations // The columns of a row in the result can be accessed by field index or by field name results.map(attributes => "Name: " + attributes(0)).show() // +-------------+ // | value| // +-------------+ // |Name: Michael| // | Name: Andy| // | Name: Justin| // +-------------+
Conversion between 3.6 Type summary
RDD, DataFrame, Dataset three have much in common, have their own applicable scenarios often need to convert between them,
DataFrame / Dataset 转 eet:
Val rdd1 = testDF.rdd Val rdd2 = testDS.rdd
Eet 转 DataFrame:
import spark.implicits._ val testDF = rdd.map {line=> (line._1,line._2) }.toDF("col1","col2")
Usually tuple write data line together, and then specify the name of the field in the toDF
Eet 转 Dataset:
spark.implicits._ Import Case class COLTEST (col1: String, col2: Int) the extends the Serializable // custom field names and types Val testds = {Line rdd.map => COLTEST (line._1, line._2) } .toDS
It can be noted, when the type is defined for each row (case class), it has been given of field names and types, as long as the rear case class to which to add value
Dataset转DataFrame:
This is also very simple, because it is only the case class packaged into Row
import spark.implicits._ val testDF = testDS.toDF
DataFrame转Dataset:
spark.implicits._ Import Case class COLTEST (col1: String, col2: Int) the extends the Serializable // custom field names and types val testDS = testDF.as [Coltest]
This method is given in the type of each column, using as a method, converted into a Dataset, which is very convenient when the data type and need to be addressed for each DataFrame field
When using some special operations, be sure to add import spark.implicits._ otherwise toDF, toDS can not be used
3.7 User-defined functions
Spark.udf function by user can customize function
3.7.1 User-defined function UDF
scala> val df = spark.read.json("examples/src/main/resources/people.json") df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ scala> spark.udf.register("addName", (x:String)=> "Name:"+x) res5: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType))) scala> df.createOrReplaceTempView("people") scala> spark.sql("Select addName(name), age from people").show() +-----------------+----+ |UDF:addName(name)| age| +-----------------+----+ | Name:Michael|null| | Name:Andy| 30| | Name:Justin| 19| +-----------------+----+
3.7.2 User-defined aggregate function
Dataset strongly typed and weakly typed DataFrame provides relevant aggregate function, such as count (), countDistinct (), avg (), max (), min (). In addition, users can set their own custom aggregate functions
3.7.2.1 User-defined types weak aggregate functions
Inherited UserDefinedAggregateFunction achieved by a user-defined aggregate function. The following shows a wage averaged custom aggregation function
org.apache.spark.sql.expressions.MutableAggregationBuffer Import Import org.apache.spark.sql.expressions.UserDefinedAggregateFunction Import org.apache.spark.sql.types._ Import org.apache.spark.sql.Row Import org.apache .spark.sql.SparkSession Object MyAverage the extends UserDefinedAggregateFunction { // data type of polymerization function input parameters DEF inputSchema: StructType = StructType (StructField ( "inputColumn", LongType) :: Nil) // polymerization buffer worth of data type def bufferSchema : = {StructType StructType (StructField ( "SUM", LongType) :: StructField ( "COUNT", LongType) :: Nil) } // return value data type DEF dataType: the dataType = DoubleType // has been input is the same for return the same output. DETERMINISTIC DEF: = Boolean to true // initialize DEF the initialize (Buffer: MutableAggregationBuffer): Unit = { // keep total wages Buffer (0) = 0L // number stored wages Buffer (. 1) = 0L } same Execute // data between consolidation. Update DEF (Buffer: MutableAggregationBuffer, INPUT: Row): Unit = { IF (! input.isNullAt (0)) { Buffer (0) = buffer.getLong (0) + input.getLong (0) Buffer (. 1) = Buffer .getLong (. 1) +. 1 } } // between different data merge the Execute DEF merge (Buffer1: MutableAggregationBuffer, Buffer2: Row): Unit = { Buffer1 (0) = buffer1.getLong (0) + buffer2.getLong (0) buffer1 (1) = buffer1.getLong (1) + buffer2.getLong (1) } // 计算最终结果 def evaluate (buffer: Row): Double = buffer.getLong (0).toDouble / buffer.getLong (1) } // 注册函数 spark.udf.register("myAverage", MyAverage) val df = spark.read.json("examples/src/main/resources/employees.json") df.createOrReplaceTempView("employees") df.show() // +-------+------+ // | name|salary| // +-------+------+ // |Michael| 3000| // | Andy| 4500| // | Justin| 3500| // | Berta| 4000| // +-------+------+ val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees") result.show() // +--------------+ // |average_salary| // +--------------+ // | 3750.0| // +--------------+
3.7.2.2 User-defined types strongly aggregate functions
To achieve a strongly typed by inheritance Custom aggregation function Aggregator, also averaged wage
org.apache.spark.sql.expressions.Aggregator Import Import org.apache.spark.sql.Encoder Import org.apache.spark.sql.Encoders Import org.apache.spark.sql.SparkSession // Since it is a strongly typed, may there case class case the Employee class (name: String, the salary: Long) case Average class (var SUM: Long, var COUNT: Long) Object MyAverage the extends Aggregator in [the Employee, Average, Double] { // definition of a data structure to keep the wage The total number and the total number of wage, are initially 0 DEF ZERO: Average = Average (0L, 0L) // TWO values as Combine A new new value to Produce the For Performance, The function `On May Modify buffer`. // and return of IT INSTEAD A new new Object Constructing DEF the reduce (Buffer: Average, Employee: the Employee): = {Average buffer.sum + = employee.salary + =. 1 buffer.count Buffer } // execute different polymerization results DEF Merge (B1: Average, B2: Average): = {Average b1.sum + = b2.sum b1.count + = b2.count B1 } // calculating an output DEF Finish (Reduction: Average): Double = reduction.sum.toDouble / reduction.count // set values between encoder type, to be converted into a case class // Encoders.product is tuple and case scala encoder class switching DEF bufferEncoder: encoder [Average] = Encoders.product // set the final output value of the encoder DEF outputEncoder: encoder [Double] = Encoders.scalaDouble } Import spark.implicits._ Val DS = spark.read .json ( "examples / the src / main / Resources / employees.json"). AS [the Employee] ds.show () // + ------- + ------ + // | name|salary| // +-------+------+ // |Michael| 3000| // | Andy| 4500| // | Justin| 3500| // | Berta| 4000| // +-------+------+ // Convert the function to a `TypedColumn` and give it a name val averageSalary = MyAverage.toColumn.name("average_salary") val result = ds.select(averageSalary) result.show() // +--------------+ // |average_salary| // +--------------+ // | 3750.0| // +--------------+