创建dataSet

从已经存在的集合中创建 spark.createDataSet(集合)

	scala> val ds1 = spark.createDataset(1 to 10)
	ds1: org.apache.spark.sql.Dataset[Int] = [value: int]
	scala> ds1.show

从已经存在的RDD中创建 spark.createDataSet(RDD)

	scala> val personRDD = sc.textFile("file:///export/person.txt")
	personRDD: org.apache.spark.rdd.RDD[String] = file:///export/person.txt MapPartitionsRDD[33] at textFile at<console>:24

	scala> spark.createData
	createDataFrame   createDataset

	scala> val ds2 = spark.createDataset(personRDD)
	ds2: org.apache.spark.sql.Dataset[String] = [value: string]

	scala> ds2.show
	+-------------+
	|        value|
	+-------------+
	|1 zhangsan 20|
	|    2 lisi 29|
	|  3 wangwu 25|
	| 4 zhaoliu 30|
	|  5 tianqi 35|
	|    6 kobe 40|
	+-------------+

通过样例类配合集合调用toDS方法得到dataSet

	scala> case class Person(name:String,age:Int)
	scala> val personDataList = List(Person("zhangsan",18),Person("lisi",28))
	scala> val personDS = personDataList.toDS
	personDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

	scala> personDS.show
	+--------+---+
	|    name|age|
	+--------+---+
	|zhangsan| 18|
	|    lisi| 28|
	+--------+---

通过dataFrame转变为dataSet

	scala> case class Person(name:String,age:Long)
	defined class Person
	scala> val jsonDF = spark.read.json("file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/people.json")
	jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

	scala> val jsonDS = jsonDF.as[Person]
	jsonDS: org.apache.spark.sql.Dataset[Person] = [age: bigint, name: string]

	scala> jsonDS.show
	+----+-------+
	| age|   name|
	+----+-------+
	|null|Michael|
	|  30|   Andy|
	|  19| Justin|
	+----+-------+

图解

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_44429965/article/details/107397804