background
A test data of people.json is provided in the spark official example folder. Combined with the introduction of the official dataset usage, we can do some exercises. The prepared data can be downloaded here: https://download.csdn.net /download/u013560925/10342251.
The dataframe format when data json is read is as follows:
people.json: corresponds to user name and age
peopleScore.json: corresponds to the user's name and score
! ! The 2.1 api used before is now the newer 2.2 api, pay attention! Here, the related version information of spark and sql is as follows. If the versions are inconsistent, the usage of the api will be different:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.0</version> <scope>compile</scope> </dependency>
text
The API methods involved in this exercise are: joinwith, groupBy, countDistinct, agg, sample, sort, dropDuplicates, join, mapPartition
dataset is a strongly typed data collection, so operations such as as[Int] are needed to live data types in many cases, which is not needed by dataframe.
0. Related initialization and import preparation
initialization:
val spark = SparkSession .builder() .appName("Spark Hive Example") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate() //Note that the following reference packages should be introduced separately! ! ! ! import spark.implicits._ //implicit conversion import org.apache.spark.sql.functions._ //agg built-in algorithm
1. joinWith
The conditions of the 2.1 spark version can be directly represented by strings, such as ($"name"===$"n"), which must be represented by columns in the 2.2 version.
personsDS.joinWith(personScoresDS,personsDS("name")===personScoresDS("n")).show
2. groupBy+countDistinct
personsDS.groupBy("name").agg(sum("age").as[Int],countDistinct("age").as[Int],current_date().as[String]).show()
3. agg
personsDS.groupBy("name").agg(collect_list("name").as[String],collect_set("age").as[Int])
4. sample
personsDS.sample(false,0.5).show()
5. sort
personsDS.sort("age").show()
6. dropDuplicates
personsDS.dropDuplicates("name").show
6. join
personsDS.join(personScoresDS,personsDS("name")===personScoresDS("n")).show
7. mapPartition
def doubleFunc(iter: Iterator[Person]) : Iterator[(String,Long)] = { var res = ArrayBuffer[(String,Long)]() while (iter.hasNext) { val cur = iter.next; res+=((cur.name,cur.age+1000)) } res.iterator } personsDS.mapPartitions (doubleFunc)
in conclusion
1. The spark core and sql versions are very important, and the APIs are used differently in different versions
2. Use the official javadoc to view the usage of the api and query the corresponding version
3. The method obtained by Baidu is likely to be unavailable due to different versions