spark - small practice (2) dataset combat

background

     A test data of people.json is provided in the spark official example folder. Combined with the introduction of the official dataset usage, we can do some exercises. The prepared data can be downloaded here: https://download.csdn.net /download/u013560925/10342251.

     The dataframe format when data json is read is as follows:

     people.json: corresponds to user name and age

     peopleScore.json: corresponds to the user's name and score


 ! ! The 2.1 api used before is now the newer 2.2 api, pay attention! Here, the related version information of spark and sql is as follows. If the versions are inconsistent, the usage of the api will be different:         

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
    <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.2.0</version>
            <scope>compile</scope>
    </dependency>

text

        The API methods involved in this exercise are: joinwith, groupBy, countDistinct, agg, sample, sort, dropDuplicates, join, mapPartition

       dataset is a strongly typed data collection, so operations such as as[Int] are needed to live data types in many cases, which is not needed by dataframe.

0. Related initialization and import preparation

initialization:

val spark = SparkSession
      .builder()
      .appName("Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation)
      .enableHiveSupport()
      .getOrCreate()
       //Note that the following reference packages should be introduced separately! ! ! !
      import spark.implicits._ //implicit conversion
      import org.apache.spark.sql.functions._ //agg built-in algorithm

1. joinWith

The conditions of the 2.1 spark version can be directly represented by strings, such as ($"name"===$"n"), which must be represented by columns in the 2.2 version.

personsDS.joinWith(personScoresDS,personsDS("name")===personScoresDS("n")).show


2. groupBy+countDistinct

Pay attention to as[Int], as[String], pay attention to the use of agg(sum) to introduce the dependencies mentioned before
personsDS.groupBy("name").agg(sum("age").as[Int],countDistinct("age").as[Int],current_date().as[String]).show()

3. agg

personsDS.groupBy("name").agg(collect_list("name").as[String],collect_set("age").as[Int])

4. sample

Run three times, each time the randomly drawn data is different

personsDS.sample(false,0.5).show()

5. sort

  personsDS.sort("age").show()

6. dropDuplicates

personsDS.dropDuplicates("name").show
    

6. join

personsDS.join(personScoresDS,personsDS("name")===personScoresDS("n")).show


7. mapPartition

def doubleFunc(iter: Iterator[Person]) : Iterator[(String,Long)] = {
      var res = ArrayBuffer[(String,Long)]()
      while (iter.hasNext)
      {
        val cur = iter.next;
        res+=((cur.name,cur.age+1000))
      }
      res.iterator
    }
      personsDS.mapPartitions (doubleFunc)


in conclusion

1. The spark core and sql versions are very important, and the APIs are used differently in different versions

2. Use the official javadoc to view the usage of the api and query the corresponding version

3. The method obtained by Baidu is likely to be unavailable due to different versions

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325584309&siteId=291194637