Spark advanced functions [application] combineByKey

About a .combineByKey operator

  Function: the packet count and custom summation.

  Features: for processing (key, value) types of data.

  Implementation steps:

    1. The data to be processed is initialized, and the number of conversion operations

    2. Check whether the key is the first treatment, the first treatment is added, otherwise the merge partitions based on custom logic []

    3. The combined packet, return results

Operator Code combat two .combineByKey

 1 package big.data.analyse.scala.arithmetic
 2 
 3 import org.apache.spark.sql.SparkSession
 4 /**
 5   * Created by zhen on 2019/9/7.
 6   */
 7 object CombineByKey {
 8   def main (args: Array[String]) {
 9     val spark = SparkSession.builder().appName("CombineByKey").master("local[2]").getOrCreate()
10     val sc = spark.sparkContext
11     sc.setLogLevel("error")
12 
13     val initialScores = Array((("hadoop", "R"), 1), (("hadoop", "java"), 1),
14                               (( "Spark", "Scala"),. 1), (( "Spark", "R & lt"),. 1), (( "Spark", "Java"),. 1 ))
 15  
16      Val D1 = sc.parallelize (initialScores)
 . 17  
18 is      Val d1.map Result = (X => . (x._1._1, (x._1._2, x._2))) combineByKey (
 . 19        (V: (String, Int)) => (v: (String, Int)), // initialization operation, when the key for the first time to perform some initialization and conversion operations 
20        (c: (String, Int), v: (String, Int)) => (c._1 + "," + v._1, c._2 + v._2), // partition merge, merge non-first appeared 
21        (c1: (String, Int), c2: (String, Int)) => (c1._1 + "," + c2._1 , c1._2 + c2._2)) // Packet Consolidation 
22       .collect()
23 
24     result.foreach(println)
25   }
26 }

Three .combineByKey operator execution result

  

 

Guess you like

Origin www.cnblogs.com/yszd/p/11481923.html