When is Spark's join wide dependency and when is narrow dependency

When is Spark's join wide dependence and narrow dependence
Question:
Refer to the following code:
1. What is the result of the two print statements, whether the corresponding dependence is wide dependence or narrow dependence, and why is this result;
2. Join operation When is the wide dependence and when is the narrow dependence;


import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    HashPartitioner, SparkConf, SparkContext}
object JoinDemo2 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val random = scala.util.Random
    val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx"))
    val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0,
      "CD"))
    val rdd1: RDD[(Int, String)] = sc.makeRDD(col1)
    val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)
    val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2)
    rdd3.count()
    println(rdd3.dependencies)
    val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new
        HashPartitioner(3)))
    rdd4.count()
    println(rdd4.dependencies)
    Thread.sleep(10000000)
    sc.stop()
  }}

Answer:
1. Two print statements: List(org.apache.spark.OneToOneDependency@63acf8f6) List(org.apache.spark.OneToOneDependency@d9a498)
Insert picture description here
Corresponding dependencies:
rdd3 corresponds to wide dependencies, and rdd4 corresponds to narrow dependencies.
Reasons:
1) Refer to the webUI
from the DAG diagram. It can be seen that the first join is clearly separated from the previous one. Satge. It can be seen that this is a wide dependence. The second join, join after partitionBy, is not divided into a stage separately, which shows that it is a narrow dependency.
rdd3 join
Insert picture description here
rdd4 join
Insert picture description here
2) Code analysis:
a. First is the default join method, here a default partitioner is used

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    
    
    join(other, defaultPartitioner(self, other))
  }

b. The default partitioner, for the first join will return a HashPartitioner with the total number of computer cores as the number of partitions. The second join will return the HashPartitioner we set (partition number 3)

  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    
    
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
    
    
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
    
    
      None
    }

    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
    
    
      rdd.context.defaultParallelism
    } else {
    
    
      rdds.map(_.partitions.length).max
    }

    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than the default number of partitions, use the existing partitioner.
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
    
    
      hasMaxPartitioner.get.partitioner.get
    } else {
    
    
      new HashPartitioner(defaultNumPartitions)
    }
  }

c. Go to the actual execution of the join method, in which flatMapValues ​​is a narrow dependency, so if there is a wide dependency, it should be in the cogroup operator

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    
    
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

d. Enter the cogroup method, the core is CoGroupedRDD, according to two need to join RDD and a partitioner. In the first join, neither rdd has a partitioner, so in this step, the two rdds need to perform a shuffle based on the incoming partitioner, so the first join is a wide dependency. The second join has been divided into zones at this time, and there is no need to shuffle again. So the second one is narrow dependence

  /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    
    
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    
    
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues {
    
     case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }

e. Both print OneToOneDependency, because in CoGroupedRDD, in the getDependencies method, if the rdd has a partitioner, it will return OneToOneDependency(rdd).

  override def getDependencies: Seq[Dependency[_]] = {
    
    
    rdds.map {
    
     rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
    
    
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
    
    
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }
  1. When is join a wide dependency and when is a narrow dependency?
    From the above analysis, it can be known that if the two tables to be joined already have a partitioner and the number of partitions is the same, the same key is in the same partition at this time. It is narrow dependence. Conversely, if there are no partitions in the two tables that need to be joined or the number of partitions is different, and shuffle is required when joining, then it is a wide dependency

Guess you like

Origin blog.csdn.net/weixin_38813363/article/details/111868410