文章目录

Spark AE 提交 ShuffleQueryStageExec 执行，并返回Futrue对象
DagScheduler 收集 Map 执行结果，等待 MapStage执行结束
AE 接收并处理 Stage MapOutput 信息
ShuffleMapTask 读取 Shuffle 数据
MapStatus 对象

Spark 3 中的AE会将原始SQL切分成很多QueryStage，在每个QueryStage执行完毕后，针对之前完成的QueryStage执行结果，对QueryPlan进行二次优化。
其中二次优化最重要的就是根据之前QueryStage shuffle write的 MapStatus 信息而生成的 stats 对象。
先大概说一下 spark 获取 child query stage stats 的流程:

每个Map Task写Shuffle数据到shuffle文件中
Task执行结束后，上报 MapStatus 信息到 Driver 的 MapOutputTrackerMaster
当MapStage 的所有Task的 MapStatus 都上报成功，Driver 判断该Stage 执行成功，返回该Stage的所有 MapStatus 信息
如果是AE，会将该Stage的信息填充到 ShuffleQueryStageExec 中，后续通过内部方法 mapStats 即可获取真实Shuffle的物理数据信息，用于Plan的 reOptimize()
比如，OptimizeSkewedJoin 在判断是否出现数据倾斜的Join的时候，就需要依赖于Join输入算子的Partition数据大小信息，内部将已经执行完成的 ShuffleQueryStageExec 算子转换成内部的 ShuffleStageInfo 算子来处理，以简化代码。

Spark AE 提交 ShuffleQueryStageExec 执行，并返回Futrue对象

Spark AE 的 QueryStageExec 通过 doMaterialize() 方法执行并返回执行结果。
内部通过 SparkContext.submitMapStage() 方法提交Stage进行执行，并返回执行结果 Future[MapOutputStatistics] 对象。

// ShuffleQueryStageExec
  override def doMaterialize(): Future[Any] = attachTree(this, "execute") {
    
    
    shuffle.mapOutputStatisticsFuture
  }

// ShuffleExchangeExec
  // 'mapOutputStatisticsFuture' is only needed when enable AQE.
  @transient override lazy val mapOutputStatisticsFuture: Future[MapOutputStatistics] = {
    
    
    if (inputRDD.getNumPartitions == 0) {
    
    
      Future.successful(null)
    } else {
    
    
      sparkContext.submitMapStage(shuffleDependency)
    }
  }

// SparkContext
  /**
   * Submit a map stage for execution. This is currently an internal API only, but might be
   * promoted to DeveloperApi in the future.
   */
  private[spark] def submitMapStage[K, V, C](dependency: ShuffleDependency[K, V, C])
      : SimpleFutureAction[MapOutputStatistics] = {
    
    
    assertNotStopped()
    val callSite = getCallSite()
    var result: MapOutputStatistics = null
    val waiter = dagScheduler.submitMapStage(
      dependency,
      (r: MapOutputStatistics) => {
    
     result = r },
      callSite,
      localProperties.get)
    new SimpleFutureAction[MapOutputStatistics](waiter, result)
  }

DagScheduler 收集 Map 执行结果，等待 MapStage执行结束

Map Task 执行完毕后，会将Shuffle Write 信息保存到 MapStatus 对象中。
实际返回的 MapStatus 对象有两类, CompressedmapStatus 和 HighlyCompressedMapStatus 具体细节后面在详细说。

completion = {CompletionEvent@18223} "CompletionEvent(ShuffleMapTask(1, 1),Success,org.apache.spark.scheduler.CompressedMapStatus@51393f72,ArrayBuffer(LongAccumulators...., value"
 task = {ShuffleMapTask@18235} "ShuffleMapTask(1, 1)"
 reason = {Success$@18247} "Success"
 result = {CompressedMapStatus@18236} 
  loc = {BlockManagerId@18254} "BlockManagerId(driver, 10.236.90.37, 62914, None)"
  compressedSizes = {byte[5]@18255} [52, 51, 51, 53, 54]
  _mapTaskId = 3
 accumUpdates = {ArrayBuffer@18248} "ArrayBuffer" size = 35
 taskMetrics = {TaskMetrics@18249} 
 metricPeaks = {long[20]@18250} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 taskInfo = {TaskInfo@18251}

Task 结果被封装成 CompletionEvent 发送到Driver Scheduler，其中 result: CompressedmapStatus 为每个Map执行结果。

如果Task是 ShuffleMapTask 类型，DagScheduler 会向 MapOutputTracker 注册 MapOutput status 信息。


  private[scheduler] def handleTaskCompletion(event: CompletionEvent): Unit = {
    
    
    val task = event.task
    val stageId = task.stageId

    event.reason match {
    
    
      case Success =>
        task match {
    
    

          case smt: ShuffleMapTask =>
            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
            shuffleStage.pendingPartitions -= task.partitionId
            val status = event.result.asInstanceOf[MapStatus]
            val execId = status.location.executorId
            logDebug("ShuffleMapTask finished on " + execId)
            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
    
    
              logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
            } else {
    
    
              // The epoch of the task is acceptable (i.e., the task was launched after the most
              // recent failure we're aware of for the executor), so mark the task's output as
              // available.
              mapOutputTracker.registerMapOutput(
                shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
            }
        }
  }

// MapOutputTrackerMaster 中stats信息存储
val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala
// ShuffleStatus
val mapStatuses = new Array[MapStatus](numPartitions)

// MapOutputTracker 添加 task的 map status 信息
  def registerMapOutput(shuffleId: Int, mapIndex: Int, status: MapStatus): Unit = {
    
    
    shuffleStatuses(shuffleId).addMapOutput(mapIndex, status)
  }

当一个Stage的所有Task 都执行完毕时，MapOutputTrackerMaster 中有保存每个Stage 完整的ShuffleStatus 信息。
调用 markMapStageJobAsFinished 方法，获取Stage 完整 MapOutput 信息，并通过callback 方法返回。

  def numAvailableOutputs: Int = mapOutputTrackerMaster.getNumAvailableOutputs(shuffleDep.shuffleId)

  def isAvailable: Boolean = numAvailableOutputs == numPartitions

  private[scheduler] def markMapStageJobsAsFinished(shuffleStage: ShuffleMapStage): Unit = {
    
    
    // Mark any map-stage jobs waiting on this stage as finished
    if (shuffleStage.isAvailable && shuffleStage.mapStageJobs.nonEmpty) {
    
    
      val stats = mapOutputTracker.getStatistics(shuffleStage.shuffleDep)
      for (job <- shuffleStage.mapStageJobs) {
    
    
        markMapStageJobAsFinished(job, stats)
        // job.listener 也就是 JobWaiter 调用进入回调方法
        // job.listener.taskSucceeded(0, stats)
      }
    }
  }

AE 接收并处理 Stage MapOutput 信息

Spark AE 接收之前的Stage 的返回结果，并保存到 Stage的 resultOption 对象中，此时我们在调用 Stage node的 mapStats 时，就包含的所有Map Task的 Output 信息。

// AdaptiveSparkPlanExec::private def getFinalPhysicalPlan()

          // Start materialization of all new stages and fail fast if any stages failed eagerly
          reorderedNewStages.foreach {
    
     stage =>
            try {
    
    
              stage.materialize().onComplete {
    
     res =>
                if (res.isSuccess) {
    
    
                  events.offer(StageSuccess(stage, res.get))
                } else {
    
    
                  events.offer(StageFailure(stage, res.failed.get))
                }
              }(AdaptiveSparkPlanExec.executionContext)
            } catch {
    
    
              ...
            }
          }

        // 收集Stage的执行结果，设置 resultOption 数据
        val nextMsg = events.take()
        val rem = new util.ArrayList[StageMaterializationEvent]()
        events.drainTo(rem)
        (Seq(nextMsg) ++ rem.asScala).foreach {
    
    
          case StageSuccess(stage, res) =>
            stage.resultOption.set(Some(res))
          case StageFailure(stage, ex) =>
            errors.append(ex)
        }

// ShuffleQueryStageExec
  def mapStats: Option[MapOutputStatistics] = {
    
    
    assert(resultOption.get().isDefined, s"${getClass.getSimpleName} should already be ready")
    val stats = resultOption.get().get.asInstanceOf[MapOutputStatistics]
    Option(stats)
  }

ShuffleMapTask 读取 Shuffle 数据

Executor 在执行Task的时候，会先根据父 RDD 的iterator 生成新的Iterator，再进行执行。
ShuffledRowRDD 的 compute() 方法会根据当前RDD (Stage 最开始的Shuffle Read RDD)的 PartitionSpec 类型，获取对应的Reader，再调用Reader.read() 方法返回数据迭代器。

Task:: run()
Task:: runTask()

// MapPartitionsRDD, 其中f 是Task对应的计算函数。在计算之前，需要先计算出依赖的 iterator
  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

RDD::computeOrReadCheckpoint(split, context)
RDD::compute(split, context)

// ShuffledRowRDD::compute()
  override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = {
    
    
    // ...
    val reader = split.asInstanceOf[ShuffledRowRDDPartition].spec match {
    
    
      case CoalescedPartitionSpec(startReducerIndex, endReducerIndex) =>
        SparkEnv.get.shuffleManager.getReader(
          dependency.shuffleHandle,
          startReducerIndex,
          endReducerIndex,
          context,
          sqlMetricsReporter)

      case PartialReducerPartitionSpec(reducerIndex, startMapIndex, endMapIndex, _) =>
        SparkEnv.get.shuffleManager.getReaderForRange(
          dependency.shuffleHandle,
          startMapIndex,
          endMapIndex,
          reducerIndex,
          reducerIndex + 1,
          context,
          sqlMetricsReporter)

      case PartialMapperPartitionSpec(mapIndex, startReducerIndex, endReducerIndex) =>
        SparkEnv.get.shuffleManager.getReaderForRange(
          dependency.shuffleHandle,
          mapIndex,
          mapIndex + 1,
          startReducerIndex,
          endReducerIndex,
          context,
          sqlMetricsReporter)
    }
    reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(_._2)
  }

ShuffledRowRDD 的 Reader 都是由 shuffleManager 返回的。
比如，在 SortShuffleManager 中需要先根据当前 partitionSpec 的信息，获取对应数据的 block 信息，然后通过 BlockStoreShuffleReader 对应读取数据。
其中，对于合并Map结果，并进行排序的过程都在 BlockStoreShuffleReader 中。

// SortShuffleManager
  override def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext,
      metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C] = {
    
    
    val blocksByAddress = SparkEnv.get.mapOutputTracker.getMapSizesByExecutorId(
      handle.shuffleId, startPartition, endPartition)
    new BlockStoreShuffleReader(
      handle.asInstanceOf[BaseShuffleHandle[K, _, C]], blocksByAddress, context, metrics,
      shouldBatchFetch = canUseBatchFetch(startPartition, endPartition, context))
  }

MapStatus 对象

之前大概说明了Spark 对于 Map Task 结果的处理流程。如果Map Stage有非常多的Map Task,此时Driver中用于保存MapOutput的对象也会非常的大。
为了节省内存，目前Spark 对于MapStatus 进行了压缩，并实现了 CompressedmapStatus 和两种压缩算法。
此次出现问题的是 HighlyCompressedMapStatus，该压缩算法对于Small Block 只存储了对应的个数和avgSize, 导致了在Job 出现数据倾斜的时候，无法正确的判断出。详情参考: SPARK-36967

Spark对 Shuffle MapStatus 信息的处理