--DAGScheduler process of spark job runs

DAGScheduler - stage divide and create and stage submission

Manual carefully I will run a spark from a job as an entry point, the various steps in the process spark operation involved, including the delineation of the DAG to create a set of tasks, resource allocation, sequence of tasks, task distribution to each executor, task execution, task result return driver and so on all aspects of the series, to invoke the task of running the entire chain as a clue, the individual spark-core infrastructure of linked, so that we can effect the various infrastructure module of the spark there is a sense of the whole, and then have the impression that spark the overall framework, and then to crush one of each of the modules, respectively, in-depth study, this gradual approach, the last in order to have a more in-depth on the spark-core and comprehensive grasp . Of course, the main purpose of this article is to clarify the whole process runs spark job.

入口: SparkContext.runJob

We know that lazy executed when the spark job execution, execution lazy biggest advantage is that some pipeline operators to the same chain together to form a stream computing model, personally think that this feature is also a spark higher than the performance of a mapreduce kinds important reason, as some of the later mapreduce optimization framework based on such an important optimization actually means tez, mahout and so is the number of chain operators can perform pipelined together, avoid the middle repeatedly place orders. Pulling away, we return to this method, comments can be seen by a method, which is the spark in all actions operator entrance.

/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*
* @param rdd           target RDD to run tasks on
* @param func          a function to run on each partition of the RDD
* @param partitions    set of partitions to run on; some jobs may not want to compute on all
*                      partitions of the target RDD, e.g. for operations like `first()`
* @param resultHandler callback to pass each result to
*/

runJob DEF T, the U-: ClassTag : Unit = {
IF (stopped.get ()) {
the throw new new IllegalStateException ( "SparkContext has been the shutdown")
}
Val callSite getCallSite =
Val = cleanedFunc Clean (FUNC)
the logInfo ( "Starting Job:" callSite.shortForm +)
// call DAGScheduler of runJob method
dagScheduler.runJob (RDD, cleanedFunc, Partitions, callSite, resultHandler, localProperties.get)
// update console progress messages printed
progressBar.foreach (_. finishAll ())
// RDD process of the checkpoint
rdd.doCheckpoint ()
}

  • First, remove unnecessary references closures, this step is mainly for the convenience of serialized because some unnecessary references that may be referenced objects can not be serialized, which causes the function can not be serialized. In many cases, user-written code is not very tricky, spark this in mind, so this is to minimize the user's development effort.
  • Call DAGScheduler perform logic tasks submitted

This method is very simple, do not repeat them.

DAGScheduler.submitJob

After some calls eventually call to this method.

def submitJob[T, U](
  rdd: RDD[T],
  func: (TaskContext, Iterator[T]) => U,
  partitions: Seq[Int],
  callSite: CallSite,
  resultHandler: (Int, U) => Unit,
  properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
// 检查是否有非法的partition
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
  throw new IllegalArgumentException(
    "Attempting to access a non-existent partition: " + p + ". " +
      "Total number of partitions: " + maxPartitions)
}

// nextJobId每次自增1
val jobId = nextJobId.getAndIncrement()
// 如果要运行的分区数为0,那么就没必要运行,直接返回成功就行了
if (partitions.size == 0) {
  // Return immediately if the job is running 0 tasks
  return new JobWaiter[U](this, jobId, 0, resultHandler)
}

assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
// 向DAG的事件处理器投递一个任务提交的事件
eventProcessLoop.post(JobSubmitted(
  jobId, rdd, func2, partitions.toArray, callSite, waiter,
  SerializationUtils.clone(properties)))
waiter

}

The logic of this approach is also very simple. First, do some checking, and then to the internal scheduler DAG an event handler to submit job posting an event. DAGScheduler they have an event processor is a conventional event loop processing, the loop processing method using the single-threaded event in the event queue, the logic is very simple, it no longer expand here. After posting job submission tasks, eventually calling handleJobSubmitted method of DAGScheduler. We can see, DAGScheduler there are many other similar approach, corresponding to a different event types, event distribution logic in DAGSchedulerEventProcessLoop.doOnReceive method, no longer expand. We still back up the job to run the main line, continue to look handleJobSubmitted.

handleJobSubmitted

private[scheduler] def handleJobSubmitted(jobId: Int,
  finalRDD: RDD[_],
  func: (TaskContext, Iterator[_]) => _,
  partitions: Array[Int],
  callSite: CallSite,
  listener: JobListener,
  properties: Properties) {
var finalStage: ResultStage = null
try {
  // New stage creation may throw an exception if, for example, jobs are run on a
  // HadoopRDD whose underlying HDFS files have been deleted.
  // 创建最后一个stage
  finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
  case e: Exception =>
    logWarning("Creating new stage failed due to exception - job: " + jobId, e)
    listener.jobFailed(e)
    return
}

// 设置活跃的任务
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
  job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))

// 更新一些簿记量
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
// 向事件总线中投递一个事件
listenerBus.post(
  SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
// 提交最后一个stage
submitStage(finalStage)

}

  • Some bookkeeping related to the amount of the update is no longer started.

  • Set up the final stage, this step would in fact be calculated according to shuffle dependency diagram of the entire RDD (DAG) divided to form different stage, the last step action operators will create ResultStage, then submit the final stage.

The following summary focuses on what divided us and create stage of the DAG, the main function of which is the DAGScheduler.

stage division and create

DAGScheduler.createResultStage

private def createResultStage(
  rdd: RDD[_],
  func: (TaskContext, Iterator[_]) => _,
  partitions: Array[Int],
  jobId: Int,
  callSite: CallSite): ResultStage = {
// 首先创建依赖的父stage
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
// 有了父stage,就可以创建最后一个stage了
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}

The focus is to create a parent stage.

private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
getShuffleDependencies(rdd).map { shuffleDep =>
  getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}

getShuffleDependencies

The way to achieve depth-first traversal of rdd with a stack, you can see if it finds shuffle rely on record, and is no longer dependent on the previous shuffle continue to look for dependence.
So this method can only be found on this level rdd shuffle all depend on the whole the DAG, but not across multi-level shuffle dependence.
Private [Scheduler] DEF getShuffleDependencies (
RDD: RDD [_]): HashSet [ShuffleDependency [ , , _]] = {
Val Parents = new new HashSet [ShuffleDependency [ , , _]]
Val visited = new new HashSet [RDD [_]]
// implemented with a stack depth-first traversal
Val waitingForVisit = new new ArrayStack [RDD [_]]
waitingForVisit.push (RDD)
the while (waitingForVisit.nonEmpty) {
Val toVisit = waitingForVisit.pop ()
IF (! visited (toVisit)) {
visited toVisit = +
toVisit.dependencies.foreach {
// If shuffle is dependent on, record, and does not continue to look up the dependent dependent shuffle
case shuffleDep: ShuffleDependency[, , _] =>
parents += shuffleDep
case dependency =>
// 对于窄依赖,
waitingForVisit.push(dependency.rdd)
}
}
}
parents
}

getOrCreateShuffleMapStage

We continue to look at another important way, the creation stage shuffle.

private def getOrCreateShuffleMapStage(
  shuffleDep: ShuffleDependency[_, _, _],
  firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
  case Some(stage) =>
    stage

  case None =>
    // Create stages for all missing ancestor shuffle dependencies.
    // 获取所有还没有创建stage的祖先shuffle依赖
    getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
      // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
      // that were not already in shuffleIdToMapStage, it's possible that by the time we
      // get to a particular dependency in the foreach loop, it's been added to
      // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
      // SPARK-13902 for more information.
      if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
        createShuffleMapStage(dep, firstJobId)
      }
    }
    // Finally, create a stage for the given shuffle dependency.
    createShuffleMapStage(shuffleDep, firstJobId)
}
}

You can see, this method will not create a stage for all ancestors shuffle all rely created.
We look at the creation of specific process ShuffleMapStage:

def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
// 这里可以看出,一个ShuffleStage的rdd是shuffle输入侧的rdd
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length
// 这里调用了获取父Stage的方法,实际上这几个方法会形成递归调用
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
// 一个Stage就是对一些引用的封装,其中比较重要的是mapOutputTracker
val stage = new ShuffleMapStage(
  id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)

// 更新一些簿记量
stageIdToStage(id) = stage
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
updateJobIdStageIdMaps(jobId, stage)

// 在map输出追踪器中注册这个shuffle
if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
  // Kind of ugly: need to register RDDs with the cache and map output tracker here
  // since we can't do it in the RDD constructor because # of partitions is unknown
  logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
  mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}

The more critical steps are:

  • Create all parent stage
  • A package ShuffleMapStage objects, the more important is a reference mapOutputTracker object. The main role of this target position information during track shuffle the output of the phase map, the output map will be covered later is sorted and partitioned map and the output data sequence of processing by shuffleManager, blockManager then stored, and the position of the output map blockId information through identity, and will return the driver, there is a MapOutputTrackerMaster component responsible for maintaining the output stage all map all tasks of the position information of the driver.
  • In stage mpOutputTrackerMaster register newly created, in fact, a data mapping structure in Riga

summary

For the creation stage to make a summary: here involves several methods of forming the recursive call; depth-first traversal in the process of traversing rdd dependent, each encounter a dependency creates a shuffle stage, the upstream stage created after all , and finally create a ResultStage.

stage submission

Next, we look at the last step in the process of job running DAGScheduler responsible for: stage submission

submitStage

The first is submitStage method.

private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
  logDebug("submitStage(" + stage + ")")
  if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
    val missing = getMissingParentStages(stage).sortBy(_.id)
    logDebug("missing: " + missing)
    if (missing.isEmpty) {
      logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
      submitMissingTasks(stage, jobId.get)
    } else {
      for (parent <- missing) {
        submitStage(parent)
      }
      waitingStages += stage
    }
  }
} else {
  abortStage(stage, "No active job for stage " + stage.id, None)
}
}

This method is relatively simple:

  • The first is the submission has not run too many parent stage, put themselves into the waiting queue

  • If the parent stage have already completed the run, or a parent stage does not exist, then submit the current stage, namely call submitMissingTasks

submitMissingTasks

private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")

// First figure out the indexes of partition ids to compute.
// 首先是找出还没有计算的partition有哪些
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties

// 更新簿记量
runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
// outputCommitCoordinator内部簿记量的更新
stage match {
  case s: ShuffleMapStage =>
    outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
  case s: ResultStage =>
    outputCommitCoordinator.stageStart(
      stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}

// 找出每个Task的偏向位置,对于一般的shuffle stage,通过mapOutputTracker来计算Task的偏向位置
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
  stage match {
    case s: ShuffleMapStage =>
      partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
    case s: ResultStage =>
      partitionsToCompute.map { id =>
        val p = s.partitions(id)
        (id, getPreferredLocs(stage.rdd, p))
      }.toMap
  }
} catch {
  case NonFatal(e) =>
    stage.makeNewStageAttempt(partitionsToCompute.size)
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
    abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
    runningStages -= stage
    return
}

// 更新stage最近一次的尝试的信息
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

// If there are tasks to execute, record the submission time of the stage. Otherwise,
// post the even without the submission time, which indicates that this stage was
// skipped.
if (partitionsToCompute.nonEmpty) {
  stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}
// 向事件总线投递一个事件
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
// 对任务进行序列化,这里对RDD和ShuffleDependency对象进行序列化
var taskBinary: Broadcast[Array[Byte]] = null
try {
  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
  // For ResultTask, serialize and broadcast (rdd, func).
  val taskBinaryBytes: Array[Byte] = stage match {
    case stage: ShuffleMapStage =>
      JavaUtils.bufferToArray(
        closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
    case stage: ResultStage =>
      JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
  }

  // RDD和ShuffleDependency的序列化数据是通过广播变量传输到executor端的
  // 广播变量实际上也是先将数据通过blockManager写入内存或磁盘,然后executor端通过rpc远程拉取数据
  taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
  // In the case of a failure during serialization, abort the stage.
  case e: NotSerializableException =>
    abortStage(stage, "Task not serializable: " + e.toString, Some(e))
    runningStages -= stage

    // Abort execution
    return
  case NonFatal(e) =>
    abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
    runningStages -= stage
    return
}

val tasks: Seq[Task[_]] = try {
  // 对任务运行的统计量累加器对象的序列化
  // 累加器对象序列化有一个比较有意思的地方,在readObject方法中,可以看一下
  val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
  stage match {
    case stage: ShuffleMapStage =>
      stage.pendingPartitions.clear()
      // 每个分区创建一个Task
      partitionsToCompute.map { id =>
        val locs = taskIdToLocations(id)
        val part = stage.rdd.partitions(id)
        stage.pendingPartitions += id
        new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
          taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
          Option(sc.applicationId), sc.applicationAttemptId)
      }

    case stage: ResultStage =>
      partitionsToCompute.map { id =>
        val p: Int = stage.partitions(id)
        val part = stage.rdd.partitions(p)
        val locs = taskIdToLocations(id)
        new ResultTask(stage.id, stage.latestInfo.attemptNumber,
          taskBinary, part, locs, id, properties, serializedTaskMetrics,
          Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
      }
  }
} catch {
  case NonFatal(e) =>
    abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
    runningStages -= stage
    return
}

if (tasks.size > 0) {
  logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
    s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
  // 从这里DAGScheduler把接力棒交给了Task调度器
  taskScheduler.submitTasks(new TaskSet(
    tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
} else {
  // Because we posted SparkListenerStageSubmitted earlier, we should mark
  // the stage as completed here in case there are no tasks to run
  markStageAsFinished(stage, None)

  val debugString = stage match {
    case stage: ShuffleMapStage =>
      s"Stage ${stage} is actually done; " +
        s"(available: ${stage.isAvailable}," +
        s"available outputs: ${stage.numAvailableOutputs}," +
        s"partitions: ${stage.numPartitions})"
    case stage : ResultStage =>
      s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
  }
  logDebug(debugString)

  submitWaitingChildStages(stage)
}
}

This method is long, but it should be said that the job submission process is the most important method in the DAG scheduler. The main thing to do is actually created from the stage to submit a set of tasks, each partition to create a Task, all to be evaluated Task form a set of tasks.

  • Update some amount of bookkeeping
  • Task find the position of each of the bias, for general shuffle stage, calculated by the Task biased position mapOutputTracker
  • Event bus to deliver a stage submission
  • And calculation functions of the RDD or func ShuffleDependency ResultStage be serialized for transmission
  • Sequence of tasks running statistics accumulator objects, object serialization plus there is a more interesting place, readObject method, you can look at
  • Creating a Task for each partition to be calculated, according to the type of stage is divided into two kinds ShuffleMapTask and ResultTask
  • Finally, the method calls TaskScheduler submit jobs

So far, DAGScheduler completed his mission successfully pass the baton to the TaskScheduler, the next step is TaskScheduler the show.
Next, we will continue to analyze this class TaskSchedulerImpl submit some work done for the task, mainly the work of resource allocation needs to be considered locality, blacklist, balance and other issues.

left questions

  • How to calculate the position of bias task?
  • The role of outputCommitCoordinator?
  • What is the underlying mechanism of variable broadcast? The back will broadcast a special analysis of the variables, in fact, is the use of block manager blockManager (block manager should be the most important of the infrastructure)

Guess you like

Origin www.cnblogs.com/zhuge134/p/10961742.html