Task原理剖析

一、Task的执行的流程

    1、在接收到LaunchTask的请求之后,会用一个TaskRunner来封装这个task,在TaskRunner的对需要的资源进行拷贝以及相关环境的初始化,然后再TaskRunner的run(因为继承了Runnable)方法中调用task的run()方法对task进行处理:

 override def run(): Unit = {
      threadId = Thread.currentThread.getId
      Thread.currentThread.setName(threadName)
      val threadMXBean = ManagementFactory.getThreadMXBean
      val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
      val deserializeStartTime = System.currentTimeMillis()
      val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
        threadMXBean.getCurrentThreadCpuTime
      } else 0L
      Thread.currentThread.setContextClassLoader(replClassLoader)
      val ser = env.closureSerializer.newInstance()
      logInfo(s"Running $taskName (TID $taskId)")
      execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
      var taskStart: Long = 0
      var taskStartCpu: Long = 0
      startGCTime = computeTotalGcTime()

      try {
        // Must be set before updateDependencies() is called, in case fetching dependencies
        // requires access to properties contained within (e.g. for access control).
        //反序列task的配置信息,便于在后面去使用
        Executor.taskDeserializationProps.set(taskDescription.properties)

        //反序列化,拷贝相关的资源,以及我们所需要的jar包3
        updateDependencies(taskDescription.addedFiles, taskDescription.addedJars)
        
        //反序列化的方法把得到的文件以及jar包反序化回来
        task = ser.deserialize[Task[Any]](
            //用到类加载器的原因,类加载器可以动态加载一个类,可以对指定上下文相关资源进行读取4
            
          taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
        task.localProperties = taskDescription.properties
        task.setTaskMemoryManager(taskMemoryManager)

        // If this task has been killed before we deserialized it, let's quit now. Otherwise,
        // continue executing the task.
        val killReason = reasonIfKilled
        if (killReason.isDefined) {
          // Throw an exception rather than returning, because returning within a try{} block
          // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl
          // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
          // for the task.
          throw new TaskKilledException(killReason.get)
        }

        // The purpose of updating the epoch here is to invalidate executor map output status cache
        // in case FetchFailures have occurred. In local mode `env.mapOutputTracker` will be
        // MapOutputTrackerMaster and its cache invalidation is not based on epoch numbers so
        // we don't need to make any special calls here.
        if (!isLocal) {
          logDebug("Task " + taskId + "'s epoch is " + task.epoch)
          env.mapOutputTracker.asInstanceOf[MapOutputTrackerWorker].updateEpoch(task.epoch)
        }

        //运行当前的task并且计算运行的时间
        // Run the actual task and measure its runtime.
        //开始运行的时间
        taskStart = System.currentTimeMillis()
        taskStartCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
          threadMXBean.getCurrentThreadCpuTime
        } else 0L
        var threwException = true
        //value封装了shufflemaptask计算数据输出的位置
        val value = try {
          //调用task的run方法并且返回运行的结果
          val res = task.run(
            taskAttemptId = taskId,
            attemptNumber = taskDescription.attemptNumber,
            metricsSystem = env.metricsSystem)
          threwException = false
          res
        } 

    2、调用RDD的iterator()方法,就会对task所对应的RDD的partition执行我们所定义的算子

final def run(
      taskAttemptId: Long,
      attemptNumber: Int,
      metricsSystem: MetricsSystem): T = {
      .......
      .......
     try {
      //调用抽象的方法15
      runTask(context)
    }
   .........
  }
//抽象函数的定义如下
//抽象函数的e16
  def runTask(context: TaskContext): T
//抽象的函数有依赖与其子类的实现在这儿我
//们以shuffleMapTask为例

    (1)、这就是我们定义的函数算子

override def compute(split: Partition, context: TaskContext): Iterator[U] =
   //f就是我们自己定义的算子和函数还实现一些函数,对RDD分区进行操作计算
    //返回新的RDD的分区数据
    f(context, split.index, firstParent[T].iterator(split, context))

    (2)、计算后的结果使用shufflemanager的shuffleWriter写入本地磁盘文件

 override def runTask(context: TaskContext): MapStatus = {
  ..........
  ..........
   try {
      //管理器
      val manager = SparkEnv.get.shuffleManager
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      //对rdd的迭代器执行指定的逻辑
      //返回的数据都是写入自己的分区
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      //mapstatus里面封装了计算后的数据,也就是BlockManger的相关信息
      writer.stop(success = true).get
    }
 ..........
 .........
}

    (3)、MapStatus把处理后的数据发送给DAGScheduler,MapStatus汇总之后把数据发送给MapOutPutTeacker,最后由resultTask进行处理

二、执行的示意图

        

    

猜你喜欢

转载自blog.csdn.net/Milkcoffeezhu/article/details/80053897
今日推荐