一、Task的执行的流程
1、在接收到LaunchTask的请求之后,会用一个TaskRunner来封装这个task,在TaskRunner的对需要的资源进行拷贝以及相关环境的初始化,然后再TaskRunner的run(因为继承了Runnable)方法中调用task的run()方法对task进行处理:
override def run(): Unit = { threadId = Thread.currentThread.getId Thread.currentThread.setName(threadName) val threadMXBean = ManagementFactory.getThreadMXBean val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId) val deserializeStartTime = System.currentTimeMillis() val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { threadMXBean.getCurrentThreadCpuTime } else 0L Thread.currentThread.setContextClassLoader(replClassLoader) val ser = env.closureSerializer.newInstance() logInfo(s"Running $taskName (TID $taskId)") execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER) var taskStart: Long = 0 var taskStartCpu: Long = 0 startGCTime = computeTotalGcTime() try { // Must be set before updateDependencies() is called, in case fetching dependencies // requires access to properties contained within (e.g. for access control). //反序列task的配置信息,便于在后面去使用 Executor.taskDeserializationProps.set(taskDescription.properties) //反序列化,拷贝相关的资源,以及我们所需要的jar包3 updateDependencies(taskDescription.addedFiles, taskDescription.addedJars) //反序列化的方法把得到的文件以及jar包反序化回来 task = ser.deserialize[Task[Any]]( //用到类加载器的原因,类加载器可以动态加载一个类,可以对指定上下文相关资源进行读取4 taskDescription.serializedTask, Thread.currentThread.getContextClassLoader) task.localProperties = taskDescription.properties task.setTaskMemoryManager(taskMemoryManager) // If this task has been killed before we deserialized it, let's quit now. Otherwise, // continue executing the task. val killReason = reasonIfKilled if (killReason.isDefined) { // Throw an exception rather than returning, because returning within a try{} block // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl // exception will be caught by the catch block, leading to an incorrect ExceptionFailure // for the task. throw new TaskKilledException(killReason.get) } // The purpose of updating the epoch here is to invalidate executor map output status cache // in case FetchFailures have occurred. In local mode `env.mapOutputTracker` will be // MapOutputTrackerMaster and its cache invalidation is not based on epoch numbers so // we don't need to make any special calls here. if (!isLocal) { logDebug("Task " + taskId + "'s epoch is " + task.epoch) env.mapOutputTracker.asInstanceOf[MapOutputTrackerWorker].updateEpoch(task.epoch) } //运行当前的task并且计算运行的时间 // Run the actual task and measure its runtime. //开始运行的时间 taskStart = System.currentTimeMillis() taskStartCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) { threadMXBean.getCurrentThreadCpuTime } else 0L var threwException = true //value封装了shufflemaptask计算数据输出的位置 val value = try { //调用task的run方法并且返回运行的结果 val res = task.run( taskAttemptId = taskId, attemptNumber = taskDescription.attemptNumber, metricsSystem = env.metricsSystem) threwException = false res }
2、调用RDD的iterator()方法,就会对task所对应的RDD的partition执行我们所定义的算子
final def run( taskAttemptId: Long, attemptNumber: Int, metricsSystem: MetricsSystem): T = { ....... ....... try { //调用抽象的方法15 runTask(context) } ......... } //抽象函数的定义如下 //抽象函数的e16 def runTask(context: TaskContext): T //抽象的函数有依赖与其子类的实现在这儿我 //们以shuffleMapTask为例
(1)、这就是我们定义的函数算子
override def compute(split: Partition, context: TaskContext): Iterator[U] = //f就是我们自己定义的算子和函数还实现一些函数,对RDD分区进行操作计算 //返回新的RDD的分区数据 f(context, split.index, firstParent[T].iterator(split, context))
(2)、计算后的结果使用shufflemanager的shuffleWriter写入本地磁盘文件
override def runTask(context: TaskContext): MapStatus = { .......... .......... try { //管理器 val manager = SparkEnv.get.shuffleManager writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) //对rdd的迭代器执行指定的逻辑 //返回的数据都是写入自己的分区 writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) //mapstatus里面封装了计算后的数据,也就是BlockManger的相关信息 writer.stop(success = true).get } .......... ......... }
(3)、MapStatus把处理后的数据发送给DAGScheduler,MapStatus汇总之后把数据发送给MapOutPutTeacker,最后由resultTask进行处理
二、执行的示意图