Spark核心原理之调度算法
Spark核心原理之调度算法
在Spark的Standalone模式下的调度算法中,有三种粒度的调度算法。在应用程序之间可以任务执行的是有条件的FIFO策略,在作业及调度阶段提供了FIFO模式和FAIR模式,而任务之间是 由数据本地性和延迟执行等因素共同决定的。下面根据源码对这三种调度算法分别介绍。
应用程序之间
在独立运行模式下中Master提供了资源管理调度功能。在调度过程中,Master先启动等待列表中应用程序的Driver,这些Driver尽可能分配在集群的Worker节点上,然后根据集群的内存和CPU使用情况,对等待运行的应用程序进行资源分配,在分配算法上根据先来先分配,先分配的应用程序会尽可能多地获取满足条件的资源,后分配的应用程序只能在剩余资源中再次筛选。如果没有合适资源的应用程序只能等待,知道其他应用程序释放。所以该策略可以认为是有条件的FIFO策略。其实现代码如下:
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
//对Worker节点进行随机排序,能够使Driver更加均衡分布在集群中
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
//按照顺序在集群中启动Driver,Driver尽可能在不同的Worker节点上运行
for (driver <- waitingDrivers.toList) {
// iterate over a copy of waitingDrivers
var launched = false
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
//对等待的应用程序按照顺序分配运行资源
startExecutorsOnWorkers()
}
在分配应用程序资源的时候,会根据Worker的分配策略进行。分配算法有两种:一种是把应用程序运行在尽可能多的Worker上,这种分配算法不仅能够充分使用集群资源,而且有利于数据处理的本地性;另一种算法是应用程序尽可能少的Worker上,该情况适合CPU密集型而内存使用较少的场景。该策略由spark.deploy.speadOut配置项进行设置,默认情况下为true,也就是第一种尽可能分散运行。其代码如下:
//该方法返回集群中Worker节点所能提供CPU核数数组
private def scheduleExecutorsOnWorkers(
app: ApplicationInfo,
usableWorkers: Array[WorkerInfo],
spreadOutApps: Boolean): Array[Int] = {
//应用程序中每个Executor所需CPU核数
val coresPerExecutor = app.desc.coresPerExecutor
//如果分配时分配给Executor所需的最少CPU核数,如果为应用程序设置了每个Executor
//所需CPU核数,则为该值,否则默认为1
val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
//如果没有设置,则表示该应用程序在Worker节点上只启动一个Executor,并尽可能分配资源
val oneExecutorPerWorker = coresPerExecutor.isEmpty
val memoryPerExecutor = app.desc.memoryPerExecutorMB
//集群中可用Worker节点数
val numUsable = usableWorkers.length
//Worker节点所能提供CPU核数数组
val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
//worker分配Executor个数数组
val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
//需要分配CPU核数,为应用程序所需CPU核数和可用CPU核数最小值
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
//返回指定的Worker节点是否能够启动Executor,满足的条件为:
// 1)应用程序需要分配CPU核数>=每个Executor所需的最少CPU核数
// 2)是否有足够的CPU核数,判断条件为该Worker节点可用CPU核数-该Worker节点已分配的
// CPU核数>=每个Executor所需的最少CPU核数。
//如果在该Worker节点上允许启动新的Executor,需要追加下面两个条件:
// 1)判断内存是否足够启动Executor,其方法是:当前Worker节点可用内存-该Worker节点
// 已经分配的内存>=每个Executor分配的内存大小,其中已经分配的内存为
// 已分配的Executor数乘以每个Executor所分配的内存数。
// 2)已经分配给该应用程序的Executor数量+已经运行该应用程序的Executor数量<
// 该应用程序Executor设置的最大数量。
def canLaunchExecutor(pos: Int): Boolean = {
val keepScheduling = coresToAssign >= minCoresPerExecutor
val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor
// If we allow multiple executors per worker, then we can always launch new executors.
// Otherwise, if there is already an executor on this worker, just give it more cores.
//启动新的Executor条件为:该worker节点允许启动多个executor或者该worker节点没有为该应用程序分配executor
//其中允许启动多个executor判断条件为是否设置每个executor所需cpu核数,如果没有设置则表示允许,否则在该worker
//节点只允许启动一个executor
val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
if (launchingNewExecutor) {
val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
keepScheduling && enoughCores && enoughMemory && underLimit
} else {
keepScheduling && enoughCores
}
}
//在可用的worker节点中启动executor,在worker节点每次分配资源时,分配给executor所需的
//的最少cpu核数,该过程是通过多次轮询进行,直到worker节点没有满足启动executor条件或者已经
//达到应用程序限制。在分配过程中,worker节点可能多次分配,如果该worker节点可以启动多个executor,
//则每次分配的时候启动新的executor并赋予资源;如果worker节点只能启动一个executor,则每次分配的
//时候把资源追加到该executor
var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
while (freeWorkers.nonEmpty) {
freeWorkers.foreach {
pos =>
var keepScheduling = true
//满足keepScheduling标志为真(第一次分配或者集中运行)和该worker节点满足
//启动executor条件时,进行资源分配
while (keepScheduling && canLaunchExecutor(pos)) {
//每次分配cpu核数为executor所需的最少cpu核数
coresToAssign -= minCoresPerExecutor
assignedCores(pos) += minCoresPerExecutor
//如果未设置每个executor启动cpu核数,则该worker只为该应用程序启动一个
//executor,否则在每次分配中启动1个新的executor
if (oneExecutorPerWorker) {
assignedExecutors(pos) = 1
} else {
assignedExecutors(pos) += 1
}
//如果是分散运行,则在某一worker节点上做完资源分配立即转到下一个worker节点
//如果是集中运行,则持续在某一worker节点上做资源分配,直到用完该worker节点
//所有资源,由于传入的worker列表是按照可用cpu核数倒序排列,在集中运行时候,
//会尽可能少的使用worker节点
if (spreadOutApps) {
keepScheduling = false
}
}
}
//继续从上次分配完的可用worker节点列表获取满足executor的worker节点列表
freeWorkers = freeWorkers.filter(canLaunchExecutor)
}
assignedCores
}
作业及调度阶段之间
Spark应用程序提交执行时,会根据RDD依赖关系形成有向无环图(DAG),然后交给DAGScheduler进行划分作业和调度阶段,这些作业之间没有依赖关系,对于多个作业之间的调度,Spark提供两种调度策略:一种是FIFO策略,(目前默认的模式);一种是FAIR模式,该模式的调度可以通过两个参数的配置来决定Job执行的优先模式,两个参数分别是minShare(最小任务数)和weight(任务的权重)。该调度策略的执行过程和代码如下:
1.创建调度池
在TaskSchedulerImpl.initilaize方法中先创建根调度池rootPool对象,然后根据系统配置调度模式创建调度创建器,针对两种调度策略具体实例化FIFOSchedulableBuilder或FairSchedulableBuilder,最终使用调度创建器buildPools方法在根调度池rootPool下创建调度池。代码实现如下:
def initialize(backend: SchedulerBackend) {
this.backend = backend
schedulableBuilder = {
//根据调度模式配置调度池
schedulingMode match {
//使用FIFO调度方式
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
//使用FAIR调度方式
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
case _ =>
throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
s"$schedulingMode")
}
}
schedulableBuilder.buildPools()
}
2.调度池加入调度内容
在TaskSchedulerImpl.submitTasks方法中,先把调度阶段拆分为任务集,然后把这些任务集交给管理器TaskManager进行管理,最后把该任务集的管理器加入到调度池中,等待分配执行。
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
//创建任务集的管理,用于管理这个任务集的声明周期
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets.foreach {
case (_, ts) =>
ts.isZombie = true
}
stageTaskSets(taskSet.stageAttemptId) = manager
//将该任务集的管理器加入到系统调度池中,由系统统一调配,该调度器属于应用级别
//支持FIFO和FAIR(公平调度)两种
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
...
}
...
}
3.提供已排序的任务集管理器
在TaskSchedulerImpl.resourceOffers方法中进行资源分配时,会从根调度池rootPools获取已经排序的任务管理器,该排序算法由两种调度策略FIFOSchedulingAlgorithm和FairSchedulingAlgorithm的comparator方法提供。代码实现如下:
def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
...
//获取按照资源调度策略排序好的TaskSetManager
val sortedTaskSets = rootPool.getSortedTaskSetQueue
...
}
(1)FIFO调度策略实现代码如下:
private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
//获取作业优先级,实际上是作业编号
val priority1 = s1.priority
val priority2 = s2.priority
var res = math.signum(priority1 - priority2)
//如果是同一个作业,再比较调度阶段优先级
if (res == 0) {
val stageId1 = s1.stageId
val stageId2 = s2.stageId
res = math.signum(stageId1 - stageId2)
}
res < 0
}
}
(2)FIAR调度策略实现代码如下:
private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
//比较两个调度优先级方法,返回true表示前者优先级高,false表示后者优先级高
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
//最小任务数
val minShare1 = s1.minShare
val minShare2 = s2.minShare
//正在运行的任务数
val runningTasks1 = s1.runningTasks
val runningTasks2 = s2.runningTasks
//饥饿程序,判断标准为正在运行的任务数是否小于最小任务数
val s1Needy = runningTasks1 < minShare1
val s2Needy = runningTasks2 < minShare2
//资源比,正在运行的任务数/最小任务数
val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
//权重比,正在运行的任务数/任务的权重
val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
var compare = 0
//判断执行
if (s1Needy && !s2Needy) {
return true
} else if (!s1Needy && s2Needy) {
return false
} else if (s1Needy && s2Needy) {
compare = minShareRatio1.compareTo(minShareRatio2)
} else {
compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
}
if (compare < 0) {
true
} else if (compare > 0) {
false
} else {
s1.name < s2.name
}
}
}
任务之间
在介绍任务调度算法之前,首先介绍下数据本地性和延迟执行两个概念。
1.数据本地性
数据的计算尽可能的在数据所在的节点上进行,这样可以减少数据在网络上传输,以此减少移动数据代价。数据如果在运行节点的内存中,就能够进一步减少磁盘I/O的传输。在Spark中数据本地行优先级从高到低为 ,即最好是任务运行的节点内存中存在数据、次好是同一个Node(同一机器)上,再次是同机架,最后是任意位置。其中任务数据本地性通过以下情况确定:
- 如果任务处于作业开始的调度阶段内,这些任务对应的RDD分区都有首选运行位置,该位置也是任务运行首选位置,数据本地性为NODE_LOCAL
- 如果任务处于非作业开头的调度阶段,可以根据父调度阶段运行的位置得到任务的首选位置,这种情况下,如果executor处于活动状态,则数据本地性PROCESS_LOCAL;如果executor不处于活动状态,但存在父调度阶段运行结果,则数据本地性为NODE_LOCAL
- 如果没有首选位置,则数据本地性为NO_PREF.
2.延迟执行
在任务分配运行节点时,先判断任务最佳运行节点是否空闲,如果该节点没有足够的资源运行该任务,在这种情况下需要等待一段时间;如果在等待时间内该节点释放出足够的资源,则任务在该节点运行,如果还是不足会找出次佳的节点进行运行。通过这样的方式进行能让任务运行在更高级别数据本地性的节点,从而减少磁盘I/O和网络传输。
- Spark任务分配的原则就是让任务运行在数据本地行优先级别高的节点上,甚至可以为此等待一段时间。
3.任务执行调度算法
在任务分配中TaskSetManager是核心对象,先在其初始化时使用addPendingTask方法,根据任务自身的首选位置得到pendingTasksForExecutor、pendingTasksForHost、pendingTasksForRack、pendingTasksWithNoPrefs4个列表,然后根据这四个列表在computeValidLocalityLevels方法中得到该任务集的数据性本地列表,按照获取的数据本地性从高到低匹配到可用的Worker节点,在匹配前使用getAllowedLocalityLevel得到数据集允许的数据本地性,比较该数据本地行和指定数据本地性优先级,取优先级高的数据本地性;最后在指定的worker节点中判断比较获得数据优先级是否存在需要运行的任务,如果存在则返回该任务和数据本地性进行相关信息更新处理。代码实现如下:
private[spark] def addPendingTask(index: Int) {
for (loc <- tasks(index).preferredLocations) {
loc match {
case e: ExecutorCacheTaskLocation =>
pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer) += index
case e: HDFSCacheTaskLocation =>
val exe = sched.getExecutorsAliveOnHost(loc.host)
exe match {
case Some(set) =>
for (e <- set) {
pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer) += index
}
logInfo(s"Pending task $index has a cached location at ${e.host} " +
", where there are executors " + set.mkString(","))
case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +
", but there are no executors alive there.")
}
case _ =>
}
pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
for (rack <- sched.getRackForHost(loc.host)) {
pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer) += index
}
}
if (tasks(index).preferredLocations == Nil) {
pendingTasksWithNoPrefs += index
}
allPendingTasks += index // No point scanning this whole list to find the old task there
}
private def computeValidLocalityLevels(): Array[TaskLocality.TaskLocality] = {
import TaskLocality.{
PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY}
val levels = new ArrayBuffer[TaskLocality.TaskLocality]
if (!pendingTasksForExecutor.isEmpty &&
pendingTasksForExecutor.keySet.exists(sched.isExecutorAlive(_))) {
levels += PROCESS_LOCAL
}
if (!pendingTasksForHost.isEmpty &&
pendingTasksForHost.keySet.exists(sched.hasExecutorsAliveOnHost(_))) {
levels += NODE_LOCAL
}
if (!pendingTasksWithNoPrefs.isEmpty) {
levels += NO_PREF
}
if (!pendingTasksForRack.isEmpty &&
pendingTasksForRack.keySet.exists(sched.hasHostAliveOnRack(_))) {
levels += RACK_LOCAL
}
levels += ANY
logDebug("Valid locality levels for " + taskSet + ": " + levels.mkString(", "))
levels.toArray
}
其中resourceoffers方法代码如下:
def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
...
//为任务随机分配Executor,避免任务集中分配到Worker上
val shuffledOffers = shuffleOffers(filteredOffers)
// Build a list of tasks to assign to each worker.
//用于存储分配好资源任务
val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
val availableCpus = shuffledOffers.map(o => o.cores).toArray
val availableSlots = shuffledOffers.map(o => o.cores / CPUS_PER_TASK).sum
//获取按照资源调度策略排序好的TaskSetManager
val sortedTaskSets = rootPool.getSortedTaskSetQueue
//如果有新加入的Executor,需要重新计算数据本地性
for (taskSet <- sortedTaskSets) {
logDebug("parentName: %s, name: %s, runningTasks: %s".format(
taskSet.parent.name, taskSet.name, taskSet.runningTasks))
...
} else {
//为分配好的TaskSetManager列表进行分配资源,分配的原则就是就近原则
//按照顺序PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
var launchedAnyTask = false
// Record all the executor IDs assigned barrier tasks on.
val addressesWithDescs = ArrayBuffer[(String, TaskDescription)]()
for (currentMaxLocality <- taskSet.myLocalityLevels) {
var launchedTaskAtCurrentMaxLocality = false
do {
launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
currentMaxLocality, shuffledOffers, availableCpus, tasks, addressesWithDescs)
launchedAnyTask |= launchedTaskAtCurrentMaxLocality
} while (launchedTaskAtCurrentMaxLocality)
}
if (!launchedAnyTask) {
taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors).foreach {
taskIndex =>
executorIdToRunningTaskIds.find(x => !isExecutorBusy(x._1)) match {
case Some ((executorId, _)) =>
if (!unschedulableTaskSetToExpiryTime.contains(taskSet)) {
blacklistTrackerOpt.foreach(blt => blt.killBlacklistedIdleExecutor(executorId))
val timeout = conf.get(config.UNSCHEDULABLE_TASKSET_TIMEOUT) * 1000
unschedulableTaskSetToExpiryTime(taskSet) = clock.getTimeMillis() + timeout
logInfo(s"Waiting for $timeout ms for completely "
+ s"blacklisted task to be schedulable again before aborting $taskSet.")
abortTimer.schedule(
createUnschedulableTaskSetAbortTimer(taskSet, taskIndex), timeout)
}
case None => // Abort Immediately
logInfo("Cannot schedule any task because of complete blacklisting. No idle" +
s" executors can be found to kill. Aborting $taskSet." )
taskSet.abortSinceCompletelyBlacklisted(taskIndex)
}
}
} else {
...
s"stage ${taskSet.stageId}.")
}
}
}
// TODO SPARK-24823 Cancel a job that contains barrier stage(s) if the barrier tasks don't get
// launched within a configured time.
if (tasks.size > 0) {
hasLaunchedTask = true
}
return tasks
}
对于单个任务集的任务调度由TaskSchedulerImpl.resourceOfferSingleTaskSet方法实现。代码如下:
private def resourceOfferSingleTaskSet(
taskSet: TaskSetManager,
maxLocality: TaskLocality,
shuffledOffers: Seq[WorkerOffer],
availableCpus: Array[Int],
tasks: IndexedSeq[ArrayBuffer[TaskDescription]],
addressesWithDescs: ArrayBuffer[(String, TaskDescription)]) : Boolean = {
//遍历所有worker。为每个worker分配运行任务
var launchedTask = false
for (i <- 0 until shuffledOffers.size) {
val execId = shuffledOffers(i).executorId
val host = shuffledOffers(i).host
//当worker的cpu核数满足任务运行要求核数
if (availableCpus(i) >= CPUS_PER_TASK) {
try {
//对指定Executor分配运行的任务,分配后更新相关列表和递减可用CPU
for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
tasks(i) += task
val tid = task.taskId
taskIdToTaskSetManager.put(tid, taskSet)
taskIdToExecutorId(tid) = execId
executorIdToRunningTaskIds(execId).add(tid)
availableCpus(i) -= CPUS_PER_TASK
assert(availableCpus(i) >= 0)
if (taskSet.isBarrier) {
addressesWithDescs += (shuffledOffers(i).address.get -> task)
}
launchedTask = true
}
} catch {
case e: TaskNotSerializableException =>
logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
return launchedTask
}
}
}
return launchedTask
}
对指定的worker的executor分配运行的任务调用TaskSetManager.resourceOffer方法实现。代码如下:
def resourceOffer(
execId: String,
host: String,
maxLocality: TaskLocality.TaskLocality)
: Option[TaskDescription] =
{
val offerBlacklisted = taskSetBlacklistHelperOpt.exists {
blacklist =>
blacklist.isNodeBlacklistedForTaskSet(host) ||
blacklist.isExecutorBlacklistedForTaskSet(execId)
}
if (!isZombie && !offerBlacklisted) {
val curTime = clock.getTimeMillis()
var allowedLocality = maxLocality
//如果资源有Locality特征
if (maxLocality != TaskLocality.NO_PREF) {
//获取当前任务集允许执行的Locality,getAllowedLocalityLevel随时间变化而变化
allowedLocality = getAllowedLocalityLevel(curTime)
//如果允许的Locality级别低于maxLocality,则使用maxLocality覆盖允许的Locality
if (allowedLocality > maxLocality) {
// We're not allowed to search for farther-away tasks
allowedLocality = maxLocality
}
}
dequeueTask(execId, host, allowedLocality).map {
case ((index, taskLocality, speculative)) =>
...
//更新相关信息,并对任务序列化
val serializedTask: ByteBuffer = try {
ser.serialize(task)
} catch {
...
}
if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
!emittedTaskSizeWarning) {
emittedTaskSizeWarning = true
logWarning(s"Stage ${task.stageId} contains a task of very large size " +
s"(${serializedTask.limit() / 1024} KB). The maximum recommended task size is " +
s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
}
//把该任务加入到运行任务列表中
addRunningTask(taskId)
...
}
} else {
None
}
}
最后是TaskSetManager.getAllowedLocalityLevel方法获取当前任务集允许执行的数据本地性实现。在
该方法从上次获得的数据本地性开始,根据优先级从高到低判断是否存在任务需要运行:如果对于其中一级数据本地性没有存在需要运行的任务,则不进行延时等待,而是进行下一阶段数据本地性处理;如果存在需要运行的任务,但延迟时间超过了该数据本地性设置的延迟时间,那么也进行下一阶段数据本地性处理;如果不满足前面两种情况,则返回数据本地性。代码实现如下:
private def getAllowedLocalityLevel(curTime: Long): TaskLocality.TaskLocality = {
//正在运行任务copiesRunning和成功运行任务successful两个中检查是否包含指定任务,
//如果不包含,则表示这些任务需要运行;如果包含需要把这些任务从前面Pending4个列表移除
def tasksNeedToBeScheduledFrom(pendingTaskIds: ArrayBuffer[Int]): Boolean = {
...
}
def moreTasksToRunIn(pendingTasks: HashMap[String, ArrayBuffer[Int]]): Boolean = {
...
}
while (currentLocalityIndex < myLocalityLevels.length - 1) {
//获取指定的数据本地性是否包含需要运行的任务
val moreTasks = myLocalityLevels(currentLocalityIndex) match {
case TaskLocality.PROCESS_LOCAL => moreTasksToRunIn(pendingTasksForExecutor)
case TaskLocality.NODE_LOCAL => moreTasksToRunIn(pendingTasksForHost)
case TaskLocality.NO_PREF => pendingTasksWithNoPrefs.nonEmpty
case TaskLocality.RACK_LOCAL => moreTasksToRunIn(pendingTasksForRack)
}
if (!moreTasks) {
//如果没有包含需要运行的任务,则进入下一级数据本地性处理
lastLaunchTime = curTime
logDebug(s"No tasks for locality level ${myLocalityLevels(currentLocalityIndex)}, " +
s"so moving to locality level ${myLocalityLevels(currentLocalityIndex + 1)}")
currentLocalityIndex += 1
} else if (curTime - lastLaunchTime >= localityWaits(currentLocalityIndex)) {
//如果存在需要运行的任务,但是延迟时间超越了该数据本地性设置的延迟时间,也进行下一级数据
//本地性处理
lastLaunchTime += localityWaits(currentLocalityIndex)
logDebug(s"Moving to ${myLocalityLevels(currentLocalityIndex + 1)} after waiting for " +
s"${localityWaits(currentLocalityIndex)}ms")
currentLocalityIndex += 1
} else {
//返回满足条件的数据本地性
return myLocalityLevels(currentLocalityIndex)
}
}
myLocalityLevels(currentLocalityIndex)
}