关于Yarn源码的那些事(五)

上文说到ApplicationMaster的调度便戛然而止,本文继续。

private static final class AddApplicationToSchedulerTransition extends RMAppTransition {
		@Override
		public void transition(RMAppImpl app, RMAppEvent event) {
			app.handler.handle(new AppAddedSchedulerEvent(app.applicationId, app.submissionContext.getQueue(), app.user,
					app.submissionContext.getReservationID()));
		}
	}

上文说到这段代码,里面的app.handler实际上是rmdispatcher的handler:

			rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);

这种事件的类型是SchedulerEventType类型,所以交给schedulerDispatcher来handle,追溯下去,是由默认的CapacityScheduler来负责的,相关代码如下:

public AppAddedSchedulerEvent(ApplicationId applicationId, String queue, String user, boolean isAppRecovering,
			ReservationId reservationID) {
		super(SchedulerEventType.APP_ADDED);
		this.applicationId = applicationId;
		this.queue = queue;
		this.user = user;
		this.reservationID = reservationID;
		this.isAppRecovering = isAppRecovering;
	}

这里确定事件类型:

case APP_ADDED: {
			AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event;
			String queueName = resolveReservationQueueName(appAddedEvent.getQueue(), appAddedEvent.getApplicationId(),
					appAddedEvent.getReservationID());
			if (queueName != null) {
				addApplication(appAddedEvent.getApplicationId(), queueName, appAddedEvent.getUser(),
						appAddedEvent.getIsAppRecovering());
			}
		}

这里是处理部分的代码:只截取了部分重要的代码:

// Submit to the queue
		try {
			queue.submitApplication(applicationId, user, queueName);
		} catch (AccessControlException ace) {
			// Ignore the exception for recovered app as the app was previously accepted
			if (!isAppRecovering) {
				LOG.info("Failed to submit application " + applicationId + " to queue " + queueName + " from user "
						+ user, ace);
				this.rmContext.getDispatcher().getEventHandler()
						.handle(new RMAppRejectedEvent(applicationId, ace.toString()));
				return;
			}
		}
		// update the metrics
		queue.getMetrics().submitApp(user);
		SchedulerApplication<FiCaSchedulerApp> application = new SchedulerApplication<FiCaSchedulerApp>(queue, user);
		applications.put(applicationId, application);
		LOG.info("Accepted application " + applicationId + " from user: " + user + ", in queue: " + queueName);
		if (isAppRecovering) {
			if (LOG.isDebugEnabled()) {
				LOG.debug(applicationId + " is recovering. Skip notifying APP_ACCEPTED");
			}
		} else {
			rmContext.getDispatcher().getEventHandler()
					.handle(new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
		}

先提交到对应的队列内,然后再由rmContext内部的dispatcher再行调度,而这个调度器,就是全局的调度器,根据事件类型,我们找到处理函数:

addTransition(RMAppState.SUBMITTED, RMAppState.ACCEPTED, RMAppEventType.APP_ACCEPTED,
							new StartAppAttemptTransition())
private static final class StartAppAttemptTransition extends RMAppTransition {
		@Override
		public void transition(RMAppImpl app, RMAppEvent event) {
			app.createAndStartNewAttempt(false);
		};
	}
private void createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) {
		createNewAttempt();
		handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(), transferStateFromPreviousAttempt));
	}

这里,再次提交了一个RMAppStartAttemptEvent,开始创建一个ApplicationMaster的尝试操作,因为我们知道,可能启动一次不会成功,所以要有尝试的机会:

我们重在分析最后一个方法内的两个调用:

private void createNewAttempt() {
		ApplicationAttemptId appAttemptId = ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1);
		RMAppAttempt attempt = new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
				submissionContext, conf,
				// The newly created attempt maybe last attempt if (number of
				// previously failed attempts(which should not include Preempted,
				// hardware error and NM resync) + 1) equal to the max-attempt
				// limit.
				maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq);
		attempts.put(appAttemptId, attempt);
		currentAttempt = attempt;
	}

创建一个启动尝试,需要很多环境参数,这里主要汇聚了ApplicationMaster的相关参数,调度器,管理服务等,都是与ApplicationMaster启动密切相关的.

第二个方法,再次提交了一个事件,而这个类型的事件调度,如下:

			rmDispatcher.register(RMAppAttemptEventType.class, new ApplicationAttemptEventDispatcher(rmContext));
addTransition(RMAppAttemptState.NEW, RMAppAttemptState.SUBMITTED, RMAppAttemptEventType.START,
							new AttemptStartedTransition())
private static final class AttemptStartedTransition extends BaseTransition {
		@Override
		public void transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) {

			boolean transferStateFromPreviousAttempt = false;
			if (event instanceof RMAppStartAttemptEvent) {
				transferStateFromPreviousAttempt = ((RMAppStartAttemptEvent) event)
						.getTransferStateFromPreviousAttempt();
			}
			appAttempt.startTime = System.currentTimeMillis();

			// Register with the ApplicationMasterService
			appAttempt.masterService.registerAppAttempt(appAttempt.applicationAttemptId);

			if (UserGroupInformation.isSecurityEnabled()) {
				appAttempt.clientTokenMasterKey = appAttempt.rmContext.getClientToAMTokenSecretManager()
						.createMasterKey(appAttempt.applicationAttemptId);
			}

			// Add the applicationAttempt to the scheduler and inform the scheduler
			// whether to transfer the state from previous attempt.
			appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent(appAttempt.applicationAttemptId,
					transferStateFromPreviousAttempt));
		}
	}

我们看下这个事件:

public AppAttemptAddedSchedulerEvent(ApplicationAttemptId applicationAttemptId,
			boolean transferStateFromPreviousAttempt, boolean isAttemptRecovering) {
		super(SchedulerEventType.APP_ATTEMPT_ADDED);
		this.applicationAttemptId = applicationAttemptId;
		this.transferStateFromPreviousAttempt = transferStateFromPreviousAttempt;
		this.isAttemptRecovering = isAttemptRecovering;
	}

很明显,提交的事件类型是SchedulerEventType.APP_ATTEMPT_ADDED类型,我们看下相应的handle方法,一步步看:

		this.eventHandler = rmContext.getDispatcher().getEventHandler();

这说明事件是交给了全局的调度器来负责调度:

			rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);

根据事件类型,找到对应的调度器,是RM内部的SchedulerEventDispatcher,而其内部的handler逻辑,实际上是由内部的EventProcessor来处理的,这是个继承了Runnable的类,用于异步处理:

						scheduler.handle(event);			

而这里的scheduler,则是RM全局的CapacityScheduler,我们看看其如何处理这个事件:

case APP_ATTEMPT_ADDED: {
			AppAttemptAddedSchedulerEvent appAttemptAddedEvent = (AppAttemptAddedSchedulerEvent) event;
			addApplicationAttempt(appAttemptAddedEvent.getApplicationAttemptId(),
					appAttemptAddedEvent.getTransferStateFromPreviousAttempt(),
					appAttemptAddedEvent.getIsAttemptRecovering());
		}

看起来,好像构建了新的事件需要进行后续的调度,我们看看addApplicationAttempt方法:

		SchedulerApplication<FiCaSchedulerApp> application = applications.get(applicationAttemptId.getApplicationId());
		CSQueue queue = (CSQueue) application.getQueue();

		FiCaSchedulerApp attempt = new FiCaSchedulerApp(applicationAttemptId, application.getUser(), queue,
				queue.getActiveUsersManager(), rmContext);
		if (transferStateFromPreviousAttempt) {
			attempt.transferStateFromPreviousAttempt(application.getCurrentAppAttempt());
		}
		application.setCurrentAppAttempt(attempt);

		queue.submitApplicationAttempt(attempt, application.getUser());
		LOG.info("Added Application Attempt " + applicationAttemptId + " to scheduler from user "
				+ application.getUser() + " in queue " + queue.getQueueName());
		if (isAttemptRecovering) {
			if (LOG.isDebugEnabled()) {
				LOG.debug(applicationAttemptId + " is recovering. Skipping notifying ATTEMPT_ADDED");
			}
		} else {
			rmContext.getDispatcher().getEventHandler()
					.handle(new RMAppAttemptEvent(applicationAttemptId, RMAppAttemptEventType.ATTEMPT_ADDED));
		}
	

前面的代码不分析了,重点看这里的继续调度过程,再次给全局调度器递交了一个事件,这个事件类型是RMAppAttemptEventType.ATTEMPT_ADDED,我们发现这个事件,交给了:

	rmDispatcher.register(RMAppAttemptEventType.class, new ApplicationAttemptEventDispatcher(rmContext));

看看这个事件是如何被ApplicationAttemptEventDispatcher处理的:

			ApplicationAttemptId appAttemptID = event.getApplicationAttemptId();
			ApplicationId appAttemptId = appAttemptID.getApplicationId();
			RMApp rmApp = this.rmContext.getRMApps().get(appAttemptId);
			if (rmApp != null) {
				RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptID);
				if (rmAppAttempt != null) {
					try {
						rmAppAttempt.handle(event);
					} catch (Throwable t) {
						LOG.error("Error in handling event type " + event.getType() + " for applicationAttempt "
								+ appAttemptId, t);
					}
				}
			}
		

这里的rmAppAttempt,实际实现是rmAppAttemptImpl,点进去看其handle方法:

		this.writeLock.lock();

		try {
			ApplicationAttemptId appAttemptID = event.getApplicationAttemptId();
			LOG.debug("Processing event for " + appAttemptID + " of type " + event.getType());
			final RMAppAttemptState oldState = getAppAttemptState();
			try {
				/* keep the master in sync with the state machine */
				this.stateMachine.doTransition(event.getType(), event);
			} catch (InvalidStateTransitonException e) {
				LOG.error("Can't handle this event at current state", e);
				/* TODO fail the application on the failed transition */
			}

			if (oldState != getAppAttemptState()) {
				LOG.info(appAttemptID + " State change from " + oldState + " to " + getAppAttemptState());
			}
		} finally {
			this.writeLock.unlock();
		}
	

这里,走的是状态机转换,原先,我们状态机的状态为:RMAppAttemptSTATE.SUBMITTED,提交了事件RMAppAttemptSTATE.APP_ATTEMPT_ADDED:

addTransition(RMAppAttemptState.SUBMITTED,
							EnumSet.of(RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING, RMAppAttemptState.SCHEDULED),
							RMAppAttemptEventType.ATTEMPT_ADDED, new ScheduleTransition())

我们看下schedulerTransition的代码:

public RMAppAttemptState transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) {
			ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
			if (!subCtx.getUnmanagedAM()) {
				// Need reset #containers before create new attempt, because this request
				// will be passed to scheduler, and scheduler will deduct the number after
				// AM container allocated

				// Currently, following fields are all hard code,
				// TODO: change these fields when we want to support
				// priority/resource-name/relax-locality specification for AM containers
				// allocation.
				appAttempt.amReq.setNumContainers(1);
				appAttempt.amReq.setPriority(AM_CONTAINER_PRIORITY);
				appAttempt.amReq.setResourceName(ResourceRequest.ANY);
				appAttempt.amReq.setRelaxLocality(true);

				// AM resource has been checked when submission
				Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
						Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null);
				if (amContainerAllocation != null && amContainerAllocation.getContainers() != null) {
					assert (amContainerAllocation.getContainers().size() == 0);
				}
				return RMAppAttemptState.SCHEDULED;
			} else {
				// save state and then go to LAUNCHED state
				appAttempt.storeAttempt();
				return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING;
			}
		}

重点在这里,开始调度之后,就需要尝试分配资源,并且在指定的container上准备启动尝试了:

我们看下这里的资源分配代码:

Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
						Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null);

追溯下去,重点在这儿:

		FiCaSchedulerApp application = getApplicationAttempt(applicationAttemptId);

return application.getAllocation(getResourceCalculator(), clusterResource, getMinimumResourceCapability());

前面的检查和释放Container不看了,直接看这两句话,内部还有另一个scheduler,来分配资源,这一段逻辑有些复杂,彻底来看一看:

前文,我们调用了capacityScheduler的addApplication方法,内部有两句代码:

SchedulerApplication<FiCaSchedulerApp> application = new SchedulerApplication<FiCaSchedulerApp>(queue, user);
		applications.put(applicationId, application);

而后,我们调用了addApplicationAttempt方法,内部新建了一个FicaSchedulerApp:

SchedulerApplication<FiCaSchedulerApp> application = applications.get(applicationAttemptId.getApplicationId());
		CSQueue queue = (CSQueue) application.getQueue();

		FiCaSchedulerApp attempt = new FiCaSchedulerApp(applicationAttemptId, application.getUser(), queue,
				queue.getActiveUsersManager(), rmContext);
		if (transferStateFromPreviousAttempt) {
			attempt.transferStateFromPreviousAttempt(application.getCurrentAppAttempt());
		}
		application.setCurrentAppAttempt(attempt);

这个,就是我们后来使用的FicaSchedulerApp,我们看看这个FicaSchedulerApp的初始化代码:

public FiCaSchedulerApp(ApplicationAttemptId applicationAttemptId, String user, Queue queue,
			ActiveUsersManager activeUsersManager, RMContext rmContext) {
		super(applicationAttemptId, user, queue, activeUsersManager, rmContext);

		RMApp rmApp = rmContext.getRMApps().get(getApplicationId());

		Resource amResource;
		if (rmApp == null || rmApp.getAMResourceRequest() == null) {
			// the rmApp may be undefined (the resource manager checks for this too)
			// and unmanaged applications do not provide an amResource request
			// in these cases, provide a default using the scheduler
			amResource = rmContext.getScheduler().getMinimumResourceCapability();
		} else {
			amResource = rmApp.getAMResourceRequest().getCapability();
		}

		setAMResource(amResource);
	}

分析这个方法,需要注意,如果我们提交ApplicationMaster的时候,没有指定资源的话,则使用最小资源容量,所以通常我们需要配置,我们看下最小资源容量,这个资源容量是在CapacityScheduler中配置的:

public Resource getMinimumAllocation() {
		int minimumMemory = getInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_MB,
				YarnConfiguration.DEFAULT_RM_SCHEDULER_MINIMUM_ALLOCATION_MB);
		int minimumCores = getInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_VCORES,
				YarnConfiguration.DEFAULT_RM_SCHEDULER_MINIMUM_ALLOCATION_VCORES);
		return Resources.createResource(minimumMemory, minimumCores);
	}

追本溯源,发现最小内存是1G,最小核数是1,因为这里加载的是我们yarn的配置,所以可以配置下,修改我们的默认值:

注意,最后一句话,执行成功之后,RMAppAttemptImpl的状态转换为RMAppAttemptSTATE.SCHEDULED:

ContainersAndNMTokensAllocation allocation = pullNewlyAllocatedContainersAndNMTokens();
		return new Allocation(allocation.getContainerList(), getHeadroom(), null, currentContPreemption,
				Collections.singletonList(rr), allocation.getNMTokenList());

实际上,在return新的Allocation之前,有一句代码很重要,如上,我们点进去看下:

		List<Container> returnContainerList = new ArrayList<Container>(newlyAllocatedContainers.size());
		List<NMToken> nmTokens = new ArrayList<NMToken>();
		for (Iterator<RMContainer> i = newlyAllocatedContainers.iterator(); i.hasNext();) {
			RMContainer rmContainer = i.next();
			Container container = rmContainer.getContainer();
			try {
				// create container token and NMToken altogether.
				container.setContainerToken(rmContext.getContainerTokenSecretManager().createContainerToken(
						container.getId(), container.getNodeId(), getUser(), container.getResource(),
						container.getPriority(), rmContainer.getCreationTime(), this.logAggregationContext));
				NMToken nmToken = rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
						getApplicationAttemptId(), container);
				if (nmToken != null) {
					nmTokens.add(nmToken);
				}
			} catch (IllegalArgumentException e) {
				// DNS might be down, skip returning this container.
				LOG.error("Error trying to assign container token and NM token to" + " an allocated container "
						+ container.getId(), e);
				continue;
			}
			returnContainerList.add(container);
			i.remove();
			rmContainer.handle(new RMContainerEvent(rmContainer.getContainerId(), RMContainerEventType.ACQUIRED));
		}
		return new ContainersAndNMTokensAllocation(returnContainerList, nmTokens);
	

这里,我们注意看下事件提交的部分,很重要,对于每个rmContainer都提交了一个事件,而这个事件的处理,则交给了:

.addTransition(RMContainerState.ALLOCATED, RMContainerState.ACQUIRED, RMContainerEventType.ACQUIRED,
							new AcquiredTransition())

注意,这是RMContainerImpl内的状态机的方法,为什么状态时从Allocated到Acquired的转换,是因为前面我们在获取资源的时候,分配的过程中,对应的RMContainerImpl的状态已经发生了转换,从new转换为了Allocated;我们看看AcquiredTransition内部的处理:

// Tell the app
			container.eventHandler.handle(new RMAppRunningOnNodeEvent(
					container.getApplicationAttemptId().getApplicationId(), container.nodeId));

重点在这这儿,而这里的container实际上是RMContainerImpl,执行的是其内部的eventHandler的handle方法,我们看下:

// Register event handler for RmAppEvents
			rmDispatcher.register(RMAppEventType.class, new ApplicationEventDispatcher(rmContext));

对应的处理,在状态机转换的逻辑内:

addTransition(RMAppState.ACCEPTED, RMAppState.ACCEPTED, RMAppEventType.APP_RUNNING_ON_NODE,
							new AppRunningOnNodeTransition())

我们看看这个转换器:

		public void transition(RMAppImpl app, RMAppEvent event) {
			RMAppRunningOnNodeEvent nodeAddedEvent = (RMAppRunningOnNodeEvent) event;

			// if final state already stored, notify RMNode
			if (isAppInFinalState(app)) {
				app.handler
						.handle(new RMNodeCleanAppEvent(nodeAddedEvent.getNodeId(), nodeAddedEvent.getApplicationId()));
				return;
			}

			// otherwise, add it to ranNodes for further process
			app.ranNodes.add(nodeAddedEvent.getNodeId());
		};
	

毫无引文,我们需要走第二步,看看app.ranNodes是什么作用,主要是添加了运行的节点,在此不赘述了。

继续从前面的逻辑看:

// AM resource has been checked when submission
				Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
						Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null);
				if (amContainerAllocation != null && amContainerAllocation.getContainers() != null) {
					assert (amContainerAllocation.getContainers().size() == 0);
				}
				return RMAppAttemptState.SCHEDULED;

这里,我们看到,触发了RMAppAttemptState.SCHEDULED的转换:

addTransition(RMAppAttemptState.SCHEDULED,
							EnumSet.of(RMAppAttemptState.ALLOCATED_SAVING, RMAppAttemptState.SCHEDULED),
							RMAppAttemptEventType.CONTAINER_ALLOCATED, new AMContainerAllocatedTransition())

追本溯源,找到这里,我们看下AMContainerAllocatedTransition的代码:

// Set the masterContainer
			appAttempt.setMasterContainer(amContainerAllocation.getContainers().get(0));
			RMContainerImpl rmMasterContainer = (RMContainerImpl) appAttempt.scheduler
					.getRMContainer(appAttempt.getMasterContainer().getId());
			rmMasterContainer.setAMContainer(true);
			// The node set in NMTokenSecrentManager is used for marking whether the
			// NMToken has been issued for this node to the AM.
			// When AM container was allocated to RM itself, the node which allocates
			// this AM container was marked as the NMToken already sent. Thus,
			// clear this node set so that the following allocate requests from AM are
			// able to retrieve the corresponding NMToken.
			appAttempt.rmContext.getNMTokenSecretManager().clearNodeSetForAttempt(appAttempt.applicationAttemptId);
			appAttempt.getSubmissionContext().setResource(appAttempt.getMasterContainer().getResource());
			appAttempt.storeAttempt();
			return RMAppAttemptState.ALLOCATED_SAVING;

这个转换,顾名思义,就是AMContainer的分配,可以看出里面新建了一个RMContainerImpl,说明,要开始与NodeManager交互了,需要NodeManager来调度事件,准备启动ApplicationMaster了:

private void storeAttempt() {
		// store attempt data in a non-blocking manner to prevent dispatcher
		// thread starvation and wait for state to be saved
		LOG.info("Storing attempt: AppId: " + getAppAttemptId().getApplicationId() + " AttemptId: " + getAppAttemptId()
				+ " MasterContainer: " + masterContainer);
		rmContext.getStateStore().storeNewApplicationAttempt(this);
	}

我们看下其中的这个方法,把本次提交的这个RMAppAttemptImpl交给了RM的大管家,将本次尝试存储起来。

而且,最后这个转换提交了一个新的状态:RMAppAttemptState.ALLOCATED_SAVING。

本文到此结束,下文将讨论这次提交的状态,带来的下次转化。

猜你喜欢

转载自blog.csdn.net/u013384984/article/details/80296286