大规模集群中,对数据本地化(data loality)的优化,可以减少很多网络IO,变成本地的磁盘IO,而磁盘的IO比网络的IO快很多,这对降低集群的IO负载以及增加集群的吞吐量是很有益处的,所以我对Yarn中的延迟调度进行分析,尝试提高作业中任务所在容器的调度本地化命中率。
我们从自带的延迟调度的单元测试入手进行分析:
@Test
public void testDelayScheduling() {
FSLeafQueue queue = Mockito.mock(FSLeafQueue.class);
Priority prio = Mockito.mock(Priority.class);
Mockito.when(prio.getPriority()).thenReturn(1);
double nodeLocalityThreshold = .5;
double rackLocalityThreshold = .6;
ApplicationAttemptId applicationAttemptId = createAppAttemptId(1, 1);
RMContext rmContext = resourceManager.getRMContext();
FSAppAttempt schedulerApp =
new FSAppAttempt(scheduler, applicationAttemptId, "user1", queue ,
null, rmContext);
// Default level should be node-local
assertEquals(NodeType.NODE_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
// First five scheduling opportunities should remain node local
for (int i = 0; i < 5; i++) {
schedulerApp.addSchedulingOpportunity(prio);
assertEquals(NodeType.NODE_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
}
// After five it should switch to rack local
schedulerApp.addSchedulingOpportunity(prio);
assertEquals(NodeType.RACK_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
// Manually set back to node local
schedulerApp.resetAllowedLocalityLevel(prio, NodeType.NODE_LOCAL);
schedulerApp.resetSchedulingOpportunities(prio);
assertEquals(NodeType.NODE_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
// Now escalate again to rack-local, then to off-switch
for (int i = 0; i < 5; i++) {
schedulerApp.addSchedulingOpportunity(prio);
assertEquals(NodeType.NODE_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
}
schedulerApp.addSchedulingOpportunity(prio);
assertEquals(NodeType.RACK_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
for (int i = 0; i < 6; i++) {
schedulerApp.addSchedulingOpportunity(prio);
assertEquals(NodeType.RACK_LOCAL, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
}
schedulerApp.addSchedulingOpportunity(prio);
assertEquals(NodeType.OFF_SWITCH, schedulerApp.getAllowedLocalityLevel(
prio, 10, nodeLocalityThreshold, rackLocalityThreshold));
}
我们从自带的单元测试入手:
这两个参数分别设置为了 0.5 和 0.6
double nodeLocalityThreshold = .5;
double rackLocalityThreshold = .6;
schedulerApp.getAllowedLocalityLevel():
来看一下这个方法:
public synchronized NodeType getAllowedLocalityLevel(Priority priority,
int numNodes, double nodeLocalityThreshold, double rackLocalityThreshold) {
// upper limit on threshold
if (nodeLocalityThreshold > 1.0) { nodeLocalityThreshold = 1.0; }
if (rackLocalityThreshold > 1.0) { rackLocalityThreshold = 1.0; }
// If delay scheduling is not being used, can schedule anywhere
if (nodeLocalityThreshold < 0.0 || rackLocalityThreshold < 0.0) {
return NodeType.OFF_SWITCH;
}
// Default level is NODE_LOCAL
if (!allowedLocalityLevel.containsKey(priority)) {
allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
return NodeType.NODE_LOCAL;
}
NodeType allowed = allowedLocalityLevel.get(priority);
// If level is already most liberal, we're done
if (allowed.equals(NodeType.OFF_SWITCH)) return NodeType.OFF_SWITCH;
double threshold = allowed.equals(NodeType.NODE_LOCAL) ? nodeLocalityThreshold :
rackLocalityThreshold;
// Relax locality constraints once we've surpassed threshold.
if (getSchedulingOpportunities(priority) > (numNodes * threshold)) {
if (allowed.equals(NodeType.NODE_LOCAL)) {
allowedLocalityLevel.put(priority, NodeType.RACK_LOCAL);
resetSchedulingOpportunities(priority);
}
else if (allowed.equals(NodeType.RACK_LOCAL)) {
allowedLocalityLevel.put(priority, NodeType.OFF_SWITCH);
resetSchedulingOpportunities(priority);
}
}
return allowedLocalityLevel.get(priority);
}
可以看到这两个参数不设置的情况下,默认的 allowedLocality是 任意节点(OFF_SWITCH),一旦设置了此参数,默认的allowedLocality就是(NODE_LOCAL)即本地节点。这两个参数设置以后,若不满足本地性的要求(当前的本地性可能为 NODE_LOCAL, RACK_LOCAL),可以跳过参数对应的numNodes * threshold次数的心跳机会,来增加数据本地节点分配Container的可能性,当然超过设定次数的调度机会没有调度到我们的本地性需求对应的节点,就会松弛本地化限制(relax locality constraints ),即降低本地性需求进行调度,NODE_LOCAL将降低为RACK_LOCAL需求,若RACK_LOCAL需求降低为OFF_SWITCH需求。理论上和单元测试对应起来了。
接下来到调度器中观察一下实际调度的时候,怎么用到了这个本地化限制allowedLocality:
FSAppAttempt.assignContainer
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
if (LOG.isTraceEnabled()) {
LOG.trace("Node offered to app: " + getName() + " reserved: " + reserved);
}
Collection<Priority> prioritiesToTry = (reserved) ?
Arrays.asList(node.getReservedContainer().getReservedPriority()) :
getPriorities();
// For each priority, see if we can schedule a node local, rack local
// or off-switch request. Rack of off-switch requests may be delayed
// (not scheduled) in order to promote better locality.
synchronized (this) {
for (Priority priority : prioritiesToTry) {
// Skip it for reserved container, since
// we already check it in isValidReservation.
if (!reserved && !hasContainerForNode(priority, node)) {
continue;
}
addSchedulingOpportunity(priority);
ResourceRequest rackLocalRequest = getResourceRequest(priority,
node.getRackName());
ResourceRequest localRequest = getResourceRequest(priority,
node.getNodeName());
if (localRequest != null && !localRequest.getRelaxLocality()) {
LOG.warn("Relax locality off is not supported on local request: "
+ localRequest);
}
NodeType allowedLocality;
if (scheduler.isContinuousSchedulingEnabled()) {
allowedLocality = getAllowedLocalityLevelByTime(priority,
scheduler.getNodeLocalityDelayMs(),
scheduler.getRackLocalityDelayMs(),
scheduler.getClock().getTime());
} else {
allowedLocality = getAllowedLocalityLevel(priority,
scheduler.getNumClusterNodes(),
scheduler.getNodeLocalityThreshold(),
scheduler.getRackLocalityThreshold());
}
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& localRequest != null && localRequest.getNumContainers() != 0) {
if (LOG.isTraceEnabled()) {
LOG.trace("Assign container on " + node.getNodeName()
+ " node, assignType: NODE_LOCAL" + ", allowedLocality: "
+ allowedLocality + ", priority: " + priority
+ ", app attempt id: " + getApplicationAttemptId());
}
return assignContainer(node, localRequest,
NodeType.NODE_LOCAL, reserved);
}
if (rackLocalRequest != null && !rackLocalRequest.getRelaxLocality()) {
continue;
}
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& (allowedLocality.equals(NodeType.RACK_LOCAL) ||
allowedLocality.equals(NodeType.OFF_SWITCH))) {
if (LOG.isTraceEnabled()) {
LOG.trace("Assign container on " + node.getNodeName()
+ " node, assignType: RACK_LOCAL" + ", allowedLocality: "
+ allowedLocality + ", priority: " + priority
+ ", app attempt id: " + getApplicationAttemptId());
}
return assignContainer(node, rackLocalRequest,
NodeType.RACK_LOCAL, reserved);
}
ResourceRequest offSwitchRequest =
getResourceRequest(priority, ResourceRequest.ANY);
if (offSwitchRequest != null && !offSwitchRequest.getRelaxLocality()) {
continue;
}
if (offSwitchRequest != null &&
offSwitchRequest.getNumContainers() != 0) {
if (!hasNodeOrRackLocalRequests(priority) ||
allowedLocality.equals(NodeType.OFF_SWITCH)) {
if (LOG.isTraceEnabled()) {
LOG.trace("Assign container on " + node.getNodeName()
+ " node, assignType: OFF_SWITCH" + ", allowedLocality: "
+ allowedLocality + ", priority: " + priority
+ ", app attempt id: " + getApplicationAttemptId());
}
return assignContainer(
node, offSwitchRequest, NodeType.OFF_SWITCH, reserved);
}
}
if (LOG.isTraceEnabled()) {
LOG.trace("Can't assign container on " + node.getNodeName()
+ " node, allowedLocality: " + allowedLocality + ", priority: "
+ priority + ", app attempt id: "
+ getApplicationAttemptId());
}
}
}
return Resources.none();
}
这里的Trace日志是后来补充的patch,每一个请求实际调度到的本地化等级就可以从这个日志中看出来。
ResourceRequest rackLocalRequest = getResourceRequest(priority,
node.getRackName());
ResourceRequest localRequest = getResourceRequest(priority,
node.getNodeName());
为了找到这个rackLocalRequest 以及localRequest 的源头:
我们将角度转移到mapreduce端的AM代码:
是资源请求实体,是从AM传送过来的:
taskAttempt.eventHandler.handle(new ContainerRequestEvent(
taskAttempt.attemptId, taskAttempt.resourceCapability,
taskAttempt.dataLocalHosts.toArray(
new String[taskAttempt.dataLocalHosts.size()]),
taskAttempt.dataLocalRacks.toArray(
new String[taskAttempt.dataLocalRacks.size()])));
而这里请求container中包括dataLocalHosts,来自数据输入端本身位置信息确定,splitInfo.getLocations()。
然后通过scheduledRequests调度task的时候:
scheduledRequests.addMap:
void addMap(ContainerRequestEvent event) {
ContainerRequest request = null;
if (event.getEarlierAttemptFailed()) {
....
} else {
// 根据本地性将task attempt放入对应的mapping中
// mapsHostmapping or mapsRackMapping
for (String host : event.getHosts()) {
LinkedList<TaskAttemptId> list = mapsHostMapping.get(host);
if (list == null) {
list = new LinkedList<TaskAttemptId>();
mapsHostMapping.put(host, list);
}
list.add(event.getAttemptID());
if (LOG.isDebugEnabled()) {
LOG.debug("Added attempt req to host " + host);
}
}
for (String rack: event.getRacks()) {
LinkedList<TaskAttemptId> list = mapsRackMapping.get(rack);
if (list == null) {
list = new LinkedList<TaskAttemptId>();
mapsRackMapping.put(rack, list);
}
list.add(event.getAttemptID());
if (LOG.isDebugEnabled()) {
LOG.debug("Added attempt req to rack " + rack);
}
}
request = new ContainerRequest(event, PRIORITY_MAP);
}
maps.put(event.getAttemptID(), request);
addContainerReq(request);
}
ScheduledRequests是RMContainerAllocator的一个内部类,这里调用的是addMap,从字面上看是将map的container request放入集合中。接下来调用:
protected void addContainerReq(ContainerRequest req) {
// Create resource requests
for (String host : req.hosts) {
// Data-local
if (!isNodeBlacklisted(host)) {
addResourceRequest(req.priority, host, req.capability);
}
}
// Nothing Rack-local for now
for (String rack : req.racks) {
addResourceRequest(req.priority, rack, req.capability);
}
// Off-switch
addResourceRequest(req.priority, ResourceRequest.ANY, req.capability);
}
分别添加本地节点的,本地机架的,任意节点的资源请求到对应的资源hash表中。
接下来看资源申请过程:
AM通过RMContainerAllocator.heartbeat()来证明自己存活的同时,对所需资源进行申请:
@Override
protected synchronized void heartbeat() throws Exception {
scheduleStats.updateAndLogIfChanged("Before Scheduling: ");
List<Container> allocatedContainers = getResources();
if (allocatedContainers != null && allocatedContainers.size() > 0) {
scheduledRequests.assign(allocatedContainers);
}
int completedMaps = getJob().getCompletedMaps();
int completedTasks = completedMaps + getJob().getCompletedReduces();
if ((lastCompletedTasks != completedTasks) ||
(scheduledRequests.maps.size() > 0)) {
lastCompletedTasks = completedTasks;
recalculateReduceSchedule = true;
}
if (recalculateReduceSchedule) {
boolean reducerPreempted = preemptReducesIfNeeded();
if (!reducerPreempted) {
// Only schedule new reducers if no reducer preemption happens for
// this heartbeat
scheduleReduces(getJob().getTotalMaps(), completedMaps,
scheduledRequests.maps.size(), scheduledRequests.reduces.size(),
assignedRequests.maps.size(), assignedRequests.reduces.size(),
mapResourceRequest, reduceResourceRequest, pendingReduces.size(),
maxReduceRampupLimit, reduceSlowStart);
}
recalculateReduceSchedule = false;
}
scheduleStats.updateAndLogIfChanged("After Scheduling: ");
}
- List< Container > allocatedContainers = getResources(); 完成了Container的收集
- scheduledRequests.assign(allocatedContainers); 把收集到的Container分给具体的task。
Container收集的具体流程呢?
getResources() -> makeRequest() -> ApplicationMasterProtocol.allocate()然后就通过RPC到了RM端的:
ApplicationMasterService.allocate() -> FairScheduler.allocate():
public Allocation allocate(ApplicationAttemptId appAttemptId,
List<ResourceRequest> ask, List<ContainerId> release,
List<String> blacklistAdditions, List<String> blacklistRemovals) {
//....
synchronized (application) {
if (!ask.isEmpty()) {
application.showRequests();
// Update application requests
application.updateResourceRequests(ask);
application.showRequests();
}
//...
return new Allocation(allocation.getContainerList(),
application.getHeadroom(), preemptionContainerIds, null, null,
allocation.getNMTokenList());
}
}
我们的目的不能忘记:找到这个rackLocalRequest 以及localRequest 的源头
application.updateResourceRequests(ask); 这个过程中:
public synchronized void updateResourceRequests(
List<ResourceRequest> requests) {
if (!isStopped) {
appSchedulingInfo.updateResourceRequests(requests, false);
}
}
把资源请求ask更新到了appSchedulingInfo中
回头看看rackLocalRequest 以及localRequest 怎么得到的:
public synchronized ResourceRequest getResourceRequest(Priority priority, String resourceName) {
return this.appSchedulingInfo.getResourceRequest(priority, resourceName);
}
正是上述过程我们更新到中的信息,源头终于找到。
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
if (LOG.isTraceEnabled()) {
LOG.trace("Node offered to app: " + getName() + " reserved: " + reserved);
}
Collection<Priority> prioritiesToTry = (reserved) ?
Arrays.asList(node.getReservedContainer().getReservedPriority()) :
getPriorities();
// For each priority, see if we can schedule a node local, rack local
// or off-switch request. Rack of off-switch requests may be delayed
// (not scheduled) in order to promote better locality.
synchronized (this) {
for (Priority priority : prioritiesToTry) {
// Skip it for reserved container, since
// we already check it in isValidReservation.
if (!reserved && !hasContainerForNode(priority, node)) {
continue;
}
addSchedulingOpportunity(priority);
ResourceRequest rackLocalRequest = getResourceRequest(priority,
node.getRackName());
ResourceRequest localRequest = getResourceRequest(priority,
node.getNodeName());
if (localRequest != null && !localRequest.getRelaxLocality()) {
LOG.warn("Relax locality off is not supported on local request: "
+ localRequest);
}
NodeType allowedLocality;
if (scheduler.isContinuousSchedulingEnabled()) {
allowedLocality = getAllowedLocalityLevelByTime(priority,
scheduler.getNodeLocalityDelayMs(),
scheduler.getRackLocalityDelayMs(),
scheduler.getClock().getTime());
} else {
allowedLocality = getAllowedLocalityLevel(priority,
scheduler.getNumClusterNodes(),
scheduler.getNodeLocalityThreshold(),
scheduler.getRackLocalityThreshold());
}
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& localRequest != null && localRequest.getNumContainers() != 0) {
return assignContainer(node, localRequest,
NodeType.NODE_LOCAL, reserved);
}
if (rackLocalRequest != null && !rackLocalRequest.getRelaxLocality()) {
continue;
}
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& (allowedLocality.equals(NodeType.RACK_LOCAL) ||
allowedLocality.equals(NodeType.OFF_SWITCH))) {
return assignContainer(node, rackLocalRequest,
NodeType.RACK_LOCAL, reserved);
}
ResourceRequest offSwitchRequest =
getResourceRequest(priority, ResourceRequest.ANY);
if (offSwitchRequest != null && !offSwitchRequest.getRelaxLocality()) {
continue;
}
if (offSwitchRequest != null &&
offSwitchRequest.getNumContainers() != 0) {
if (!hasNodeOrRackLocalRequests(priority) ||
allowedLocality.equals(NodeType.OFF_SWITCH)) {
return assignContainer(
node, offSwitchRequest, NodeType.OFF_SWITCH, reserved);
}
}
}
}
return Resources.none();
}
我们可以看到,在首先满足 阈值allowedLocality的情况下,优先分配本节点对应优先级的 NODE_LOCAL的需求,若没有NODE_LOCAL需求,再优先满足RACK_LOCAL需求,若该节点针对此优先级没有本地化需求,比如一些fail map 和 reduce task, 则最后分配这些请求。