Master可以配置为两个,Spark在standalone模式下,支持Master主备切换。党Active Master节点出现故障的时候,可以将Standby Master切换为Active Master。
Master主备切换相关代码流程如下:
1 设置RECOVERY_MODE,没有配置的话 默认值为 NONE
private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
配置方式:conf/spark-env.sh
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/nfs/spark_recovery"
2 设置持久化引擎
// Master onStart
override def onStart(): Unit = {
...
...
val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
case "ZOOKEEPER" =>
logInfo("Persisting recovery state to ZooKeeper")
val zkFactory =
new ZooKeeperRecoveryModeFactory(conf, serializer)
(zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
case "FILESYSTEM" =>
val fsFactory =
new FileSystemRecoveryModeFactory(conf, serializer)
(fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
case "CUSTOM" =>
val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory"))
val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer])
.newInstance(conf, serializer)
.asInstanceOf[StandaloneRecoveryModeFactory]
(factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
case _ =>
(new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
}
persistenceEngine = persistenceEngine_
leaderElectionAgent = leaderElectionAgent_
}
由代码可以看出,持久化引擎可以基于四种方式来创建。
(1) ZOOKEEPER。其基本原理是通过zookeeper来选举一个Master,其他的Master处于Standby状态。将Standalone集群连接到同一个ZooKeeper实例并启动多个Master,利用zookeeper提供的选举和状态保存功能,可以使一个Master被选举,而其他Master处于Standby状态。如果现任Master死去,另一个Master会通过选举产生,并恢复到旧的Master状态,然后恢复调度。整个恢复过程可能要1-2分钟。
(2) FILESYSTEM。spark提供目录保存spark Application和worker的注册信息,并将他们的恢复状态写入该目录,当spark的master节点宕掉的时候,重启master,就能获取application和worker的注册信息。需要手动进行切换。
(3) CUSTOM。用户可以继承抽象类PersistenceEngine实现自己的持久化引擎,用于保存和恢复 Application,Driver,Worker信息。
(4) NONE。使用BlackHolePersistenceEngine(),由其实现可以看出,它不会持久化Application,Driver,Worker信息,主备切换的时候,之前的Application,Driver,Worker信息会被全部丢弃。
private[master] class BlackHolePersistenceEngine extends PersistenceEngine {
override def persist(name: String, obj: Object): Unit = {} //空实现
override def unpersist(name: String): Unit = {} //空实现
override def read[T: ClassTag](name: String): Seq[T] = Nil //空实现
}
3 Master onStart()调用时机
Master继承了ThreadSafeRpcEndpoint,ThreadSafeRpcEndpoint继承了RpcEndpoint
private[deploy] class Master(
override val rpcEnv: RpcEnv,
address: RpcAddress,
webUiPort: Int,
val securityMgr: SecurityManager,
val conf: SparkConf)
extends ThreadSafeRpcEndpoint with Logging with LeaderElectable
/**
* A trait that requires RpcEnv thread-safely sending messages to it.
*
* Thread-safety means processing of one message happens before processing of the next message by
* the same [[ThreadSafeRpcEndpoint]]. In the other words, changes to internal fields of a
* [[ThreadSafeRpcEndpoint]] are visible when processing the next message, and fields in the
* [[ThreadSafeRpcEndpoint]] need not be volatile or equivalent.
*
* However, there is no guarantee that the same thread will be executing the same
* [[ThreadSafeRpcEndpoint]] for different messages.
*/
private[spark] trait ThreadSafeRpcEndpoint extends RpcEndpoint
Inbox类process调用RpcEndpoint的onStart()方法。
/**
* Process stored messages.
*/
def process(dispatcher: Dispatcher): Unit = {
var message: InboxMessage = null
inbox.synchronized {
if (!enableConcurrent && numActiveThreads != 0) {
return
}
message = messages.poll()
if (message != null) {
numActiveThreads += 1
} else {
return
}
}
while (true) {
safelyCall(endpoint) {
message match {
case RpcMessage(_sender, content, context) =>
try {
endpoint.receiveAndReply(context).applyOrElse[Any, Unit](content, { msg =>
throw new SparkException(s"Unsupported message $message from ${_sender}")
})
} catch {
case NonFatal(e) =>
context.sendFailure(e)
// Throw the exception -- this exception will be caught by the safelyCall function.
// The endpoint's onError function will be called.
throw e
}
case OneWayMessage(_sender, content) =>
endpoint.receive.applyOrElse[Any, Unit](content, { msg =>
throw new SparkException(s"Unsupported message $message from ${_sender}")
})
case OnStart =>
endpoint.onStart() //调用RpcEndpoint的 onStart()方法
if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
inbox.synchronized {
if (!stopped) {
enableConcurrent = true
}
}
}
case OnStop =>
val activeThreads = inbox.synchronized { inbox.numActiveThreads }
assert(activeThreads == 1,
s"There should be only a single active thread but found $activeThreads threads.")
dispatcher.removeRpcEndpointRef(endpoint)
endpoint.onStop()
assert(isEmpty, "OnStop should be the last message")
case RemoteProcessConnected(remoteAddress) =>
endpoint.onConnected(remoteAddress)
case RemoteProcessDisconnected(remoteAddress) =>
endpoint.onDisconnected(remoteAddress)
case RemoteProcessConnectionError(cause, remoteAddress) =>
endpoint.onNetworkError(cause, remoteAddress)
}
}
Dispatcher MessageLoop中调用 inbox process
/** Message loop used for dispatching messages. */
private class MessageLoop extends Runnable {
override def run(): Unit = {
try {
while (true) {
try {
val data = receivers.take()
if (data == PoisonPill) {
// Put PoisonPill back so that other MessageLoops can see it.
receivers.offer(PoisonPill)
return
}
data.inbox.process(Dispatcher.this) // 调用inbox process
} catch {
case NonFatal(e) => logError(e.getMessage, e)
}
}
} catch {
case ie: InterruptedException => // exit
}
}
}
A message dispatcher, responsible for routing RPC messages to the appropriate endpoint(s).
rpc消息都会被放到一个LinkedBlockingQueue中,Dispatcher MessageLoop会不断的从LinkedBlockingQueue中获取消息,并处理。更详细的rpc机制,这里不做介绍,后续专门开辟一节来说明。
4 主备切换事件处理
(1) 使用持久化引擎去读区持久化的storedApps, storedDrivers, storedWorkers。
(2) 没有读到信息的话,将Master状态设置为RecoveryState.ALIVE,流程结束。
(3) 如果storedApps, storedDrivers, storedWorkers有任何一个是非空的,设置Master状态为RecoveryState.RECOVERING。
Master状态:RecoveryState.RECOVERING
(4) 开始恢复apps drivers workers。这里driver只是简单的加入缓存,apps workers会被重新注册。
(5) Application和Worker的状态都会被修改为UNKNOWN。
(6) 向app对应的diver以及worker发送master改变信息,并把新的masterWebUiUrl地址携带过去。
(7) 正常的app driver和worker接收到Master发来的地址后,会返回响应消息给新的Master。
(8) 返回了响应消息的app的状态被设置为WAITING,worker的状态被设置为ALIVE。
(9) 当所有的app driver和worker都回复了响应消息(判断所有的apps和workers的状态是否还有UNKNOWN) 或 等待超时后, 就会使用completeRecovery()来处理apps, drivers和worker。
Master状态:RecoveryState.COMPLETING_RECOVERY
(10) 过滤掉没有恢复的apps和worker。
(11) 设置已恢复的Application状态为RUNNING。
(12) 重新调度还没有被任何workers请求的drivers。
Master状态:state = RecoveryState.ALIVE
(13) 调度schedule(),对正在等待资源调度的Driver和Application进行调度,比如在某个worker上启动Driver,或者为Application在Worker上启动它需要的Executor。
// master receive
override def receive: PartialFunction[Any, Unit] = {
// 新leader确立
case ElectedLeader =>
val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv) // (1) 使用持久化引擎去读区持久化的storedApps, storedDrivers, storedWorkers
state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) {
RecoveryState.ALIVE // (2) 没有读到信息的话,将Master状态设置为RecoveryState.ALIVE,流程结束
} else {
RecoveryState.RECOVERING // (3) 如果storedApps, storedDrivers, storedWorkers有任何一个是非空的,设置Master状态为RecoveryState.RECOVERING
}
logInfo("I have been elected leader! New state: " + state)
if (state == RecoveryState.RECOVERING) {
beginRecovery(storedApps, storedDrivers, storedWorkers) // (4) 开始恢复apps drivers workers
recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
self.send(CompleteRecovery) // (9) 当等待超时后, 就会发送CompleteRecovery消息,然后使用completeRecovery()来处理apps, drivers和worker
}
}, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
}
case CompleteRecovery => completeRecovery() // 完成恢复处理
...
// (7) 正常的app driver接收到Master发来的地址后,会返回响应消息给新的Master
case MasterChangeAcknowledged(appId) =>
idToApp.get(appId) match {
case Some(app) =>
logInfo("Application has been re-registered: " + appId)
app.state = ApplicationState.WAITING // (8) 返回了响应消息的app的状态被设置为WAITING
case None =>
logWarning("Master change ack from unknown app: " + appId)
}
if (canCompleteRecovery) { completeRecovery() } // (9) 当所有的app driver和worker都回复了响应消息后, 就会使用completeRecovery()来处理apps, drivers和worker。
// (7) 正常的worker接收到Master发来的地址后,会返回响应消息给新的Master
case WorkerSchedulerStateResponse(workerId, executors, driverIds) =>
idToWorker.get(workerId) match {
case Some(worker) =>
logInfo("Worker has been re-registered: " + workerId)
worker.state = WorkerState.ALIVE // (8) 返回了响应消息的worker的状态被设置为ALIVE。
val validExecutors = executors.filter(exec => idToApp.get(exec.appId).isDefined)
for (exec <- validExecutors) {
val app = idToApp.get(exec.appId).get
val execInfo = app.addExecutor(worker, exec.cores, Some(exec.execId))
worker.addExecutor(execInfo)
execInfo.copyState(exec)
}
for (driverId <- driverIds) {
drivers.find(_.id == driverId).foreach { driver =>
driver.worker = Some(worker)
driver.state = DriverState.RUNNING
worker.addDriver(driver)
}
}
case None =>
logWarning("Scheduler state from unknown worker: " + workerId)
}
if (canCompleteRecovery) { completeRecovery() } // (9) 当所有的app driver和worker都回复了响应消息后, 就会使用completeRecovery()来处理apps, drivers和worker。
...
}
// 这里driver只是简单的加入缓存,apps workers会被重新注册。
private def beginRecovery(storedApps: Seq[ApplicationInfo], storedDrivers: Seq[DriverInfo],
storedWorkers: Seq[WorkerInfo]) {
for (app <- storedApps) {
logInfo("Trying to recover app: " + app.id)
try {
registerApplication(app) // 注册Application
app.state = ApplicationState.UNKNOWN // (5) Application的状态都会被修改为UNKNOWN。
app.driver.send(MasterChanged(self, masterWebUiUrl)) // (6) 向app对应的diver发送master改变信息,并把新的masterWebUiUrl地址携带过去。
} catch {
case e: Exception => logInfo("App " + app.id + " had exception on reconnect")
}
}
for (driver <- storedDrivers) {
// Here we just read in the list of drivers. Any drivers associated with now-lost workers
// will be re-launched when we detect that the worker is missing.
drivers += driver // driver加入缓存
}
for (worker <- storedWorkers) {
logInfo("Trying to recover worker: " + worker.id)
try {
registerWorker(worker) // 注册worker
worker.state = WorkerState.UNKNOWN // (5) Worker的状态都会被修改为UNKNOWN。
worker.endpoint.send(MasterChanged(self, masterWebUiUrl)) // (6) 向worker发送master改变信息,并把新的masterWebUiUrl地址携带过去。
} catch {
case e: Exception => logInfo("Worker " + worker.id + " had exception on reconnect")
}
}
}
private def completeRecovery() {
// Ensure "only-once" recovery semantics using a short synchronization period.
if (state != RecoveryState.RECOVERING) { return }
state = RecoveryState.COMPLETING_RECOVERY
// Kill off any workers and apps that didn't respond to us.
workers.filter(_.state == WorkerState.UNKNOWN).foreach(
removeWorker(_, "Not responding for recovery")) // (10) 过滤掉没有恢复的worker
apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication) // (10) 过滤掉没有恢复的apps
// Update the state of recovered apps to RUNNING
apps.filter(_.state == ApplicationState.WAITING).foreach(_.state = ApplicationState.RUNNING) // (11) 设置已恢复的Application状态为RUNNING
// Reschedule drivers which were not claimed by any workers --> (12) 重新调度还没有被任何workers请求的drivers。
drivers.filter(_.worker.isEmpty).foreach { d =>
logWarning(s"Driver ${d.id} was not found after master recovery")
if (d.desc.supervise) {
logWarning(s"Re-launching ${d.id}")
relaunchDriver(d)
} else {
removeDriver(d.id, DriverState.ERROR, None)
logWarning(s"Did not re-launch ${d.id} because it was not supervised")
}
}
state = RecoveryState.ALIVE
schedule() // (13) 调度schedule()
logInfo("Recovery complete - resuming operations!")
}
/**
* Schedule the currently available resources among waiting apps. This method will be called
* every time a new app joins or resource availability changes.
*/
// (13) 调度schedule(),对正在等待资源调度的Driver和Application进行调度,比如在某个worker上启动Driver,或者为Application在Worker上启动它需要的Executor
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
// Drivers take strict precedence over executors
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
startExecutorsOnWorkers()
}