spark源码学习(一)- sparkContext 初始化过程

背景

         sparkcontext为spark应用程序的入口,sparksession中也集成了sparkconext对象,sparkcontext在初始化的过程中会初始化DAGSchedular、TaskSchedular、SchedularBackend和MapOutputTrackerMaster,TaskSchedular、SchedularBackend都是接口,会根据环境的不同实例出不同的实现对象,在standalone环境中TaskSchedular是TaskSchedularImpl,SchedularBackend是StandaloneSchdularBackend, 其中StandaloneSchdularBackend是继承了java的 coarsegrainedschedularbackend。StandaloneSchdularBackend主要负责通信和资源管理,比如向master注册任务等,下面就会给出源码中这些重要组成部分的初始化过程。


过程

 1.SparkContext.scala

    //
    //此期间都是一些配置项的设置过程
    //比如
    if (!_conf.contains("spark.master")) {
      throw new SparkException("A master URL must be set in your configuration")
    }
    if (!_conf.contains("spark.app.name")) {
      throw new SparkException("An application name must be set in your configuration")
    }

    // log out spark.app.name in the Spark driver logs
    logInfo(s"Submitted application: $appName")

    // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
    if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
      throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
        "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
    }

    
    
    
    //省略大部分。。。。。。。
    //
    
    //初始化主要部分
    // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
    // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

    // Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

    其中比较重要的部分是:

                SparkContext.createTaskScheduler(this, master, deployMode)

                _taskScheduler.start()


2.SparkContext.createTaskScheduler(this, master, deployMode) 方法

          createTaskSchedular会根绝配置设置的master地址来选择不同的方法初始化schedular和backend对象,主要会分为这几类:spark地址、本地地址和masterUrl方式,masterUrl方式,是当driver运行在cluster而不在本机的时候,需要使用masterUrl寻找clusterManager,并使用clusterManager对象来建立schedular和backend对象。而spark地址方式初始化代码如下:

case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        
        //初始化方法-重要
        scheduler.initialize(backend)
        //返回 对象
        (backend, scheduler)

       schedular.initialize()方法主要是创建TaskSetManager的Pool池,初始化方法有FIFO和FAIR两种方法,后面提交的taskset任务集合都会暂时存储到这里面,源代码如下:

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

3. _taskScheduler.start()

      在上一步 schedular.initialize()传入backend对象,所以start逻辑主要是调用backend的start()方法向master应用申请和master启动excutor,TaskSchedularImpl的.start()源码如下:

 override def start() {
    
      //注册和申请资源主要逻辑,调用了这个backend的方法
      backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleWithFixedDelay(new Runnable {
        override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
          checkSpeculatableTasks()
        }
      }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }

4.StandaloneSchedularBackend.start()方法

     StandaloneSchedulerBackend.start():
            1.使用相关参数构建commond对象,相关参数包括: drive-url /excutor-url /cores(运行核心数)/app-id 
            2.构建ApplicationDescription对象:主要包含内核数,内存等限制和说明信息

            3.构建StandaloneAppClient 并调用StandaloneAppClient.start方法

            4.StandaloneAppClient.start使用rpc. endpoint调用发送注册请求

    源码如下:

override def start() {
    super.start()

    // SPARK-21159. The scheduler backend should only try to connect to the launcher when in client
    // mode. In cluster mode, the code that submits the application to the Master needs to connect
    // to the launcher instead.
    if (sc.deployMode == "client") {
      launcherBackend.connect()
    }

    // The endpoint for executors to talk to us
    val driverUrl = RpcEndpointAddress(
      sc.conf.get("spark.driver.host"),
      sc.conf.get("spark.driver.port").toInt,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)

    // 此处省略一部分代码
    
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    val webUrl = sc.ui.map(_.webUrl).getOrElse("")
    val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
  
  
    val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    client.start()  //rpc发送请求
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
   
  }

      commond对象初始化的时候参数为CoarseGrainedExecutorBackend的类名,executor从该类的main函数开始运行,也就是,使用命令行附带一些参数,运行了一个带有main函数的类(CoarseGrainedExecutorBackend)从而启动executor

5.CoarseGrainedExecutorBackend.main()

     work端从CoarseGrainedExecutorBackend 入口main开始运行

     main函数的主要逻辑为,检查传输过来的参数,并调用 run(driverUrl, executorId, hostname, cores, appId, workerUrl, userClassPath)方法

      在run方法中,主要逻辑为,使用rpc获取driver端的spark properties,初始化本机的SparkEnv对象,设置endpoint等


结论

     CoarseGrainedExecutorBackend启动之后,其receive方法会等待driver端的TasksetManager发送task任务,然后启动线程运行任务了,具体查看另一片博文:点击打开链接  http://blog.csdn.net/u013560925/article/details/79577957

猜你喜欢

转载自blog.csdn.net/u013560925/article/details/79617819