Spark.2.2源码阅读: SPARK SUBMIT任务提交

1.编写程序

当我们编写了一个程序后。

package com.llcc.sparkSql.text

import org.apache.spark.sql.{Row, SQLContext, SparkSession}

import org.apache.spark.sql.types.{StringType, StructField, StructType}

object SparkSQLTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().getOrCreate()
    val peopleRDD = spark.sparkContext.textFile("J:\\00-IDEA\\work\\SparkSQLTest\\example\\people.txt")
    val schemaString = "name age"
    val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)
    val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))
    val peopleDF = spark.createDataFrame(rowRDD, schema)
    peopleDF.createTempView("people")
    println("第一个会话"+spark)
    val results = spark.sql("SELECT name FROM people")
    results.show();
    spark.stop();
  }
}

我们需要提交这个程序到集群上去运行,使用$SPARK_HOME/bin目录下的 spark-submit 脚本去提交用户的程序

2.spark-submit

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

如上,列出了一些常用选项:

  1. –class:应用程序的入口(例如 org.apache.spark.examples.SparkPi)
  2. –master:master地址,这是集群中master的URL地址(例如 spark://192.168.1.10:7077 )
  3. –deploy-mode:部署模式,是否将用户的Driver程序部署到集群的Worker节点(cluster集群模式),或将本地作为外部client客户端模式(默认为client客户端模式)
  4. –conf:spark 配置,键-值对形式
  5. application-jar:用户程序Jar包路径
  6. application-arguments:用户应用程序所需参数
    一个spark-submit实例如下:
./bin/spark-submit \
  --class com.llcc.sparkSql.text.SparkSQLTest \
  --master spark://192.168.1.20:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 2G \
  --total-executor-cores 5 \
  /path/to/SparkSQLTest.jar \
  1000

注意
您还可以指定--supervise确保驱动程序在非零退出代码失败时自动重新启动
参考:http://spark.apache.org/docs/2.2.0/submitting-applications.html

  1. –class 指定程序入口为com.llcc.sparkSql.text.SparkSQLTest
  2. –master 指定master url地址为spark://192.168.1.20:7077
  3. –deploy-mode 指定部署模式为cluster
  4. –supervise 在程序执行失败后,重新启动application
  5. –executor-memory 20G 每个executor的内存为2G
  6. –total-executor-cores executor的cpu core总数为10
  7. /path/to/SparkSQLTest.jar 程序的jar包
  8. 1000 程序所需参数

提交这个spark任务,命令行日志如下:

root@louisvv bin]# ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://192.168.1.20:7077 --deploy-mode cluster  --executor-memory 2G --total-executor-cores 5 ../examples/jars/spark-examples_2.11-2.1.0.2.6.0.3-8.jar 1000
Running Spark using the REST application submission protocol.
18/04/19 17:03:29 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.1.20:7077.
18/04/19 17:03:40 WARN RestSubmissionClient: Unable to connect to server spark://192.168.1.20:7077.
Warning: Master endpoint spark://192.168.1.20:7077 was not a REST server. Falling back to legacy submission gateway instead.

了一个WARN,说spark://192.168.1.20:7077不是一个REST服务,使用传统的提交网关,这个WARN,会在下面会进行解释

Spark-submit这个脚本作为入口,脚本最后调用exec执行 ${SPARK_HOME}/bin/spark-class 调用class为:org.apache.spark.deploy.SparkSubmit$@”为脚本执行的所有参数

--class com.llcc.sparkSql.text.SparkSQLTest \
  --master spark://192.168.1.20:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 2G \
  --total-executor-cores 5 \
  /path/to/SparkSQLTest.jar \
  1000
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

# 这里可以看到执行了spark-class脚本
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

看一下spark-class脚本

脚本首先校验$SPARK_HOME/conf,spark相关依赖目录$SPARK_HOME/jars,hadoop相关依赖目录$HADOOP_HOEM/lib,将校验所得所有目录地址拼接为LAUNCH_CLASSPATH变量

$JAVA_HOME/bin/java 定义为RUNNER变量

其中有个build_command()方法,创建执行命令,把build_commands创建的命令,循环加到数组CMD中,最后执行exec执行CMD


# -z 判断SPARK_HOME变量的长度是否为0,等于0为真
if [ -z "${SPARK_HOME}" ]; then
# 加载当前目录的find...的变量
  source "$(dirname "$0")"/find-spark-home
fi

# 加载这个文件的变量
. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary  这一段,主要是寻找java命令
# -n 判断变量长度是否不为0,不为0为真
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java" # JAVAHOME存在就赋值RUNNER为这个
else
  # 监测java命令是否存在
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    # 不存在退出
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.   寻找spark的jar包 这里如果我们的jar包数量多,而且内容大,可以事先放到每个机器的对应目录下,这里是一个优化点
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

# 判断下边俩,不都存在就报错退出
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  # 存在就变量赋值
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
# 这变量长度大于1赋值
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
# 这变量长度大于1 unset目录权限
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
# 启动程序库将打印由NULL字符分隔的参数,以允许与shell进行其他解释的字符进行参数。在while循环中读取它,填充将用于执行最终命令的数组。
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
# 启动程序的退出代码被追加到输出,因此父shell从命令数组中删除它,并检查其值,看看启动器是否成功。
# 这里spark启动了以SparkSubmit为主类的JVM进程。
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?  # 输出返回值
}

# Turn off posix mode since it does not allow process substitution
# 关闭posix模式,因为它不允许进程替换。
set +o posix
CMD=()                              # 创建数组
while IFS= read -d '' -r ARG; do    # 把build_commands输出结果,循环加到数组CMD中
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}                    # 数组长度
LAST=$((COUNT - 1))                 # 数组长度-1
LAUNCHER_EXIT_CODE=${CMD[$LAST]}    # 数组的最后一个值,也就是上边$?的值

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
# 某些JVM失败会导致错误被打印到stdout(而不是stderr),这会导致解析启动程序输出的代码变得混乱。
# 在这些情况下,检查退出代码是否为整数,如果不是,将其作为特殊的错误处理。
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then    # 如果返回值不是数字,退出
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then               # 如果返回值不为0,退出,返回返回值
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")                           # CMD还是原来那些参数,$@
exec "${CMD[@]}"                                    # 执行这些


# 最终命令如下
# /opt/jdk1.8/bin/java -Dhdp.version=2.6.0.3-8 -cp /usr/hdp/current/spark2-historyserver/conf/:/usr/hdp/2.6.0.3-8/spark2/jars/*:/usr/hdp/current/hadoop-client/conf/
#   org.apache.spark.deploy.SparkSubmit \
#  --master spark://192.168.1.20:7077 \
#  --deploy-mode cluster \
# --class org.apache.spark.examples.SparkPi \
#  --executor-memory 2G \
#  --total-executor-cores 5 \
#  ../examples/jars/spark-examples_2.11-2.1.0.2.6.0.3-8.jar \
#  1000

源码解析
最终执行的命令中,指定了程序的入口为org.apache.spark.deploy.SparkSubmit,来看一下它的主函数

根据解析后参数action进行模式匹配,如果是submit操作,则调用submit方法

override def main(args: Array[String]): Unit = {
    val appArgs = new SparkSubmitArguments(args)  // 构建spark提交的参数
    // 是否打印详细参数,包括默认的
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    // //  对appArgs的action进行模式匹配
    appArgs.action match {
      //  如果是SUBMIT,则调用submit 调用顺序:main--> submit(appArgs)->runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      case SparkSubmitAction.SUBMIT => submit(appArgs)   //通过反射执行main函数,不同的部署方式main函数是不一样的
      //  如果是KILL,则调用kill
      case SparkSubmitAction.KILL => kill(appArgs)
      //  如果是REQUEST_STATUS,则调用requestStatus
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

然后我们看看submit(appArgs)方法主要做了什么?

/**
   * Submit the application using the provided parameters.
   *  提交应用程序使用提供的参数
    *
   * This runs in two steps. First, we prepare the launch environment by setting up
   * the appropriate classpath, system properties, and application arguments for
   * running the child main class based on the cluster manager and the deploy mode.
   * Second, we use this launch environment to invoke the main method of the child
   * main class.
    *  这个包括两步:
    *   1.我们准备开启环境使用配置的classpath, system properties, 和 application 的参数去运行子类的主要方法,在cluster manager 上
    *   2.我们使用子类的环境
   */
  @tailrec
  private def submit(args: SparkSubmitArguments): Unit = {
    //  首先调用prepareSubmitEnvironment
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

    /**
      * doRunMain方法,其实调用了runMain
      */
    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            // Hadoop's AuthorizationException suppresses the exception's stack trace, which
            // makes the message printed to the output by the JVM not very helpful. Instead,
            // detect exceptions with empty stack traces here, and treat them differently.
            if (e.getStackTrace().length == 0) {
              // scalastyle:off println
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              // scalastyle:on println
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {

        //调用顺序:main--> submit(appArgs)->runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }

     // In standalone cluster mode, there are two submission gateways:
     //   (1) The traditional RPC gateway using o.a.s.deploy.Client as a wrapper
     //   (2) The new REST-based gateway introduced in Spark 1.3
     // The latter is the default behavior as of Spark 1.3, but Spark submit will fail over
     // to use the legacy gateway if the master endpoint turns out to be not a REST server.

    /*  在standalone集群模式下,有两个提交网关:
    *   1.使用org.apache.spark.deploy.Client作为包装器来使用传统的RPC网关
    *   2.Spark 1.3中引入的基于rest的网关
    *   第二种方法是Spark 1.3的默认行为,但是Spark submit将会失败
    *   如果master不是一个REST服务器,那么它将无法使用REST网关。
    */
    if (args.isStandaloneCluster && args.useRest) {
      try {
        // scalastyle:off println
        printStream.println("Running Spark using the REST application submission protocol.")
        // scalastyle:on println
        // 调用doRunMain()方法
        doRunMain()
      } catch {
        // Fail over to use the legacy submission gateway
        case e: SubmitRestConnectionException =>
          printWarning(s"Master endpoint ${args.master} was not a REST server. " +
            "Falling back to legacy submission gateway instead.")
          args.useRest = false
          submit(args)
      }
    // In all other modes, just run the main class as prepared
    } else {
      // 其他模式,直接调用doRunMain方法
      doRunMain()
    }
  }

在这个方法里,先执行

 val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

我们进去看看,

 /**
   * Prepare the environment for submitting an application.
   * This returns a 4-tuple:
   *   (1) the arguments for the child process,
   *   (2) a list of classpath entries for the child,
   *   (3) a map of system properties, and
   *   (4) the main class for the child
   * Exposed for testing.
    *
    * 准备提交应用程序的环境。
    * 这将返回一个4-tuple:
    *   (1) 子程序的参数
    *   (2) 子程序的类路径条目列表,
    *   (3) 系统map配置
    *   (4) 子程序的主类。
    * 由于不同的部署方式其卖弄函数是不一样的,主要是由spark的提交参数决定
   */
  private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
      : (Seq[String], Seq[String], Map[String, String], String) = {
    // Return values
    val childArgs = new ArrayBuffer[String]()
    val childClasspath = new ArrayBuffer[String]()
    val sysProps = new HashMap[String, String]()
    var childMainClass = ""

    // 先根据参数中master和delpoy-mode,设置对应的clusterManager和部署模式,再根据args中的其他参数,
    // 设置相关childArgs, childClasspath, sysProps, childMainClass,并返回结果
    // Set the cluster manager
    //  根据参数的master,设置对应的集群资源管理器
    val clusterManager: Int = args.master match {
      case "yarn" => YARN
      // 这里可以看到yarn-client已经过时了,yarn和yarn-client最后调用的都是yarn模式
      case "yarn-client" | "yarn-cluster" =>
        printWarning(s"Master ${args.master} is deprecated since 2.0." +
          " Please use master \"yarn\" with specified deploy mode instead.")
        YARN
      case m if m.startsWith("spark") => STANDALONE
      case m if m.startsWith("mesos") => MESOS
      case m if m.startsWith("local") => LOCAL
      case _ =>
        printErrorAndExit("Master must either be yarn or start with spark, mesos, local")
        -1
    }

    // Set the deploy mode; default is client mode
    //  根据参数的deployMode,设置部署模式
    var deployMode: Int = args.deployMode match {
      case "client" | null => CLIENT   // 默认是client方式提交
      case "cluster" => CLUSTER
      case _ => printErrorAndExit("Deploy mode must be either client or cluster"); -1
    }

    // Because the deprecated way of specifying "yarn-cluster" and "yarn-client" encapsulate both
    // the master and deploy mode, we have some logic to infer the master and deploy mode
    // from each other if only one is specified, or exit early if they are at odds.
    // 因为指定“yarn-cluster”和“yarn-client”两种模式都是过时的部署模式,
    // 我们有一些逻辑来推断主和部署模式,如果只指定了一个,或者在不一致的情况下提前退出。


    if (clusterManager == YARN) {
      (args.master, args.deployMode) match {
        case ("yarn-cluster", null) =>
          deployMode = CLUSTER
          args.master = "yarn"
        case ("yarn-cluster", "client") =>
          // 客户端部署模式与master“yarn-cluster”不兼容。
          printErrorAndExit("Client deploy mode is not compatible with master \"yarn-cluster\"")
        case ("yarn-client", "cluster") =>
          // 客户端部署模式与master“yarn-cluster”不兼容。
          printErrorAndExit("Cluster deploy mode is not compatible with master \"yarn-client\"")
        case (_, mode) =>
          args.master = "yarn"
      }

      // Make sure YARN is included in our build if we're trying to use it
      // 确保YARN是包含在里面了,当我们使用的时候,
      if (!Utils.classIsLoadable("org.apache.spark.deploy.yarn.Client") && !Utils.isTesting) {
        printErrorAndExit(
          "Could not load YARN classes. " +
          "This copy of Spark may not have been compiled with YARN support.")
      }
    }

    // Update args.deployMode if it is null. It will be passed down as a Spark property later.
    //  根据参数的deployMode,设置部署模式
    (args.deployMode, deployMode) match {
      case (null, CLIENT) => args.deployMode = "client"
      case (null, CLUSTER) => args.deployMode = "cluster"
      case _ =>
    }
    val isYarnCluster = clusterManager == YARN && deployMode == CLUSTER
    val isMesosCluster = clusterManager == MESOS && deployMode == CLUSTER

    // Resolve maven dependencies if there are any and add classpath to jars. Add them to py-files
    // too for packages that include Python code
    // 如果有任何添加路径到jars解决Maven的依赖。把它们添加到包含Python代码的包中。
    val exclusions: Seq[String] =
      if (!StringUtils.isBlank(args.packagesExclusions)) {
        args.packagesExclusions.split(",")
      } else {
        Nil
      }

    // Create the IvySettings, either load from file or build defaults
    // 创建IvySettings,从文件加载或构建默认值。
    val ivySettings = args.sparkProperties.get("spark.jars.ivySettings").map { ivySettingsFile =>
      SparkSubmitUtils.loadIvySettings(ivySettingsFile, Option(args.repositories),
        Option(args.ivyRepoPath))
    }.getOrElse {
      SparkSubmitUtils.buildIvySettings(Option(args.repositories), Option(args.ivyRepoPath))
    }

    val resolvedMavenCoordinates = SparkSubmitUtils.resolveMavenCoordinates(args.packages,
      ivySettings, exclusions = exclusions)
    if (!StringUtils.isBlank(resolvedMavenCoordinates)) {
      args.jars = mergeFileLists(args.jars, resolvedMavenCoordinates)
      if (args.isPython) {
        args.pyFiles = mergeFileLists(args.pyFiles, resolvedMavenCoordinates)
      }
    }

    // install any R packages that may have been passed through --jars or --packages.
    // Spark Packages may contain R source code inside the jar.
    // 安装任何可能已经通过的R包——jar或包。Spark包可能包含jar中的R源代码。
    if (args.isR && !StringUtils.isBlank(args.jars)) {
      RPackageUtils.checkAndBuildRPackage(args.jars, printStream, args.verbose)
    }

    // In client mode, download remote files.
    // 如果是client模式,就下载远程的文件
    if (deployMode == CLIENT) {
      val hadoopConf = new HadoopConfiguration()
      // 下载主要文件
      args.primaryResource = Option(args.primaryResource).map(downloadFile(_, hadoopConf)).orNull
      // 下载jar包
      args.jars = Option(args.jars).map(downloadFileList(_, hadoopConf)).orNull
      // 下载python的文件
      args.pyFiles = Option(args.pyFiles).map(downloadFileList(_, hadoopConf)).orNull
      // 下载普通文件
      args.files = Option(args.files).map(downloadFileList(_, hadoopConf)).orNull
    }

    // Require all python files to be local, so we can add them to the PYTHONPATH
    // In YARN cluster mode, python files are distributed as regular files, which can be non-local.
    // In Mesos cluster mode, non-local python files are automatically downloaded by Mesos.

    // 要求所有的python文件都是本地的,所以我们可以将它们添加到线程集群模式的PYTHONPATH中,
    // python文件作为常规文件分发,这些文件可以是非本地的。
    // 在Mesos集群模式中,非本地python文件会被Mesos自动下载。
    if (args.isPython && !isYarnCluster && !isMesosCluster) {
      if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
        printErrorAndExit(s"Only local python files are supported: ${args.primaryResource}")
      }
      val nonLocalPyFiles = Utils.nonLocalPaths(args.pyFiles).mkString(",")
      if (nonLocalPyFiles.nonEmpty) {
        printErrorAndExit(s"Only local additional python files are supported: $nonLocalPyFiles")
      }
    }

    // Require all R files to be local 要求所有R文件为本地文件。
    if (args.isR && !isYarnCluster && !isMesosCluster) {
      if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
        printErrorAndExit(s"Only local R files are supported: ${args.primaryResource}")
      }
    }

    // The following modes are not supported or applicable 以下模式不受支持或适用。
    (clusterManager, deployMode) match {
      case (STANDALONE, CLUSTER) if args.isPython =>
        printErrorAndExit("Cluster deploy mode is currently not supported for python " +
          "applications on standalone clusters.")
      case (STANDALONE, CLUSTER) if args.isR =>
        printErrorAndExit("Cluster deploy mode is currently not supported for R " +
          "applications on standalone clusters.")
      case (LOCAL, CLUSTER) =>
        printErrorAndExit("Cluster deploy mode is not compatible with master \"local\"")
      case (_, CLUSTER) if isShell(args.primaryResource) =>
        printErrorAndExit("Cluster deploy mode is not applicable to Spark shells.")
      case (_, CLUSTER) if isSqlShell(args.mainClass) =>
        printErrorAndExit("Cluster deploy mode is not applicable to Spark SQL shell.")
      case (_, CLUSTER) if isThriftServer(args.mainClass) =>
        printErrorAndExit("Cluster deploy mode is not applicable to Spark Thrift server.")
      case _ =>
    }

    // If we're running a python app, set the main class to our specific python runner
    // 如果我们运行的是python应用程序,那么将主类设置为特定的python runner。
    if (args.isPython && deployMode == CLIENT) {
      if (args.primaryResource == PYSPARK_SHELL) {
        // python-shell时候运行的主类
        args.mainClass = "org.apache.spark.api.python.PythonGatewayServer"
      } else {
        // If a python file is provided, add it to the child arguments and list of files to deploy.
        // Usage: PythonAppRunner <main python file> <extra python files> [app arguments]
        // 如果提供了python文件,则将其添加到子参数和要部署的文件列表中。
        // 用法:PythonAppRunner  <额外的python文件> [app参数]
        args.mainClass = "org.apache.spark.deploy.PythonRunner"
        args.childArgs = ArrayBuffer(args.primaryResource, args.pyFiles) ++ args.childArgs
        if (clusterManager != YARN) {
          // The YARN backend distributes the primary file differently, so don't merge it.
          // YARN后端以不同的方式分发主文件,所以不要合并它。
          args.files = mergeFileLists(args.files, args.primaryResource)
        }
      }
      if (clusterManager != YARN) {
        // The YARN backend handles python files differently, so don't merge the lists.
        args.files = mergeFileLists(args.files, args.pyFiles)
      }
      if (args.pyFiles != null) {
        sysProps("spark.submit.pyFiles") = args.pyFiles
      }
    }

    // In YARN mode for an R app, add the SparkR package archive and the R package
    // archive containing all of the built R libraries to archives so that they can
    // be distributed with the job

    // 在一个R应用的YARN模式中,添加SparkR软件包存档和包含所有构建的R库的R包存档文件,以便将它们分发到工作中。
    if (args.isR && clusterManager == YARN) {
      val sparkRPackagePath = RUtils.localSparkRPackagePath
      if (sparkRPackagePath.isEmpty) {
        printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
      }
      val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
      if (!sparkRPackageFile.exists()) {
        printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
      }
      val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString

      // Distribute the SparkR package.
      // Assigns a symbol link name "sparkr" to the shipped package.
      args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")

      // Distribute the R package archive containing all the built R packages.
      if (!RUtils.rPackages.isEmpty) {
        val rPackageFile =
          RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
        if (!rPackageFile.exists()) {
          printErrorAndExit("Failed to zip all the built R packages.")
        }

        val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
        // Assigns a symbol link name "rpkg" to the shipped package.
        args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
      }
    }

    // TODO: Support distributing R packages with standalone cluster TODO:支持用独立集群分发R包。
    if (args.isR && clusterManager == STANDALONE && !RUtils.rPackages.isEmpty) {
      printErrorAndExit("Distributing R packages with standalone cluster is not supported.")
    }

    // TODO: Support distributing R packages with mesos cluster TODO:支持用mesos集群分发R包。
    if (args.isR && clusterManager == MESOS && !RUtils.rPackages.isEmpty) {
      printErrorAndExit("Distributing R packages with mesos cluster is not supported.")
    }

    // If we're running an R app, set the main class to our specific R runner
    // 如果我们运行一个R应用程序,将主类设置为特定的R运行器。
    if (args.isR && deployMode == CLIENT) {
      if (args.primaryResource == SPARKR_SHELL) {
        args.mainClass = "org.apache.spark.api.r.RBackend"
      } else {
        // If an R file is provided, add it to the child arguments and list of files to deploy.
        // Usage: RRunner <main R file> [app arguments]
        args.mainClass = "org.apache.spark.deploy.RRunner"
        args.childArgs = ArrayBuffer(args.primaryResource) ++ args.childArgs
        args.files = mergeFileLists(args.files, args.primaryResource)
      }
    }

    if (isYarnCluster && args.isR) {
      // In yarn-cluster mode for an R app, add primary resource to files
      // that can be distributed with the job
      args.files = mergeFileLists(args.files, args.primaryResource)
    }

    // Special flag to avoid deprecation warnings at the client
    sysProps("SPARK_SUBMIT") = "true"

    // A list of rules to map each argument to system properties or command-line options in
    // each deploy mode; we iterate through these below
    val options = List[OptionAssigner](

      // All cluster managers
      OptionAssigner(args.master, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES, sysProp = "spark.master"),
      OptionAssigner(args.deployMode, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES,
        sysProp = "spark.submit.deployMode"),
      OptionAssigner(args.name, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES, sysProp = "spark.app.name"),
      OptionAssigner(args.ivyRepoPath, ALL_CLUSTER_MGRS, CLIENT, sysProp = "spark.jars.ivy"),
      OptionAssigner(args.driverMemory, ALL_CLUSTER_MGRS, CLIENT,
        sysProp = "spark.driver.memory"),
      OptionAssigner(args.driverExtraClassPath, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES,
        sysProp = "spark.driver.extraClassPath"),
      OptionAssigner(args.driverExtraJavaOptions, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES,
        sysProp = "spark.driver.extraJavaOptions"),
      OptionAssigner(args.driverExtraLibraryPath, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES,
        sysProp = "spark.driver.extraLibraryPath"),

      // Yarn only
      OptionAssigner(args.queue, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.queue"),
      OptionAssigner(args.numExecutors, YARN, ALL_DEPLOY_MODES,
        sysProp = "spark.executor.instances"),
      OptionAssigner(args.jars, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.dist.jars"),
      OptionAssigner(args.files, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.dist.files"),
      OptionAssigner(args.archives, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.dist.archives"),
      OptionAssigner(args.principal, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.principal"),
      OptionAssigner(args.keytab, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.keytab"),

      // Other options
      OptionAssigner(args.executorCores, STANDALONE | YARN, ALL_DEPLOY_MODES,
        sysProp = "spark.executor.cores"),
      OptionAssigner(args.executorMemory, STANDALONE | MESOS | YARN, ALL_DEPLOY_MODES,
        sysProp = "spark.executor.memory"),
      OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, ALL_DEPLOY_MODES,
        sysProp = "spark.cores.max"),
      OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, ALL_DEPLOY_MODES,
        sysProp = "spark.files"),
      OptionAssigner(args.jars, LOCAL, CLIENT, sysProp = "spark.jars"),
      OptionAssigner(args.jars, STANDALONE | MESOS, ALL_DEPLOY_MODES, sysProp = "spark.jars"),
      OptionAssigner(args.driverMemory, STANDALONE | MESOS | YARN, CLUSTER,
        sysProp = "spark.driver.memory"),
      OptionAssigner(args.driverCores, STANDALONE | MESOS | YARN, CLUSTER,
        sysProp = "spark.driver.cores"),
      OptionAssigner(args.supervise.toString, STANDALONE | MESOS, CLUSTER,
        sysProp = "spark.driver.supervise"),
      OptionAssigner(args.ivyRepoPath, STANDALONE, CLUSTER, sysProp = "spark.jars.ivy")
    )

    // In client mode, launch the application main class directly
    // In addition, add the main application jar and any added jars (if any) to the classpath
    // Also add the main application jar and any added jars to classpath in case YARN client
    // requires these jars.
    if (deployMode == CLIENT || isYarnCluster) {
      childMainClass = args.mainClass
      if (isUserJar(args.primaryResource)) {
        childClasspath += args.primaryResource
      }
      if (args.jars != null) { childClasspath ++= args.jars.split(",") }
    }

    if (deployMode == CLIENT) {
      if (args.childArgs != null) { childArgs ++= args.childArgs }
    }

    // Map all arguments to command-line options or system properties for our chosen mode
    for (opt <- options) {
      if (opt.value != null &&
          (deployMode & opt.deployMode) != 0 &&
          (clusterManager & opt.clusterManager) != 0) {
        if (opt.clOption != null) { childArgs += (opt.clOption, opt.value) }
        if (opt.sysProp != null) { sysProps.put(opt.sysProp, opt.value) }
      }
    }

    // Add the application jar automatically so the user doesn't have to call sc.addJar
    // For YARN cluster mode, the jar is already distributed on each node as "app.jar"
    // For python and R files, the primary resource is already distributed as a regular file
    if (!isYarnCluster && !args.isPython && !args.isR) {
      var jars = sysProps.get("spark.jars").map(x => x.split(",").toSeq).getOrElse(Seq.empty)
      if (isUserJar(args.primaryResource)) {
        jars = jars ++ Seq(args.primaryResource)
      }
      sysProps.put("spark.jars", jars.mkString(","))
    }

    // In standalone cluster mode, use the REST client to submit the application (Spark 1.3+).
    // All Spark parameters are expected to be passed to the client through system properties.

    //  在standalone cluster模式,使用Rest client提交application
    //  Rest client提交,根据useRest进行判断,useRest为True为RestSubmissionClient方式
    // 提交application,否则为Client方式提交
    if (args.isStandaloneCluster) {
      if (args.useRest) {
        childMainClass = "org.apache.spark.deploy.rest.RestSubmissionClient"
        childArgs += (args.primaryResource, args.mainClass)
      } else {
        // In legacy standalone cluster mode, use Client as a wrapper around the user class
        // 在传统的standalone集群模式中,使用Client作为用户类的包装器
        childMainClass = "org.apache.spark.deploy.Client"
        //  如果参数中有设置supervise,则childArgs中添加supervise相关参数
        if (args.supervise) { childArgs += "--supervise" }
        //  获取参数中对driverMemory,driverCores的配置参数,将其添加到childArgs中
        Option(args.driverMemory).foreach { m => childArgs += ("--memory", m) }
        Option(args.driverCores).foreach { c => childArgs += ("--cores", c) }
        childArgs += "launch"
        childArgs += (args.master, args.primaryResource, args.mainClass)
      }
      if (args.childArgs != null) {
        childArgs ++= args.childArgs
      }schedule
    }

    // Let YARN know it's a pyspark app, so it distributes needed libraries.
    if (clusterManager == YARN) {
      if (args.isPython) {
        sysProps.put("spark.yarn.isPython", "true")
      }

      if (args.pyFiles != null) {
        sysProps("spark.submit.pyFiles") = args.pyFiles
      }
    }

    // assure a keytab is available from any place in a JVM
    if (clusterManager == YARN || clusterManager == LOCAL) {
      if (args.principal != null) {
        require(args.keytab != null, "Keytab must be specified when principal is specified")
        if (!new File(args.keytab).exists()) {
          throw new SparkException(s"Keytab file: ${args.keytab} does not exist")
        } else {
          // Add keytab and principal configurations in sysProps to make them available
          // for later use; e.g. in spark sql, the isolated class loader used to talk
          // to HiveMetastore will use these settings. They will be set as Java system
          // properties and then loaded by SparkConf
          sysProps.put("spark.yarn.keytab", args.keytab)
          sysProps.put("spark.yarn.principal", args.principal)

          UserGroupInformation.loginUserFromKeytab(args.principal, args.keytab)
        }
      }
    }

    // In yarn-cluster mode, use yarn.Client as a wrapper around the user class
    if (isYarnCluster) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

    if (isMesosCluster) {
      assert(args.useRest, "Mesos cluster mode is only supported through the REST submission API")
      childMainClass = "org.apache.spark.deploy.rest.RestSubmissionClient"
      if (args.isPython) {
        // Second argument is main class
        childArgs += (args.primaryResource, "")
        if (args.pyFiles != null) {
          sysProps("spark.submit.pyFiles") = args.pyFiles
        }
      } else if (args.isR) {
        // Second argument is main class
        childArgs += (args.primaryResource, "")
      } else {
        childArgs += (args.primaryResource, args.mainClass)
      }
      if (args.childArgs != null) {
        childArgs ++= args.childArgs
      }
    }

    // Load any properties specified through --conf and the default properties file
    for ((k, v) <- args.sparkProperties) {
      sysProps.getOrElseUpdate(k, v)
    }

    // Ignore invalid spark.driver.host in cluster modes.
    if (deployMode == CLUSTER) {
      sysProps -= "spark.driver.host"
    }

    // Resolve paths in certain spark properties
    val pathConfigs = Seq(
      "spark.jars",
      "spark.files",
      "spark.yarn.dist.files",
      "spark.yarn.dist.archives",
      "spark.yarn.dist.jars")
    pathConfigs.foreach { config =>
      // Replace old URIs with resolved URIs, if they exist
      sysProps.get(config).foreach { oldValue =>
        sysProps(config) = Utils.resolveURIs(oldValue)
      }
    }

    // Resolve and format python file paths properly before adding them to the PYTHONPATH.
    // The resolving part is redundant in the case of --py-files, but necessary if the user
    // explicitly sets `spark.submit.pyFiles` in his/her default properties file.
    sysProps.get("spark.submit.pyFiles").foreach { pyFiles =>
      val resolvedPyFiles = Utils.resolveURIs(pyFiles)
      val formattedPyFiles = if (!isYarnCluster && !isMesosCluster) {
        PythonRunner.formatPaths(resolvedPyFiles).mkString(",")
      } else {
        // Ignoring formatting python path in yarn and mesos cluster mode, these two modes
        // support dealing with remote python files, they could distribute and add python files
        // locally.
        resolvedPyFiles
      }
      sysProps("spark.submit.pyFiles") = formattedPyFiles
    }

    (childArgs, childClasspath, sysProps, childMainClass)
  }

可以知道submit方法中,首先调用prepareSubmitEnvironment方法,准备submit环境

先根据参数中master和delpoy-mode,设置对应的clusterManager和部署模式,再根据args中的其他参数,设置相关childArgs, childClasspath, sysProps, childMainClass,并返回结果

然后执行了

 /*  在standalone集群模式下,有两个提交网关:
    *   1.使用org.apache.spark.deploy.Client作为包装器来使用传统的RPC网关
    *   2.Spark 1.3中引入的基于rest的网关
    *   第二种方法是Spark 1.3的默认行为,但是Spark submit将会失败
    *   如果master不是一个REST服务器,那么它将无法使用REST网关。
    */
    if (args.isStandaloneCluster && args.useRest) {
      try {
        // scalastyle:off println
        printStream.println("Running Spark using the REST application submission protocol.")
        // scalastyle:on println
        // 调用doRunMain()方法
        doRunMain()
      } catch {
        // Fail over to use the legacy submission gateway
        case e: SubmitRestConnectionException =>
          printWarning(s"Master endpoint ${args.master} was not a REST server. " +
            "Falling back to legacy submission gateway instead.")
          args.useRest = false
          submit(args)
      }
    // In all other modes, just run the main class as prepared
    } else {
      // 其他模式,直接调用doRunMain方法
      doRunMain()
    }

上面最后总会调用doRunMain()方法

 /**
      * doRunMain方法,其实调用了runMain
      */
    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            // Hadoop's AuthorizationException suppresses the exception's stack trace, which
            // makes the message printed to the output by the JVM not very helpful. Instead,
            // detect exceptions with empty stack traces here, and treat them differently.
            if (e.getStackTrace().length == 0) {
              // scalastyle:off println
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              // scalastyle:on println
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {

        //调用顺序:main--> submit(appArgs)->runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }

然后调用了 runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) 方法

/**
   * Run the main method of the child class using the provided launch environment.
   *
   * Note that this main class will not be the one provided by the user if we're
   * running cluster deploy mode or python applications.
    *
    * 使用提供的启动环境运行子类的主方法。
    *
    * 请注意,如果我们正在运行集群部署模式或python应用程序,那么这个主类将不会是用户提供的。
    *
    * runMain通过反射mainMethod.invoke执行该方法
    *
    * 当deploy mode为client时,执行用户自己编写的主方法
    *
    * 当deploy mode为cluster时,需要判断是否为REST提交,如果是则执行
    * org.apache.spark.rest.RestSubmissionClient的主方法,如果不是则执行
    * org.apache.spark.deploy.Client的主方法
   */
  private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {
    // scalastyle:off println
    if (verbose) {
      printStream.println(s"Main class:\n$childMainClass")
      printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
      // sysProps may contain sensitive information, so redact before printing
      printStream.println(s"System properties:\n${Utils.redact(sysProps).mkString("\n")}")
      printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
      printStream.println("\n")
    }
    // scalastyle:on println

    val loader =
      if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
        new ChildFirstURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      } else {
        new MutableURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      }
    Thread.currentThread.setContextClassLoader(loader)

    //  使用URLClassLoader加载jar包
    for (jar <- childClasspath) {
      addJarToClasspath(jar, loader)
    }

    for ((key, value) <- sysProps) {
      System.setProperty(key, value)
    }

    var mainClass: Class[_] = null

    try {
      /**
        * Class.forName(xxx.xx.xx)的作用是要求JVM查找并加载指定的类,
        * 也就是说JVM会执行该类的静态代码段
        */
      mainClass = Utils.classForName(childMainClass)
    } catch {
      case e: ClassNotFoundException =>
        e.printStackTrace(printStream)
        if (childMainClass.contains("thriftserver")) {
          // scalastyle:off println
          printStream.println(s"Failed to load main class $childMainClass.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
          // scalastyle:on println
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
      case e: NoClassDefFoundError =>
        e.printStackTrace(printStream)
        if (e.getMessage.contains("org/apache/hadoop/hive")) {
          // scalastyle:off println
          printStream.println(s"Failed to load hive class.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
          // scalastyle:on println
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
    }

    // SPARK-4170
    if (classOf[scala.App].isAssignableFrom(mainClass)) {
      printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
    }

    //  获取mainClass的main方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }

    @tailrec
    def findCause(t: Throwable): Throwable = t match {
      case e: UndeclaredThrowableException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: InvocationTargetException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: Throwable =>
        e
    }

    try {
      // 执行提交代码的main方法,传入参数数组
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
      case t: Throwable =>
        findCause(t) match {
          case SparkUserAppException(exitCode) =>
            System.exit(exitCode)

          case t: Throwable =>
            throw t
        }
    }
  }

从上面可以看到最终调用了我们写的程序的主类方法。com.llcc.sparkSql.text.SparkSQLTest #main() 然后就开始运行我们的程序了。

猜你喜欢

转载自blog.csdn.net/qq_21383435/article/details/80029862
今日推荐