Spark kernel analysis (1) overview of core principles

Spark kernel generally refers to Spark 的核心运行机制the operation mechanism of Spark core components, Spark task scheduling mechanism, Spark memory management mechanism, and the operation principle of Spark core functions. Proficiency in the principles of Spark kernel can help us better complete the Spark code design and be able to Help us accurately pinpoint the crux of the problems that occurred during the operation of the project.

1. Spark core components

1.1 Driver

Spark drive 用于执行 Spark 任务中的 main 方法,负责实际代码的执行工作node .

Driver is mainly responsible for the following during Spark job execution:

(1) Convert user programs into jobs
(2) Schedule tasks between Executors
(3) Track the execution of Executors
(4) Display the running status of queries through the UI

1.2 Executor

The Spark Executor node is a JVM process, 负责在 Spark 作业中运行具体任务and the tasks are independent of each other. When the Spark application starts, the Executor node is started at the same time, and it always exists with the entire Spark application life cycle. If an Executor node fails or crashes, the Spark application can continue to execute, and the tasks on the error node will be scheduled to other Executor nodes to continue running.

Executor has two core functions:

(1) Responsible for running the tasks that make up the Spark application and returning the results to the Driver process;

(2) They provide memory storage for RDDs that require caching in user programs through their own Block Manager. RDD is directly cached in the Executor process, so the task can make full use of the cached data to speed up operations at runtime.

1.3 Master

Spark cluster 主要负责对整个集群资源的分配与管理manager .

Cluster Manager is in Yarn deployment mode ResourceManager; it is Mesos Master in Mesos deployment mode; it is in Standalone deployment mode Master.

The resources allocated by Cluster Manager belong to the first-level allocation, which allocates resources such as memory and CPU on each Worker to the Application 并不负责对 Executor 的资源的分配.

1.4 Worker

Spark working node, the actual deployment mode in Yarn NodeManageralternative.

Mainly responsible for the following tasks:

(1) Inform the Cluster Manager of its own memory, CPU and other resources through the registration mechanism
(2) Create an Executor process
(3) Further allocate resources and tasks to Executor
(4) Synchronize resource information and Executor status information to ClusterManager

1.5 Application

An application written by the user using the API provided by Spark.

Mainly responsible for the following tasks:

(1) Application will perform RDD conversion and DAG construction through Spark API, and register Application to Cluster Manager through Driver.
(2) The Cluster Manager will allocate Executor, memory, CPU and other resources to the Application through the first-level allocation according to the resource requirements of the Application.
(3) Driver allocates Executor and other resources to each task through secondary allocation, and Application finally tells Executor to run the task through Driver.

2. Spark general operation process

Insert picture description here
Figure 1-1 shows the general running process of Spark. No matter what mode Spark is deployed in, the Driver process will be started after the task is submitted, and then the Driver process will register the application with the cluster manager, and then the cluster manager will follow the configuration file of this task The Executor is allocated and started. When the resources required by the Driver are all satisfied, the Driver starts to execute the main function. Spark queries 懒执行that when the execution reaches the action operator, it starts to calculate backwards, and divides the stages according to wide dependencies, and then each stage corresponds to A taskset has multiple tasks in a taskset. According to the localization principle, the task will be distributed to the specified Executor for execution. During the execution of the task, the Executor will continue to communicate with the Driver and report the task running status.

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108606265