【spark】principle

Reprinted from: http://www.cnblogs.com/tgzhu/p/5818374.html

Apache Spark is a big data processing framework built around speed, ease of use, and complex analytics, originally developed by AMPLab at the University of California, Berkeley in 2009, and became one of Apache's open source projects in 2010, alongside Hadoop and Storm, among others. Compared with other big data and MapReduce technologies, Spark has the following advantages:

  • Spark provides a comprehensive and unified framework for managing big data processing needs of various data sets and data sources (batch data or real-time streaming data) with different properties (text data, graph data, etc.)
  • According to official data, Spark can increase the running speed of applications in Hadoop clusters by 100 times in memory, and even increase the running speed of applications on disk by 10 times.

 Target:

  • Architecture and Ecology
  • spark 与 hadoop
  • Operation process and characteristics
  • Common terms
  • standalone mode
  • yarn cluster
  • RDD running process

Architecture and Ecology:


  • Usually when the amount of data that needs to be processed exceeds the scale of a single machine (for example, our computer has 4GB of memory, and we need to process more than 100GB of data) then we can choose a spark cluster for computing, and sometimes the amount of data we may need to process is not It is large, but the calculation is very complicated and requires a lot of time. At this time, we can also choose to use the powerful computing resources of the spark cluster to calculate in parallel. The schematic diagram of the architecture is as follows:
  • Spark Core: Contains the basic functionality of Spark; in particular, defines the API for RDDs, operations, and actions on both. Other Spark libraries are built on top of RDD and Spark Core
  • Spark SQL: Provides an API to interact with Spark through Hive Query Language (HiveQL), the SQL variant of Apache Hive. Each database table is treated as an RDD, and Spark SQL queries are translated into Spark operations.
  • Spark Streaming: Process and control real-time data streams. Spark Streaming allows programs to process real-time data like normal RDDs
  • MLlib: A library of common machine learning algorithms implemented as Spark operations on RDDs. This library contains scalable learning algorithms such as classification, regression, etc. that require iteration over large datasets.
  • GraphX: A collection of algorithms and tools for controlling graphs, parallel graph operations and computations. GraphX ​​extends the RDD API to include operations to control graphs, create subgraphs, and access all vertices on a path
  • The composition diagram of the Spark architecture is as follows:
  • Cluster Manager: In standalone mode, it is the master node, which controls the entire cluster and monitors workers. Resource manager in YARN schema
  • Worker node: The slave node is responsible for controlling the computing node and starting the Executor or Driver.
  • Driver: run the main() function of the Application
  • Executor: An executor is a process that runs on a worker node for an Application

Spark与hadoop:


  • Hadoop has two core modules, the distributed storage module HDFS and the distributed computing module Mapreduce
  • Spark itself does not provide a distributed file system, so most of spark's analysis relies on Hadoop's distributed file system HDFS
  • Both Mapreduce and spark of Hadoop can perform data calculation, and compared with Mapreduce, spark is faster and provides more functions
  • The relationship diagram is as follows:

 Operation process and characteristics:


  • The spark running flow chart is as follows:
  1. Build the running environment of Spark Application and start SparkContext
  2. SparkContext applies to the resource manager (which can be Standalone, Mesos, Yarn) to run the Executor resource, and starts the StandaloneExecutorbackend,
  3. Executor applies for Task to SparkContext
  4. SparkContext distributes applications to Executors
  5. SparkContext builds a DAG graph, decomposes the DAG graph into Stages, sends the Taskset to the Task Scheduler, and finally sends the Task to the Executor for running by the Task Scheduler
  6. Task runs on Executor and releases all resources after running

     Spark running features:

  1. Each Application obtains an exclusive executor process, which resides during the Application period and runs the Task in a multi-threaded manner. This Application isolation mechanism is advantageous, both from a scheduling perspective (each Driver schedules its own tasks) and from a running perspective (Tasks from different Applications run in different JVMs), of course, this means that Spark Applications cannot share data across applications unless the data is written to an external storage system
  2. Spark has nothing to do with the resource manager, as long as it can get the executor process and keep communicating with each other
  3. The Client submitting the SparkContext should be close to the Worker node (the node running the Executor), preferably in the same Rack, because there is a lot of information exchange between the SparkContext and the Executor during the Spark Application running process
  4. Task adopts the optimization mechanism of data locality and speculative execution

Common terms:


  • Application: Appliction refers to the Spark application written by the user, which includes the code of a Driver function and the Executor code distributed on multiple nodes in the cluster.
  • Driver:  The Driver in Spark runs the main function of the above Application and creates the SparkContext. The purpose of creating the SparkContext is to prepare the running environment of the Spark application. In Spark, the SparkContext is responsible for communicating with the ClusterManager for resource application, task allocation and monitoring. Wait, when the Executor part runs, the Driver is also responsible for closing the SparkContext, usually using the SparkContext to represent the Driver
  • Executor:   A process that an Application runs on a worker node. This process is responsible for running certain tasks and storing data in memory or disk. Each Application has its own batch of Executors, in Spark on Yarn mode , its process name is CoarseGrainedExecutor Backend. A CoarseGrainedExecutor Backend has one and only one Executor object, which is responsible for wrapping the Task into a taskRunner and extracting an idle thread from the thread pool to run the Task. The number of tasks that each oarseGrainedExecutor Backend can run in parallel depends on the number of CPUs allocated to it.
  • Cluter Manager: Refers to an external service that obtains resources on the cluster. There are currently three types
    1. Standalon : Spark native resource management, Master is responsible for resource allocation
    2. Apache Mesos: a resource scheduling framework with good compatibility with hadoop MR
    3. Hadoop Yarn: mainly refers to the ResourceManager in Yarn
  • Worker: Any node in the cluster that can run Application code. In Standalone mode, it refers to the Worker node configured through the slave file. In Spark on Yarn mode, it is the NoteManager node.
  • Task: A work unit that is sent to an Executor, but the MapTask in hadoopMR is the same as the ReduceTask concept. It is the basic unit for running an Application. Multiple Tasks form a Stage, and TaskScheduler is responsible for the scheduling and management of Tasks.
  • Job: Parallel computing consisting of multiple Tasks, often triggered by Spark Action, and multiple Jobs are often generated in an Application
  • Stage: Each Job will be split into multiple sets of Tasks, as a TaskSet, its name is Stage, the division and scheduling of Stage is responsible for DAGScheduler, Stage has non-final Stage (Shuffle Map Stage) and final Stage (Result Stage) two kinds, the boundary of Stage is where shuffle occurs
  • DAGScheduler: Build a Stage-based DAG (Directed Acyclic Graph) according to the Job, and submit the Stage to TASkScheduler. The basis for dividing Stages is to find the scheduling method with the least overhead based on the dependencies between RDDs, as shown in the following figure
  • TASKSedulter: Submit the TaskSET to the worker to run, and what Task each Executor runs is allocated here. The TaskScheduler maintains all TaskSets. When the Executor heartbeats to the Driver, the TaskScheduler will allocate the corresponding Task according to the remaining resources. In addition, the TaskScheduler also maintains the running labels of all tasks, and retries the failed tasks. The following figure shows the role of TaskScheduler
  • The task scheduler in different operating modes is as follows:
    1. Spark on Standalone mode is TaskScheduler
    2. YARN-Client mode is YarnClientClusterScheduler
    3. YARN-Cluster mode is YarnClusterScheduler
  • The run-level diagram tying these terms together is as follows:
  • Job=multiple stages, Stage=multiple tasks of the same kind, Task is divided into ShuffleMapTask and ResultTask, Dependency is divided into ShuffleDependency and NarrowDependency

Spark operating mode:


  • Spark has various and flexible operating modes. When deployed on a single machine, it can run in either local mode or pseudo-distributed mode. When deployed in a distributed cluster, there are also many operating modes. For options, it depends on the actual situation of the cluster. The underlying resource scheduling can either rely on an external resource scheduling framework or use Spark's built-in Standalone mode.
  • For the support of external resource scheduling framework, the current implementation includes relatively stable Mesos mode, and hadoop YARN mode
  • Local mode: commonly used for local development and testing, local and local cluster respectively

standalone: ​​Standalone cluster operation mode


  • Standalone mode uses Spark's own resource scheduling framework
  • The typical architecture of Master/Slaves is adopted, and ZooKeeper is used to realize the HA of the Master
  • The frame structure diagram is as follows:
  • The main nodes in this mode are Client node, Master node and Worker node. The Driver can run either on the Master node or on the local Client. When using the spark-shell interactive tool to submit a Spark job, the Driver runs on the Master node; when using the spark-submit tool to submit a job or using "new SparkConf.setManager("spark:// When running a Spark task in master:7077”)” mode, the Driver runs on the local client
  • The running process is as follows: ( refer to: http://blog.csdn.net/gamer_gyt/article/details/51833681 )
  1. SparkContext connects to the Master, registers with the Master and applies for resources (CPU Core and Memory)
  2. The Master decides on which Worker to allocate resources according to the resource application requirements of SparkContext and the information reported in the Heartbeat cycle of the Worker, then obtains resources on the Worker, and then starts the StandaloneExecutorBackend;
  3. StandaloneExecutorBackend registers with SparkContext;
  4. SparkContext sends the Applicaiton code to the StandaloneExecutorBackend; and SparkContext parses the Applicaiton code, builds a DAG graph, and submits it to the DAG Scheduler for decomposition into Stages (when an Action operation is encountered, a Job will be spawned; each Job contains one or more Stages , Stage is generally generated before acquiring external data and shuffle), and then submitted to Task Scheduler as Stage (or TaskSet), Task Scheduler is responsible for assigning Task to corresponding Worker, and finally submitted to StandaloneExecutorBackend for execution;
  5. StandaloneExecutorBackend will establish an Executor thread pool, start executing the Task, and report to SparkContext until the Task is completed
  6. After all tasks are completed, SparkContext logs out to the Master and releases resources

yarn:  (Reference: http://blog.csdn.net/gamer_gyt/article/details/51833681)


  • The Spark on YARN mode is divided into two modes according to the location of the Driver in the cluster: one is the YARN-Client mode, and the other is the YARN-Cluster (or YARN-Standalone mode)
  • In Yarn-Client mode, the Driver runs locally on the client. This mode enables Spark Application to interact with the client. Because the Driver is on the client, the status of the Driver can be accessed through the webUI. The default is http://hadoop1:4040 access, while YARN is accessed via http://hadoop1:8088
  • The workflow steps of YARN-client are:
  • Spark Yarn Client applies to YARN's ResourceManager to start the Application Master. At the same time, DAGScheduler and TASKScheduler will be created in SparkContent initialization. Since we choose Yarn-Client mode, the program will choose YarnClientClusterScheduler and YarnClientSchedulerBackend
  • After the ResourceManager receives the request, it selects a NodeManager in the cluster, assigns the first Container to the application, and asks it to start the ApplicationMaster of the application in this Container. The difference from YARN-Cluster is that the ApplicationMaster does not run SparkContext, only Contact with SparkContext for resource allocation
  • After the SparkContext in the Client is initialized, it establishes communication with the ApplicationMaster, registers with the ResourceManager, and applies for resources (Container) to the ResourceManager according to the task information.
  • Once the ApplicationMaster applies for the resource (that is, the Container), it communicates with the corresponding NodeManager and asks it to start the CoarseGrainedExecutorBackend in the obtained Container. After the CoarseGrainedExecutorBackend starts, it will register with the SparkContext in the Client and apply for a Task
  • The SparkContext in the client assigns the Task to the CoarseGrainedExecutorBackend for execution, and the CoarseGrainedExecutorBackend runs the Task and reports the running status and progress to the Driver, so that the Client can keep track of the running status of each task, so that the task can be restarted when the task fails
  • After the application is completed, the Client's SparkContext applies to the ResourceManager for logout and closes itself

Spark Cluster mode:

  • In YARN-Cluster mode, after a user submits an application to YARN, YARN will run the application in two stages:
    1. The first stage is to start Spark's Driver as an ApplicationMaster in the YARN cluster;
    2. The second stage is to create an application by the ApplicationMaster, then apply for resources from the ResourceManager for it, and start the Executor to run the Task, while monitoring its entire running process until the operation is completed.
  • The workflow of YARN-cluster is divided into the following steps
  • Spark Yarn Client submits applications to YARN, including ApplicationMaster programs, commands to start ApplicationMaster, programs that need to be run in Executor, etc.
  • After the ResourceManager receives the request, it selects a NodeManager in the cluster, assigns the first Container to the application, and asks it to start the ApplicationMaster of the application in this Container, where the ApplicationMaster initializes SparkContext, etc.
  • ApplicationMaster registers with ResourceManager, so that users can view the running status of the application directly through ResourceManage, then it will apply for resources for each task through RPC protocol in a polling manner, and monitor their running status until the end of the operation
  • Once the ApplicationMaster applies for the resource (that is, the Container), it communicates with the corresponding NodeManager and asks it to start the CoarseGrainedExecutorBackend in the obtained Container. After the CoarseGrainedExecutorBackend starts, it will register with the SparkContext in the ApplicationMaster and apply for a Task. This is the same as the Standalone mode, except that when SparkContext is initialized in Spark Application, CoarseGrainedSchedulerBackend is used to schedule tasks with YarnClusterScheduler. YarnClusterScheduler is just a simple wrapper for TaskSchedulerImpl, which adds waiting logic for Executor, etc.
  • The SparkContext in the ApplicationMaster assigns the Task to the CoarseGrainedExecutorBackend for execution. The CoarseGrainedExecutorBackend runs the Task and reports the running status and progress to the ApplicationMaster, so that the ApplicationMaster can keep track of the running status of each task, so that the task can be restarted when the task fails.
  • After the application is completed, ApplicationMaster applies to ResourceManager for logout and shuts itself down

Differences between Spark Client and Spark Cluster:

  • Before understanding the deep difference between YARN-Client and YARN-Cluster, let's clear a concept: Application Master. In YARN, each Application instance has an ApplicationMaster process, which is the first container started by the Application. It is responsible for dealing with the ResourceManager and requesting resources. After obtaining the resources, it tells the NodeManager to start the Container for it. From a deeper meaning, the difference between YARN-Cluster and YARN-Client mode is actually the difference between the ApplicationMaster process
  • In YARN-Cluster mode, the Driver runs in AM (Application Master), which is responsible for applying for resources to YARN and supervising the running status of jobs. After the user submits the job, the client can be turned off, and the job will continue to run on YARN, so the YARN-Cluster mode is not suitable for running interactive jobs
  • In the YARN-Client mode, the Application Master only requests the Executor from YARN, and the Client will communicate with the requested Container to schedule their work, which means that the Client cannot leave

Think: Which mode do we use when submitting jobs with Spark?

 

RDD running process:


  • Running an RDD in Spark is roughly divided into the following three steps:
  1. Create RDD object
  2. The DAGScheduler module intervenes in the operation to calculate the dependencies between RDDs, and the dependencies between RDDs form a DAG
  3. Each Job is divided into multiple Stages. A main basis for dividing Stages is whether the input of the current calculation factor is determined, and if so, divide it into the same Stage to avoid the overhead of message passing between multiple Stages
  • The example diagram is as follows:
  • Let's take an example of sorting by the initials of AZ and finding the total number of different names under the same initials to see how RDD works
  • Creating an RDD In the above example, except that the last collect is an action that does not create an RDD, the first four transformations will create a new RDD. So the first step is to create all the RDDs (five items of information inside)?
  • Creating an execution plan Spark pipelines as much as possible and divides the stages based on whether or not the data is to be reorganized, such as the groupBy() transformation in this example, which divides the entire execution plan into two-stage execution. Eventually a DAG (directed acyclic graph, directed acyclic graph) will be generated as a logical execution plan
  • Scheduling tasks Divide each stage into different tasks, each of which is a combination of data and computation. Before proceeding to the next stage, all tasks in the current stage must be completed. Because the first transformation of the next stage must reorganize the data, it must wait for all the result data of the current stage to be calculated before continuing.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326075230&siteId=291194637