Article 1|Spark Overview

Apache Spark was originally born in the APM laboratory of the University of California, Berkeley in 2009, and was open sourced in 2010. It is now one of the top open source projects under the Apache Software Foundation. Spark's goal is to design a programming model that can perform data analysis quickly. Spark provides in-memory calculations and reduces IO overhead. In addition, Spark is written based on Scala, providing an interactive programming experience. After 10 years of development, Spark has become a hot big data processing platform, and the latest version is Spark3.0. This article is mainly to give an overview of Spark, and the follow-up content will discuss specific details. The main content of this article includes:

  • Spark's attention analysis
  • Features of Spark
  • Some important concepts of Spark
  • Overview of Spark components
  • Overview of Spark operating architecture
  • First experience of Spark programming

Spark's interest analysis

Overview

The figure below shows the domestic search trends for Spark, Hadoop and Flink in the past year

Insert picture description here

The global search trends for Spark, Hadoop and Flink in the past year are as follows:
Insert picture description here

Regional distribution of domestic search interest on Spark, Hadoop and Flink in the past year (in descending order of Flink search interest):

Insert picture description here

Regional distribution of global search interest on Spark, Hadoop and Flink in the past year (in descending order of Flink search interest):

Insert picture description here

analysis

It can be seen from the above 4 pictures that in the past year, whether it is domestically or globally, the search interest for Spark has always been higher than that of Hadoop and Flink. In recent years, Flink has developed rapidly. It is endorsed by Ali in China. Flink's natural stream processing characteristics make it the preferred framework for developing stream applications. It can be seen that although Flink is very popular in China, it is still not as popular as Spark in the world. So learning and mastering Spark technology is still a good choice. The technology has many similarities. If you have mastered Spark and then learn Flink, I believe you will feel familiar.

Features of Spark

  • high speed

    Apache Spark uses DAG scheduler, query optimizer, and physical execution engine to provide high performance for batch and stream processing.

  • Easy to use

    Support the use of Java, Scala, Python, R and SQL to quickly write applications. Spark provides more than 80 advanced operation operators to easily build parallel applications.

  • Versatility

    Spark provides a very rich ecological stack, including components such as SQL query, stream computing, machine learning, and graph computing. These components can be seamlessly integrated in one application, and through one-stop deployment, it can handle a variety of complex computing scenarios

  • Various operating modes

    Spark can be run in Standalone mode or in environments such as Hadoop, Apache Mesos, and Kubernetes. And you can access data from multiple data sources such as HDFS, Alluxio, Apache Cassandra, Apache HBase, and Apache Hive.

Some important concepts of Spark

  • RDD

    Resilient Distributed Dataset, an abstract concept of distributed memory, provides a highly restricted shared memory model

  • DAY

    Directed Acyclic Graph, reflecting the dependency between RDDs

  • Application

    Spark user-written program, the driver program and executors composition

  • Application jar
    user-written application JAR package

  • Driver program
    uses the process of the program main() function to create SparkContext

  • Cluster manager
    cluster manager is an external service used for resource request allocation (such as standalone manager, Mesos, YARN)

  • Deploy mode

    The deployment mode determines where the Driver process runs. If it is cluster mode, the framework itself will start the Driver process on a machine inside the cluster. If it is in client mode, the Driver process will be started on the machine where the program is submitted

  • Worker node

    The node Executor running the application in the cluster is a process running on the Worknode node, responsible for running specific tasks and storing data for the application

  • Task
    unit of work running in executor

  • Job
    A job contains multiple RDDs and a series of operator operations running on the RDD. The job needs to be triggered by action operations (such as save, collect, etc.)

  • Stage
    Each job is divided by a stage composed of a series task would stage interdependence

Overview of Spark components

The Spark ecosystem mainly includes components such as Spark Core, SparkSQL, SparkStreaming, MLlib, and GraphX, as shown in the following figure:

Insert picture description here

  • Spark Core

    Spark core is the core of Spark, including the basic functions of Spark, such as memory computing, task scheduling, deployment mode, storage management, etc. SparkCore provides an RDD-based API which is the basis of other high-level APIs, and its main function is to implement batch processing.

  • Spark SQL

    Spark SQL is designed to process structured and semi-structured data. SparkSQL allows users to query structured data using SQL, DataFrame and DataSetAPI in Spark programs, and supports Java, Scala, Python and R languages. Since the DataFrame API provides a unified way to access various data sources (including Hive, Avro, Parquet, ORC and JDBC), users can connect to any data source in the same way. In addition, Spark SQL can use hive metadata, thus achieving perfect integration with Hive. Users can run Hive jobs directly on Spark. Spark SQL can be accessed through spark-sql shell commands.

  • SparkStreaming

    SparkStreaming is a very important module of Spark, which can realize the scalability, high throughput, and fault-tolerant stream processing of real-time data streams. Internally, its working method is to split the real-time input data stream into a series of micro batches, which are then processed by the Spark engine. SparkStreaming supports multiple data sources, such as kafka, Flume and TCP sockets, etc.

  • MLlib

    MLlib is a machine learning library provided by Spark. Users can use the Spark API to build a machine learning application. Spark is particularly good at iterative computing, and its performance is 100 times that of Hadoop. The lib contains common machine learning algorithms, such as logistic regression, support vector machine, classification, clustering, regression, random forest, collaborative filtering, principal component analysis, etc.

  • GraphX

    GraphX ​​is an API for graph computing in Spark. It can be considered as a rewrite and optimization of Pregel on Spark. GraphX ​​has good performance, rich functions and operators, and can freely run complex graph algorithms on massive amounts of data. GraphX ​​has many built-in graph algorithms, such as the famous PageRank algorithm.

Overview of Spark operating architecture

On the whole, the Spark application architecture includes the following main parts:

  • Driver program
  • Master node
  • Work node
  • Executor
  • Tasks
  • SparkContext

In Standalone mode, the operating architecture is shown in the figure below:

Insert picture description here

Driver program

Driver program is the main() function of the Spark application (create SparkContext and Spark session). The node running the Driver process is called the Driver node. The Driver process communicates with the Cluster Manager and sends scheduled tasks to the Executor.

Cluster Manager

It is called the cluster manager and is mainly used to manage the cluster. Common cluster managers include YARN, Mesos, and Standalone. The Standalone cluster manager includes two long-running background processes, one of which is on the Master node and the other is on the Work node. In the follow-up cluster deployment mode article, we will discuss the content of this part in detail, here is a general impression.

Worker node

Friends who are familiar with Hadoop should know that Hadoop includes namenode and datanode nodes. Spark is similar, and Spark calls the node that runs specific tasks Worker node. The node will report the available resources of the current node to the Master node. Usually, a work background process is started on each Worker node to start and monitor the Executor.

Executor

The master node allocates resources, uses the Work node in the cluster to create an Executor, and the Driver uses these Executors to allocate and run specific tasks. Each application has its own Executor process, which uses multiple threads to execute specific tasks. Executor is mainly responsible for running tasks and saving data.

Task

Task is the unit of work sent to Executor

SparkContext

SparkContext is the entrance to a Spark session and is used to connect to a Spark cluster. Before submitting an application, you first need to initialize SparkContext. SparkContext implies network communication, storage system, computing engine, WebUI and other content. It is worth noting that there can only be one SparkContext in a JVM process. If you want to create a new SparkContext, you need to call the stop() method on the original SparkContext.

Spark Programming Trial

Spark implementation of grouping and topN case

Description : There is an order.txt file for order data on HDFS. The file field has a segmentation symbol ",", where the fields indicate order id, commodity id, and transaction amount in turn. The sample data is as follows:

Order_00001,Pdt_01,222.8
Order_00001,Pdt_05,25.8
Order_00002,Pdt_03,522.8
Order_00002,Pdt_04,122.4
Order_00002,Pdt_05,722.4
Order_00003,Pdt_01,222.8

Question : Use sparkcore to find the id of the product with the largest turnover in each order

Implementation code

import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}

object TopOrderItemCluster {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("top n order and item")
    val sc = new SparkContext(conf)
    val hctx = new HiveContext(sc)
    val orderData = sc.textFile("data.txt")
    val splitOrderData = orderData.map(_.split(","))
    val mapOrderData = splitOrderData.map { arrValue =>
      val orderID = arrValue(0)
      val itemID = arrValue(1)
      val total = arrValue(2).toDouble
      (orderID, (itemID, total))
    }
    val groupOrderData = mapOrderData.groupByKey()

    /**
      ***groupOrderData.foreach(x => println(x))
      ***(Order_00003,CompactBuffer((Pdt_01,222.8)))
      ***(Order_00002,CompactBuffer((Pdt_03,522.8), (Pdt_04,122.4), (Pdt_05,722.4)))
      ***(Order_00001,CompactBuffer((Pdt_01,222.8), (Pdt_05,25.8)))
      */
   
    val topOrderData = groupOrderData.map(tupleData => {
      val orderid = tupleData._1
      val maxTotal = tupleData._2.toArray.sortWith(_._2 > _._2).take(1)
      (orderid, maxTotal)
    }
    )
    topOrderData.foreach(value =>
      println("最大成交额的订单ID为:" + value._1 + " ,对应的商品ID为:" + value._2(0)._1)

      /**
        ***最大成交额的订单ID为:Order_00003 ,对应的商品ID为:Pdt_01
        ***最大成交额的订单ID为:Order_00002 ,对应的商品ID为:Pdt_05
        ***最大成交额的订单ID为:Order_00001 ,对应的商品ID为:Pdt_01
        */
      
    )
    //构造出元数据为Row的RDD
    val RowOrderData = topOrderData.map(value => Row(value._1, value._2(0)._1))
    //构建元数据
    val structType = StructType(Array(
      StructField("orderid", StringType, false),
      StructField("itemid", StringType, false))
    )
    //转换成DataFrame
    val orderDataDF = hctx.createDataFrame(RowOrderData, structType)
   // 将数据写入Hive
    orderDataDF.registerTempTable("tmptable")
    hctx.sql("CREATE TABLE IF NOT EXISTS orderid_itemid(orderid STRING,itemid STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t'")
      hctx.sql("INSERT INTO orderid_itemid SELECT * FROM tmptable")
  }

}

Package the above code and submit it to the cluster to run. You can enter the hive cli or spark-sql shell to view the data in Hive.

to sum up

This article mainly introduces Spark as a whole, including the search popularity analysis of Spark, the main features of Spark, some important concepts of Spark, and the running architecture of Spark, and finally a Spark programming case is given. This article is the first one shared by the Spark series. You can first get a feel for the global outlook of Spark. The next one will share the Spark Core programming guide.
Insert picture description here

Guess you like

Origin blog.csdn.net/jmx_bigdata/article/details/107400709