Introduction to Spark Streaming_Chapter 1

Introduction to Spark Streaming
● Official website
http://spark.apache.org/streaming/
● Overview
Spark Streaming is aReal-time computing framework based on Spark Core.
Insert picture description here
Features
Easy to use: You can write streaming programs just like writing offline batches and support java/scala/pythonlanguages.
Fault tolerance: SparkStreaming works without additional code and configuration 恢复丢失的工作.
Easy integration into the Spark system: streaming processing combined with batch processing and interactive query.
Position in the architecture
Real-time computing module in the big data computing module
Insert picture description here
Spark Streaming principle
Spark Streaming中,会有一个接收器组件Receiver,作为一个长期运行的task跑在一个Executor上。Receiver接收外部的数据流形成input DStream
DStream会被按照时间间隔(自定)划分成一批一批的RDD
编写业务代码对DStream进行操作,实际就是对RDD进行操作,有多少个RDD业务代码就会执行多少次。
Insert picture description here
Data abstraction in Streaming
DStream: Continuous input data stream and output data stream after various Spark operator operations are
essentially a series of time continuous RDD
quasi-real-time calculation / near-real-time calculation (not 100% real-time calculation [ Acceptable within 5s])
Insert picture description here
Insert picture description here

Spark Core
Spark provides a variety of resource scheduling frameworks, based on in-memory computing, provides DAG execution process management, and RDD blood relationship to ensure fast and highly fault-tolerant computing. RDD is the core concept of
Spark. Spark SQL
SparkSQL optimizes sql queries based on Spark Core, converts sql queries into corresponding RDDs (DateFrame), and optimizes them, simplifying development and improving the efficiency of data cleaning.
Spark Streaming
SparkStreaming is based on SparkCore The realized stream processing framework implements stream processing (DStream) through the concept of micro-batch, which can guarantee the data delay to at least 500ms, and is a high throughput and high fault tolerance stream processing framework.

DStream related operations:
1. Data input: Receiver
2. Data conversion: Transformations(conversion)
2.1每个批次的处理不依赖于之前批次的数据
2.2当前批次的处理需要使用之前批次的数据或者中间结果
2.2.1 UpdateStateByKey(func)
2.2.2 Window Operations 窗口操作
3. Data output: Output Operations(输出)/Action
when a certain Output Operations被调用时, spark streaming 程序才会开始真正的计算过程.
Transformations
Common Transformation—Stateless transition: The processing of each batch does not depend on the data of the previous batch
Insert picture description here
Special Transformations—Stateful transition: 当前批次的处理需要使用之前批次的数据或者中间结果.
There comprises a state transition based on the conversion (updateStateByKey) track status change and the conversion of the sliding window
1. UpdateStateByKey(func)
2. Window Operationswindowing
Output / Action
can Output Operations 将DStream的数据输出到外部的数据库或文件系统
When a Output Operations被调用时, Spark Streaming 程序才会开始真正的计算过程(RDD of Action and the like)
Insert picture description here
are summarized
Insert picture description here

Published 238 original articles · praised 429 · 250,000 views

Guess you like

Origin blog.csdn.net/qq_45765882/article/details/105562468