Spark 结构化流

https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html


Most streaming engines focus on performing computations on a stream: for example, one can map a stream to run a function on each record, reduce it to aggregate events by time, etc. However, as we worked with users, we found that virtually no use case of streaming engines only involved performing computations on a stream. Instead, stream processing happens as part of a larger application, which we’ll call a continuous application. Here are some examples:

  1. Updating data that will be served in real-time. For instance, developers might want to update a summary table that users will query through a web application. In this case, much of the complexity is in the interaction between the streaming engine and the serving system: for example, can you run queries on the table while the streaming engine is updating it? The “complete” application is a real-time serving system, not a map or reduce on a stream.
  2. Extract, transform and load (ETL). One common use case is continuously moving and transforming data from one storage system to another (e.g. JSON logs to an Apache Hive table). This requires careful interaction with both storage systems to ensure no data is duplicated or lost — much of the logic is in this coordination work.
  3. Creating a real-time version of an existing batch job. This is hard because many streaming systems don’t guarantee their result will match a batch job. For example, we’ve seen companies that built live dashboards using a streaming engine and daily reporting using batch jobs, only to have customers complain that their daily report (or worse, their bill!) did not match the live metrics.
  4. Online machine learning. These continuous applications often combine large static datasets, processed using batch jobs, with real-time data and live prediction serving.
大多数流引擎专注于在流上执行计算:例如,可以映射流以在每条记录上运行函数,减少它以按时间聚合事件等。但是,当我们与用户合作时,我们发现几乎没有流引擎的使用情况只涉及在流上执行计算。相反,流处理是作为更大应用程序的一部分发生的,我们称之为连续应用程序。这里有些例子:

  1. 更新将实时提供的数据。例如,开发人员可能希望更新用户通过Web应用程序查询的摘要表。在这种情况下,流式引擎与服务系统之间的交互很复杂:例如,您可以在流引擎更新它时在桌面上运行查询吗? “完整”应用程序是实时服务系统,不是流式地图或缩小。
  2. 提取,转换和加载(ETL)。一个常见的用例是不断移动数据并将数据从一个存储系统转换为另一个存储系统(例如JSON日志到Apache Hive表)。这需要与两个存储系统进行仔细的交互,以确保没有数据被重复或丢失 - 大部分逻辑都在这个协调工作中。
  3. 创建现有批作业的实时版本。这很难,因为许多流媒体系统不能保证他们的结果将匹配批量作业。例如,我们看到使用流引擎构建实时仪表板的公司以及使用批处理作业的每日报告的公司,只是让客户抱怨他们的日常报告(或更糟,他们的账单!)与实时指标不匹配。
  4. 在线机器学习。这些连续的应用程序通常将大量静态数据集组合在一起,使用批处理作业进行处理,以及实时数据和实时预测服务。

Continuous Applications

We define a continuous application as an end-to-end application that reacts to data in real-time. In particular, we’d like developers to use a single programming interface to support the facets of continuous applications that are currently handled in separate systems, such as query serving or interaction with batch jobs. For example, here is how we would handle the use cases above:

  1. Updating data that will be served in real time. The developer would write a single Spark application that handles both updates and serving (e.g. through Spark’s JDBC server), or would use an API that automatically performs transactional updates on a serving system like MySQL, Redis or Apache Cassandra.
  2. Extract, transform and load (ETL). The developer would simply list the transformations required as in a batch job, and the streaming system would handle coordination with both storage systems to ensure exactly-once processing.
  3. Creating a real-time version of an existing batch job. The streaming system would guarantee results are always consistent with a batch job on the same data.
  4. Online machine learning. The machine learning library would be designed to combine real-time training, periodic batch training, and prediction serving behind the same API.

Continuous Applications

我们将连续应用定义为实时响应数据的端到端应用。特别是,我们希望开发人员使用单个编程接口来支持当前在单独系统中处理的连续应用程序的各个方面,例如查询服务或与批处理作业的交互。例如,下面是我们如何处理上面的用例:

  1. 更新将实时提供的数据。开发人员将编写一个单独的Spark应用程序来处理更新和服务(例如通过Spark的JDBC服务器),或者使用自动执行事务更新的API,如MySQL,Redis或Apache Cassandra等服务系统。
  2. 提取,转换和加载(ETL)。开发人员只需列出批处理作业中所需的转换,流式处理系统将处理与两个存储系统的协调,以确保一次处理。
  3. 创建现有批作业的实时版本。流式传输系统将确保结果始终与同一数据上的批处理作业一致。
  4. 在线机器学习。机器学习库将被设计为结合实时训练,定期批量训练以及在相同API背后的预测。

猜你喜欢

转载自blog.csdn.net/qq_15300683/article/details/80653748