Apache Paimon streaming data lake V 0.4 and future prospects

Abstract: This article is compiled from the sharing at the Apache Paimon Meetup by Li Jinsong (Zhixin), head of Alibaba Cloud's open source big data table storage team, Alibaba senior technical expert, Apache Flink PMC, and Paimon PPMC. The content of this article is mainly divided into four parts:

  1. Difficulties on lake storage
  2. Deep Dive into Apache Paimon 0.4
  3. social practice
  4. Subsequent planning

Click to view the original video & speech PPT

Paimon 0.4 was just released in June this year. It is a very competitive version, and it is also the first version after entering the Apache incubator.

1. Difficulties in lake storage

There are three main new scenarios for data lakes:

  • In the first scenario, real-time data enters the lake. The data can update the CDC data from the database in real time, and enter the lake into the data lake in real time, so that the data can be analyzed by various engines as soon as possible.
  • In the second scene, the real-time field is widened. Widen the fields of dimension tables in real time and provide them for downstream query and stream reading.
  • The third scenario is real-time data stream reading. Provides streaming reading of message queue experience, and can generate Changelog according to the primary key.

2

The pain points of entering the lake are as follows:

  • Resource consumption and real-time performance: Poor update throughput and huge resource consumption; COW update is poor, MOR query is poor, and it is difficult to choose back pressure, back pressure, or back pressure.

  • There are many things to manage in the data lake: managing compaction; cleaning up small historical files; cleaning up expired partitions.

  • Schema Evolution: What about adding columns upstream and lake storage? Restart sync job? A bunch of small tables consume resources and energy.

3

There are three pain points for wide tables:

  • Resource consumption and real-time: throughput and resources are equally important.

  • Input diversity: CDC input; input may be out of order.

  • Reading: I hope it can be efficient enough, has Project pushdown, and can be read in a stream.

4

The pain points of stream reading are as follows:

  • All-incremental one-stream reading: first read the full amount and then follow the increment, the complete stream, not the read-only increment.

  • Changelog generation: some scenarios require low cost; some scenarios require low latency.

  • FileNotFound: The contradiction between data lake file cleaning and streaming reading.

  • Lookup Join: Support Flink's Lookup Join.

5

Apache Paimon is a data lake specially designed for CDC processing and stream computing. Hope to bring you a comfortable and automatic lake handling experience.

It can also be seen from the official website that Apache Paimon supports high-speed data writing, Changelog generation and efficient real-time query.

2. In-depth Apache Paimon 0.4

6

The overall architecture of Paimon is a data lake built on Data Lake (HDFS/OSS/S3), and all its Meta and data are stored on these data lakes, which is a data lake format. The Meta of this data lake can also be synchronized to the Hive Metastore and Alibaba Cloud's Data Lake Formation for unified, data-based, tabular management. Then the data lake synchronizes the Changelog to the lake, and then synchronizes Kafka.

Now Paimon 0.4 provides Schema Evolution synchronization of Flink CDC, and also provides synchronization of MySQL's entire database. Subsequent Paimon 0.5 will support Kafka's CDC data synchronization. In addition, we can also write Append log data into Paimon through batch writing through Flink, or write it into Paimon through wide table merging.

On the reading side, Paimon can support batch reading and Ad-Hoc queries from various engines, such as Spark, Trino, StarRocks, etc., and can also read its Changelog in a fully incremental and integrated way through Flink, and streaming reading can provide data If the order is guaranteed, you can also use Flink to lookup join.

7

Paimon is a data lake + LSM architecture. Let me share with you why Paimon needs LSM.

LSM is a write-friendly format. It can see the entire process when writing, but it does not need to understand the specific process. The general idea is that writing occurs in Flink Sink. When the checkpoint arrives, it will Sort the data in memory and flush the records to the Level0 file.

Thanks to the native asynchronous Minor Compaction of LSM, it can fall to the bottom layer through asynchronous Compaction, or some Minor Compaction and Minor mergers can occur at the upper layer, so that it can keep the LSM without too many level. The performance of merge read is guaranteed without causing a large write amplification.

In addition, Flink Sink will automatically clean up expired snapshots and files, and you can also configure partition cleanup strategies. Therefore, the entire Paimon provides high-throughput Append writes, low-consumption local Compaction, fully automatic cleaning, and orderly merging. Therefore, its write throughput is very high, and merge read will not be too slow.

8

Based on Paimon's design, let's take a look at the production practice from the same trip, and compare the benefits brought by the original Hudi watch.

  • The resources entering the lake are saved by 30%-40%.

  • Write performance improved by 3x.

  • The performance of some queries has improved by about 7 times.

9

I just shared some capabilities of Paimon CDC data in the lake in terms of throughput. Let me introduce some of the more convenient tools that Paimon brings to users on CDC in the lake.

For example, in Paimon 0.4, we provided access to Flink CDC. The natively integrated Flink CDC provides DataStream jobs, and the Changelog data is written to Paimon through Schema Evolution through Flink CDC.

Table synchronization, which can automatically manage table structure changes, add columns, delete columns, change types, rename columns, etc. You can also add calculated columns, define partition columns, define primary keys, and perform synchronization of sub-database and sub-table through the definition of table synchronization.

In addition, Paimon CDC also provides the synchronization of the entire database, which allows all the tables in the entire database to be synchronized to Paimon, so you don't have to worry about OOM or easy hangups. When a job is synchronized, the resources for synchronization can be reduced as much as possible. It also supports INCLUDING, EXCLUDING, and table name prefixes and suffixes, automatically skips failed tables, and dynamically adds tables.

In Paimon 0.5, we provide Kafka synchronization. Not only can it be synchronized through Flink CDC, but CDC data in Kafka can also be synchronized. You can write your database, TIDB, MySQL, Oracle to Kafka, and then synchronize it to Paimon with Schema Evolution synchronization.

It can be seen that the synchronization into the lake is very simple, and the entire synchronization job can be started by using Paimon's Flink action. Even Paimon also provides CDC's DataStream API. You can directly call our integrated jobs to synchronize data, or you can write your own Flink flow Schema Evolution Pipeline through CDC's DataStream API.

10

Paimon supports the definition of Partial-update, you can define Partial-update Engine. In this way, different fields can be written into different fields, and batch reading can be performed later. Even Paimon also provides stream reading. As long as the Changelog Producer is declared, the merged data can be stream-read. Its query also supports efficient query of column pruning.

In addition, the input of Partial-update may be out of order, so in the Partial-update table, Sequence Field can also be defined to handle out-of-order situations. The concept of Sequence Group was introduced in Paimon 0.5, in order to solve the different disorder of each flow. If they share the same version field, after one stream is updated, the latest version of another stream may not be updated.

For example, there are two upstream tables to be updated, so two Sequence Groups must be defined, and the fields of this Sequence Group can be different version fields. In this way, different streams only need to update their own versions, and no matter how misaligned the two sides are, its final data can be updated correctly.

11

In Paimon, its stream reading is one of its cores, which is also a key point that distinguishes it from other data lakes. Paimon can stream raw data, you can set Changelog-producer=input. If your data is a complete CDC, you can use this mode, which is the most efficient and consumes the least resources.

If your stream is not a complete CDC, such as Partial-update input. Therefore, downstream stream reading is required to generate a change log. Paimon not only supports the generation of change logs, but also has two very flexible modes, Lookup mode and Full-Compaction mode.

The Lookup mode can dynamically lookup the high-level files when writing, find the latest data, merge the latest Changelog and output it to the downstream. This is the fastest and the one we recommend for a 1-3 minute delay, but it will cost more.

If some jobs require very low cost and can accept greater delay, you can use the Full-Compaction mode. It will generate the corresponding Changelog when it is in asynchronous Full-Compaction, and you can set the full-Compaction cycle scheduling time to be longer, such as 10 minutes. Its advantage is that the cost is lower, but the delay is higher.

We just mentioned that there is a contradiction between lake storage and streaming reading, which is FileNotFound. Because streaming storage needs to constantly clean up snapshots, there will be fewer small files in it. However, if the streaming read relies on an early snapshot, once the streaming job hangs up, the snapshot it read will be cleaned up, and it cannot be recovered at all.

For the problem, Paimon proposed Consumer-ID, which is somewhat similar to Kafka's Group-ID. It can ensure that after the job hangs and restarts, the snapshot it read will not be cleaned up.

12

Paimon 0.4 has also made great progress in the ecology, as shown in the figure above.

At the beginning, Paimon only supported Flink, and it served as a Flink Table Store to support Flink's complete ecology and usage.

More is supported in Paimon 0.4. For example, Spark supports Batch Read, Batch Write, and can also create Table and Alter Table in Spark; supports Batch Reader, Batch Write, Create Table, etc. in Hive; supports Create Table, Alter Table, etc. in Trino and other functions.

We have two engines that are well integrated, one is Flink and the other is Spark. We hope that all its functions, such as batch reading, batch writing, creating tables, modifying Meta and other commands, are better supported in Flink and Spark. Secondly, we hope that other engines can support reading Paimon, and even more operations, such as Create Table, Write Table, etc.

In addition to these traditional processing engines, StarRocks, Doris, and Seatunnel also integrate Paimon, and the overall code is basically ready and is about to be released. MaxCompute and Hologres on Alibaba Cloud, and Arctic on NetEase are also on the way of research and development.

3. Social application practice

13

At present, the main users and participants of the open source community include Alibaba Cloud, ByteDance, Tongcheng Travel, Bilibili, Zhongyuan Bank, Mihayou, Autohome and other enterprises.

Let's take a look at how everyone uses Paimon.

14

On the Alibaba Cloud Computing Platform, Paimon is the No. 1 position in the data lake. It is hoped that all calculations of the Alibaba Cloud Computing Platform will be integrated into Paimon, integrate Paimon, and read Paimon. The best integration is the real-time computing Flink version platform, which is in Flink and the open source big data platform E-MapReduce, and hopes to replace Hudi as the first choice for real-time lake access.

The picture above shows Apache Paimon. We can see that we can access the lake through Alibaba Cloud's Flink real-time computing, CTAS into the lake, and real-time Flink streaming through Alibaba Cloud. I also hope that the data of Paimon can be queried by MaxCompute and Hologres, and can also be better integrated into the open source big data platform E-MapReduce.

15

In ByteDance, engineers use Paimon+Flink as the Streaming Warehouse production system for lineage management and consistent query. As shown in the figure above, business data falls into Streaming Warehouse through Streaming ETL, which is similar to the concept of Streaming materialize view. In this way, all Paimon tables can be queried through consistent Query.

16

In the same journey, the introduction of Paimon mainly optimized the original Hudi's near real-time data warehouse.

  • In the scenario of real-time writing to the ODS layer, Paimon has about 114+ jobs; the largest daily increment of Upsert is 20 million+; the largest total table size is 9 billion+.

  • In the partial update scenario, Paimon has 10+ jobs; the concept of true partial update (Sequence-Group) is applied.

  • In the streaming/incremental reading scenario, Paimon has 20+ streaming incremental reading jobs; 10+ batch hourly incremental reading jobs.

17

Zhongyuan Bank is exploring the streaming data warehouse; Mihayou is also exploring the integration of streaming and batching technology; Bilibili is tackling the AI ​​direction, considering the scenario of Partial-Update; Chenfeng Information is exploring TB-level data into the lake, and has built a Flink streaming and batching integration + Paimon's stream-batch integrated data warehouse, etc.

4. Subsequent planning

18

We hope to achieve such a Streaming LakeHouse. Data can be entered into Paimon through a very convenient way to enter the lake, and a Streaming Pipeline can also be established through Paimon's stream reading and batch reading. At the same time, Paimon should also have a very good ecology, which can be queried by various engines. This is a general direction of Paimon+Flink going forward.

19

To create an easy-to-use and simple Streaming LakeHouse, there are roughly the following three directions.

First direction:

  • There will be more CDC going into the lake during CDC processing. For example, the entry of Kafka into the lake just mentioned should be simpler, more natural, and more automatic.

  • At present, Paimon still needs to set a number of Buckets. The performance of a bucket that is too small is relatively poor. After the amount of data is large, the throughput will drop. And a bucket that is too large has a lot of small files. Although there is one LSM in a Bucket, it already has relatively good throughput, but you still need to tune it. Therefore, a dynamic Bucket will be provided in Paimon 0.5, and the desired state is fully automatic.

  • Create tag, I hope that after Paimon enters the lake in real time, a tag can be printed every day for offline production.

The second direction: Append-Only processing enhancement. The Append-Only before Paimon needs to define Bucket, which is a very difficult concept to define. Therefore, Paimon should support real offline tables in the future, there should be no buckets, and the writing of offline tables should also include small file merging, and this is also in line with Paimon's fully automatic concept.

The third direction: In addition to the ecological docking of StarRocks, we hope to build Spark into a second well-integrated engine like Paimon. Spark’s ability to read and write should be very good. A complete data lake.

20

Next, review the development process of Paimon. It will be discussed in the Flink community in 2021; the first version of Flink Table Store will be released in May 2022; 0.3 will be released in January 2023, which is a production-available version of Paimon; it will enter the incubator in March and be renamed Apache Paimon. In June 2023, Paimon 0.4 is released.

In the future, we hope that the CDC real-time data lake will be fully mature, Append offline table production will be available, the ecology will be fully connected, and Spark will enter a mature state.

Q&A

Q: CDC writes Paimon tables, if the binlog traffic is 3000 records per second + full initialization, how to optimize it. Checkpoint often fails in the current test?

Answer: The key is to see where the performance bottleneck is, whether there is a memory problem, and finally look at Jstack.

Q: Can the table structure be dynamically modified?

Answer: Yes, Spark or Flink 1.17 are both available.

Q: When will 0.5 be released?

A: Around August.

Q: How about the latency of stream reading?

Answer: The minimum checkpoint delay is 1 minute.

Q: How to easily migrate from Hudi to Paimon?

Answer: Yes, the SparkGenericCatalog launched now is also for the coexistence of Hudi and Paimon tables.

Q: Can you expand and talk about the Lookup mode of Changelog?

Answer: You can check the official website

Primary Key Table | Apache Paimon

Q: Is Bucket an important parameter, and how to tune it?

Answer: Yes, according to the amount of data, you can actually run it and check it. Currently, the latest version also supports dynamic buckets.

Q: After a period of storage, can the Bucket be manually adjusted? Will the data before the adjustment be re-scored?

A: For details, see Rescale Bucket on the official website

Rescale Bucket | Apache Paimon

Q: How does Paimon's Partial-update prevent old data from overwriting new data when real-time data is out of sequence? Is there any implementation similar to the sequence column?

Answer: Yes, see the sequence-field on the official website for details

Primary Key Table | Apache Paimon

Q: When compressing, does it have a great impact on the performance of reading and writing?

Answer: It has an impact on writing, which is a tradeoff of reading and writing.

Follow Paimon

The development of streaming data lakes needs your support:

Click to view the original video & speech PPT

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/132222366