Flink Meets Apache Celeborn: A Unified Data Shuffle Service

Author|Xiong Jiashu (Lu Shuang)

We are very pleased to announce that Apache Celeborn (Inclubating)[1] officially supports Flink. Celeborn officially entered the incubator of the Apache Software Foundation (ASF) in December last year. Stability and flexibility. The latest version 0.3.0 adds support for Flink batch job Shuffle. From then on, Flink and Spark can use the unified data Shuffle service at the same time, saving resources and reducing operation and maintenance costs to a greater extent.

Celeborn currently supports most versions of Flink. In terms of design, Celeborn emphasizes the comprehensive integration with Flink in memory management, scheduling strategy, flow control mechanism, Metrics, etc., to ensure that the introduction of Celeborn will not cause Flink to fall back on the existing mechanism. Use Celeborn Master HA, multi-tier storage, and graceful upgrade capabilities to gain benefits in terms of elasticity, stability, and performance.

1. Why do you need Apache Celeborn

Flink and Spark are big data computing engines that integrate streams and batches. Shuffle is a key stage that affects computing performance. The local disk of the lower computing node cannot be very large. In addition, the Flink and Spark engines also provide the ability to dynamically scale the Adaptive Scheduler according to the amount of resources. This requires the computing node to be able to offload the intermediate Shuffle data to the external storage service in a timely manner. In order to improve resource utilization efficiency, it is very necessary to use an independent Shuffle service.

At the same time, Celeborn supports multiple high-efficiency data shuffling methods and adapts to multiple deployment modes. Its HA architecture and graceful offline capabilities also make Celeborn itself flexible. Therefore, introducing an independent ShuffleService like Apache Celeborn is the only way to achieve real resource elasticity, improve stability and resource efficiency. Next, we will focus on how Celeborn supports Flink and the important features of Celeborn.

2. List of Celeborn support engines and features

  • Support Flink 1.14 and above, and support Flink Adaptive Batch Job Scheduler

  • Support Spark 2.4/Spark 3.x, support Spark AQE

  • Stability - Flink
    • Memory management: Integrate Flink memory management mechanism and support similar Flink Credit-based flow control mechanism
    • Tolerate JM, TM restart recovery
    • Support load balancing, connection multiplexing, heartbeat mechanism
    • Support batch Reproduce data after Celeborn Worker failure
  • Support multiple deployment modes: support deployment in Kubernetes and Standalone environments

  • High availability: Celeborn Master supports HA and automatic failover

  • Rolling upgrade: Celeborn cluster supports Master/Worker rolling upgrade

  • Shuffle mechanism: supports MapPartition/ReducePartition two data shuffle mechanisms, and will integrate the two mechanisms when supporting Flink in the future

  • Performance: MapPartition supports an IO scheduling mechanism similar to Flink

  • Multi-tier storage supports SSD/HDD/HDFS multi-tier storage

3. Support Flink key design and important feature description

3.1 Memory stability and protocol optimization

Celeborn is committed to serving multiple engines as a unified Shuffle data service. In terms of design, Celeborn supports multiple engines in a plug-in manner by enhancing the scalability of the framework and protocols, which greatly improves the reusability of components and reduces the complexity of Celeborn. However, compared with Spark, how to support Flink under Flink's strict memory management model is a key challenge for Celeborn. Because Celeborn reuses all previous interfaces and protocols for the purpose of unification, it cannot be unified with Flink on the network stack, which makes Celeborn unable to directly use Flink's NetworkBuffer. Therefore, in order to use managed memory as much as possible to avoid OOM and improve system stability, Celeborn has made various optimizations in the process of data reading and writing:

  • When writing data, Wrapper is performed on Flink's NettyBuffer that holds the data, realizing zero-copy data transmission, which is released immediately after sending
  • When reading data, Celeborn implements a Decoder in FlinkPluginClient that can directly use Flink Buffer when reading data, so that the memory used for writing and reading data is managed by FlinkMemory. This is consistent with the native Flink memory model. Avoid user modification of job parameters and possible memory stability problems after adopting Celeborn.

1

  • Support Credit-based flow control mechanism: In order to further improve the stability of the Flink side, Celeborn also introduces a Credit-based flow control mechanism similar to Flink when reading data [2], that is, only when the data receiving end (TaskManager) has enough The data sender (Celeborn Worker) will only send the data when the buffer is used to receive the data. While the data receiving end is continuously processing data, it will also feed back the released buffer (Credit) to the sending end to continue sending new data, while writing data completely reuses Celeborn's original efficient multi-layer storage implementation.

2

3.2 MapPartition, ReducePartition double Shuffle mechanism

Celeborn supports two types of data shuffle. MapPartition refers to Partition data written by the same upstream Map Task and read by multiple downstream Reduce Tasks, while ReducePartition refers to data belonging to the same Partition by multiple upstream Map Tasks. Push to the same Worker for aggregation, and the downstream is directly read by the Reduce Task. The two methods have their own advantages and disadvantages:

  • ReducePartition recalculation is expensive. Usually, all upstream Mappers will push data to a Reduce Partition file. When the file is lost, all upstream tasks need to be recalculated. Although Celeborn's multi-copy mechanism can reduce the probability of data loss, the possibility still exists. MapPartition is conducive to fault-tolerant recovery, and re-running the corresponding Map Task is enough.
  • ReducePartition can convert random reads into sequential reads, so the network efficiency and disk IO efficiency of Reduce Task during Shuffle Read can be greatly improved, and Map Partition is more flexible and can support various types of Shuffle edges.

In order to better support Flink (newly introduced two types of Shuffle, Rescale and Forward), and to meet a smaller recalculation cost, Celeborn0.3.0 supports the Shuffle type of MapPartition. In the current version, Celeborn uses MapPartition to support Flink, and ReducePartition to support Spark. However, in future versions, it will consider combining Flink to implement a dynamic switching Shuffle mechanism to achieve the best performance.

3

3.3 MapPartition data reading and writing and optimization

According to the characteristics of Flink's current Shuffle, scheduling, and fault tolerance, the MapPartition method also adopts the current Flink Sort-Shuffle implementation, that is, the output data of the calculation task is sorted before output, and the sorted data is appended and written to the CelebornWorker at the same time. In a file, in the process of data reading, increase the scheduling of data reading requests, always read data according to the offset order of the file, meet the reading request, and achieve the complete order of data in the optimal case read. The figure below shows the storage structure and IO scheduling process of data files.

4

3.4 Celeborn for multiple engines

According to the above description, it should be seen that Flink and Spark are only the difference between clients for Celeborn services. The two can completely reuse a set of Celeborn services, which not only saves resources, improves operation and maintenance efficiency, but also makes the architecture more efficient. clear.

Let's take a brief look at the architecture of Celeborn: The entire composition of Celeborn is divided into three important parts: CelebornMaster, CelebornWorker, and CelebornPlugin (Flink, Spark). Among them, CelebornMaster is responsible for managing the entire Shuffle cluster, including Worker, Shuffle resource management, and various metadata. Worker is responsible for writing and reading Shuffle data. The MapPartition used by Flink mentioned above and the ReducePartition mode used by Spark reuse all server components and achieve a unified protocol. The Celeborn server does not perceive the differences on the engine side. A Celeborn cluster can serve multiple engines at the same time. The following shows the interactive architecture diagram of Flink, Spark and Celeborn clusters.

5

At the same time, Celeborn Master uses the raft protocol to synchronize cluster metadata, Worker and App information. The client/Worker interacts with the Leader node, and HA can be achieved without relying on external components. The client/Worker can automatically switch to the new one when the Master is upgraded or fails. Leader node. In terms of design, Celeborn abstracts concepts and interfaces such as Register Shuffle, Reserve Slots, Partition Split, and Commit. The engine side can completely use these interfaces to implement management logic in a plug-in manner. Therefore, Celeborn's high availability and scalability are very suitable for access The new engine welcomes users and developers in need to enrich and develop Celeborn together.

3.5 More Celeborn features and optimizations

Celeborn version 0.3.0 also adds features such as multi-level storage and multi-level blacklist, optimizes the number of RPC requests, shortens the time for graceful upgrades, and fixes a large number of corner cases. The community is working on this version For the release process, you can follow Celeborn's mailing group or the Apache Celeborn official website[3] to get the latest release information.

4. Celeborn's internal production practice and future road in Ali

Celeborn support for Flink has been proven in production. Inside Ali, Celeborn has undertaken the largest Flink Batch job order Shuffle of more than 600T, and the job runs smoothly, with excellent stability and performance.

In addition, Apache Celeborn's support for Flink has been strongly supported by the flink-remote-shuffle community[4], and many designs are also derived from the flink-remote-shuffle project, for which we would like to express our sincere thanks.

In the future, in addition to the Celeborn community mentioned above, it will combine the characteristics of Flink to realize the mechanism of dynamically switching Shuffle, and also plan to introduce multi-level storage into memory, support Flink Hybrid Shuffle and other features, and finally thank the users and developers of Celeborn, and welcome more users and Developers join!

Reference

[1] https://github.com/apache/incubator-celeborn

[2] Analysis of Network Flow Control and Back Pressure: Flink Advanced Tutorials - Alibaba Cloud Community

[3] Apache Celeborn (Incubating)

[4] https://github.com/flink-extended/flink-remote-shuffle

For more technical questions related to Celeborn, you can join the community DingTalk exchange group 41594456~

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/131799220