Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

about the author

Qiu from Xian (Chi Mountain), the Apache Contributor Flink, Central South University, master's degree, 2018 joined Alibaba computing platform division, to focus on core engine development Flink, mainly engaged Flink State & Checkpoint-related research and development.

As we all know, Flink is currently one of the most widely used calculation engine, which uses checkpoint mechanism in fault-tolerant [1], checkpoint will state snapshot backup to a distributed storage system for later reinstated. In the internal Alibaba, we used mainly to store HDFS, when after reaching a certain number with a cluster of Job, HDFS will cause great pressure, this paper describes a method for substantially reducing the pressure of HDFS - small files merge.

background

Regardless of when FsStateBackend, RocksDBStateBackend or NiagaraStateBackend, Flink during the checkpoint, TM will be written to the state-snapshots distributed file system, and then send the file handle JM, JM complete global checkpoint snapshot storage, as shown below:

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

For the total amount for the checkpoint, TM within each checkpoint data are written to the same file, and for RocksDBStateBackend / NiagaraStateBackend incremental checkpoint [2], it will be written to a file each sst distributed system within the file. When a large amount of work, and a large concurrent operations, will form a very large pressure on the underlying HDFS: 1) a large number of RPC RPC request will affect the response time (as shown below); 2) a large number of files NameNode memory lot of pressure.

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

 

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

In the Flink ByteStreamStateHandle had tried to solve the problem of a small number of documents [3], the state is less than a certain threshold value is sent directly to JM, JM unified written by the distributed file, so as to avoid generating small files in the TM side. But the program has some limitations, the threshold is set too small, there will be many small files generated, the threshold is set too large, it will lead to too much memory consumption JM risk of OOM.

A small file merger plan

For the above problem we propose a solution - small files merge.

In the original implementation, each sst file opens a CheckpointOutputStream, each corresponding to a CheckpointOutputStream FSDataOutputStream, write local files to a distributed file, and then close FSDataOutputStream, generate a StateHandle. As shown below:

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

The combined open FSDataOutputStream small files will be reused until the file size reaches a preset threshold value, in other words a plurality of sst files are reused on the same DFS files, each file occupies a portion of sst DFS file, a plurality of final StateHandle share a physical file, as shown below:

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

Place in the following sections we will describe the details of implementation, in which important considerations include:

1) support the concurrent checkpoint

Flink inherently supports concurrent checkpoint, small files merge multiple files program will write to the same file in a distributed storage, if considered properly, the data will be written string or damaged, so we need to have a mechanism to ensure the correctness of the program, be described in detail with reference to 2.1.

2) prevent accidentally deleted files

We use reference counting to record the use of the file, only the file reference count drops to zero judge whether to delete, you might accidentally deleted files, how to ensure that files are not deleted by mistake, we will be set forth in 2.2.

3) reducing spatial amplification

After using small files merge, as long as there is a statehandle file is used, the entire distributed file can not be deleted, and therefore will take up more space, we describe in detail in Section 2.3 solution to this problem.

4) Exception Handling

We will explain how to handle exceptions, including exceptions and circumstances JM TM abnormalities in 2.4.

2.5 will be described in detail after the Checkpoint is canceled or failure, how to cancel Snapshot terminal TM, TM is not canceled if the end of the Snapshot, will result in normal operation of the multi-end-actual TM Snapshot ratio.

Set forth in Section 3 small file compatibility with existing programs of the merger plan; Section 4 will be the advantages and disadvantages of small files merge program description; and finally in Section 5 we show the results achieved in a production environment.

Second, the design and implementation

In this section we will describe in detail the entire details of the merger of small files, as well as the design of these points.

Here we recall process roughly TM Snapshot of the end:

  • TM aligned end barrier
  • TM Snapshot synchronous operation
  • TM Snapshot asynchronous operation

Wherein sst upload files to a distributed storage system in the third step above, the same file in order to upload a checkpoint, the checkpoint file uploads plurality of possible simultaneously.

1, supports concurrent checkpoint

Flink inherently supports concurrent checkpoint, and therefore small files merger plan also needs to be able to support concurrent checkpoint, the checkpoint if different sst files simultaneously write to a distributed file will result in damage to the contents of the file, follow unable to restore from the file.

In [4] 11937 FLINK-proposals, we will write the file to each of the checkpoint state HDFS same document, a different state of checkpoint files written to HDFS different - in other words, does not cross Checkpoint HDFS file sharing, thus avoiding a situation multiple clients write to the same file at the same time.

We will continue to promote the follow-up cross Checkpoint file sharing program, of course, across Checkpoint file sharing program, the parallel Checkpoint will write to different HDFS file.

2, to prevent accidentally deleted files

After multiplexing the underlying file, we use the reference count usage tracking files, delete files when the number of references to drop to zero in the file. However, in some cases, the file reference number is 0, the file does not mean that will not continue to be used, could lead to accidental deletion of files. Here we will describe in detail later turned concurrent checkpoint could lead to accidental deletion of files, as well as solutions.

The following diagram, for example, maxConcurrentlyCheckpoint = 2:

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

There are three image above checkpoint, which chk-1 has been completed, chk-2 and chk-3 will be based chk-1, chk-2 in chk-3 before completion, chk-3 was found in the time of registration 4.sst found 4.sst in chk-2 have already registered, reuses stateHandle in 4.sst corresponding chk-2, and then cancel the registration chk-3 in 4.sst, and delete stateHandle, in processed chk-3 after the 4.sst, distributed file the stateHandle corresponding reference count to zero, if we remove distributed file this time, it will also remove 5.sst corresponding content, leading to follow-up can not be recovered from chk-3.

The problem here is how to refer to the time counting down to 0 in a distributed file stateHandle corresponding right to judge whether it will continue to refer to the file, so the entire process is completed after the checkpoint and then determine whether a distributed file deletion, if really a checkpoint file is not found complete references, you can safely delete, or not delete.

3, reducing spatial amplification

After combining scheme using small files, each file corresponding to a segment sst distributed file, as shown below:

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

Files can only be deleted when all segment are no longer in use, the figure above four segment, only the segment-4 was used, but the entire file can not be deleted, wherein the segment [1-3] space is wasted, data from the actual production environment shows that the overall space in the 1.3 magnification (actual space occupied / real useful space) - 1.6.

To solve the problem of space amplification, reach for the TM asynchronous thread magnification file exceeds the threshold of compression. And only compressed file has been closed.

The entire compression process is as follows:

  • Magnification calculated for each file
  • If the magnification is small, skip to Step 7
  • If the magnification document A exceeds the threshold, then generate a corresponding new file A '(if this process fails to create the file, by the TM responsible for cleaning work)
  • A recording and A 'in the mapping relationship
  • At the next checkpoint X falls transmitted to document A StateHandle JM, is sent to a new StateHandle JM used in generating the information A`
  • After the completion of checkpoint X, we add A 'reference count, reduce the reference count of A, reduced to 0 A file will delete (if JM increase in the reference count of A' references, then abnormal, will be from the last successful the checkpoint rebuild the entire reference counter)
  • File compression is completed

4, exception handling

During the checkpoint, the main there are two anomalies: JM TM abnormal and abnormal, we will divide the situation described.

1) JM abnormal

JM main reference terminal and file records StateHandle count, the reference count data need not persisted to the external memory, so no special processing is not necessary to consider other transaction related operations, if the failover JM transmission, directly from the most recent a complete checkpoint recovery, and reconstruction can reference count.

2) TM abnormal

TM anomalies can be divided into two types: 1) that the file has been reported to JM before the checkpoint; 2) file has not been reported to JM, we'll Points elaborate.

① file has been reported to JM

JM had reported to the file, delete the reference count and therefore, the file is controlled by JM JM in the end has the file, when the file reference count becomes 0, the file will be deleted JM.

② file has not been reported to JM

The file has not been reported to JM, the file is no longer being used, it will not be perceived JM, orphaned files. This situation has temporarily unified peripheral tool to clean up.

5, end snapshot cancel TM

Like the previous chapters earlier, we need at the checkpoint timeout / failure to cancel snapshot TM side, while Flink is no corresponding notification mechanism, now FLINK-8871 [5] in the corresponding optimized tracking, we have increased the associated implementation internally, when the checkpoint fails to send data RPC TM, TM terminated by corresponding RPC message, will cancel the corresponding snapshot.

Third, compatibility

Small files merge feature supports seamless migration over from the previous version. From the previous checkpoint restore step is as follows:

  • Each TM assigned to restore the state handle their needs
  • TM state handle corresponding download data from a remote
  • Restore from local

Small files merge main effect is Step 2, adapting to different StateHandle corresponding data downloaded from a remote time and therefore does not affect the overall compatibility.

Fourth, the advantages and disadvantages

Advantage: a significant reduction in the pressure of HDFS: RPC pressure, including pressure and NameNode memory.

Inadequate: State does not support multi-threaded upload function (State upload bottlenecks temporary checkpoint is not).

Fifth, the results of the online environment

After the on-line program, the pressure on Namenode greatly reduced, the screenshot below from online production clusters, from the data, file creation and closing RPC decreased significantly, RPC response time is also greatly reduced, make sure to get through double eleven.

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

Ali stepped pit large-scale application Flink: How significantly reduce HDFS pressure?

Reference material

  • [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html
  • [2] https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
  • [3] https://www.slideshare.net/dataArtisans/stephan-ewen-experiences-running-flink-at-very-large-scale
  • [4] https://issues.apache.org/jira/browse/FLINK-11937
  • [5] https://issues.apache.org/jira/browse/FLINK-8871
Published 50 original articles · won praise 1706 · Views 2.22 million +

Guess you like

Origin blog.csdn.net/zl1zl2zl3/article/details/105326924