Type a FlinkTime
There are three types of time, namely the time to generate the data itself, into the system and the time to Flink be processed, the data in the system may be Flink time three properties:
Event Time is the time of occurrence of each data on their equipment. This time is typically embedded in the recorded data, and then enters Flink, a time stamp can be extracted from the event log; Event Time scrambled data even if they occur, or a case where a delay retrieve data from a backup or the persistent log, also to provide correct results. This time is the most valuable, and hang the clock time any computer / operating system-independent.
Processing Time refers to the system time to perform a corresponding operation of the machine. If the flow-based computing system processing Processing Time to convective processing system is the simplest, all time-based operations (e.g., Time Window) will be used to run the machine operator corresponding system clock. However, in a distributed and asynchronous environments, Processing Time not certainty, it is vulnerable to reach Event system speed (e.g., from a message queue) and the effect of the internal processing sequence Flink system data, it can not be accurately Processing Time reacting the case of time series data occurs.
Ingestion Time to enter the event Flink time. Source generated at operator, i.e. to obtain at the time of the Source data, Ingestion Time Event Time positioned between and Processing Time conceptually. Source acquires the time data, and the data transfer sequence affect internal processing Flink Event is not a distributed system, a number of relatively stable, but Ingestion Time Processing Time and, as the reaction is not exactly the case of time-series data occurs.
Two mechanisms Watermark
The above-mentioned Event Time is the best time to reflect data attributes, but Time may be delayed or disorder occurs Event, Flink system itself can only process the data one by one, how to deal with Event Time may be delayed or scrambled happen?
For example, we need to count the number of times an event occurs from 10:00 to 11:00, that is, on the Event Time is the number of statistics between 10:00 and 11:00. Event Time the case may be delayed or out of order, Flink system to determine how event data occurred from 10:00 to 11:00 have arrived, you can give the statistics of it? Long wait for the results of the output will be delayed time and take up more system resources.
Watermark is a peer Event Time logo, content Watermark is a timestamp, a time-stamped Watermark X is reached, the equivalent of telling Flink system, any data Event Time less than X have been reached. For example, the above example, if a time stamp is 11:01 Flink received the Watermark, it can be put before the statistics Event Time in [10: 00,11: Event output between the number 01), the correlation is occupied empty H. It should be noted that the length of the problem of the window, the window only complete collection of data, only statistical.
Three generation Watermark
Periodic - a certain time interval or after a certain number of records generates a watermark.
Punctuated - generated by the event time based on a certain logic watermark, such as a received data generates a WaterMark, time event time - 5 seconds.
Both ways produce, have created mechanisms to ensure the watermark is monotonically increasing.
Even with watermark, if the conditions in reality, data is not guaranteed to meet watermark how to do? For example Flink dealt with watermark 11:01, but after experiencing the event time is 10:00 ~ 11: 00 between the data how to do? First, if the probability of such things appear very small, does not affect the accuracy required, the data can be directly discarded; if the probability of occurrence of this kind of thing is relatively large, it is necessary to adjust the production mechanism of the water mark.
In addition to the data violates watermark mechanism discarded, there are treatment methods are not discarded, such as updating the survey results before the adoption of a number of mechanisms, this way there will be some performance overhead.
Four code sample
package org.tonny.flink.bi.job.water;
|