Flink's time semantics and detailed explanation of Wartermark examples

1: The elicitation of time semantics

When talking about the semantics of time, we bring up a small case. In our daily life, we will inevitably play games in order to kill time on the way to and from work on the subway. Assuming that a certain game is set to pass several levels within one minute. Bonus points for a few levels, Zhang San had already passed 3 levels 45 seconds before the game started, but after 45 seconds the subway entered the mountain tunnel and there was no signal, but he connected 5 levels within 35 seconds of the tunnel. After 35 seconds of the tunnel, the subway exits the tunnel and the network returns to normal. At this time, the data in the cache is sent to the server. Because there is no signal in the mountain tunnel during this period, the information connected to the 5 gates is not accepted by the server in time. Therefore, the rule of giving n points if you pass n levels within 1 minute set by the game party is a problem here. According to the game party's rules, although you Zhang San cleared the level during the tunnel time, I The server does not have a record of your clearance, but on the one hand, for Zhang San, although I did not pass 5 levels in the remaining 15 seconds, at least I have passed 5 levels with the signal time before, so the game The party must formulate a very complete rule, otherwise it will greatly reduce the user's product experience. Therefore, the above small case leads to the time semantics in our Flink.

Insert picture description here

1.1 Definition of time semantics

In stream processing, time is a very core concept. How to stipulate that the corresponding data enters different windows according to the time is the most important issue. Different time concepts are supported in Flink's stream processing, as follows As shown in the figure:

Insert picture description here

Event Time:

是事件创建的时间。它通常由事件中的时间戳描述,例如采集的日志数据中,每一条日志都会记录自己的生成时间,Flink通过时间戳分配器访问事件时间戳。

Ingestion Time:

Ingestion Time是事件到达Flink Souce的时间。从Source到下游各个算子中间可能有很多计算环节,任何一个算子的处理速度快慢可能影响到下游算子的Processing Time。而Ingestion Time定义的是数据流最早进入Flink的时间,因此不会被算子处理速度影响。

Processing Time:

是每一个执行基于时间操作的算子的本地系统时间,与机器相关,默认的时间属性就是Processing Time。

1.2 The advantages and disadvantages of the three time semantics:

Event Time:

一个基于Event Time的Flink程序中必须定义Event Time,以及如何生成Watermark。我们可以使用元素中自带的时间,也可以在元素到达Flink后人为给Event Time赋值,使用Event Time的优势是结果的可预测性(可类比上述的玩游戏事件),缺点是缓存较大,增加了延迟,且调试和定位问题更复杂。

Note: In Flink's streaming processing, most businesses will use eventTime. Generally, only when eventTime is unavailable, will they be forced to use ProcessingTime or IngestionTime.

Ingestion Time:

Ingestion Time通常是Event Time和Processing Time之间的一个折中方案。比起Event Time,Ingestion Time可以不需要设置复杂的Watermark,因此也不需要太多缓存,延迟较低。比起Processing Time,Ingestion Time的时间是Souce赋值的,一个事件在整个处理过程从头至尾都使用这个时间,而且后续算子不受前序算子处理速度的影响,计算结果相对准确一些,但计算成本稍高。

Processing Time:

Processing Time只依赖当前执行机器的系统时钟,不需要依赖Watermark,无需缓存。Processing Time是实现起来非常简单也是延迟最小的一种时间语义,但是我们一般都很少用到Processing Time此种时间语义。

1.3 About the introduction of EventTime:

The introduction of EventTime is very simple, just call the setStreamTimeCharacteristic method after creating the environment.

val env = StreamExecutionEnvironment.getExecutionEnvironment
// 从调用时刻开始给env创建的每一个stream追加时间特征
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

2:Wartermark

2.1 Wartermark derived from non-ideal conditions:

We know that stream processing has a process and time from event generation, to flow through the source, and then to the operator. Although in most cases, the data flowing to the operator comes in the order of time when the event is generated, but It also does not rule out the occurrence of disorder due to network, distribution, etc. The so-called disorder means that the order of events received by Flink is not strictly in accordance with the order of the event time of the event.

Insert picture description here

Then there is a problem at this time. Once there is disorder, if we only determine the operation of the window based on eventTime, we cannot know whether all the data is in place, but we cannot wait indefinitely. At this time, there must be a mechanism to ensure a specific After time, the window must be triggered to perform calculations. This special mechanism is Watermark.

2.2Wartermark overview:

1. Watermark is a mechanism to measure the progress of Event Time.
2. Watermark is used to handle out-of-order events, and the correct handling of out-of-order events is usually realized by combining the Watermark mechanism with window .
3. The Watermark in the data stream is used to indicate that the data whose timestamp is less than the Watermark has arrived. Therefore, the execution of the window is also triggered by the Watermark.
4. Watermark can be understood as a delayed trigger mechanism. We can set the delay time t of Watermark. Each time the system will check the maximum maxEventTime among the data that has arrived, and then determine that all data with eventTime less than maxEventTime-t has arrived. If the stop time of a window is equal to maxEventTime-t, then this window is triggered to execute .

2.3 Features of Wartermark

Insert picture description here

1. A watermark is a special data record, which is essentially a timestamp, and is passed on indiscriminately like business data. The purpose is to measure the progress of the event time.

2. The watermark must increase monotonically to ensure that the event time clock of the task is advancing forward, not
backward

3.watermark is related to the time stamp of the data

2.4 Graphical Wartermark

Insert picture description here

Note: Watermark is the "window closing time" of the previous window triggered. Once the door is closed, all data within the window range based on the current time will be included in the window.
As long as the water level is not reached, no matter how long the actual time advances, the window will not be triggered.

2.5 Introduction of Wartermark

For the introduction of watermark, Flink's bottom layer has helped us encapsulate a lot of content. We only need to set the time semantics (set to EventTime), call the assignTimestampsAndWatermarks method and then implement the TimestampAssigner interface, (you can also call assignAscendingTimestamps, ascending data directly Extract the timestamp, the sequential data in the ideal state uses this method, because there are problems such as network delay in real life, so this method is rarely used in general)

val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)//时间语义设置为EnevtTime,默认为ProcessingTime

dataStream.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[thermometer](Time.milliseconds(50)) {
    
    
  override def extractTimestamp(element: thermometer): Long = {
    
    
    element.timestamp * 1000L  //
  }
} )

Where Time.milliseconds(50) is the delay time T in the above WaterMark, here is the inheritance diagram of the TimestampAssigner interface

Insert picture description here

2.6 Window test based on time semantics

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.{
    
    EventTimeSessionWindows, SlidingEventTimeWindows, TumblingEventTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time


case class thermometer(id : String ,time : Long,Temp : Double)
//温度计样例类

object Time_window {
    
    
  def main(args: Array[String]): Unit = {
    
    

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)//时间语义设置为EnevtTime,默认为ProcessingTime
    env.getConfig.setAutoWatermarkInterval(50)

    //从socket文本流中读取数据
    val inputStream = env.socketTextStream("hadoop102",7777)

    // 先转换成样例类类型
    val dataStream = inputStream
      .map( data => {
    
    
        val arr = data.split(",")
        thermometer(arr(0), arr(1).toLong, arr(2).toDouble)
      } )
     // .assignAscendingTimestamps(_.time * 1000L)    // 升序数据提取时间戳,理想状态下的顺序数据使用此种方法,此方法不需要定义
      //WaterMark,因为WaterMark直接采用的是数据进来时的时间戳

      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[thermometer](Time.seconds(3)) {
    
    
        override def extractTimestamp(element: thermometer): Long = element.time.toLong * 1000L
      })

    val latetag = new OutputTag[(String, Double, Long)]("late")

    val res = dataStream
      .map(data => (data.id,data.Temp,data.time))
      .keyBy(_._1)   //按照id进行分组
//      .window(TumblingEventTimeWindows.of(Time.seconds(15)))  底层滚动窗口的实现
//      .window(SlidingEventTimeWindows.of(Time.seconds(15),Time.milliseconds(3))) //底层滑动窗口的实现
//      .window(EventTimeSessionWindows.withGap(Time.seconds(15)))  会话窗口
//      .countWindow(10)  滚动计数窗口
//      .countWindow(10,2) 滑动计数窗口
      .timeWindow(Time.seconds(10))  //使用Flink为我们封装好的滑动或者滚动窗口的实现方法
      .allowedLateness(Time.minutes(1)) //允许迟到1minute的数据
      .sideOutputLateData(latetag)

      .reduce((currdata,newdata)=>(currdata._1,currdata._2.min(newdata._2),newdata._3))  //每10s求出当前时间下各个温度计的最小值

    res.getSideOutput(latetag).print("late")
    res.print("result")


    env.execute("EventTime_Tumblingwindow test")

  }
}

Thermometer data used in the test

1,1609745003,10.6
1,1609745004,8.2
1,1609745010,25.6
4,1609745003,24.7
5,1609745005,18.4
1,1609745013,7.2
1,1609745006,1.2
1,1609745014,35.4
1,1609745015,17.4
1,1609745023,22.9
1,1609745007,1.6
1,1609745100,7.0

Insert picture description here

From the current input data, it can be seen that the range of the window is [0,10), and the corresponding WaterMark when the timestamp is 013 is exactly 10. At this time, the data in the 0-10 buckets is triggered for processing.

Insert picture description here

According to the above window, the current window range is [10, 20). When the timestamp is 023, the calculation of data 10-20 in the bucket is triggered. For windows 10-20, the lowest temperature is 7.2 corresponding to timestamp 013. For windows 0 to 10, at this time, when the result of the previous aggregation operation has come out, the time stamp data of 006 comes again. Because we set allowedLateness(Time.minutes(1)) //允许迟到1minute的数据it in the code, the current minimum temperature of the window 0 to 10 becomes 1.2 , The timestamp is correspondingly changed to the latest timestamp in the code.

Insert picture description here

For windows 0 to 10, although the time stamp in the previous window is inserted at this time, because the temperature of 1.6 is not as low as the previous minimum of 1.2, the minimum temperature is still 1.2 at this time, but the current time stamp has changed to the latest time Poke 007.

Guess you like

Origin blog.csdn.net/weixin_44080445/article/details/112131990