目录
(1)时间分类
在Flink的流式处理中,会涉及到时间的不同概念,如下图所示
事件时间EventTime: 事件真真正正发生产生的时间
摄入时间IngestionTime: 事件到达Flink的时间
处理时间ProcessingTime: 事件真正被处理/计算的时间
上面的三个时间,我们更关注事件时间EventTime
(2)Watermark详解
(2.1)Watermark图解
(2.2)什么是Watermark?
Watermark就是给数据再额外的加的一个时间列也就是Watermark是个时间戳!
(2.3)如何计算Watermark?
Watermark =当前窗口的最大的事件时间 - 最大允许的延迟时间或乱序时间
这样可以保证Watermark水位线会一直上升(变大),不会下降
(2.4)Watermark有什么用?
之前的窗口都是按照系统时间来触发计算的,如:[10:00:00~10:00:10) 的窗口,一但系统时间到了10:00:10就会触发计算,那么可能会导致延迟到达的数据丢失!那么现在有了Watermark,窗口就可以按照Watermark来触发计算!
也就是说Watermark是用来触发窗口计算的!
(2.5)Watermark如何出发窗口计算?
窗口计算的触发条件为:
- 窗口中有数据
- Watermaker >= 窗口的结束时间
注意:
上面的触发公式进行如下变形:
Watermaker >= 窗口的结束时间
Watermaker = 当前窗口的最大的事件时间 - 最大允许的延迟时间或乱序时间
当前窗口的最大的事件时间 - 最大允许的延迟时间或乱序时间 >= 窗口的结束时间
当前窗口的最大的事件时间 >= 窗口的结束时间 + 最大允许的延迟时间或乱序时间
Watermark API:
- https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/dev/event_timestamps_watermarks.html
(3)EventTime 和 WaterMark 的使用
Flink 内置了两个 WaterMark 生成器:
- Monotonously Increasing Timestamps(时间戳单调增长:其实就是允许的延迟为 0)
WatermarkStrategy.<WaterSensor>forMonotonousTimestamps()
- Fixed Amount of Lateness(允许固定时间的延迟)
WatermarkStrategy.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(2))
(3.1)基于事件时间的滚动窗口测试watermark机制
代码开发:
package com.aikfk.flink.datastream.bean;
/**
* @author :caizhengjie
* @description:TODO
* @date :2021/3/20 9:19 下午
* 水位传感器:用于接收水位数据
* <p>
* id:传感器编号
* ts:时间戳
* vc:水位
*/
public class WaterSensor {
private String id;
private Long ts;
private Integer vc;
public WaterSensor(String id, Long ts, Integer vc) {
this.id = id;
this.ts = ts;
this.vc = vc;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public Long getTs() {
return ts;
}
public void setTs(Long ts) {
this.ts = ts;
}
public Integer getVc() {
return vc;
}
public void setVc(Integer vc) {
this.vc = vc;
}
@Override
public String toString() {
return "WaterSensor{" +
"id='" + id + '\'' +
", ts=" + ts +
", vc=" + vc +
'}';
}
}
package com.aikfk.flink.datastream.watermark;
import com.aikfk.flink.datastream.bean.WaterSensor;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import java.time.Duration;
/**
* @author :caizhengjie
* @description:基于事件事件滚动窗口测试watermark机制
* @date :2021/3/20 9:21 下午
*/
public class EventTimeTumbling {
public static void main(String[] args) throws Exception {
// 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取端口数据并转换为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDS = env.socketTextStream("bigdata-pro-m07", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String s) throws Exception {
String[] split = s.split(",");
return new WaterSensor(split[0],Long.parseLong(split[1]),Integer.parseInt(split[2]));
}
});
// 3.提取数据中的时间戳字段
SingleOutputStreamOperator<WaterSensor> waterSensorSingleOutputStreamOperator = waterSensorDS
.assignTimestampsAndWatermarks(WatermarkStrategy
// 设置最大允许的延迟时间
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(2))
// 指定事时间件列
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
}));
// 4.按照id分组
KeyedStream<WaterSensor, String> keyedStream = waterSensorSingleOutputStreamOperator.keyBy(WaterSensor::getId);
// 5.开窗
WindowedStream<WaterSensor, String, TimeWindow> window = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
// 6.计算总和
SingleOutputStreamOperator<WaterSensor> result = window.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor t1, WaterSensor t2) throws Exception {
return new WaterSensor(t1.getId(),t1.getTs(),t1.getVc() + t2.getVc());
}
});
// 7.打印
result.print();
// 8.执行任务
env.execute();
}
}
测试非乱序数据:
ws_001,1577844001,1
ws_001,1577844002,1
ws_001,1577844003,1
ws_001,1577844005,1
ws_001,1577844006,1
ws_001,1577844009,1
运行结果:
WaterSensor{
id='ws_001', ts=1577844001, vc=3}
运行过程解释:
因为滚动窗口是基于事件时间0到5秒,左闭右开[0,5)。输入的数据事件时间1到3秒时,会落入窗口为[0,5),当输入的数据事件时间为t(比如是9秒),假设设置最大允许的延迟时间为2秒,即watermark为7秒,而wm >= 窗口最大边界值5秒,所以触发[0,5)的窗口,得到的结果为vc = 3
测试乱序数据:
ws_001,1577844001,1
ws_001,1577844002,1
ws_001,1577844003,1
ws_001,1577844005,1
ws_001,1577844001,1
ws_001,1577844002,1
ws_001,1577844009,1
运行结果:
WaterSensor{
id='ws_001', ts=1577844001, vc=5}
运行过程解释:
因为滚动窗口是基于事件时间0到5秒,左闭右开[0,5)。输入的数据事件时间1到3秒时,会落入窗口为[0,5),后面来了第5秒的数据,落入的窗口为[5,10),再后面又来了1,2秒的数据,为迟到的数据,因为在来第5秒数据的时候,wm为3秒,它是小于窗口的边界值,所以[0,5)窗口没有关闭,因此来的1,2秒数据会落入到[0,5)窗口中。当输入的数据事件时间为t(比如是9秒),假设设置最大允许的延迟时间为2秒,即watermark为7秒,而wm >= 窗口最大边界值5秒,所以触发[0,5)的窗口,得到的结果为vc = 5.
(3.2)基于事件时间的滚动窗口测试允许迟到数据(allowedLateness)机制与侧输出流(sideOutput)
已经添加了 wartemark 之后, 仍有数据会迟到怎么办? Flink 的窗口, 也允许迟到数据.
当触发了窗口计算后, 会先计算当前的结果, 但是此时并不会关闭窗口.以后每来一条 迟到数据, 则触发一次这条数据所在窗口计算(增量计算).
那么什么时候会真正的关闭窗口呢? wartermark 超过了 窗口结束时间+等待时间
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(3))
注意:允许迟到只能运用在 event time 上
允许迟到数据, 窗口也会真正的关闭, 如果还有迟到的数据怎么办? Flink 提供了一种叫做侧输出流的来处理关窗之后到达的数据.
代码开发:
package com.aikfk.flink.datastream.watermark;
import com.aikfk.flink.datastream.bean.WaterSensor;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.OutputTag;
import java.time.Duration;
/**
* @author :caizhengjie
* @description:基于事件事件滚动窗口测试watermark机制
* @date :2021/3/20 9:21 下午
*/
public class LateAndSideOutPut {
public static void main(String[] args) throws Exception {
// 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取端口数据并转换为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDS = env.socketTextStream("bigdata-pro-m07", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String s) throws Exception {
String[] split = s.split(",");
return new WaterSensor(split[0],Long.parseLong(split[1]),Integer.parseInt(split[2]));
}
});
// 3.提取数据中的时间戳字段
SingleOutputStreamOperator<WaterSensor> waterSensorSingleOutputStreamOperator = waterSensorDS
.assignTimestampsAndWatermarks(WatermarkStrategy
// 设置最大允许的延迟时间
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(2))
// 指定事时间件列
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
}));
// 4.按照id分组
KeyedStream<WaterSensor, String> keyedStream = waterSensorSingleOutputStreamOperator.keyBy(WaterSensor::getId);
// 5.开窗,允许迟到数据,侧输出流
WindowedStream<WaterSensor, String, TimeWindow> window = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(new OutputTag<WaterSensor>("Side") {
});
// 6.计算总和
SingleOutputStreamOperator<WaterSensor> result = window.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor t1, WaterSensor t2) throws Exception {
return new WaterSensor(t1.getId(),t1.getTs(),t1.getVc() + t2.getVc());
}
});
DataStream<WaterSensor> sideOutput = result.getSideOutput(new OutputTag<WaterSensor>("Side") {
});
// 7.打印
result.print();
sideOutput.print("Side");
// 8.执行任务
env.execute();
}
}
测试数据:
ws_001,1577844001,1
ws_001,1577844002,1
ws_001,1577844003,1
ws_001,1577844008,1
ws_001,1577844001,1
ws_001,1577844002,1
ws_001,1577844003,1
ws_001,1577844009,1
ws_001,1577844001,1
ws_001,1577844002,1
运行结果:
WaterSensor{
id='ws_001', ts=1577844001, vc=3}
WaterSensor{
id='ws_001', ts=1577844001, vc=4}
WaterSensor{
id='ws_001', ts=1577844001, vc=5}
WaterSensor{
id='ws_001', ts=1577844001, vc=6}
Side> WaterSensor{
id='ws_001', ts=1577844001, vc=1}
Side> WaterSensor{
id='ws_001', ts=1577844002, vc=1}
运行过程解释:
因为滚动窗口是基于事件时间0到5秒,左闭右开[0,5)。输入的数据事件时间1到3秒时,会落入窗口为[0,5),后面来了第8秒的数据,假设设置最大允许的延迟时间为2秒 ,此时的wm = 6秒大于窗口的最大边界值,触发窗口计算,所以输入第8秒的数据会得到vc=3,但是由于添加了允许迟到数据(allowedLateness)机制,设置允许迟到时间是2秒,因此窗口并没有关闭,而是持续到了wm = 7秒,后面来了1,2,3秒的迟到数据,还会落入到[0,5)窗口中,但是是来一条迟到数据则触发一次这条数据所在窗口计算(增量计算)。当输入的数据事件时间为t(比如是9秒),即watermark为7秒,而wm >= 窗口结束时间+等待时间,窗口关闭,后面再来的1,2秒迟到数据就不会落入到[0,5)窗口中,即通过侧输出流来处理关窗之后到达的数据。
(3.3)基于事件时间的滑动窗口测试watermark机制
代码开发:
package com.aikfk.flink.datastream.watermark;
import com.aikfk.flink.datastream.bean.WaterSensor;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import java.time.Duration;
/**
* @author :caizhengjie
* @description:基于事件事件滚动窗口测试watermark机制
* @date :2021/3/20 9:21 下午
*/
public class EventTimeSliding {
public static void main(String[] args) throws Exception {
// 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取端口数据并转换为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDS = env.socketTextStream("bigdata-pro-m07", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String s) throws Exception {
String[] split = s.split(",");
return new WaterSensor(split[0],Long.parseLong(split[1]),Integer.parseInt(split[2]));
}
});
// 3.提取数据中的时间戳字段,生成watermark
SingleOutputStreamOperator<WaterSensor> waterSensorSingleOutputStreamOperator = waterSensorDS
.assignTimestampsAndWatermarks(WatermarkStrategy
// 设置最大允许的延迟时间
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(2))
// 指定事时间件列
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
}));
// 4.按照id分组
KeyedStream<WaterSensor, String> keyedStream = waterSensorSingleOutputStreamOperator.keyBy(WaterSensor::getId);
// 5.开窗
WindowedStream<WaterSensor, String, TimeWindow> window = keyedStream.window(SlidingEventTimeWindows.of(Time.seconds(6), Time.seconds(2)));
// 6.计算总和
SingleOutputStreamOperator<WaterSensor> result = window.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor t1, WaterSensor t2) throws Exception {
return new WaterSensor(t1.getId(),t1.getTs(),t1.getVc() + t2.getVc());
}
});
// 7.打印
result.print();
// 8.执行任务
env.execute();
}
}
测试数据:
ws_001,1577844001,1
ws_001,1577844008,1
ws_001,1577844012,1
运行结果:
WaterSensor{
id='ws_001', ts=1577844001, vc=1}
WaterSensor{
id='ws_001', ts=1577844001, vc=1}
WaterSensor{
id='ws_001', ts=1577844001, vc=1}
WaterSensor{
id='ws_001', ts=1577844008, vc=1}
运行过程解释:
程序中设置的滑动窗口大小为6秒,步长为2秒,当输入的数据事件时间为1秒时,所属的窗口为[-4,2),[-2,4),[0,6)这三个窗口中,当输入的数据事件时间为8秒时,wm为6秒 >= [0,6)这个窗口的最大边界值,关闭窗口,触发前面三个窗口计算,所以直接输出三个结果。而8秒属于[4,10),[6,12),[8,14)这三个窗口,如果想输出一个结果,则输出数据事件时间为12秒,wm为10秒 >= [4,10)这个窗口的最大边界值,触发窗口计算,得到一个结果。
(3.4)基于事件时间的会话窗口测试watermark机制
时间间隔:指的是WaterMark跟数据本身的时间差值,包含间隔时间
代码开发:
package com.aikfk.flink.datastream.watermark;
import com.aikfk.flink.datastream.bean.WaterSensor;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import java.time.Duration;
/**
* @author :caizhengjie
* @description:基于事件事件滚动窗口测试watermark机制
* @date :2021/3/20 9:21 下午
*/
public class EventTimeSession {
public static void main(String[] args) throws Exception {
// 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取端口数据并转换为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDS = env.socketTextStream("bigdata-pro-m07", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String s) throws Exception {
String[] split = s.split(",");
return new WaterSensor(split[0],Long.parseLong(split[1]),Integer.parseInt(split[2]));
}
});
// 3.提取数据中的时间戳字段,生成watermark
SingleOutputStreamOperator<WaterSensor> waterSensorSingleOutputStreamOperator = waterSensorDS
.assignTimestampsAndWatermarks(WatermarkStrategy
// 设置最大允许的延迟时间
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(2))
// 指定事时间件列
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
}));
// 4.按照id分组
KeyedStream<WaterSensor, String> keyedStream = waterSensorSingleOutputStreamOperator.keyBy(WaterSensor::getId);
//5.开窗,时间间隔:指的是WaterMark跟数据本身的时间差值,包含间隔时间
WindowedStream<WaterSensor, String, TimeWindow> window = keyedStream.window(EventTimeSessionWindows.withGap(Time.seconds(5)));
// 6.计算总和
SingleOutputStreamOperator<WaterSensor> result = window.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor t1, WaterSensor t2) throws Exception {
return new WaterSensor(t1.getId(),t1.getTs(),t1.getVc() + t2.getVc());
}
});
// 7.打印
result.print();
// 8.执行任务
env.execute();
}
}
测试数据:
ws_001,1577844002,1
ws_001,1577844007,1
ws_001,1577844014,1
运行结果:
WaterSensor{
id='ws_001', ts=1577844002, vc=2}
运行过程解释:
程序中设置的会话窗口大小为5秒,第一次输入的数据事件时间是2秒,第二次输入的数据事件时间是7秒,不会触发窗口,因为只有输入数据的watermark >= 上一次的数据事件时间 + 时间间隔(5秒)。当输入的数据时间为14秒,wm为12秒 >= 7 + 5,所以触发窗口计算,得到两个结果。
(4)自定义 WatermarkStrategy
有 2 种风格的 WaterMark 生产方式: periodic(周期性) and punctuated(间歇性).
都需要继承接口: WatermarkGenerator
(4.1)周期性
package com.aikfk.flink.datastream.watermark;
import com.aikfk.flink.datastream.bean.WaterSensor;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
/**
* @author :caizhengjie
* @description:基于事件事件滚动窗口测试watermark机制
* @date :2021/3/20 9:21 下午
*/
public class EventTimeTumblingCustomerPeriod {
public static void main(String[] args) throws Exception {
// 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取端口数据并转换为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDS = env.socketTextStream("bigdata-pro-m07", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String s) throws Exception {
String[] split = s.split(",");
return new WaterSensor(split[0],Long.parseLong(split[1]),Integer.parseInt(split[2]));
}
});
// 3.提取数据中的时间戳字段,生成watermark
SingleOutputStreamOperator<WaterSensor> waterSensorSingleOutputStreamOperator = waterSensorDS
.assignTimestampsAndWatermarks(new WatermarkStrategy<WaterSensor>() {
@Override
public WatermarkGenerator<WaterSensor> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new MyPeriod(2000L);
}
}.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
}));
// 4.按照id分组
KeyedStream<WaterSensor, String> keyedStream = waterSensorSingleOutputStreamOperator.keyBy(WaterSensor::getId);
// 5.开窗
WindowedStream<WaterSensor, String, TimeWindow> window = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
// 6.计算总和
SingleOutputStreamOperator<WaterSensor> result = window.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor t1, WaterSensor t2) throws Exception {
return new WaterSensor(t1.getId(),t1.getTs(),t1.getVc() + t2.getVc());
}
});
// 7.打印
result.print();
// 8.执行任务
env.execute();
}
/**
* 自定义周期性的Watermark生成器
*/
public static class MyPeriod implements WatermarkGenerator<WaterSensor> {
private Long maxTs;
// 允许的最大延迟时间 ms
private Long maxDelay;
public MyPeriod(Long maxDelay) {
this.maxDelay = maxDelay;
this.maxTs = Long.MIN_VALUE + maxDelay + 1;
}
// 每收到一个元素, 执行一次. 用来生产WaterMark中的时间戳
@Override
public void onEvent(WaterSensor event, long eventTimestamp, WatermarkOutput output) {
//有了新的元素找到最大的时间戳
System.out.println("取数据中最大的时间戳");
maxTs = Math.max(eventTimestamp, maxTs);
}
// 周期性的把WaterMark发射出去, 默认周期是200ms
@Override
public void onPeriodicEmit(WatermarkOutput output) {
// 周期性的发射水印: 相当于Flink把自己的时钟调慢了一个最大延迟
System.out.println("生成WaterMark" + (maxTs - maxDelay));
output.emitWatermark(new Watermark(maxTs - maxDelay));
}
}
}
(4.2)间歇性
package com.aikfk.flink.datastream.watermark;
import com.aikfk.flink.datastream.bean.WaterSensor;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
/**
* @author :caizhengjie
* @description:基于事件事件滚动窗口测试watermark机制
* @date :2021/3/20 9:21 下午
*/
public class EventTimeTumblingCustomerPunt {
public static void main(String[] args) throws Exception {
// 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取端口数据并转换为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDS = env.socketTextStream("bigdata-pro-m07", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String s) throws Exception {
String[] split = s.split(",");
return new WaterSensor(split[0],Long.parseLong(split[1]),Integer.parseInt(split[2]));
}
});
// 3.提取数据中的时间戳字段
SingleOutputStreamOperator<WaterSensor> waterSensorSingleOutputStreamOperator = waterSensorDS
.assignTimestampsAndWatermarks(new WatermarkStrategy<WaterSensor>() {
@Override
public WatermarkGenerator<WaterSensor> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new MyPunt(2000L);
}
}.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
}));
// 4.按照id分组
KeyedStream<WaterSensor, String> keyedStream = waterSensorSingleOutputStreamOperator.keyBy(WaterSensor::getId);
// 5.开窗
WindowedStream<WaterSensor, String, TimeWindow> window = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
// 6.计算总和
SingleOutputStreamOperator<WaterSensor> result = window.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor t1, WaterSensor t2) throws Exception {
return new WaterSensor(t1.getId(),t1.getTs(),t1.getVc() + t2.getVc());
}
});
// 7.打印
result.print();
// 8.执行任务
env.execute();
}
/**
* 自定义间歇性watermark
* */
public static class MyPunt implements WatermarkGenerator<WaterSensor> {
private Long maxTs;
private Long maxDelay;
public MyPunt(Long maxDelay) {
this.maxDelay = maxDelay;
this.maxTs = Long.MIN_VALUE + maxDelay + 1;
}
//当数据来的时候调用
@Override
public void onEvent(WaterSensor event, long eventTimestamp, WatermarkOutput output) {
System.out.println("取数据中最大的时间戳");
maxTs = Math.max(eventTimestamp, maxTs);
output.emitWatermark(new Watermark(maxTs - maxDelay));
}
//周期性调用
@Override
public void onPeriodicEmit(WatermarkOutput output) {
}
}
}
测试数据:
ws_001,1577844001,1
ws_001,1577844002,1
ws_001,1577844012,1
运行结果:
取数据中最大的时间戳
取数据中最大的时间戳
取数据中最大的时间戳
WaterSensor{
id='ws_001', ts=1577844001, vc=2}
(5)多并行度下 WaterMark 的传递
WaterMark传递:
- 使用广播的方式传输的
- 某个并行度中Watermark值取决于前面所有并行度的最小WaterMark值
- 当WaterMark值没有增长的时候,不会向下游传递,注意:生成不变
总结: 多并行度的条件下, 向下游传递 WaterMark 的时候, 总是以最小的那个 WaterMark 为准! 木桶原理!
以上内容仅供参考学习,如有侵权请联系我删除!
如果这篇文章对您有帮助,左下角的大拇指就是对博主最大的鼓励。
您的鼓励就是博主最大的动力!