这一章说一下Flink动态表的概念。阿里的一篇文章,可以先看看会对动态表有一个模糊的概念。动态表就是一个根据流在动态变化的表。从阿里的例子可以看出,当一个表Stream发生改变的时候,就会引起Keyed Table这张表的一个动态变化,表Stream是一个无法撤回的表,Stream表是只能不停增加的一张表,但是Keyed Table 会根据Stream中数据的增长的变化来修改自己count出来的值,随着count值的改变就会使得以count为key的第二张表的改变,第二张表才是我们需要的结果。第一张表只是一个过渡的表,但是有了第一张表才能满足我们第二张的要求。
将阿里的第一张表以java代码写出:
package com.yjp.flink.retraction; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.java.StreamTableEnvironment; public class RetractionITCase { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tEnv = StreamTableEnvironment.getTableEnvironment(env); env.getConfig().disableSysoutLogging(); DataStream<Tuple2<String, Integer>> dataStream = env.fromElements( new Tuple2<>("hello", 1), new Tuple2<>("word", 1), new Tuple2<>("hello", 1), new Tuple2<>("bark", 1), new Tuple2<>("bark", 1), new Tuple2<>("bark", 1), new Tuple2<>("bark", 1), new Tuple2<>("bark", 1), new Tuple2<>("bark", 1), new Tuple2<>("flink", 1) ); tEnv.registerDataStream("demo1", dataStream, "word ,num"); Table table = tEnv.sqlQuery("select * from demo1 ").groupBy("word") .select("word AS word ,num.sum AS count") .groupBy("count").select("count , word.count as frequency"); tEnv.toRetractStream(table, Word.class).print(); env.execute("demo"); } }
package com.yjp.flink.retraction; public class Word { private Integer count; private Long frequency; public Word() { } public Integer getCount() { return count; } public void setCount(Integer count) { this.count = count; } public Long getFrequency() { return frequency; } public void setFrequency(Long frequency) { this.frequency = frequency; } @Override public String toString() { return "Word{" + "count=" + count + ", frequency=" + frequency + '}'; } }
结果:
2> (true,Word{count=1, frequency=1})
2> (false,Word{count=1, frequency=1})
2> (true,Word{count=1, frequency=2})
4> (true,Word{count=3, frequency=1})
4> (false,Word{count=3, frequency=1})
4> (true,Word{count=4, frequency=1})
4> (false,Word{count=4, frequency=1})
2> (false,Word{count=1, frequency=2})
2> (true,Word{count=1, frequency=3})
2> (false,Word{count=1, frequency=3})
3> (true,Word{count=6, frequency=1})
1> (true,Word{count=2, frequency=1})
1> (false,Word{count=2, frequency=1})
1> (true,Word{count=5, frequency=1})
1> (false,Word{count=5, frequency=1})
1> (true,Word{count=2, frequency=1})
2> (true,Word{count=1, frequency=2})
2> (false,Word{count=1, frequency=2})
2> (true,Word{count=1, frequency=3})
2> (false,Word{count=1, frequency=3})
2> (true,Word{count=1, frequency=2})
从结果来分析,我们所希望达到的的目标是:6,1 6个bark 2,1两个hello 1,2 分别是word flink
前面数字相同的是同一组操作,true代表的是写入,false代表的是撤回,true和false一样就会抵消,然后就会发现结果和我们预想的结果是一样的,如果没有撤回操作,阿里的文章已经说明了。
我们在看阿里的第二个例子:看第二个例子的时候会好奇StringLast这个函数应该怎样去实现,java实现如下
package com.yjp.flink.retract; import org.apache.flink.api.java.tuple.Tuple3; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.java.StreamTableEnvironment; public class ALiTest { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tEnv = StreamTableEnvironment.getTableEnvironment(env); env.getConfig().disableSysoutLogging(); DataStreamSource<Tuple3<String, String, Long>> dataStream = env.fromElements( new Tuple3<>("0001", "中通", 1L), new Tuple3<>("0002", "中通", 2L), new Tuple3<>("0003", "圆通", 3L), new Tuple3<>("0001", "圆通", 4L) ); tEnv.registerDataStream("Ali", dataStream, "order_id ,company,timestamp"); tEnv.registerFunction("agg", new AliAggrete()); Table table = tEnv.sqlQuery("select * from Ali ") .groupBy("order_id").select(" order_id,agg(company,timestamp) As company") .groupBy("company").select("company , order_id.count as order_cnt"); tEnv.toRetractStream(table, ALi.class).print(); env.execute("ALi"); } }
package com.yjp.flink.retract; import org.apache.flink.table.functions.AggregateFunction; public class AliAggrete extends AggregateFunction<String, ALiAccum> { @Override public ALiAccum createAccumulator() { return new ALiAccum(); } @Override public String getValue(ALiAccum aLiAccum) { return aLiAccum.company; } //更改累加器中的结果 public void accumulate(ALiAccum aLiAccum, String company, Long time) { if (time > aLiAccum.timeStamp) { aLiAccum.company = company; } } // public void retract(ALiAccum aLiAccum, String company, Long time) { // aLiAccum.company = company; // aLiAccum.timeStamp = time; // } // public void resetAccumulator(ALiAccum aLiAccum) { // aLiAccum.company = null; // aLiAccum.timeStamp = 0L; // } // public void merge(ALiAccum acc, Iterable<ALiAccum> it) { // Iterator<ALiAccum> iter = it.iterator(); // while (iter.hasNext()) { // ALiAccum aLiAccum = iter.next(); // if (aLiAccum.timeStamp > acc.timeStamp) { // acc.company = aLiAccum.company; // } // } // } }
package com.yjp.flink.retract; public class ALiAccum { public String company = null; public Long timeStamp = 0L; }
package com.yjp.flink.retract; public class ALi { private String company; private Long order_cnt; public ALi() { } public String getCompany() { return company; } public void setCompany(String company) { this.company = company; } public Long getOrder_cnt() { return order_cnt; } public void setOrder_cnt(Long order_cnt) { this.order_cnt = order_cnt; } @Override public String toString() { return "ALi{" + "company='" + company + '\'' + ", order_cnt=" + order_cnt + '}'; } }
这个整个就是阿里第二个例子用代码去实现,timestamp这个字段其实可以不用给,因为每个流进入的时候就会自带一个时间戳,但是会有乱序的考虑,如果不考虑乱序就用自带的时间戳就可以了。
分析整个逻辑代码
tEnv.registerFunction("agg", new AliAggrete());
将我们自己实现的聚合的函数注册, Table table = tEnv.sqlQuery("select * from Ali ")将流转换为第一张Stream表, .groupBy("order_id").select(" order_id,agg(company,timestamp) As company")以订单id分组,相同id的订单会进入同一组,然后我们通过我们自定义的聚合函数去实现只发送时间戳最大的那个记录,实现的原理,ALiAccum这个类是为了将我们company,timestamp两个字段形成映射关系,然后AggregateFunction<String, ALiAccum> String为返回类型,我们这里需要返回的是公司的名字,所以为String类型,ALiAccum是我们传入的两个字段,之前将两个字段映射为了POJP对象,首先会调用createAccumulator()方法,创建一个数据结构来保存聚合的中间结果,然后通过accumulate()方法来该更中间结果的值,最后通过getValue()来返回我们真正需要的值。最后对我们操作过的这张表进行查询操作,就得到我们想要的结果了。主要就是自己需要实现Agg函数。文章链接,Flink提供的Agg函数的文档。
努力吧,皮卡丘。