Flink time and the type of watermark mechanism

Type a FlinkTime

    There are three types of time, namely the time to generate the data itself, into the system and the time to Flink be processed, the data in the system may be Flink time three properties:

Event Time is the time of occurrence of each data on their equipment. This time is typically embedded in the recorded data, and then enters Flink, a time stamp can be extracted from the event log; Event Time scrambled data even if they occur, or a case where a delay retrieve data from a backup or the persistent log, also to provide correct results. This time is the most valuable, and hang the clock time any computer / operating system-independent.

Processing Time refers to the system time to perform a corresponding operation of the machine. If the flow-based computing system processing Processing Time to convective processing system is the simplest, all time-based operations (e.g., Time Window) will be used to run the machine operator corresponding system clock. However, in a distributed and asynchronous environments, Processing Time not certainty, it is vulnerable to reach Event system speed (e.g., from a message queue) and the effect of the internal processing sequence Flink system data, it can not be accurately Processing Time reacting the case of time series data occurs.

Ingestion Time to enter the event Flink time. Source generated at operator, i.e. to obtain at the time of the Source data, Ingestion Time Event Time positioned between and Processing Time conceptually. Source acquires the time data, and the data transfer sequence affect internal processing Flink Event is not a distributed system, a number of relatively stable, but Ingestion Time Processing Time and, as the reaction is not exactly the case of time-series data occurs.

Two mechanisms Watermark

The above-mentioned Event Time is the best time to reflect data attributes, but Time may be delayed or disorder occurs Event, Flink system itself can only process the data one by one, how to deal with Event Time may be delayed or scrambled happen?

For example, we need to count the number of times an event occurs from 10:00 to 11:00, that is, on the Event Time is the number of statistics between 10:00 and 11:00. Event Time the case may be delayed or out of order, Flink system to determine how event data occurred from 10:00 to 11:00 have arrived, you can give the statistics of it? Long wait for the results of the output will be delayed time and take up more system resources.

Watermark is a peer Event Time logo, content Watermark is a timestamp, a time-stamped Watermark X is reached, the equivalent of telling Flink system, any data Event Time less than X have been reached. For example, the above example, if a time stamp is 11:01 Flink received the Watermark, it can be put before the statistics Event Time in [10: 00,11: Event output between the number 01), the correlation is occupied empty H. It should be noted that the length of the problem of the window, the window only complete collection of data, only statistical.

Three generation Watermark

Periodic - a certain time interval or after a certain number of records generates a watermark.

Punctuated - generated by the event time based on a certain logic watermark, such as a received data generates a WaterMark, time event time - 5 seconds.

Both ways produce, have created mechanisms to ensure the watermark is monotonically increasing.

Even with watermark, if the conditions in reality, data is not guaranteed to meet watermark how to do? For example Flink dealt with watermark 11:01, but after experiencing the event time is 10:00 ~ 11: 00 between the data how to do? First, if the probability of such things appear very small, does not affect the accuracy required, the data can be directly discarded; if the probability of occurrence of this kind of thing is relatively large, it is necessary to adjust the production mechanism of the water mark.

In addition to the data violates watermark mechanism discarded, there are treatment methods are not discarded, such as updating the survey results before the adoption of a number of mechanisms, this way there will be some performance overhead.

Four code sample

package org.tonny.flink.bi.job.water;

import org.apache.commons.lang3.ArrayUtils;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.time.Time;
Import org.apache.flink.util.StringUtils;

/ **
 *
specified linux opening machine NC -l 9900
 *
input data format: * hello1 1567059808519  * of hello2 1,567,059,809,519  * hello3 1567059810519  * / public class WaterMarkJob { public static void main (String [] args) throws Exception {         StreamExecutionEnvironment the env = StreamExecutionEnvironment. getExecutionEnvironment ();         env.getConfig () disableSysoutLogging ();. // Close print log env.setStreamTimeCharacteristic (TimeCharacteristic.
 





   


       
EventTime are );  // Set the time dispenser

       
env.setParallelism (. 1);  // set the degree of parallelism
       
env.getConfig () setAutoWatermarkInterval (3000);. // every 9 emits a second Watermark

       
DataStream <String> text = env.socketTextStream ( "localhost" , 9900);

        DataStream <Tuple3 <String, Long, Integer >> = Counts text
                // set filtering
               
.filter ( new new filterClass ())
                // set word
               
.map ( new new LineSplitter ())
                // set the watermark method
               
.assignTimestampsAndWatermarks ( new new PeriodicWatermarks ())
                .keyBy (0)
                // set the scroll window size
               
.timeWindow (Time. seconds The (60))
                .sum (2);

        counts.print ();
        env.execute ( "the WordCount the Window" );

    }

    public static class PeriodicWatermarks the implements AssignerWithPeriodicWatermarks <Tuple3 <String, Long, Integer >> {
        Private Long currentMaxTimestamp = 0L;

        Private Long Final maxOutOfOrderness = 10000L;   // this disorder has been controlled delay metric , the time stamp 10 second ago data // Get EventTime are @Override public Long

       

        

        extractTimestamp(Tuple3<String, Long, Integer> element, long previousElementTimestamp) {
            if (element == null) {
                return currentMaxTimestamp;
            }

            long timestamp = element.f1;
            currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
            System.out.println("get timestamp is " + timestamp + " currentMaxTimestamp " + currentMaxTimestamp);
            return timestamp;
        }

        //获取Watermark
       
@Override
        public Watermark getCurrentWatermark () {
            . The System OUT .println ( "IS Wall Clock" + the System. With currentTimeMillis () + "new new Watermark" + ( currentMaxTimestamp - maxOutOfOrderness ));
            return new new Watermark ( currentMaxTimestamp - maxOutOfOrderness );
        }
    }

    // construct the element and its event time. then the number assigned to 1
   
public static class Final LineSplitter the implementsMapFunction<String, Tuple3<String, Long, Integer>> {
        @Override
        public Tuple3<String, Long, Integer> map(String value) throws Exception {
            if (org.apache.commons.lang3.StringUtils.isBlank(value)) {
                return null;
            }

            String[] tokens = value.toLowerCase().split("\\W+");
            if (ArrayUtils.isEmpty(tokens) || ArrayUtils.getLength(tokens) < 2) {
                return null;
            }
            long eventtime = 0L;
            try {
                eventtime = Long.parseLong(tokens[1]);
            } catch (NumberFormatException e) {
                return null;
            }
            return new Tuple3<String, Long, Integer>(tokens[0], eventtime, 1);
        }
    }

    /**
     *
过滤掉为nullwhitespace的字符串
    
*/
   
public static final class FilterClass implements FilterFunction<String> {
        @Override
        public boolean filter(String value) throws Exception {
            if (StringUtils.isNullOrWhitespaceOnly(value)) {
                return false;
            } else {
                return true;
            }
        }

    }
}

 

Guess you like

Origin www.cnblogs.com/supertonny/p/11430145.html