Types and basic use of Flink-sink


The sink in flink is equivalent to the action in spark and is one of the important basis for dividing subtasks.

PrintSink numbering problem

public static void main(String[] args) throws Exception {
    
    
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource<String> streamSource = environment.socketTextStream("test130", 8888);

    streamSource.print("result:").setParallelism(2);

    environment.execute("PrintSink");
}

Test: terminal input ababab
results:

result::2> b
result::1> a
result::2> b
result::1> a
result::2> b

1. The prefix result is set in print.
2. The previous number (that is, the Id of the subTask) switches back and forth between 1 and 2 (due to setParallelism(2))
3. The sink of Print can set the degree of parallelism, which means that it is a parallel sink.
4. The first thing that needs to be explained is that the number of subTask of flink starts from 0 , and the parallelism we set is 2. Logically speaking, the previous number should be 1 and 0 , why are they 1 and 2?

Answer:
Among the ES source code, one is called PrintSinkOutputWriter:
some methods are as follows:

public void open(int subtaskIndex, int numParallelSubtasks) {
    
    
    this.stream = !this.target ? System.out : System.err;
    this.completedPrefix = this.sinkIdentifier;
    if (numParallelSubtasks > 1) {
    
    
        if (!this.completedPrefix.isEmpty()) {
    
    
            this.completedPrefix = this.completedPrefix + ":";
        }
            // 输出的结果在原来基础上+1了
        this.completedPrefix = this.completedPrefix + (subtaskIndex + 1);
    }
    // 因此为何我们输出的结果有这个 > 的符号
    if (!this.completedPrefix.isEmpty()) {
    
    
        this.completedPrefix = this.completedPrefix + "> ";
    }

}

This code explains why the output result when we use the prinit() method is like this:

[subTaskId+1] > value

Default output template:

2> b
1> a
2> b
1> a
2> b
1> a

Use of addSink (custom Sink)

public static void main(String[] args) throws Exception {
    
    
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource<String> streamSource = environment.socketTextStream("test130", 8888);

    SingleOutputStreamOperator<Tuple2<String, Integer>> flatMap = streamSource.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    
    
        @Override
        public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
    
    
            String[] words = line.split(" ");
            for (String word : words) {
    
    
                Tuple2<String, Integer> of = Tuple2.of(word, 1);
                out.collect(of);
            }
        }
    });
    SingleOutputStreamOperator<Tuple2<String, Integer>> sum = flatMap.keyBy(0).sum(1);

    sum.addSink(new RichSinkFunction<Tuple2<String, Integer>>() {
    
    
        @Override
        public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
    
    
            // 获取index
            int index = getRuntimeContext().getIndexOfThisSubtask();
            System.out.println("自定义Sink:" + index + "->" + value);
        }
    });

    environment.execute("AddSink");
}

The terminal opens port 8888 and enters:

nc -lk 8888
flink spark spark

result:

自定义sink: 0 ->(spark,1)
自定义sink: 6 ->(flink,1)
自定义sink: 0 ->(spark,2)

Note:
If you need to get the Id of the subTask in the custom Sink, the internal implementation class must be RichSinkFunction , otherwise it is not supported

int index = getRuntimeContext().getIndexOfThisSubtask();

Use of csvSink

First of all, what everyone needs to pay attention to:
look at the source code: Insert picture description here
this internal implementation:

  1. The default line separator is carriage return
  2. The separator between strings is a comma

Test code:
If I enter some text in the terminal:
my file output path will have an out2 folder, but the size is 0 at this time, why? As shown in the figure: Insert picture description here
View the source code:
One of the classes is called: CsvOutputFormat , and some code is posted

public void open(int taskNumber, int numTasks) throws IOException {
    
    
    super.open(taskNumber, numTasks);
    this.wrt = this.charsetName == null ? new OutputStreamWriter(new BufferedOutputStream(this.stream, 4096)) : new OutputStreamWriter(new BufferedOutputStream(this.stream, 4096), this.charsetName);
}

You can see that there is a 4096 number here.
This means that only if the file size exceeds 4096, the program will start a flush to write data to the file. Therefore, when we write too little data, the limit of 4096 will not be reached, so there will be no flush operation, and the output file content is empty.
So we can operate through the following schemes:

  1. Let the terminal's port 8888 disconnect.
  2. Then the program will enter a finishing work, an exception occurs, and a flush will be performed. Write the data to the file.
  3. After disconnecting the terminal, a prompt will appear on the print station

Insert picture description here
At this time, the target folder has changed:
Insert picture description here
Data content:
Insert picture description here

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/107799559