Types and basic use of Flink-sink
The sink in flink is equivalent to the action in spark and is one of the important basis for dividing subtasks.
PrintSink numbering problem
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> streamSource = environment.socketTextStream("test130", 8888);
streamSource.print("result:").setParallelism(2);
environment.execute("PrintSink");
}
Test: terminal input ababab
results:
result::2> b
result::1> a
result::2> b
result::1> a
result::2> b
1. The prefix result is set in print.
2. The previous number (that is, the Id of the subTask) switches back and forth between 1 and 2 (due to setParallelism(2))
3. The sink of Print can set the degree of parallelism, which means that it is a parallel sink.
4. The first thing that needs to be explained is that the number of subTask of flink starts from 0 , and the parallelism we set is 2. Logically speaking, the previous number should be 1 and 0 , why are they 1 and 2?
Answer:
Among the ES source code, one is called PrintSinkOutputWriter:
some methods are as follows:
public void open(int subtaskIndex, int numParallelSubtasks) {
this.stream = !this.target ? System.out : System.err;
this.completedPrefix = this.sinkIdentifier;
if (numParallelSubtasks > 1) {
if (!this.completedPrefix.isEmpty()) {
this.completedPrefix = this.completedPrefix + ":";
}
// 输出的结果在原来基础上+1了
this.completedPrefix = this.completedPrefix + (subtaskIndex + 1);
}
// 因此为何我们输出的结果有这个 > 的符号
if (!this.completedPrefix.isEmpty()) {
this.completedPrefix = this.completedPrefix + "> ";
}
}
This code explains why the output result when we use the prinit() method is like this:
[subTaskId+1] > value
Default output template:
2> b
1> a
2> b
1> a
2> b
1> a
Use of addSink (custom Sink)
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> streamSource = environment.socketTextStream("test130", 8888);
SingleOutputStreamOperator<Tuple2<String, Integer>> flatMap = streamSource.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] words = line.split(" ");
for (String word : words) {
Tuple2<String, Integer> of = Tuple2.of(word, 1);
out.collect(of);
}
}
});
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = flatMap.keyBy(0).sum(1);
sum.addSink(new RichSinkFunction<Tuple2<String, Integer>>() {
@Override
public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
// 获取index
int index = getRuntimeContext().getIndexOfThisSubtask();
System.out.println("自定义Sink:" + index + "->" + value);
}
});
environment.execute("AddSink");
}
The terminal opens port 8888 and enters:
nc -lk 8888
flink spark spark
result:
自定义sink: 0 ->(spark,1)
自定义sink: 6 ->(flink,1)
自定义sink: 0 ->(spark,2)
Note:
If you need to get the Id of the subTask in the custom Sink, the internal implementation class must be RichSinkFunction , otherwise it is not supported
int index = getRuntimeContext().getIndexOfThisSubtask();
Use of csvSink
First of all, what everyone needs to pay attention to:
look at the source code:
this internal implementation:
- The default line separator is carriage return
- The separator between strings is a comma
Test code:
If I enter some text in the terminal:
my file output path will have an out2 folder, but the size is 0 at this time, why? As shown in the figure:
View the source code:
One of the classes is called: CsvOutputFormat , and some code is posted
public void open(int taskNumber, int numTasks) throws IOException {
super.open(taskNumber, numTasks);
this.wrt = this.charsetName == null ? new OutputStreamWriter(new BufferedOutputStream(this.stream, 4096)) : new OutputStreamWriter(new BufferedOutputStream(this.stream, 4096), this.charsetName);
}
You can see that there is a 4096 number here.
This means that only if the file size exceeds 4096, the program will start a flush to write data to the file. Therefore, when we write too little data, the limit of 4096 will not be reached, so there will be no flush operation, and the output file content is empty.
So we can operate through the following schemes:
- Let the terminal's port 8888 disconnect.
- Then the program will enter a finishing work, an exception occurs, and a flush will be performed. Write the data to the file.
- After disconnecting the terminal, a prompt will appear on the print station
At this time, the target folder has changed:
Data content: