flink 1.17代码示例:使用 Flink 的 DataStream API 来实现一个简单的批处理 WordCount 应用程序处理有界数据流

完整代码

在这里插入图片描述

待处理数据
hello flink
hello world
hello java
java代码
package com.zxl.wc;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * TODO DataStream实现Wordcount:读文件(有界流)
 *
 * @author cjp
 * @version 1.0
 */
public class WordCountStreamDemo {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // TODO 1.创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // TODO 2.读取数据:从文件读
        DataStreamSource<String> lineDS = env.readTextFile("input/word.txt");

        // TODO 3.处理数据: 切分、转换、分组、聚合
        // TODO 3.1 切分、转换
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOneDS = lineDS
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    
    
                    @Override
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
    
    
                        // 按照 空格 切分
                        String[] words = value.split(" ");
                        for (String word : words) {
    
    
                            // 转换成 二元组 (word,1)
                            Tuple2<String, Integer> wordsAndOne = Tuple2.of(word, 1);
                            // 通过 采集器 向下游发送数据
                            out.collect(wordsAndOne);
                        }
                    }
                });
        // TODO 3.2 分组
        KeyedStream<Tuple2<String, Integer>, String> wordAndOneKS = wordAndOneDS.keyBy(
                new KeySelector<Tuple2<String, Integer>, String>() {
    
    
                    @Override
                    public String getKey(Tuple2<String, Integer> value) throws Exception {
    
    
                        return value.f0;
                    }
                }
        );
        // TODO 3.3 聚合
        SingleOutputStreamOperator<Tuple2<String, Integer>> sumDS = wordAndOneKS.sum(1);

        // TODO 4.输出数据
        sumDS.print();

        // TODO 5.执行:类似 sparkstreaming最后 ssc.start()
        env.execute();
    }
}

/**
 * 接口 A,里面有一个方法a()
 * 1、正常实现接口步骤:
 * <p>
 * 1.1 定义一个class B  实现 接口A、方法a()
 * 1.2 创建B的对象:   B b = new B()
 * <p>
 * <p>
 * 2、接口的匿名实现类:
 * new A(){
 * a(){
 * <p>
 * }
 * }
 */

这段代码展示了如何使用 Apache Flink 的 DataStream API 来实现一个简单的批处理 WordCount 应用程序,处理的是有界数据流。Datastream API 是 Flink 用于处理实时流数据的核心 API,但同样可以应用于批处理场景。

代码解析

创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

创建了一个 Flink 的执行环境。StreamExecutionEnvironment 是 Flink 的流处理 API 的入口点。

读取数据
DataStreamSource<String> lineDS = env.readTextFile("input/word.txt");

从指定的文件路径 "input/word.txt" 读取文本数据,并将每一行作为一个字符串元素组成的数据流 lineDS

处理数据
切分、转换
SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOneDS = lineDS
        .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    
    
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
    
    
                // 按照 空格 切分
                String[] words = value.split(" ");
                for (String word : words) {
    
    
                    // 转换成 二元组 (word,1)
                    Tuple2<String, Integer> wordsAndOne = Tuple2.of(word, 1);
                    // 通过 采集器 向下游发送数据
                    out.collect(wordsAndOne);
                }
            }
        });

这个 flatMap 函数接收每一行文本,并将其拆分成单个单词,然后将每个单词转换为一个 Tuple2<String, Integer>,表示为 (单词, 1),表示该单词出现了一次。

分组
KeyedStream<Tuple2<String, Integer>, String> wordAndOneKS = wordAndOneDS.keyBy(
        new KeySelector<Tuple2<String, Integer>, String>() {
    
    
            @Override
            public String getKey(Tuple2<String, Integer> value) throws Exception {
    
    
                return value.f0; // 选择第一个字段作为 key
            }
        }
);

按照 Tuple2 的第一个元素(即单词)进行分组。

聚合
SingleOutputStreamOperator<Tuple2<String, Integer>> sumDS = wordAndOneKS.sum(1);

对每个分组内的 Tuple2 的第二个元素(即计数器)求和,得到每个单词的总出现次数。

输出结果
sumDS.print();

打印聚合后的结果,即每个单词及其出现次数。

执行
env.execute();

提交 Flink 作业并开始执行。

流程图可视化

以下是代码执行流程的可视化流程图:

+---------------------+          +---------------------+          +---------------------+
|  创建执行环境         |          |  读取文件数据         |          |  切分并转换数据      |
|  StreamExecutionEnvironment.getExecutionEnvironment() |----->    |  env.readTextFile()  |----->    |  lineDS.flatMap()    |
+---------------------+          +---------------------+          +---------------------+
           |                                                  |                                                  |
           v                                                  v                                                  v
+---------------------+          +---------------------+          +---------------------+
|  分组               |          |  聚合计数            |          |  输出结果            |
|  wordAndOneDS.keyBy() |----->    |  wordAndOneKS.sum(1) |----->    |  sumDS.print()       |
+---------------------+          +---------------------+          +---------------------+
           |                                                  |                                                  |
           v                                                  v                                                  v
+---------------------+          +---------------------+          +---------------------+
|  执行 Flink 作业     |          |                     |          |                     |
|  env.execute()       |          |                     |          |                     |
+---------------------+          +---------------------+          +---------------------+

通过这个流程图,我们可以清晰地看到代码的执行流程,从创建执行环境开始,一直到最终执行 Flink 作业。

日志输出

在这里插入图片描述

附:pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.zxl</groupId>
    <artifactId>FlinkTutorial-1.17</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <flink.version>1.17.0</flink.version>
    </properties>


    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>


        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-datagen</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.27</version>
        </dependency>

        <!--目前中央仓库还没有 jdbc的连接器,暂时用一个快照版本-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-jdbc</artifactId>
            <version>1.16.3</version>
            <!--<version>3.1.0-1.17</version>-->
            <!--<version>1.17-SNAPSHOT</version>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-statebackend-rocksdb</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.3.4</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-statebackend-changelog</artifactId>
            <version>${flink.version}</version>
            <scope>runtime</scope>
        </dependency>

        <dependency>
            <groupId>com.google.code.findbugs</groupId>
            <artifactId>jsr305</artifactId>
            <version>1.3.9</version>
        </dependency>


        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-java-bridge</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner-loader</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-runtime</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
        </dependency>




    </dependencies>

    <repositories>
        <repository>
            <id>apache-snapshots</id>
            <name>apache snapshots</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <!--<url>https://maven.aliyun.com/repository/apache-snapshots</url>-->
        </repository>
    </repositories>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.4</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <excludes>
                                    <exclude>com.google.code.findbugs:jsr305</exclude>
                                    <exclude>org.slf4j:*</exclude>
                                    <exclude>log4j:*</exclude>
                                    <exclude>org.apache.hadoop:*</exclude>
                                </excludes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers combine.children="append">
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer">
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


</project>

猜你喜欢

转载自blog.csdn.net/a772304419/article/details/143380465