Flink的数据流输入,转换,输出(章节二)

Flink的使用(章节二)

程序部署

本地执行

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.createLocalEnvironment(3)  //指定代码并行度
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

远程部署

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

StreamExecutionEnvironment.getExecutionEnvironment⾃动识别运⾏环境,如果运⾏环境是idea,系统会⾃动切换成本地模式,默认系统的并⾏度使⽤系统最⼤线程数,等价于Spark中设置的local[*] ,如果是⽣产环境,需要⽤户在提交任务的时候指定并⾏度 --parallelism

部署⽅式

WEB UI部署(略)—见flink(第一章)

通过脚本部署

[root@CentOS ~]# cd /usr/flink-1.10.0/
[root@CentOS flink-1.10.0]# ./bin/flink run
--class com.baizhi.quickstart.FlinkWordCountQiuckStart
--detached # 后台提交    client提交后自动退出
--parallelism 4 #指定程序默认并⾏度
--jobmanager CentOS:8081 # 提交⽬标主机
/root/flink-datastream-1.0-SNAPSHOT.jar
Job has been submitted with JobID f2019219e33261de88a1678fdc78c696

查看现有任务

[root@CentOS flink-1.10.0]# ./bin/flink list --running --jobmanager CentOS:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
01.03.2020 05:38:16 : f2019219e33261de88a1678fdc78c696 : Window Stream WordCount
(RUNNING)
--------------------------------------------------------------
No scheduled jobs.
[root@CentOS flink-1.10.0]# ./bin/flink list --all --jobmanager CentOS:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
01.03.2020 05:44:29 : ddfc2ddfb6dc05910a887d61a0c01392 : Window Stream WordCount
(RUNNING)
--------------------------------------------------------------
No scheduled jobs.
---------------------- Terminated Jobs -----------------------
01.03.2020 05:36:28 : f216d38bfef7745b36e3151855a18ebd : Window Stream WordCount
(CANCELED)
01.03.2020 05:38:16 : f2019219e33261de88a1678fdc78c696 : Window Stream WordCount
(CANCELED)
--------------------------------------------------------------

取消指定任务

[root@CentOS flink-1.10.0]# ./bin/flink cancel --jobmanager CentOS:8081 f2019219e33261de88a1678fdc78c696
Cancelling job f2019219e33261de88a1678fdc78c696.
Cancelled job f2019219e33261de88a1678fdc78c696.

查看程序执⾏计划

[root@CentOS flink-1.10.0]# ./bin/flink info --class com.baizhi.quickstart.FlinkWordCountQiuckStart --parallelism 4 /root/flinkdatastream-1.0-SNAPSHOT.jar
----------------------- Execution Plan -----------------------
{
    
    "nodes":[{
    
    "id":1,"type":"Source: Socket Stream","pact":"Data
Source","contents":"Source: Socket Stream","parallelism":1},{
    
    "id":2,"type":"Flat
Map","pact":"Operator","contents":"Flat Map","parallelism":4,"predecessors":
[{
    
    "id":1,"ship_strategy":"REBALANCE","side":"second"}]},
{
    
    "id":3,"type":"Map","pact":"Operator","contents":"Map","parallelism":4,"predecessors"
:[{
    
    "id":2,"ship_strategy":"FORWARD","side":"second"}]},
{
    
    "id":5,"type":"aggregation","pact":"Operator","contents":"aggregation","parallelism":
4,"predecessors":[{
    
    "id":3,"ship_strategy":"HASH","side":"second"}]},
{
    
    "id":6,"type":"Sink: Print to Std. Out","pact":"Data Sink","contents":"Sink: Print to
Std. Out","parallelism":4,"predecessors":
[{
    
    "id":5,"ship_strategy":"FORWARD","side":"second"}]}]}
--------------------------------------------------------------
No description provided.

⽤户可以访问:https://flink.apache.org/visualizer/将json数据粘贴过去,查看Flink执⾏计划图

跨平台发布

object FlinkWordCountQiuckStartCorssPlatform {
    def main(args: Array[String]): Unit = {
        //1.创建流计算执⾏环境
        var jars="/Users/admin/IdeaProjects/20200203/flink-datastream/target/flinkdatastream-1.0-SNAPSHOT.jar"
        val env = StreamExecutionEnvironment.createRemoteEnvironment("CentOS",8081,jars)
        //设置默认并⾏度
        env.setParallelism(4)
        //2.创建DataStream - 细化
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
        .map(word=>(word,1))
        .keyBy(0)
        .sum(1)
        //4.将计算的结果在控制打印
        counts.print()
        //5.执⾏流计算任务
        env.execute("Window Stream WordCount")
    }
}

在运⾏之前需要使⽤mvn重新打包程序。直接运⾏main函数即可

Streaming (DataStream API)

DataSource

数据源是程序读取数据的来源,⽤户可以通过 env.addSource(SourceFunction) ,将SourceFunction添加到程序中。Flink内置许多已知实现的SourceFunction,但是⽤户可以⾃定义实现SourceFunction (⾮并⾏化的接⼝)接⼝或者实现 ParallelSourceFunction (并⾏化)接⼝,如果需要有状态管理还可以继承 RichParallelSourceFunction .

File-based

readTextFile

 //1.创建流计算执⾏环境
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 //2.创建DataStream - 细化
 val text:DataStream[String] = env.readTextFile("hdfs://CentOS:9000/demo/words")
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 //4.将计算的结果在控制打印
 counts.print()
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

readFile

 //1.创建流计算执⾏环境
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 //2.创建DataStream - 细化
 var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
 val text:DataStream[String] = env.readFile(inputFormat,"hdfs://CentOS:9000/demo/words")
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 //4.将计算的结果在控制打印
 counts.print()
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")
//1.创建流计算执⾏环境
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 //2.创建DataStream - 细化
 var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
 val text:DataStream[String] = env.readFile(inputFormat,
 "hdfs://CentOS:9000/demo/words",FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 //4.将计算的结果在控制打印
 counts.print()
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

该⽅法会检查采集⽬录下的⽂件,如果⽂件发⽣变化系统会重新采集。此时可能会导致⽂件的重复计算。⼀般来说不建议修改⽂件内容,直接上传新⽂件即可

Socket Based

socketTextStream

//1.创建流计算执⾏环境
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 //2.创建DataStream - 细化
 val text = env.socketTextStream("CentOS", 9999,'\n',3)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 //4.将计算的结果在控制打印
 counts.print()
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")
Collection-based
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 //2.创建DataStream - 细化
 val text = env.fromCollection(List("this is a demo","hello word"))
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 //4.将计算的结果在控制打印
 counts.print()
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")
UserDefinedSource

SourceFunction

import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class UserDefinedNonParallelSourceFunction extends SourceFunction[String]{
    @volatile //防⽌线程拷⻉变量
    var isRunning:Boolean=true
    val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
    //在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
    override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
        while(isRunning){
            Thread.sleep(100)
            //输送数据给下游
            sourceContext.collect(lines(new Random().nextInt(lines.size)))
        }
    }
    //释放资源
    override def cancel(): Unit = {
        isRunning=false
    }
}

ParallelSourceFunction

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction,SourceFunction}
import scala.util.Random
class UserDefinedParallelSourceFunction extends ParallelSourceFunction[String]{
    @volatile //防⽌线程拷⻉变量
    var isRunning:Boolean=true
    val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
    //在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
    override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
        while(isRunning){
            Thread.sleep(100)
            //输送数据给下游
            sourceContext.collect(lines(new Random().nextInt(lines.size)))
        }
    }
    //释放资源
    override def cancel(): Unit = {
        isRunning=false
    }
}
//1.创建
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.addSource[String](⽤户定义的SourceFunction)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
println(env.getExecutionPlan) //打印执⾏计划
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
Kafka集成

导入依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

默认提供读取kafka中value的值的类

import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
object KafkaSourceWordCount {
    def main(args: Array[String]): Unit = {
        //创建流计算环境
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        //创建DataStream
        //kafka配置信息
        val props = new Properties()
        props.setProperty("bootstrap.servers", "centos:9092")
        props.setProperty("group.id", "g1")
        val text = env.addSource(new FlinkKafkaConsumer[String]("topic01", new SimpleStringSchema(), props))
        //3.执⾏DataStream的转换算⼦
        val counts= text.flatMap(line => line.split(" "))
        .map(word => (word, 1))
        .keyBy(0)
        .sum(1)
        //4.将计算的结果在控制打印
        counts.print()
        //5.执⾏流计算任务
        env.execute("Window Stream WordCount")
    }
}

如果想要获取kafka的 key 或分区,偏移量,需要自定义类实现 KafkaDeserializationSchema (掌握)

class UserDefinedKafkaSchema extends KafkaDeserializationSchema[(String,String,Int,Long)]{
    //流计算,读不到末尾
    override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false
    //实现序列化的逻辑
    override def deserialize(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
        if(consumerRecord.key()!=null){
            (new String(consumerRecord.key()),new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
        }else{
            ("",new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
        }
    }
    override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
        //创建此类要导入  import org.apache.flink.api.scala._   包
        createTypeInformation[(String,String,Int,Long)]
    }
}
//创建流计算环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//创建DataStream
//kafka配置信息
val props = new Properties()
props.setProperty("bootstrap.servers", "centos:9092")
props.setProperty("group.id", "g1")
val text = env.addSource(new FlinkKafkaConsumer[(String,String,Int,Long)]("topic01", new UserDefinedKafkaSchema(), props))
//3.执⾏DataStream的转换算⼦
val counts= text.flatMap(t => {
    print(t)
    t._2.split(" ")
})
.map(word => (word, 1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

JSONKeyValueDeserializationSchema

要求Kafka中的topic的key和value都必须是json格式,也可以在使⽤的时候,指定是否读取元数据(topic、分区、o!set等) —flink内部已经实现好的类

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
//{"id":1,"name":"zhangsan"}
val text = env.addSource(new FlinkKafkaConsumer[ObjectNode]("topic01",new JSONKeyValueDeserializationSchema(true),props))
//t:{"value":{"id":1,"name":"zhangsan"},"metadata":{"offset":0,"topic":"topic01","partition":13}}
text.map(t=> (t.get("value").get("id").asInt(),t.get("value").get("name").asText()))
.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

Data Sinks

Data Sink使⽤DataStreams并将其转发到⽂件,Socket,外部系统或打印它们。 Flink带有多种内置输出格式,这些格式封装在DataStreams的操作后⾯。

File-based

writeAsText() / TextOutputFormat - -将元素按行写为字符串。通过调用每个元素的 toString()方法获得字符串。 —已过时

writeAsCsv(…) / CsvOutputFormat - 将元组写为逗号分隔的值文件。行和字段定界符是可配置的。每个字段的值来自对象的 toString()方法。 —已过时

writeUsingOutputFormat/ FileOutputFormat - 的方法和自定义文件输出基类。支持自定义对象到字节的转换。

请注意DataStream上的write*()⽅法主要⽤于调试⽬的。

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果写出在本地文件中      ! 如果写入hdfs有缓存,需要写入大量数据才有结果
counts.writeUsingOutputFormat(new TextOutputFormat[(String, Int)](new Path("file:///Users/admin/Desktop/flink-results")))
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

注意事项:如果改成HDFS,需要⽤户⾃⼰产⽣⼤量数据,才能看到测试效果,原因是因为HDFS⽂件系统写⼊时的缓冲区⽐较⼤。

以上写⼊⽂件系统的Sink不能够参与系统检查点,如果在⽣产环境下通常使⽤flink-connector-filesystem写⼊到外围系统。

通常正规程序需要写入文件系统时,可使用以下方法

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-filesystem_2.11</artifactId>
    <version>1.10.0</version>
</dependency>
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=StreamingFileSink.forRowFormat(new Path("hdfs://CentOS:9000/bucket-results"), new SimpleStringEncoder[(String,Int)]("UTF-8"))
.withBucketAssigner(new DateTimeBucketAssigner[(String, Int)]("yyyy-MM-dd"))//动态产⽣写⼊路径
.build()
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(bucketingSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

老版本中使用如下方式

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
 bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
 bucketingSink.setBatchSize(1024)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(bucketingSink)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")
print() / printToErr()
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.printToErr("测试").setParallelism(2)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

printToErr打印结果为红色,print打印结果为黑色

如果输出分区的并行度大于1,输出结果的前面会产生任务标记 id

UserDefinedSinkFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
class UserDefinedSinkFunction extends RichSinkFunction[(String,Int)]{
    override def open(parameters: Configuration): Unit = {
        println("打开链接...")
    }
    override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit =
    {
        println("输出:"+value)
    }
    override def close(): Unit = {
        println("释放连接")
    }
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(1)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
 bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
 bucketingSink.setBatchSize(1024)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
//使用上面自定义的类输出数据
 counts.addSink(new UserDefinedSinkFunction)
 //5.执⾏流计算任务      
 env.execute("Window Stream WordCount")
RedisSink

导入依赖

<dependency>
    <groupId>org.apache.bahir</groupId>
    <artifactId>flink-connector-redis_2.11</artifactId>
    <version>1.0</version>
</dependency>

参考网址:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")

//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
// redis的单机链接
var flinkJeidsConf = new FlinkJedisPoolConfig.Builder()
.setHost("CentOS")
.setPort(6379)
.build()
counts.addSink(new RedisSink(flinkJeidsConf,new UserDefinedRedisMapper()))
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand,
                                                                  RedisCommandDescription, RedisMapper}
class UserDefinedRedisMapper extends RedisMapper[(String,Int)]{
    override def getCommandDescription: RedisCommandDescription = {
        new RedisCommandDescription(RedisCommand.HSET,"wordcounts")
    }
    override def getKeyFromData(data: (String, Int)): String = data._1
    override def getValueFromData(data: (String, Int)): String = data._2+""
}
Kafka集成

引入依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

⽅案1

import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema
import org.apache.kafka.clients.producer.ProducerRecord
class UserDefinedKafkaSerializationSchema extends KafkaSerializationSchema[(String,Int)]{
    override def serialize(element: (String, Int), timestamp: lang.Long):
    ProducerRecord[Array[Byte], Array[Byte]] = {
        return new ProducerRecord("topic01",element._1.getBytes(),element._2.toString.getBytes())
    }
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 val props = new Properties()
 props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
 props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
 props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
 //Semantic.EXACTLY_ONCE:开启kafka幂等写特性
 //Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
 val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
 new UserDefinedKafkaSerializationSchema, props,Semantic.AT_LEAST_ONCE)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(kafakaSink)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

以上的 defult_topic 没有任何意义

⽅案2 (老版方法)

class UserDefinedKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
    override def serializeKey(element: (String, Int)): Array[Byte] = {
        element._1.getBytes()
    }
    override def serializeValue(element: (String, Int)): Array[Byte] = {
        element._2.toString.getBytes()
    }
    //可以覆盖 默认是topic,如果返回null,则将数据写⼊到默认的topic中
    override def getTargetTopic(element: (String, Int)): String = {
        null
    }
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 val props = new Properties()
 props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
 props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
 props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
 //Semantic.EXACTLY_ONCE:开启kafka幂等写特性
 //Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
 val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
 new UserDefinedKeyedSerializationSchema, props, Semantic.AT_LEAST_ONCE)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(kafakaSink)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

Operators

Transformation

DataStream → DataStream

Map

dataStream.map ( x => x * 2 )

FlatMap

dataStream.flatMap ( str => str.split(" ") )

Filter

dataStream.filter ( _ != 0 )
DataStream* → DataStream

Union

dataStream.union(otherStream1, otherStream2, ...)
DataStream,DataStream → ConnectedStreams

connect

“连接”两个保留其类型的数据流,从而允许两个流之间共享状态。

与union的不同之处在与可以处理两种不同类型的数据

someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)
ConnectedStreams → DataStream

CoMap, CoFlatMap

connectedStreams.map(
 (_ : Int) => true,
 (_ : String) => false
)
connectedStreams.flatMap(
 (_ : Int) => true,
 (_ : String) => false
)
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 val text1 = env.socketTextStream("CentOS", 9999)
 val text2 = env.socketTextStream("CentOS", 8888)
 text1.connect(text2)
 .flatMap((line:String)=>line.split("\\s+"),(line:String)=>line.split("\\s+"))
 .map((_,1))
 .keyBy(0)
 .sum(1)
 .print("总数")
 env.execute("Stream WordCount")
DataStream → SplitStream

Split

根据某种标准将流分成两个或多个流。

通常用于过滤流数据,将一个流根据某种条件进行切分。

val split = someDataStream.split(
    (num: Int) =>
    (num % 2) match {
        case 0 => List("even")
        case 1 => List("odd")
    }
)
SplitStream → DataStream

Select

val even = split.select("even")
val odd = split.select("odd")
val all = split.select("even","odd")
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text1 = env.socketTextStream("CentOS", 9999)
var splitStream= text1.split(line=> {
    if(line.contains("error")){
        List("error")
    } else{
        List("info")
    }
})
splitStream.select("error").printToErr("错误")
splitStream.select("info").print("信息")
splitStream.select("error","info").print("All")
env.execute("Stream WordCount")

PrcoessFunction

⼀般来说,更多使⽤PrcoessFunctio完成流的分⽀。

val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
val errorTag = new OutputTag[String]("error")
val allTag = new OutputTag[String]("all")
val infoStream = text.process(new ProcessFunction[String, String] {
    override def processElement(value: String,
                                ctx: ProcessFunction[String, String]#Context,
                                out: Collector[String]): Unit = {
        if (value.contains("error")) {
            ctx.output(errorTag, value) //边输出
        } else {
            out.collect(value) //正常数据
        }
        ctx.output(allTag, value) //边输出
    }
})
infoStream.getSideOutput(errorTag).printToErr("错误")
infoStream.getSideOutput(allTag).printToErr("所有")
infoStream.print("正常")
env.execute("Stream WordCount")
DataStream → KeyedStream

KeyBy

从逻辑上将流划分为不相交的分区,每个分区都包含同一键的元素。在内部,这是通过哈希分区实现的。见如何指定键。此转换返回KeyedStream。 —通俗的理解就是根据DataStream中元素的key进行分堆

进行数据划分,然后方便使用聚合函数,进行数据的计算

dataStream.keyBy("someKey") // Key by field "someKey"  根据样例类的属性
dataStream.keyBy(0) // Key by the first element of a Tuple	根据元组的位置
KeyedStream → DataStream

Reduce

keyedStream.reduce(_ + _)              //聚合函数运算
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.socketTextStream("CentOS", 9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy("_1")
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("Stream WordCount")

Fold

val result: DataStream[String] =
keyedStream.fold("start")((str, i) => { str + "-" + i })
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.socketTextStream("CentOS", 9999)
lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy("_1")
    .fold((null:String,0:Int))((z,v)=>(v._1,v._2+z._2))
    .print()
env.execute("Stream WordCount")

Aggregations

keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
val env = StreamExecutionEnvironment.getExecutionEnvironment
 //zhangsan 研发部 1000
 //lisi 研发部 9000
 //ww 销售部 9000
 val lines = env.socketTextStream("CentOS", 9999)
 lines.map(line=>line.split(" "))
 .map(ts=>Emp(ts(0),ts(1),ts(2).toDouble))
 .keyBy("dept")
//maxBy  求最大工资的完整信息(包括员工名称),而不是单一的求最大工资
 .maxBy("salary")//Emp(lisi,研发部,5000.0)
 .print()
 env.execute("Stream WordCount")

如果使⽤时max,则返回的是Emp(zhangsan,研发部,5000.0) 只关心要求最大值的字段和分组字段

Physical partitioning

Flink还通过以下功能对转换后的确切流分区进行了低级控制(如果需要)。 —对转换后的流再次进行分区

Random partitioning

根据均匀分布对元素进⾏随机划分。

dataStream.shuffle()

Rebalancing (Round-robin partitioning):

分区元素轮循,从⽽为每个分区创建相等的负载。在存在数据偏斜的情况下对性能优化有⽤。

dataStream.rebalance()

Rescaling

和Roundrobin Partitioning⼀样,Rescaling Partitioning也是⼀种通过循环的⽅式进⾏数据重平衡的分区策略。但是不同的是,当使⽤Roundrobin Partitioning时,数据会全局性地通过⽹络介质传输到其他的节点完成数据的重新平衡,⽽Rescaling Partitioning仅仅会对上下游继承的算⼦数据进⾏重平衡,具体的分区主要根据上下游算⼦的并⾏度决定。例如上游算⼦的并发度为2,下游算⼦的并发度为4,就会发⽣上游算⼦中⼀个分区的数据按照同等⽐例将数据路由在下游的固定的两个分区中,另外⼀个分区同理路由到下游两个分区中。

dataStream.rescale()

Broadcasting

将元素广播到每个分区。

dataStream.broadcast

Custom partitioning

自定义分区策略

dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS", 9999)
.map((_,1))
.partitionCustom(new Partitioner[String] {
    override def partition(key: String, numPartitions: Int): Int = {
        key.hashCode & Integer.MAX_VALUE % numPartitions
    }
},_._1)
.print()
.setParallelism(4)
println(env.getExecutionPlan)
env.execute("Stream WordCount")

Task chaining and resource groups

对两个⼦操作进⾏Chain,意味着将这两个 算⼦放置⼦⼀个线程中,这样是为了没必要的线程开销,提升性能。如果可能的话,默认情况下Flink会链接运算符。例如⽤户可以调⽤:

StreamExecutionEnvironment.disableOperatorChaining()   //取消系统自动的线程划分

禁⽤chain⾏为,但是不推荐。

startNewChain

someStream.filter(...).map(...).startNewChain().map(...)

将第⼀map算⼦和filter算⼦进⾏隔离

disableChaining

someStream.map(...).disableChaining()

所有操作符禁⽌和map操作符进⾏chain

slotSharingGroup

设置操作的slot共享组。 Flink会将具有相同slot共享组的operator放在同⼀个Task slot中,同时将没有slot共享组的operator保留在其他Task slot中。这可以⽤来隔离Task Slot。下游的操作符会⾃动继承上游资源组。默认情况下,所有的输⼊算⼦的资源组的名字是 default ,因此当⽤户不对程序进⾏资源划分的情况下,⼀个job所需的资源slot,就等于最⼤并⾏度的Task。

someStream.filter(...).slotSharingGroup("name")

猜你喜欢

转载自blog.csdn.net/origin_cx/article/details/104702519