Flink + kafka achieve real-time calculation Wordcount

Flink description:

Flink is a distributed processing engine for streaming data and batch data. It is mainly implemented by the Java code. At present mainly rely on the contribution of the open source community and development. For Flink, the main scene it has to deal with is streaming data, batch data stream is just a special case of extreme data only. In still other words, Flink all tasks will be handled as a stream, which is its greatest feature. Flink can support local fast iteration loop iterations and some tasks.

Flink features:

Flink is distributed stream processing open source framework:
1> are disordered even though the data source or data arrived late, but also to maintain the accuracy of the results.
2> and with a fault-tolerant state, seamlessly recover from failure, and can be. Once-keeping exactly
3> large-scale distributed
4> widely used real-time computing scene (Blink Ali real-time two-eleven turnover is based on the use of transformation from Flink)

Flink can ensure that only one state semantic computing; Flink has state means that the program can keep the already processed data;
Flink supports streaming and event time window semantics, Flink support flexible time-based window, counting, data-driven window or session ;
Flink Fault tolerance is lightweight and at the same time allowing the system to maintain high throughput and provide consistency guarantees only once, Flink recover from failures, zero data loss;
Flink capable of high throughput and low latency;
Flink save point provides version control mechanisms to be able to update the application or re-processing of historical data is not lost and minimum downtime.

2. Kafka

Kafka Introduction

Kafka was developed by the Apache Software Foundation, an open source stream processing platform, written by Scala and Java. Kafka is a high throughput of distributed publish-subscribe messaging system that can handle all the action streaming data consumer-scale site. This action (web browsing, search and other user action) is a key factor in many social functions in modern networks. These data usually due to the required throughput is achieved by the polymerization process log and the log. Like for like Hadoop and off-line analysis of log data systems, but requires real-time processing limitations, this is a viable solution. Kafka's purpose is to Hadoop parallel loading mechanism to unify online and offline messaging, but also through the cluster in order to provide real-time information.

Kafka properties

Kafka is a high throughput distributed publish-subscribe messaging system, has the following characteristics:
1> message provided by a persistent disk data structures, this structure for the message even when the number stored in the TB can be maintained stably for a long time in performance.
2> High throughput even very ordinary hardware Kafka can support millions of messages per second.
3> Support to partition messaging server by Kafka and consumption of machine clusters.
4>. Hadoop support parallel data loading.

Kafka installation configuration and basic usage

Flink on this blog is because local consumption Kafka data to achieve WordCount, so do not need to do too much Kafka configuration, download the installation package from the Apache official website decompression can be used directly
where we create a topic named test
input data stream producer:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Monitoring the input data stream from a producer in the consumer:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

1> Create maven project

<dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
        <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_2.11</artifactId> <version>1.0.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.11</artifactId> <version>1.0.0</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.8_2.11</artifactId> <version>1.0.0</version> </dependency> </dependencies>
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

4> set the time interval to monitor the data stream (official state called checkpoints)

env.enableCheckpointing(1000);

5>. Kafka and configure the ip and port zookeeper

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.20:9092");
properties.setProperty("zookeeper.connect", "192.168.1.20:2181"); properties.setProperty("group.id", "test");
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<String>("test", new SimpleStringSchema(),properties);

7>. Kafka's data into the DataStream type of flink

DataStream<String> stream = env.addSource(myConsumer);

8> embodiment and outputs the calculation model

DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new LineSplitter()).keyBy(0).sum(1);

counts.print();

DETAILED logic code calculation model

public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { private static final long serialVersionUID = 1L; public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("\\W+"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } }

4. Verify

1>. Kafka producer input

2>. Flink client once the outcome

The complete code

package com.scn;

import java.util.Properties;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08; import org.apache.flink.streaming.util.serialization.SimpleStringSchema; import org.apache.flink.util.Collector; public class FilnkCostKafka { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(1000); Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "192.168.1.20:9092"); properties.setProperty("zookeeper.connect", "192.168.1.20:2181"); properties.setProperty("group.id", "test"); FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<String>("test", new SimpleStringSchema(), properties); DataStream<String> stream = env.addSource(myConsumer); DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new LineSplitter()).keyBy(0).sum(1); counts.print(); env.execute("WordCount from Kafka data"); } public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { private static final long serialVersionUID = 1L; public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("\\W+"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } } }

Guess you like

Origin www.cnblogs.com/ExMan/p/11285143.html