1. Flink
Flink description:
Flink is a distributed processing engine for streaming data and batch data. It is mainly implemented by the Java code. At present mainly rely on the contribution of the open source community and development. For Flink, the main scene it has to deal with is streaming data, batch data stream is just a special case of extreme data only. In still other words, Flink all tasks will be handled as a stream, which is its greatest feature. Flink can support local fast iteration loop iterations and some tasks.
Flink features:
Flink is distributed stream processing open source framework:
1> are disordered even though the data source or data arrived late, but also to maintain the accuracy of the results.
2> and with a fault-tolerant state, seamlessly recover from failure, and can be. Once-keeping exactly
3> large-scale distributed
4> widely used real-time computing scene (Blink Ali real-time two-eleven turnover is based on the use of transformation from Flink)
Flink can ensure that only one state semantic computing; Flink has state means that the program can keep the already processed data;
Flink supports streaming and event time window semantics, Flink support flexible time-based window, counting, data-driven window or session ;
Flink Fault tolerance is lightweight and at the same time allowing the system to maintain high throughput and provide consistency guarantees only once, Flink recover from failures, zero data loss;
Flink capable of high throughput and low latency;
Flink save point provides version control mechanisms to be able to update the application or re-processing of historical data is not lost and minimum downtime.
2. Kafka
Kafka Introduction
Kafka was developed by the Apache Software Foundation, an open source stream processing platform, written by Scala and Java. Kafka is a high throughput of distributed publish-subscribe messaging system that can handle all the action streaming data consumer-scale site. This action (web browsing, search and other user action) is a key factor in many social functions in modern networks. These data usually due to the required throughput is achieved by the polymerization process log and the log. Like for like Hadoop and off-line analysis of log data systems, but requires real-time processing limitations, this is a viable solution. Kafka's purpose is to Hadoop parallel loading mechanism to unify online and offline messaging, but also through the cluster in order to provide real-time information.
Kafka properties
Kafka is a high throughput distributed publish-subscribe messaging system, has the following characteristics:
1> message provided by a persistent disk data structures, this structure for the message even when the number stored in the TB can be maintained stably for a long time in performance.
2> High throughput even very ordinary hardware Kafka can support millions of messages per second.
3> Support to partition messaging server by Kafka and consumption of machine clusters.
4>. Hadoop support parallel data loading.
Kafka installation configuration and basic usage
Flink on this blog is because local consumption Kafka data to achieve WordCount, so do not need to do too much Kafka configuration, download the installation package from the Apache official website decompression can be used directly
where we create a topic named test
input data stream producer:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Monitoring the input data stream from a producer in the consumer:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
Data 3. Flink Java API to achieve Flink consumption Kafka implementation process WordCount
1> Create maven project
2> Configuration and flink flink-kafka pom file required dependencies
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_2.11</artifactId> <version>1.0.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.11</artifactId> <version>1.0.0</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.8_2.11</artifactId> <version>1.0.0</version> </dependency> </dependencies>
3> introduced Flink StreamExecutionEnvironment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
4> set the time interval to monitor the data stream (official state called checkpoints)
env.enableCheckpointing(1000);
5>. Kafka and configure the ip and port zookeeper
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.20:9092");
properties.setProperty("zookeeper.connect", "192.168.1.20:2181"); properties.setProperty("group.id", "test");
6> The zookeeper kafka and configuration information loaded to Flink StreamExecutionEnvironment
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<String>("test", new SimpleStringSchema(),properties);
7>. Kafka's data into the DataStream type of flink
DataStream<String> stream = env.addSource(myConsumer);
8> embodiment and outputs the calculation model
DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new LineSplitter()).keyBy(0).sum(1);
counts.print();
DETAILED logic code calculation model
public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { private static final long serialVersionUID = 1L; public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("\\W+"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } }
4. Verify
1>. Kafka producer input
2>. Flink client once the outcome
The complete code
package com.scn;
import java.util.Properties;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08; import org.apache.flink.streaming.util.serialization.SimpleStringSchema; import org.apache.flink.util.Collector; public class FilnkCostKafka { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(1000); Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "192.168.1.20:9092"); properties.setProperty("zookeeper.connect", "192.168.1.20:2181"); properties.setProperty("group.id", "test"); FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<String>("test", new SimpleStringSchema(), properties); DataStream<String> stream = env.addSource(myConsumer); DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new LineSplitter()).keyBy(0).sum(1); counts.print(); env.execute("WordCount from Kafka data"); } public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { private static final long serialVersionUID = 1L; public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("\\W+"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } } }