Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, stream analytics, data integration, and mission-critical applications.
core competence
- High throughput
Under the limited throughput of the network, use the machine cluster with a delay as low as 2ms to deliver messages. - Scalability
Expand the production cluster to 1,000 agents, trillions of messages per day, PB (petabytes) of data, and hundreds of thousands of partitions. Elastically expand and contract storage and processing.
1PB = 1024TB = 2^50字节
- Persistent Storage
Securely store data streams in a distributed, durable, fault-tolerant cluster. - High Availability
Efficiently scale clusters across availability zones, or connect separate clusters across geographic regions.
ecosystem
-
Built-in Stream Processing Process
event streams with joins, aggregations, filters, transformations, and more using event time and once-only processing. -
Connect to almost anything
Kafka's out-of-the-box Connect interface integrates with hundreds of event sources and event sinks, including Postgres, JMS, Elasticsearch, AWS S3, and more. -
CLIENT LIBRARIES
Read, write and process streams of events in a large number of programming languages. -
Large Ecosystem of Open Source Tools
Large Ecosystem of Open Source Tools: Take advantage of a wide range of community-driven tools.
Trust and Ease of Use
-
Mission Critical
Supports mission critical use cases with guaranteed ordering, zero message loss and efficient exactly-once processing. -
Trusted by Thousands of Organizations
Kafka is used by thousands of organizations, from internet giants to automakers to stock exchanges. Over 5 million unique lifetime downloads. -
Huge user base
Kafka is one of the five most active projects at the Apache Software Foundation, with hundreds of meetups around the world. -
Extensive Online Resources
Extensive documentation, online training, guided tutorials, videos, sample projects, Stack Overflow, etc.
1 Introduction
1.1 What is event stream?
Event Stream ( Event streaming
) is the digital equivalent of the human central nervous system. It is the technological foundation of an always-on( always-on
) world, where business is increasingly defined and automated by software, and the users of software are increasingly software.
Technically, event streaming is streams of events
the practice of capturing data in real-time in the form of event streams ( ) from event sources such as databases, sensors, mobile devices, cloud services, and software applications; storing these event streams persistently for later retrieval; real-time and retrospectively manipulate, process, and respond to event streams; and route event streams to different target technologies as needed. Thus, event streaming ensures a continuous flow and interpretation of data so that the right information is in the right place at the right time.
1.2 What can I do with Event Stream?
Event streaming is applied to a variety of use cases across a wide range of industries and organizations . Its many examples include:
- Real-time processing of payments and financial transactions, such as at stock exchanges, banks and insurance companies.
- Track and monitor cars, trucks, fleets and goods in real time, e.g. in the logistics and automotive industries.
- Continuously capture and analyze sensor data from IoT devices or other equipment such as factories and wind farms.
- Collect and immediately respond to customer interactions and orders, such as in retail, hospitality and travel industries, and mobile applications.
- Monitor patient care in the hospital and predict changes in condition to ensure timely treatment in emergencies.
- Connect, store and serve data generated by different parts of the company.
- Serves as the foundation for data platforms, event-driven architectures, and microservices.
1.3 Apache Kafka® is an event streaming platform. what does that mean?
Kafka combines three key capabilities so you can achieve end-to-end event streaming use cases with one battle-tested solution :
- Publish (write)
publish (write)
and subscribe (read)subscribe to (read)
to streams of events, including continuous import/export of data from other systems. - Durably and reliably store event streams for as long as you want.
- Process event streams as they occur or retrospectively
All these functions are provided in a distributed, highly scalable, elastic, fault-tolerant and secure manner. Kafka can be deployed on bare metal hardware, virtual machines and containers, on-premises and in the cloud. You can choose between self-managing your Kafka environment and using fully managed services from various vendors.
1.4 In a nutshell, how does Kafka work?
Kafka is a distributed system consisting of a server ( Server
) and a client ( client
), communicating through a high-performance TCP network protocol . It can be deployed on bare-metal hardware, virtual machines, and containers in on-premises and cloud environments.
Servers : Kafka runs as a cluster of one or more servers, which can span multiple data centers or cloud regions. Some of these servers make up the storage layer, called proxies ( broker
). Other servers run Kafka Connect to continuously import and export data in the form of event streams, integrating Kafka with existing systems such as relational databases and other Kafka clusters. To allow you to implement mission-critical use cases, Kafka clusters are highly scalable and fault-tolerant : if any one of the servers fails, the others will take over from them, ensuring continued operation without any data loss.
Clients : They allow you to write distributed applications and microservices that can read, write, and process event streams in parallel, at scale, and in a fault-tolerant manner, even in the face of network issues or machine failures The same is true for the case. Kafka ships with a few such clients, augmented by dozens of clients contributed by the Kafka community : clients are available for Java and Scala, including the higher-level Kafka Streams library for Go, Python, C / C++ and many other programming languages and REST APIs.
1.5 Key concepts and terminology
Events ( event
) record facts about "what happened" in the world or in your business. record
Also referred to as record ( ) or message ( ) in the documentation message
. When you read and write data to Kafka, you do it in the form of events. Conceptually, an event has a key ( key
), a value ( value
), a timestamp ( timestamp
), and an optional metadata header ( metadata headers
). Below is an example:
- Event key: “Alice”
- Event value: “Made a payment of $200 to Bob”
- Event timestamp: “Jun. 25, 2020 at 2:06 p.m.”
Producers ( Producer
) are those client applications that publish (write) events to Kafka, and consumers ( consumer
) are those client applications that subscribe (read and process) those events. In Kafka, producers and consumers are completely decoupled and agnostic of each other, which is a key design element to achieve the high scalability Kafka is known for . For example, the producer never needs to wait for the consumer. Kafka provides various guarantees , such as the ability to process events exactly once.
Events are organized and persistently stored in topics ( topic
) . Very simply, a topic is like a folder in the file system, and events are the files in that folder. For example, the subject name could be "payments". Topics in Kafka are always multi-producer and multi-subscriber : a topic can have zero, one or more producers writing events to it, and zero, one or more consumers subscribing to those events. Events in topics can be read as often as needed - unlike traditional messaging systems, events are not deleted after use . 相反,你可以通过每个主题的配置设置来定义Kafka应该保留多长时间的事件
, after which older events will be discarded. Kafka's performance is effectively constant in terms of data size, so storing data for long periods of time is perfectly fine .
Topics are partitioned ( ), which means that a topic is distributed over many "buckets ( )" partitioned
located on different Kafka brokers . buckets
This distributed placement of data is important for scalability because it allows client applications to read and write data from multiple brokers simultaneously. When a new event is published to a topic, it is actually appended to a partition of the topic . Events with the same event key (e.g., customer or vehicle ID) are written to the same partition, and Kafka guarantees that any consumer of a given topic partition will always read events for that partition in the same order as they were written.
Figure: This example topic has four partitions P1-P4. Two different producer clients write events to the topic's partitions over the network, publishing new events to the topic independently of each other. Events with the same key (colored in the graph) are written to the same partition. Note that both producers can write to the same partition if appropriate.
To make your data fault-tolerant and highly available, replicated
each topic can be replicated ( ), even across geographic regions or data centers, so that there are always multiple brokers with a copy of the data, in case something goes wrong, you need to maintain brokers , and so on . A common production setup is a replication factor of 3, that is, there will always be three copies of your data. This replication is performed at the topic partition level.
For getting started, this beginner's article should suffice. The Design section of the documentation explains various Kafka concepts in detail, if you're interested.
1.6 Kafka APIs
In addition to command-line tools for operating and managing tasks, Kafka provides five core APIs for Java and Scala:
- Admin API is used to manage and inspect topics, brokers and other Kafka objects.
- The Producer API publishes (writes) an event stream to one or more Kafka topics.
- The Consumer API subscribes to (reads) one or more topics and processes the stream of events produced to those topics.
- The Kafka Streams API enables stream processing applications and microservices. It provides more advanced features for processing event streams, including transformations, stateful operations (such as aggregations and joins), windowing, event-time-based processing, and more. Reads input from one or more topics to generate output for one or more topics, effectively converting an input stream to an output stream.
- The Kafka Connect API builds and runs reusable data import/export connectors that can consume (read) or produce (write) streams of events from external systems and applications for integration with Kafka. For example, a connector to a relational database such as PostgreSQL might capture every change to a set of tables. In practice, however, you usually don't need to implement your own connectors, as the Kafka community already provides hundreds of ready-made connectors.
1.7 What to do next
- To get hands-on experience with Kafka, follow the quickstart .
- To learn about Kafka in more detail, read the documentation . You can also choose books and academic papers on Kafka .
- Browse use cases to see how other users in our global community are getting value from Kafka.
- Join a local Kafka meetup group and watch presentations from Kafka Summit , the main conference for the Kafka community.
2、quickstart
Step 1: Get kafka
Download the latest Kafka release and unzip it:
$ tar -xzf kafka_2.13-3.4.0.tgz
$ cd kafka_2.13-3.4.0
Step 2: Start the kafka environment
Note: Java 8+ must be installed on your local environment
Apache Kafka can be started with ZooKeeper or KRaft. To get started with either configuration, follow the sections below, but not both.
use ZooKeeper
Run the following command to start all services in the correct order:
# Start the ZooKeeper service
$ bin/zookeeper-server-start.sh config/zookeeper.properties
Open another terminal session and run:
# Start the Kafka broker service
$ bin/kafka-server-start.sh config/server.properties
Once all services are successfully started, you have a basic Kafka environment up and running and ready to use.
KRaft
Generate cluster UUID
$ KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
format log directory
$ bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties
Start the Kafka server
$ bin/kafka-server-start.sh config/kraft/server.properties
Once the Kafka server has been successfully started, you will have a basic Kafka environment running and ready to use.
Step 3: Create a topic to store events
Kafka is a distributed event streaming platform that allows you to read, write, store and process events (also known as records or messages in documents) across multiple machines.
Examples of events include payment transactions, geolocation updates from mobile phones, shipping orders, sensor measurements from IoT devices or medical devices, and more. These events are organized and stored in topics . Very simply, a topic is like a folder in the file system, and events are the files in that folder.
Therefore, before you can write your first event, you must create a topic. Open another terminal session and run:
$ bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092
# ./bin/kafka-topics.sh --help
This tool helps to create, delete, describe, or change a topic.
Option Description
------ -----------
--alter Alter the number of partitions,
replica assignment, and/or
configuration for the topic.
--at-min-isr-partitions if set when describing topics, only
show partitions whose isr count is
equal to the configured minimum.
--bootstrap-server <String: server to REQUIRED: The Kafka server to connect
connect to> to.
#要连接的Kafka服务器
--command-config <String: command Property file containing configs to be
config property file> passed to Admin Client. This is used
only with --bootstrap-server option
for describing and altering broker
configs.
--config <String: name=value> A topic configuration override for the
topic being created or altered. The
following is a list of valid
configurations:
cleanup.policy
compression.type
delete.retention.ms
file.delete.delay.ms
flush.messages
flush.ms
follower.replication.throttled.
replicas
index.interval.bytes
leader.replication.throttled.replicas
local.retention.bytes
local.retention.ms
max.compaction.lag.ms
max.message.bytes
message.downconversion.enable
message.format.version
message.timestamp.difference.max.ms
message.timestamp.type
min.cleanable.dirty.ratio
min.compaction.lag.ms
min.insync.replicas
preallocate
remote.storage.enable
retention.bytes
retention.ms
segment.bytes
segment.index.bytes
segment.jitter.ms
segment.ms
unclean.leader.election.enable
See the Kafka documentation for full
details on the topic configs. It is
supported only in combination with --
create if --bootstrap-server option
is used (the kafka-configs CLI
supports altering topic configs with
a --bootstrap-server option).
--create Create a new topic.
--delete Delete a topic
--delete-config <String: name> A topic configuration override to be
removed for an existing topic (see
the list of configurations under the
--config option). Not supported with
the --bootstrap-server option.
--describe List details for the given topics.
#列出给定主题的详细信息。
--disable-rack-aware Disable rack aware replica assignment
--exclude-internal exclude internal topics when running
list or describe command. The
internal topics will be listed by
default
--help Print usage information.
--if-exists if set when altering or deleting or
describing topics, the action will
only execute if the topic exists.
--if-not-exists if set when creating topics, the
action will only execute if the
topic does not already exist.
--list List all available topics.
--partitions <Integer: # of partitions> The number of partitions for the topic
being created or altered (WARNING:
If partitions are increased for a
topic that has a key, the partition
logic or ordering of the messages
will be affected). If not supplied
for create, defaults to the cluster
default.
--replica-assignment <String: A list of manual partition-to-broker
broker_id_for_part1_replica1 : assignments for the topic being
broker_id_for_part1_replica2 , created or altered.
broker_id_for_part2_replica1 :
broker_id_for_part2_replica2 , ...>
--replication-factor <Integer: The replication factor for each
replication factor> partition in the topic being
created. If not supplied, defaults
to the cluster default.
--topic <String: topic> The topic to create, alter, describe
or delete. It also accepts a regular
expression, except for --create
option. Put topic name in double
quotes and use the '\' prefix to
escape regular expression symbols; e.
g. "test\.topic".
#要创建、修改、描述(显示)或删除的主题。除了--create选项之外,它还
#接受正则表达式。将主题名称放在双引号中,并使用'\'前缀转义正
#则表达式符号;如:"test\.topic"。
--topic-id <String: topic-id> The topic-id to describe.This is used
only with --bootstrap-server option
for describing topics.
--topics-with-overrides if set when describing topics, only
show topics that have overridden
configs
--unavailable-partitions if set when describing topics, only
show partitions whose leader is not
available
--under-min-isr-partitions if set when describing topics, only
show partitions whose isr count is
less than the configured minimum.
--under-replicated-partitions if set when describing topics, only
show under replicated partitions
--version Display Kafka version.
All of Kafka's command-line tools have additional options: Run kafka-topics.sh
the command without any arguments to display usage information. For example, it can also show details like partition counts for new topics :
$./bin/kafka-topics.sh --describe --topic quickstart-events --bootstrap-server localhost:9092
Topic: quickstart-events TopicId: 4HukHBG9QYySmD7mtOIYbw PartitionCount: 1 ReplicationFactor: 1 Configs:
Topic: quickstart-events Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Step 4: Write some events to the topic
Kafka clients communicate with Kafka brokers over the network to write (or read) events. Once an event is received, the broker will store the event in a durable and fault-tolerant manner for as long as you need it - maybe even forever.
Run the console producer client to write some events to the topic. By default, each line you enter will cause a separate event to be written to the topic.
$ bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
This is my first event
This is my second event
You can stop the producer client at any time using Ctrl-C
the stop producer client.
Step 5: Read Events
Open another terminal session and run the console consumer client to read the events you just created:
$ bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
This is my first event
This is my second event
# ./bin/kafka-console-consumer.sh --help
This tool helps to read data from Kafka topics and outputs it to standard output.
Option Description
------ -----------
--bootstrap-server <String: server to REQUIRED: The server(s) to connect to.
connect to>
--consumer-property <String: A mechanism to pass user-defined
consumer_prop> properties in the form key=value to
the consumer.
--consumer.config <String: config file> Consumer config properties file. Note
that [consumer-property] takes
precedence over this config.
--enable-systest-events Log lifecycle events of the consumer
in addition to logging consumed
messages. (This is specific for
system tests.)
--formatter <String: class> The name of a class to use for
formatting kafka messages for
display. (default: kafka.tools.
DefaultMessageFormatter)
--formatter-config <String: config Config properties file to initialize
file> the message formatter. Note that
[property] takes precedence over
this config.
--from-beginning If the consumer does not already have
an established offset to consume
from, start with the earliest
message present in the log rather
than the latest message.
# 如果消费者还没有一个确定的消费偏移量,那么从日志中出现的最
# 早的消息开始,而不是从最新的消息开始。
--group <String: consumer group id> The consumer group id of the consumer.
--help Print usage information.
--include <String: Java regex (String)> Regular expression specifying list of
topics to include for consumption.
--isolation-level <String> Set to read_committed in order to
filter out transactional messages
which are not committed. Set to
read_uncommitted to read all
messages. (default: read_uncommitted)
--key-deserializer <String:
deserializer for key>
--max-messages <Integer: num_messages> The maximum number of messages to
consume before exiting. If not set,
consumption is continual.
--offset <String: consume offset> The offset to consume from (a non-
negative number), or 'earliest'
which means from beginning, or
'latest' which means from end
(default: latest)
--partition <Integer: partition> The partition to consume from.
Consumption starts from the end of
the partition unless '--offset' is
specified.
--property <String: prop> The properties to initialize the
message formatter. Default
properties include:
print.timestamp=true|false
print.key=true|false
print.offset=true|false
print.partition=true|false
print.headers=true|false
print.value=true|false
key.separator=<key.separator>
line.separator=<line.separator>
headers.separator=<line.separator>
null.literal=<null.literal>
key.deserializer=<key.deserializer>
value.deserializer=<value.
deserializer>
header.deserializer=<header.
deserializer>
Users can also pass in customized
properties for their formatter; more
specifically, users can pass in
properties keyed with 'key.
deserializer.', 'value.
deserializer.' and 'headers.
deserializer.' prefixes to configure
their deserializers.
--skip-message-on-error If there is an error when processing a
message, skip it instead of halt.
--timeout-ms <Integer: timeout_ms> If specified, exit if no message is
available for consumption for the
specified interval.
--topic <String: topic> The topic to consume on.
--value-deserializer <String:
deserializer for values>
--version Display Kafka version.
--whitelist <String: Java regex DEPRECATED, use --include instead;
(String)> ignored if --include specified.
Regular expression specifying list
of topics to include for consumption.
You can use Ctrl-C
stop consumer client at any time.
Feel free to experiment: for example, switch back to the producer terminal (previous step) to write additional events and see how the events show up immediately in the consumer terminal.
Because events are stored durably in Kafka, they can be read as many times as you want by as many consumers . You can easily verify this by opening another terminal session and re-running the previous command again.
Step 6: Use kafka connect to import/export data as event stream
You likely have large amounts of data in existing systems such as relational databases or traditional messaging systems, and many applications that already use these systems. Kafka Connect allows you to continuously ingest data from external systems into Kafka and vice versa . It is an extensible tool that runs connectors that implement custom logic for interacting with external systems. Therefore, it is very easy to integrate existing systems with Kafka. To make this process easier, there are hundreds of these connectors readily available.
In this quickstart, we'll see how to use a simple connector to run Kafka Connect, import data from a file to a Kafka topic, and export data from a Kafka topic to a file.
First, make sure to connect-file-3.4.0.jar
add it to the plugin. Set in the configuration of the Connect worker plugin.path
. For the quickstart we'll use a relative path and treat the connector's package as an uber jar, which will work when the quickstart command is run from the install directory. However, it's worth noting that for production deployments, it's always advisable to use absolute paths. See plugin.path for detailed instructions on how to set this configuration.
Edit config/connect-standalone.properties
the file, add or change plugin.path
configuration properties to match the following, and save the file:
echo "plugin.path=libs/connect-file-3.4.0.jar"
Then, first create some seed data for testing:
echo -e "foo\nbar" > test.txt
Or on Windows:
echo foo> test.txt
echo bar>> test.txt
Next, we'll start the two connectors running in standalone mode, which means they run in a single local dedicated process . We provide three configuration files as parameters. The first is always the configuration of the Kafka Connect process, which contains common configurations such as the broker that Kafka will connect to and the serialization format of the data. The rest of the configuration files individually specify which connectors to create. These files include a unique connector name, the connector class to instantiate, and any other configuration required by the connector.
bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties
These example configuration files, included with Kafka, use the default local cluster configuration you started earlier and create two connectors: the first is a source connector, which reads rows from an input file and produces each row to Kafka topic, and the second is a sink connector that reads messages from a Kafka topic and produces each message as a line in the output file.
During startup, you'll see a number of log messages, including some indicating that the connector is being instantiated. Once the Kafka Connect process starts, the source connector should start reading rows test.txt
from it and producing it to the topic connect-test
, and the sink connector should start connect-test
reading messages from the topic and writing them to a file test.sink.txt
. We can verify that data has passed through the entire pipeline by inspecting the contents of the output file:
> more test.sink.txt
foo
bar
Note that the data is stored in a Kafka topic connect-test
, so we can also run a console consumer to see the data in the topic (or use custom consumer code to process it):
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning
{
"schema":{
"type":"string","optional":false},"payload":"foo"}
{
"schema":{
"type":"string","optional":false},"payload":"bar"}
...
The connector continues to process the data, so we can add data to the file and see it move through the pipeline:
echo Another line>> test.txt
Step 7: Process events with Kafka Streams
Once your data is stored in Kafka as events, you can process the data using the Kafka Streams client library for Java/Scala. It allows you to implement mission-critical real-time applications and microservices where input and/or output data is stored in Kafka topics . Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the advantages of Kafka's server-side clustering technology, making these applications highly scalable, elastic, fault-tolerant, and distributed. The library supports one-shot processing, stateful operations and aggregations, windows, joins, event-time-based processing, and more.
To give you a first taste, here's how to implement popular WordCount
algorithms:
KStream<String, String> textLines = builder.stream("quickstart-events");
KTable<String, Long> wordCounts = textLines
.flatMapValues(line -> Arrays.asList(line.toLowerCase().split(" ")))
.groupBy((keyIgnored, word) -> word)
.count();
wordCounts.toStream().to("output-topic", Produced.with(Serdes.String(), Serdes.Long()));
The Kafka Streams Demo and Application Development Tutorial demonstrates how to write and run such a streaming application from start to finish.
Step 8: Terminate the kafka environment
Now that you have completed the quickstart, feel free to terminate the Kafka environment, or continue using it.
- Use
Ctrl-C
stop producer and consumer clients (if you haven't already done so). - Stop
Ctrl-C
the Kafka broker with . - Finally, if the Kafka with ZooKeeper section was started, stop
Ctrl-C
the ZooKeeper server with .
If you also want to delete any data from your local Kafka environment, including any events you created during the process, run the command:
rm -rf /tmp/kafka-logs /tmp/zookeeper /tmp/kraft-combined-logs