Kafka principle analysis

What is a message system:
a system used to handle message queues;
what is a message queue:
is a software engineering component used for inter-process communication or inter-thread communication within the same process;
they use a queue to propagate messages - here propagation The message is --> delivery control or content;

这里面有个问题:

Is the message queue used to improve performance and speed up message transmission?
The answer, obviously not, although the message queue provides data redundancy, it is not a cache.
If you want to speed up, you can directly combine the consumer and the producer to write, and add a full-memory
queue in the middle. There is no persistence, and there is no network transmission. Wouldn’t it be faster;
message queue, the best interpretation is , "fire and forget", the English translation is "decoupling", which realizes the
effective decoupling of producers and consumers and reduces the complexity of the system;
as a producer, its main concern should be its own production work, It should not care about what it produces,
who is consuming it, and how it is consumed. It should simply put the produced things in a warehouse (that is, fire), and then you
can ignore them (forget), without any psychological burden. As for the latter, how the message is delivered to the consumer, whether this delivery
method will lose the message and other reliability issues (this is why the message queue is not only a storage area
for intermediate results). This role, as an intermediate warehouse, responsible for dealing with consumers and ensuring the reliability of subsequent delivery is the role of the
message queue.

Kafka summarizes messages in units of topics
; programs that publish messages to Kafka topics are called producers; programs
that subscribe to topics and consume messages are called consumers;
Kafka runs in a cluster and can be composed of one or more services , each service is called a broker;
producers send messages to the kafka cluster through the network, and the cluster provides messages to consumers;

Classification of message queues:

Point-to-point message queue queue:
A message can only be consumed by one consumer, but the queue supports multiple consumers.
After the message is consumed, it is no longer stored in the queue, that is, after a message is consumed, other consumers can no longer consume it.

Publish and subscribe message queue topic:
a message in Kafka can be consumed by multiple consumers to achieve message sharing;

Why build kafka:
collection of activity data:
website user-related behavior data, such as: PV, UV and other
operational data to
monitor core system performance indicators

The characteristics of these data: the
data is immutable and the
data volume is huge and
needs to be processed in real time

Kafka is a distributed messaging system (cluster):
load balancing + failover migration

The basic structure of the message system:

The broker is the mechanism for receiving messages; each message server can have multiple brokers, and multiple servers form a distributed cluster of the message system;
zookeeper acts as the coordinator, the broker is registered with zookeeper, and zookeeper synchronizes information to other brokers;
kafka uses zookeeper to store Meta information of the message;

Basic concepts:
1.topic: a message source specially processed by kafka, representing messages in kafka;
2.partition (partition): a physical grouping of topic messages; a topic can have multiple partitions, and each partition is an ordered Each message in the queue and partition will be assigned an ordered id;
3.message: refers to a specific message, which is the basic unit of communication, and each producer can send a topic to a topic;
4.producer: a producer, to The process of publishing messages on a topic in Kafka is called production;
5.consumer: the process of consumers, subscribing to topics and processing the messages published by them is called consumption;
6.broker: caching broker, one or more servers in the kafka cluster, responsible for Really receive and process messages;

Application scenarios of kafka:
message (message system)
websit activity tracking (website activity tracking)
log aggregation (log collection center)

First of all, Partition is not transparent to the upper layer application. Users can specify the reason to pull the generated messages to that partition.
The topic is divided into partitions for load balancing. Regardless of the order of the messages, a single Topic can use multiple Partitions, and the leaders are evenly distributed on all brokers , to alleviate the overheating problem of a single broker.
Another reason is that in the index file corresponding to each .log file (segment), the offset is 32bit. If the number of records in a single Partition exceeds 4GB, what should be done? Use multiple Partitions.
If hdd The size is 8TB, the size of the ssd is 1TB, and the messages generated under one topic exceed the size of the disk. What should I do?
If a server is configured with 24/36 disks, and the topic is divided into partitions, the granularity is smaller, and the load of the single-machine disk will also be based on Balanced.
If a topic data is too large, there must be a need for partitioning.

Available memory and number of partitions: Brokers will allocate the memory space specified by the replica.fetch.max.bytes parameter to each partition. Assuming replica.fetch.max.bytes=1M and there are 1000 partitions, it requires almost 1G of memory , to ensure that the message with the largest number of partitions does not exceed the server's memory, otherwise an OOM error will be reported. Similarly, fetch.message.max.bytes on the consumer side specifies the memory space required for the largest message. Likewise, the maximum required memory space for the number of partitions cannot exceed the server memory. So, if you have large messages to send, you can only use a smaller number of partitions or a server with more memory if you have a certain amount of memory.

1 What is a consumer group?
A consumer group is a scalable and fault-tolerant consumer mechanism provided by kafka. Since it is a group, there must be multiple consumers or consumer instances in the group, and they share a common ID, that is, the group ID. All consumers within the group coordinate to consume all partitions of subscribed topics. Of course, each partition can only be consumed by one consumer in the same consumer group.
Three characteristics of the consumer group group:
● There can be one or more consumer instances under the consumer group, and the consumer instance can be a process or a thread
● group.id is a string that uniquely identifies a consumer group
● consumer group Each partition under the subscribed topic can only be assigned to one consumer under a group (of course, the partition can also be assigned to other groups)
consumer position (consumer position)
In the process of consumption, consumers need to record how much data they consume, that is, consumption location information. There is a special term for this location information in Kafka: offset. Many message engines store this part of the information on the server side (broker side). The advantage of this is of course the simplicity of implementation, but there will be three main problems: 1. The broker will become stateful, which will affect scalability; 2. An acknowledgement mechanism needs to be introduced to confirm the successful consumption. 3. Due to the need to save a lot of consumer offset information, complex data structures must be introduced, resulting in waste of resources. Kafka chose a different method: each consumer group saves its own displacement information, so only a simple integer is needed to represent the position; at the same time, a checkpoint mechanism can be introduced for periodic persistence, which simplifies the implementation of the response mechanism.
3 Offset management (offset management)
3.1 Automatic VS Manual
By default, Kafka automatically commits the offset for you on a regular basis (enable.auto.commit = true). Of course, you can choose to manually submit the offset to achieve your own control. In addition, Kafka will regularly save the group consumption situation and make an offset map, as shown in the following figure:

The above figure shows the current consumption of the test-group group.

3.2 Displacement submission
The displacement of the old version is submitted to zookeeper, and the picture is not drawn. In short, the directory structure is: /consumers/<group.id>/offsets/<topic>/<partitionId>, but zookeeper is actually not suitable Do large batches of read and write operations, especially write operations. Therefore, kafka provides another solution: increase the consumeroffsets topic, write the offset information to this topic, and get rid of the dependence on zookeeper (referring to saving the offset). The message in consumer_offsets saves the offset information submitted by each consumer group at a certain time. Still taking the consumer group in the above figure as an example, the format is roughly as follows:

The consumers_offsets topic is configured with a compact policy, so that it can always save the latest offset information, which not only controls the overall log capacity of the topic, but also achieves the purpose of saving the latest offset. For the specific principle of compaction, please refer to: Log Compaction
As for which partition of __consumers_offsets each group is saved to, and how to view it, please refer to this article: How Kafka reads offset topic content (
consumer_offsets )

The principle of
kafka: The design concept of kafka: the
persistence
constraint is the throughput rather than the state information that the function
has been used. The state information is saved as part of the data user, not on the server
distributed system

The storage structure of kafka in zookeeper:
kafka will store meta information in zookeeper, and there will be multiple directories. The directory structure is as follows:
The consumer obtains the message and sends it to the broker through the zookeeper, and the kafka cluster randomly assigns the request to the broker server;

Kafka deployment method:
1. Single broker deployment: deploy one broker service on one message server;
2. Single machine multi-broker deployment (pseudo-distributed): deploy multiple broker services on one message server;
3. Multiple machines and multiple brokers Deployment (truly distributed):
multiple physical message servers, one or more broker services are deployed on each message server;
each broker has an independent id, which is defined when the broker is created;
broker ids are arranged in order ;

Single broker deployment:

Single-machine multi-broker deployment:

jps command to view the broker information of kafka;
kafka startup error: UseCompressedOps, modify bin/kafka-run-class.sh, remove this option;

Kafka's configuration file and demo:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324743786&siteId=291194637