The bytebeat interviewer asked the message queue like this: high availability, non-repetitive consumption, reliable transmission, sequential consumption, message accumulation, I put it together

Write in front

It’s the peak season for job-hopping at the end of the year. When many small partners go out for interviews, many interviewers will ask questions about the message queue. Many of the small partners’ answers are not perfect, and some of them know the answer in their hearts, but they are not very good. The root cause of the expression is still not a thorough understanding of the relevant knowledge points. Today, we will discuss this topic together. Note: The article is a bit long, you say you can read it all in one go, I don't believe it! !

The article has been included in:

https://github.com/sunshinelyz/technology-binghe

https://gitee.com/binghe001/technology-binghe

What is a message queue?

Message Queue (Message Queue) is a container for storing messages during the transmission of messages, and is a communication method between applications. The message can be returned immediately after sending, and the message system guarantees reliable transmission of the message. The message publisher only writes the message to the queue without considering who needs the message, and the user of the message does not need to know who published the message, only the message Take it from the queue, so that production and consumption can be separated.

Why use message queues?

advantage:

Asynchronous processing: such as SMS notification, terminal status push, App push, user registration, etc.
Data synchronization: business data push synchronization
Retry Compensation: Retry after accounting failure
System decoupling: communication upstream and downstream, terminal abnormal monitoring, distributed event center
Traffic peak reduction: order processing in the spike scenario
Publish and subscribe: HSF service status change notification, distributed event center
High concurrency buffer: log service, monitoring report

The core functions of using message queues are: decoupling , asynchrony , and peak clipping .

Disadvantages:

Reduced system availability The more external dependencies the system introduces, the easier it is to hang up? How to ensure the high availability of the message queue?
How does the increased system complexity ensure that messages are not re-consumed? How to deal with the loss of messages? How to ensure the order of message delivery?
Consistency problem A. After the system is processed, it returns directly. Everyone thinks that your request is successful; but the problem is that if there are three BCD systems, the two BD systems successfully write the library, and the C system fails to write the library. What is it? Your data is inconsistent.

The following two message queues are mainly discussed, RabbitMQ and Kafka.

How to ensure the high availability of the message queue?

High availability of RabbitMQ

RabbitMQ's high availability is based on master-slave (non-distributed) high availability. RabbitMQ has three modes: stand-alone mode (Demo level), normal cluster mode (no high availability), mirrored cluster mode (high availability).

Normal cluster mode

Normal cluster mode means to start multiple instances of RabbitMQ on multiple machines, one for each machine. The queue you create will only be placed on one RabbitMQ instance , but each instance synchronizes the metadata of the queue (metadata can be considered as some configuration information of the queue. Through metadata, you can find the instance where the queue is located). When you are consuming, if you actually connect to another instance, that instance will pull data from the instance where the queue is located.

This method is really troublesome and not very good. It does not achieve the so-called distributed , just a normal cluster. Because this causes you to either randomly connect to one instance each time and then pull data, or connect to the instance where the queue is located to consume data. The former has the overhead of data pulling , and the latter leads to a single-instance performance bottleneck .

And if the instance of the queue goes down, it will cause other instances to be unable to pull from that instance. If you enable message persistence and let RabbitMQ store messages on the ground, the messages will not necessarily be lost , you have to wait for this instance After recovery, you can continue to pull data from this queue.

So this matter is more embarrassing. There is no so-called high availability . This solution is mainly to improve throughput , that is, let multiple nodes in the cluster serve the read and write operations of a certain queue.

Insert picture description here

Mirror cluster mode

This mode is the so-called highly available mode of RabbitMQ. Unlike the normal cluster mode, in the mirrored cluster mode, the queue you create, regardless of the metadata or the messages in the queue, will exist on multiple instances , that is, each RabbitMQ node has a complete queue of this queue . Mirror , including the meaning of all queue data. Then every time you write a message to the queue, it will automatically synchronize the message to the queue of multiple instances.

So how to turn on this mirrored cluster mode ? In fact, it is very simple. RabbitMQ has a very good management console, which is to add a strategy in the background. This strategy is a mirroring cluster mode strategy . When specified, you can require data to be synchronized to all nodes, or you can request to synchronize to a specified number When you create the queue again, apply this strategy to automatically synchronize the data to other nodes.

In this case, the advantage is that any one of your machines is down, it's okay, other machines (nodes) also contain the complete data of the queue, and other consumers can go to other nodes to consume data. The disadvantage is that, first, this performance overhead is too large, and the messages need to be synchronized to all machines, resulting in heavy network bandwidth pressure and consumption! Second, if you play this way, it is not distributed, and there is no scalability . If a queue is heavily loaded and you add a machine, the new machine also contains all the data of the queue, and there is no way to linearly expand Your queue. You think, if the amount of data in this queue is so large that the capacity on this machine cannot be accommodated, what should we do at this time?

Insert picture description here

Kafka's high availability

One of the most basic understandings of Kafka's architecture: consists of multiple brokers, each broker is a node; you create a topic, this topic can be divided into multiple partitions, each partition can exist on a different broker, each partition Put part of the data.

This is a natural distributed message queue , that is, the data of a topic is scattered on multiple machines, and each machine puts a part of the data .

In fact, RabbmitMQ is not a distributed message queue. It is a traditional message queue. It only provides some clustering and HA (High Availability) mechanisms, because no matter how you play, RabbitMQ is a queue The data is stored in one node. Under the mirrored cluster, each node also stores the complete data of the queue.

Before Kafka 0.8, there was no HA mechanism, that is, if any broker is down, the partition on that broker is invalid, and it cannot be written or read, and there is no high availability.

For example, let's suppose that a topic is created, and the number of partitions is specified to be 3, each on three machines. However, if the second machine goes down, 1/3 of the data of this topic will be lost, so this is not highly available.

Insert picture description here

After Kafka 0.8, an HA mechanism is provided, which is the replica mechanism. The data of each partition will be synchronized to other machines to form its own multiple replica copies. All replicas will elect a leader, then production and consumption will deal with this leader, and then other replicas are followers. When writing, the leader will be responsible for synchronizing the data to all followers. When reading, just read the data on the leader directly. Can only read and write the leader? It's very simple. If you can read and write each follower at will, then you have to care about the data consistency problem . The system complexity is too high and problems are prone to occur. Kafka will evenly distribute all replicas of a partition on different machines, so as to improve fault tolerance.

Insert picture description here

In this way, there is so-called high availability , because if a broker goes down, it's okay, and the partition on that broker has a copy on other machines. If there is a leader of a certain partition on the down broker, then a new leader will be re-elected from the followers at this time , and everyone can continue to read and write that new leader. This is the so-called high availability.

When writing data , the producer writes to the leader, and then the leader writes the data to the local disk, and then other followers take the initiative to pull data from the leader. Once all the followers have synchronized their data, they will send an ack to the leader. After the leader receives the ack from all the followers, it will return a successful write message to the producer. (Of course, this is only one of the modes, and this behavior can be adjusted appropriately)

When consuming , it will only be read from the leader, but only when a message has been synchronized and successfully returned to ack by all followers, the message will be read by the consumer.

How to ensure that messages are not repeatedly consumed (idempotence)?

First of all, all message queues will have such a repeated consumption problem, because this is not guaranteed by MQ, but by our own development. We use Kakfa to discuss how to achieve it.

Kakfa has a concept of offset, that is, every message written into it will have an offset value, which represents the serial number of consumption, and then after the consumer consumes the data, it will submit the offset value of the message that it has consumed by default at regular intervals, indicating that I I have already consumed it. If I restart or something next time, let me continue to consume from the currently submitted offset.

But there are always surprises in everything. For example, what we often encountered in production before is that you sometimes restart the system, depending on how you restart it. If you encounter something anxious, just kill the process and restart it. This will cause the consumer to process some messages, but not have time to submit the offset, which is embarrassing. After restarting, a few messages will be consumed again.

In fact, repeated consumption is not terrible. The terrible thing is that you have not considered how to ensure idempotence after repeated consumption .

Give an example. Suppose you have a system that inserts one piece of data into the database when one message is consumed. If you repeat one message twice, you insert two pieces. Isn't the data wrong? But if you consume it for the second time, judge for yourself whether you have already consumed it, and if so, just throw it away, so that a piece of data is not retained, thus ensuring the correctness of the data. If a piece of data is repeated twice, there is only one piece of data in the database, which guarantees the idempotence of the system. Idempotence, in layman's terms, is repeated many times for one data or one request. You have to make sure that the corresponding data will not change or make mistakes .

So the second question is, how to ensure the idempotence of message queue consumption?

In fact, we still have to think about the business. Here are a few ideas:

For example, if you take a piece of data to write to the database, you first check it based on the primary key. If the data is all there, don’t insert it, just update it.
For example, if you are writing Redis, that's okay. Anyway, it is set every time, which is naturally idempotent.
For example, if you are not in the above two scenarios, it is a little more complicated. When you need to ask the producer to send each piece of data, add a globally unique id, similar to the order id, and then after you consume it here, First check it in Redis, for example, based on this id. Have you consumed it before? If you haven't consumed it, you just deal with it, and then write this id to Redis. If you have consumed it, don't process it, and make sure not to process the same message repeatedly.
For example, based on the unique key of the database to ensure that duplicate data will not be inserted repeatedly. Because of the unique key constraint, repeated data insertion will only report an error and will not cause dirty data in the database.

Insert picture description here

Of course, how to ensure that the consumption of MQ is idempotent, it needs to be combined with the specific business.

How to ensure reliable transmission of messages (without loss)?

This is affirmative. The basic principle of MQ is that there can be no more or less data. In fact, no more data is the problem of repeated consumption. No less, that is, data cannot be lost. Some information such as billing and deductions must not be lost.

The problem of data loss may appear in producers, MQ, and consumers. Let's analyze it separately from RabbitMQ and Kafka.

How RabbitMQ guarantees the reliability of messages

Insert picture description here

Producer loses data

When the producer sends data to RabbitMQ, the data may be lost halfway, because of network problems or anything, it is possible.

At this time, you can choose to use the transaction function provided by RabbitMQ, that is, the producer opens the RabbitMQ transaction before sending the datachannel.txSelect , and then sends the message. If the message is not successfully received by RabbitMQ, the producer will receive an exception error, and the transaction can be rolled back. channel.txRollback, And then retry sending the message; if the message is received, the transaction can be committed channel.txCommit.

// 开启事务
channel.txSelect
try {
    
    
// 这里发送消息
} catch (Exception e) {
    
    
channel.txRollback

// 这里再次重发这条消息
}

// 提交事务
channel.txCommit

But the problem is that once the RabbitMQ transaction mechanism (synchronization) is implemented, the throughput will basically go down because it consumes too much performance .

So, in general, if you want to make sure to speak and write RabbitMQ messaging do not lose, you can open confirmmode, the producers set to open confirmlater mode, every time you write a message will be assigned a unique id, then if written in RabbitMQ , RabbitMQ will give you a return ackmessage that tells you that the news ok. If RabbitMQ could not process the message, you will be a callback nackinterface to receive a message telling you this fails, you can try again. And you can use this mechanism to maintain the status of each message id in the memory yourself. If you haven't received the callback of this message for a certain period of time, you can resend it.

Transaction mechanism and the confirmbiggest difference is the mechanism that transaction mechanism are synchronized , then you commit a transaction will be blocked there, but confirmthe mechanism is asynchronous , after which you send a message you can send the next message, then that message RabbitMQ received after An interface that will call you back asynchronously will notify you that the message has been received.

This is generally the producer to avoid loss of data , are used confirmmechanisms.

RabbitMQ loses data

RabbitMQ own data is lost, and that you have to open the persistence of RabbitMQ , after a message is written to be persisted to disk, even if it is hung up RabbitMQ own, previously stored data will be automatically read after recovery , the data is not generally throw. Unless it is extremely rare that RabbitMQ hasn't been persisted, and it hangs by itself, which may cause a small amount of data loss , but this probability is small.

There are two steps to set up persistence :

When the queue is created, set it to be persistent. This will ensure that RabbitMQ persists the metadata of the queue, but it will not persist the data in the queue.
The second message is sent when the message is deliveryModeprovided to the message 2 is set to the persistent, in which case the message will RabbitMQ persisted up to disk.

These two persistences must be set at the same time. Even if RabbitMQ is hung up, it will restart from the disk to restore the queue and restore the data in this queue.

Note that even if you turn on the persistence mechanism for RabbitMQ, there is a possibility that the message is written to RabbitMQ, but it has not been persisted to the disk. The result is unfortunate. At this time, RabbitMQ hangs, which will cause memory A little bit of data is lost.

So, persistence can tell there's producers confirmwith up mechanism, messages are only after the disk, will inform the producer persisted ack, so even before the persisted to disk, RabbitMQ hung up, lost data, production ackIf you ca n’t receive it , you can resend it yourself.

Consumers lose data

If RabbitMQ loses data, it is mainly because when you consume it, it has just been consumed and has not been processed. As a result, the process hangs , such as restarting, then it will be embarrassing. RabbitMQ thinks that you have consumed all the data and the data is lost.

This time starting RabbitMQ provides a ackmechanism, in simple terms, is that you must turn off the automatic RabbitMQ ackcan be called via an api on the line, then each time your own code to ensure that when processed, and then in the program acka. In this case, if you have not been processed, it is not no ackthe? Then RabbitMQ thinks that you haven't finished processing it. At this time, RabbitMQ will allocate the consumption to other consumers for processing, and the message will not be lost.
Insert picture description here

How does Kakfa ensure the reliability of the message

Consumers lose data

The only situation that may cause consumers to lose data is that you consume the message, and then the consumer automatically submits the offset , making Kafka think that you have consumed the message, but in fact you are just preparing to process the message. If you haven't dealt with it, you hang up yourself, and the message will be lost at this time.

Isn't this similar to RabbitMQ? Everyone knows that Kafka will automatically submit offsets, so as long as you turn off the automatic submission of offsets and submit the offsets manually after processing, you can ensure that data will not be lost. However, there may still be repeated consumption at this time . For example, you have just processed it and haven't submitted the offset, and you hang up by yourself. At this time, you will definitely consume it again. It is good to ensure the idempotence yourself.

A problem encountered in the production environment is that after our Kafka consumers consume the data, they write to a memory queue and buffer it first. As a result, sometimes, you just write the message to the memory queue, and the consumer will automatically Submit the offset. Then we restarted the system at this time, which will cause the data in the memory queue to be lost before processing.
Kafka loses data

One of the more common scenarios in this area is that a certain broker in Kafka goes down and then the leader of the partition is re-elected. Think about it, if other followers happen to have some data out of sync at this time, and the leader hangs up at this time, and then a certain follower is elected as the leader, will some data be missing? Some data is lost.

We have also encountered it in the production environment. So did Kafka's leader machine. After switching the follower to the leader, we will find that this data is lost.

Therefore, at this time, it is generally required to set at least the following 4 parameters:
- Set to topic replication.factorparameters: This value must be greater than one, it requires that each partition must have at least two copies.
- Set in Kafka server min.insync.replicasparameters: This value must be greater than 1, this requires a leader is at least perceived to have at least one follower was kind enough to keep in touch with yourself, not left behind, so as to ensure that the leader and a follower hang of it.
- Set on the producer side acks=all: This requires each piece of data to be written to all replicas before it can be considered successful .
- Set on the producer side retries=MAX(a large, large, large value, meaning unlimited retry): This is to require infinite retry once the write fails, and it is stuck here.
Our production environment is configured in accordance with the above requirements. After this configuration, at least on the Kafka broker side, it can be guaranteed that when the broker where the leader is located fails, the data will not be lost when the leader is switched.
Producer loses data

If you set it according to the above ideas acks=all, it will not be lost. The requirement is that your leader receives the message and all the followers have synchronized to the message before it is considered successful. If this condition is not met, the producer will automatically try again and again for an unlimited number of times.

How to ensure the order of messages?

Let me give an example. We have done a mysql binlogsynchronization system before, and the pressure is still very high. The daily synchronization data must reach hundreds of millions, which means that the data is synchronized from one mysql library to another mysql library intact (mysql -> mysql). A common point is that, for example, a big data team needs to synchronize a mysql library to do various complex operations on the company's business system data.

You add or delete a change in mysql in data, the corresponding out additions and deletions to 3 binloglog, then the three binlogsent to the MQ inside, then out of order execution consumption, at least you have to ensure that people are in the order, huh? Otherwise, it was originally: add, modify, delete; you were shocked to change the order and execute it into delete, modify, add, isn't it all wrong.

Originally, this data was synchronized, and the last data should be deleted; as a result, you made a mistake in this order, and finally the data was retained, and there was an error in data synchronization.

Let's take a look at two scenarios where the order will be out of order:

RabbitMQ : One queue, multiple consumers. For example, the producer sends three pieces of data to RabbitMQ, the order is data1/data2/data3, and what is pressed is a memory queue of RabbitMQ. There are three consumers who consume one of these three pieces of data from MQ. As a result, consumer 2 finishes the operation first, saves data2 in the database, and then data1/data3. This is not obviously messed up.

Insert picture description here

Kafka : For example, we have built a topic with three partitions. When the producer writes, he can actually specify a key. For example, if we specify a certain order id as the key, then the data related to this order will be distributed to the same partition, and the data in this partition must be There is an order. When consumers retrieve data from the partition, there must be an order. Up to this point, the order is still ok, there is no confusion. Then, we may engage multiple threads in the consumer to process messages concurrently . Because if the consumer is single-threaded consumption processing, and the processing is relatively time-consuming, for example, processing a message takes tens of ms, then only dozens of messages can be processed in 1 second, which is too low throughput. If multiple threads run concurrently, the order may be messed up.

Insert picture description here

RabbitMQ solution

Split multiple queues. Each queue has a consumer, which means that there are more queues, which is really troublesome; or there is only one queue but corresponding to one consumer, and then this consumer uses the internal memory queue as a queue, and then distributes it to different workers at the bottom. deal with.

Insert picture description here

Kafka solution

One topic, one partition, one consumer, internal single-thread consumption, single-thread throughput is too low, generally do not use this.
Write N memory queues, and all data with the same key will be in the same memory queue; then for N threads, each thread consumes one memory queue, so that the order can be guaranteed.

Insert picture description here

How to deal with message push products?

A large number of messages have been backlogged in mq for several hours and have not been resolved

One consumer has 1,000 items per second, and three consumers per second has 3,000 items, which means 180,000 items per minute. So if you have a backlog of millions to tens of millions of data, even if the consumer recovers, it will take about an hour to recover.

Generally, at this time, the capacity can only be temporarily expanded urgently. The specific steps and ideas are as follows:

Fix the consumer's problem first to ensure that its consumption speed is restored, and then stop all existing consumers.
Create a new topic, the partition is 10 times the original, and the number of queues 10 times the original is temporarily created.
Then write a temporary consumer program that distributes data. This program is deployed to consume the backlog of data. After consumption, it does not do time-consuming processing , and directly polls and writes the temporarily established 10 times the number of queues evenly.
Then temporarily requisition 10 times the machines to deploy consumers, and each batch of consumers consumes a temporary queue of data. This approach is equivalent to temporarily expanding queue resources and consumer resources by 10 times, and consuming data at 10 times the normal speed.
After completion of the backlog of data and other fast consumption, was deployed to restore the original architecture , to re- use the original consumer machine to consume news.

The message in mq expired

Assuming you are using RabbitMQ, RabbtiMQ can set the expiration time, which is TTL. If the backlog of messages in the queue exceeds a certain time, it will be cleared by RabbitMQ, and the data will be gone. Then this is the second pit. This is not to say that a large amount of data will be accumulated in mq, but that a large amount of data will be directly lost .

In this case, it is not to increase the consumer's consumption backlog, because there is actually no backlog, but a lot of news is lost. We can take a plan, which is batch redirection , which we have done in a similar scene online before. When there was a lot of backlog, we just discarded the data at that time, and then after the peak period, for example, everyone drinks coffee and stays up until 12 o'clock in the evening, and the users are all asleep. At this time, we started to write the program, write a temporary program for the lost data, find out bit by bit, and then refill it into mq to make up the data lost during the day. It can only be so.

Assuming that 10,000 orders are backlogged in mq and have not been processed, and 1,000 of them are lost. You can only manually write a program to find out those 1,000 orders, and manually send them to mq to make up again.

mq is almost full

If the message backlog is in mq and you haven't dealt with it for a long time, the mq is almost full at this time, what should I do? Is there another way to do this? No, who made your first plan execute too slowly? You write programs temporarily, access data to consume, consume one and discard one, don't need it , and consume all the messages quickly. Then go to the second plan and fill in the data at night.

Reference materials:

Kafa in-depth analysis
RabbitMQ source code analysis

Okay, let’s stop here today. I’m Glacier. If you have any questions, you can leave a message below, or add me to WeChat: sun_shine_lyz, I will pull you into the group, share technology together, advance together, and be awesome together~~