Look at the problem from the perspective of the interview: detailed explanation of the message queue (a long text with ten thousand words, definitely worth reading)


foreword

The message queue is often used in the interview and development process. This article aims to simulate the questions raised by the interviewer, introduce the content of the message queue in as much detail as possible, and make the content of the message queue easier to understand by inserting pictures.

Special emphasis: Part of the content of this article comes from the interview book and other materials, and is for reference only!


1. What is a message queue?

Before introducing the content of the message queue, we need to clarify what the definition of a message queue is.

Message Queue (Message Queue) is a communication method between applications. Messages can be returned immediately after they are sent, and the message system ensures reliable delivery of messages. The message publisher only publishes the message to MQ regardless of who gets it, and the message user only gets the message from MQ regardless of who publishes it. In this way, neither the publisher nor the user needs to know the existence of the other party.

Message queue middleware is an important component in a distributed system. It mainly solves problems such as application decoupling, asynchronous messages, and traffic shaving, and realizes high performance, high availability, scalability, and eventual consistency architecture. The currently used message queues are ActiveMQ, RabbitMQ, ZeroMQ, Kafka, MetaMQ, RocketMQ
insert image description here

2. Why use message queue?

There are three common benefits of message queues: decoupling, asynchronous, peak shaving

1. Decoupling

Decoupling is a basic requirement in the development process, and message queues can well realize the decoupling of system components

Let's look at a scenario first:
System A sends data to the three BCD systems, and sends them through interface calls. What if the E system also wants this data? So what if the C system is no longer needed?
insert image description here
System A is heavily coupled with various other messy systems. System A generates a piece of critical data
, and many systems require System A to send this data. System A should always consider the BCDE four systems, what should we do if the four systems fail? Do you want to resend, do you want to save the message?

From this scenario, it can be seen that once a change occurs, this one-to-one connection method will be very troublesome. The coupling relationship between the systems is very strong, and the code of multiple systems needs to be changed.

Obviously, this is not the ideal way to do it.

The message queue can solve this problem very well.

Specific usage: Using MQ, system A generates a piece of data and sends it to MQ, which system needs data to consume in MQ by itself. If the new system needs data, it can be consumed directly from MQ; if a system does not need this data, just cancel the consumption of MQ messages. In this way, system A does not need to think about who to send data to, does not need to maintain this code, and does not need to consider whether the call is successful or fails to time out.
insert image description here
Through a model of MQ and Pub/Sub publishing and subscribing messages, system A is completely decoupled from other systems.

2. Asynchronous

Let’s look at another scenario. When system A receives a request, it needs to write the library locally, and also needs to write the library in the BCD three systems. It takes 3ms to write the library locally, and 300ms, 450ms, and 200ms for the three BCD systems to write the library respectively. The total delay of the final request is 3+ 300 + 450 + 200 = 953ms, which is close to 1s. The user feels that something is slow to death. It is almost unacceptable for the user to initiate a request through the browser and wait for 1s.
insert image description here

Using a message queue can also solve this problem.

The specific method is: using MQ, then system A continuously sends 3 messages to the MQ queue. If it takes 5ms, the total time for system A to receive a request and return a response to the user is 3 + 5 = 8ms. For the user , in fact, it feels like clicking a button, and it will return directly after 8ms, so fast!
From 953ms to 8ms, the performance has been improved by more than 100 times. This is the benefit brought by the MQ mechanism. This is

insert image description here
why we usually upload a file or publish a content, and the local display shows that the upload is successful or the publication is successful, but other users are still in a short period of time. Can't see it, it will take a while to refresh.

This is actually because data synchronization takes time. The successful upload seen locally is actually the fact that we have successfully uploaded the data to MQ, but it still takes a certain amount of time for MQ to be written into the database (currently this delay is very short), so Other users can't see it for the time being, and need to wait for the data in MQ to be stored in the database before other users can see the content **

3. Peak clipping

Let's look at another scenario: from 0:00 to 12:00 every day, system A is calm, and the number of concurrent requests per second is only 50. As a result, every time from 12:00 to 13:00, the number of concurrent requests per second suddenly increased to 5k+. But the system is directly based on MySQL, a large number of requests pour into MySQL, about 5k SQLs are executed on MySQL per second.
insert image description here

For general MySQL, it is almost enough to handle 2k requests per second. If the request reaches 5k per second, MySQL may be killed directly, causing the system to crash, and users will no longer be able to use the system (forgive me for thinking I got wb, whenever there is a big event, the hot search will collapse hhhh)

But once the peak period is over, it will become a low peak period in the afternoon. There may be only 10,000 users operating on the website at the same time, and the number of requests per second may be only 50 requests, which has almost no impact on the entire system. pressure.

Using message queues can also solve this problem to a certain extent.

Why do you say it is to a certain extent?

Because the message queue does not directly improve the performance of the system processing, but limits the data requests entering the system per unit time to ensure that the system does not collapse, and the backlog of demand exists in MQ or directly throws an exception

Specific method: using MQ, 5k requests are written into MQ per second, and system A can process up to 2k requests per second, because MySQL can process up to 2k requests per second.

System A slowly pulls requests from MQ, pulling 2k requests per second, and it is ok if it does not exceed the maximum number of requests it can handle per second. In this way, even during peak hours, system A will never will hang up.

However, MQ has 5k requests coming in every second, and only 2k requests go out. As a result, during the peak period at noon (1 hour), there may be hundreds of thousands or even millions of requests backlogged in MQ.
insert image description here
The backlog of this short peak period is ok, because after the peak period, 50 requests per second enter MQ, but system A will still process at the speed of 2k requests per second. Therefore, as soon as the peak period passes, system A will quickly resolve the backlog of messages (and some requests are timely, and there may be no need to request data after a while, such as the majority of people who eat melons hhhhhh)

3. What are the disadvantages of message queues?

Seeing this, do you feel that the message queue has many benefits, is very powerful, can significantly improve performance, and can also ensure that the system does not crash under high request volume?

What is the shortcoming?

1. Reduced system availability

The more external dependencies the system introduces, the easier it is to hang up . Originally, you are system A calling the interface of the three systems of BCD, and the four systems of ABCD are still fine, there is no problem, you just add an MQ in, what if the MQ hangs up? Once MQ hangs up, the whole system crashes

2. Increased system complexity

After joining MQ, how can ** ensure that messages are not consumed repeatedly? How to deal with the situation of message loss? How to ensure the order of message delivery? **Will lead to a series of other problems.

3. Consistency issues

System A directly returns success after processing, because it thinks that the request is successful after joining MQ; but the problem is, if the three systems of BCD and BD succeed in writing the database, the result is that system C fails in writing the database. lead to data inconsistencies. Data inconsistency is a very big problem and requires careful consideration!

So the message queue is actually a very complex architecture. You can introduce it with many benefits, but you have to do various additional technical solutions and architectures to avoid the disadvantages it brings. After you do it well, you will find that the system is complicated . The speed is increased by an order of magnitude, perhaps 10 times more complicated. But at critical moments, it is necessary to use it, especially for large systems.

4. How to ensure the high availability of the message queue?

1. High availability of RabbitMQ

RabbitMQ is more representative, because it is based on master-slave (non-distributed) for high availability, we will use RabbitMQ as an example to explain how to achieve high availability of the first MQ.

RabbitMQ has three modes: stand-alone mode, common cluster mode, and mirror cluster mode.

1) Stand-alone mode

The stand-alone mode is at the Demo level. Generally, it is started locally, and no one uses the stand-alone mode for production.

2) Ordinary cluster mode: no high availability

Ordinary cluster mode means to start multiple RabbitMQ instances on multiple machines, one for each machine. The queue you create will only be placed on one RabbitMQ instance, but each instance will synchronize the metadata of the queue (metadata can be considered as some configuration information of the queue, through the metadata, you can find the instance where the queue is located). When you consume, if you actually connect to another instance, that instance will pull data from the instance where the queue is located.
insert image description here
This method is really troublesome and not very good. It does not achieve the so-called distributed, it is just an ordinary cluster. Because this leads to consumers either randomly connecting to an instance each time and then pulling data, or permanently connecting to the instance where the queue is located to consume data. The former has the overhead of data pulling, and the latter leads to single-instance performance bottlenecks.

And if the instance that placed the queue is down, other instances will not be able to pull from that instance. If you enable message persistence and let RabbitMQ store messages on the ground, the messages may not be lost. You have to wait for this instance After recovery, you can continue to pull data from this queue.

So this matter is more embarrassing. There is no so-called high availability. This solution is mainly to improve throughput, that is, to let multiple nodes in the cluster serve the read and write operations of a certain queue.

3) Mirror cluster mode: high availability

This mode is the so-called high availability mode of RabbitMQ. Different from the normal cluster mode, in the mirror cluster mode, the queue you create, regardless of the metadata or the messages in the queue, will exist on multiple instances.

That is to say, each RabbitMQ node has a complete image of the queue, including all the data of the queue. Then every time you write a message to the queue, the message will be automatically synchronized to the queues of multiple instances.
insert image description here
In this case, the advantage is that if any of your machines is down, it’s okay, other machines (nodes) also contain the complete data of this queue, and other consumers can go to other nodes to consume data.
The downside is that, first, the performance overhead is too high. Messages need to be synchronized to all machines, resulting in heavy pressure and consumption of network bandwidth!

Second, this way of playing is not distributed, so there is no scalability at all. If a queue is heavily loaded and you add machines, the newly added machines also contain all the data of this queue, and there is no way to expand linearly. your queue. You think, if the amount of data in this queue is so large that the capacity of this machine cannot be accommodated, what should we do at this time?

2. High availability of Kafka

One of the most basic architectural understandings of Kafka: it consists of multiple brokers, and each broker is a node; you create a topic, which can be divided into multiple partitions, each partition can exist on different brokers, and each partition is Put some data.
insert image description here

This is a natural distributed message queue, which means that the data of a topic is scattered on multiple machines, and each machine puts a part of the data.

In fact, RabbitMQ and the like are not distributed message queues. They are traditional message queues. They just provide some clustering and HA (High Availability, high availability) mechanisms, because no matter how you play, RabbitMQ is a queue The data is stored in one node. Under the mirroring cluster, each node also stores the complete data of the queue.

Before Kafka 0.8, there was no HA mechanism, that is, if any broker went down, the partition on that broker would be abolished, unable to write or read, and there was no high availability to speak of.
For example, we assume that a topic is created, and the number of its partitions is specified to be 3, which are respectively on three machines. However, if the second machine goes down, 1/3 of the data of this topic will be lost, so this cannot be highly available.

After Kafka 0.8, the HA mechanism is provided, which is the replica (replica) copy mechanism. The data of each partition will be synchronized to other machines to form its own multiple replica copies. All replicas will elect a leader, then production and consumption will deal with this leader , and then other replicas will be followers. When writing, the leader will be responsible for synchronizing the data to all followers, and when reading, just read the data on the leader directly.

Why can only read and write leader?

It's very simple. If you can read and write each follower at will, then you have to take care of the data consistency problem. The system complexity is too high, and it is easy to go wrong. Kafka will evenly distribute all replicas of a partition on different machines, so as to improve fault tolerance.
insert image description here
In this way, there is so-called high availability, because if a certain broker goes down, it’s okay, the partition on that broker has copies on other machines. If there is a partition leader on the broken broker, a new leader will be re-elected from the followers at this time, and everyone can continue to read and write the new leader. This is called high availability.

When writing data, the producer writes to the leader, and then the leader writes the data to the local disk, and then other followers actively pull data from the leader. Once all followers have synchronized the data, they will send ack to the leader. After the leader receives the acks of all followers, it will return a write success message to the producer. (Of course, this is just one of the modes, and this behavior can be adjusted appropriately)

When consuming, it will only be read from the leader, but only when a message has been synchronously and successfully returned to ack by all followers, the message will be read by the consumer.

5. How to ensure that messages are not consumed repeatedly (idempotency of message consumption)?

1. What is the problem of repeated consumption of messages

First use a scenario to introduce what is the problem of repeated consumption of messages.

Kafka actually has a concept of offset, that is, each message written has an offset, which represents the serial number of the message, and after the consumer consumes the data, every once in a while (regularly), it will consume the message it has consumed Submit the offset, saying "I have already consumed it, if I restart or something next time, you can let me continue to consume from the offset I consumed last time."

But there are always accidents in everything. For example, what we often encountered in production before is that you sometimes restart the system to see how you restart. If you are in a hurry, kill the process directly and restart it. This will cause the consumer to process some messages, but has no time to submit the offset, which is embarrassing. After restarting, a small number of messages will be consumed again.

There is such a scene. Data 1/2/3 enter Kafka in turn, and Kafka will assign an offset to each of these three pieces of data, representing the serial number of this piece of data. We assume that the assigned ow set is 152/153/154 in sequence. When consumers consume from kafka, they also consume in this order. If the consumer consumes the data with offset=153 and just prepares to submit the offset to zookeeper, the consumer process is restarted. Then the offset of 1/2 of the consumed data at this time has not been submitted, and Kafka will not know that you have consumed the data with offset=153. Then after restarting, the consumer will find Kafka and say, hey, buddy, you can pass me the data behind the place I consumed last time. Since the previous offset was not successfully submitted, 1/2 of the data will be transmitted again. If the consumer does not deduplicate at this time, it will lead to repeated consumption.
insert image description here
If what the consumer does is to write a piece of data into the database, it will lead to saying that you may insert 1/2 of the data into the database twice, then the data is wrong.
In fact, repeated consumption is not terrible. What is terrible is that you have not considered how to ensure idempotency after repeated consumption.

Let me give you an example. Suppose you have a system that consumes a message and inserts a piece of data into the database. If you repeat a message twice, you will insert two pieces. Isn’t the data wrong? But if you consume it for the second time, judge for yourself whether you have already consumed it, and if so, just throw it away, so that a piece of data will not be retained, thus ensuring the correctness of the data. If a piece of data appears twice, there is only one piece of data in the database, which ensures the idempotency of the system.

Idempotency, in layman's terms, just one piece of data, or one request, repeated for you many times, you have to make sure that the corresponding data will not change and cannot make mistakes

2. How to solve the problem of repeated consumption of messages

The following ideas are provided:
(1) For example, if you want to write data to the database, you first check it according to the primary key. If the data is available, don’t insert it, just update it.

(2) For example, if you are writing Redis, there is no problem. Anyway, it is set every time, which is naturally idempotent.

(3) For example, if you are not in the above two scenarios, it is a little more complicated. You need to let the producer add a globally unique id when sending each piece of data, something like an order id, and then you consume here After arriving, first check in Redis according to this id, have you consumed it before? If it has not been consumed, you process it, and then write the id to Redis. If you have already consumed it, then you don't need to deal with it, just make sure not to process the same message repeatedly.

(4) For example, based on the unique key of the database to ensure that duplicate data will not be inserted repeatedly. Because there is a unique key constraint, repeated data insertion will only report an error, and will not cause dirty data in the database

insert image description here

6. How to ensure the reliable transmission of messages (the problem of message loss)?

There is a basic principle of using MQ, that is, there can be no more or less data, and no more, which is
the problem of repeated consumption and idempotence mentioned above. No less, that is to say, don't lose this data.

1.RabbitMQ

If you use MQ to transmit very core messages, such as billing and deduction messages, you must ensure that the billing messages will never be lost during the MQ transfer process.
insert image description here

1). The message is lost during the incoming process

When the producer sends data to RabbitMQ, the data may be lost halfway, because of network problems or something.

At this time, you can choose to use the transaction function provided by RabbitMQ, that is, the producer opens the RabbitMQ transaction channel.txSelect before sending data, and then sends a message. If the message is not successfully received by RabbitMQ, the producer will receive an exception error, and then you can Roll back the transaction channel.txRollback, and then retry sending the message; if the message is received, the transaction channel.txCommit can be committed.

// 开启事务
	channel.txSelect
try {
    
    
	// 这里发送消息
} catch (Exception e) {
    
    
	channel.txRollback
	// 这里再次重发这条消息
}
	// 提交事务
	channel.txCommit

But the problem is that once the RabbitMQ transaction mechanism (synchronization, needs, etc.) is implemented, the throughput will basically drop because it consumes too much performance.

So in general, if you want to ensure that the message written to RabbitMQ is not lost, you can enable the confirm mode. After setting the confirm mode on the producer, each message you write will be assigned a unique id, and then if you write In RabbitMQ, RabbitMQ will return you an ack message, telling you that the message is ok. If RabbitMQ fails to process the message, it will call back one of your nack interfaces to tell you that the message failed to be received, and you can try again. And you can combine this mechanism to maintain the state of each message id in memory. If you haven't received the callback of this message after a certain period of time, you can resend it.

The biggest difference between the transaction mechanism and the confirm mechanism is that the transaction mechanism is synchronous. After you submit a transaction, it will be blocked there, but the confirm mechanism is asynchronous. After you send a message, you can send the next message, and then the message RabbitMQ receives After the message is received, it will asynchronously call back one of your interfaces to notify you that the message has been received.

Therefore, the producer generally uses the confirm mechanism to avoid data loss.

2). RabbitMQ lost data

That is, RabbitMQ lost data by itself. You must enable the persistence of RabbitMQ , that is, after the message is written, it will be persisted to the disk. Even if RabbitMQ hangs up by itself, it will automatically read the previously stored data after recovery. Generally, the data will not leave. Unless it is extremely rare that RabbitMQ hangs up before it persists, which may cause a small amount of data loss, but this probability is small.

There are two steps to setting persistence:
the first is to set it as persistent when creating a queue, so that RabbitMQ can ensure that the metadata of the queue is persisted, but it will not persist the data in the queue.

The second is to set the deliveryMode of the message to 2 when sending
the message to set the message to be persistent. At this time, RabbitMQ will persist the message to the disk.

These two persistence must be set at the same time. Even if RabbitMQ hangs up and restarts again, it will restart and restore the queue from the disk, and restore the data in the queue.

Note that even if you enable the persistence mechanism for RabbitMQ, there is a possibility that this message is written to RabbitMQ, but it has not had time to persist to the disk. Unfortunately, RabbitMQ hangs up at this time, which will cause memory A little bit of data is lost.

Therefore, persistence can be combined with the confirm mechanism on the producer side. Only after the message is persisted to the disk, the producer will be notified of ack, so even before it is persisted to the disk, RabbitMQ hangs up and the data is lost. , the producer can't receive the ack, you can also resend it yourself.

3). The consumer loses data

If RabbitMQ loses data, it is mainly because when you consume it, it has just been consumed and has not been processed. As a result, the process hangs up, such as restarting, then it is embarrassing. RabbitMQ thinks that you have consumed all the data, and the data is lost.

At this time, you have to use the ack mechanism provided by RabbitMQ. Simply put, you must turn off RabbitMQ’s automatic ack, which can be called through an api, and then every time you ensure that the processing is completed in your own code, then ack in the program Bundle. In this case, if you haven't finished processing, won't there be no ack? Then RabbitMQ thinks that you have not finished processing. At this time, RabbitMQ will allocate this consumption to other consumers for processing, and the message will not be lost.
insert image description here

2.Kafka

1) The consumer loses data

The only situation that may cause the consumer to lose data is that you consume the message, and then the consumer automatically submits the offset, making Kafka think that you have already consumed the message, but in fact you are just about to process the message. Before you deal with it, you hang up yourself, and this message will be lost at this time.

Isn’t this similar to RabbitMQ? Everyone knows that Kafka will automatically submit offsets, so as long as you turn off automatic submission of offsets and manually submit offsets after processing, you can ensure that the data will not be lost. But at this time, there may still be repeated consumption. For example, you have just finished processing and have not submitted the offset, but you hang up yourself. At this time, you will definitely consume once again. Just ensure idempotence yourself.

2) Kafka lost data

One of the more common scenarios in this area is that a certain broker of Kafka goes down, and then re-elects the leader of the partition. Think about it, if other followers happen to have some data that is not synchronized at this time, and the leader hangs up at this time, and after electing a follower to become the leader, wouldn't some data be lost? This lost some data.

So at this time, it is generally required to set at least the following 4 parameters:
set the replication.factor parameter for the topic: this value must be greater than 1, and each partition must have at least 2 copies.

Set the min.insync.replicas parameter on the Kafka server: this value must be greater than 1. This requires a leader to at least perceive that there is at least one follower and keep in touch with itself, so as to ensure that there is still a follower when the leader hangs up. .

Set acks=all on the producer side: This is to require each piece of data to be written to all replicas before it can be considered successful.

Set retries=MAX on the producer side (a very large value, which means infinite retries): This is to request that once the write fails, it will retry infinitely, and it is stuck here.

After this configuration, at least on the Kafka broker side, it can be guaranteed that when the broker where the leader is located fails, data will not be lost when the leader is switched.

3) Will the producer lose data?

If you set acks=all according to the above ideas, it will not be lost. The requirement is that after your leader receives the message and all followers have synchronized with the message, the write is considered successful. If this condition is not met, the producer will automatically retry continuously, and retry infinitely.

7. If you are asked to write a message queue, how should you design the architecture?

For example, for this message queuing system, let's consider it from the following perspectives:

First of all, this mq must support scalability, that is, to expand the capacity quickly when needed, so as to increase the throughput and capacity, so how to do it? To design a distributed system, refer to the design concept of kafka, broker -> topic -> partition, and put a machine in each partition to store part of the data. If the resources are not enough now, simply add a partition to the topic, then do data migration, and add machines, so you can store more data and provide higher throughput?

Secondly, you have to consider whether the data of this mq should be landed on the disk, right? That must be required, and the disk can be used to ensure that the data will not be lost if the process hangs up. How did it fall when the disk was dropped? Sequential writing, so that there is no addressing overhead for random disk reading and writing, and the performance of sequential disk reading and writing is very high. This is the idea of ​​Kafka.

Secondly, do you consider the availability of your mq? For this matter, refer to the high-availability guarantee mechanism of Kafka explained in the previous link of availability. Multiple copies -> leader & follower -> broker hangs up and re-elects the leader to serve the outside world.

Can it support data 0 loss? Yes, refer to the Kafka zero data loss solution we mentioned earlier

Guess you like

Origin blog.csdn.net/qq_46119575/article/details/129794304