[Message Queuing] Best practices for unlocking RocketMQ cluster deployment from production failures

I. Introduction

Near the end of the year, a physical machine memory failure in the production MQ cluster maintained by the author caused the operating system to restart abnormally. Within 10 minutes, many application sending clients experienced message sending timeouts. The accident was identified as S1, the author’s "year-end bonus". ..

1.1 Fault description

The deployment architecture adopted by RocketMQ cluster is 2 masters and 2 slaves. The deployment architecture is shown in the figure below:

A very obvious feature of its deployment architecture is the deployment of nameserver and broker processes on a physical machine.

One of the machines (192.168.3.100) had a memory failure, which caused the machine to restart. However, due to factors such as self-checking required for restarting the Linux operating system, the entire restarting process lasted for nearly 10 minutes, and the client's sending timeout lasted for 10 minutes. This is obviously Is unacceptable! ! !

What is the high-availability design of RocketMQ? Next, we will introduce the analysis process in detail .

1.2 Failure analysis

When I learned that a machine failure caused the message sending timeout to last for 10 minutes, my first reaction was that it shouldn't be, because RocketMQ cluster is a distributed deployment architecture, which naturally supports fault detection and fault recovery, and the message sending client can automatically perceive Broker The abnormal time will never exceed 10 minutes, so how did the fault happen?

First, let's review the routing registration and discovery mechanism of RocketMQ.

(1) RocketMQ routing registration and removal mechanism

The routing registration and removal mechanism are described as follows:

  • All Brokers in the cluster send heartbeat packets to all NameServers in the cluster every 30s to register Topic routing information.
  • When NameServer receives a heartbeat packet from the Broker side, it first updates the routing table and records the time when the heartbeat packet is received.
  • NameServer starts a scheduled task to scan the Broker survival status table every 10s. If the Nameserver does not receive the Broker's heartbeat packet for 120s, it will determine that the Broker is offline and remove the Broker from the routing table.
  • If the long connection between the Nameserver and the Broker is disconnected, the NameServer can immediately sense that the Broker is offline and remove the Broker from the routing table.
  • The message client (message sender, message consumer) will only establish a connection with one of the NameServers at any time, and query the NameServer for routing information every 30s. If the query results are found, the client’s local routing information will be updated; if the routing is queried If it fails, ignore it.

Judging from the above routing registration and removal mechanism, when a Broker server is down, how long does it take for the message sender to perceive changes in routing information?

Discuss separately in the following two situations:

  • The TCP connection between the NameServer and the Broker server is disconnected. At this time, the NameServer can immediately perceive the routing information change and remove it from the routing table, so that the message sender should be able to perceive the routing change within 30s. The sender will If a message fails to be sent, but combined with the sending evasion mechanism, it will not cause major failures to the sender, which is acceptable.
  • If the TCP connection between the NameServer and the Broker server is not disconnected, but the Broker is no longer able to provide services (such as suspended animation), it takes 120s for the NameServer to sense the Broker downtime. At this time, it takes up to 150s for the message sender to perceive changes in its routing information.

But the question is, why does a Broker restart due to a memory failure, and the business resumes only after 10 minutes, that is, the client really senses that the Broker is down?

Now that it appears, we need to analyze it and give a solution to avoid the same type of error in the production environment.

(2) After troubleshooting

Query the client's log (/home/{user}/logs/rocketmqlogs/rocketmq_client.log), from which you can see that the time of sending a message from the client for the first time is 14:44. The log output is as follows:

Due to the memory failure of the 192.168.3.100 machine, first check the logs in other nameservers in the cluster to see how long the NameServer in the normal machine perceives the failure of broker-a. The logs are as follows:

It can be seen from this that the nameserver of 192.18.3.101 basically senses its downtime in about 2 minutes, that is, although the machine is restarting, the TCP connection may not be disconnected due to the hardware self-check of the operating system and other reasons, so the nameserver is It senses its downtime after 120s and removes the broker from the routing information table. Then, according to the routing elimination mechanism, the client should perceive the change within 150 seconds, so why didn’t it perceive it?

Continue to view the routing information of the client, and view the time point when the client perceives changes in the routing information, as shown in the following figure:

From the client log, the client only perceives the change at 14:53:46. Why is this?

It turns out that the client reported a timeout exception when updating routing information. The screenshot is shown below:

During the period from the failure to the failure recovery, the client has been trying to update the routing information from the failed NameServer, but it has always returned a timeout, which has caused the client to be unable to obtain the latest routing information, so it has not been able to sense that it has been down. Broker.

From the log analysis point of view, so far it is relatively clear. All of the client did not perceive the change of its routing information within 120s, because the client has been trying to update the routing information from the downed nameserver, but because The request has been unable to succeed, so the cached routing information of the client has been unable to be updated, causing the above phenomenon

The problem is here. According to our understanding of RocketMQ, the NameServer is down, and the client will automatically select the next nameserver from the nameserver list. Why does the nameserver switch not happen here , but wait until 14:53?

Next, we will focus on the NameServer switching code, the code snippet is shown in the following figure:

Several key analyses in the above figure are as follows:

  • The prerequisite for the client to select the connection from the cache to send the RPC request is that the isActive method of the connection returns true, that is, the underlying TCP connection is active.
  • When the client initiates an RPC request to the server, if a non-timeout exception occurs, the closeChannel method will be executed. This method will close the connection and remove it from the connection cache table. This is very important, because if there is a cache when switching the NameServer If the connection is active and the connection is active, the nameserver will not be switched.
  • If the sending RPC times out, rocketmq will decide whether to close the connection according to the clientCloseSocketIfTimeout parameter, but unfortunately this parameter is false by default, and no modification entry is provided.

The analysis of the problem here is very clear.

Due to factors such as the machine memory failure triggering a restart and the need for self-checking, the nameserver and broker can no longer process the request but the underlying TCP connection is not disconnected. It returns after a timeout, but the client does not close the TCP connection with the failed machine nameserver. It will trigger the switching of NameServer. After the machine restarts successfully, the TCP connection is disconnected. After the faulty machine restarts, it will sense the routing information changes and recover from the fault.

Root cause: The nameserver’s suspended animation caused the routing information to fail to update.

(3) Best practices

After the above failures, I personally feel that the nameserver should not be deployed with the broker. If the nameserver and the broker are not deployed together, the above problems can be effectively avoided. The deployment architecture is shown in the following figure:

If such a deployment architecture is faced with the above scenario, that is, Broker suspended animation, can it be effectively avoided? The answer is yes.

If the broker of 192.168.3.100 is suspended, then the nameservers of 3.110 and 3.111 can perceive the downtime of broker-a within 2 minutes, and the client can obtain the latest routing information from the nameserver, so that it will not continue to be down when the message is sent Broker continues to send messages and the failure is restored;

If the nameserver suspends and a timeout error occurs, as long as the broker is not down, the cache can still work normally, but if the nanmeserver and the broker suspend together, the above architecture still cannot avoid the above problems .

Therefore, the best practice this time mainly includes the following two measures :
1. The nameserver and broker must be deployed separately and isolated.
2. The connection between the nameserver and the client should be closed after the timeout to trigger nameserver drift, and the source code needs to be modified.

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_41893274/article/details/112546934