This article is shared from Huawei Cloud Community " Sermant's Practice in Remote Multi-Active Scenarios ", author: Huawei Cloud Open Source.
The Sermant community has successively launched the message queue consumption prohibition plug-in and database write prohibition plug-in in versions 1.3.0 and 1.4.0 , respectively to solve the problem of flow cutoff and data consistency protection in remote multi-active scenarios. This article will analyze Sermant’s practice in remote multi-activity scenarios.
1. Live more in a different place
1.1 What is living in different places?
For a software system, we hope that when the system fails, it can still provide services to the outside world normally. This feature of the software system is called high availability, and the remote multi-active architecture is used to solve the high availability problem.
The earliest system architecture is generally a single-machine architecture. When the database fails, the business may be interrupted for a long time. In order to solve this problem, the database has been developed to consist of a master database and a slave database. The master database is responsible for reading and writing operations, and the slave database only provides read operations. The data of the master database will be synchronized to the slave database in real time to maintain the consistency and integrity of the data. . When a problem occurs in the main library, the slave library switches to the main library and continues to work. However, these services are deployed in the same computer room or even the same cabinet. When the computer room fails, the system still cannot provide normal external services.
At this time, active-active in the same city has become a good solution. Two computer rooms are deployed in one city. The two computer rooms deploy the same software environment and provide services. When one of the computer rooms fails, the traffic can be switched to another computer room to continue execution to ensure high availability of the system. As shown in Figure 1, the database in computer room 1 is the main database. All write operations in the two computer rooms operate on the main database in computer room 1, and read operations can read the database in this computer room. The physical distance between the two computer rooms is relatively close. At the same time, the two computer rooms can use dedicated lines for network connection. Therefore, the network latency of service calls in different computer rooms is low. The latency of writing services from computer room 2 to the database of computer room 1 is within an acceptable range.
Figure 1: Dual-active architecture diagram in the same city
The active-active architecture in the same city solves the problem of high availability of software systems. However, if a natural disaster occurs in a city, such as earthquakes, floods, etc., all computer rooms deployed in the same city will still be damaged and stop providing services. And because these disasters are highly destructive, the system repair cycle will be relatively long, which will seriously affect the normal operation of the company's business. In this case, it is obvious that these computer rooms need to be deployed in different regions. At the same time, the geographical distance of these regions needs to be far enough to resist the risk of natural disasters. This is the origin and value of the remote multi-active architecture.
As shown in the figure above, if computer room 1 and computer room 2 are deployed in two cities, they will become remote active-active. In order to better resist risks, computer rooms can be deployed in multiple regions. In this way, remote active-active will be upgraded to remote active-active.
The architecture diagram of remote multi-active is shown in Figure 2. Client traffic is distributed through the routing layer to different regional computer rooms for execution. The difference from the same-city active-active architecture is that computer rooms in different regions are physically far apart. The cost of deploying dedicated network lines is huge and unrealistic. The network delay of access between different computer rooms cannot be ignored. Therefore, it is necessary to operate the database in the local computer room. It cannot be operated across computer rooms. Under the multi-active remote architecture, the database of each computer room is the main database, and the data in different computer rooms will be synchronized to the central computer room, and then synchronized from the central computer room to other computer rooms. Because the databases in all computer rooms can be written to, when different computer rooms modify the same piece of data, data conflicts are inevitably introduced. In order to resolve data conflicts, some traffic can be fixedly forwarded to a certain computer room according to the fragmentation policy at the routing layer. The traffic fragmentation policy can be based on business type or geographical location. Through traffic sharding, it is guaranteed that relevant requests from the same user will be routed to the same computer room to complete all business operations, and the traffic in the computer room is guaranteed to flow only within the local computer room, reducing network latency.
Figure 2: Remote multi-active architecture diagram
1.2 Typical scenarios of multiple activities in different places
The remote multi-active architecture deploys computer rooms in different regions to provide external services to resist risks caused by natural disasters. It is an effective means to achieve high system availability. However, the remote multi-active architecture also makes the system more complex and introduces new requirements in terms of fault cutoff and data consistency:
- In a cloud service scenario, when a fault occurs in an availability zone, consumers in the fault zone need to stop pulling messages for consumption, and at the same time, the allocated message queues are rebalanced to consumers in the normal availability zone for processing, so as to avoid causing business exceptions.
- Remote multi-active can effectively solve the problem of data consistency by sharding traffic. But for global data, such as product quantity, when writing data, only the global database in the central computer room is allowed to be operated. Generally, the traffic for operating global data needs to be routed to the central computer room, and other computer rooms are only allowed to read the database. When traffic is routed incorrectly, it may still be written to the database in a non-central computer room, causing data conflicts. At this time, it is necessary to add protection to the global database and prohibit the execution of write operations in non-central computer rooms.
In response to the above two typical problems, Sermant developed the message queue consumption prohibition plug-in and the database write prohibition plug-in to deal with them, which will be introduced in detail below.
2. Message queue prohibits consumption of plug-ins
2.1 Introduction to message queue consumption prohibition plug-in
The message queue consumption prohibition plug-in allows microservices to dynamically adjust consumers' consumption behavior of message queue middleware according to actual needs in the running state, ensuring that in abnormal environments or states, messages in the business processing process are properly managed and avoid unnecessary Business impact. For example, in a remote multi-active architecture system, if a regional failure occurs and traffic needs to be cut off, the message queue consumption ban function can be enabled in the availability zone where the failure occurred, allowing consumers in the normal availability zone to handle the business and avoid The faulty area consumes traffic, causing business anomalies, ensuring high availability of the system. After the fault is handled, consumption can be restarted.
The message queue consumption prohibition plug-in currently supports two message middlewares: Kafka and RocketMQ. On the Kafka side, the plug-in implements Topic-level consumption prohibition and recovery functions. For RocketMQ, the granularity of controlling consumption is at the consumer instance level. Sermant supports issuing message queue types and specific topics that need to be prohibited from consumption through the configuration center.
For more information about the consumption queue prohibition plug-in, configuration instructions and scene demonstrations, please refer to the official website document message queue consumption prohibition .
2.2 Application of message queue prohibition of consumption plug-in failure and flow cut-off scenario
Application scenario: A software system uses Kafka as a message queue, and the producer produces messages to the topic-test topic. The topic message contains four partitions. Availability Zone A and Availability Zone B each have two consumers who join the test consumer group and consume topic-test messages. Each consumer is assigned a partition. Availability Zone A and Availability Zone B are distributed in different regions, that is, in different places. Two more computer rooms. As shown below.
In this scenario, after the consumer service disables the running of the consumption plug-in by mounting Sermant's message queue, it can control the topics consumed by the consumer in real time, thereby ensuring that the messages in the business processing process are properly managed in abnormal environments or states.
When availability zone A fails, consumers in availability zone A should stop consuming. Issue a global configuration in Availability Zone A to prohibit Consumer A and Consumer B from consuming the topic-test topic, and release the allocated message queue.
The configuration of the message queue consumption prohibition plug-in is as follows. enableKafkaProhibition means enabling the Kafka queue consumption prohibition capability, and kafkaTopics specifies the subscription topics that need to be prohibited from consumption. For the method of delivering configuration, please refer to the official website document. Message queue prohibits consumption :
enableKafkaProhibition: true skullTopics: - topic-test
After the configuration is delivered, consumers in availability zone A stop consuming, and consumers in availability zone B reallocate the partitions of the topic-test topic, as shown in the following figure.
After availability zone A returns to normal, the configuration can be issued through the dynamic configuration center again to enable consumers A and B to consume the topic-test topic. After enabling consumption configuration delivery, Kafka will trigger rebalancing, and consumers in availability zones A and B will be reassigned partitions.
The message queue consumption prohibition plug-in realizes the fault cut-off capability of the message queue in the remote multi-active scenario, ensuring the availability of the system.
3. Database write prohibition plug-in
3.1 Introduction to message queue consumption prohibition plug-in
After the service is started by mounting the database write prohibition plug-in, it can dynamically enable or disable the write prohibition ability for the specified database. In a remote multi-active scenario, users want to stop writing operations to individual or all databases and only allow data to be read to ensure the data integrity, consistency, and security of the database system. For example, global data writing in a business database is only allowed in the central computer room. By enabling the database write prohibition plug-in, routing abnormal traffic fails to be written to the non-central computer room database. In a multi-location and multi-write scenario, the traffic is cut off before it is manually cut. The computer room of the stream first prohibits writing to the database, and waits for data synchronization in other computer rooms to be completed before cutting the stream. The use of the database write prohibition plug-in in the above scenario ensures the consistency of database data.
The database write prohibition plug-in currently supports MySQL, MongoDB, PostgreSQL and OpenGauss databases. When the microservice is running, the write-prohibited database type and name can be issued through the configuration center. For specific write operations and plug-in usage that support write prohibition, please refer to the official website document database write prohibition .
3.2 Database write prohibition plug-in protects data consistency applications
Application scenario: Under the multi-active remote architecture, a business microservice is used to modify global data such as product inventory. At the same time, the global data is stored in a MySQL database named global. For this global data, write operations are only allowed to operate the global database in the central computer room, and the global databases in other computer rooms can only read data. In order to ensure data consistency, when global data is modified, the traffic is routed to the central computer room for execution at the routing layer, and other read operations can be routed to any computer room, as shown in the figure below.
When the routing layer makes a routing error for the traffic writing global data and executes it in a non-central computer room, if the central computer room and the non-central computer room modify the quantity of the same product at the same time, it may cause data conflicts. In order to prevent this from happening, the business Microservices can mount Sermant's database write prohibition plug-in to prohibit writing to the global database in non-central computer rooms.
Writing to the global database is prohibited in non-central computer rooms, and the following configuration needs to be issued through the dynamic configuration center:
enableMySqlWriteProhibition: true mySqlDatabases: - global
Among them, enableMySqlWriteProhibition means enabling the ability to prohibit writing on the MySQL database, and mySqlDatabases is used to specify the name of the specific write-prohibited database. This example is the global database.
After the configuration is issued, when traffic with abnormal routing is written to the global database in the non-central computer room, the database write prohibition plug-in throws a java.sql.SQLException exception to the business microservice and prohibits writing to the database. The business system needs to handle this exception, such as adding a retry operation to reroute the traffic to the central computer room for execution to ensure the normal operation of the system. The execution logic is shown in the figure below.
The database write prohibition plug-in disables writing to the specified database in a remote multi-active scenario, which can prevent abnormal traffic write operations and ensure data consistency in databases in different computer rooms.
4. Summary
In the remote multi-active scenario, Sermant's message queue consumption prohibition plug-in can realize the problem of message queue flow cutoff when the availability zone fails, allowing consumers in the normal availability zone to consume data; the database write prohibition plug-in is used to prohibit writing to the specified database. , and does not affect reading the database to prevent data conflicts.
Sermant has achieved rich service governance capabilities in remote multi-activity scenarios. In the future, Sermant will continue to work hard to gradually build a more complete service governance capability system.
As a bytecode enhancement framework focusing on the field of service governance, Sermant is committed to providing a high-performance, scalable, easy-to-access, and feature-rich service governance experience, and will take care of performance, functionality, and experience in each version. , everyone is widely welcome to join.
- Sermant official website: https://sermant.io
- GitHub warehouse address: https://github.com/huaweicloud/Sermant
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
A programmer born in the 1990s developed a video porting software and made over 7 million in less than a year. The ending was very punishing! High school students create their own open source programming language as a coming-of-age ceremony - sharp comments from netizens: Relying on RustDesk due to rampant fraud, domestic service Taobao (taobao.com) suspended domestic services and restarted web version optimization work Java 17 is the most commonly used Java LTS version Windows 10 market share Reaching 70%, Windows 11 continues to decline Open Source Daily | Google supports Hongmeng to take over; open source Rabbit R1; Android phones supported by Docker; Microsoft's anxiety and ambition; Haier Electric shuts down the open platform Apple releases M4 chip Google deletes Android universal kernel (ACK ) Support for RISC-V architecture Yunfeng resigned from Alibaba and plans to produce independent games for Windows platforms in the future