With 1.2 million connections per unit, how is the Xiaoai gateway structured?

Said it in front

In the reader exchange group (50+) of Nien, a 40-year-old architect , many friends have obtained interview qualifications from first-tier Internet companies such as Alibaba, NetEase, Youzan, Xiyin, Baidu, and Didi.

Recently, Nien guided a friend's resume and wrote a " Long Connection Gateway Project Architecture and Practical Operation ". This project helped this guy get an interview invitation from Byte/Alibaba/Weibo/Autohome , so this is An awesome project.

In order to help you get more interview opportunities and get more offers from big companies,

Nien decided to publish a chapter of video in September to introduce the architecture and practical operation of this project, "Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation", which is expected to be released at the end of the month. Then, we provide one-on-one resume guidance, making your resume sparkling and completely transformed.

"Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation" The poster is as follows:

In conjunction with "Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation", Nien will sort out several industrial-grade and production-grade gateway cases as architectural and design materials.

Sorted out in front

In addition to the above five cases, in the process of sorting out the learning cases, Nien found another beautiful production-level case: "With 1.2 million connections in a single unit, how is the Xiaoai gateway structured?" 》,

Attention, this is another very awesome and top-notch industrial-grade and production-grade gateway case .

These cases are not original to Nien.

These cases are just collected by Nien while searching for information on the Internet during the preparation of the video lesson "Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation" for everyone to learn and communicate with.

For the PDFs of "Nien Architecture Notes", "Nien High Concurrency Trilogy" and "Nien Java Interview Guide", please go to the official account [Technical Freedom Circle] to obtain

With 1.2 million connections per unit, how is the Xiaoai gateway structured?

Author: Xiaoai technical team

1. The huge evolutionary achievements of Xiaoai access gateway

Xiao Ai (also known as "Xiao Ai Classmate") is an artificial intelligence voice interaction engine owned by Xiaomi.

"Xiaoai Classmate" is Xiaomi Group's unified intelligent voice service infrastructure.

The "Xiao Ai Classmate" client is integrated into Xiaomi mobile phones, Xiaomi AI speakers, Xiaomi TVs and other devices, and is widely used in personal mobility, smart homes, smart wearables, smart offices, children's entertainment, smart travel, smart hotels and smart learning, etc. Eight scenes.

The Xiaoai access layer is a key service for Xiaoai cloud device access and one of the core services.

The Xiaomi technical team conducted a series of optimizations and attempts on this service from 2020 to 2021. In the end, it successfully increased the number of long connections that a single machine can carry from 300,000 to 120W+, saving 30+ machines.

2. What is Xiaoai access layer?

The layers of Xiaoai’s overall architecture are as follows :

The access service is mainly responsible for the authentication and authorization layer and the transport layer. It is the first service for all Xiaoai devices to interact with Xiaoai Brain.

From the picture above, we can see that the important functions of Xiaoai access service include the following points :

  • 1) Secure transmission and authentication: Protect the secure channel between the device and the brain to ensure the validity of identity authentication and the security of data transmission;
  • 2) Maintain long connections: maintain long connections between the device and the brain (such as Websocket, etc.), properly save the connection status, and perform heartbeat maintenance and other tasks;
  • 3) Request forwarding: Forward each request from Xiaoai devices to ensure the stability of each request.

3. Technical implementation of early access layer

The earliest implementation of the Xiaoai access layer was based on Akka and playframework . We used them to build the first version. Its features are as follows:

  • 1) Based on Akka, we implemented preliminary asynchronousization to ensure that the core thread will not be blocked and the performance will be good.
  • 2) The playframework framework naturally supports Websocket, so we can quickly build and implement it with limited manpower, and can ensure the standardization of the protocol implementation.

Note that playframework, referred to as play, is a web development framework similar to springmvc.

4. Technical issues at the early access layer

As the number of XiaoAi’s long connections reached tens of millions, we discovered that there were some problems with the early access layer solution.

The main issues are as follows :

  • 1) As the number of long connections increases, more and more memory data needs to be maintained, and the JVM's GC becomes a performance bottleneck, and there are GC risks. After incident analysis, we found that the upper limit of the number of single-instance long connections in the access layer of the Akka+Play version is about 280,000.
  • 2) The implementation of the access layer in the old version is relatively random. There are a lot of state dependencies between Akka Actors, and it is not based on immutable message passing. As a result, communication between actors is like function calls. The code is less readable, difficult to maintain, and fails to function. Akka Actor's advantages in building concurrent programs.
  • 3) As an access layer service, the old version has a strong dependence on protocol parsing and needs to be frequently online with version updates, which may cause long connection reconnections and pose an avalanche risk.
  • 4) Due to reliance on the Play framework, we found that its long connection management is inaccurate (because the underlying TCP connection data cannot be obtained), which will affect our daily inspection of the service capacity assessment, and after the number of long connections increases, we cannot carry out More detailed optimization.

5. Design goals of the new access layer

In view of the many problems with the early access layer technical solutions, we decided to reconstruct the access layer.

The design goals of the new version of the access layer are as follows :

  • 1) High stability: Avoid disconnection as much as possible during the online process to ensure stable service;
  • 2) High performance: The target single machine can carry at least 1 million long connections, and try to avoid the impact of GC;
  • 3) High controllability: Except for the system calls of the underlying network I/O, all other codes need to be implemented by themselves or use internal components to increase autonomy.

Therefore, we embarked on a journey of exploration to achieve millions of long connections on a single machine.

6. Optimization ideas for the new access layer

6.1 Access layer dependencies

The relationship between the access layer and external services is clarified as follows :

6.2 Functional division of access layer

The main responsibilities of the access layer can be summarized as follows :

  • 1) WebSocket decoding: The received client data stream needs to be parsed according to the WebSocket protocol;
  • 2) Maintain Socket state: store basic information of the connection;
  • 3) Encryption and decryption: All data communicated with the client needs to be encrypted, and the transmission between back-end modules is plain text JSON;
  • 4) Sequentialization: Two requests A and B on the same physical connection arrive at the server. In the back-end service, B may get a response before A, but we need to wait for A to complete before sending it to the client in the order of A and B. ;
  • 5) Backend message distribution: The access layer not only interfaces with a single service, but may also forward requests to different services based on different messages;
  • 6) Authentication: security-related verification, identity verification, etc.

6.3 The idea of ​​splitting the access layer

Split the previous single module into two sub-modules according to whether it has status.

The details are as follows :

  • 1) Front-end: stateful, minimized functions, and minimize online;
  • 2) Backend: stateless, maximized functions, and no user awareness when online.

Therefore, according to the above principles, in theory we will achieve such a functional division, that is, the front end is smaller and the back end is larger. The schematic is shown below.

7. Technical implementation of the new access layer

7.1 Overview

The module is split into front-end and back-end :

  • 1) The front end is stateful and the back end is stateless;
  • 2) The front-end and back-end are independent processes but deployed on the same machine.

Supplement : The front end is responsible for maintaining the status of the device's long connection, so it is a stateful service; while the back end is responsible for processing specific business requests, so it is a stateless service. The launch of the backend service will not cause the device connection to be interrupted and reconnected, nor will it trigger an authentication call. This avoids unnecessary fluctuations in the long connection status caused by version upgrades or logic adjustments.

The front end is implemented in C++ :

  • 1) Independently parse the WebSocket protocol: can obtain all information from the Socket level and handle any bugs;
  • 2) Higher CPU utilization: no additional JVM overhead, no GC drag on performance;
  • 3) Higher memory utilization: When the number of connections increases, the memory consumption associated with the connections will also increase, and self-management can achieve extreme optimization.

The backend is temporarily implemented in Scala :

  • 1) Implemented functions can be directly migrated at a much lower cost than rewriting;
  • 2) Some external services (such as authentication) have Scala (Java) SDK libraries that can be used directly, but there is no C++ version. If rewritten in C++, the cost will be very high;
  • 3) All functions have been transformed into stateless functions, which can be restarted at any time without the user noticing.

Communication uses ZeroMQ :

  • The most efficient way to communicate between processes is shared memory. ZeroMQ is implemented based on shared memory, so there is no problem with speed.

7.2 Front-end implementation

Overall architecture :

As shown in the figure above, it consists of four sub-modules :

  • 1) Transport layer: Websocket protocol analysis, XMD protocol analysis;
  • 2) Distribution layer: Shields the differences in the transport layer. No matter what interface the transport layer uses, it is converted into a unified event in the distribution layer and delivered to the state machine;
  • 3) State machine layer: In order to implement pure asynchronous services, the independently developed Akka-like state machine framework XMFSM based on the Actor model is used, which implements a single-threaded Actor abstraction;
  • 4) ZeroMQ communication layer: Since the ZeroMQ interface is a blocking implementation, this layer is responsible for sending and receiving through two threads.

7.2.1 Transport layer:

The WebSocket part uses C++ and ASIO to implement websocket-lib. Xiaoai long connection is based on the WebSocket protocol, so we independently implemented a WebSocket long connection library.

The characteristics of this long connection library are :

  • a. Lock-free design to ensure excellent performance;
  • b. Developed based on BOOST ASIO to ensure underlying network performance.

The stress test shows that the performance of this library is very excellent :

Number of long links SWC P99 delay
100w 5w 5ms

This layer is also responsible for the sending and receiving tasks of the other two channels in addition to the original WebSocket.

Currently, the transport layer supports the following 3 different client interfaces :

  • a. websocket (tcp): referred to as ws;
  • b. SSL-based encrypted websocket (tcp): referred to as wss;
  • c. xmd(udp): referred to as xmd.

7.2.2 Distribution layer:

Convert different transport layer events into unified events and deliver them to the state machine. This layer acts as an adapter to ensure that no matter which type is used in the previous transport layer, when it reaches the distribution layer, it will become a consistent event and delivered to the state machine.

7.2.3 State machine processing layer:

The main processing logic is in this layer, and a very important part here is the encapsulation of the sending channel.

For the Xiaoai application layer protocol, the processing logic of different channels is completely consistent, but each channel has detailed differences in processing and security-related logic.

For example :

  • a. Wss transmission and reception does not require encryption and decryption. Encryption and decryption are completed by the more front-end Nginx, while wss needs to be sent using AES encryption;
  • b. After successful authentication, wss does not need to send the challenge text to the client because wss does not need to perform encryption and decryption;
  • c. The content sent by xmd is different from the other two. It is based on a private protocol encapsulated by protobuf, and xmd needs to handle the logic after sending failure, while ws/wss does not need to consider the problem of sending failure, which is guaranteed by the underlying Tcp protocol.

In response to this situation : we use the polymorphic features of C++ to deal with it, and specifically abstract a Channel interface. The methods provided in this interface include some key different steps of a request processing, how to send a message to the client, how to stop the connection, How to deal with sending failure and so on. For 3 different sending channels (ws/wss/xmd), each channel has its own Channel implementation.

As soon as the client connection object is created, the specific Channel object of the corresponding type is instantiated. In this way, the main logic of the state machine only needs to implement the common logic of the business layer. When there is a differential logic call, the Channel interface is directly called to complete. Such a simple polymorphic feature helps us separate the differences and ensure that the code is clean.

7.2.4 ZeroMQ communication layer:

ZeroMQ's read and write operations are asynchronous through two threads, and are responsible for the encapsulation and parsing of several private instructions.

7.3 Backend implementation

7.3.1 Stateless transformation:

One of the most important changes made to the backend is to remove all information related to the connection status .

The entire service uses Request (N Requests can be transmitted on one connection) as the core for various forwarding and processing. Each request has no connection with the previous request. Multiple requests on one connection are treated as independent requests by the backend module.

7.3.2 Architecture:

The Scala service uses the Akka-Actor architecture to implement business logic.

After the service receives the message from ZeroMQ, it is directly delivered to the Dispatcher for data analysis and request processing. Different requests in the Dispatcher will be sent to the corresponding RequestActor for Event protocol analysis and distributed to the business Actor corresponding to the event for processing. Finally, the processed request data is sent to the back-end AIMS&XMQ service through XmqActor.

The processing flow of a request in multiple actors on the backend :

7.3.3 Dispatcher requests distribution:

By using Protobuf, the front end and back end can interact, which can save the performance of Json parsing and make the protocol more standardized.

After receiving the message sent by ZeroMQ, the backend service will parse the PB protocol in the DispatcherActor and process the data according to different classifications (CMD for short). The classifications are as follows:

BIND command :

This function is used for device authentication. Because the authentication logic is complex and difficult to implement in C++, authentication is still performed at the scala business layer. This part mainly parses the HTTP Headers requested by the device, extracts the token for authentication, and then returns the result to the front end.

LOGIN command :

This command is used for device login. After the device passes the authentication and the connection is successfully established, it will execute the LOGIN command to send the long connection information to AIMS and record it in the Varys service for subsequent active push and other functions. During the LOGIN process, the service first requests the Account service to obtain the uuid of the long connection (used for routing addressing during the connection process), and then sends the device information + uuid to AIMS for device login operation.

LOGOUT command :

This command is used to log out of the device. When the device disconnects from the server, it needs to perform a Logout operation to delete the long connection record from the Varys service.

UPDATE and PING commands :

  • a. Update command, device status information update, is used to update the relevant information saved in the database of the device;
  • b. Ping command, connection keepalive, is used to confirm that the device is online.

TEXT_MESSAGE 与 BINARY_MESSAGE

Text messages and binary messages. When a text message or binary message is received, it will be sent to the RequestActor corresponding to the request for processing according to the requestid.

7.3.4 Request request analysis:

The received text and binary messages will be sent by the DispatcherActor to the corresponding RequestActor for processing according to the requestId.

Among them : the text message will be parsed into an Event request, and then distributed to the specified business Actor based on the namespace and name. The binary message will be distributed to the corresponding business actor according to the current requested business scenario.

7.4 Other optimizations

In the process of completing the adjustment of the new architecture 1.0, we are also constantly measuring the connection capacity and summarizing several points that have a greater impact on capacity.

7.4.1 Protocol optimization:

  • a. Replace JSON with Protobuf : The early front-end and back-end communication used the JSON text protocol, but it was found that JSON serialization and deserialization took up a lot of CPU. After switching to the Protobuf protocol, the CPU usage dropped significantly.
  • b. JSON supports partial parsing : Since the business layer protocol is based on JSON and cannot be directly replaced, we adopt the method of "partial parsing JSON" to only parse the smaller header part to obtain the namespace and name, and then forward most of the directly forwarded messages. , only a small amount of JSON messages are completely deserialized into objects. After this optimization, the CPU usage is reduced by 10%.

7.4.2 Extend heartbeat time:

When we first tested the 20w connection, we found that among the messages sent and received by the front and back ends, the heartbeat PING messages used to keep users online accounted for 75% of the total message volume. Sending and receiving this message consumed a lot of CPU. Therefore, we extend the heartbeat time and also achieve the purpose of reducing CPU consumption.

7.4.3 Self-developed intranet communication library:

In order to improve the communication performance with back-end services, we use a self-developed TCP communication library, which is developed based on Boost ASIO and is a purely asynchronous multi-threaded TCP network library. Its excellent performance helps us increase the number of connections to 120w+.

8. Future planning

After the optimization of version 1.0 of the new architecture, it has been verified that our splitting direction is correct, because the preset goals have been achieved :

  • 1) The number of connections carried by a single machine is 28w => 120w+ (the peak request QPS of an ordinary server machine with 16G memory and 40 cores exceeds 10,000), and the access layer goes offline, saving 50%+ of machine costs;
  • 2) The backend can be launched losslessly.

Let’s re-examine our ideal goal. Taking this as the direction, we have the prototype of version 2.0 :

Specifically :

  • 1) The back-end module is rewritten in C++ to further improve performance and stability. At the same time, the parts of the back-end module that cannot be rewritten in C++ are operated and maintained as independent service modules, and the back-end module is called through the network library;
  • 2) Try to migrate non-essential functions in the front-end module to the back-end to make the front-end have fewer functions and be more stable;
  • 3) If after the transformation, the processing capabilities of the front-end and the back-end are quite different, considering that ZeroMQ actually has excess performance, you can consider using a network library to replace ZeroMQ, so that the front-end and back-end can be deployed from a 1:1 single machine to a 1:N multi-machine deployment. Machine deployment makes better use of machine resources.

The goal of version 2.0 is : after the above transformation, it is expected that a single front-end module can reach 2 million+ connection processing capabilities.

At the end: If you have any questions, you can seek advice from the old architecture.

The road to architecture is full of ups and downs

Architecture is different from advanced development. Architecture questions are open/open, and there are no standard answers to architecture questions.

Because of this, many friends, despite spending a lot of energy and money, unfortunately never complete the architecture upgrade in their lifetime .

Therefore, in the process of architecture upgrade/transformation, if you really can’t find an effective solution, you can come to the 40-year-old architect Nien for help.

Some time ago, I had a friend who worked on Java across majors. Now he is facing the problem of switching architectures. However, after several rounds of guidance from Nien, he successfully got the offer of Java architect + big data architect . Therefore, if you encounter difficulties in your career, it will be much smoother to ask an experienced architect for help.

Recommended reading

" Ten billions of visits, how to design a cache architecture "

" Multi-level cache architecture design "

" Message Push Architecture Design "

" Alibaba 2: How many nodes do you deploy?" How to deploy 1000W concurrency?

" Meituan 2 Sides: Five Nines High Availability 99.999%. How to achieve it?"

" NetEase side: Single node 2000Wtps, how does Kafka do it?"

" Byte Side: What is the relationship between transaction compensation and transaction retry?"

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?"

" How to structure billion-level short videos? "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!"

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!"

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?"

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132941352