1 million level connections, how is iQiyi WebSocket gateway structured?

Said it in front

In the reader community (50+) of 40-year-old architect Nien , many friends have obtained interview qualifications from first-tier Internet companies such as Alibaba, NetEase, Youzan, Xiyin, Baidu, and Didi.

Recently, Nien guided a friend's resume and wrote a " High Concurrency Gateway Project ". This project helped this guy get an interview invitation from Byte/Alibaba/Weibo/Autohome , so this is an awesome project. project.

In order to help you get more interview opportunities and get more offers from big companies,

Nien decided to publish a chapter of video in September to introduce the architecture and practical operation of this project, "Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation", which is expected to be released at the end of the month. Then, we provide one-on-one resume guidance, making your resume sparkling and completely transformed.

"Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation" The poster is as follows:

In conjunction with "Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation", Nien will sort out several industrial-grade and production-grade gateway cases as architectural and design materials.

Sorted out in front

In addition to the above three cases, here, Nien found another beautiful production-level case:

" iQiyi WebSocket real-time push gateway technology practice ",

Attention, this is another very awesome industrial-grade and production-grade gateway case .

These cases are not original to Nien. These cases are just collected by Nien while searching for information on the Internet during the course preparation of "Chapter 33: 10Wqps High Concurrency Netty Gateway Architecture and Practical Operation" for everyone to learn and communicate with.

For the PDFs of "Nien Architecture Notes", "Nien High Concurrency Trilogy" and "Nien Java Interview Guide", please go to the official account [Technical Freedom Circle] to obtain

100W level connection, iQiyi WebSocket push gateway architecture

Original author: iQiyi technical team

The HTTP protocol is a stateless, TCP-based request/response model protocol.

In the HTTP protocol, only the client can initiate a request, and the server will respond.

Although, in many cases this request/response pull model will suffice.

However, in certain situations, such as real-time notification (such as offline message push in IM is the most typical) and message push application scenarios, data needs to be pushed to the client in real time, which requires the server to have the ability to actively push data.

How to push it?

Although traditional Web server push technologies, including short polling, long polling, etc., can solve the problem to a certain extent, there are also problems such as timeliness and resource waste.

The WebSocket specification introduced by the HTML5 standard has basically changed this situation and has become the mainstream of current server-side message push technology.

This article will share iQiyi’s practical experience and summary in the process of implementing WebSocket long connection real-time push gateway based on Netty.

1. Technical pain points of the old solution

As a key part of our content ecosystem, iQiyi Account, as a front-end system, has high requirements for user experience, which directly affects the creative enthusiasm of creators.

Currently, iQiyi Account uses WebSocket real-time push technology in multiple business scenarios, including :

1) User comments : Push comment messages to the browser in real time;

2) Real-name authentication : Before signing the contract, the user needs to undergo real-name authentication. The user scans the QR code and enters the third-party authentication page. After the authentication is completed, the browser is notified asynchronously of the authentication status;

3) Liveness recognition : Similar to real-name authentication, when the liveness recognition is completed, the result will be notified to the browser asynchronously.

In actual business development, we found that there are some problems in the use of WebSocket real-time push technology.

These questions are :

1) First of all : the WebSocket technology stack is not unified, and some are implemented based on Netty and some based on Web containers, which brings difficulties to development and maintenance;

2) Secondly : WebSocket implementation is scattered in various projects and closely coupled with business systems. If other businesses need to integrate WebSocket, they will face the dilemma of repeated development, waste of costs and low efficiency;

3) Third : WebSocket is a stateful protocol. When the client connects to the server, it only connects to one node in the cluster, and only communicates with this node during data transmission. WebSocket clusters need to solve the problem of session sharing. If you only use a single node deployment, although this problem can be avoided, it cannot be horizontally expanded to support higher loads, and there is a risk of single point failure;

4) Finally : There is a lack of monitoring and alarming. Although the number of WebSocket long connections can be roughly estimated through the number of Linux Socket connections, the number is not accurate, and it is impossible to know the number of users and other indicator data with business significance; it cannot be compared with the existing Microservice monitoring integration to achieve unified monitoring and alarming.

2. Technical objectives of the new solution

As mentioned above, in order to solve the problems existing in the old solution, we need to implement a unified WebSocket long connection real-time push gateway.

This new set of gateways needs to have the following characteristics :

1) Centralize long-term connection management and push capabilities : adopt a unified technology stack and precipitate long-term connections as a basic function to facilitate function iteration and maintenance;

2) Decoupling from business : Separating business logic from long-term connection communication makes the business system do not need to care about communication details, avoids repeated development, and saves research and development costs;

3) Easy to use : Provides HTTP push channel to facilitate access to various development languages. The business system only needs to make simple calls to push data, thereby improving research and development efficiency;

4) Distributed architecture : Build a multi-node cluster to support horizontal expansion to meet the challenges brought by business growth; node failure will not affect the overall availability of the service, ensuring high reliability;

5) Multi-terminal message synchronization : Allow users to log in online at the same time using multiple browsers or tabs to ensure that messages are sent simultaneously;

6) Multi-dimensional monitoring and alarming : Connect custom monitoring indicators with the existing microservice monitoring system. When problems occur, alarms can be made in a timely manner to ensure the stability of the service.

3. Technology selection of new solutions

Among numerous WebSocket implementations, we finally settled on Netty after weighing performance, scalability, community support and other aspects.

Netty is a high-performance, event-driven, asynchronous and non-blocking network communication framework that has been widely used in many well-known open source projects.

WebSocket has stateful characteristics, which is different from the stateless characteristics of HTTP, so it cannot achieve load balancing through clustering like HTTP. After a long connection is established, it will maintain a session with a node on the server, so in a cluster environment, it will be difficult to determine which node the session belongs to.

There are generally two technical solutions to solve the above problems :

1) One is to use technology similar to the microservice registry to maintain the global session mapping relationship;

2) The other is to use event broadcasting, and each node determines whether to hold a session. A comparison of these two options is shown in the table below.

WebSocket cluster solution :

plan advantage shortcoming
Registration center The session mapping relationship is clear and is more suitable when the cluster size is large. The implementation is complex, relies heavily on the registration center, and has additional operation and maintenance costs.
event broadcast Implementation is simple and lightweight When there are many nodes, all nodes are broadcast, which is a waste of resources.

Considering the implementation cost and cluster size, we chose a lightweight event broadcast solution.

There are many methods to implement broadcast, such as RocketMQ-based message broadcast, Redis-based Publish/Subscribe, ZooKeeper-based notification, etc.

The advantages and disadvantages of these options are compared in the table below. After considering factors such as throughput, real-time, persistence, and ease of implementation, we finally chose RocketMQ.

Comparison of broadcast implementation solutions :

plan advantage shortcoming
Based on RocketMQ High throughput, high availability, and guaranteed reliability Real-time performance is not as good as Redis
Based on Redis High real-time performance and simple implementation Not guaranteed to be reliable
Based on ZooKeeper Simple to implement Poor writing performance, not suitable for frequent writing scenarios

4. Ideas for implementing the new plan

4.1 System architecture

The overall architecture of the gateway is shown in the figure below :

NOTE: Please click on the image for a clear view!

The overall process of the gateway is as follows :

**1)** When the client establishes a long connection with any node of the gateway, the node will add it to the long connection queue in the memory. The client will regularly send heartbeat messages to the server. If the heartbeat message is not received after the set time, it is considered that the long connection between the client and the server has been disconnected, and the server will close the connection and clear the session in the memory.

2) When the business system needs to push data to the client, the data is sent to the gateway through the HTTP interface provided by the gateway.

3) After receiving the push request, the gateway will write the message to RocketMQ .

4) As a consumer, the gateway consumes messages in broadcast mode , and all nodes can receive the messages.

5) After receiving the message, the node will determine whether the pushed message target is in the long connection queue maintained in its memory. If it exists, the data will be pushed through the long connection , otherwise it will be ignored directly.

The gateway forms a cluster through multiple nodes, and each node is responsible for a part of the long connections to achieve load balancing. When facing a large number of connections, you can also add nodes to spread the pressure and achieve horizontal expansion.

At the same time, when a node fails, the client will try to re-establish long connections with other nodes to ensure the overall availability of the service.

4.2 Session management

After the WebSocket long connection is established, the session information will be saved in the memory of each node.

The SessionManager component is responsible for managing sessions, and it uses a hash table internally to maintain the association between UID and UserSession.

UserSession represents a user-level session. A user may have multiple persistent connections at the same time, so UserSession also uses a hash table internally to maintain the association between Channel and ChannelSession.

In order to prevent users from endlessly creating long connections, when the number of ChannelSession inside UserSession exceeds a certain number, it will close the earliest established ChannelSession to reduce the occupation of server resources.

The relationship between SessionManager, UserSession, and ChannelSession is shown in the figure below.

SessionManager component :

NOTE: Please click on the image for a clear view!

4.3 Monitoring and alarming

In order to grasp the number of long connections established in the cluster and the number of users included, the gateway provides basic monitoring and alarm functions.

The gateway is connected to Micrometer , and the number of connections and users are exposed as custom indicators for Prometheus to collect, thus achieving integration with the existing microservice monitoring system.

In Grafana , you can easily view indicator data such as the number of connections, number of users, JVM, CPU, and memory to understand the current service capabilities and pressure of the gateway. Alarm rules can also be configured in Grafana to trigger Qixin (internal alarm platform) alarms when data is abnormal.

5. Performance stress test of new solution

Pressure test preparation :

  • 1) Select two virtual machines configured with 4 cores and 16G to serve as servers and clients respectively;
  • 2) During the stress test, open 20 ports for the gateway and start 20 clients at the same time ;
  • 3) Each client uses one server port to establish 50,000 connections, allowing millions of connections to be created simultaneously.

The number of connections (millions) and memory usage are shown in the figure below :

[root@sy-dev-1de4f0c2a target]# ss -s ; free -h
Total: 1002168 (kernel 1002250)
TCP: 1002047 (estab 1002015, closed 4, orphaned 0, synrecv 0, timewait 4/0), ports 0
Transport Total   IP      IPv6
*         1002250 -       -
RAW       0       0       0
UDP       4       2       2
TCP       1002043 1002041 2
INET      1002047 1002043 4
FRAG      0       0       0

          total   used    free  shared  buff/cache  available
Mem:      15G     4.5G    4.5G  232K    6.5G        8.2G
Swap:     4.0G    14M     4.0G

To send a message to millions of long connections at the same time, using a single thread, the average time it takes for the server to complete the sending is about 10 seconds, as shown in the figure below.

Server push takes time :

2021-01-25 20:51:02.614 INFO [mp-tcp-gateway,54d52e7e4240b65a,54d52e7e4240b65a,false]
[600ebeb62@2559f4507adee3b316c571/507adee3b316c571] 89558 --- [nio-8080-exec-6]
c.i.m.t.g.controller.NotifyController: [] [UID:] send message ...
2021-01-25 20:51:11.973 INF0 [mp-tcp-gateway,54d52e7e4240b65a,54d52e7e4240b65a,false]
[1600ebeb62@2559f4507adee3b316c571/507adee3b316c571] 89558 --- [nio-8080-exec-6]
c.i.m.t.g.controller.NotifyController: [] [UID:] send message to 1001174 channels

Generally, the number of long connections established by the same user at the same time is in the single digits.

Taking 10 long connections as an example, under the conditions of 600 concurrency and 120s duration, the TPS of the push interface is approximately 1600+, as shown in the figure below.

Stress test data for long connection 10, concurrency 600, and duration 120s :

The current performance indicators have met our actual business scenarios and can support future business growth.

6. Practical application cases of new solutions

In order to more vividly demonstrate the optimization effect, at the end of the article, we take the example of adding filter effects to the cover image and introduce an example of iQiyi account using the new WebSocket gateway solution.

When iQiyi self-media publishes videos, they can choose to add filter effects to the cover image to guide users to provide higher-quality covers.

When the user selects the cover image, an asynchronous background processing task will be submitted.

Once the asynchronous task is completed, the images processed with different filter effects are returned to the browser through WebSocket. The business scenario is as shown in the figure below.

From the perspective of R&D efficiency, if WebSocket is integrated in a business system, it will take at least 1-2 days of development time.

By directly using the push function of the new WebSocket gateway, data push can be achieved with simple interface calls, reducing development time to minutes and greatly improving research and development efficiency.

From the perspective of operation and maintenance costs, the business system no longer contains communication details unrelated to business logic, the code is more maintainable, the system architecture becomes simpler, and the operation and maintenance costs are significantly reduced.

7. Summary

WebSocket is the mainstream technology for server-side push. Appropriate use can effectively improve system responsiveness and enhance user experience.

Through the WebSocket long connection gateway, you can quickly add data push capabilities to the system, effectively reduce operation and maintenance costs, and improve development efficiency.

The value of the long connection gateway lies in :

  • 1) It encapsulates WebSocket communication details and decouples it from the business system, allowing the long-connection gateway and business system to independently optimize and iterate, avoiding repeated development and facilitating development and maintenance;
  • 2) The gateway provides a simple and easy-to-use HTTP push channel, supports access to multiple development languages, and facilitates system integration and use;
  • 3) The gateway adopts a distributed architecture, which can achieve horizontal expansion, load balancing and high availability of services;
  • 4) The gateway integrates monitoring and alarming, which can provide timely warnings when the system is abnormal, ensuring the health and stability of services.

Currently, the new WebSocket long connection real-time gateway has been applied in multiple business scenarios such as iQiyi account picture filter result notification and MCN electronic signature.

There are many aspects that need to be explored in the future, such as message retransmission and ACK, WebSocket binary data support, multi-tenant support, etc.

At the end: If you have any questions, you can seek advice from the old architecture.

The road to architecture is full of ups and downs

Architecture is different from advanced development. Architecture questions are open/open, and there are no standard answers to architecture questions.

Because of this, many friends, despite spending a lot of energy and money, unfortunately never complete the architecture upgrade in their lifetime .

Therefore, in the process of architecture upgrade/transformation, if you really can’t find an effective solution, you can come to the 40-year-old architect Nien for help.

Some time ago, I had a friend who worked on Java across majors. Now he is facing the problem of switching architectures. However, after several rounds of guidance from Nien, he successfully got the offer of Java architect + big data architect . Therefore, if you encounter difficulties in your career, it will be much smoother to ask an experienced architect for help.

Recommended reading

" Ten billions of visits, how to design a cache architecture "

" Multi-level cache architecture design "

" Message Push Architecture Design "

" Alibaba 2: How many nodes do you deploy?" How to deploy 1000W concurrency?

" Meituan 2 Sides: Five Nines High Availability 99.999%. How to achieve it?"

" NetEase side: Single node 2000Wtps, how does Kafka do it?"

" Byte Side: What is the relationship between transaction compensation and transaction retry?"

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?"

" How to structure billion-level short videos? "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!"

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!"

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?"

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132737300