How to design a million QPS system?

1. Introduction to Relationship Chain Business

From the business perspective of the master station, the relationship chain refers to the following relationship between user A and user B. It is subdivided by following attributes, mainly following (subscribing), and also involves various attributes or states such as blocking, quietly following, mutual following, and special attention. At present, the magnitude of the relationship chain of the master station is relatively large, and it continues to grow at a relatively rapid rate. As a platform-based business, the relationship chain service provides basic queries such as one-to-many relationship point search, full relationship list, and relationship counting. The peak QPS of comprehensive queries is nearly one million, and is relied on by core businesses such as dynamics and comments.

Under the trend of continuously increasing data volume and query requests, ensuring real-time and accurate data and maintaining high service availability are the core goals of the evolution of the relationship chain architecture.

picture

(Figure: Business scenario of relationship chain in space page)

picture

(Figure: Relationship Chain State Machine)

2. Transaction bottleneck - the evolution of storage

Relational Database

The concerned write event corresponds to the pure status attribute transfer, so it is very suitable to use a relational database. In the early stage of the development of the main site community, the magnitude of the relationship chain is small, and the direct use of mysql has natural advantages: simple development and maintenance, clear logic, and only need to directly maintain a relationship table and a count table to meet online use. Taking into account the development speed of the community, previous designs have used 500 tables (relational tables) and 50 tables (counting tables) to disperse the pressure.

picture

(Figure: Example of the structure of a relational table)

picture

(Figure: Example of the structure of the count table)

Under this storage structure, taking a mid-additional fid request as an example, mysql needs to perform the following operations in a transaction:

  • Query the count of mid and lock it, query the count of fid and lock it;

  • Query the relationship of mid->fid and lock it, query the relationship of fid->mid and lock it;

  • According to the state machine, after calculating the new relationship in memory, modify the relationship between mid->fid and modify the relationship between fid->mid (if any, such as changing from one-way attention to mutual relationship);

  • Modify the number of followers of mid and the number of fans of fid;

This architecture has been maintained until 2021. With the continuous development of the community, the shortcomings of the architecture are becoming more and more obvious: on the one hand, even if the table is divided, the overall size of the relationship chain data still exceeds the recommended overall storage capacity (currently TB level); on the other hand, heavy transactions make mysql unable to support high write traffic. Under the original synchronous write architecture, the performance is that the attention failure rate becomes higher; if it is simply upgraded to the asynchronous write architecture, the performance is a backlog of messages , when the message backlog lasts longer than the validity period of the temporary cache, it will cause customer complaints, which is not a cure for the symptoms.

picture

(Figure: "synchronous" write relationship flow chart using mysql as the core storage)

KV storage

*For the introduction of Bilibili's self-developed distributed KV storage, please refer to: Bilibili Distributed KV Storage Practice

The upgrade plan finally decided to use is: data storage is migrated from mysql to KV storage, and logically change "write mysql synchronously" to "write KV asynchronously". The reason for not choosing "write KV synchronously" is that on the one hand, one relationship corresponds to multiple KV records, and KV does not support transactions; on the other hand, the asynchronous architecture can withstand a large number of possible instantaneous attention requests. In order to be compatible with the business subscribed to mysql binlog, after "write KV asynchronously", it will also "write mysql asynchronously".

Under the new architecture, for each user's attention request, the delivery of databus is considered as a successful request, and the mysql binlog is only provided to some business parties (such as data platforms) that are not sensitive to real-time We don't need to care about the slight backlog, but for business parties with high real-time requirements, we will deliver the databus for these business subscriptions after we have processed the asynchronous write KV event.

picture

(Figure: Flow chart of asynchronous write relationship using KV as core storage and mysql redundant storage)

The biggest advantage of KV storage is that the bottom layer can provide a count method to replace the redundant mysql count table. The advantage of this is that we only need to maintain a KV table that simply saves the relationship. The storage structure we designed is:

  • The key is {attr|mid}fid , attr is the relationship zipper type, mid and fid both represent the user id, {attr|mid} represents splicing attr and mid as a hash, multiple fids under the hash will be stored in lexicographical order, combined The zipper traversal method (scan) provided by the KV service can obtain all fids under the hash;

  • value is a structure , including attribute (relationship attribute) and mtime (modification time);

attr and attribute are easy to confuse, the difference between the two is as follows:

  • The attr in the key is the relationship zipper type , and there are 5 types in total (3 types of forward relationship and 2 types of reverse relationship): ATTR_WHISPER means quietly following the class (mid quietly followed fid), ATTR_FOLLOW means following the class (mid followed fid) , ATTR_BLACK means black (mid blacked fid), ATTR_WHISPERED means quietly followed (mid is quietly followed by fid), ATTR_FOLLOWED means followed (mid is followed by fid). For users, the mapping relationship between various lists and relationship chain types is as follows:

  • Watch list: According to different product requirements, most of the time it refers to the follow-up relationship chain (attr=ATTR_FOLLOW), and some scenarios will also add the quietly follow-up relationship chain (attr=ATTR_WHISPER);

  • Fan List: A collection of quietly followed relationship chains (attr=ATTR_WHISPERED) and followed relationship chains (attr=ATTR_FOLLOWED);

  • Blacklist list: pull black relationship chain (attr=ATTR_BLACK).

  • The attribute in value indicates the current relationship attribute , and there are 4 types in total: WHISPER means to follow quietly, FOLLOW means to follow, FRIEND means to follow each other, and BLACK means to pull black. It is easy to confuse this with the previous attr, and the complete mapping relationship between them is as follows:

  • There can be attribute=WHISPER under attr=ATTR_WHISPER or ATTR_WHISPERED;

  • Attr=ATTR_FOLLOW or ATTR_FOLLOWED can have attribute=FOLLOW or FRIEND;

  • There can be attribute=BLACK under attr=ATTR_BLACK.

The five relationship zippers of midA are shown in the figure:

picture

(Picture: five relationship zippers of midA)

To sum up, after upgrading to KV storage, the read operation is not very complicated:

  • If you want to query the relationship between mid and fid in a forward direction, you only need to check (get, batch_get) to traverse 3 kinds of forward attr;

  • If you want to query the full amount of attention relationship and blacklist, you only need to find the corresponding attr to perform scan respectively;

  • If you want to query the user count, you only need the attr corresponding to count;

The slightly more complicated logic lies in the writing of relations:

  • MySQL has transactions to ensure atomicity, but kv storage does not support transactions. As far as the user's request is concerned, the delivery of the databus is regarded as successful attention, so when processing this message asynchronously, it is necessary to guarantee 100% successful writing. Therefore, when processing asynchronous messages, we add Logic for infinite retries on failure.

  • In extreme cases, you may also encounter the problem of write conflicts: For example, at a certain point in time, user A follows user B," and at the same time "user B follows user A, this may cause some unexpected data errors (because One-way attention and cross-correlation are two different properties, and the attention behavior of either party will affect this property). In order to avoid this situation, we take advantage of the orderly nature of the data under the same key in the message queue. By ensuring that the same pair of users is assigned a key, we ensure that the operations of the same pair of users are executed in an orderly manner.

Still take the example of adding attention to fid in mid, for each attention event:

  • The job needs to put a positive follow relationship first, and then check the upper limit. If it exceeds the upper limit, it will roll back and exit;

  • Then batch put all other reverse attrs affected by this attention action, such as mid's followed relationship (attr=ATTR_WHISPERED), fid's followed relationship (attr=ATTR_FOLLOW, if any, such as changing from one-way attention to mutual relationship );

  • If any of the above put operations fails, it needs to be retried; until these actions are completed, then the following event is considered successful;

  • Post the databus to inform the subscriber that a concern event has occurred;

  • Post asynchronously write mysql events, synchronize the concerned events with mysql, and generate binlogs for subscribers to use.

3. Rapid growth - iteration of cache

Storage layer cache memcached

A certain proportion of the online query requests are for querying the full watchlist and full blacklist. As mentioned in the previous section, in order not to redundantly store a relationship chain count, the storage design of KV is quite special. A user's positive follow relationship is distributed in 3 different attrs (that is, 3 different relationship zippers) . If you want to pull a user's full relationship list from the KV storage, you need to do a circular scan of the three positive relationship zippers at the same time (because each scan has an upper limit), but because the performance of the scan method is relatively poor, you need to A set of cache is added on the upper layer of KV storage, and the scan QPS is strictly controlled by reducing the return-to-source ratio.

Considering that memcached has better performance for large keys, the predecessors added a memcached cache on the upper layer of KV storage to store the user's full relationship list. The specific business process is as follows:

picture

(Figure: Query business process of full relationship list)

Judging from the cache back-to-source data during the peak period, memcached can withstand 97%-99% of requests for KV storage, and only QPS less than 6K will miss the cache, and the effect is quite obvious.

picture

picture

(Figure: memcached QPS and cache back-to-source rate)

Query layer cache hash

In addition to the request for the follow list, a large part of the request is a one-to-many point-check relationship (query the relationship between a user and one or more other users). If you pull the full relationship list from memcached every time and then take the intersection in memory, the network The overhead will be very large, so for this query scenario, it is also necessary to design a set of caches suitable for enumeration.

The number of followers of active users is generally in the range of tens to hundreds. The cache used for point check does not need to be strictly ordered, but it must support the query of the specified hashkey, redis hash and the hget, hmget, hset, hmset methods provided by it are very suitable for this scene. Therefore, the query layer cache design is as follows: the key is mid, the hashkey is the ID of each user who has a relationship with mid, and the value is their relationship data, which corresponds to the data stored in KV by midA earlier:

picture

(Figure: storage structure of concern relationship in redis hash cache)

Since the hash stores all the positive attention relationships of midA, when the cache miss needs to be returned to the source, to obtain the full amount of attention relationships, it can be used in conjunction with the previous memcached. The business process is as follows:

picture

(Figure: One-to-many relationship query business process under the redis hash architecture)

Based on this set of caches, the average time-consuming interface for checking one-to-one and one-to-many follow relationships is basically maintained at 1ms, and the hit rate of the hash can reach 70%-75%, so it is relatively easy to support nearly a million QPS, and with the horizontal expansion of the redis cluster, can support more business requests.

picture

picture

(Figure: QPS of redis hash cache and cache back-to-source rate)

Query layer cache kv (a seemingly failed attempt)

By the second half of 2022, on the one hand, the product puts forward the requirement of "the xx I follow also pays attention to ta". The query of this kind of second-degree relationship is very difficult under the hash architecture:

  • Since the hash only stores positive relationship queries, it is necessary to obtain the follow list of "I" first, and then traverse and query the follow relationship between everyone in the follow list and ta;

  • Since many of the "me" watchlists are inactive users, it is difficult to hit the hash and memcached cache, which means that each request will be batched and sent back to the source KV storage. In addition, the recommendation side can leave the calculation time for the relationship chain service to be very short. When this time the request is canceled by the timeout, all the scan operations of the back-to-source KV storage belonging to this request will be canceled, and the instance will trigger the rpc fuse Event alarms bring a lot of alarm noise (because even if only one request times out, the amount of rpc errors is the number of scans sent back to the source under that request).

picture

(Figure: Number of RPC errors in the KV storage scan operation before and after switching architectures)

On the other hand, the product proposes the idea of ​​releasing the upper limit of attention. We consider that after such demand is launched, there will be more and more users with high attention, and even some users will quickly fill up the attention after the function is launched. The hash structure Flaws and risks will also become apparent day by day. The risk point is that when the same redis instance has multiple high-interest users who miss the cache, trigger back-to-source, and hmset backfill the cache, the continuous high write QPS may make the redis cpu utilization rate full (for example, every second 2 users need to backfill the cache, and their relationship list is 5000, and the actual write QPS is 10,000).

Under the above background, after internal discussions within the team, we first introduced the redis kv structure cache, hoping to replace the hash directly through a simple cache. The key is the user id of user A and user B, and the value is user A and user B. relationship, an example is as follows:

picture

(Figure: storage structure of concern relationship in redis kv cache)

Under this cache structure, you only need to check the source KV storage, because the performance of the KV storage check operation (get, batch_get) is far better than the scan operation, and in order to reduce the dependence on memcached, when redis kv cache When a miss occurs, we directly go back to the source KV storage to perform point checks (get, batch_get), and then backfill the cache. The flow chart is as follows:

picture

(Figure: query business process of one-to-many relationship under redis kv architecture)

We grayscaled 2% of users and found that the hit rate of the kv structure cache gradually converged to 60%, and the usage rate of the cache memory and the number of keys far exceeded expectations. This means that 40% of the requests will miss the cache and go back to the source, which is obviously unacceptable under the pressure of millions of QPS. After analyzing the requests cached by miss, we found that the main source of business is comments, and most of the requests return "no relationship", that is, the comment scene will query the attention relationship of a large number of strangers, so there will be a lot of empty sentries and most of them are not related. It will be accessed twice (for a user, the number of empty sentinels can be regarded as the number of comment users he reads), which can also make a reasonable explanation for the performance of the single kv structure cache.

Query layer cache bloom_filter+kv

For a large number of empty sentinel scenes, it is a recognized and reasonable solution to put a layer of Bloom filter on it. We decided to maintain a Bloom filter for each user, first add all the relationship chains in the stock to the Bloom filter, and consume new write relationship events and update the Bloom filter to make it a resident cache filter device. There are three possibilities for hitting a bloom filter:

  1. It's related now

  2. There used to be a relationship, but it's okay now

  3. There has been no relationship, but hash collisions have occurred in the previous two cases

Only those that hit the Bloom filter will go to the lower-level kv cache, which solves most of the problems of empty sentinels. The specific flow chart is as follows:

picture

(Figure: query business process of one-to-many relationship under bloom filter + redis kv architecture)

At present, 100% of the traffic in the relationship chain scene has been switched to Bloom’s new architecture. Bloom’s hit rate has reached 80%+, and the old hash architecture is being offline. This technical transformation not only solves the problem of releasing the upper limit of the relationship chain Possible problems, second-degree relationship alarm noise, difficulty in supporting "many-to-one" reverse queries, and it is expected to save some cache resources.

4. The risk is coming - hot spot disaster recovery

The main scenario of the relationship chain is to query the following relationship between "user A" and other users. In the requests at the same time, when the requests of "user A" are scattered, the pressure on Redis will be shared among dozens of instances in the cluster. At this time, the maximum pressure that the system can withstand is equal to the sum of each instance in the cluster ; In extreme cases, if "User A" is concentrated on a few users, then the pressure will be concentrated on a few Redis instances, and the barrel short-board effect will be very obvious.

Going back to a certain hot scene last year, the traffic was concentrated on the dynamic details pages or manuscript playback pages of popular ups such as the World Wide Web, and these pages relied on real-time query of the relationship between the up owner and each commentator. When a large number of users loaded at the same time When commenting, it forms the query hotspot of the up master.

At that time, the structure of the relationship chain service was relatively lagging behind in the processing of hotspots. When a hotspot up master was found (or a hotspot up master was known in advance), it would be manually configured into the hotspot list. For hotspot users, before requesting Redis, the local Localcache will be queried first (the up master relationship list data stored in Localcache is updated every ten seconds). Although there may be data inconsistencies within these ten seconds, from a practical business point of view, it is the big up masters that trigger hot requests, and the relationship lists of these up masters rarely change, so there is almost no impact on user experience. make an impact.

When the hotspot request occurred that night, with the increase of users, the CPU usage of several instances of the Redis cluster gradually exceeded 70%, and some instances even exceeded 90%.

picture

(Figure: Redis single instance CPU usage alarm at the time of the hotspot event)

Due to the lack of hotspot detection capability, the operation and maintenance personnel need to manually capture the current hotspot key after seeing the alarm (under the premise that the CPU utilization of the Redis instance is almost full, directly connecting to the instance to count the key is a high-risk operation) , and then manually configure the storage, and then the pressure on Redis plummeted. In order to avoid possible risks, other official media accounts were temporarily added to the list of popular users in the follow-up, and the relationship chain service survived the traffic peak without any risk.

picture

(Figure: Single-instance Redis QPS before and after configuring local cache)

Afterwards, the business architecture provides a hotspot detection tool, which can automatically count hotspots and temporarily use the local cache based on the configured threshold after access; at the beginning of this year, the hotspot detection tool and the local cache sdk were integrated (*another For an example of local caching, see this article: Dynamic outbox local caching optimization at station B ), hotspot automatic detection and automatic downgrade become more convenient, and the business side only needs to simply modify the local cache type to have low-code anti-hotspot capabilities. After the verification of League of Legends S12 and the New Year Festival, the indicators of the relationship chain service were relatively stable during the above-mentioned activities.

picture

(Picture: Monitoring of the number of hot keys that are automatically sensed and cached at noon one day)

5. Long-term planning - the extension of the relationship

How to use the relationship chain capabilities to empower upper-level businesses and how to make the basic services of the relationship chain more reliable are also issues that we need to continue to think about. In the medium term, there are still many directions that can be used. Here are just a few directions:

  • Enabling business: in a multi-tenant manner, through the existing set of codes of the relationship chain service, it can provide basic relationship capabilities (follow/subscribe, unfollow/subscribe, follow list, fan list) for quick access to the new business system , to avoid secondary development.

  • Empowering the community: How to make the platform service of the relationship chain more general, you can try to generalize the object of the relationship, for example, in the dynamic feed scene, integrate the user's pan-subscription relationship scene (such as up master, collection, comics, drama, classroom, etc. ).

  • Stability improvement: There are many business parties connected to the relationship chain service. Through 0 trust and 100% quota configuration, mutual interference between businesses is avoided, especially to avoid the impact of the surge of ordinary business traffic on the core business.

Guess you like

Origin blog.csdn.net/2301_78588786/article/details/132008456