The practice of building a high-availability system using Huawei Cloud FunctionGraph

This article is shared from the Huawei Cloud Community " Practice of Building Highly Available Systems with Huawei Cloud FunctionGraph " by Xiaozhi, Huawei Cloud PaaS Service.

Introduction

Every year, there are reports on the Internet that the XXX system is abnormally unavailable, causing huge economic losses to customers. Cloud services have a larger customer base. Once problems arise, they will have a great impact on customers and the service itself. Based on Huawei Cloud FunctionGraph's own practices, this article will introduce in detail how to build a highly available serverless computing platform to achieve a win-win situation for both customers and the platform.

High availability introduction

High availability [1] (English: high availability, abbreviated as HA), an IT term, refers to the system's ability to perform its functions without interruption, representing the system's availability . It is one of the criteria when designing a system.

The industry generally uses SLA indicators to measure system availability.

Service level agreement [2] (English: service-level agreement, abbreviation SLA), also known as service level agreement, service level agreement, is a formal commitment defined between a service provider and a customer . The service provider and the serviced customer have specifically reached the promised service indicators-quality, availability, and responsibility. For example, if the service provider promises an SLA of 99.99%, the maximum annual service failure time is 5.26 minutes (365*24*60*0.001%).

FunctionGraph intuitively measures the two golden indicators of system availability, SLI and latency. SLI is the request success rate indicator of the system, and latency is the performance of system processing.

High availability challenges

As a sub-service in Huawei Cloud, FunctionGraph must not only consider the robustness of the system itself but also the robustness of surrounding dependent services while building its own capabilities (for example, the dependent identity authentication service is unavailable, the gateway for traffic forwarding The service service is down, service access to the storage object fails, etc.). In addition, when the hardware resources that the system relies on fail or the system is suddenly attacked by traffic, etc., facing these uncontrollable abnormal scenarios, it is a big challenge for the system to build its own capabilities to maintain high business availability. Figure 1 shows the peripheral interactions of FunctionGraph.

Figure 1 Peripheral interaction of FunctionGraph

For common problems, four major categories have been sorted out, as shown in Table 1.

Table 1 Summary of Frequently Asked Questions in FunctionGraph

In response to these problems, we have summarized the following general governance methods:

  • Traffic mutation management: overload protection + elastic expansion and contraction + circuit breaker + asynchronous peak clipping + monitoring and alarming. Based on the defensive design concept, overload protection + circuit breaker ensures that all resources of the system are under control, and then on this basis, it provides the ultimate expansion Capacity to meet large traffic, suitable customer scenarios recommend asynchronous peak shaving to reduce system pressure, and monitoring alarms to detect overload problems in a timely manner.
  • System service exception management: disaster recovery architecture + retry + isolation + monitoring and alarming. The disaster recovery architecture avoids the entire system downtime. Retry reduces the impact of system abnormalities on customer business. Isolation quickly removes system abnormal points to prevent failures. Diffusion, quickly discover system service anomalies through monitoring alarms.
  • System dependent service exception management: disaster recovery architecture + cache downgrade + monitoring and alarming. The disaster recovery architecture reduces single points of failure in dependent services. Cache downgrading ensures that the system can still run normally after dependent service failures. Monitoring and alarming can quickly detect dependent service exceptions. .
  • Governance caused by changes: grayscale upgrade + process control + monitoring and alarming. Through grayscale upgrade, we can avoid global failures caused by abnormal system upgrades for formal customers. Through process control, we can minimize the risk of human changes. We can quickly detect changes after changes through monitoring and alarming. failure.

FunctionGraph system design practice

In order to solve the problems in Table 1, FunctionGraph has optimized many aspects such as architectural disaster recovery, flow control, retry, cache, grayscale upgrade, monitoring and alarming, and management processes, and its usability has been greatly improved. The following mainly introduces some exception-oriented design practices of FunctionGraph, and does not expand on elastic capabilities, system functions, etc. for the time being.

Disaster recovery architecture

To implement Huawei Cloud Disaster Recovery 1.1 architecture (for example: service AZ level fault domain, cluster cross-AZ self-healing capability, AZ level service dependency isolation), multiple sets of FunctionGraph management plane and data plane clusters are deployed, and each set of clusters is AZ isolated to achieve the same region. AZ disaster recovery within. As shown in Figure 2, FunctionGraph deploys multiple sets of data plane clusters (to undertake FunctionGraph function running business) and dispatcher scheduling clusters (to undertake FunctionGraph's traffic cluster scheduling tasks) to increase system capacity and disaster recovery. When one of the Yuanrong clusters is abnormal, the dispatcher scheduling component can promptly remove the faulty cluster and distribute the traffic to several other clusters.

Figure 2 FunctionGraph simple architecture diagram

Distributed decentralized architecture design supports flexible horizontal expansion and contraction

This strategy is the key to the design of logical multi-tenancy services. It needs to solve the problem of rebalancing after decentralization and component expansion and contraction.

Decentralization of static data management: The metadata of logical multi-tenant services can all be stored in the same set of middleware due to the small amount in the initial stage. As customers increase their volume, it is necessary to design a data splitting plan to support data sharding to cope with subsequent massive data reading and writing, as well as reliability pressure.

Decentralization of traffic scheduling function: Component function design supports decentralization (common centralized dependencies: locks, flow control values, scheduling tasks, etc.). After the traffic is increased, the number of component copies can be expanded. The component adopts a self-balancing strategy. Complete reloading of traffic.

Multi-dimensional flow control strategy

Before the client function traffic on FunctionGraph finally reaches the runtime, it will pass through multiple links, and each link may have traffic exceeding its carrying threshold. Therefore, in order to ensure the stability of each link, FunctionGraph defensively adds different flow control strategies on each link. The basic principles address resource isolation at function granularity on computing (cpu), storage (disk, disk I/0), and network (http connection, bandwidth).

Function traffic is triggered from the client side, and the link flow control that finally runs is shown in Figure 3.

Figure 3 FunctionGraph flow control

Gateway APIG flow control

APIG is the traffic entrance of FunctionGraph. It supports total traffic control at the Region level and can be flexibly expanded according to the business busyness of the region. At the same time, APIG supports customer-level traffic control. When abnormal customer traffic is detected, customer traffic can be quickly restricted through the APIG side to reduce the impact of individual customers on system stability.

System business flow control

Flow control at api level

After the customer traffic passes through APIG, it goes to the system side of FunctionGraph. Based on the scenario where APIG flow control fails, FunctionGraph builds its own flow control strategy. Currently, node-level flow control, customer API total flow control, and function-level flow control are supported. When customer traffic exceeds FunctionGraph's carrying capacity, the system directly rejects it and returns 429 to the customer.

System resource flow control

FunctionGraph is a logical multi-tenancy service. The control plane and data plane resources are shared by customers. When malicious attacks by illegal customers will cause system instability. FunctionGraph implements customer flow control based on the number of concurrent requests for shared resources, strictly limiting the resources available to customers. In addition, resource pooling of shared resources ensures that the total amount of shared resources is controllable, thereby ensuring system availability. For example: http connection pool, memory pool, coroutine pool.

Concurrency control: Construct a flow control strategy based on FunctionGraph function granularity based on the number of concurrent requests. FunctionGraph's client function execution time has various types such as milliseconds, seconds, minutes, hours, etc. The conventional QPS flow control strategy of requests per second is processed Requests that execute for extremely long periods of time have inherent shortcomings and cannot limit the system shared resources occupied by customers at the same time. The control strategy based on the number of concurrency strictly limits the number of requests at the same time. If the number of requests exceeds the number, it is directly rejected to protect the system's shared resources.

http connection pool: When building high-concurrency services, reasonably maintaining the number of long http connections can minimize the resource overhead time of http connections, while ensuring that the number of http connection resources is controllable, ensuring system security while improving system performance. . The industry can refer to the connection reuse of http2 and the connection pool implementation within fasthttp. The principles are to minimize the number of http and reuse existing resources.

Memory pool: In scenarios where the customer's request and response messages are particularly large and concurrency is particularly high, the system's memory occupied per unit time is large. When the threshold is exceeded, it can easily cause system memory overflow and cause the system to restart. Based on this scenario, FunctionGraph has added unified control of the memory pool. At the request entry and response exit, it verifies whether the customer request message exceeds the threshold to protect the system memory and control it.

Coroutine pool: FunctionGraph is built on a cloud native platform and uses the go language. If each request uses a coroutine to process logs and indicators, when large concurrent requests come, a large number of coroutines will be executed concurrently, causing a significant decline in the overall performance of the system. FunctionGraph introduces go's coroutine pool and transforms log and indicator processing tasks into individual job tasks and submits them to the coroutine pool. Then the coroutine pool handles them uniformly, which greatly alleviates the problem of coroutine explosion.

Asynchronous consumption rate control: When an asynchronous function is called, it will be placed first in FunctionGraph's Kafka. By reasonably setting the customer's Kafka consumption rate, it ensures that function instances are always sufficient, and prevents excessive function calls from causing the underlying resources to be quickly consumed.

Function instance control

  • Customer instance quota: By limiting the total customer quota, we prevent malicious customers from consuming all the underlying resources to ensure the stability of the system. When the customer's business really needs it, the customer quota can be quickly expanded by applying for a work order.
  • Function instance quota: By limiting the function quota, a single customer's function can be prevented from consuming all the customer's instances. It can also prevent the customer's quota from becoming invalid and causing a large amount of resource consumption in a short period of time. In addition, if the customer's business involves the use of middleware such as databases and redis, the number of customer middleware connections can be protected within a controllable range through function instance quota restrictions.

Efficient resource elasticity capabilities

Flow control is a defensive design concept that reduces the risk of system overload through early blocking. When a customer's normal business suddenly increases and requires a large amount of resources, the first thing to solve is the resource elasticity problem. On the premise of ensuring the success of the customer's business, flow control strategies can be used to cover system abnormalities and prevent the spread of explosions. FunctionGraph supports multiple elastic capabilities such as rapid elasticity of cluster nodes, rapid elasticity of customer function instances, and intelligent prediction elasticity of customer function instances, ensuring that FunctionGraph can still be used normally when customer business increases suddenly.

Retry strategy

By designing a retry strategy with appropriate benefits, FunctionGraph can ensure that the customer's request is ultimately successfully executed when an exception occurs. As shown in Figure 4, the retry strategy must have termination conditions, otherwise it will cause a retry storm and more easily break through the system's load limit.

Figure 4 Retry strategy

Function request failed retry

  • Synchronous request: When a customer requests execution and encounters a system error, FunctionGraph will forward the request to other clusters and retry up to 3 times to ensure that the customer's request can be executed in other clusters even if it encounters occasional cluster exceptions. success.
  • Asynchronous requests: Since asynchronous functions do not have high real-time requirements, after the client function fails to execute, the system can implement a more refined retry strategy for failed requests. Currently FunctionGraph supports binary exponential backoff retries. When a function terminates abnormally due to a system error, the function will back off exponentially according to the method of 2, 4, 8, and 16. When the interval backs off to 20 minutes, subsequent retries will be based on 20 minutes. The function request retry time is carried out at intervals. The function request retry time supports a maximum of 6 hours. When it exceeds, it will be processed as a failed request and returned to the customer. Through binary exponential backoff, the stability of customer business can be guaranteed to the greatest extent.

Retries between dependent services

  • Retry mechanism of middleware: Taking redis as an example, when the system fails to read and write redis occasionally, it will sleep for a period of time, and then repeat the read and write operations of redis, with a maximum number of retries of 3 times.
  • http request retry mechanism: When errors such as eof and io timeout occur in an http request due to network fluctuations, it will sleep for a period of time and repeat the http sending operation, with a maximum number of retries of 3 times.

cache

Caching can not only speed up data access, but also can still use cached data to ensure system availability when dependent services fail. Divided from functional categories, there are two types of components that FunctionGraph needs to cache. The first is middleware, and the second is dependent cloud services. The system gives priority to access cached data, and at the same time regularly refreshes local cached data from middleware and dependent cloud services. The method is shown in Figure 5.

  • Cache middleware data: FunctionGraph monitors changes in middleware data and updates them to the local cache in a timely manner through publishing and subscription. When the middleware is abnormal, the local cache can continue to be used to maintain system stability.
  • Cache key dependent service data: Take Huawei Cloud's identity authentication service IAM as an example. FunctionGraph will strongly rely on IAM. When the customer initiates the first request, the system will cache the token locally with an expiration time of 24 hours. When IAM hangs up, it will not be affected. Old request. Use of FunctionGraph system. Other key cloud services rely on the practice of temporarily caching key data into local memory.

Figure 5 FunctionGraph’s caching measures

fuse

The above measures can ensure the smooth operation of the customer's business. However, when the customer's business is abnormal and cannot be recovered or malicious customers continue to attack the FunctionGraph platform, system resources will be wasted on abnormal traffic, occupying the resources of normal customers, and the system may also fail. Unexpected errors occur after continuous high-load operation with abnormal traffic. For this scenario, FunctionGraph built its own circuit breaker strategy based on the function call volume model. As shown in Figure 6, multi-level circuit breakers are implemented based on the failure rate of the call volume to ensure the smoothness of customer business and the stability of the system.

Figure 6 Circuit breaker strategy model

isolation

  • Asynchronous function business isolation: According to the category of asynchronous requests, FunctionGraph divides Kafka's consumer groups into timed trigger consumer groups, exclusive consumer groups, general consumer groups, and asynchronous message retry consumer groups. Topics are also divided into peer categories in the same way. . By subdividing consumer groups and topics, isolating scheduled trigger services from high-traffic services, and isolating normal services from retry request services, customer service requests are guaranteed with the highest priority.
  • Secure container isolation: Traditional cce containers are isolated based on cgroups. When the number of customers increases and the number of customer calls increases, mutual interference between customers will occasionally occur. Through secure containers, virtual machine-level isolation can be achieved so that customer services do not interfere with each other.
Grayscale upgrade

Logical multi-tenancy service, once there is a problem with the upgrade, the impact will be uncontrollable. FunctionGraph supports ring upgrades (divided according to the risk of the business in the region), blue-green release, and canary release strategies. The upgrade action is briefly described in three steps:

  1. Traffic isolation of the cluster before upgrade: When the current FunctionGraph is upgraded, priority is given to isolating the traffic of the upgraded cluster to ensure that new traffic does not enter the upgraded cluster;
  2. Traffic migration and graceful exit of the cluster before upgrade: Migrate traffic to other clusters, and perform the upgrade operation after the request to upgrade the cluster exits gracefully.
  3. The upgraded cluster supports traffic migration by customer: After the upgrade is completed, the traffic of dial-up test customers will be forwarded to the upgraded cluster. After all dial-up test use cases are successfully executed, the traffic of official customers will be moved in.
Monitor alarms

When an error occurs in FunctionGraph that cannot be avoided by the system, our solution is to build monitoring and alarm capabilities to quickly discover abnormal points, recover from failures at the minute level, and minimize system interruption time. As the last line of defense for system high availability, the ability to quickly detect problems is crucial. FunctionGraph has built multiple alarm points around business critical paths. As shown in table 2.

Table 2: Monitoring alarms built by FunctionGraph

Process specifications

Some of the above measures solve the problem of system availability from the technical design level. FunctionGraph has also formed a set of rules and regulations from the process. When technology cannot solve the problem in the short term, the risk can be quickly eliminated through human intervention. Specifically, there are the following team operation specifications:

  • Internal war room process: When encountering an emergency problem on the live network, the team quickly organizes key roles within the team to restore the live network failure as soon as possible;
  • Internal change review process: After the system version is immersed in the test environment and verified to have no problems, before the official change to the live network, a change guideline needs to be written to identify the changed function points and risk points. Only after the evaluation of the key roles of the team is it allowed to be put on the live network. Reduce abnormalities caused by human changes through standard process management;
  • Regular live network problem analysis and review: Routine weekly live network risk assessment, alarm analysis and review, identify deficiencies in system design through problems, draw inferences from one instance, and optimize the system.

Client disaster recovery

Even the most advanced cloud services in the industry cannot promise 100% SLA to the outside world. Therefore, when the system itself or even human intervention cannot quickly restore the system state in a short period of time, the disaster recovery plan jointly designed with the customer becomes crucial. Generally, FunctionGraph will work with customers to design a disaster recovery plan for the client. When the system continues to experience exceptions, the client needs to retry for returns. When the number of failures reaches a certain level, it is necessary to consider triggering a circuit breaker on the client side to limit the impact on downstream systems. access while promptly switching to escape options.

Summarize

When FunctionGraph is doing high-availability design, it generally follows the following principles of "redundancy + failover". While meeting the basic needs of the business, it ensures that the system is stable and then gradually improves the architecture.

" Redundancy + Failover " includes the following capabilities:

Disaster recovery architecture : multi-cluster mode, active-standby mode

Overload protection : flow control, asynchronous peak shaving, resource pooling

Fault management : retry, cache, isolation, downgrade, circuit breaker

Grayscale release : grayscale streaming and graceful exit

Client disaster recovery : retry, circuit breaker, escape

In the future, FunctionGraph will continue to build more available services from the dimensions of system design, monitoring, and process. As shown in Figure 7, we can quickly discover problems by building monitoring capabilities, quickly solve problems through reliability design, reduce problems through process specifications, continuously improve the system's availability capabilities, and provide customers with higher SLA services.

Figure 7: FunctionGraph high-availability iteration practice

references

[1] High availability definition: https://zh.wikipedia.org/zh-hans/%E9%AB%98%E5%8F%AF%E7%94%A8%E6%80%A7

[2]SLA definition: https://zh.wikipedia.org/zh-hans/%E6%9C%8D%E5%8A%A1%E7%BA%A7%E5%88%AB%E5%8D%8F %E8%AE%AE

Author: An Proofreader: Jiulang, Wenruo

I decided to give up on open source industrial software. Major events - OGG 1.0 was released, Huawei contributed all source code. Ubuntu 24.04 LTS was officially released. Google Python Foundation team was laid off. Google Reader was killed by the "code shit mountain". Fedora Linux 40 was officially released. A well-known game company released New regulations: Employees’ wedding gifts must not exceed 100,000 yuan. China Unicom releases the world’s first Llama3 8B Chinese version of the open source model. Pinduoduo is sentenced to compensate 5 million yuan for unfair competition. Domestic cloud input method - only Huawei has no cloud data upload security issues
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/11059520