[Ali] stability day5 automatic pressure measurement and capacity planning - kill against the flow of control

 

Two-eleven born from 2009 to the present, 2013 is definitely a watershed.

Why do you say that? Since 2013 it has been full link voltage measurement.

The annual November 11 00:00:00, Alibaba Group, the most intense emotional moment came. Overcharged files enthusiasm this moment began to come to reflect on last year's figure is double eleven people today records: 24-hour trading volume 101.2 billion transaction to create a peak 17.2w; and in a binary world inside, it is a very short period of time such as tsunami influx of large scale flow.

Peak flow of mass destruction, once in 2011 and 2012 to the technical team has left unforgettable memories midnight Cry. However, with the debut full link pressure measured, it changes the flow of the corresponding peak Alibaba attitude and approach, but also a time to refresh the record of China's Internet world.

Why do we need capacity planning?

Alibaba has a very rich business forms, each service by a range of different business systems to provide services, each service distributed systems deployed on different machines. With the development of business, in particular (such as double 11), need to be prepared for each business system in a large number of scenes to promote marketing activities for Alibaba machine technology team is a major problem. "Capacity Planning" It is born to solve this problem, the purpose of capacity planning is to make every business system can clearly know: When the machine should be added, and when should reduce the machine? Double 11 scenarios need to be prepared to promote other large number of machines, both to protect the stability of the system, but also save costs?

Capacity planning four steps

Among the scenes to promote the process of preparing large double 11 and so on, capacity planning is generally divided into four stages:

  1. Traffic estimates stage: visits a future point in time a business would have little historical data analysis;

  2. System capacity assessment phases: preliminary calculation of the number of machines required for each distribution system;

  3. Capacity fine tuning phase: by total link measurement to simulate the user pressing the behavior of the high-actuation time, fine adjustment of the capacity level of the entire site while in the verification site capacity;

  4. Flow Control phases: configuration system protection measures such as the current limit threshold of the system, to prevent the case of actual traffic than projected traffic, the system can not provide normal service.

In the first phase of which, by the appropriate prediction algorithm and a wealth of historical data, generally it can accurately estimate the amount of access to services. Views business even estimated in the first stage with the actual errors, flow control through the fourth stage can also be sure that the site is always in good service condition. After doing the estimated business-traffic, capacity planning into the second stage, a preliminary assessment of the capacity of the system. How accurate capacity assessment, with minimal cost to support a good estimate of the volume of business is the core issue at this stage.

To calculate how many machines a system requires, in addition to know the service capabilities beyond the call amount of future business, there is a more important variable is a single machine. Get service capabilities on a single machine is Ali Baba single pressure measurement is acquired by the way. Alibaba, in order to accurately obtain the service capabilities of a single machine, pressure tests are carried out directly in the production environment, which is important for two reasons: the single pressure measurement requires both ensure the authenticity of the environment, but also to ensure traffic authenticity . Otherwise acquired the ability to service a single machine value will have relatively large errors affect the accuracy of the entire capacity planning.

Way to produce a single machine environment stress test is divided into four kinds:

 Analog request: to achieve stress testing of the production environment by a machine called a request to initiate simulation purposes

Analog request is relatively simple, there are a lot of open-source or commercial simulation tools can do requests, such as apache ab, webbench, httpload, jmeter, loadrunner. Through the field case, the system is not new lines or views of the system in this way to single pressure measurement. Analog request shortcomings that exist between simulation and the real service request requesting differences will affect the structure of the stress test. Another disadvantage is that the request is a write request simulation process troublesome, because the service data write requests might cause pollution, the pollution either accept or require special processing (such as data generated by the pressure sensing isolation).

 Copy Request: multiple copies sent to the specified pressure measuring machine by a request machine

In order to make the request pressure measured with the request closer to the real business, on the way to the source of pressure measured request, we try to record and playback from real traffic, by way of replication request to conduct stress tests. Request to copy the way higher accuracy than the request mode simulation request, because the requested service more real. From the point of view insufficient, requested copy is also facing the problem of dirty data write request processing, in addition to copy To respond to a request to be blocked off, so the pressure is measured this machine requires a separate, and can not provide normal services. Request that replicates the stress test, mainly used in small system calls than scene.

 Forwards the request: The request forwarding distributed environment, multiple machines onto a single machine

For larger systems than call scenario, we have a better approach. Drainage of a practice which we call request forwarding, Alibaba systems are basically distributed, forwarded by requesting multiple machines onto a single machine, so a machine under more traffic to the purpose of the stress test. Drainage mode request forwarded not only very accurate pressure measurement result, no dirty data, and the operation is also very convenient, Ali Baba is very wide with a single pressure measurement mode. Of course, this approach also has a pressure test on the condition that the amount of calls the system needs to be large enough, if you call the system a very small amount, even if all traffic is directed to a machine, or can not pressure measured bottleneck.

 Adjust the load balancing: Load balancing device to modify the weight, so that the pressure measuring machine to allocate more request

Similar to the request forwarded drainage way, the last pressure test is also a way to make a machine in a distributed environment to allocate more request. Difference is that the embodiment is used to adjust the weights by the weight of the load balancing device. Adjust the load balancing methods of living very accurate pressure measurement results, and no dirty data. Prerequisites also need to call the amount of distributed systems is large enough.

Alibaba, there is a special stand-alone pressure measurement pressure measurement platform. Measuring pressure on the basis of four kinds of internet pressure measured in the manner previously described, the pressure member is an automated measurement system. In this system, you can configure scheduled tasks on a regular basis the system pressure test, you may also want to manually trigger time point pressure measurement pressure measurement at any time. During the pressure measured at the same time, real-time detection system load pressure measuring machine, once the system load reaches a preset threshold value that is immediately stop pressure measurement, while the output of a pressure test report.

Because the pressure is measured in a production environment, we must be very careful to protect the pressure measuring process does not affect the normal business. Pressure measuring platform on a single machine, each month will be more than 5000 times the pressure measuring system release or major changes will be verified by measuring whether the performance of stand-alone pressure changes, pressure measurements obtained by the single stand-alone service capacity is the capacity value planning a very important reference.

Views business has estimated, but also know the system a single machine service capabilities, a rough calculation of how many machines need to be very simple.

The minimum number of visits of business machines = Estimated / stand-alone capability.

Under normal circumstances, we will reserve a small amount of buffer to prevent errors and accidents assessment.

Why do we need the full link pressure measured?

Goes through this process, we have completed a rough assessment of system capacity, however, do this step is not enough of it? Lessons of the past have given us a harsh lesson.

For each system we have done a rough calculation capacity, that everything will be relatively smooth, but the real scene is not the case, the double zero when 11 comes, many of the system's operation to be worse than we thought. The reason is that the actual business scenarios, the pressure of each system relatively large, and there is interdependence between the system, without taking into account the measured single voltage dependent relatively large part of the pressure circumstances, will introduce errors in an uncertain . It's like, we have to produce an instrument, every part had been subject to rigorous testing, the parts are assembled into a final instrument, the instrument would be like working status is not clear.

In fact, we also have some lessons blood. In 2012, 11 of the double zero, we have a network card database system is played, leading to some users can not normally shop, even though we have done very well prepared, but there are some things that we have not taken into account.

需要怎么样才能解决这个问题?在 2013 年的双 11 备战过程当中,在很长一段时间内这都是我们面临的一个难题。在中国,学生通常都会有期末考试,为了在期末考试中取得比较好的成绩,老师通常会让学生们在考试前先做几套模拟题。双 11 对我们的系统来说就是一年一度的期末考试,所以我们冒出了这么一个想法:“如果能让双 11 提前发生,让系统提前经历双 11 的模拟考验,这个问题就解决了”。通过对双 11 零点的用户行为进行一次高仿真的模拟,验证整个站点的容量、性能和瓶颈点,同时验证之前进行的容量评估是否合理,不合理的地方再进行适当的微调。

我们为此研发了一套新的压测平台——“全链路压测”。双 11 的模拟可不是一件简单的事情,上亿的用户在阿里巴巴平台上挑选、购买好几百万种不同类型的商品,场景的复杂性非常高。有三个最主要的难点需要解决:

  1. 用于的请求量非常大,在双 11 零点,每秒的用户请求数超过 1000w;

  2. 模拟的场景要跟双 11 零点尽可能的贴近,如果模拟的场景跟双 11 零点差距太大,将不具备实际的参考价值,而双 11 零点的业务场景非常复杂;

  3. 我们需要在生产环节去模拟双 11,如何去做到模拟的用户请求不对正常的业务和数据造成影响。

为了能够发出每秒 1000w 以上的用户请求,全链路压测构件了一套能够发出超大规模用户请求的流量平台。流量平台由一个控制节点和上千个 worker 节点组成,每一个 worker 节点上都部署了我们自己研发的压测引擎。压测引擎除了需要支持阿里巴巴业务的请求协议,还需要具备非常好的性能,要不然 1000w 的用户请求,我们将无法提供足够多的 worker 节点。上千个压测引擎彼此配合、紧密合作,我们能像控制一台机器一样控制整个压测集群,随心所欲的发出 100w/s 或者 1000w/s 的用户请求。

1000w+/s 的用户请求量不仅要能够发送出来,而且还需要跟双 11 的用户行为尽可能的接近,而双 11 是一个非常复杂的业务场景。为了使得模拟能够更加真实,我们做了非常多的工作。首先,我们从生产环境提取一份跟双 11 同等数量级的基础数据(包含:买家、卖家、店铺、商品、优惠等等),做好筛选和敏感字段的脱敏,作为全链路压测的基础数据。然后基于这些基础数据,结合前几年的历史数据,通过相应的预测算法,得到今年双 11 的业务模型。

双 11 的业务模型包含 100 多个业务因子,比如:买家数量、买家种类、卖家数量、卖家种类、商品数量、商品种类,pc 和无线的占比,购物车里的商品数量,每一种业务类型的访问量级等等)。有了业务模型之后,再根据业务模型构造相应的压测请求,最终将压测请求上传到压测引擎。

全链路压测直接在生产环境进行双 11 的模拟,在前面的单机压测方式中也有提到,对于模拟请求的方式,需要考虑脏数据的处理方式。全链路压测的所有数据都在生产环境做了数据隔离,包含存储、缓存、消息、日志等一系列的状态数据。在压测请求上会打上特殊的标记,这个标记会随着请求的依赖调用一直传递下去,任何需要对外写数据的地方都会根据这个标记的判断写到隔离的区域,我们把这个区域叫做影子区域。全链路压测对粗略的容量评估起到了精调的作用,使双 11 零点的各种不确定性变的更加确定。

我们在 2013 年双 11 前夕的全链路压测过程当中共发现了 700 多个系统问题,2014、2015、2016 同样也发现了好几百个问题。这些问题如果没有在全链路压测的过程当中被发现,很有可能会在双 11 零点的真实业务场景当中暴露出来,将造成严重的可用性影响。

超限后的流量控制如何做?

前面章节我们讨论的都是”容量规划”,我们知道容量规划是基于一套精密的业务模型,而这个业务模型是根据历年来的大促数据,以及复杂的预测模型推算出来的。然而,不论这个模型多么强壮,它始终是一个预测。这就意味着我们存在着预测和现实流量有误差。

这个并不仅仅是一个担心,这个发生过非常多次。最近的一个例子是在 16 年的双 11,我们为某一个重要的场景预备了足以应付 16.2 万每秒的峰值,然而那天的峰值实际上到达了 20 万每秒,超过我们准备能力将近 13%,你可能觉得这只会对峰值产生影响,这些额外的 2W 请求马上就会被消耗掉,但并不是你想的这样。

当一台机器超负荷运转的时候,这台处理请求的时间会变长。这会给用户带来不好的体验,用户会试图重复提交请求,这无形中又给系统带来了更多的请求压力。随着请求堆积的越来越多,系统性能会逐渐下降甚至无法响应新的请求。

当一台机器挂掉以后, 负载均衡会把请求重定向到另外的机器上去,这又无形中给别的机器带来了更多的任务,而这些机器也处于一个饱和的状态,很快也会像第一台机器一样,也无法响应新的请求。就这样,在很短的时间之内,越来越多的机器会停止响应,最终导致整个集群都无法响应。这就使我们常常说的“雪崩效应”。一旦“雪崩”发生,就很难停止。我们必须有一个有效的机制,来监控和控制进入的流量,来防止灾难的发生。

然而,流控并不仅仅用于流量高峰,它在很多的场景都可能用的到。比如在一个业务的链路上,有一个下游系统出现了问题,响应时间变得很长。这个问题在链路上会被放大,甚至导致整个链路不可用。这意味着流控也需要可以根据响应时间来控制系统的健康,当一个应用响应的时间超过阈值,我们可以认为这个应用不可控,应该迅速将它降级。

除了流控的激发原因之外,流控也可以灵活的定义流控的方式。不同的业务场景,可以采取不同的流控方式。比如说,对于有的应用,我们可以简单的丢弃这个请求,有的应用,则需要对下游应用进行降级,甚至直接加入黑名单。而有的应用,则需要把这些多余的请求排队,等到高峰期过后,系统没有那么忙碌之后,再逐步消耗这些流量。

所以,我们最终的流控框架可以从三个纬度着手,运行状况,调用关系,流控方式。应用可以灵活的根据自己的需求,任意组合。

下面这个是我们流控的架构图:

  • 第一步,我们在程序入口给所有的方法都进行埋点;

  • 第二步,我们把这些埋点方法的运行状态,调用关系统计记录下来;

  • 第三步,我们通过从预设好的规则中心接收规则,来根据第二步中统计到的系统状态进行控制。

然而,当系统发生流控的时候,系统虽然是安全的,但是它始在一个“受损”状态下运行。所以我们也在问题排除之后,解除流量控制。用我们上面的场景作为例子。一个链路上的一个下游应用出现了问题,导致响应时间变长,从而导致上游应用的系统负载过高。过了一会儿之后,这个下游应用恢复了,响应时间大大缩短。然而这个时候,上游应用的负载并不能马上恢复,因为进来的请求已经堆积了一段时间了。

这就意味着,如果我们采用传统的方式,用系统负载来判断是否应该恢复流控,那么即使问题已经修复,系统地负载仍然处于一个比较高的状态。这样就会导致系统恢复慢。既要迅速恢复,同时也要系统稳定。最后我们采取的方式是,让 rt,load, 允许通过的 qps 达到动态平衡。

让我们来看一下最后取得的效果。用了新的算法之后,我们可以看到系统稳定在一定的范围之内,同时当问题机器恢复之后,流量也能够很快的恢复。

从近几年双 11 零点的业务稳定性上来看,全链路压测是一个明显的分水岭,在全链路压测之后整个站点的稳定性明显好于全链路压测之前。全链路压测已经成为阿里巴巴大促备战的必要环节,无论是双 11 大促、双 12 大促,还是平时一些比较小的促销活动,每一次活动之前都会进行好几轮的全链路压测来对系统进行一次全方位的模拟验证,提前暴露各个环节的问题。全链路压测的诞生使得阿里大促备战的系统稳定性有了质的提升,被誉为大促备战的核武器。

除了全链路压测来验证我们的容量规划的正确性以外,流量控制的策略在我们的大促技术规划时也很重要,限流框架通过 自由组合运行状态,调用链路,限流措施的灵活组合,覆盖了多种业务场景。同时,通过动态平衡,可以做到快恢复,最低的减低对用户使用体验的冲击。流量控制和流量压测两者结合,让我们的系统稳定健康地渡过各种极限业务场景。

写在最后

阿里研究员蒋江伟曾经说过,“今天如果有人问我怎么做双十一,怎么做大促活动,我会告诉他一个非常简单的方法,就是做好容量规划,做好限流降级。”现在,基于阿里在双 11 大促上的多年的系统高可用保障经验,全链路压测服务 6 月份即将在阿里云上线(在原有云产品 PTS 的基础上进行全方位升级),大家都可以近距离的使用阿里的这套核武器了。

发布了172 篇原创文章 · 获赞 352 · 访问量 16万+

Guess you like

Origin blog.csdn.net/Ture010Love/article/details/104374053