Observable Platform: Implementation of Didi’s Observability

Observability is a topic that has attracted much attention in recent years. So what is observability? Don’t worry, let’s start with a common scenario:

You are a front-line development classmate. You received a phone alarm on your way to work one day, indicating that the number of errors on a certain interface exceeded the threshold of 30. Thanks to the so-called chatops made by the company's monitoring team, after many twists and turns, you finally opened the corresponding monitoring chart in IM and found that the current number of errors seemed to be more than before.

As a service developer, you deployed a new version last night, and the dependent services seem to have also changed. You begin to guess whether this alarm is related to your online launch last night, but you can't recall the changes to the dependent services last night.

Your team leader called to inquire about the situation with the police. If you don't understand the situation, you can only answer "I need to take a look." You turn on your computer, connect to the hotspot and log in to the machine, tail -f xxx.log | grep -E 'error|timeout|code=9527'. After a fierce operation, you discovered the problem, which was caused by the high delay of another service you relied on. It has nothing to do with your online launch, nor does it have anything to do with the dependent services that were changed last night.

Many students have encountered the above scenario. We will find some problems:

  • Lack of ability to further decompose and in-depth analysis: After obtaining the chart of the monitoring output, further decomposition cannot be performed. We have to jump out of the current context and use tools such as tail, grep, tcpdump, strace, etc. to trace the problem.

  • Complex microservice architecture makes it difficult to locate the source of the problem: Because of the complex microservice architecture, it is impossible to determine where the problem originated, whether it is a problem with the service itself or a dependent service.

  • It is difficult to determine reasonable alert rules: it is not clear whether actions such as if len(error) > 30; then alert() are reasonable.

  • Troubleshooting relies on our historical experience: for example, a code like 9527 was added from the previous failure to identify a certain type of error, but this type of error may never appear again, but for some reason this number remains in our minds. Quite deep, out of short-term memory, it was conveniently used as one of the filter conditions.

How to solve these problems? Need to rely on observability. There is still no standard definition of observability in the computer field, but there is some consensus: if you can understand any state a system is in from the outside, these states do not need to be predefined (such as 9527), they may It may or may not have happened before. When a new state appears, there is no need for you to re-engage or release new code, so we say that this system has observability.

Based on the above description of observability, we briefly compare the well-known monitoring and observability from several aspects so that everyone can understand their differences more intuitively:

  • Monitoring focuses on aggregated values, while observability focuses on details: In traditional monitoring, aggregated indicator values, such as average, maximum, minimum, etc., are usually collected and displayed to determine the overall status of the system. Observability focuses more on collecting and displaying detailed information, such as original logs, distribution of indicators, etc., allowing developers to gain an in-depth understanding of every detail to better identify problems and optimize the system.

  • Monitoring uses thresholds, manually or automatically compares with thresholds, or guesses the current system status based on historical experience or long-expired run books. Because observability has detailed information, it advocates and encourages users to actively explore and understand the system, so that problems can be discovered more accurately.

  • Monitoring is for experienced engineers, observability is for all engineers.

In general, the goal of observability is to allow us to better understand the operation of the system, so that we can discover and solve problems more quickly and improve the stability of the system. It emphasizes comprehensive and detailed data, encouraging users to actively explore and discover, rather than just passively receiving alarm notifications for which we ourselves cannot determine the threshold.

Observability implementation

In the previous section, we briefly described the advantages of observability over monitoring. So how to achieve observability? Observability expects to preserve as much request context as possible in order to explore the environment and detailed state that led to a failure (which may be historical or new). Therefore, observability requires much higher information richness than monitoring, requiring users to expose more information (such as labels that enrich metrics). However, in this case, the user's cost may be uncontrollable.

Off-screen: After all, don’t you just want more money?

Author: Student Xiuer, please sit down first.

If you follow the recommended practices of some SAAS vendors and report dozens or hundreds of dimensions without limiting the number of bases in each dimension, you can indeed achieve the prerequisites for observability. But there are 2 practical problems:

  1. Reporting such rich data will cause user costs to increase explosively.

  2. It is difficult for existing storage solutions to handle such high-dimensional, high-cardinality, and large-scale data.

Offscreen: Can't you implement one yourself?

Author: Is the security guard here? Please ask Xiuer to go out.

In addition to the high dimensionality and high cardinality arguments advocated by some SAAS vendors, the open source industry is more inclined to adopt the "curve to save the country" approach and achieve observability by correlating the current main observation signals. These main observation signals include: Metrics (indicators), Traces (distributed tracing), Logs (logs), etc. Correlation theory expects the high-level abstraction of correlation Metrics + the cross-service contextual correlation of Traces + the form of Logs to expose the most detailed Human-Readable/Acceptable information to achieve observable goals.

Didi’s observability implementation

The previous section mentioned two ways to achieve observability. Which method is used in Didi?

The answer is to combine the two to balance cost, efficiency, and the goal of achieving observability. In terms of implementation, we have transformed the methods of log collection and Metrics collection, segmented low-cardinality and high-cardinality dimensions, stored them in different back-end storages, and established correlation relationships. Expose the original log text, traceID and other information that users care about to achieve observability.

Specifically, for log collection, during the process of real-time analysis of logs at the collection end to generate monitoring curves, an original log text and its corresponding relationship with the monitoring curve will be sampled and reported according to a given cycle. For Metrics collection, the user needs to pass additional parameters when calling the buried code, telling us the dimension information that we want to sample and save. This extra dimension is generally set to TraceID in the production environment, and it will not appear in the Label of the Metrics curve. We sample and report an extra dimension of information in each cycle and its corresponding relationship with the monitoring curve.

Didi Observability Product Introduction

Observability not only needs to be considered at the technical level, but also needs to be improved to help users upgrade from monitoring to observability. Below, we will introduce several commonly used scenarios, combined with the above, to deeply experience the convenience and explorability brought by observability.

Alarm scene related log original text

If the configured policy issues an alarm notification, you can directly view the original log text you are most familiar with in IM, and make next decisions based on the original log text without logging in to the machine, tail, grep, etc.

0cc1787a061e85f19a21554acd0d5161.png

Figure 1: Original text of alarm scene correlation log

Curve view related log original text, TraceID

Just using it at the time of alarm is not enough. You can also drill down to the original log text while viewing the picture. We have simply cleaned the extracted original log text. When the TraceID is identified, you can jump directly to the Trace platform to troubleshoot the upstream and downstream services:

d0064abaf7e7144e9c90f54469c62877.png

Figure 2: Curve view associated log original text, TraceID

end

By establishing the correlation of observation signals such as Logs, Traces and Metrics, this set of architecture and products, internally called MTL, play an important role in Didi's observable system, allowing developers and operation and maintenance teams to better understand The operating status of its system can be more accurately identified and troubleshooted. This system has also been recognized and positively evaluated by many users. I hope that through this article, I can provide you with some experience and inspiration.



Cloud Native Night Talk

Is your company doing observability? How is it achieved? Welcome to leave a message in the comment area. If you need to further communicate with us, you can also send a private message to the backend directly.

The author will select one of the most meaningful messages and send a Didi-customized multi-functional cross-pack, and the prize will be drawn at 9pm on September 19th.

bcfcc58a9a60788a0de2e6ca83833c3d.png

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132843742