TiDB full-stack full-link observability best practices based on DeepFlow

Abstract: As an excellent open source distributed database software, TiDB has received more and more attention and applications from users. However, in the process of operation and maintenance assurance, it also faces operation and maintenance islands, difficulty in delimitation and positioning, and the overhead of obtaining observable data. This article summarizes the best practices for TiDB users to build full-stack observability based on DeepFlow, including how to use DeepFlow's high-performance, zero-intrusion observability technology to eliminate the blind spots of full-link tracking on the TiDB side, and how to use DeepFlow's high-performance, zero-intrusion observability technology to eliminate blind spots on the TiDB side . It can uniformly observe the business panorama, the entire process of SQL transactions, network performance, system resource performance, file reading and writing performance, and application function performance, thereby building a unified, three-dimensional, and all-round observability capability for TiDB and its applications.

01: Operation and maintenance challenges of distributed databases

In daily operation and maintenance, DBAs of distributed databases usually face three challenges:

  • Real-time performance of observation data : Since the traditional instrumentation method will bring obvious performance losses, the DBA will only enable TiDB's built-in distributed tracing capability after a business exception occurs. Fault handling can only be done through post-analysis, repeated recurrence, and passive positioning. .
  • Comprehensiveness of observation data : Traditional database operation and maintenance mainly relies on the database's own data, lacking real-time data from client applications, system resources, database file reading and writing, server network performance and other dimensions, and the surrounding environment is in a diagnostic blind spot.
  • Observation and diagnosis continuity : Traditional monitoring tools have data separation from each other. Fault diagnosis often needs to be transferred between different operation and maintenance teams. The analysis process often requires switching between multiple monitoring tools. Diagnosis continuity is often interrupted.

The above problems have led to the disconnect between database operation and maintenance and other operations and maintenance, forming an operation and maintenance island. Business operation risks are difficult to detect, faults are difficult to demarcate, recovery cycles are long, communication and collaboration are numerous, and operation and maintenance conflicts are numerous.

DeepFlow provides TiDB with full-stack, full-link, production-ready observation capabilities through multiple core capabilities such as zero-intrusion, high-performance observation data collection, open observation data access, and unified observation data correlation analysis, helping DBA builds full-link monitoring, rapid fault delineation, root cause analysis and other capabilities to effectively improve the efficiency of distributed database operation and maintenance and solve the above-mentioned operation and maintenance pain points.

02: TiDB observability deployment solution

Figure 1 - TiDB observability overall deployment architecture

In the TiDB observability practice plan, we automatically deploy DeepFlow Agent in the Node of the business cluster and TiDB distributed database cluster to collect and collect multi-dimensional observation data; the structured observation data is transmitted through the network and processed uniformly ( Unified data labeling, unified associated data, and unified analysis data) are centrally stored on the DeepFlow Server; through functional design that closely follows the operation and maintenance scenario, these data provide flexible, multi-dimensional full-stack observation and analysis capabilities from macro to micro. .

DeepFlow Agent collects and collects rich observation data, including:

  1. eBPF zero-intrusion data : call chain tracking, SQL performance indicators, SQL call logs, network performance indicators, network flow logs, file read and write events, CPU Perf and other observation data
  2. OpenTelemetry instrumentation data : call chain tracking data within each component of TiDB
  3. Prometheus Exporter data : K8s system indicators, TiDB performance indicators

Figure 2 - DeepFlow observability data collection and collection diagram

03: Call chain tracking

In the practice of DeepFlow's full-stack observation of TiDB, we have achieved full-link call chain tracking capabilities including front-end applications, intermediate networks, TiDB-Proxy, and TiDB through eBPF technology. Compared with traditional APM technology, the call chain implemented by DeepFlow Tracking has the following advantageous features:

  • No blind spots : Eliminates tracking blind spots in TiDB, gateway middleware, and intermediate networks;
  • High performance : High-performance, zero-intrusion tracking achieved through eBPF technology builds production - ready non-sampling tracking and observation capabilities for TiDB;
  • Hot reloading : Through the code hot-swapping feature of eBPF technology, online tracking capabilities are available in applications and TiDB clusters at any time . Even small performance jitters can be continuously observed at a fine-grained level to quickly discover and eliminate system risks.
  • Cross-technology stack : Through the full-stack tracking capability, unified tracking and unified collaboration of multiple operation and maintenance technology stacks such as DBA, database development, system operation and maintenance, and application operation and maintenance are realized to quickly determine fault boundaries and improve technical collaboration efficiency.

1) Call chain tracking without blind spots

We can answer this question through a simple schematic diagram: Why is it that compared with traditional instrumentation solutions, DeepFlow eBPF zero-intrusion collection truly achieves call chain tracking without blind spots?

Figure 3 - Comparison of coverage of three different call chain tracking solutions

Application instrumentation

Traditional APM technology tracks the application call chain through application code instrumentation. The observation scope is limited to the application scope that can be instrumented. The tracking capability forms an observation blind spot on the TiDB side. When business access has a slow response, it is impossible to quickly determine whether TiDB is the source of the problem.

Application instrumentation + TiDB instrumentation

In order to extend the application call chain tracking capability to TiDB, we submitted a PR to the TiDB community for code repair and implemented the tracking capability on the TiDB side. However, we later discovered that there are still three problems with instrumentation in TiDB:

  • The network in the TiDB operating environment is in a tracking blind spot, and the impact of the network on SQL performance cannot be diagnosed;
  • TiDB-Proxy in front of TiDB is in a tracking blind spot and cannot diagnose the impact of TiDB-Proxy on SQL performance;
  • The response performance after TiDB instrumentation dropped significantly (see below for specific data), and it cannot be continuously used in a production system.

DeepFlow eBPF zero-intrusion collection

The zero-intrusion call chain tracking capability built through DeepFlow's eBPF technology eliminates TiDB's tracking blind spots and has complete tracking capabilities for the following locations in any application call process: 1) front-end application; 2) TiDB-Proxy; 3) TiDB ;4) K8s network.

At this point, we can track a certain slow response in the DeepFlow call chain tracing flame graph, intuitively observe the contribution proportion of each position to the response delay, and thus quickly determine the source location of a certain slow response:

Figure 4 - DeepFlow call chain tracing flame graph

For example, in the following figure, we can clearly determine in a part of a DeepFlow call chain tracing flame graph that the MySQL COM_QUERY COMMITcall in a certain business process introduces a 480ms delay in the processing of the tidb-proxy process:

Figure 5 - DeepFlow call chain tracing flame graph part

2) High-performance call chain tracking

The most critical issue that hinders the implementation of TiDB observability in actual production systems is the performance of observation data collection.

我们发现,应用内插桩、Java Agent 字节码增强等技术实现调用链采集时,可能会带来显著的、难以定界的额外资源消耗,并可能引入显著的业务性能损失,导致此类技术方案在高并发、高性能、高可靠要求的系统中均无法落地。很多企业仅能在测试系统或非重要的生产系统中开启此类追踪能力,而较为重要的生产系统只能采用高比例采样追踪来降低对业务的负面影响,特别是金融行业核心生产系统会完全放弃追踪能力以确保不影响业务性能,但这就导致了运行保障能力薄弱、运维风险失控。

DeepFlow uses eBPF technology to achieve zero intrusion (Zero Code) observability that solves these problems very well. Business delay (Response Time) usually only has a sub-millisecond level impact, and the Agent's resource consumption is predictable and configurable. limit. Thanks to Just-in-Time (JIT) technology, the running efficiency of DeepFlow Agent's eBPF code can be comparable to the performance of the kernel's native code, and will not affect the application's processing process during the collection process, so that the application can call the processing process of zero intrusion.

In DeepFlow's full-stack observation practice of TiDB, we tested and verified the different performance of TiDB cluster's business performance when using Jaeger instrumentation and using DeepFlow's eBPF zero-intrusion technology for observability practice.

Figure 6 - SQL performance comparison between Jaeger instrumentation collection and DeepFlow eBPF collection

TiDB SQL response delay during Jaeger instrumentation collection : We turned on TiDB's OpenTracing function and observed the response performance of the application instance location (svc-order) accessing tiproxy. We can see that the average SQL response delay is stable at 300~400ms. In some cases, the average maximum delay exceeds 1.5s .

Figure 7 - (Jaeger instrumentation collection) SQL performance indicators of svc-order -> tidb-proxy

TiDB SQL response delay during DeepFlow eBPF collection : We use DeepFlow Agent to collect unsampled observation data on the TiDB distributed cluster, and observe the response performance of the application instance location (svc-order) when accessing tiproxy. You can see the SQL response The average delay is reduced to 3~5ms , and the maximum delay does not exceed 38ms .

Figure 8 - (eBPF collection) SQL performance indicators of svc-order -> tidb-proxy

From the above performance comparison data, we found that the zero-intrusion observability achieved by DeepFlow through eBPF technology solves the business performance problems caused by the instrumentation solution, so that it can build a production-ready, no- fuss solution for the TiDB distributed database of the core IT production system. Sampling and tracking observation capabilities.

3) Call chain tracking with hot loading at any time

Due to business performance losses caused by technical means such as in-application instrumentation and Java Agent bytecode enhancement, important core production systems will basically turn off tracing capabilities in daily operations. However, when encountering faults or exceptions, fine-grained tracing is required. , but found that the startup process of instrumentation and Java Agent requires restarting the application instances and TiDB component instances of the core production system. At this time, it becomes a difficult and painful decision whether to enable the tracking function, which ultimately leads to this operation and maintenance status quo. :

  • Minor faults go unnoticed , leaving the system running with hidden dangers.
  • A large amount of investment is made for intermediate faults , the test environment is repeatedly tested, and the root cause is determined by luck.
  • All efforts were made to extinguish major faults , and the entire team worked 24/7, regardless of cost, until the business was restored.

Hayne's Law tells us that behind any serious accident, there are 29 minor accidents and 300 potential hidden dangers. The same pattern exists in the operation and maintenance of IT systems. It is precisely because of the restart actions required for instrumentation and Java Agent to start that a large number of potential hidden dangers and minor faults in the production environment are difficult to diagnose and locate. The relocation of intermediate faults consumes a lot of manpower, while major faults continue to occur. Appear.

DeepFlow's zero-intrusion observability perfectly solves this problem. When you need to enable the call chain tracking capability for an application system or TiDB, you can deploy DeepFlow Agent at any time with one click to start tracking data collection. When deploying the Agent, you do not need to invade the application POD and TiDB component POD, nor do you need to restart the application, TiDB or operating system. , Agent can instantly hot load the tracking data collection code into the kernel of the operating system and start tracking data collection at each location only through the eBPF technology of the Linux operating system. Even for business systems with very high load, when you want to uninstall the tracking capability, you can turn off the Agent's eBPF collection switch online or uninstall the Agent, and the tracking data collection code can be instantly uninstalled from the operating system kernel . During this entire process, the upper-layer application and TiDB component program have no awareness.

DeepFlow Agent's call chain tracking capability can be hot-loaded at any time in the operating system kernel, allowing us to track and observe capabilities online and offline at any time in the application system and TiDB cluster, without worrying about disturbance to the application, and to detect minor faults and intermediate faults at any time. Conduct fine-grained observations to quickly discover and eliminate system hazards to avoid major failures.

Figure 9 - Zero-intrusion deployment of DeepFlow Agent

4) Call chain tracking for unified collaboration across technology stacks

APM tracking of traditional instrumentation technology is suitable for application developers and TiDB developers. The call chain reflects more about the function call relationship within the process, which is difficult to understand without a development background. Therefore, during the fault delimitation process, DBA and other operations must Dimensional roles are of little practical help. The DeepFlow observability platform realizes the full-stack tracing capability of any application call from the application to the operating system and the underlying network, so it can build TiDB operation and maintenance, TiDB development, application development, K8s operation and maintenance, and middleware operation and maintenance. With the full-stack operation and maintenance collaboration capabilities of each technology stack, we can gain clear insights into the performance of each processing link in the entire process of any application call, and quickly determine the fault boundary.

Figure 10 - Unified collaboration and unified observation of multiple operation and maintenance technology stacks

Take the above figure as an example, in the DeepFlow observable platform:

  • Business application development and operation and maintenance : Quickly diagnose and discover hidden dangers or faults in business applications through the call delay of business application microservices;
  • K8s operation and maintenance : through the call delay of the network card between each microservice instance; rapid diagnosis to discover hidden dangers or faults in the K8s platform;
  • Middleware operation and maintenance : Quickly diagnose and discover hidden dangers or faults in the middleware through the calling delay of the TiDB-Proxy location;
  • DBA : Use the TiDB call delay collected by eBPF to quickly diagnose hidden dangers or faults on the TiDB side;
  • Database development : After turning on TiDB's OpenTelemetry instrumentation data on demand, database developers can quickly diagnose hidden dangers or failures in key internal functions of TiDB.

04: All-round observation

In addition to call chain tracking, during daily operation and maintenance of TiDB, the business panorama, network performance, system resource performance, operating system file read and write performance, and application function performance are all dimensions that need to be comprehensively observed and analyzed to quickly find the problem. The root cause is that observability practice must open up the silos of multiple types of observation data, establish connections between observation data, and design a smooth observation operation flow, so that full-stack operation and maintenance personnel can efficiently and smoothly perform the observation operation and maintenance process. Observe all-round data and find conclusions in the data faster and easier.

In TiDB observability practice, we have implemented unified collection and analysis of multiple types of data in the DeepFlow platform, and through smooth operation flows, we can efficiently access various types of data and achieve a full range of observation capabilities, including:

  1. Business panoramic observation
  2. Network performance observation
  3. SQL statement traceback
  4. System resource performance observation
  5. File reading and writing performance observation
  6. Function performance observation (CPU profiling)

1) Business panoramic observation

DeepFlow dynamically obtains resource tags and business tags through real-time docking with cloud-native API interfaces, and calls data tag tags for real-time collected applications, thereby realizing the ability to build a business panorama that is automated and dynamically updated with the business.

In the TiDB observability practice based on DeepFlow, we built an end-to-end business panorama from cloud native application clusters to TiDB distributed database clusters (as shown below):

Figure 11 - Business Panorama

By building a business panoramic topology of cloud native application clusters and TiDB database clusters, we can observe the overall picture of the IT business system from a macro perspective. When local business performance abnormalities are found in the business panorama, we can gradually explore microscopically and quickly find unknown unknowns.

2) Network performance observation

Network packet loss and network delay are often suspected problems in the TiDB database cluster fault diagnosis process. DeepFlow can quickly determine whether a certain distributed database business failure is caused by a network failure through network performance observation.

In network performance observation, we can observe the network interaction performance of external access to the TiDB cluster and mutual access of internal components of the TiDB cluster through the following six indicators, and diagnose and determine different types of network transmission problems:

  • Bytes (throughput rate) - observe network throughput pressure
  • TCP retransmission ratio - observe network packet loss
  • TCP zero window ratio - observing TCP congestion
  • Connection establishment-failure ratio—observe TCP connection establishment anomalies
  • Average TCP connection establishment delay - Observed network RTT
  • Average TCP/ICMP system latency - observe the response speed of the operating system

Example 1: Quickly observe the network interaction performance of the client accessing the TiDB portal

Figure 12 - svc-user -> tidb-proxy network performance indicator observation

Example 2: Quickly observe the network interaction performance between TiDB internal components

Figure 13 - Observation of network performance indicators of tidb -> pd

3) SQL statement traceback

Unreasonable SQL statements will occupy a large amount of CPU and memory resources, increase TiDB's response delay, and reduce the overall performance of TiDB. Therefore, quickly backtracking inefficient SQL statements is an important part of database operation and maintenance. Queries that do not use indexes, full table scans, overly complex queries, and the use of inappropriate functions or data types are all types of "bad SQL" that require backtracking and focusing on SQL statements.

In TiDB observability practice, DeepFlow uses zero-intrusion bypass collection capabilities to completely record the details of each SQL statement, including SQL statements, response delays, occurrence times, source nodes, etc., which can be retrieved during call log retrieval. Search and trace the occurrence frequency of bad SQL in real time, and determine the source of SQL.

Figure 14 - SQL call log traceback

4) System resource performance observation

Excessive CPU and memory utilization of the operating system, or network interface throughput congestion during the same period may lead to the occurrence of slow SQL. Therefore, in order to find the root cause of slow SQL problems faster and more accurately, quickly analyze the TiDB components during the fault diagnosis process. The CPU, memory, and network interface indicators of an instance are a key part of improving the observation capabilities, operation and maintenance capabilities, and troubleshooting efficiency of distributed databases.

In the practice of TiDB observability, we collect system indicators through Grafana Agent, and uniformly import the data into the DeepFlow observability platform. The data is uniformly marked, correlated, and presented to achieve the ability to observe TiDB component POD indicator data. , after discovering the source POD of slow SQL in the call chain tracing, you can view the CPU usage change curve, memory usage change curve, and network throughput change curve of the POD with one click, and easily determine the slow SQL event through the indicator values ​​during the problem period. Correlation with system CPU, memory, and network interface resource performance.

Figure 15 - TiDB database instance system resource performance observation

5) File reading and writing performance observation

文件读写的性能会对 TiDB 的 SQL 响应性能产生显著影响,虽然采用高性能存储资源可以尽可能的避免文件慢读写可能性,但在 TiDB 运维过程中仍需要频繁回答两个问题:

  • When sporadic slow SQL occurs, how to determine whether sporadic slow reading and writing of files is the root cause of the problem?
  • When some unknown large file reading or writing or frequent file reading and writing occurs in the system, how to find out the source process of file reading and writing?

DeepFlow realizes complete collection and observation of operating system file read and write events through the eBPF capability of the Linux kernel, and supports multiple collection modes such as full collection and partial collection.

Figure 16 - Schematic diagram of eBPF collection principle of operating system IO events

When an application experiences slow response, DeepFlow can retrieve all file read and write events on the client and server during the problem period with one click, and quickly lock file operation events and source processes in the list. When a slow SQL response occurs, the DBA can determine within seconds whether there is an associated slow IO event; when there is unknown file read and write behavior, system operation and maintenance can lock the file IO source within seconds:

Figure 17 - File IO event retrieval

6) Function performance observation (CPU profiling)

During the troubleshooting process of slow SQL, when it is found that the CPU has become the bottleneck of business performance, the next thing to do is to analyze the CPU scheduling situation of the TiDB component POD, find out the hot functions in the TiDB component program, and then find the program The entry point for optimization. DeepFlow implements the CPU Perf data collection and observation capabilities of the application process through the eBPF technology of the Linux kernel. By analyzing the CPU performance of the TiDB component application process, developers can easily lock the kernel functions, library functions, and application functions in each TiDB component program. CPU hot spots.

TiDB 可观测性实践中,我们通过 DeepFlow 对 TiDB、PD、TiKV 组件进行了 CPU 性能剖析。其中在下面的 TiDB 进程 CPU 性能剖析火焰图中我们可以清晰观测到 Jaeger 插桩为 TiDB 进程带来了显著的 CPU 资源开销

Figure 18 - TiDB process CPU performance analysis results

05: Value summary

良好的运维保障能力是 TiDB 分布式数据库稳定运行的前提条件,在此次可观测性实践中,DeepFlow 通过领先的高性能 eBPF 零侵扰数据采集技术、零插桩的调用链追踪技术、多源数据的统一观测技术,构建了面向 TiDB 全栈、全链路、随时热加载、可生产落地的可观测性方案,显著提升 TiDB 的全方位运维能力,高效助力 TiDB 数据库稳定运行,帮助用户打造更加可靠、稳定的数据库服务。

06: What is DeepFlow

DeepFlow is an observability product developed by Yunshan Network, aiming to provide deep observability for complex cloud infrastructure and cloud native applications. Based on eBPF, DeepFlow realizes the zero-intrusion (Zero Code) collection of observation signals such as application performance indicators, distributed tracing, and continuous performance analysis, and combines it with the smart label (SmartEncoding) technology to realize the Full Stack correlation and integration of all observation signals. Efficient access. Using DeepFlow, cloud-native applications can automatically have deep observability, thereby eliminating the heavy burden of continuous instrumentation on developers and providing DevOps/SRE teams with monitoring and diagnostic capabilities from code to infrastructure.

GitHub address: https://github.com/deepflowio/deepflow

Visit DeepFlow Demo to experience zero instrumentation, full coverage, and fully relevant observability.

I decided to give up on open source industrial software. Major events - OGG 1.0 was released, Huawei contributed all source code. Ubuntu 24.04 LTS was officially released. Google Python Foundation team was laid off. Google Reader was killed by the "code shit mountain". Fedora Linux 40 was officially released. A well-known game company released New regulations: Employees’ wedding gifts must not exceed 100,000 yuan. China Unicom releases the world’s first Llama3 8B Chinese version of the open source model. Pinduoduo is sentenced to compensate 5 million yuan for unfair competition. Domestic cloud input method - only Huawei has no cloud data upload security issues
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3681970/blog/11062627