eBPF + LLM: an infrastructure for enabling observability agents

This article is compiled from the speech shared by Xiang Yang, DeepFlow product leader of Yunshan Network, at the QCon Global Software Development Conference (Beijing) 2024 , with the theme of "eBPF + LLM: Infrastructure for Realizing Observable Agents." Review the link and download the PPT .

Observable Open Source Developer MeetupCountdown to registration for the second day! Come and join the Observability Open Source Developer Meetup | Nanjing Station

Today I am happy to share with you some of the work DeepFlow has done on observability agents. Today's topic mainly includes two aspects: how to use eBPF to solve data quality problems, and how to use LLM to build efficient agents on this basis. From these two aspects, we can see why eBPF and LLM are key infrastructures for realizing observability agents .

The first question above is actually a data governance issue. There are many ways to obtain high-quality data, such as using organizational specifications and improving R&D engineering efficiency. What I’m sharing today is mainly the latter, specifically how to use eBPF, an innovative technology, to obtain full-stack observability data with zero intrusion. After we have high-quality data, we can use LLM, combined with prompt word engineering, RAG, fine-tuning and other methods, to build an efficient observable agent. The practices shared today come from our observability product DeepFlow, which is also an increasingly popular open source project. Finally, I will also share my thoughts on the evolution of observable agents.

01

Build high-quality observability signal sources using eBPF

The essence of the first issue is data governance, with the goal of obtaining high-quality observability data. Let's take a look at the traditional solutions first. For example, we use APM to collect data. How to ensure the integrity and relevance of the data at this time? Especially in a cloud-native environment, when the client accesses the server, it may go through complex K8s networks, various gateways, various middleware, databases, DNS and other basic services. These intermediate links are not covered by APM. It can be covered, even the observed location of APM and the actual network request of the process will be different. When we do data governance and data analysis on this basis, we usually find that there are problems with the integrity and consistency of the data. It often takes a lot of time and energy to promote and improve business-side instrumentation coverage, and we also need to use packet capture and Logs and other methods are used to concatenate heterogeneous data of a large number of basic services. The picture below is believed to be a problem we often encounter when using APM: the delay observed by the client is 500ms, but the delay observed by the server is only 10ms. What is even worse is that the server may not have sent the instrumentation yet.

Pain points of using APM to collect observation signals

Closer to home, why do we say that eBPF is the infrastructure for high-quality observability signal sources ? eBPF technology is a kernel programmable technology. Each little bee in the picture below is a function position that eBPF can hook. We can use eBPF to perceive all internal states of any process on the cloud host through Hook business functions, system calls, kernel functions, network and disk driver functions. More importantly, this action is completely safe because of the existence of eBPF Verifier, and because of the JIT compilation mechanism, its performance is comparable to the kernel's native code.

Unique advantages of using eBPF

eBPF technology has two unique advantages and can well solve the data quality problems faced by APM. The first advantage is zero intrusion (Zero Code) . The operation of an eBPF program does not require code modification, recompilation, or restart of any application process. It is a plug-and-play technology that can be uploaded to your production at any time. Environment. The second advantage is the Full Stack . Whether it is a business process, a gateway, a message queue, a database, or an operating system, eBPF can be used to collect observation data. Therefore, when a process is running, its interaction with the entire software stack, from business logic to language runtime, from shared libraries to kernel to hardware drivers, can all be covered using eBPF. From the figure below, we can see that the Raw Data that eBPF can collect is very rich, including: Process Events, File Events, Perf Events, Socket Events, Kernel Events, and Hardware Events. The theme of today's QCon Session is business observability . For this topic, we first focus on the first four types of data.

eBPF Raw Data

However, the data you see in the picture above is only Raw Data. It requires identification, extraction, conversion, association, aggregation and other operations to obtain business observability data that we can use daily. The picture below is a summary of part of eBPF raw data processing in DeepFlow. You can see that we can extract the Request/Error/Delay golden indicators of the API based on Socket Events. By correlating these indicators, we can build a Service Map, and by correlating all calls Distributed tracing call chains can be formed, and by aggregating these call chains, a more fine-grained API Map can be constructed. Among them, zero-intrusion distributed tracing based on eBPF is DeepFlow's original capability. Distributed tracing can be achieved without instrumentation or TraceID injection. Interested partners are welcome to download and try it out on our GitHub Repo . In addition, process start and stop logs can be extracted from Process Events, File Access Log can be extracted from File Events, and CPU, memory, GPU, video memory, lock events can be extracted from Perf Events and performance analysis flame graphs can be drawn.

From Raw Data to high-quality observation signals

In addition to obtaining the observation signal, an important task we need to do is to inject unified labels into the data. For example, injecting resource and business tags obtained from K8s, Cloud, and CMDB helps us horizontally associate full-stack data; injecting system-level tags such as processes, threads, and coroutines, as well as network-level tags such as Subnet, IP, and TCP SEQ. , helps us vertically associate distributed call chains.

Taking distributed tracing as an example, let's take a look at the effect of DeepFlow using eBPF. The following figure compares the difference between APM and eBPF tracking results: using APM can generally cover the Java process, but it is usually difficult to cover the API gateway, microservice gateway, service grid, DNS, Redis, MySQL, etc. in the entire application, and it is difficult to cover other non- The application coverage cost of Java language is also relatively high. Using eBPF, based on zero intrusion and full-stack data, the entire call stack can cover all business processes and infrastructure processes, as well as K8s network transmission, file reading and writing and other events. This capability is a very cutting-edge innovation. We published this work in an academic paper of nearly 20 pages, which was accepted by ACM SIGCOMM, the top academic conference of the American Computer Society, last year.

Distributed Tracing

Returning to today's Session's topic of business observability , how does the data obtained by eBPF from the kernel be associated with the business? eBPF is a core technology, and the entire ecological discussion focuses on system, performance, security, etc.; while applications focus on business semantics, and developers hope to obtain business and efficiency information. How to make the observability of eBPF break through the dimensional wall, from system to business, let me talk about DeepFlow's approach. Taking Socket Data as an example, when performing protocol parsing on the data obtained by eBPF, we usually only parse the header fields of standard protocols (such as HTTP, MySQL, etc.). DeepFlow now supports more than 20 protocol parsing, covering HTTP1/2/S, RPC, MQ, DB, and Network categories. DeepFlow's built-in parsing capabilities can extract standard fields from the headers of these protocols and even payloads, such as URLs, SQL statements, error codes, etc. However, information such as business error codes, transaction serial numbers, order IDs, and car frame numbers located in HTTP Payload cannot be extracted according to unified logic. And sometimes the business will use Protobuf, Thrift and other methods for serialization, and it needs to be combined with the corresponding Schema to parse the Payload.

Here we use another technology WebAssembly to solve this problem. In fact, if we think that eBPF is a kernel programmable technology, then WebAssembly is a user-mode programmable technology. DeepFlow uses it to implement a set of secure, high-performance, hot-loading plug-in mechanisms. Users can use Golang, Rust, C/C++ and other languages ​​to write Plugins to achieve on-demand parsing of business payloads, thereby analyzing Request Log and File Access Log. Wait for eBPF observation data to be enhanced. For example, you can write a Golang program based on the Plugin SDK provided by DeepFlow to parse the HTTP Protobuf Payload and extract the fields of business concern. You can even use the Error Code, Error Message and other information in the Payload to rewrite the corresponding fields in the original HTTP Request Log.

Using eBPF as infrastructure to sense business

Finally, the title of this section is "Building high-quality observability signal sources using eBPF." Therefore, our view is that eBPF is an infrastructure and the first step in the entire observability construction. Based on the solid foundation of zero-intrusion observability capabilities, we can incorporate traditional intrusive data on demand and inject unified labels to build a more powerful observability platform. The capabilities of eBPF are like typing a section at the beginning of Star Wars Whos your daddyto fully open the map ; while traditional intrusive data is like the business side using science balls to conduct fixed-point exploration of local areas on demand.

02

Use LLM to build efficient observability agents

The second point shared today is how to use the capabilities of LLM to build observable intelligence based on high-quality data. In the past, the biggest pain point of AIOps was poor data quality (low coverage, messy format). When you plan to start doing AIOps, it usually takes half a year or more to promote data governance. Now, the high-quality data of eBPF means that the foundation is solid. At the same time, it coincides with the AGI era, and LLM has demonstrated far more powerful capabilities than previous small models. Therefore, we believe that eBPF + LLM is the infrastructure for realizing observability agents . Let me share some DeepFlow practices in this area.

At this stage, DeepFlow does not build agents for all problems. We hope to explore the problems in the entire process of development, testing, and operation and maintenance, and select two or three that are most painful to solve by combining observability + agents. To understand the pain points, first focus on these two or three points. The first scenario we found is work orders , specifically the chaotic process in the early stages of creating a work order group; the second scenario is changes , specifically the rapid demarcation of performance degradation after changes; the third scenario is Vulnerabilities , we are still exploring this scenario, and today we will share some of our preliminary thoughts.

Inefficiencies in daily work

Inefficiency of work orders : Let’s first assume that an alarm in your company will trigger the creation of a group chat, such as a Feishu group. Let’s take a typical scenario we see at a customer’s office: After the first person is pulled into the work order group, he may look at the call chain tracking data, do some analysis and find that it is not his problem; then he pulls the second person into the work order group. Join the group, this person looked at some indicator data, and after doing some analysis, he found that it was not his problem; then he brought a third person into the group, this person looked at the event data, and after doing some analysis, he still found that it was not his problem. question; then pull in the fourth person, fifth person,...; maybe it wasn't until the most powerful person in the department, Lao Wang, was pulled into the group, and a more in-depth and detailed analysis and summary was made, that the coffin could be finalized and the work Forward the order to the correct person in charge, Xiao Li . We often find that before the work order is matched to Xiao Li, the process is very confusing and inefficient, and may take more than an hour. Even if the people who were brought in at the beginning are not participating in the entire process, the existence of this work order group will interrupt their normal work from time to time, which has a significant impact on the efficiency of everyone in the work order group.

Work order inefficiency

How can an observability agent solve this problem? When a work order is created, the AI ​​Agent-driven robot will be automatically pulled into the work order group. The AI ​​Agent first calls the DeepFlow API to view tracking, indicators, events, logs and other data types, and uses a series of statistical algorithms to summarize the features of the data (thereby effectively reducing the number of Tokens), and then uses these feature information as prompts to call LLM ( Currently, GPT4 is mainly used for analysis. After analyzing one type of data, the AI ​​Agent uses LLM's Function Calling or JSON Mode capabilities to decide which other type or types of data need to be analyzed. Finally, the AI ​​Agent requests LLM to make a summary based on all analysis results.

在这个过程中,DeepFlow 的 eBPF 提供了完整的、全栈的可观测性数据,AutoTagging 为所有数据注入了统一的、语义丰富的标签。基于分析结果,AI Agent 能够利用例如 label.owner 等标签将对应的负责人准确的拉到工单群中来。目前阶段,虽然 AI Agent 的工单定界准确率虽然还无法达到 100%,但他已经能够在大多数情况下将工单群初期非常混乱的一个多小时成功的压缩到一分钟,而且显著减少了工单群里的人数,从而显著提高了整个团队的工作效率

Observability agents improve work order efficiency

Inefficiency of changes : We found that in a cloud native environment, the reasons for performance degradation after service release may be multifaceted, and due to the complexity of the call chain and software code stack, it is sometimes difficult for the partners responsible for On Call to locate The root cause is that it is also very possible to misjudge easily because of ignorance of content beyond the scope of one's own knowledge. After this kind of problem occurs, the only way is to temporarily expand the capacity or roll back to restore the business, and wait for the problem to be fixed before releasing the version again. However, the test environment may not be able to reproduce the problems in the production environment, so we have to introduce a series of traffic playback mechanisms to assist in analyzing the root cause of the problem. In this scenario, we hope that AI Agent can help developers quickly identify the root causes of performance degradation and bring the version online again much earlier.

Inefficiency of change

For scenarios with complex call chains, we already have a good solution in the example of work order agent. For scenarios with complex function call stacks, this is eBPF's specialty. It can obtain business functions, library functions, runtime functions, and kernel function call stacks when the process is running without any intrusion. Due to the security and low overhead of eBPF, profiling can be continuously turned on. Therefore, profiling data has usually been accumulated for a period of time before the performance deteriorates to an intolerable level after the change. The AI ​​Agent can use this data to quickly complete root cause delimitation.

eBPF Profiling data covers a very wide range of technology stacks, and it is difficult for any business developer to understand all the information. This scenario happens to be what AI Agent is good at. As shown in the figure below, LLM is usually able to understand the knowledge of kernel functions, runtime functions, and basic library functions, so the analysis results of these functions can be given directly. Even if there are things that LLM is not good at, since the software projects where such functions are located are not updated frequently and belong to common knowledge, we can consider enhancing the parts that LLM does not master through fine-tuning. Let's take a look at some commonly used application library functions, such as Python's Requests, etc. These libraries are characterized by a large number, fast iteration, and rich interface documentation. We can consider vectorizing the documentation of such functions and using RAG Mechanism to enhance LLM analysis. Further up are the internal business codes of the enterprise. They are not common knowledge, are larger in quantity, and change faster. Therefore, we choose to optimize prompt words to feed them directly to LLM. For example, we can Inject the corresponding Git commit_id label into K8s. DeepFlow's AutoTagging capability can easily allow the AI ​​Agent to locate the recent code modification record through commit_id and notify LLM. It can be seen that through the combined use of LLM, Fine-tuning, RAG, and Prompt Engineering technology, all professional fields involved in eBPF full-stack Profiling data can be completely covered, helping developers quickly identify root causes .

Observability agents improve change efficiency

Inefficiency of vulnerabilities : One report stated, "Vulnerability rectification may be 76% useless, and only 3% of the vulnerabilities should be given priority attention." DeepFlow is still exploring how to use AI Agent to improve efficiency in this link. What is certain is that eBPF is an excellent data collection technology for Cloud Workload Security. Isovalent summarizes the four golden observation signals of security: Process Execution, Network Socket, File Access, and Layer 7 Network Identity . DeepFlow currently has partial coverage of these four signals, and will continue to enhance it in the future. I believe that with the completion of these data, combined with LLM, we will be able to create a very eye-catching security scene AI agent.

At the end of this part, let’s talk about how to continuously improve AI Agent. Taking the work order scenario as an example, we use chaos engineering in the test environment to construct a large amount of abnormal data. Because we know the correct root causes of these anomalies, they can be used to evaluate the AI ​​​​Agenl and continuously improve it. In the production environment (note: the right side of the PPT should be the production environment), we have added a mechanism for users to score, and Agent developers will make improvements based on the scores.

How to continuously improve Agent

03

Observability agent practice examples for DeepFlow users

So what does DeepFlow’s AI Agent look like now? Let’s give a quick introduction to this part. Currently, in the DeepFlow Enterprise Edition page, the AI ​​Agent can be summoned from the topology map, call chain tracking, and continuous analysis pages. At the same time, the Agent's API can also be called by Feishu ChatBot to implement the function of a work order expert. After the AI ​​Agent gives the first round of summary, it will also provide about three or four questions that you may continue to ask. You can directly click on these questions to continue the conversation. Of course, users can also directly enter their own questions for dialogue.

DeepFlow Tracing Agent

DeepFlow Profiling Agent

In addition, the AI ​​Agent capability was also released in the DeepFlow Community Edition today , and is currently able to analyze the current Grafana Panel data. Currently we support two panels, Topo and Tracing, and are adapted to four large models: GPT, Tongyi Qianwen, Wenxinyiyan, and ChatGLM. You are welcome to download and try it out.

AskGPT - Grafana DeepFlow Topo Plugin

AskGPT - Grafana DeepFlow Tracing Pluging

04

Thoughts on future evolution directions

Finally, let me share the future evolution direction of DeepFlow AI Agent.

Now, eBPF can fully cover cloud applications. Next, we will extend its capabilities to the end side, including autonomous driving and smart space domain control on smart cars, as well as some smartphone scenarios where permissions are allowed.

On the other hand, we also found that RAG has a lot of room for optimization. Here is a review of RAG: Retrieval-Augmented Generation for Large Language Models: A Survey . We hope it will be helpful to everyone.

05

What is DeepFlow

DeepFlow is an observability product developed by Yunshan Network , aiming to provide deep observability for complex cloud infrastructure and cloud native applications. Based on eBPF, DeepFlow realizes zero-intrusion ( Zero Code) collection of observation signals such as application performance indicators, distributed tracing, and continuous performance analysis, and combines with smart label ( SmartEncoding) technology to achieve full-stack ( Full Stack) correlation and efficient access of all observation signals. Using DeepFlow, cloud-native applications can automatically have deep observability, thereby eliminating the heavy burden of continuous instrumentation on developers and providing DevOps/SRE teams with monitoring and diagnostic capabilities from code to infrastructure.

GitHub address: https://github.com/deepflowio/deepflow

Visit DeepFlow Demo to experience zero instrumentation, full coverage, and fully relevant observability.

Event Preview | Observability Open Source Developer Meetup "Intelligent Observability: Observable Evolution Driven by Large Models"Observable Open Source Developer Meetup-Xiangyang Sharing Topic

The pirated resources of "Qing Yu Nian 2" were uploaded to npm, causing npmmirror to have to suspend the unpkg service. Zhou Hongyi: There is not much time left for Google. I suggest that all products be open source. Please tell me, time.sleep(6) here plays a role. What does it do? Linus is the most active in "eating dog food"! The new iPad Pro uses 12GB of memory chips, but claims to have 8GB of memory. People’s Daily Online reviews office software’s matryoshka-style charging: Only by actively solving the “set” can we have a future. Flutter 3.22 and Dart 3.4 release a new development paradigm for Vue3, without the need for `ref/reactive `, no need for `ref.value` MySQL 8.4 LTS Chinese manual released: Help you master the new realm of database management Tongyi Qianwen GPT-4 level main model price reduced by 97%, 1 yuan and 2 million tokens
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3681970/blog/11183273