During a certain continuous stress test, we found that GreptimeDB's Frontend node memory continued to increase even when the request volume was stable, until it was killed by OOM. We judged that Frontend should have a memory leak, so we started a journey to troubleshoot memory leaks.
Heap Profiling
On large projects it is almost impossible to find memory leaks just by looking at the code. So we first need to do a statistical analysis of the memory usage of the program. Fortunately, the jemalloc used by GreptimeDB comes with heap profiling , and we also support exporting the jemalloc profile dump file . So when the memory of GreptimeDB's Frontend node reached 300MB and 800MB, we dumped its memory profile files respectively, then used jemalloc's built-in tool jeprof
to analyze the memory differences ( --base
parameters) between the two, and finally displayed them with a flame graph:
Obviously, the long block in the middle of the picture is occupied by the ever-increasing 500MB of memory. Observing carefully, there are thread-related stack traces. Could it be that too many threads have been created? I simply used ps -T -p
the command to check the process of the Frontend node several times. The number of threads was stable at 84, and they were all threads that were predicted to be created. So the reason "too many threads" can be eliminated.
Looking further down, we found a lot of stack traces related to Tokio runtime, and Tokio task leaks are also a common memory leak. At this time we will use another artifact: Tokio-console .
Tokio Console
Tokio Console is Tokio's official diagnostic tool. The output results are as follows:
We see that there are actually 5559 running tasks, and most of them are in the Idle state! So we can confirm that the memory leak occurs in Tokio's task. Now the question becomes: Where in the GreptimeDB code are so many Tokio tasks spawned that cannot be ended?
From the "Location" column in the above figure we can see where the task is spawned :
impl Runtime {
/// Spawn a future and execute it in this thread pool
///
/// Similar to Tokio::runtime::Runtime::spawn()
pub fn spawn<F>(&self, future: F) -> JoinHandle<F::Output>
where
F: Future + Send + 'static,
F::Output: Send + 'static,
{
self.handle.spawn(future)
}
}
The next task is to find all the code in GreptimeDB that calls this method.
..Default::default()
!
After careful inspection of the code, we finally located where the Tokio task leaked and fixed the leak in PR #1512 . Simply put, in the constructor of a struct that will be created frequently, we spawned a Tokio task that can continue to run in the background, but failed to recycle it in time. For resource management, creating the task in the constructor itself is not a problem, as long as Drop
the task can be successfully terminated in . The bad thing about our memory leak is that we ignored this convention.
This constructor Default::default()
is also called in the method of the struct, which makes it more difficult for us to find the root cause.
Rust has a very convenient method for constructing your own struct using another struct, namely " Struct Update Syntax ". If struct is implemented Default
, we can simply use it in the field constructor of struct ..Default::default()
. If Default::default()
there is a "side effect" inside (for example, the reason for our memory leak this time - creating a Tokio task running in the background), special attention must be paid to: after the struct construction is completed, the Default
temporary struct created will be discarded. Do a good job in resource recycling.
For example, the following small example: (Rust Playground )
struct A {
i: i32,
}
impl Default for A {
fn default() -> Self {
println!("called A::default()");
A { i: 42 }
}
}
#[derive(Default)]
struct B {
a: A,
i: i32,
}
impl B {
fn new(a: A) -> Self {
B {
a,
// A::default() is called in B::default(), even though "a" is provided here.
..Default::default()
}
}
}
fn main() {
let a = A { i: 1 };
let b = B::new(a);
println!("{}", b.a.i);
}
The method of struct A default
will be called and printed out called A::default()
.
Summarize
- To troubleshoot memory leaks in Rust programs, we can use jemalloc's heap profiling to export dump files; then generate a flame graph to visually display memory usage.
- Tokio-console can easily display the task running status of Tokio runtime; pay special attention to the growing idle tasks.
- Try not to leave code with side effects in the constructor of commonly used structs.
Default
Should only be used for value type structs.
About Greptime
Greptime Greptime Technology was founded in 2022 and is currently improving and building two products, time series database GreptimeDB and GreptimeCloud.
GreptimeDB is a time series database written in Rust language. It is distributed, open source, cloud native, and highly compatible. It helps enterprises read, write, process, and analyze time series data in real time while reducing the cost of long-term storage.
Based on the open source GreptimeDB, GreptimeCloud provides users with fully managed DBaaS, as well as application products combined with observability, Internet of Things and other fields. Using the cloud to provide software and services can achieve rapid self-service provisioning and delivery, standardized operation and maintenance support, and better resource flexibility. GreptimeCloud has officially opened for internal testing. Welcome to follow the official account or official website for the latest developments!
Official website: https://greptime.com/
Public account: GreptimeDB
GitHub: https://github.com/GreptimeTeam/greptimedb
Documentation: https://docs.greptime.com/
Twitter: https://twitter.com/Greptime
Slack: https://greptime.com/slack
LinkedIn: https://www.linkedin.com/company/greptime/
A programmer born in the 1990s developed a video porting software and made over 7 million in less than a year. The ending was very punishing! High school students create their own open source programming language as a coming-of-age ceremony - sharp comments from netizens: Relying on RustDesk due to rampant fraud, domestic service Taobao (taobao.com) suspended domestic services and restarted web version optimization work Java 17 is the most commonly used Java LTS version Windows 10 market share Reaching 70%, Windows 11 continues to decline Open Source Daily | Google supports Hongmeng to take over; open source Rabbit R1; Android phones supported by Docker; Microsoft's anxiety and ambition; Haier Electric shuts down the open platform Apple releases M4 chip Google deletes Android universal kernel (ACK ) Support for RISC-V architecture Yunfeng resigned from Alibaba and plans to produce independent games for Windows platforms in the future