Who stole 1/3 of the CPU - the root cause of weird Go performance issues

I have seen many articles introducing that my CPU usage is high or even close to 100%. In fact, it is clear that there are no omissions. It is nothing more than which busy loop is stuck. Here is a description of a recently encountered Go program that still reports about 30% CPU usage in the top command when it is idle. Such performance problems are more subtle and difficult to figure out.

The problem occurred in my own high-performance multi-group Raft library Dragonboat , which is a multi-group Raft library implemented by Apache2 open source Go . Because performance is the core selling point, the CPU consumption of each function is well known until one day, it is suddenly found that the system occupies 30% of the load of a CPU core when it is not loaded, as shown in the following figure:

When the server program is idle, the cpu load seen in top should usually be in the low single digits. After in-depth analysis, it is found that it is a problem of deeper Go scheduling implementation.

Observed

Seeing the above top results, strace -c took a look, a lot of futex. Start a few more idle processes like this, grab a flame graph to see, it looks like this:

It must be assumed that you have a basic understanding of scheduling in recent versions of Go (such as 1.8-1.11), and understand the meanings and functions of M, P, and G. If you don't understand it temporarily, you can refer to this article or the Chinese translation of this article .

Several points can be observed in the flame graph:

  • The operation of letting the current M go to sleep is quite heavy. It is caused by tickWorkerMain trying to read a channel.
  • The data of runtime.futex is consistent with the situation reported by strace mentioned above

Looking at the code, there is a 1khz ticker in the Dragonboat library tickWorkerMain, which is equal to reading the channel ticker.C 1000 times per second. At that time, the first feeling was a little doubtful, because common sense told me that a simple non-strict 1khz low-frequency ticker, on a server processor of around 3Ghz, the CPU usage should be 1-2%, which is very low. I just hope that a certain function of the program is called 1000 times per second, how can it take up nearly 30% of the CPU?

Regardless of the cause, let's look at the problem in isolation. In order to achieve "a function of the program is called 1000 times a second", in Go, you can use a ticker like this:

package main

import (
    "time"
)

func main() {
    ticker := time.NewTicker(time.Millisecond)
    defer ticker.Stop()
    for range ticker.C {
    }
}

As a result, the above program top reports a %CPU value of 25%. Is it a problem with Go's ticker implementation? Change the sleep loop to see:

package main

import (
    "time"
)

func main() {
    for {
        time.Sleep(time.Millisecond)
    }
}

Running the above time.Sleep program, top reports 15% %CPU. This is very different from the same sleep loop in C++ at 1% %CPU. So it began to tend to be the pot of Go's scheduler.

analyze

Going back to the above flame graph, the series of operations after park_m() are very conspicuous. We already know that M stands for Machine, which is usually considered to be an OS Thread. The subsequent stopm() of park_m(), as the name implies, disables the current M and tells the system that this M is not used for the time being.

Everything seems to be starting to become clear. Every time tickWorkerMain starts to wait for the next tick, that is, when reading the channel of ticker.C, the current goroutine will be parked because the channel is empty before the time expires. This operation is very light, and only one flag needs to be changed. At this time, because the system is idle and there are no other goroutines available for scheduling, the Go scheduler must let this M go to sleep, and this operation is heavy and has a lock, and finally the futex syscall is called. More specifically, this is also related to Go's background timer implementation and system monitor implementation (note runtime.sysmon in the flame graph), which will not be expanded here. Everyone will tell you that coroutine scheduling is a light operation, and that's certainly true. But none of them tell you the more important point: the cost of having no goroutines to schedule repeatedly and frequently with coroutine scheduling is significant in Go's current implementation.

It must be pointed out that this problem is caused because the system is idle and there are no goroutines that can be scheduled. Obviously, when the system is busy, that is, when the CPU resources are truly valuable, the above-mentioned 30% CPU overhead does not exist, because there is a high probability that there will be goroutines available for scheduling, and there is no need to do the heavy lifting of letting M sleep. operate.

The impact of this issue is specific and objective:

  • Users will repeatedly ask why the system accounts for 30% of the %CPU when it is idle. It is not a good thing for the process to always be at the top of the top when it is idle.
  • Having this on a slow battery-operated ARM core is troublesome

cgo solution

We also already know that C/C++ does the same thing very cheaply. If you never change the business logic, the first thing that comes to your mind is not to let M go to sleep, and there will be no situation where no goroutine can be scheduled. For example, an OS thread can generate this 1kHz tick independently of the scheduler via cgo, calling the desired Go function 1000 times per second from the C code. This idea is easy to implement. It is nothing more than calling a C function from the Go code. This C function wakes up from sleep 1000 times per second to call the 1khz tick processing function in Go. The specific code will not be posted.

Using this idea to modify the code of Dragonboat , the cpu load during idle time is greatly reduced:

 

result

The above workaround has greatly reduced the impact of this problem on your software. Go to golang-nuts to complain, and then report to golang's issue tracker. The fundamental problem is the implementation of Go Scheduler. It should not be so difficult for users to use a 1kHz ticker. The real solution is to provide a more efficient implementation directly on the standard library and runtime.

 

In the development of Dragonboat , such performance regression happened almost every week. The evolution from 100,000 throughputs per second to 10 million throughputs per second is a process of continuous understanding of the algorithm protocol and a continuous familiarity with the Go runtime habits. There will be a lot of such practical knowledge of performance optimization in the future, all of which are based on the most popular Golang language in the Internet background, and the materials are all common scenarios that any application will involve. As the best teaching material, you are welcome to try Dragonboat , and please click Star to support its continuous development.

 

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324090290&siteId=291194637