【译】Linus有礼貌地批评了一位开发者关于spinlock

听说最近Linus耐心而又礼貌地批评了一个开发者。原文在这里:https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

今天比较有兴趣地把原文翻译了一遍。不是很难懂,但中间还是加了很多译注,都是笔者自己的一些理解,也不敢保证全对。大体上应该差不多吧。以下先贴笔者的译文,再贴原文。

译文:
整篇文章都是错误的,并且所测量的内容与作者认为并声称要测量的内容完全不同。

首先,你得知道自旋锁只应该在不被调度时使用它们。(译注:这里的意思是,拥有自旋锁的线程应该一直占住CPU,而不应该被调度器调度离开CPU。这里的“调度”指的是“被调度离开CPU”的意思。) 但是,该博客的作者似乎正在用户空间中实现自己的自旋锁,而不关心该锁的用户(译注:这里的用户可以理解为拥有该锁的线程)是否被调度。 而关于其声称的“没有人拿锁”的时间的代码则完全是垃圾。

它在释放锁之前记录一下时间,然后在获取锁之后再记录一下时间,并声称这二者的时间差就是没有人持有锁的时间。 这简直是愚蠢、毫无意义和完全是错误的。

这是纯垃圾。怎么回事呢:

a. 由于正在自旋,因此正在使用CPU
b. 调度器会在一个随机时刻拿走CPU使用权(译注:即schedule out
c. 那个随机时刻可能正是在记录时间之后,但在释放自旋锁之前

所以现在仍然持有该锁,但是由于已经用尽了时间片,因此已经不占住CPU了。 刚才记录的“当前时间”已经过时了,它记录的并不是真正释放锁的时刻。

这个时候,有人想要拿该“自旋锁”,但由于锁还没有被释放,所以那个人现在将自旋以等待一段时间-因为自旋锁仍然被先前的线程占有,并没有被释放,只是先前的线程现在没有占用CPU而已。 在未来某个时候,调度器会再次调度起先前那个占用着自旋锁线程,此时释放自旋锁。 然后等待者线程获得锁,然后它记录了一下当前时间,并对比之前记录的时间,然后说:“哦,这个锁没有被人持有已经很久了呀。” (译注:实际上这个锁一直被占有,只是占有者在上一次记录时间后就一直没有在CPU上运行。

注意上面的场景可能还算是好的。如果你拥有的线程多于CPU(可能是其他无关进程的),那么可能下一个被调度起来的线程并不是要释放锁的线程。调度起来的下一个线程可能是另一个想要拿该锁的线程,而拥有该锁的线程现在还没有运行!

因此,所讨论的代码是纯粹的垃圾。 你不能那样做自旋锁。 或者更确切地说,你可以那样做,而在执行这些操作时,你正在测量随机延迟并获得无意义的值,因为你要测量的是“我工作很忙,所有进程都受CPU限制, 而且我正在测量调度器将进程保持在适当位置多长时间的随机点。”

然后,你写了一篇博客,指责其他人,而没有理解你自己的错误代码其实是垃圾,并且提供了随机的垃圾值。

然后,你测试了不同的调度器,并获得了你认为有趣的不同随机值,因为你认为它们显示出一些有关调度器的很棒的东西。

但并非如此。 你只是获得了随机值,因为不同的调度器对“我是否要让受CPU约束的进程使用较长的时间片”具有不同的启发算法。 尤其是当每个线程都在愚蠢和错误的基准测试中自旋时,它们看起来都像是纯吞吐量基准测试,实际上并没有彼此等待。

你甚至可能会看到诸如“当我将其作为前台UI进程运行时,与在后台作为批处理进程运行时得到的数值不同”之类的问题。 很酷很有趣的数字,是吧?

不,它们一点都不酷,也不有趣,你刚刚创建了一个特别糟糕的随机数生成器。

那么,解决办法是什么?

在你告诉系统“你正在等待锁”的位置使用锁,而在这里,释放锁的线程会告知你锁什么时候被释放;这样,调度器就能准确地为你工作而不是随机地为你工作。

注意,当作者使用真正的 std::mutex 时,无论是什么调度器,事情都能很好地运行。 因为现在你正在做你想做的事情。 是的,计时值可能仍然没什么用-运气不好归运气不好-但至少现在调度程序知道你正在使用锁。
译注:这2段的意思大约是,使用系统的锁,这样调度器会一直让持有锁的线程工作,而不是把它调度离开CPU,因此不会造成CPU浪费的情况发生;而计时的值却仍可能是不准确的,因为记录时间的代码和释放锁的代码无法做到原子性。

或者,如果你真的想使用自旋锁(提示:其实你不需要),请确保在拿住锁的同时,你会一直占用CPU。 你需要为此使用实时调度程序(或者是运行于内核内:在内核里,自旋锁工作良好,因为内核本身可以说“嘿,我在运行自旋锁,你现在不能调度我(译注:即不能占用我的CPU,不能把我调度离开CPU)”)。

但是,如果你使用实时调度程序,则需要注意它的其他含义。 有很多,其中一些是致命的。 我强烈建议不要尝试。 无论如何,你都可能会弄错许多其他问题,而现在某些错误(例如不公平或[优先级倒置])可能会使你的整个事情完全失败,并且事情从“由于锁得不好而运行缓慢”到“完全无法工作,因为我没有考虑很多其他事情”。

请注意,甚至OS内核也可能会遇到此问题-想象一下在虚拟化环境中,虚拟机管理程序将物理CPU以 overcommitted 的方式分配给虚拟CPU会发生什么情况? 是的,正是如此,不要那样做。或者至少要意识到这一点,并使用含虚拟化感知的半虚拟化自旋锁,以便你可以告诉虚拟机管理程序“嘿,现在别调度我,我处于关键区域”。
译注:这一段大约是说,在虚拟化环境中,当物理CPU过度分配给虚拟CPU后,当虚拟CPU要做自旋锁的时候,可能因为要做自旋锁的线程数量超过物理CPU的限制(即over-commit)而导致一些跑自旋锁的线程被暂停使用物理CPU的问题。

因为否则,也许在你完成所有工作之后,刚好要释放锁的时候,你被调度器调度离开CPU了;这样所有想拿锁的线程都会在你被调度离开CPU的这段时间阻塞住,从而这段时间里所有人都在CPU上自旋而不会取得任何进展。
译注:这其实就是说,拥有自旋锁的线程不应该被调度离开CPU。而要达到这一点,除了使用实时操作系统,只能是在kernel内部使用自旋锁。因为在用户空间使用自旋锁,是无法保证调度器不把该线程调度离开CPU的。这样就会导致所有人都没有进展,即导致了CPU时间的浪费 - 想拿锁的所有线程都在自旋,而持有锁的线程被调度离开了CPU。

真的,就是这么简单。

这绝对与缓存一致性延迟无关。 它与错误实现的锁定有关。

我重复一遍:不要在用户空间中使用自旋锁,除非你实际上知道自己在做什么。 并且要知道,你知道自己在做什么的可能性基本上为零。

有一个非常真实的原因,为什么你需要使用 sleeping locks (例如pthread_mutex等)。

实际上,我再多说一点:永远不要自己制造锁。 无论它们是否是自旋锁,你都会出错。 你将获得错误的 memory ordering,或者你将获得错误的公平性,或者你将遇到诸如上述的“忙碌循环,而拥有锁的线程已经被调度离开CPU了”之类的问题。

不,在自旋锁上自旋时加上随机的“ sched_yield()”调用并不会有帮助。 当人们屈服于所有错误的进程时,很容易导致调度风暴。

令人遗憾的是,即使是 system locking 也不一定很棒。 例如,对于许多基准测试,你需要不公平的锁定,因为它可以极大地提高吞吐量。 但这可能会导致糟糕的延迟。 而且标准的 system locking (例如pthread_mutex_lock()并没有flag表示“我关心公平锁定,因为延迟比吞吐量更重要”)。
译注:这里的意思大约是,system locking 也并不是完美的,它在某些情况下也不会表现出用户希望的行为。因为它只能取折衷的方案,而不会随着需求的改变而改变自身的行为。正如没有一个flag来让它更关心延迟或更关心吞吐量。

因此,即使你在技术上正确地使用了锁并避免了彻底的错误,对于你的负载,你可能也会得到错误的locking行为。 吞吐量和延迟确实确实具有与locking相反的非常不利的趋势。 一个不公平的锁将锁保留在一个线程中(或将其保留在一个CPU中),可以提供更好的缓存局部性行为和更好的吞吐量数字。

但是,当其他 CPU core 想拿锁时,偏向于使用本地线程和本地 CPU core 的不公平锁就可能会直接导致延迟尖峰,但是将其保留在本地 CPU core 有助于缓存行为。 相反地,公平的锁可以避免延迟尖峰,但是会引起大量的跨CPU缓存一致性,因为拿锁的区域一般都会积极地从一个CPU迁移到另一个CPU。
译注:这几段在反复说明一个意思,不公平的锁因为当代CPU架构设计的原因会使得局部(本地CPU core)的效率很高,但会引起全局其他地方的延迟高;而公平的锁相反,整个系统延迟不高,但相应的整个系统的吞吐率也低。

通常,不公平的锁定会导致严重的延迟,最终导致大型系统完全无法接受。 但是对于较小的系统,不公平可能不会那么明显,但是性能优势是明显的,因此系统供应商将选择不公平但更快的锁排队算法。

(几乎每次我们在内核中选择一个不公平但快速的锁定模型时,我们最终都会后悔,并不得不增加公平性)。
因此,你可能希望研究非标准库的实现,而是考虑满足特定需要的特定锁定方式。 诚然,这确实非常非常令人讨厌。 但是不要写你自己的。 找到其他人已经写好的,并花了数十年的时间对其进行实际调整并使其起作用的。

因为你永远都不要以为自己足够聪明,可以编写自己的locking routines.因为可能是你并没有这么聪明(这里的“你”,也包括了我自己-我们对所有的内核内的锁已研究了数十年,并且对 ticket locks, cacheline-efficient queuing locks, 都经过了简单的测试和设置,而即使是知道自己在做什么的人也往往会多次出错。

为何可以找到数十本有关锁的学术论文,这是有原因的。 真的, 这个很难。

Linus

原文:
The whole post seems to be just wrong, and is measuring something completely different than what the author thinks and claims it is measuring.

First off, spinlocks can only be used if you actually know you’re not being scheduled while using them. But the blog post author seems to be implementing his own spinlocks in user space with no regard for whether the lock user might be scheduled or not. And the code used for the claimed “lock not held” timing is complete garbage.

It basically reads the time before releasing the lock, and then it reads it after acquiring the lock again, and claims that the time difference is the time when no lock was held. Which is just inane and pointless and completely wrong.

That’s pure garbage. What happens is that

(a) since you’re spinning, you’re using CPU time

(b) at a random time, the scheduler will schedule you out

© that random time might ne just after you read the “current time”, but before you actually released the spinlock.

So now you still hold the lock, but you got scheduled away from the CPU, because you had used up your time slice. The “current time” you read is basically now stale, and has nothing to do with the (future) time when you are actually going to release the lock.

Somebody else comes in and wants that “spinlock”, and that somebody will now spin for a long while, since nobody is releasing it - it’s still held by that other thread entirely that was just scheduled out. At some point, the scheduler says “ok, now you’ve used your time slice”, and schedules the original thread, and now the lock is actually released. Then another thread comes in, gets the lock again, and then it looks at the time and says “oh, a long time passed without the lock being held at all”.

And notice how the above is the good schenario. If you have more threads than CPU’s (maybe because of other processes unrelated to your own test load), maybe the next thread that gets shceduled isn’t the one that is going to release the lock. No, that one already got its timeslice, so the next thread scheduled might be another thread that wants that lock that is still being held by the thread that isn’t even running right now!

So the code in question is pure garbage. You can’t do spinlocks like that. Or rather, you very much can do them like that, and when you do that you are measuring random latencies and getting nonsensical values, because what you are measuring is “I have a lot of busywork, where all the processes are CPU-bound, and I’m measuring random points of how long the scheduler kept the process in place”.

And then you write a blog-post blamings others, not understanding that it’s your incorrect code that is garbage, and is giving random garbage values.

And then you test different schedulers, and you get different random values that you think are interesting, because you think they show something cool about the schedulers.

But no. You’re just getting random values because different schedulers have different heuristics for “do I want to let CPU bound processes use long time slices or not”? Particularly in a load where everybody is just spinning on the silly and buggy benchmark, so they all look like they are pure throughput benchmarks and aren’t actually waiting on each other.

You might even see issues like “when I run this as a foreground UI process, I get different numbers than when I run it in the background as a batch process”. Cool interesting numbers, aren’t they?

No, they aren’t cool and interesting at all, you’ve just created a particularly bad random number generator.

So what’s the fix for this?

Use a lock where you tell the system that you’re waiting for the lock, and where the unlocking thread will let you know when it’s done, so that the scheduler can actually work with you, instead of (randomly) working against you.

Notice, how when the author uses an actual std::mutex, things just work fairly well, and regardless of scheduler. Because now you’re doing what you’re supposed to do. Yeah, the timing values might still be off - bad luck is bad luck - but at least now the scheduler is aware that you’re “spinning” on a lock.

Or, if you really want to use use spinlocks (hint: you don’t), make sure that while you hold the lock, you’re not getting scheduled away. You need to use a realtime scheduler for that (or be the kernel: inside the kernel spinlocks are fine, because the kernel itself can say “hey, I’m doing a spinlock, you can’t schedule me right now”).

But if you use a realtime scheduler, you need to be aware of the other implications of that. There are many, and some of them are deadly. I would suggest strongly against trying. You’ll likely get all the other issues wrong anyway, and now some of the mistakes (like unfairness or [priority inversions) can literally hang your whole thing entirely and things go from “slow because I did bad locking” to “not working at all, because I didn’t think through a lot of other things”.

Note that even OS kernels can have this issue - imagine what happens in virtualized environments with overcommitted physical CPU’s scheduled by a hypervisor as virtual CPU’s? Yeah - exactly. Don’t do that. Or at least be aware of it, and have some virtualization-aware paravirtualized spinlock so that you can tell the hypervisor that “hey, don’t do that to me right now, I’m in a critical region”.

Because otherwise you’re going to at some time be scheduled away while you’re holding the lock (perhaps after you’ve done all the work, and you’re just about to release it), and everybody else will be blocking on your incorrect locking while you’re scheduled away and not making any progress. All spinning on CPU’s.

Really, it’s that simple.

This has absolutely nothing to do with cache coherence latencies or anything like that. It has everything to do with badly implemented locking.

I repeat: do not use spinlocks in user space, unless you actually know what you’re doing. And be aware that the likelihood that you know what you are doing is basically nil.

There’s a very real reason why you need to use sleeping locks (like pthread_mutex etc).

In fact, I’d go even further: don’t ever make up your own locking routines. You will get the wrong, whether they are spinlocks or not. You’ll get memory ordering wrong, or you’ll get fairness wrong, or you’ll get issues like the above “busy-looping while somebody else has been scheduled out”.

And no, adding random “sched_yield()” calls while you’re spinning on the spinlock will not really help. It will easily result in scheduling storms while people are yielding to all the wrong processes.

Sadly, even the system locking isn’t necessarily wonderful. For a lot of benchmarks, for example, you want unfair locking, because it can improve throughput enormously. But that can cause bad latencies. And your standard system locking (eg pthread_mutex_lock() may not have a flag to say “I care about fair locking because latency is more important than throughput”.

So even if you get locking technically right and are avoiding the outright bugs, you may get the wrong kind of lock behavior for your load. Throughput and latency really do tend to have very antagonistic tendencies wrt locking. An unfair lock that keeps the lock with one single thread (or keeps it to one single CPU) can give much better cache locality behavior, and much better throughput numbers.

But that unfair lock that prefers local threads and cores might thus directly result in latency spikes when some other core would really want to get the lock, but keeping it core-local helps cache behavior. In contrast, a fair lock avoids the latency spikes, but will cause a lot of cross-CPU cache coherency, because now the locked region will be much more aggressively moving from one CPU to another.

In general, unfair locking can get so bad latency-wise that it ends up being entirely unacceptable for larger systems. But for smaller systems the unfairness might not be as noticeable, but the performance advantage is noticeable, so then the system vendor will pick that unfair but faster lock queueing algorithm.

(Pretty much every time we picked an unfair - but fast - locking model in the kernel, we ended up regretting it eventually, and had to add fairness).

So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don’t write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.

Because you should never ever think that you’re clever enough to write your own locking routines… Because the likelihood is that you aren’t (and by that “you” I very much include myself - we’ve tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

There’s a reason why you can find decades of academic papers on locking. Really. It’s hard.

Linus

发布了169 篇原创文章 · 获赞 332 · 访问量 48万+

猜你喜欢

转载自blog.csdn.net/nirendao/article/details/103900311