什么是“缓存友好”代码?

本文翻译自:What is a “cache-friendly” code?

What is the difference between " cache unfriendly code " and the " cache friendly " code? 缓存不友好的代码 ”和“ 缓存友好的 ”代码之间有什么区别?

How can I make sure I write cache-efficient code? 如何确定我编写的高效缓存代码?


#1楼

参考:https://stackoom.com/question/184Eh/什么是-缓存友好-代码


#2楼

Preliminaries 预赛

On modern computers, only the lowest level memory structures (the registers ) can move data around in single clock cycles. 在现代计算机上,只有最低级别的内存结构( 寄存器 )才能在单个时钟周期内移动数据。 However, registers are very expensive and most computer cores have less than a few dozen registers (few hundred to maybe a thousand bytes total). 但是,寄存器非常昂贵,并且大多数计算机内核都只有不到几十个寄存器(总计几百到千个字节 )。 At the other end of the memory spectrum ( DRAM ), the memory is very cheap (ie literally millions of times cheaper ) but takes hundreds of cycles after a request to receive the data. 在内存频谱( DRAM )的另一端,内存非常便宜(即便宜了数百万倍 ),但是在请求接收数据后需要花费数百个周期。 To bridge this gap between super fast and expensive and super slow and cheap are the cache memories , named L1, L2, L3 in decreasing speed and cost. 为了弥合超快和昂贵以及超慢和便宜之间的差距, 缓存存储器以降低的速度和成本命名为L1,L2,L3。 The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. 这个想法是,大多数执行代码经常会碰到一小组变量,而其余部分(一大组变量)则很少。 If the processor can't find the data in L1 cache, then it looks in L2 cache. 如果处理器无法在L1高速缓存中找到数据,那么它将在L2高速缓存中查找。 If not there, then L3 cache, and if not there, main memory. 如果不存在,则L3缓存,如果不存在,则为主内存。 Each of these "misses" is expensive in time. 这些“缺失”中的每一个在时间上都是昂贵的。

(The analogy is cache memory is to system memory, as system memory is too hard disk storage. Hard disk storage is super cheap but very slow). (类比是高速缓存是系统内存,因为系统内存太硬盘存储。硬盘存储非常便宜,但速度很慢)。

Caching is one of the main methods to reduce the impact of latency . 缓存是减少延迟影响的主要方法之一。 To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency . 解释一下Herb Sutter(下面的链接): 增加带宽很容易,但是我们不能摆脱延迟

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). 始终通过内存层次结构检索数据(最小==最快到最慢)。 A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. 高速缓存命中/未命中通常是指CPU最高级别的高速缓存中的命中/未命中-最高水平是指最大==最慢。 The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). 高速缓存命中率对于性能至关重要,因为每次高速缓存未命中都会导致从RAM中获取数据(或更糟的是...),这需要花费大量时间(RAM需要数百个周期,HDD需要数千万个周期)。 In comparison, reading data from the (highest level) cache typically takes only a handful of cycles. 相比之下,从(最高级别)高速缓存读取数据通常只需要几个周期。

In modern computer architectures, the performance bottleneck is leaving the CPU die (eg accessing RAM or higher). 在现代计算机体系结构中,性能瓶颈正在使CPU失效(例如访问RAM或更高版本)。 This will only get worse over time. 随着时间的推移,这只会变得更糟。 The increase in processor frequency is currently no longer relevant to increase performance. 当前,处理器频率的增加不再与提高性能有关。 The problem is memory access. 问题是内存访问。 Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. 因此,CPU中的硬件设计工作目前主要集中在优化缓存,预取,管道和并发性上。 For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data! 例如,现代的CPU将大约85%的芯片消耗在高速缓存上,并将高达99%的芯片用于存储/移动数据!

There is quite a lot to be said on the subject. 关于这个话题有很多要说的。 Here are a few great references about caches, memory hierarchies and proper programming: 这是有关缓存,内存层次结构和正确编程的一些出色参考:

Main concepts for cache-friendly code 缓存友好代码的主要概念

A very important aspect of cache-friendly code is all about the principle of locality , the goal of which is to place related data close in memory to allow efficient caching. 缓存友好型代码的一个非常重要的方面是关于局部性的原则 ,其目的是将相关数据紧密放置在内存中以实现高效的缓存。 In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work? 在CPU缓存方面,了解缓存行以了解其工作方式非常重要: 缓存行如何工作?

The following particular aspects are of high importance to optimize caching: 以下特定方面对于优化缓存非常重要:

  1. Temporal locality : when a given memory location was accessed, it is likely that the same location is accessed again in the near future. 时间位置 :当访问给定的存储位置时,很可能在不久的将来再次访问同一位置。 Ideally, this information will still be cached at that point. 理想情况下,此信息仍将在此时进行缓存。
  2. Spatial locality : this refers to placing related data close to each other. 空间局部性 :这是指将相关数据彼此靠近放置。 Caching happens on many levels, not just in the CPU. 缓存发生在许多级别,而不仅仅是在CPU中。 For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. 例如,当您从RAM中读取数据时,通常会提取比专门要求的更大的内存块,因为程序经常会很快需要该数据。 HDD caches follow the same line of thought. HDD缓存遵循相同的思路。 Specifically for CPU caches, the notion of cache lines is important. 特别是对于CPU高速缓存, 高速缓存行的概念很重要。

Use appropriate containers 使用适当的容器

A simple example of cache-friendly versus cache-unfriendly is 's std::vector versus std::list . std::vectorstd::list是缓存友好和缓存不友好的简单示例。 Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list , which stores its content all over the place. 一个元素std::vector存储在连续的内存,因此访问它们是更为缓存友好比在访问元素std::list ,其中存储内容所有的地方。 This is due to spatial locality. 这是由于空间局部性。

A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!). Bjarne Stroustrup在这个youtube片段中给出了一个很好的例子(感谢@Mohammad Ali Baydoun提供的链接!)。

Don't neglect the cache in data structure and algorithm design 在数据结构和算法设计中不要忽略缓存

Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. 只要有可能,请尝试以最大程度利用缓存的方式调整数据结构和计算顺序。 A common technique in this regard is cache blocking (Archive.org version) , which is of extreme importance in high-performance computing (cfr. for example ATLAS ). 在这方面,一种常见的技术是缓存阻止 (Archive.org版本) ,这在高性能计算(例如ATLAS )中至关重要。

Know and exploit the implicit structure of data 知道并利用数据的隐式结构

Another simple example, which many people in the field sometimes forget is column-major (ex. , ) vs. row-major ordering (ex. , ) for storing two dimensional arrays. 另一个简单的示例(本领域的许多人有时会忘记)是用于存储二维数组的列主排序(例如 )与行主排序(例如 )的比较。 For example, consider the following matrix: 例如,考虑以下矩阵:

1 2
3 4

In row-major ordering, this is stored in memory as 1 2 3 4 ; 在行优先排序中,它以1 2 3 4形式存储在内存中; in column-major ordering, this would be stored as 1 3 2 4 . 在按大列排序时,它将被存储为1 3 2 4 It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. 不难看出,不利用此顺序的实现将很快遇到(很容易避免!)缓存问题。 Unfortunately, I see stuff like this very often in my domain (machine learning). 不幸的是,我看到这样的东西经常在我的域(机器学习)。 @MatteoItalia showed this example in more detail in his answer. @MatteoItalia在他的答案中更详细地显示了此示例。

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. 当从内存中获取矩阵的某个元素时,其附近的元素也将被获取并存储在缓存行中。 If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line). 如果利用了排序,这将导致较少的内存访问(因为随后的计算所需的接下来的几个值已经在高速缓存行中)。

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. 为简单起见,假设高速缓存包含一条高速缓存行,该行可以包含2个矩阵元素,并且当从内存中获取给定元素时,下一个也是。 Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M ): 假设我们要对上述示例2x2矩阵中的所有元素求和(我们称其为M ):

Exploiting the ordering (eg changing column index first in ): 利用顺序(例如,首先在更改列索引):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

Not exploiting the ordering (eg changing row index first in ): 不利用顺序(例如,首先在更改行索引):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). 在这个简单的示例中,利用排序使执行速度大约翻倍(因为内存访问比计算总和需要更多的周期)。 In practice, the performance difference can be much larger. 在实践中,性能差别可以大得多

Avoid unpredictable branches 避免不可预测的分支

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. 现代体系结构具有流水线功能,并且编译器在重新排序代码方面变得非常擅长,以最大程度地减少由于内存访问而引起的延迟。 When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. 当关键代码包含(不可预测的)分支时,很难或不可能预取数据。 This will indirectly lead to more cache misses. 这将间接导致更多的高速缓存未命中。

This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array? 这是很好的解释这里(感谢@的0x90的链接): 为什么处理有序数组不是处理一个排序的数组快吗?

Avoid virtual functions 避免虚函数

In the context of , virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). 的上下文中, virtual方法代表了有关缓存未命中的有争议的问题(存在一个普遍的共识,即就性能而言应尽可能避免使用它们)。 Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. 虚拟功能可以诱导查找过程中高速缓存未命中,但是这只是发生不经常被称为特定功能(否则它可能会被缓存),所以这被看作是由一些非的问题。 For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class? 有关此问题的参考,请查看: 在C ++类中拥有虚拟方法的性能成本是多少?

Common problems 常见问题

A common problem in modern architectures with multiprocessor caches is called false sharing . 在具有多处理器缓存的现代体系结构中,一个常见的问题称为虚假共享 This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line . 当每个单独的处理器试图在另一个内存区域中使用数据并将其存储在同一高速缓存行中时,就会发生这种情况。 This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. 这将导致高速缓存行(其中包含另一个处理器可以使用的数据)一次又一次被覆盖。 Effectively, different threads make each other wait by inducing cache misses in this situation. 实际上,在这种情况下,不同的线程会导致缓存未命中,从而使彼此等待。 See also (thanks to @Matt for the link): How and when to align to cache line size? 另请参见(感谢@Matt提供链接): 如何以及何时对齐缓存行大小?

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing . RAM内存缓存不足的一种极端症状(在这种情况下可能不是您的意思)就是所谓的抖动 This occurs when the process continuously generates page faults (eg accesses memory which is not in the current page) which require disk access. 当进程连续产生需要磁盘访问的页面错误(例如,访问当前页面中不在的内存)时,就会发生这种情况。


#3楼

In addition to @Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (eg a bitmap image) column-wise instead of row-wise. 除了@Marc Claesen的答案外,我认为缓存不友好的代码的一个经典示例是可以按列而不是按行扫描C二维数组(例如位图图像)的代码。

Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; 行中相邻的元素在内存中也是相邻的,因此按顺序访问它们意味着以升序的顺序访问它们。 this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory. 这是缓存友好的,因为缓存倾向于预取连续的内存块。

Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory. 相反,按列访问此类元素是缓存不友好的,因为同一列上的元素在内存中彼此相距很远(尤其是它们的距离等于行的大小),因此,当您使用此访问模式时,在内存中跳来跳去,有可能浪费了缓存的精力来检索内存中附近的元素。

And all that it takes to ruin the performance is to go from 而破坏性能所需要的仅仅是

// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
    for(unsigned int x=0; x<width; ++x)
    {
        ... image[y][x] ...
    }
}

to

// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
    for(unsigned int y=0; y<height; ++y)
    {
        ... image[y][x] ...
    }
}

This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (eg 10+ megapixels 24 bpp images on current machines); 在高速缓存较小和/或使用大阵列(例如,当前机器上的10+兆像素,24 bpp图像)的系统中,此效果可能非常显着(速度的几个数量级); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation. 因此,如果您必须进行多次垂直扫描,通常最好先将图像旋转90度,然后再执行各种分析,将对缓存不友好的代码仅限制为旋转。


#4楼

Processors today work with many levels of cascading memory areas. 如今,处理器可以处理许多级联的存储区域。 So the CPU will have a bunch of memory that is on the CPU chip itself. 因此,CPU将在CPU芯片本身上拥有一堆内存。 It has very fast access to this memory. 它可以非常快速地访问该内存。 There are different levels of cache each one slower access ( and larger ) than the next, until you get to system memory which is not on the CPU and is relatively much slower to access. 缓存有不同的级别,每个级别的访问速度都比下一个级别的访问速度慢(并且更大),直到您获得的系统内存不在CPU上并且访问速度相对要慢得多。

Logically, to the CPU's instruction set you just refer to memory addresses in a giant virtual address space. 从逻辑上讲,对于CPU的指令集,您仅引用巨大虚拟地址空间中的内存地址。 When you access a single memory address the CPU will go fetch it. 当您访问单个内存地址时,CPU将获取它。 in the old days it would fetch just that single address. 在过去,它只会获取那个地址。 But today the CPU will fetch a bunch of memory around the bit you asked for, and copy it into the cache. 但是今天,CPU会在您请求的位附近获取一堆内存,并将其复制到缓存中。 It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon. 它假定如果您要求的特定地址很可能很快就会在附近要求一个地址。 For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other. 例如,如果您要复制缓冲区,则可以从连续的地址读取和写入-一个接一个。

So today when you fetch an address it checks the first level of cache to see if it already read that address into cache, if it doesn't find it, then this is a cache miss and it has to go out to the next level of cache to find it, until it eventually has to go out into main memory. 因此,今天,当您获取地址时,它会检查缓存的第一级,以查看是否已将该地址读入缓存,如果找不到该地址,则表明这是缓存未命中,因此必须进入下一级别。缓存以找到它,直到最终将其放入主内存中为止。

Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses. 缓存友好的代码试图使访问在内存中保持紧密联系,以便最大程度地减少缓存未命中。

So an example would be imagine you wanted to copy a giant 2 dimensional table. 举个例子,假设您想复制一个巨大的二维表。 It is organized with reach row in consecutive in memory, and one row follow the next right after. 它在内存中连续排列触及率行,然后紧跟着下一行。

If you copied the elements one row at a time from left to right - that would be cache friendly. 如果您一次从左到右复制元素一次,那将是缓存友好的。 If you decided to copy the table one column at a time, you would copy the exact same amount of memory - but it would be cache unfriendly. 如果您决定一次将表复制到一列,则将复制完全相同的内存量-但它对缓存不友好。


#5楼

Welcome to the world of Data Oriented Design. 欢迎来到面向数据的设计世界。 The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality. 基本原则是排序,消除分支,批处理,消除virtual呼叫-迈向更好本地化的所有步骤。

Since you tagged the question with C++, here's the obligatory typical C++ Bullshit . 由于您使用C ++标记了问题,因此这是强制性的典型C ++废话 Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject. 托尼·阿尔布雷希特(Tony Albrecht)的“面向对象编程陷阱”也是对该主题的出色介绍。


#6楼

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply. 只是说明一下:缓存不友好与缓存友好代码的经典示例是矩阵乘法的“缓存阻止”。

Naive matrix multiply looks like 天真矩阵乘法看起来像

for(i=0;i<N;i++) {
   for(j=0;j<N;j++) {
      dest[i][j] = 0;
      for( k==;k<N;i++) {
         dest[i][j] += src1[i][k] * src2[k][j];
      }
   }
}

If N is large, eg if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss. 如果N大,例如,如果N * sizeof(elemType)大于缓存大小,则对src2[k][j]每一次访问都将是缓存未命中。

There are many different ways of optimizing this for a cache. 有多种优化缓存的方法。 Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items: 这是一个非常简单的示例:不要使用内部循环中的每个缓存行读取一个项目,而要使用所有项目:

int itemsPerCacheLine = CacheLineSize / sizeof(elemType);

for(i=0;i<N;i++) {
   for(j=0;j<N;j += itemsPerCacheLine ) {
      for(jj=0;jj<itemsPerCacheLine; jj+) {
         dest[i][j+jj] = 0;
      }
      for( k=0;k<N;k++) {
         for(jj=0;jj<itemsPerCacheLine; jj+) {
            dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
         }
      }
   }
}

If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. 如果高速缓存行大小为64字节,并且我们对32位(4字节)浮点数进行操作,则每个高速缓存行有16个项目。 And the number of cache misses via just this simple transformation is reduced approximately 16-fold. 仅通过这种简单的转换,高速缓存未命中的数量就减少了大约16倍。

Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on. Fancier转换在2D切片上运行,针对多个缓存(L1,L2,TLB)进行优化,依此类推。

Some results of googling "cache blocking": 谷歌搜索“缓存阻止”的一些结果:

http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf

http://software.intel.com/en-us/articles/cache-blocking-techniques http://software.intel.com/en-us/articles/cache-blocking-techniques

A nice video animation of an optimized cache blocking algorithm. 一个优化的缓存阻止算法的漂亮视频动画。

http://www.youtube.com/watch?v=IFWgwGMMrh0 http://www.youtube.com/watch?v=IFWgwGMMrh0

Loop tiling is very closely related: 循环平铺密切相关:

http://en.wikipedia.org/wiki/Loop_tiling http://en.wikipedia.org/wiki/Loop_tiling

发布了0 篇原创文章 · 获赞 8 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/asdfgh0077/article/details/105492845