Real-Time Rendering——18.2 Locating the Bottleneck找出瓶颈

The first step in optimizing a pipeline is to locate the largest bottleneck [1679]. One way of finding bottlenecks is to set up several tests, where each test decreases the amount of work a particular stage performs. If one of these tests causes the frames per second to increase, the bottleneck stage has been found. A related way of testing a stage is to reduce the workload on the other stages without reducing the workload on the stage being tested. If performance does not change, the bottleneck is the stage where the workload was not altered. Performance tools can provide detailed information on which API calls are expensive, but do not necessarily pinpoint exactly what stage in the pipeline is slowing down the rest. Even when they do, it is useful to understand the idea behind each test.

优化管道的第一步是找到最大的瓶颈[1679]。找到瓶颈的一种方法是设置几个测试，每个测试减少特定阶段执行的工作量。如果这些测试中的一个导致每秒帧数增加，就已经找到了瓶颈阶段。测试一个阶段的相关方法是减少其他阶段的工作负荷，而不减少被测试阶段的工作负荷。如果性能没有改变，瓶颈是工作负载没有改变的阶段。性能工具可以提供关于哪些API调用开销较大的详细信息，但不一定能准确指出流水线中的哪个阶段降低了其余部分的速度。即使他们这样做了，理解每个测试背后的想法也是有用的。

What follows is a brief discussion of some of the ideas used to test the various stages,to give a flavor of how such testing is done. A perfect example of the importance of understanding the underlying hardware comes with the advent of the unified shader architecture. It forms the basis of many GPUs from the end of 2006 on. The idea is that vertex, pixel, and other shaders all use the same functional units. The GPU takes care of load balancing, changing the proportion of units assigned to vertex versus pixel shading. As an example, if a large quadrilateral is rendered, only a few shader units could be assigned to vertex transformation, while the bulk are given the task of fragment processing. Pinpointing whether the bottleneck is in the vertex or pixel shader stage is less obvious [1961]. Either shader processing as a whole or another stage will still be the bottleneck, however, so we discuss each possibility in turn.

接下来是对用于测试不同阶段的一些想法的简短讨论，以给出这样的测试是如何完成的。统一着色器架构的出现是理解底层硬件重要性的一个完美例子。从2006年底开始，它成为许多GPU的基础。想法是顶点、像素和其他着色器都使用相同的功能单元。GPU负责负载平衡，改变分配给顶点和像素着色的单位比例。作为一个例子，如果渲染一个大的四边形，只有几个着色器单元可以被分配给顶点变换，而大部分被赋予片段处理的任务。查明瓶颈是在顶点还是像素着色器阶段不太明显[1961]。然而，无论是作为一个整体的着色器处理还是另一个阶段仍将是瓶颈，所以我们依次讨论每种可能性。

18.2.1 Testing the Application Stage

测试应用阶段

If the platform being used is supplied with a utility for measuring the workload on the processor(s), that utility can be used to see if your program uses 100% (or near that) of the CPU processing power. If the CPU is in constant use, your program is likely to be CPU-limited. This is not always foolproof, since the application may at times be waiting for the GPU to complete a frame. We talk about a program being CPUor GPU-limited, but the bottleneck can change over the lifetime of a frame.

如果所使用的平台提供了一个用于测量处理器工作负载的实用程序，则该实用程序可用于查看您的程序是否使用了100%(或接近100%)的CPU处理能力。如果CPU经常被使用，你的程序很可能是CPU受限的。这并不总是万无一失的，因为应用程序有时可能会等待GPU完成一帧。我们谈到一个程序受到CPU或GPU的限制，但是瓶颈可以在一个帧的生命周期内改变。

A smarter way to test for CPU limits is to send down data that causes the GPU to do little or no work. For some systems this can be accomplished by simply using a null driver (a driver that accepts calls but does nothing) instead of a real driver.This effectively sets an upper limit on how fast you can get the entire program to run, because you do not use the graphics hardware nor call the driver, and thus, the application on the CPU is always the bottleneck. By doing this test, you get an idea on how much room for improvement there is for the GPU-based stages not run in the application stage. That said, be aware that using a null driver can also hide any bottleneck due to driver processing itself and communication between CPU and GPU.The driver can often be the cause of a CPU-side bottleneck, a topic we discuss in depth later on.

测试CPU限制的一个更聪明的方法是发送数据，使GPU做很少或不做工作。对于某些系统，这可以通过简单地使用空驱动程序(接受调用但不做任何事情的驱动程序)而不是真正的驱动程序来实现。这有效地设定了整个程序运行速度的上限，因为您不使用图形硬件，也不调用驱动程序，因此，CPU上的应用程序总是瓶颈。通过做这个测试，您可以了解到基于GPU的阶段没有在应用程序阶段运行时有多大的改进空间。也就是说，请注意，使用空驱动程序也可能隐藏由于驱动程序处理本身以及CPU和GPU之间的通信而导致的任何瓶颈。这个驱动程序经常是CPU端瓶颈的原因，我们将在后面深入讨论这个话题。

Another more direct method is to underclock the CPU, if possible [240]. If performance drops in direct proportion to the CPU rate, the application is CPU-bound to at least some extent. This same underclocking approach can be done for GPUs. If the GPU is slowed down and performance decreases, then at least some of the time the application is GPU-bound. These underclocking methods can help identify a bottleneck,but can sometimes cause a stage that was not a bottleneck before to become one. The other option is to overclock, but you did not read that here.

另一个更直接的方法是，如果可能的话，对CPU进行欠锁[240]。如果性能下降与CPU速率成正比，那么应用程序至少在某种程度上是受CPU限制的。这种相同的欠锁方法也可以用于GPU。如果GPU变慢，性能下降，那么至少在某些时候应用程序是受GPU限制的。这些欠锁方法有助于识别瓶颈，但有时会导致之前不是瓶颈的阶段成为瓶颈。另一个选择是超频，但你没有看到这里。

18.2.2 Testing the Geometry Processing Stage

测试几何处理阶段

The geometry stage is the most difficult stage to test. This is because if the workload on this stage is changed, then the workload on one or both of the other stages is often changed as well. To avoid this problem, Cebenoyan [240] gives a series of tests working from the rasterizer stages back up the pipeline.

几何阶段是最难测试的阶段。这是因为，如果此阶段的工作负荷发生变化，那么其他阶段中的一个或两个阶段的工作负荷通常也会发生变化。为了避免这个问题，Cebenoyan [240]给出了一系列从光栅化器阶段到流水线的测试。

There are two main areas where a bottleneck can occur in the geometry stage: vertex fetching and processing. To see if the bottleneck is due to object data transfer,increase the size of the vertex format. This can be done by sending several extra texture coordinates per vertex, for example. If performance falls, this area is the bottleneck.

在几何阶段，瓶颈主要出现在两个方面:顶点获取和处理。要查看瓶颈是否是由于对象数据传输，请增加顶点格式的大小。例如，这可以通过为每个顶点发送几个额外的纹理坐标来实现。如果性能下降，这个区域就是瓶颈。

Vertex processing is done by the vertex shader. For the vertex shader bottleneck,testing consists of making the shader program longer. Some care has to be taken to make sure the compiler is not optimizing away these additional instructions.

顶点处理由顶点着色器完成。对于顶点着色器瓶颈，测试包括使着色器程序更长。必须小心确保编译器不会优化掉这些额外的指令。

If your pipeline also uses geometry shaders, their performance is a function of output size and program length. If you are using tessellation shaders, again program length affects performance, as well as the tessellation factor. Varying any of these elements, while avoiding changes in the work other stages perform, can help determine whether any are the bottleneck.

如果您的管道也使用几何着色器，它们的性能是输出大小和程序长度的函数。如果使用细分着色器，程序长度也会影响性能以及细分因子。改变这些元素中的任何一个，同时避免改变其他阶段执行的工作，可以帮助确定是否有任何瓶颈。

18.2.3 Testing the Rasterization Stage

测试光栅化阶段

This stage consists of triangle setup and triangle traversal. Shadow map generation,which uses extremely simple pixel shaders, can bottleneck in the rasterizer or merging stages. Though normally rare [1961], it is possible for triangle setup and rasterization to be the bottleneck for small triangles from tessellation or objects such as grass or foliage. However, small triangles can also increase the use of both vertex and pixel shaders. More vertices in a given area clearly increases vertex shader load. Pixel shader load also increases because each triangle is rasterized by a set of 2 × 2 quads,so the number of pixels outside of each triangle increases [59]. This is sometimes called quad overshading (Section 23.1). To find if rasterization is truly the bottleneck,increase the execution time of both the vertex and pixel shaders by increasing their program sizes. If the render time per frame does not increase, then the bottleneck is in the rasterization stage.

这个阶段由三角形设置和三角形遍历组成。阴影贴图生成使用非常简单的像素着色器，可能会在光栅化或合并阶段遇到瓶颈。虽然通常很少[1961]，但三角形设置和栅格化可能会成为镶嵌或对象(如草或树叶)中的小三角形的瓶颈。然而，小三角形也可以增加顶点和像素着色器的使用。给定区域中更多的顶点显然会增加顶点着色器的负载。像素着色器负载也会增加，因为每个三角形都由一组2 × 2四边形光栅化，因此每个三角形外部的像素数量会增加[59]。这有时被称为四边形重叠(第23.1节)。要确定光栅化是否真的是瓶颈，请通过增加程序大小来增加顶点和像素着色器的执行时间。如果每帧的渲染时间没有增加，那么瓶颈就在光栅化阶段。

18.2.4 Testing the Pixel Processing Stage

测试像素处理阶段

The pixel shader program’s effect can be tested by changing the screen resolution. If a lower screen resolution causes the frame rate to rise appreciably, the pixel shader is likely to be the bottleneck, at least some of the time. Care has to be taken if a level of detail system is in place. A smaller screen is likely to also simplify the models displayed, lessening the load on the geometry stage.

像素着色器程序的效果可以通过改变屏幕分辨率来测试。如果较低的屏幕分辨率导致帧速率明显上升，像素着色器很可能是瓶颈，至少在某些时候是这样。如果有详细程度系统，必须小心谨慎。较小的屏幕也可能简化显示的模型，减轻几何阶段的负荷。

Lowering the display resolution can also affect costs from triangle traversal, depth testing and blending, and texture access, among other factors. To avoid these factors and isolate the bottleneck, one approach is the same as that taken with vertex shader programs, to add more instructions to see the effect on execution speed. Again, it is important to determine that these additional instructions are not optimized away by the compiler. If frame rendering time increases, the pixel shader is the bottleneck (or at least has become the bottleneck at some point as its execution cost increased). Alternately, the pixel shader could be simplified to a minimum number of instructions, something often difficult to do in a vertex shader. If overall rendering time decreases, a bottleneck has been found. Texture cache misses can also be costly. If replacing a texture with a 1 × 1 resolution version gives considerably faster performance, then texture memory access is a bottleneck.

降低显示分辨率还会影响三角形遍历、深度测试和混合以及纹理访问等因素的成本。为了避免这些因素并隔离瓶颈，一种方法与顶点着色器程序采用的方法相同，即添加更多指令来查看对执行速度的影响。同样，确定这些额外的指令没有被编译器优化掉是很重要的。如果帧渲染时间增加，像素着色器就是瓶颈(或者至少随着其执行成本的增加，在某些时候已经成为瓶颈)。或者，像素着色器可以简化为最小数量的指令，这在顶点着色器中通常很难做到。如果总渲染时间减少，则发现了瓶颈。纹理缓存未命中也可能代价高昂。如果用1 × 1分辨率的版本替换一个纹理可以显著提高性能，那么纹理内存访问就是一个瓶颈。

Shaders are separate programs that have their own optimization techniques. Persson [1383, 1385] presents several low-level shader optimizations, as well as specifics about how graphics hardware has evolved and how best practices have changed.

着色器是独立的程序，有自己的优化技术。Persson [1383，1385]介绍了几种低级着色器优化，以及图形硬件如何发展和最佳实践如何变化的细节。

18.2.5 Testing the Merging Stage

测试合并阶段

In this stage depth and stencil tests are made, blending is done, and surviving results are written to buffers. Changing the output bit depth for these buffers is one way to vary the bandwidth costs for this stage and see if it could be the bottleneck. Turning alpha blending on for opaque objects or using other blending modes also affects the amount of memory access and processing performed by raster operations.

在这一阶段，进行深度和模板测试，进行混合，并将幸存的结果写入缓冲区。改变这些缓冲器的输出位深度是改变这一级的带宽成本的一种方法，看看它是否是瓶颈。为不透明对象打开alpha混合或使用其他混合模式也会影响光栅操作执行的内存访问和处理量。