Alibaba Cloud Zheng Xiao: Talking about GPU Virtualization Technology (Chapter 3)

This series of articles pushes the door:

 
Alibaba
Cloud Zheng Xiao: Talking about GPU Virtualization Technology (Chapter 1) History of GPU Virtualization Buddy @I said: "Tell me a little bit, if you don't have any technical background, I don't think you can understand it...", I am enlightened. Articles facing the public are not academic papers, and should focus on popularizing basic concepts. So I decided to try to write the next article so that people who eat melon can understand it, and professionals who can read it will have great feelings and inspiration. As for the technical details, it is generally ignored.


Chapter 3 Talking about GPU Virtualization Technology (3) GPU SRIOV and vGPU Scheduling

GPU SRIOV principle

When it comes to GPU SRIOV, there are only two products in this world: S7150 and MI25. They are all from AMD. Of course, AMD's product planning should have been scheduled for a few years, and more GPU SRIOV products will be upgraded in the future. The S7150 is aimed at the customer group of graphics rendering, while the MI25 is aimed at the user group of machine learning and AI. This article focuses on the S7150. Because the SRIOV instance of the S7150 is sold in major public cloud markets, and the MI25 does not seem to be popular yet (limited by the completeness of the AMD ROCm ecosystem).

  • Two terms: PF, VF of SRIOV

(professionals, please automatically ignore this part of the introduction smiley_0.png?wx_lazy=1)

PF : The main device on the host, the GPU driver on the host is installed on the PF. The driver of PF is the manager. It is a complete device driver. The difference from general GPU drivers is that it manages the life and scheduling cycles of all VF devices. For example, 07:00.0 in the picture below is the PF device

VF : Also a PCI device, like 07:02.0 and 07:02.1 in the picture below. During the startup process, QEMU passes the VF as a PCI pass-through device to the virtual machine through the VFIO module, and the operating system on the virtual machine will install the corresponding driver to the pass-through VF PCI device (07:02.0). The VF device occupies part of the GPU resources. For example, a PF in the figure below is divided into two VFs, then it is very likely that the GPU graphics rendering performance of the virtual machine running on the VF is 1/2 of the PF.

0290164cb72128449cc81b93c4a2044ba7d6e005

The image above is a server with 4 S7150s and 2 vGPUs virtualized per S7150 SRIOV.

  • The essence of GPU SRIOV

The essence of SRIOV is to split a PCI card resource (PF) into multiple small parts (VFs). These VFs are still endpoint devices that conform to the PCI specification. Since VFs have their own Bus/Slot/Function numbers, IOMMU/VTD can successfully find the IOMMU2 nd Translation Table in the process of receiving DMA requests from these VFs to realize the address translation from GPA to HPA. This is fundamentally different from GVT-g and Nvidia's GRID vGPU. GVT-g and Nvidia GRID vGPU do not depend on IOMMU. Its fragmented virtualization solution is to implement address translation and security checking on the host side. It should be said that the SRIOV method is better than GVT-g and GRID vGPU in terms of security, because SRIOV adds a layer of IOMMU address access protection. The cost of SRIOV is about a 5% loss in performance (of course, the cost of the MMIO trap of mdev fragmentation virtualization is even greater). Due to the superiority and security of SRIOV, it is not ruled out that other GPU manufacturers will also launch GPU SRIOV solutions in the future.

  • More thoughts on SRIOV

SRIOV也有其不利的地方比如在Scalable的方面没有优势。尤其是GPU SRIOV,我们看到的最多可以开启到16个VM。设想如果有客户想要几百个VM,并都想要带有GPU图形处理能力(但是每个VM对图形渲染的要求都很低),那么SRIOV的方案就不适用了。如果有一种新的方案可以让一个GPU的资源在更小的维度上细分那就完美了。事实上业界已经有这方面的考虑并付诸实践了。

GPU SRIOV内部功能模块

(吃瓜群众可以跳过)

由于没有GPU SRIOV HW的spec与Data Sheet,我们仅能按照一般的常用的方式来猜测GPU SRIOV内部功能模块(纯属虚构,如有雷同概不负责)。

64d1e1507e2eab741211f9aeabd055fee0692838


GPU的资源管理涉及到vGPU基本上三块内容是一定会有的:Display,安全检查,资源调度。

  • Display管理

GPU PF需要管理分配给某个VF的FrameBuffer大小,以及管理Display相关的虚拟化。Display的虚拟化一般分为Local Display和Remote Display。比如XenClient就是用的Display Local Virtualization,属于本地虚拟化过程。此过程相当于把显示器硬件单元完全交由当前虚拟机控制。在云计算行业,Display更多的是采用Remote Display的方式。我们后续会讲到行业中Remote Display的问题所在。

  • VF 安全检查

GPU PF或者GPU SRIOV模块需要承担一部分的VF的地址审核(Address Audit)和安全检查,GPU SRIOV的硬件逻辑会保证暴露出的VF Register List并确保不包含特权Register信息,比如针对GPU微处理器和FW的Registers操作,针对电源管理部分的Registers也不会导出到VF中。而VM对所有VF的MMIO读写最终会映射到PF的MMIO地址空间上,并在PF的类似微处理器等地方实现VF设备的部分MMIO模拟。

另外一部分的安全检查则是PF需要确保不同VF直接对GPU FrameBuffer的访问隔离。这部分很有可能需要PF针对不同的VF建立GPU的Pagetable,或者Screen所有的VF提交的GPU BatchBuffer。

  • VF调度

AMD GPU SRIOV从硬件的角度看就是一个对GPU资源的分时复用的过程。因此其运行方式也是与GPU分片虚拟化类似。SRIOV的调度信息后续重点介绍。

GPU SRIOV的调度系统

  • 分时复用

VF的调度是GPU虚拟化中的重点,涉及到如何服务VM,和如何确保GPU资源的公平分片。 


GPU SRIOV也是一个分时复用的策略。GPU分时复用与CPU在进程间的分时复用是一样的概念。一个简单的调度就是把一个GPU的时间按照特定时间段分片,每个VM拿到特定的时间片。在这些时间片段中,这个VM享用GPU的硬件的全部资源。目前所有的GPU虚拟化方案都是采用了分时复用的方法。但不同的GPU虚拟化方案在时间片的切片中会采用不同的方法。有些方案会在一个GPU Context的当前BatchBuffer/CMDBuffer 执行结束之后启动调度,并把GPU交由下一个时间片的所有者。而有些方案则会严格要求在特定时间片结束的时候切换,强行打断当前GPU的执行,并交予下一个时间片的所有者。这种方式确保GPU资源被平均分摊到不同VM。AMD的GPU SRIOV采用的后一种方式。后续我们会看到如何在一个客户机VM内部去窥探这些调度细节smiley_0.png?wx_lazy=1


  • 调度开销

然而GPU的调度不同于CPU的地方是GPU上下文的切换会天然的慢很多。一个CPU Core的进程切换在硬件的配合下或许在几个ns之内就完成了。而GPU则高达几百ns(比如0.2ms-0.5ms)。这带来的问题就是GPU调度不能类似CPU一样可以频繁的操作。举一个例子:GPU按照1ms的时间片做调度,那么其中每次调度0.5ms的时间花在了上下文的切换上,只有1ms的时间真正用于服务。GPU资源被极大浪费。客户理论上也只能拿到66%的GPU资源。


  • S7150的调度细节

接下来我们来看一下作为首款GPU SRIOV方案的S7150是如何调度的。由于S7150是中断驱动的结构,所以通过查看虚拟机内部GPU中断的分布情况就可大致判断出GPU SRIOV对这个虚拟机的调度策略。 


对于Windows的客户机,我们可以在内部安装Windows Performance kit,并检测"GPU activity"的活动。


对于Linux的客户机,则更简单,直接查看GPU驱动的trace event。当然我们要感谢AMD在提供给Linux内核的SRIOV VF驱动上没有去掉trace event。这让我们有机会可以在VM内部查看到SRIOV的调度细节。(不知道这算不算一种偷窥?)


我们在阿里云上随便开启一台GA1的1/2实例。

7e30545459e64a06646f25a14056b0c470b49157

并选择Ubuntu(预装AMD驱动)作为系统镜像;

在Console下查看所有的GPU相关的trace如下表:

15fab270609246701317c673634f5a750d55e4eb

很不错,我们发现有两个GPU驱动分发workload的event:amd_sched_job与amd_sched_process_job。


在VNC中开启一个GPU Workload以后(比如Glxgears或者Glmark,当然我们需要先开启x11vnc),我们通过下面Command来采集GPU数据。

trace-cmd record –e gpu_sched

… 等待几秒中ctrl+c终止采集。

trace-cmd report > results.log

查看我们抓取这两个event的事件并记录下来几个有趣的瞬间:

844f95e044477228cb2b01e26d961e0171235311

afd15ced6964a91e255b3315e68fd03d07c75be6

0e5851342c4460dc6a898231dd623903eba32be8

f64b16d88a2a36f9d9544fadff41e5ea351e412c

所有的log在一段时间内是连续的,然后断开一段时间,然后又连续的workload提交。


截图上的小红框是我们需要关注的间隔时间。摘取如下表:

事件时间ns

间隔

 

1437.803888

1437.810159

6.271ms

无GPU活动

1437.816378

1437.822720

6.342ms

无GPU活动

1437.829105

1437.835127

6.022ms

无GPU活动

1437.841587

1437.847506

5.919ms

无GPU活动

很明显在上述时间窗口期内当前VM的GPU被暂停了,并被切换至服务其他VM。因此当前VM的GPU workload会积压在驱动层次。


我们把所有的event在图表上打点后就可以发现,对于一个1/2GPU实例的VM来说,它占用的GPU资源是基本上以6ms为时间片单位做切换的。

作图如下:

8c5d9701d9ea6ac01ceacdc20ce3765007364e94

  • 估算vGPU的调度效率

我们假设每次vGPU的调度需要平均用到0.2ms,而调度的时间片段是6ms,而从上图的结果来看,AMD GPU SRIOV是采用严格时间片调度策略。6ms一旦时间用完,则马上切换至下一个VM(哪怕当前只有一个VM,也会被切走)。所以1/2实例的S7150的调度效率可以达到:96.7%如果有两个这样的VM同时满负荷运行,加起来的图形渲染能力可达到GPU直通虚拟化的96.7%以上。


实测结果如下:

345cb7e54722e5783e50280fe2a32f4e1afe9d58

1/2vGPU+ 1/2vGPU = 97.4% (vs GPU直通性能)


每一个vGPU可以达到直通GPU性能的48.x%,整体性能可以达到97.4%,与我们的预估非常接近。


更多的关于GPU虚拟化调度的思考

I have to say that the AMD S7150 is very successful in vGPU scheduling. AMD's GPU hardware design ensures that any current GPU Batch Buffer execution process can be safely preempted (GPU Workload Preemption) and switch contexts to a new Workload. With such an excellent hardware design, the scheduling algorithm of the PF driver at the software level can be so calm and orderly. The 6ms forced scheduling ensures that multiple VMs will not be starved and will not be over-occupied when sharing GPU resources. Scheduling overhead is minimal (2-3%). Moreover, such a design can further adjust the size of the time slice, such as 12ms, when the number of VMs is small, and the utilization rate of the GPU will be further improved. So why can't we use 100ms scheduling? Because the Windows kernel monitors the activity of "GPU activity". If any GPU CMD does not respond within 2 seconds, Windows will initiate a Timeout Detected Recover (TDR) to reset the GPU driver. Imagine that if you have 16 VMs and the scheduling time slice is 100ms, the average minimum interval for a VM to rotate to GPU resources is 1.6s. Coupled with other delays due to the PF driver being scheduled by the Linux kernel, it is very likely to trigger TDR inside the Windows Guest.


Unconsciously, the scheduling of GPU virtualization is discussed in this chapter. Great, the chapter devoted to GPU scheduling can be saved smiley_0.png?wx_lazy=1.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325974005&siteId=291194637