Ali cloud since the inquiry dragon architecture, cloud computing industry how to solve the problem?

_

"Dragon X-Dragon architecture" Ali cloud self-development of hardware and software integration computing architecture, includes "X-Dragon virtualization chip", "X-Dragon Hypervisor system software," and "X-Dragon server hardware architecture", depth integration of physical and virtual machines properties, can be both elastic virtual machine's resources, minute-level delivery, performance and advantages of automatic operation and maintenance of the physical machine, complete isolation level features and hardware, providing users with a new type of computing resource delivery the way.

2016, cloud Ali launched the "Dragon X-Dragon architecture" next-generation computing platform IaaS project, which uses hardware and software co-design methods, from the cloud computing IaaS field again to examine the chip, hardware, and software-defined and collaborative innovation.

October 2017, Ali cloud in Hangzhou Yunqi Congress for the first time published the bare metal server-based Dragon X-Dragon architecture.

September 2019, Ali cloud officially released the third generation of self-development Dragon architecture throughout the entire elastic computing platform, full support for ECS virtual machines, cloud native containers, and increased by 5 times the performance in terms of IOPS, PPS, users can in the cloud obtained on physical machines beyond the traditional computing capacity of 100%.

Background: The historic problem of cloud computing

Born from a computer to the 1990s, computing resources are used as a resource "to plan sex". However, the advent of the Internet era, an explosive event, it is possible to make existing computing resources overwhelmed.

One of the advantages of cloud computing is that computing resources and acting freely. The ability to benefit from virtualization technology born of decades ago, it can be managed data center computing resources by software-defined way.

All along, the underlying cloud services is still a common chip plus standard virtualization software portfolio: Intel's X86 chip + virtualization software VMware, Redhat, Citrix and other open source organizations or business organizations, this combination may eventually be output according to the needs of enterprises computing power, even tens of thousands of nuclear computing power level may be implemented within minutes.

The convenience of cloud reflected most vividly in the field of artificial intelligence. In the field of image recognition contest ImageNet example. 2015 champion ResNet model contains tens of millions of parameters, use the server to complete the next line a full 14 days of training needs; and now, the same model to complete the same task in the cloud requires only a few hours.

Of course, under the bright lights of appearance, cloud computing also has its inherent shortcomings.

Virtualization will go as black holes absorb part of the performance of the machine, which is elastic cloud capacity on the exchange at the expense of performance. For example, the cloud server running on a 95-core server, you may need to use 8-core 32G to offset the costs of virtualization, leaving the user to use only the remaining 88 core and memory, resulting in a tremendous waste of calculation power. Moreover, resource scheduling between a cloud server on the same server can not be completely isolated, the presence of preemption, so its performance is not stable.

At the same time, the entire computing industry chain is also undergoing subtle changes: Moore's Law failure hindered the development of general-purpose chip, become the trend of custom server-based GPU, FPGA, ASIC and other new chip, and traditional virtualization technology is difficult to follow them. " non-mainstream "the pace of hardware.

This is considered cloud computing industry Achilles heel.

While cloud vendors, chip makers, virtualization vendors are trying to try new methods, such as Intel and other chip makers to provide hardware-level virtualization support for virtualization technology itself has evolved from Xen to KVM, but because software vendors, hardware vendors , system integrators perform their duties split between the layers of the model that ultimately failed to solve the problem fundamentally.

This seems to be the cloud computing vendors who are a curse, the underlying architecture of innovation is imminent.

Ali cloud developing the next generation virtualization architecture

Virtualization loss is a drawback since the birth of cloud computing, there's. Ali has been in the cloud cloud computing virtualization to reduce losses, approaching extreme. In 2016 technical re-set at double 11, when he was Alibaba Group CTO Zhang Jianfeng put forward a very demanding - the virtualization overhead is also reduced to zero. This appears to be contrary to the law of conservation of energy, even in the academic nor research.

Ali cloud final team another way to come up with new solutions - to solve the virtualization overhead by a dedicated chip.

Realization of ideas from technical point of view, Ali R & D team need to reconstruct a cloud computing architecture, to the need to provide support and manageability for each node through the development of a new chip set, on this basis, then developed a sets new server hardware, software and supporting systems; and then this technology architecture into existing product designs.

Turing Award winner, University of California, Berkeley computer science professor David Patterson said: "With the end of Moore's Law, in order to obtain a faster computer performance, the only way is to improve the computer's design or 'architecture'."

In the past, due to the downstream industry chain enterprises they carry out their duties, there has been a virtual loss.

Collaborative software and hardware architecture design concept is already drifting into the clouds, the body mass of cloud vendors server deployments reached one million level, means that you can customize any hardware, and cloud vendors have begun to re-examine the synergy of innovative silicon, hardware and software. To harvest the integration of hardware and software technologies dividend is important precondition is a custom chip, self-developed hardware. This is what Ali did the cloud.

October 2017, Ali cloud in Hangzhou Yunqi Congress for the first time published the bare metal server-based Dragon X-Dragon architecture.

September 2019, Ali cloud officially released the third generation of self-development Dragon architecture throughout the entire elastic computing platform, full support for ECS virtual machines, cloud native containers, and increased by 5 times the performance in terms of IOPS, PPS, users can in the cloud obtained on physical machines beyond the traditional computing capacity of 100%.

_3

Dragon X-Dragon architecture features

Ali cloud IaaS as the world's top three manufacturers, ECS elastic computing products accumulate very large scale, in the development process, the development team has a very profound knowledge and understanding of the open source Xen / KVM, also noticed the chip industry and soft hardware integration trends, from Ali cloud Dragon X-Dragon architecture point of view, Ali cloud developed a custom dedicated server, dedicated research and development of virtual chips, developed a dedicated MOC card, developed a set of software to fit , a complex set of system software from the BIOS to the client, to the top of the overall scheduling software.

power_

MOC cards is the soul of the Dragon architecture. This card was designed by Alibaba completely autonomous. X-Dragon core chip is mounted on the bare metal elastic MOC server card. Ali cloud since the inquiry of this MOC card has independent processing, storage, I / O and other units, MOC cards bear the original achieved through network virtualization software, storage and peripherals. The Dragon server motherboard also Ali cloud customized version specifically optimized for MOC card for X-Dragon Hypervisor can easily manage the entire machine.

In this framework, each Dragon server, a virtual machine can be like to be called as X-Dragon Hypervisor, create and release a Dragon elastic bare metal server creates an instance of ECS is the same as the console and Ali cloud. However, because this call is implemented by hardware, so basically there is no performance overhead is no different from the entire operation of the machine and physical machine.

与此同时,神龙服务器的外部云盘存储、VPC网络等资源,都通过MOC卡支持。低速的外围设备,是现代服务器主要的性能瓶颈,等待硬盘等的存储消耗掉大量计算资源。神龙构架的做法,是将该部分功能,通过硬件直接独立出去,offload到MOC卡上,因为使用了专属芯片硬件,其效率非常高,而且实现了和阿里云原有的云计算体系的完全兼容。神龙裸金属服务器可以像云主机一样,通过挂载镜像进行初始化,还可以通过OpenAPI操作,完全免去了人肉运维的痛苦,使用效率极高,和一台普通ECS使用体验基本相同。

这样一来,神龙弹性裸金属服务器便克服了上面提到的公有云遇到的几个问题。
首先,神龙弹性裸金属服务器没有软件虚拟化带来的性能开销,可以完全发挥处理器和内存的性能。
其次,神龙弹性裸金属服务器的资源是独占的,其性能非常稳定,不会出现性能起伏不定的状况。
第三,神龙弹性裸金属服务器支持嵌套虚拟化,主流的虚拟化系统都可以在上面运行。

神龙云服务器在克服传统云计算主机缺点的同时,又保留了云主机的优点,比如上面提到的弹性部署、API操作、镜像启动、VPC网络等特性一样不少。

一言以蔽之,神龙弹性裸金属服务器兼备物理机的高性能和云的弹性。

神龙X-Dragon架构的应用场景

神龙X-Dragon架构“快”的特点,让它几乎适合承担从轻量级计算到高性能计算所有的云计算任务,例如可以支持ECS,还可以通过灵活的配置,组成计算力强大的超级计算集群,为HPC高性能计算提供驱动力。

以AI人工智能为例,训练一个模型可能需要数天甚至数周时间,这在现在这个分秒必争的社会是不能容忍的。传统的超算面对这类场景也束手无策,通过异构计算集群来加速训练是工业界和学术界的最常用的途径。而神龙则把异构超算能力带到了云上,可以轻松满足这种大算力场景的需求。

大规模计算集群性能损耗通常在50%左右,而基于神龙架构的超算异构集群可以最大限度发挥芯片的计算性能,提供堪比超算中心的并行计算资源。

Based Heterogeneous supercomputing cluster server SCCGN6 Dragon bare metal, combined with low latency RDMA network, high performance parallel and distributed file system CPFS acceleration frame Ali-Perseus (dyke), can be up to 100% performance increase, so that the maximum limit play to calculate the performance of the chip. ImageNet contest with 128 million images data set, for example, with ordinary computing resources training ResNet50 models, such as to reach 75% accuracy take several days or even a week, using Dragon heterogeneous supercomputing cluster model training can be shortened to a few minutes.

Not only that, Dragon is also ideal for the most popular container technology. For now, dragon bare metal containers Comparative physical machine running the server 10% -30% performance advantage. Container technology today is the most popular technology, application container technology is almost all Internet companies are more or less to deploy their own services, and various characteristics of the Dragon bare metal servers, just and container technology closely integrated to provide performance beyond expectations .

Dragon (X-Dragon) architecture has been widely applied in the Alibaba Group, Taobao, Lynx, rookie core business, to meet the dual 11 other large-scale traffic demands, Ali Baba All in Cloud strategy, also used in all Dragon products Program.

The architecture of external services in various areas of business, the Internet home of industrial enterprises on behalf of the three-dimensional product family have been using Dragon to achieve a comprehensive cloud, rendering efficient than offline cluster IDC 5% -8%; well-known automotive company SAIC-GM-use ultra Dragon SCC count clusters, automotive simulation efficiency gains of 25%; PERA based on SCC clusters provide customers with HPC solutions, the overall cost reduction of 20%; Geely automobile manufacturers by using Dragon cloud server clusters significantly improve simulation efficiency refrain shorten models design and time to market several months.

References:
Ali made "Dragon"
Ali cloud released third-generation architecture Dragon
Ali cloud elasticity bare metal server - Dragon architecture (X-Dragon) Secret
cloud computing out of the box for the first time in the history of live, Ali cloud dragon-round technical architecture for the first time exposure

Guess you like

Origin yq.aliyun.com/articles/743920