Large model serverless inference system

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Click to view the complete event review: https://my.oschina.net/u/4489239/blog/11105657

Click to jump to the preview of Shenzhen Yuanchuang Conference on May 18: https://www.oschina.net/event/2332004

On April 20, the 102nd Yuanchuang Conference was successfully held in Wuhan. This issue invites artificial intelligence experts from Wuhan Artificial Intelligence Research Institute, Huawei, MindSpore, JD Cloud, and Gitee AI to give speeches on the theme of [Large Model Competition and Performance Optimization]. Currently, some model parties or platforms will provide individual users with some free computing power to use large model technology. Gitee.AI, as a large model aggregation platform, is also providing free computing power to individual users. Lin Jiazhen, expert consultant of Gitee AI and Institute of High Performance Computing of Tsinghua University, gave a keynote speech on "Large Model Serverless Inference System".

Lin Jiazhen pointed out that Gitee.AI currently aggregates more than 2,000 models, but free computing resources are limited. Therefore, it is necessary to more efficiently allocate these free computing resources to developers on demand, which is a very challenge at the moment. Sexual issues. For example, when container technology was used for external development in the past, the swap-in, swap-out and wake-up of a single container were very fast. However, this has become difficult in the era of large models. The wake-up and sleep of the model make the swap-in and swap-out management of containers difficult to achieve in the past. The scene is just as efficient.

Serverless AI has four major advantages, including simple deployment, out-of-box use, reduced computing power usage costs, coverage of mainstream models, and support for a variety of computing hardware. There is a problem with the current model engine, or the way of purchasing and using computing power, that is, user programs, models, and inference chips are all tied to a container, occupying the hardware chip and using computing power services. The Serverless inference engine integrates and optimizes computing power resources, reduces the coupling between applications, models, and computing power through multiple levels of deaggregation, allocates computing power on demand, and improves resource utilization.

The serverless system architecture is divided into three layers. The lowest layer is the compiler layer. The loading of the model in the container is changed to the mode of rpc calling to the remote service. The interface is not changed, but it is replaced by back-end inference to realize the model and chip. Depolymerization. rpc is given to the inference engine at the upper level. The inference engine is the cluster where calculations actually occur. This level deaggregates data and computing power. For example, assume a task scenario where ten cards satisfy the scheduling request of 3000 models. At this time, there is no way to load a large model fixedly on one card. It is necessary to temporarily and dynamically load the desired model according to the request. Therefore, the calculated The chip and model weights are deaggregated, and the model is placed on TanserGraph, which is a heterogeneous memory system that can support the deaggregation of computing power chips and models. At the top layer, the Serverless layer, application, inference, and aggregation are performed.

The core capability of the serverless system architecture is heterogeneous interconnected memory to solve the model weight problem. The overall data center architecture has some limitations, such as low resource utilization and limited hardware scalability. Disaggregation technology can physically separate each component in the overall architecture and use a certain interconnection to link the control interface (Control Plane) of each component. and data interface (Data Plane) to realize on-demand allocation and expansion of various resources. In addition, memory deaggregation also has application advantages in cloud scenarios, including improving cloud environment resource utilization and making it easier to meet the growing demand for memory resources.

However, the existing hierarchical memory system is not suitable for the high hardware flexibility under the deaggregation architecture, and the system scalability is also limited. Moreover, due to the internal structure limitations of the system, the existing memory management interface capabilities are limited. Heterogeneous interconnected memory can solve these problems through three links: hardware access statistics, programmable strategies, and page migration. Taking the CPU as an example, for access statistics based on PEBs, the hardware is supported to collect the memory access status of the running program, record the instructions, TID, destination address, etc., and then load the model weights on demand.

In addition, the serverless system architecture also has various other capabilities, such as multi-level neural network compilation optimization technology based on MLIR, and lightweight system service mechanism based on user-space isolation technology. The serverless inference engine is built based on two core intellectual property technologies. In addition, it also integrates various current mainstream inference system optimization technologies.

Currently, Llama 3 has been launched on Gitee AI. Copy the link below to your browser and enter the platform to experience it (invitation code: llama3):

https://ai.gitee.com/hf-models/shenzhi-wang/Llama3-8B-Chinese-Chat

Scan the QR code to watch the replay of the lecture "Large Model Serverless Inference System" ⬇️

Large model serverless inference system

Guess you like