Understand the large model training, inference, and deployment strategies of multiple manufacturers in one article

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

On April 20, the 102nd Yuanchuang Conference was successfully held in Wuhan. This issue invites artificial intelligence experts from Wuhan Artificial Intelligence Research Institute, Huawei, MindSpore, JD Cloud, and Gitee AI to give speeches on the theme of [Large Model Competition and Performance Optimization]. Next, let’s take a look at the wonderful moments of this event!

Get a group photo ✅

Pizza and gifts are a must!

Next comes the review of the keynote speech. You can scan the QR code below, follow the "OSC Open Source Community" video account, and enter the "Live Replay" page to view the complete video review:

Liu Hao: Large model analysis and trend outlook

Liu Hao, Director of the Venture Capital Transformation Department of Wuhan Artificial Intelligence Research Institute, shared the theme of "Large Model Analysis and Trend Outlook". The Wuhan Institute of Artificial Intelligence, where Liu Hao works, began researching large model technology as early as 2020. In July 2021, it released the world's first three-modal large model with 100 billion parameters, covering images, text, and voice.

Liu Hao pointed out that in the early research on artificial intelligence technology, there were three major problems. First, the generalization ability was very poor and could only solve similar problems; second, the model capability was single and could not solve rich text, or multiple models needed to be integrated. ; Third, in the past period of time, the demand for data annotation has been a bit too great. Large models can solve problems from these three aspects, especially after the emergence of ChatGPT. The success of ChatGPT means that many downstream tasks or downstream models of artificial intelligence can enter the production line, opening up an era of artificial intelligence productization, allowing technicians to focus on making base models, allowing more people to Get involved in the artificial intelligence industry.

In addition, the large model has stimulated the stimulation of storage, computing power, transportation capacity and other links, and connected many upstream and downstream industries through the large model.

Technically speaking, many large models at home and abroad still essentially use the previous MoE architecture, but the large models have undergone a good engineering and product transformation. After the model parameters exceeded 66 billion, the unexplainability in artificial intelligence became stronger, including the emergence of capabilities that seemed unexplainable. Liu Hao believes that the method used by OpenAI to make ChatGPT so effective is still a black box, but it has explored a path for unified representation and reasoning of knowledge, world cognition and modeling and other issues.

Big models have changed not only the research model, but also the service and development model. For example, many companies began to cancel subscriptions to large model graphics cards and stopped the development of large models. In the end, there may be only a few large model companies left in the industry that make large base models, while more are industry professionals. This also means that large models have entered the stage of industrial production, and many tools will be formed on large models.

Currently, Zidong Taichu 2.0 has been upgraded to a full-modal large model, adding information modalities such as three-dimensional point clouds. At the same time, Wuhan Artificial Intelligence Research Institute has also built a full-stack domestic artificial intelligence open service platform. It uses large models as a base to deploy a one-stop platform and adopts a new model of computing power + platform. On the one hand, it uses the base to fine-tune data, and on the other hand, it uses In terms of aspects, the platform and computing power can be seamlessly combined. Currently, multiple AICCs have been implemented across the country, completing full-stack localization adaptation, utilizing high-performance inclusive computing power, deeply integrating industry scenarios, and accelerating the application of large models to empower thousands of industries.

Finally, Liu Hao also gave his four major judgments on the development trends of large models:

Trend 1: Information technology applications and innovation ecology have undergone tremendous changes, such as continuously feeding data to complete various intelligent activities, application development entering the natural language programming mode, etc.;
Trend 2: Reshaping the paradigm of decision-making intelligence, such as human-machine alignment to assist decision-making;
Trend 3: Developing in the direction of miniaturization and domainization, moving towards professional artificial intelligence based on general cognitive AI;
Trend 4: Moving towards more general artificial intelligence, such as large models interacting with humanoid robots.

Scan the QR code to watch the replay of the lecture "Large Model Analysis and Trend Outlook" ⬇️

李树桥：大模型优化技术在昇腾上的应用与落地

华为软件工程师李树桥带来《大模型优化技术在昇腾上的应用与落地》主题演讲，从三方面介绍了昇腾在大模型算力方面的特性，包括开源加速库昇腾原生支持、昇腾自研大模型优化技术、以及基于云原生的生产落地。

首先对于各种开源库的支持，涵盖第三方模型、第三方 AI 框架、第三方加速库、第三方推理服务四大方面。比如对于 Pytorch & Torch NPU 的支持，Pytorch 是一款 AI 框架，可分为两大整体，上层是 Pytorch 部分，下层是 Torch NPU。在上层，昇腾通过注册方式，将原生算子和自定义算子注册到 PyTorch，使得 PyTorch 可以在昇腾中有运行，对于下层的 Torch NPU，通过开源贡献，优化 checkpoint、FSDP、Dataloader 等模块的多设备支持能力，实现原生支持 NPU。

此外，昇腾也支持了 onnxRuntime 通用模型框架。包括 Pytorch、TensorFlow、MindSpore 等在内的不同框架可以保存成 onnx 的格式，onnxRuntime 可以去运行调用统一格式。昇腾的原生支持已经支持了onnxRuntime库，使得在对接多框架以及易用性上有很大的便利。

模型压缩方面，DeepSpeed 可对大模型进行压缩，让其可以更好地部署运行，目前也实现了昇腾的原生支持。

对于提供图像处理，机器学习，视频分析等功能的计算机视觉库 OpenCV。昇腾实现后端支持，提供昇腾 NPU 数据结构 AscendMat 和18个高频接口，并且多数算子性能提升30%。

代码迁移方面。基于 Pytorch 和 TorchNPU，实现了 OpenCLIP 对昇腾的原生支持。可以实现3行代码将模型迁移至昇腾设备。

其次在昇腾自研大模型优化技术方面。昇腾自研 AscendSpeed 大模型加速库。大模型训练是一种非常复杂的过程，涉及到许多技术和挑战，其中大模型训练需要大量的显存资源是一个难题，对计算卡提出了不小的挑战。为了在单个计算卡显存资源不足时，可以通过多张计算卡进行计算，业界出现了类似 Megatron、DeepSpeed 等第三方大模型加速库，对模型、输入数据等进行切分并分配到不同的计算卡上，最后在通过集合通信对结果进行汇总。昇腾提供 AscendSpeed 加速库，使能客户大模型业务快速迁移至昇腾设备，并且支持昇腾专有算法，确保开箱可用。

昇腾还提供了一套比较完善的工具链 AIT（Ascend Inference Tools），做为统一推理工具链入口，提供客户一体化开发工具，支持一站式调试调优。

最后，在基于云原生的生产落地方面。K8S volcano 调度器支持昇腾设备的亲和性调度。此外 Kubernetes 昇腾设备插件 Kubernetes Device Plugin 可以将其发现的设备个数上报到 Kubernetes 系统中，当设备处于不健康状态时，上报到 Kubernetes 系统中并删除，设备故障后会自动拉起新容器，挂载健康设备，并重建训练任务。当下，Vicuna 昇腾原生支持的 Space 后端已经使用了 Kubernetes Device Plugin。

扫码观看《大模型优化技术在昇腾上的应用与落地》演讲回放 ⬇️

Yuan Lijiang: Wisdom inspires the future - Yanxi large model platform

Yuan Lijiang, product director of JD Cloud, delivered a keynote speech on "Inspiring the Future with Intelligence - Yanxi Large Model Platform". Yuan Lijiang introduced that there are five major challenges in the enterprise-level implementation of large models: real-time, explainability, security and controllability, complex decision-making, and professionalism. The key to implementation is how to make correct decisions in real time and in an uncertain and dynamically changing environment. implement.

Yuan Lijiang introduced that there are two main ways to implement large models. One is the Copilot model. The interaction relationship is human-led. AI only serves as an assistant. In some scenarios, AI completes the work, such as text content generation and processing. , Vincent Tu, etc. In fact, for enterprises, they need to release manpower as much as possible. The other is the Agent mode, which is more suitable for complex scenarios in enterprises. In this mode, humans stand in a higher-dimensional perspective and act as the "mentor" or "coach" of artificial intelligence, setting goals and supervising the results. The large model can exert its reasoning ability, use appropriate tools and excuses, and finally give corresponding result feedback.

The main technologies relied on for the implementation of large models in enterprises have also changed. The initial Pre-train has the highest cost and huge investment; later, the cost of SFT mode decreased but the implementation effect was not good; the retrieval based on vector database enhanced RAG mode, but the effect was improved. It can only be limited to knowledge question and answer scenarios; in the end, proficient technical teams pay more attention to the Agent mode and can achieve multi-scenario support.

In JD.com's financial business, it is difficult to improve the ability of large models to solve practical problems simply by relying on large model SFT or LoRA. Instead, it is based on Agent technology to enable machines to use tools to solve business problems. Specifically, it uses the Agent to understand the user goals, disassemble each sub-task, and select appropriate tools for each sub-task. These tools are some interfaces of JD.com’s original business, and finally combined with large model capabilities to provide feedback. In this way, answers to some users' complex questions will be more accurate.

At present, JD Yanxi’s full model platform has built a multi-layered product matrix. The lowest layer is resource support, including computing resources, storage resources, high-speed network and resource scheduling. In the model resource layer, it provides capabilities such as model management and training, data set processing, and model evaluation and deployment. Above the model resource layer is the construction of intelligent agents, focusing on the integration of various tools. The top layer is the application service layer, which adapts to multiple enterprise scenarios.

JD Yanxi's large model platform has 6 major functions: resource scheduling collaboration, which can realize efficient management and scheduling of computing resources, ensuring performance optimization and cost control of large model development and application; data management, which provides management and support for large model training Pre-training, fine-tuning, reinforcement learning, evaluation, etc. are carried out efficiently; model training, training and fine-tuning through large models allows enterprises to have customized models to improve accuracy and relevance; intelligent agent construction helps enterprises create and deploy intelligent agents, Combined with the enterprise's existing IT systems to perform complex tasks; security compliance ensures that all large-model applications comply with security standards and legal and regulatory requirements; the intelligent application market provides a series of pre-built large-model applications that enterprises can deploy directly or provide plug-ins Quick access to the system.

Scan the QR code to watch the replay of the speech "Inspiring the Future - Yanxi Large Model Platform" ⬇️

Lin Jiazhen: Large model serverless inference system

Currently, some model parties or platforms will provide individual users with some free computing power to use large model technology. Gitee.AI, as a large model aggregation platform, is also providing free computing power to individual users. Lin Jiazhen, expert consultant of Gitee AI and Institute of High Performance Computing of Tsinghua University, gave a keynote speech on "Large Model Serverless Inference System".

Lin Jiazhen pointed out that Gitee.AI currently aggregates more than 2,000 models, but free computing resources are limited. Therefore, it is necessary to more efficiently allocate these free computing resources to developers on demand, which is a very challenge at the moment. Sexual issues. For example, when container technology was used for external development in the past, the swap-in, swap-out and wake-up of a single container were very fast. However, this has become difficult in the era of large models. The wake-up and sleep of the model make the swap-in and swap-out management of containers difficult to achieve in the past. The scene is just as efficient.

Serverless AI has four major advantages, including simple deployment, out-of-box use, reduced computing power usage costs, coverage of mainstream models, and support for a variety of computing hardware. There is a problem with the current model engine, or the way of purchasing and using computing power, that is, user programs, models, and inference chips are all tied to a container, occupying the hardware chip and using computing power services. The Serverless inference engine integrates and optimizes computing power resources, reduces the coupling between applications, models, and computing power through multiple levels of deaggregation, allocates computing power on demand, and improves resource utilization.

The serverless system architecture is divided into three layers. The lowest layer is the compiler layer. The loading of the model in the container is changed to the mode of rpc calling to the remote service. The interface is not changed, but it is replaced by back-end inference to realize the model and chip. Depolymerization. rpc is given to the inference engine at the upper level. The inference engine is the cluster where calculations actually occur. This level deaggregates data and computing power. For example, assume a task scenario where ten cards satisfy the scheduling request of 3,000 models. At this time, there is no way to load a large model fixedly on one card. It is necessary to temporarily and dynamically load the desired model according to the request. Therefore, the calculated The chip and model weights are deaggregated, and the model is placed on TanserGraph, which is a heterogeneous memory system that can support the deaggregation of computing power chips and models. At the top layer, the Serverless layer, application, inference, and aggregation are performed.

The core capability of the serverless system architecture is heterogeneous interconnected memory to solve the model weight problem. The overall data center architecture has some limitations, such as low resource utilization and limited hardware scalability. Disaggregation technology can physically separate each component in the overall architecture and use a certain interconnection to link the control interface (Control Plane) of each component. and data interface (Data Plane) to realize on-demand allocation and expansion of various resources. In addition, memory deaggregation also has application advantages in cloud scenarios, including improving cloud environment resource utilization and making it easier to meet the growing demand for memory resources.

However, the existing hierarchical memory system is not suitable for the high hardware flexibility under the deaggregation architecture, and the system scalability is also limited. Moreover, due to the internal structure limitations of the system, the existing memory management interface capabilities are limited. Heterogeneous interconnected memory can solve these problems through three links: hardware access statistics, programmable strategies, and page migration. Taking the CPU as an example, for access statistics based on PEBs, the hardware is supported to collect the memory access status of the running program, record the instructions, TID, destination address, etc., and then load the model weights on demand.

In addition, the serverless system architecture also has various other capabilities, such as multi-level neural network compilation optimization technology based on MLIR, and lightweight system service mechanism based on user-space isolation technology. The serverless inference engine is built based on two core intellectual property technologies. In addition, it also integrates various current mainstream inference system optimization technologies.

Currently, Llama 3 has been launched on Gitee AI. Copy the link below to your browser and enter the platform to experience it (invitation code: llama3):

https://ai.gitee.com/hf-models/shenzhi-wang/Llama3-8B-Chinese-Chat

Scan the QR code to watch the replay of the lecture "Large Model Serverless Inference System" ⬇️

陈子恒：昇思 MindSpore 大模型关键技术与规划

MindSpore Research Engineer 陈子恒带来《昇思 MindSpore 大模型关键技术与规划》主题演讲。陈子恒介绍，在行业中，MindSpore 介于底层芯片硬件与上层行业应用之间。MindSpore 在大模型技术方面，首先做的是基础层，覆盖了多个底层大模型，在上层则是与行业伙伴构建行业模型。此外，MindSpore 也兼容了国内外多款主流的开源大模型。针对所有大模型，MindSpore 通过 MindFormers、MindPET、MindRLHF 三个基础套件，统一大模型开发、微调和部署等全流程，实现开箱即用。

For large model training. MindSpore uses a computation graph-based compiler to implement parallel strategies. Input a calculation graph, and MindSpore's graph compilation process will segment the graph according to the parallel strategy, and automatically insert data rearrangement operators into it to ensure that the parallel computing logic of multiple machines is consistent with that of a single machine. In this way, MindSpore achieves multiple levels of optimization, including top-level automatic policy generation, multi-dimensional hybrid parallelism, and optimization that supports multi-dimensional storage and heterogeneity at runtime.

Since last year, the MindSpore team has also been doing parallel training of large models. Under normal circumstances, typical large model training uses a mixture of five parallel strategies, including data parallelism, optimizer parallelism, model parallelism, pipeline parallelism, and recalculation. The MindSpore team analyzed the time-consuming situation of typical models in these parallel modes and found that the main costs here are in three aspects, including the cost of operator-level model parallelism, bubbles generated by pipeline parallelism, and the tail time of data parallelism. And when the cluster scale continues to increase, these overhead problems will become more obvious when reaching the Wanka cluster. For example, due to the limitation of the global batch size, the bubble problem of the pipeline will become more serious, and due to the increase in the communication domain, the communication performance will deteriorate. , the tailing ratio of data parallelism will increase.

Regarding these problems, Chen Ziheng also introduced some solutions, such as multi-copy parallel mode hidden model communication, which divides the data into two. Each data can be calculated and communicated independently, while the calculation and communication between multiple copies of data are can hide each other, thereby optimizing operator-level model parallelism. For PipeLine parallel optimization, reduce Bubble to less than 10% through PipeLine Interleave.

In addition, the problem of hot and cold experts will be encountered during MoE training. For expert hot migration, AlltoAll communication volume is reduced and MoE model training performance is improved. In addition to high-performance training, another problem for large models is how to implement strategy parallelism. MindSpore adopts automatic parallelism, and the parallel strategy tuning time for large models can be reduced from months to hours.

In terms of deployment, MindSpore is equivalent to the backend of serverless, and what needs to be solved is performance issues. MindSpore uses distributed parallel reasoning, KV Cache, dynamic seq, continue batch, and high-performance reasoning fusion operators to build a unified reasoning framework with low latency, high throughput, and support for long sequences of large models. The integrated training and push architecture enables seamless connection from training to inference.

Next, MindSpore's plans for large model training cover Wanka large cluster training performance optimization, dense large model performance optimization, sparse MoE large model performance optimization, etc. In terms of large model inference, MindSpore plans to conduct more in-depth research on the integrated large model training and push architecture. , Dense large model inference acceleration, sparse large model inference acceleration, etc.

扫码观看《昇思 MindSpore 大模型关键技术与规划》演讲回放 ⬇️

That’s it for this event review. Registration for the 103rd Yuanchuang Fair is now open, click to view⬇️

[Large model technology in the terminal] OSC Source Innovation Conference·Shenzhen Station·Issue 103 https://www.oschina.net/event/2332004