昇思 MindSpore 大模型关键技术与规划

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Click to view the complete event review: https://my.oschina.net/u/4489239/blog/11105657

Click to jump to the preview of Shenzhen Yuanchuang Conference on May 18: https://www.oschina.net/event/2332004

4 月 20 日，第 102 期源创会在武汉成功举办。本期邀请来自武汉人工智能研究院、华为、MindSpore、京东云、Gitee AI 的人工智能专家，围绕【大模型竞技与性能优化】主题发表演讲。 MindSpore Research Engineer 陈子恒带来《昇思 MindSpore 大模型关键技术与规划》主题演讲。陈子恒介绍，在行业中，MindSpore 介于底层芯片硬件与上层行业应用之间。MindSpore 在大模型技术方面，首先做的是基础层，覆盖了多个底层大模型，在上层则是与行业伙伴构建行业模型。此外，MindSpore 也兼容了国内外多款主流的开源大模型。针对所有大模型，MindSpore 通过 MindFormers、MindPET、MindRLHF 三个基础套件，统一大模型开发、微调和部署等全流程，实现开箱即用。

For large model training. MindSpore uses a computation graph-based compiler to implement parallel strategies. Input a calculation graph, and MindSpore's graph compilation process will segment the graph according to the parallel strategy, and automatically insert data rearrangement operators into it to ensure that the parallel computing logic of multiple machines is consistent with that of a single machine. In this way, MindSpore achieves multiple levels of optimization, including top-level automatic policy generation, multi-dimensional hybrid parallelism, and optimization that supports multi-dimensional storage and heterogeneity at runtime.

Since last year, the MindSpore team has also been doing parallel training of large models. Typically, typical large model training uses a mixture of five parallel strategies, including data parallelism, optimizer parallelism, model parallelism, pipeline parallelism, and recalculation. The MindSpore team analyzed the time-consuming situation of typical models in these parallel modes and found that the main costs here are in three aspects, including the cost of operator-level model parallelism, bubbles generated by pipeline parallelism, and the tail time of data parallelism. And when the cluster scale continues to increase, these overhead problems will become more obvious when reaching the Wanka cluster. For example, due to the limitation of global batchsize, the bubble problem of the pipeline will become more serious, and due to the increase in communication domain, communication performance will deteriorate. , the tailing ratio of data parallelism will increase.

Regarding these problems, Chen Ziheng also introduced some solutions, such as multi-copy parallel mode hidden model communication, which divides the data into two. Each data can be calculated and communicated independently, while the calculation and communication between multiple copies of data are can hide each other, thereby optimizing operator-level model parallelism. For PipeLine parallel optimization, reduce Bubble to less than 10% through PipeLine Interleave.

In addition, the problem of hot and cold experts will be encountered during MoE training. For expert hot migration, AlltoAll communication volume is reduced and MoE model training performance is improved. In addition to high-performance training, another problem for large models is how to implement strategy parallelism. MindSpore adopts automatic parallelism, and the parallel strategy tuning time for large models can be reduced from months to hours.

In terms of deployment, MindSpore is equivalent to the backend of serverless, and what needs to be solved is performance issues. MindSpore uses distributed parallel reasoning, KV Cache, dynamic seq, continue batch, and high-performance reasoning fusion operators to build a unified reasoning framework with low latency, high throughput, and support for long sequences of large models. The integrated training and push architecture enables seamless connection from training to inference.

Next, MindSpore's plans for large model training cover Wanka large cluster training performance optimization, dense large model performance optimization, sparse MoE large model performance optimization, etc. In terms of large model inference, MindSpore plans to conduct more in-depth research on the integrated large model training and push architecture. , Dense large model inference acceleration, sparse large model inference acceleration, etc.

扫码观看《昇思 MindSpore 大模型关键技术与规划》演讲回放 ⬇️

昇思 MindSpore 大模型关键技术与规划

Guess you like