The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

This article is shared from Huawei Cloud Community " Distributed Training of Large Model LLM ", author: Hua Shanghua_Lancer.

With the rapid growth in the amount of language model parameters and required training data, the limited resources on a single machine can no longer meet the requirements for large language model training. A distributed training (Distributed Training) system needs to be designed to solve the problem of massive computing and memory resource requirements.

In a distributed training system environment, it is necessary to split a model training task into multiple subtasks and distribute the subtasks to multiple computing devices to solve resource bottlenecks. But how can we use a cluster including tens of thousands of computing acceleration chips to train a large-scale language model with hundreds of billions or even trillions of model parameters? This involves a series of technologies such as cluster architecture, parallel strategy, model architecture, memory optimization, and computing optimization.

I will introduce in detail the basic concepts of distributed machine learning systems, distributed training cluster architecture, distributed training parallel strategies, and use DeepSpeed as an example to introduce how to train large language models on a cluster.

1. Overview of distributed training

Distributed Training refers to decomposing machine learning or deep learning model training tasks into multiple subtasks and training them in parallel on multiple computing devices. Figure 1 gives an example of a single computing device and multiple computing devices. The computing device here can be a central processing unit (CPU), a graphics processing unit (GPU), or a tensor processing unit (Tensor Processing Unit). Unit (TPU) may also be a neural network processor (Neural network Processing Unit, NPU).

Since memory may not be shared among multiple computing devices within the same server, regardless of whether these computing devices are in one server or multiple servers, their system architecture falls under the category of distributed systems. A model training task often has a large number of training samples as input, which can be completed using one computing device, or the entire model training task can be split into subtasks and distributed to different computing devices to achieve parallel computing.

After that, the output of each computing device needs to be combined to finally obtain the calculation result equivalent to that of a single computing device. Since each computing device only needs to be responsible for subtasks, and multiple computing devices can execute in parallel, it can complete the overall calculation faster and ultimately accelerate the entire computing process.

Figure 1 Examples of single computing device computing and multiple computing devices

One of the most important reasons that prompts people to design distributed training systems is that the computing power of a single computing device is no longer enough to support model training. Figure 2 shows the computing power requirements of the machine learning model and the computing power that a single computing device can provide during the same period. As shown in the figure, machine learning models are developing rapidly. Starting from AlexNet in 2013 to the Palm model with 540 billion parameters in 2022, machine learning models are developing at a rate of 56 times every 18 months. As the model parameter scale increases, the requirements for the amount of training data also increase exponentially, which intensifies the demand for computing power.

However, in recent years, the increase in CPU computing power has been far lower than Moore's Law. Although computing acceleration devices (such as GPU, TPU, etc.) provide a large amount of computing power for machine learning models, its growth rate still has not exceeded Moore's Law of doubling every 18 months. In order to be able to meet the development of machine learning models, only a distributed training system can match the growing computing power requirements of the model.

Figure 2 Comparison between the growth of machine learning model parameters and the growth of computing power of computing hardware

The overall goal of distributed training is to increase the overall training speed and reduce the overall time of model training. The total training speed can be briefly estimated using the following formula:

Total training speed ∝ Single device computing speed × Total number of computing devices × Multi-device acceleration ratio

Among them, the computing speed of a single device is mainly determined by the computing speed and data I/O capabilities of a single computing acceleration chip. To optimize the training efficiency of a single device, the main technical means include mixed precision training, operator fusion, gradient accumulation, etc.; The greater the number of computing devices in a distributed training system, the higher its theoretical peak computing speed will be. However, affected by communication efficiency, an increase in the number of computing devices will cause a rapid decrease in the acceleration ratio; the multi-device acceleration ratio is determined by the calculation and Communication efficiency is determined by combining algorithms and network topology for optimization. The main goal of the distributed training parallel strategy is to improve the multi-device acceleration ratio in the distributed training system.

The amount of large language model parameters and the amount of data used are very huge, so a distributed training architecture is used to complete the training. Document [5] only introduces the training process of GPT-3 using NVIDIA V100 GPUs. Document [31] introduces that OPT uses 992 NVIDIA A100 80G GPUs and adopts Fully Shared Data Parallel. [129] and Megatron-LM Tensor Parallelism [130], the overall training time is nearly 2 months.

The researchers of the BLOOM[33] model disclosed more details on the hardware and system architecture used. Training of the model took a total of 3.5 months and used 48 computing nodes. Each node contains 8 NVIDIA A100 80G GPUs (384 GPUs in total), and uses 4*NVLink for communication between GPUs within the node. Nodes communicate with each other using an enhanced 8-dimensional hypercube global topology network built with four Omni-Path 100 Gbps network cards.

Literature [37] does not give the specific configuration and network topology of the cluster used in LLaMA model training, but gives the total GPU hours for different parameter scales. LLaMA model training uses A100-80GB GPU, LLaMA-7B model training requires 82432 GPU hours, LLaMA-13B model training requires 135168 GPU hours, LLaMA-33B model training takes 530432 GPU hours, and LLaMA-65B model training costs up to 1022362 GPU Hour. Since the amount of training data used by LLaMA far exceeds that of the OPT and BLOOM models, although the number of model parameters is much smaller than the above two models, the amount of calculation required is still very staggering.

By using a distributed training system, the training cycle of large language models can be shortened from decades on a single computing device to dozens of days using thousands of computing devices. However, distributed training systems still need to overcome various challenges such as computing walls, video memory walls, and communication walls to ensure that all resources within the cluster are fully utilized, thereby accelerating the training process and shortening the training cycle.

• Computational wall: There is a huge discrepancy between the computing power a single computing device can provide and the total amount of computation required for a large language model. The NVIDIA H100 SXM released in March 2022 has a single card FP16 computing power of only 2000 TFLOPs, while GPT-3
requires a total computing power of 314 ZFLOPs. The difference between the two is 8 orders of magnitude.

• Video memory wall: A single computing device cannot completely store the parameters of a large language model. GPT-3 contains 175 billion parameters. If stored in FP16 format, it requires 700GB of computing device memory space, and the NVIDIA H100 GPU only has 80 GB of video memory.

• Communication wall: Frequent parameter transmission and synchronization are required between computing devices in a distributed training system. Due to communication latency and bandwidth limitations, this can become a bottleneck in the training process. During the GPT-3 training process, if there are 128 model copies in the distributed system, at least 89.6TB of gradient data needs to be transmitted during each iteration. As of August 2023, a single InfiniBand link can only provide no more than 800Gb/s bandwidth. The computing wall and the video memory wall arise from the conflict between the limited computing and storage capabilities of a single computing device and the huge computing and storage requirements of the model. This problem can be solved by using a distributed training method, but distributed training will face the challenge of communication walls. In the training of multiple machines and cards, these problems gradually emerged. As the parameters of large models increase, the corresponding cluster size also increases, and these problems become more prominent. At the same time, when large clusters are trained for a long time, equipment failure may affect or interrupt the training process, which also places high demands on the distributed system.

2. Distributed training parallel strategy

The goal of the distributed training system is to convert single-node model training into equivalent distributed parallel model training. For large language models, the training process is the process of updating the neural network model parameters using optimization algorithms based on data and loss functions. The structure of the single-node model training system is shown in Figure 3, which mainly consists of two parts: data and model. The training process will be completed by multiple data mini-batches (Mini-batch).

The data in the figure represents a small batch of data. The training system uses small batches of data to generate gradients based on loss functions and optimization algorithms to correct model parameters. The execution process of a multi-layer neural network for a large language model can be represented by a computational graph (Computational Graph). This graph has multiple interconnected operators (Operators), each operator implements a neural network layer (Neural Network Layer), and the parameters represent the weights updated by this layer during training.

Figure 3 Single-device model training system

The execution process of the calculation graph can be divided into two stages: forward calculation and reverse calculation. The process of forward calculation is to read the data into the first operator, calculate the corresponding output structure, and then repeat the forward calculation process until the end of the last operator. The reverse calculation process is based on the optimization function and loss. Each operator calculates the gradient in turn, and uses the gradient to update the local parameters. After the reverse calculation is completed and the calculation of the data mini-batch is completed, the system will read the next data mini-batch and continue the next round of model parameter updates.

According to the process of the single-device model training system, we can see that if parallel acceleration is performed, it can be considered from two dimensions: data and model. First, the data can be partitioned (Partition), the same model can be copied to multiple devices, and different data shards can be executed in parallel. This method is usually called Data Parallelism (DP). The model can also be divided and the operators in the model can be distributed to multiple devices for completion respectively. This method is usually called Model Parallelism (MP). When training very large-scale language models, it is often necessary to split the data and the model at the same time to achieve a higher degree of parallelism. This method is often called hybrid parallelism (HP).

2.1. Data parallelism

In a data parallel system, each computing device has a complete copy of the entire neural network model (Model Replica). When iterating, each computing device is only allocated a subset of a batch of data samples, and based on the batch of samples The subset of data is used for forward calculation of the network model. Assume that the number of training samples in a batch is N, and M computing devices are used for parallel calculation, and each computing device will be allocated N/M samples. After the forward calculation is completed, each computing device will calculate the loss error based on the local sample to obtain the gradient Gi (i is the accelerator card number), and broadcast the local gradient Gi. All computing devices need to aggregate the gradient values given by other acceleration cards, and then use the average gradient (ΣNi=1Gi)/N to update the model to complete the batch training. Figure 4 shows an example of a data parallel training system consisting of two computing devices.

Figure 4 Example of two-node data parallel training system

The data parallel training system can effectively improve the overall training throughput and global batch size per second (Global Batch Size Per Second) by adding computing equipment. Compared with single computing device training, the main difference is that the gradients in the reverse calculation need to be synchronized in all computing devices to ensure that the final result on each computing device is the average of the gradients on all processes.

Common neural network frameworks have specific implementations of data parallelism, including: TensorFlow DistributedStrategy, PyTorch Distributed, Horovod DistributedOptimizer, etc. Since each operator in a large language model based on the Transformer architecture relies on single data rather than batch data, data parallelism will not affect its calculation logic. In general, the forward calculation in each training device is independent and does not affect the calculation logic. Involves synchronization issues. Data parallel training has the highest acceleration ratio, but requires a copy of the model to be backed up on each device and consumes a relatively high amount of video memory.

The code for using PyTorch DistributedDataParallel to implement multiple accelerator card training on a single server is as follows. First, construct the DistributedSampler class to randomly disrupt the samples of the data set and distribute them to different computing devices:

class DistributedSampler(Sampler):
  def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0):
    if num_replicas is None:
        if not dist.is_available():
            raise RuntimeError("Requires distributed package to be available")
        num_replicas = dist.get_world_size()
    if rank is None:
        if not dist.is_available():
            raise RuntimeError("Requires distributed package to be available")
        rank = dist.get_rank()
    self.dataset = dataset #dataset
    self.num_replicas = num_replicas #The number of processes defaults to world_size (number of GPUs)
    self.rank = rank # Which process/GPU currently belongs to
    self.epoch = 0
    self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
    #Number of samples per process
    self.total_size = self.num_samples * self.num_replicas #The number of total samples in the data set
    self.shuffle = shuffle # Whether to shuffle the data set
    self.seed = seed

def __iter__(self):
# 1. Shuffle processing: disrupt the order of the data set
    if self.shuffle:
        # Obfuscate based on epoch and seed
        g = torch.Generator()
        # Here self.seed is a fixed value. Changing self.epoch by set_epoch can change our initialization seed.
        # This allows the shuffling order of the data sets in each epoch to be different, so that in each epoch,
        # Each GPU gets different data, which can facilitate better training.
        g.manual_seed(self.seed + self.epoch)
        indices = torch.randperm(len(self.dataset), generator=g).tolist()
    else:
        indices = list(range(len(self.dataset)))
    # Data supplement
    indices += indices[:(self.total_size - len(indices))]
    assert len(indices) == self.total_size
    # allocate data
    indices = indices[self.rank:self.total_size:self.num_replicas]
    assert len(indices) == self.num_samples
    return iter(indices)
def __len__(self):
    return self.num_samples
def set_epoch(self, epoch):

    self.epoch = epoch

Use DistributedSampler to construct a complete training program sample main.py as follows:

import argparse
import us
import shutil
import time
import warnings
import numpy as np
warnings.filterwarnings('ignore')
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
from torch.utils.data.distributed import DistributedSampler
from models import DeepLab
from dataset import Cityscaples
parser = argparse.ArgumentParser(description='DeepLab')
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('--epochs', default=100, type=int, metavar='N',
help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=3, type=int,
metavar='N')
parser.add_argument('--local_rank', default=0, type=int, help='node rank for distributed training')
args = parser.parse_args()
torch.distributed.init_process_group(backend="nccl") #Initialization
print("Use GPU: {} for training".format(args.local_rank))
# create model
model = DeepLab()
torch.cuda.set_device(args.local_rank) #Current graphics card
model = model.cuda() # The model is placed on the graphics card
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
    output_device=args.local_rank, find_unused_parameters=True) # Data parallelism
criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), args.lr,
    momentum=args.momentum, weight_decay=args.weight_decay)
train_dataset = Cityscaples()
train_sampler = DistributedSampler(train_dataset) # Distribute data
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size,
    shuffle=False, num_workers=args.workers, pin_memory=True, sampler=train_sampler)

Start the above program via the following command line:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 main.py

2.2 Model parallelism

Model Parallelism is often used to solve the problem of insufficient memory on a single node. Take the GPT-3 model containing 175 billion parameters as an example. If each parameter in the model is represented by a 32-bit floating point number, the model needs to occupy 700GB (i.e. 175G × 4 Bytes) of memory. If it is represented by a 16-bit floating point number, each Each model copy requires 350GB of memory. The H100 accelerator card released by NVIDIA in March 2022 only supports 80GB of video memory, and the entire model cannot be completely placed in it. Model parallelism can be divided into the following two forms from the perspective of computational graphs:

(1) Split into different devices according to the layers of the model, that is, inter-layer parallelism or inter-operator parallelism (Inter-operator Parallelism), also called pipeline parallelism (Pipeline Parallelism, PP);

(2) Split the parameters in the calculation layer into different devices, that is, intra-layer parallelism or intra-operator parallelism (Intra-operator Parallelism), also called tensor parallelism (Tensor Parallelism, TP).

A sample of a two-node model parallel training system is shown in Figure 4.9. The left side is pipeline parallelism, and different layers of the model are split into different devices; the right side is tensor parallelism, and different parameters in the same layer are split into different devices. calculations in the device.

Pipeline parallelism

Pipeline Parallelism (PP) is a parallel computing strategy that processes each layer of the model in segments and distributes each segment on different computing devices, so that the previous and subsequent stages can work in a pipeline and in batches. Pipeline parallelism is usually applied in parallel systems of large-scale models to effectively solve the problem of insufficient memory on a single computing device. Figure 4.6 shows a pipeline parallel system composed of four computing devices, including forward calculation and backward calculation. Among them, F1, F2, F3, and F4 respectively represent four forward paths, which are located on different devices; while B4, B3, B2, and B1 represent the reverse backward paths, which are also located on four different devices. However, as can be seen from the figure, the downstream device (Downstream Device) in the calculation graph needs to remain idle for a long time, waiting for the upstream device (Upstream Device) to complete its calculations before it can start calculating its own tasks.

Figure 5 Example of two-node model parallel training system

This situation resulted in a significant reduction in the average usage of the device, forming a Model Parallelism Bubble, also known as a Pipeline Bubble.

Figure 6 Pipeline parallel example

The parallel bubbles generated by the naive pipeline strategy prevent the system from fully utilizing computing resources and reduce the overall computing efficiency of the system. In order to reduce parallel bubbles, literature [131] proposed the GPipe method, which further divides the mini-batch into smaller micro-batch, and uses the pipeline parallel scheme to process one micro-batch at a time. times data.

After the calculation in the current stage is completed and the results are obtained, the results of the micro-batch are sent to the downstream device, and the data of the next micro-batch starts to be processed at the same time, which can reduce parallel bubbles to a certain extent. Figure 7GPipe policy pipeline parallel example. As shown in the figure, the forward F1 calculation is broken down into F11, F12, F13, and F14. After the calculation of F11 is completed in computing device 1, the calculation of F21 will be started in computing device 2, and at the same time, F12 will be started in parallel in computing device 1. calculation. Compared with the original pipeline parallel method, the GPipe pipeline method can effectively reduce parallel bubbles.

Figure 7 GPipe policy pipeline parallel example

Although the GPipe strategy can reduce certain parallel bubbles, the backward calculation can only start after all the forward calculations in a Mini-batch are completed. Therefore, many parallel bubbles will still be generated, thus reducing the parallel efficiency of the system. Megatron-LM[132] proposed a 1F1B pipeline strategy, which consists of one forward channel and one backward channel. The 1F1B pipeline strategy introduces a task scheduling mechanism, allowing downstream devices to execute other parallel tasks while waiting for upstream calculations, thereby improving device utilization. 1F1B provides two scheduling methods, non-interleaved and interleaved, as shown in Figure 8.

The 1F1B non-interleaved scheduling mode can be divided into three stages. The first is a warm-up phase in which varying numbers of forward computations are performed in the computing device. The next phase is the forward-backward phase, where the computing device sequentially performs a forward calculation and then a backward calculation. The last phase is the backward phase, where the computing device completes the last backward calculation. Compared with the GPipe strategy, the non-interleaved scheduling mode performs better in saving memory. However, it requires the same time as the GPipe strategy to complete one round of calculations.

1F1B interleaved scheduling mode requires the number of micro-batch to be an integral multiple of the pipeline stages. Instead of being solely responsible for the computation of multiple consecutive layers, each device can process subsets of multiple layers, called model nuggets. Specifically, in the previous model, device 1 might be responsible for layers 1-4, device 2 for layers 5-8, and so on. However, in the new mode, device 1 can handle layers 1, 2, 9, 10, device 2 can handle layers 3, 4, 11, 12, and so on. In this mode, each device is assigned to multiple stages in the pipeline. For example, Device 1 may be involved in some subset of tasks in the warm-up phase, the forward computation phase, and the backward computation phase. Each device can perform computing tasks at different stages in parallel, thereby taking better advantage of pipeline parallelism. This mode not only performs well in terms of memory consumption, but also improves computational efficiency, allowing parallel systems for large models to complete computing tasks more efficiently.

Figure 8 1F1B pipeline parallel strategy example

PyTorch also includes the API function Pipe to implement the pipeline. For specific implementation, please refer to the "torch.distributed.pipeline.sync.Pipe" class. You can use this API to construct a sample containing two linear layers, which are placed in two different computing devices, as follows:

{#
Step 0. Need to initialize RPC framework first.
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
torch.distributed.rpc.init_rpc('worker', rank=0, world_size=1)
# Step 1: build a model including two linear layers
fc1 = nn.Linear(16, 8).cuda(0)
fc2 = nn.Linear(8, 4).cuda(1)
# Step 2: wrap the two layers with nn.Sequential
model = nn.Sequential(fc1, fc2)
# Step 3: build Pipe (torch.distributed.pipeline.sync.Pipe)
model = Pipe(model, chunks=8)
# do training/inference
input = torch.rand(16, 16).cuda(0)
output_rref = model(input)
}

tensor parallelism

Tensor Parallelism (TP) needs to solve two problems: how to split parameters into different devices according to the specific structure and operator type of the model, and how to ensure mathematical consistency after splitting. Large language models are based on the Transformer structure. The Transformer structure is mainly composed of the following three operators: embedded representation (Embedding), matrix multiplication (MatMul) and cross entropy loss (Cross Entropy Loss) calculation.

These three types of operators are quite different, and corresponding tensor parallel strategies [130] need to be designed to split parameters into different devices. For the Embedding operator, if the total number of vocabularies is very large, the video memory of a single computing device will not be able to accommodate the Embedding layer parameters. For example, if the number of vocabularies is 64000, the embedding representation dimension is 5120, and the type uses 32-bit precision floating point numbers, then the video memory required for the entire layer of parameters is approximately 64000 × 5120 × 4/1024/1024 = 1250MB, and the reverse gradient is the same Requires 1250MB, nearly 2.5GB for storage alone.

The parameters embedded in the presentation layer can be divided according to the word dimension. Each computing device only stores part of the word vector, and then the complete word vector is obtained by summarizing the partial word vectors on each device. Figure 4.9 shows a schematic diagram of single-node Embedding and two-node tensor parallelism.

On a single node, perform the Embedding operation, bz is the batch size (batch size), the parameter size of Embedding is [word_size, hidden_size], and the [bz, hidden_size] tensor is calculated. The Embedding tensor parallel example in Figure 4.9 divides the Embedding parameters into two blocks along the word_size dimension. Each block has a size of [word_size/2, hidden_size] and is stored on two devices respectively. When each node queries its own word list, if it cannot be found, the representation of the word is 0. After querying the respective device, the [bz, hidden_size] result tensor is obtained. Finally, through AllReduce_Sum communication ¬, summing across devices, we get From the complete full results, it can be seen that the output results here are consistent with the results executed by a single computing device.

Figure 9 Two-node Embedding operator tensor parallel example

The tensor parallelism of matrix multiplication (MatMul) should make full use of the matrix block multiplication principle. For example, to implement the following matrix multiplication Y = X ×A, where X is the input matrix with dimension M × N, A is the parameter matrix with dimension N ×K, and Y is the result matrix with dimension M ×K. If the parameter matrix A is very large, or even exceeds the video memory capacity of a single card, then the parameter matrix A can be divided into multiple cards, and the results can be gathered through collective communication to ensure that the final result is mathematically equivalent to a single computing device Calculation results. There are two ways to segment the parameter matrix A:

(1) The parameter matrix A is cut into columns, and the matrix A is cut into columns: A = [A1,A2]

(2) The parameter matrix A is cut into rows, and the matrix A is cut into rows:

Figure 10 shows an example of splitting the parameter matrix by columns. The parameter matrix A places A1 and A2 on two computing devices respectively. Two computing devices calculate Y1 = X ×A1 and Y2 = X ×A2 respectively. After the calculation is completed, multiple computing devices communicate with each other to obtain the calculation results on other computing devices and splice them together to obtain the final result matrix Y. This result is mathematically equivalent to the calculation result of a single computing device.

Figure 10 Example of parallel splitting of two-node matrix multiplication operator tensors by columns

Figure 11 shows an example of dividing the parameter matrix by columns and rows. In order to satisfy the matrix multiplication rules, the input matrix X needs to be divided by columns X = [X1|X2]. At the same time, the matrix is divided into blocks and placed on two computing devices. Each computing device calculates Y1 =X1 ×A1 and Y2 = X2 ×A2 respectively. After the calculation is completed, multiple computing devices communicate to obtain and reduce the calculation results on other cards, and the final result matrix Y can be obtained. Similarly, this splitting method can not only ensure mathematical calculation equivalence, but also solve the problem that a single computing device cannot accommodate the video memory, and can also ensure that a single computing device can accommodate parameter A through splitting.

The FFN structure in Transformer contains two fully connected (FC) layers, that is, there are two matrix multiplications, and these two matrix multiplications adopt the above two segmentation methods, as shown in Figure 4.12. The parameter matrix of the first FC layer is cut into blocks by columns, and the parameter matrix of the second FC layer is cut into blocks by rows. In this way, the output of the first FC layer exactly meets the data input requirements of the second FC layer (split by columns), so the summary communication operation after the first FC layer can be omitted. The tensor parallelism of the multi-head self-attention mechanism is similar to FFN. Because it has multiple independent heads, it is easier to achieve parallelism than FFN. Its matrix segmentation method is shown in Figure 4.13. For details, please refer to [130].

The last layer of the classification network generally uses Softmax and Cross_entropy operators to calculate cross entropy loss (Cross Entropy Loss). If the number of categories is very large, it will cause the memory of a single computing device to be unable to store and calculate the logit matrix. For this type of operator, it can be divided according to the category dimension, and at the same time, the final global cross-entropy loss can be obtained through intermediate result communication.

Figure 11 Example of parallel splitting of two-node matrix multiplication operator tensor by rows

Figure 12 FNN structure tensor parallel diagram

The first thing to calculate is the softmax value, the formula is as follows:

Among them, p represents the device number of tensor parallelism. After obtaining the Softmax calculation result, the label Target is divided by category at the same time, and each device gets part of the loss. Finally, another communication is performed to get the loss of all categories. The entire process only requires three small amounts of communication to complete the calculation of cross-entropy loss. PyTorch provides a fine-grained tensor-level parallel API, DistributedTensor. It also provides a coarse-grained model level API to perform tensor parallelism on "nn.Module". You can shard a large tensor with the following lines of code:

import torch
from torch.distributed._tensor import DTensor, DeviceMesh, Shard, distribute_tensor
# construct a device mesh with available devices (multi-host or single host)
device_mesh = DeviceMesh("cuda", [0, 1, 2, 3])
# if we want to do row-wise sharding
rowwise_placement=[Shard(0)]
# if we want to do col-wise sharding
colwise_placement=[Shard(1)]
big_tensor = torch.randn(888, 12)
# distributed tensor returned will be sharded across the dimension specified in placements
rowwise_tensor = distribute_tensor(big_tensor, device_mesh=device_mesh, placements=rowwise_placement)

For modules like "nn.Linear" that already have "torch.Tensor" as a parameter, the module-level API "distribute_module" is also provided to perform tensor parallelism at the model level. The reference code is as follows:

import torch
from torch.distributed._tensor import DeviceMesh, Shard, distribute_tensor,distribute_module
class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(8, 8)
        self.fc2 = nn.Linear(8, 8)
        self.relu = nn.ReLU()
        def forward(self, input):
            return self.relu(self.fc1(input) + self.fc2(input))
    mesh = DeviceMesh(device_type="cuda", mesh=[[0, 1], [2, 3]])
    def shard_params(mod_name, mod, mesh):
        rowwise_placement = [Shard(0)]
        def to_dist_tensor(t): return distribute_tensor(t, mesh, rowwise_placement)
        mod._apply(to_dist_tensor)
    sharded_module = distribute_module(MyModule(), mesh, partition_fn=shard_params)
    def shard_fc(mod_name, mod, mesh):
        rowwise_placement = [Shard(0)]
        if mod_name == "fc1":
            mod.weight = torch.nn.Parameter(distribute_tensor(mod.weight, mesh, rowwise_placement))
    sharded_module = distribute_module(MyModule(), mesh, partition_fn=shard_fc)

2.3 Hybrid parallelism

Hybrid Parallelism (HP) is a mixture of multiple parallel strategies such as data parallelism, pipeline parallelism, and tensor parallelism. By combining different parallel strategies, hybrid parallelism can take full advantage of various parallel strategies to maximize computing performance and efficiency.

For large language models with a scale of hundreds of billions, a tensor parallel strategy is usually used within each server. Since this strategy involves a large amount of network communication, it is necessary to utilize high-speed communication bandwidth between different computing devices within the server. Through pipeline parallelism, different layers of the model are divided into multiple stages, and each stage is calculated by a different machine. In this way, the computing power of multiple machines can be fully utilized, and calculation results and intermediate data can be transferred through high-speed communication between machines to improve the overall computing speed and efficiency.

Finally, the data parallel strategy is superimposed on the outer layer to increase the number of concurrencies and improve the overall training speed. Through data parallelism, training data is distributed to multiple groups of servers for parallel processing, and each group of servers processes different data batches. This can make full use of the computing resources of multiple servers and increase the concurrency of training, thereby speeding up the overall training speed.

BLOOM uses the Megatron-DeepSpeed[104] framework for training, which mainly consists of two parts: Megatron-LM provides tensor parallel capabilities and data loading primitives; DeepSpeed provides the ZeRO optimizer, model pipeline, and conventional distributed training components. In this way, three-dimensional parallelism of data, tensors and pipelines can be achieved. The parallel computing structure used in BLOOM model training is shown in Figure 14.

BLOOM model training uses a cluster of 48 NVIDIA DGX-A100 servers. Each DGX-A100 server contains 8 NVIDIA A100 80GB GPUs, totaling 384. The strategy adopted in BLOOM training is to first divide the cluster into groups of 48 for data parallelization.

Next, the entire model is divided into 12 stages for pipeline parallelization. The model of each stage is divided into 4 GPUs for tensor parallelism. At the same time, BLOOM also uses ZeRO (Zero Redundancy Optimizer) [134] to further reduce the model's occupation of video memory. Through the above four steps, efficient parallel computing of hundreds of GPUs can be achieved.

Figure 14 Parallel computing structure used in BLOOM model training

2.4 Computing device memory optimization

当前大语言模型训练通常采用Adam 优化算法，除了需要每个参数梯度之外，还需要一阶动量（Momentum）和二阶动量（Variance）。虽然Adam 优化算法相较SGD 算法通常效果更好也更稳定，但是对计算设备内存的占用显著增大。

In order to reduce memory usage, most systems have adopted the Mixed Precision Training method, that is, there are values in both FP16 (16-bit floating point) or BF16 (Bfloat16) and FP32 (32-bit floating point) formats. FP32, FP16 and BF16 are represented as shown in Figure 4.15. In FP32, bit 31 is the sign bit, bits 30 to 23 are used to represent the exponent, and bits 22 to 0 are used to represent the mantissa. In FP16, bit 15 is the sign bit, bits 14 to 10 are used to represent the exponent, and bits 9 to 9 are used to represent the mantissa. In BF16, bit 15 is the sign bit, bits 14 to 7 are used to represent the exponent, and bits 6 to 0 are used to represent the mantissa. Since the value range of FP16 is much smaller than that of FP32, overflow and underflow can easily occur during the calculation process. Compared with FP16, BF16 trades precision for a larger value range. However, due to the lower accuracy of FP16 and BF16 compared to FP32, gradient disappearance and model instability may occur during the training process.

Therefore, some technologies need to be used to solve these problems, such as dynamic loss scaling (Dynamic Loss Scaling) and mixed precision optimizer (Mixed Precision Optimizer). The process of mixed precision optimization is shown in Figure 4.16. The Adam optimizer state includes model parameter backups saved in FP32, and first-order momentum and second-order momentum are also stored in FP32 format. Assuming that the number of model parameters is Φ, and the model parameters and gradients are stored in FP16 format, a total of 2Φ + 2Φ + (4Φ + 4Φ + 4Φ) = 16Φ bytes of storage are required.

Among them, Adam status accounts for 75%. Before dynamic loss scaling backpropagation, the loss change (dLoss) is manually increased by 2K times, so the activation function gradient obtained during backpropagation will not overflow; after backpropagation, the weight gradient is reduced by 2K times and restored to normal values. . For example, for a model containing 7.5 billion parameters, if the FP16 format is used, only 15GB of computing device memory is required, but the model state actually consumes 120GB during the training phase.

In addition to the model state, the memory occupied by the computing card also has residual states (Residual States), including activation values (Activation), various temporary buffers (Buffers), and unusable video memory fragments (Fragmentation), etc. Since the activation value can use checkpointing (Activation Checkpointing) to greatly reduce the memory footprint of the activation value, how to reduce the model state, especially the Adam optimizer state, is the key to solving the memory footprint problem.

Figure 16 Mixed precision optimization process

The above is my brief introduction to the basic concepts of distributed machine learning systems, distributed training cluster architecture, and distributed training parallel strategies. DeepSpeed is an example of how to train a large language model on a cluster. I will continue to introduce it to you in the next article. Welcome to like it. Pay attention and support, your support is the driving force for my creation.

Reference content:

(1) Collection丨Sharing of 30 large language model training-related data sets - Zhihu. https://zhuanlan.zhihu.com/p/612243919.

(2) Four common processing methods for large language model training data - Zhihu. https://zhuanlan.zhihu.com/p/673045395.

(3)《大规模语言模型：从理论到实践》张奇等著. —北京：电子工业出版社

(4) Review of large language models - Renmin University of China. http://ai.ruc.edu.cn/research/science/20230605100.html.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Theory + practice to help you understand distributed training