Article directory
Preface
My configuration:
7 machines and 14 cards, two A800 cards per server
Question: Why does each machine only have two cards?
Answer: This is what I was given. I would like to have 8 cards in a single machine. However, these servers are provided by cloud vendors. It is said that they are all PCIE connections, and a single machine can only have up to four cards.
The server only allows access to the internal network and cannot connect to the external network.
Therefore, you need to first figure out how to configure the training environment offline
Configure training environment offline
For details, please refer to: Anaconda environment cloning and migration
When packaging the environment according to the above article, you may encounter the following error: it can be solved
by adding parameters, such as:--ignore-missing-files
conda pack -n 环境名 -o 新的环境名.tar.gz --ignore-missing-files
Shared file system
Normally, there are many benefits to configuring a shared file system for multi-machine and multi-card training. For example, you only need to save one copy of the data set and model. More importantly, when saving the model, save the model to the shared file system. If there is no shared file system, you do not need to save multiple copies of the model. If there is no shared file system, you need to save a copy of the model parameters on each server.
When you want to retrain at breakpoints, you need to manually merge the optimizer parameters on each machine, which is very troublesome.
What if there really is no shared file system?
Solution:
checkpoint
Method 1. Configure parameters in deepspeed use_node_local_storage
as follows:
"checkpoint": {
"use_node_local_storage": true
}
In case you don’t understand how to add it, here is a deepspeed stage2
configuration example:
{
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"checkpoint": {
"use_node_local_storage": true
}
}
Parameter explanation
Original documentation: https://www.deepspeed.ai/docs/config-json/
Method 2: Just add TrainingArguments
the configuration parameters in--save_on_each_node
In fact, the deepspeed plug-in documentation in huggingface has explained the situation without a shared file system. It is indeed difficult to find. Location: https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#use-of -nonshared-filesystem
Both of the above methods can solve the problem of being unable to retrain without a shared file system.
If you have used the above configuration, another problem that may arise is that when you use the resume path to resume training, you may be stuck in the position shown below:
device_map
The code has been stuck here, the GPU is occupied, and the GPU utilization is also displayed. At this time, you should check whether yours is auto
. If not, it will definitely be stuck here.
If device_map="auto"
, but the code is still stuck here, possible solutions:
This picture is referenced from: The pitfalls of deepspeed multi-machine multi-card training
Configure mutual password-free login between multiple servers
Refer to SSH remote login: password-free login settings between two or more servers
This is a must-do, and it’s best to do it right at the beginning, as it can save a lot of time.
DPA
Install pdsh on each server. Installation method:
#下载解压
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/pdsh/pdsh-2.29.tar.bz2 && tar -xf pdsh-2.29.tar.bz2 -C /root/pdsh
#编译安装
cd pdsh-2.29 && ./configure --with-ssh --enable-static-modules --prefix=/usr/local && make && make install
#测试
pdsh -V
Just change the path to your own. If it is an offline server, you can first download pdsh on a server with Internet access, and then copy it to the offline server to install it.
Problems you may encounter during Doka training
Question 1: Ninja has been installed, deepspeed multi-machine multi-card RuntimeError: Ninja is required to load C++ extensions
Answer 1:
Add at the beginning of the training code:
/root/anaconda3/envs/baichuan/bin:
Is the bin directory of the server's conda virtual environment
local_env = os.environ.copy()
local_env["PATH"]= "/root/anaconda3/envs/baichuan/bin:" + local_env["PATH"]
os.environ.update(local_env)
问题2:libcudart.so.12.2: cannot open shared object file: No such file or directory
答案2:
1、检查文件libcudart.so.12.2是否存在(正常来说都是存在的),不存在该文件的话,需要重装cuda
2、在命令行执行 sudo ldconfig /usr/local/cuda-12.2/lib64
Notice
The code for executing training must be exactly the same on each machine, and the storage path must be consistent (including the installation path of the software, etc.) to avoid strange error reports, which can really make people bald.
Summarize
Students who have actually done multi-machine and multi-card training should be able to understand how detailed this article is! It is no exaggeration to say that it is full of useful information! I hope you can like and collect it.