DeepSpeed多机多卡微调笔记

企业开发 2025-04-09 19:44:54 阅读次数: 0

1.首先查看自己电脑上的网卡类型，使用如下命令：

lspci | grep -i ethernet

输出如下：

31:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
31:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
4b:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
4b:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

由于服务器上面没有Mellanox网卡，那就不能使用InfiniBand进行通信，所以只能使用TCP/IP进行通信,所以需要手动禁用NCCL的IB(InfiniBand）支持，并指定用TCP/IP进行通信：

export NCCL_IB_DISABLE=1

你现在的目标是在三台服务器上使用 DeepSpeed 进行多机多卡的 BERT 微调。你已经有了模型和数据，接下来的步骤包括配置 DeepSpeed 环境、设置分布式训练、启动多机训练任务。以下是具体步骤：

1. 确保环境准备

在所有服务器上安装 DeepSpeed 及其依赖：

pip install deepspeed transformers torch

此外，还需要 torch.distributed 进行多机通信，建议安装 torch 版本匹配你的 GPU：

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

检查 NCCL 版本：

nccl-tests/build/all_reduce_perf -b 8 -e 256M -f 2 -g 1

2. 确保各个服务器之间可以互相通信

DeepSpeed 依赖 torch.distributed 进行多机多卡训练，需要所有服务器之间能够无密码 SSH 登录。

（1）配置 SSH 免密

在主节点（rank 0）上执行：

ssh-keygen -t rsa  # 生成密钥
ssh-copy-id user@node1  # 复制到其他节点
ssh-copy-id user@node2

并在每个节点上 ssh node1 进行测试，确保可以无密码登录。

（2）检查主机名解析

确保 /etc/hosts 里包含所有服务器的 IP 和主机名：

192.168.1.1  node0
192.168.1.2  node1
192.168.1.3  node2

也可以使用 hostfile 配置，格式如下：

node0 slots=2
node1 slots=2
node2 slots=2

（3）检查 NCCL 通信

在所有服务器上执行：

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0  # 确保这里使用的是正确的网卡
export NCCL_IB_DISABLE=1  # 如果没有 InfiniBand 设备
export NCCL_P2P_DISABLE=1

然后运行：

python -m torch.distributed.run --nnodes=3 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=node0:29500 test_nccl.py

如果 NCCL 通信正常，就可以继续。

3. 配置 DeepSpeed

创建 ds_config.json：

{
    
    
  "train_batch_size": 64,
  "train_micro_batch_size_per_gpu": 8,
  "zero_optimization": {
    
    
    "stage": 1
  },
  "fp16": {
    
    
    "enabled": true
  },
  "gradient_accumulation_steps": 2
}

如果你的模型比较大，可以尝试 zero_optimization.stage=2 或 stage=3 来减少 GPU 显存占用。

4. 启动分布式训练

假设你有 train.py 作为训练脚本，你可以使用 DeepSpeed 启动训练：

方法 1：手动指定节点

deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json

或直接指定：

deepspeed --master_addr=node0 --master_port=29500 --num_nodes 3 --num_gpus 2 train.py --deepspeed ds_config.json

方法 2：使用 PyTorch 的 `torch.distributed.launch`

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=3 --node_rank=0 --master_addr="node0" --master_port=29500 train.py

在 node1 和 node2 上，将 node_rank 改为 1 和 2。

5. 训练脚本修改

在 train.py 中，添加 DeepSpeed 初始化：

import deepspeed
import torch
import torch.distributed as dist

def setup():
    dist.init_process_group(backend="nccl")

def main():
    setup()
    
    model = ...  # 加载BERT模型
    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        model_parameters=model.parameters(),
        config="ds_config.json"
    )
    
    train_dataloader = ...
    for epoch in range(num_epochs):
        for batch in train_dataloader:
            loss = model(batch)
            model.backward(loss)
            model.step()
    
if __name__ == "__main__":
    main()

6. 监控训练

DeepSpeed 训练时，可以使用 NVIDIA-SMI 监控 GPU 使用情况：

watch -n 1 nvidia-smi

或者用 tensorboard 记录训练日志：

tensorboard --logdir runs/

7. 常见问题

（1）训练过程中 NCCL 超时

尝试：

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1

并确保 eth0 是正确的网卡。

（2）DeepSpeed 启动时报错

如果报 RuntimeError: Address already in use，确保 29500 端口没有被占用：

netstat -tulnp | grep 29500

如果被占用，换一个端口。

（3）训练速度慢

开启 fp16 训练
调整 gradient_accumulation_steps
使用 ZeRO Stage 2/3 以减少显存占用

总结

确保所有服务器可以 SSH 免密登录，并正确解析主机名
配置 DeepSpeed，确保 ds_config.json 设置正确
启动训练时使用 deepspeed 或 torch.distributed.launch
监控 NCCL 通信，避免网络或端口问题
调整 ZeRO 优化、fp16 训练，以提升训练效率

这样，你就可以在三台服务器上顺利进行 BERT 多机多卡微调了！你目前进展到哪一步了？

猜你喜欢

转载自blog.csdn.net/weixin_45056021/article/details/146308045

DeepSpeed多机多卡微调笔记

DeepSpeed在docker容器内实现多机多卡

deepspeed多机多卡并行训练指南

deepspeed 多机多卡RuntimeError: Ninja is required to load C++ extensions

ChatGLM多卡微调踩坑记录

deepSpeed多机多卡训练服务器之间，和服务器内两个GPU是怎么通信

torch单机多卡和多机多卡训练

深度学习单机多卡/多机多卡训练

多机多卡的基本概念

Accelerate 多机多卡训练

LLMs之ChatGLM2：ChatGLM-Finetuning(基于DeepSpeed)的简介、使用方法(四种微调方法(Freeze方法/Lora方法/P-Tuning方法/全量参数)+单卡/多卡训

Pytorch实现多机多卡GPU训练

Pytorch多机多卡的多种打开方式

大模型ChatGLM Lora微调的参数详探- 多卡版本

「大模型微调」使用 DDP 实现程序单机多卡并行指南

基于华为昇腾910B和LLaMA Factory多卡微调的实战教程

tf2 一机多卡训练

“米卡多”方法

tensorflow多卡训练

多卡训练记录

Pytorch多卡训练

torch 多卡并行

深度学习框架Tensorflow分布式实战多机多卡GPU，CPU并行

快速上手多机多卡的分布式tensorflow

TensorFlow分布式训练：单机多卡训练MirroredStrategy、多机训练MultiWorkerMirroredStrategy

判断单卡/多卡项目，手机所支持的卡槽数量

【Tensorflow】【Python】训练自己的数据集——数据读取、处理、训练、测试、可视化、Debug（单机单卡、单机多卡、多机多卡）

tensorflow单机多卡训练

PyTorch 单机多卡训练

CUDA多卡运行设置

今日推荐

Electron中的关于静态资源加载问题解决方案

《Cursor-AI编程》基础篇-界面指南

《Cursor-AI编程》基础篇-Tab代码智能补充

《Cursor-AI编程》基础篇-Composer功能详解

《Cursor-AI编程》基础篇-Chat功能详解

《Cursor-AI编程》进阶篇-自定义模型

《Cursor-AI编程》进阶篇-上下文详解

【大模型系列篇】最强检索增强技术GraphRAG基本原理详解

【大模型系列篇】基于Ollama和GraphRAG v2.0.0快速构建知识图谱

解释什么是迁移学习？在 CNN 中如何应用？（面试题200合集，高频、关键）

解释数据增强（Data Augmentation）的概念和方法（（面试题200合集，高频、关键））

揭秘大模型“魔法”：Function Calling 让 AI 不止会说，更能“做”！

周排行

ConfigurationClassParser类的parse方法源码解析

基础大讲堂-java 位运算符

ConsecutiveInteger判断给定的整数n能否表示成连续的m(m>1)个正整数之和

多项式问题之六——多项式快速幂

Spring Security技术栈开发企业级认证与授权（四）RESTful API服务异常处理

Linux基础命令---apachectl

MATLAB中的线性插值

Unity编辑器拓展之十七：NGUI ComponentSelector增加搜索框

SqlServer 备份还原教程

[Unity动画]01.

每日归档

2025-04-12(10529)

2025-04-11(9561)

2025-04-10(1213)

2025-04-09(10354)

2025-04-08(12998)

2025-04-07(0)

2025-04-06(0)

2025-04-05(0)

2025-04-04(0)

2025-04-03(0)