【pytorch DistributedDataParallel 及amp 使用过程遇到的问题记录】

环境
问题
- 单机多卡时：
- - 超时错误
  - - 部分报错内容:
    - 解决方法:
  - 存在没有使用梯度的参数
  - - 报错内容:
    - 解决方法:
    - 方法1 找到不参与梯度计算的层**且**没有用处的层，删除
    - 方法2 DistributedDataParallel 增加参数:find_unused_parameters = True
  - DDP 训练时第一个batch有结果第二个训练一直卡住且gpu 利用率100%
  - - 解决方法: 参考下面amp 打开调试模式分析报错的内容，其中一部分是amp相关的问题还有一部分是下面ddp设置的报错
    - - 报错内容：
      - 解决方法
- 用户线程数不够
- - 报错内容
  - 解决方法:
- amp 训练时
- - 训练损失不像预计下降，训练过程中几个epoch之后loss全为NAN
  - 在torch.nn.utils.clip_grad_norm_之前取消缩放
  - 打开调试模式，找到可能有问题的代码

pytorch 分布式训练及混合精度训练中遇到的问题记录

环境

pytorch==2.3.1

问题

单机多卡时：

超时错误

部分报错内容:

Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2
[rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL’s watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives

报错截图:
在这里插入图片描述

解决方法:

代码中:


dist.init_process_group(backend='nccl', init_method='env://',rank=self.opt.local_rank, world_size=self.opt.world_size,)

改为:

dist.init_process_group(backend='nccl', init_method='env://',rank=self.opt.local_rank, world_size=self.opt.world_size,  timeout=timedelta(seconds=3600),)

存在没有使用梯度的参数

（不使用ddp时正常，使用ddp时报错）

报错内容:

If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 60 6

解决方法:

方法1 找到不参与梯度计算的层且没有用处的层，删除

注意： nn.MultiheadAttention 层可能不兼容DDP ，但是这个层对模型有用处就不能使用方法1，这时候使用方法二
参考：https://github.com/pytorch/pytorch/issues/26698

方法2 DistributedDataParallel 增加参数:find_unused_parameters = True

注意增加find_unused_parameters =True 可能会影响性能，参考使用这个参数时pytorch的warn 提示

在分布式数据并行（DDP）构造函数中指定了find_unused_parameters=True，但在前向传播过程中没有发现任何未使用的参数。这个标志会导致每次迭代都会额外遍历自动微分图，这可能会对性能产生不利影响。如果你的模型在前向传播中确实没有任何未使用的参数，考虑关闭这个标志。请注意，如果模型的流程控制导致后续迭代中有未使用的参数，这个警告可能是误报。

如:

  DistributedDataParallel(self.model, device_ids=[opt.local_rank], output_device=opt.local_rank,find_unused_parameters=any(isinstance(layer, nn.MultiheadAttention) for layer in self.model.modules()))

DDP 训练时第一个batch有结果第二个训练一直卡住且gpu 利用率100%

解决方法: 参考下面amp 打开调试模式分析报错的内容，其中一部分是amp相关的问题还有一部分是下面ddp设置的报错

报错内容：

[rank0]: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

解决方法

增加 broadcast_buffers=False,

DDP(self.models[k], device_ids=[opt.local_rank], output_device=opt.local_rank,broadcast_buffers=False,# nn.MultiheadAttention incompatibility with DDP https://github.com/pytorch/pytorch/issues/26698find_unused_parameters=any(isinstance(layer, nn.MultiheadAttention) for layer in self.models[k].modules()))

参考链接: https://blog.csdn.net/qq_39237205/article/details/125728708

用户线程数不够

报错内容

BlockingIOError: [Errno 11] Resource temporarily unavailable

解决方法:

# 查看linux 当前用户的线程数
ulimit -u
# 查看linux 当前用户已有的线程数
ps -xH |wc -l
# 查看

方法1 增加给用户分配的线程数

账户的线程数不够，增加用户的线程数

vim /etc/security/limits.d/20-nproc.conf
*          soft    nproc     40960
root       soft    nproc     unlimited

方法2 kill 部分无用的线程

# 查看linux 当前用户已有的线程
ps -xH 
# kill 掉一些无用的线程 如 Service.py 占用了太多的线程，并且现在已经不需要该程序 
ps -xH |grep Service.py | awk '{print $1}' |xargs kill -SIGKILL
# ps -xH |grep Service.py | awk '{print $1}' |sort -u |xargs kill -SIGKILL

amp 训练时

训练损失不像预计下降，训练过程中几个epoch之后loss全为NAN

在torch.nn.utils.clip_grad_norm_之前取消缩放

scaler = GradScaler()for epoch in epochs:for input, target in data:optimizer.zero_grad()with autocast(device_type='cuda', dtype=torch.float16):output = model(input)loss = loss_fn(output, target)scaler.scale(loss).backward()# Unscales the gradients of optimizer's assigned params in-placescaler.unscale_(optimizer)# Since the gradients of optimizer's assigned params are unscaled, clips as usual:torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)# optimizer's gradients are already unscaled, so scaler.step does not unscale them,# although it still skips optimizer.step() if the gradients contain infs or NaNs.scaler.step(optimizer)# Updates the scale for next iteration.scaler.update()

打开调试模式，找到可能有问题的代码

# 正向传播前
torch.autograd.set_detect_anomaly(True)#反向传播
with torch.autograd.detect_anomaly():# losses["loss"].backward()self.scaler.scale(losses["loss"]).backward()

经过调试发现报错内容如下
“BmmBackward0” returned nan values in its oth output
代码位置为 M = torch.matmul(T, R)

原因为相机的内参矩阵之间乘积导致fp16问题

解决方法在需要fp32精度的操作外面加上：

 with torch.autocast(device_type="cuda",dtype=torch.float32):need_fp32_code

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/455282.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！