1--分布式并行训练,出现以下bug:
[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1721483, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805695 milliseconds before timing out.
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1721483, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805695 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
主要原因:
超时错误,原因可能是CPU线程忙碌(服务器CPU资源不够),导致数据长时间加载不了,从而出现了超时bug。
2--可能的解决方法:
1. 避免超时等待的方法:
例如减少数据加载的线程(降低num_workers),避免由于CPU线程不足导致的超时问题。
2. 延长超时等待的时间:
从默认的30min,延长至其他时间:torch.distributed.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))
3. 更多方案参考:https://github.com/huggingface/accelerate/issues/314