使用deepspeed小记

1. 减少显存占用的历程忠告

医学图像经常很大，所以训练模型有时候会有难度，但是现在找到了很多减少显存的方法。
不知道为什么，使用transformers的trainer库确确实实会减少显存的占用，即使没有使用deepspeed，占用的显存也会减少。

别自己造轮子

我之前也使用过 LoRA，自己也设计过，非常非常建议千万不要自己去写LoRA，很浪费时间，设计很费时间，同时检验模型LoRA的有效性也很浪费时间，权重的融合也很浪费时间，尽量使用其他已经写好的LoRA。

我推荐使用transformers集成模型和训练集，只需要写一个dataset和collate_fn，最多再多写一个Trainer的computer_loss,模型就可以自然而然的搞定。效率最高最有效。

2. Deepspeed方便快捷

在这里插入图片描述
使用 deepspeed 的流程是最短的

2.1 如果warning，需要加载一些库

moudle ava
moudle load compiler/gcc/7.3.1
moudle load cuda/7/11.8

由于deepspeed进行编译实际上是将GPU的一些指令重新编译，让CPU执行，同时还要符合CUDA的计算结构，能和GPU交互，所以GCC编译，CUDA编译都要符合版本要求

2.2 编写Trainer的python文件

建议使用transformers的trainer函数，这样很多json文件可以直接设置auto，同时还方便指定json配置文件。
同时要注意，这里可能会要求你加入 args，设置一个 local_rank 全局管控。
在 TrainingArguments 指定 ds_config.json 文件

import argparse
import sys
def parse_agrs():parser = argparse.ArgumentParser()parser.add_argument("--local_rank", type=int, default=-1, help="Local rank. Necessary for using the torch.distributed.launch utility.")return argsargs = parse_agrs()training_args = TrainingArguments(output_dir='./checkpoint/Eff_R2GenCMN_base',num_train_epochs=1000,per_device_train_batch_size=10,per_device_eval_batch_size=10,warmup_steps=500,weight_decay=0.01,logging_dir='./checkpoint/Eff_R2GenCMN_base/output_logs',logging_steps=10,save_strategy='steps',  # 添加保存策略为每一定步骤保存一次save_steps=100,  # 每100步保存一次模型save_total_limit=5,  # 最多保存5个模型report_to="none",fp16=True,  # 启用混合精度训练deepspeed='./ds_config.json',
)tokenizer = Tokenizer()
args = parse_agrs()
model = R2GenCMN(args, tokenizer)
dataset_train = Dataset(xlsx_file="./dataset/train_dataset.xlsx")
dataset_test = Dataset(xlsx_file="./dataset/test_dataset.xlsx")
trainer = MyTrainer(model=model,  # 使用的模型args=training_args,  # 训练参数train_dataset=dataset_train,  # 训练数据集eval_dataset=dataset_test,  # 验证数据集data_collator=collate_fn,# 可能需要定义compute_metrics函数来计算评估指标
)

2.3 编写ds_config文件

编写ds_config文件的目的就是简介python文件，同时更改参数方便，减少大脑记忆负担，便于使用。
ds_config.json 文件脚本通常是 通用的， batch如果写auto，deepspeed会根据显卡给你 自动设置batch 大小
这里只是设置了

stage2的

{"bfloat16": {"enabled": false},"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"optimizer": {"type": "AdamW","params": {"lr": "auto","betas": "auto","eps": "auto","weight_decay": "auto"}},"scheduler": {"type": "WarmupLR","params": {"warmup_min_lr": "auto","warmup_max_lr": "auto","warmup_num_steps": "auto"}},"zero_optimization": {"stage": 2,"offload_optimizer": {"device": "cpu","pin_memory": true},"allgather_partitions": true,"allgather_bucket_size": 2e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 2e8,"contiguous_gradients": true},"gradient_accumulation_steps": "auto","gradient_clipping": "auto","train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","steps_per_print": 1e5
}

或者使用stage3

{"bfloat16": {"enabled": false},"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"optimizer": {"type": "AdamW","params": {"lr": "auto","betas": "auto","eps": "auto","weight_decay": "auto"}},"scheduler": {"type": "WarmupLR","params": {"warmup_min_lr": "auto","warmup_max_lr": "auto","warmup_num_steps": "auto"}},"zero_optimization": {"stage": 3,"offload_optimizer": {"device": "cpu","pin_memory": true},"offload_param": {"device": "cpu","pin_memory": true},"overlap_comm": true,"contiguous_gradients": true,"sub_group_size": 1e9,"reduce_bucket_size": "auto","stage3_prefetch_bucket_size": "auto","stage3_param_persistence_threshold": "auto","stage3_max_live_parameters": 1e9,"stage3_max_reuse_distance": 1e9,"stage3_gather_fp16_weights_on_model_save": true},"gradient_accumulation_steps": "auto","gradient_clipping": "auto","steps_per_print": 1e5,"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","wall_clock_breakdown": false
}

2.4 运行程序

最终deepspeed运行就可以了
这里的warning实际上没有影响模型的运行，是重新编译。

deepspeed train.py
[2024-04-02 12:04:43,112] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:05:48,493] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-02 12:05:48,493] [INFO] [runner.py:555:main] cmd = /public/home/v-yumy/anaconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None transformer_train.py
[2024-04-02 12:05:51,627] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:05:55,944] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-04-02 12:05:55,944] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-04-02 12:05:55,944] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-04-02 12:05:55,944] [INFO] [launch.py:163:main] dist_world_size=1
[2024-04-02 12:05:55,944] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-02 12:06:29,136] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:06:31,519] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-02 12:06:31,519] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-04-02 12:06:31,519] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.742 seconds.
Prefix dict has been built successfully.
EfficientNet: replace first conv
EncoderDecoder 的Transformer 是 base
EncoderDecoder 是 base
视觉特征，不进行预训练[WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /public/home/v-yumy/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /public/home/v-yumy/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7046074867248535 seconds
Rank: 0 partition count [1] and sizes[(42770360, False)] 
{'loss': 6.7285, 'learning_rate': 1.6730270909663467e-05, 'epoch': 0.02}                                                                                                                                   
{'loss': 6.0535, 'learning_rate': 2.3254658315702903e-05, 'epoch': 0.05}                                                                                                                                   
{'loss': 5.598, 'learning_rate': 2.6809450068309278e-05, 'epoch': 0.07}                                                                                                                                    
{'loss': 5.2824, 'learning_rate': 2.9266416338062584e-05, 'epoch': 0.1}                                                                                                                                    
{'loss': 5.0738, 'learning_rate': 3.114597855245884e-05, 'epoch': 0.12}                                                                                                                                    
{'loss': 4.8191, 'learning_rate': 3.266853634404809e-05, 'epoch': 0.15}                                                                                                                                    
{'loss': 4.5336, 'learning_rate': 3.3948300828875964e-05, 'epoch': 0.17}