【从零开始】11. LLaMA-Factory 微调 Qwen 模型（番外篇）

书接上回，在完成了 RAGChecker 测试后，离 RAG 应用真正发布还差最后一步 - 基础信息指令微调。考虑到模型还是需要具备一定程度的“自我认知”，因此需要将公司信息“嵌入”到模型里面的。为此，我选择了 LLaMA-Factory（以下简称“lf”）去完成这个事儿。

之所以选 lf 还是因为它简单，好使…

LLaMA-Factory 部署

本次 lf 微调我们将采用源码方式部署。

首先从 github 将代码 checkout 下来，如下图：

(base) pai@pai:~/llm/nlp$ git clone https://github.com/hiyouga/LLaMA-Factory.git

为什么要使用源码方式部署？后面会讲解原因。

创建虚拟环境并安装依赖

(base) pai@pai:~/llm/nlp$ cd LLaMA-Factory
(base) pai@pai:~/llm/nlp/LLaMA-Factory$ conda create -n lf  python=3.11
Channels:- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##environment location: /home/pai/anaconda3/envs/lfadded / updated specs:- python=3.11...

使用 conda 创建运行环境后切换到源码目录通过pip install -r requirements.txt命令安装依赖，如下图：

(base) pai@pai:~/llm/nlp/LLaMA-Factory$ conda activate lf
(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ pip install -r requirements.txt 
...

组件安装

根据官网的提示，if 是可以选择组件进行安装的，基础版本可以只选 torch 和 metrics 安装即可。但是装都装了，不差那么点事儿，索性全部能装的都装上吧。如下图：

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ pip install -e ".[torch,metrics,deepspeed,liger-kernel,bitsandbytes,hqq,gptq,awq,aqlm,vllm,galore,badam,adam-mini,qwen,modelscope,quality]"
...

安装过程中可能会出现一堆像“无法连接”、“依赖缺失”等情况，这时只需离线安装即可（之前文章已经提到过如何离线安装，这里就不再详述了）。

启动 webui 页面

完成组件安装后就可以尝试启动 webui 界面了，一般来说第一次启动会伴随着报错，如下图：

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ CUDA_VISIBLE_DEVICES=0 python src/webui.py
[2024-11-01 04:01:01,361] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /home/pai/.triton/autotune: No such file or directory[WARNING]  async_io requires the dev libaio .so object and headers but these were not found.[WARNING]  async_io: please install the libaio-dev package with apt[WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.[WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH[WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4[WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.@autocast_custom_fwd
/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.@autocast_custom_bwd
/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'from vllm.version import __version__ as VLLM_VERSION
Running on local URL:  http://0.0.0.0:7860To create a public link, set `share=True` in `launch()`.

以上报错是因为找不到“vllm._version”模块导致的，这时需要更新一下 vllm 依赖，如下图：

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ pip install --upgrade vllm
Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: vllm in /home/pai/anaconda3/envs/lf/lib/python3.11/site-packages (0.6.3)
Collecting vllmDownloading https://mirrors.aliyun.com/pypi/packages/4a/4c/ee65ba33467a4c0de350ce29fbae39b9d0e7fcd887cc756fa993654d1228/vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl (194.8 MB)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 MB 1.4 MB/s eta 0:00:00
...

一般将依赖更新到最新版后就能解决问题。再尝试启动 webui。

喔！抛出了另一个异常信息，如下图：

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ llamafactory-cli webui
[2024-11-01 05:46:26,356] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)[WARNING]  async_io requires the dev libaio .so object and headers but these were not found.[WARNING]  async_io: please install the libaio-dev package with apt[WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.[WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH[WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4[WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.@autocast_custom_fwd
/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.@autocast_custom_bwd
Running on local URL:  http://0.0.0.0:7860To create a public link, set `share=True` in `launch()`.

这次看到的是 deepspeed 警告，那么同理更新一下 deepspeed 吧。

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ pip install --upgrade deepspeed
...

更新后再启动就看不到其他异常信息抛出了。如下图：

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ llamafactory-cli webui
[2024-11-01 06:46:43,215] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Running on local URL:  http://0.0.0.0:7860To create a public link, set `share=True` in `launch()`.

上传自定义数据

验证完 lf 正常启动后，在正式指令微调前，我们还需要上传自己的数据。

首先，我们需要创建自己的数据集并保存成 json 文件。我这里创建的是 alpaca 格式的数据集，具体格式如下：

[{"instruction": "<<这里是问题>>","input": "","output": "<<这里是答案>>"},...
]

在完成了数据集的整理之后将数据集上传到项目的 data 文件夹中，如下图：

接着修改 data 文件夹下的 dataset_info.json 文件，如下图：

{"enterprise":{"file_name": "enterprise_tuning.json","columns": {"prompt": "instruction","query": "input","response": "output","history": "history"}},...
}

这里的配置主要是对 columns 进行映射并将数据集作为系统级别命名为“enterprise”。

指令微调

微调数据集准备完毕，接下来需要做一下微调配置。首先，在examples/train_lora/路径下创建一个 Lora 微调配置文件qwen2_5_lora_sft.yaml，如下图：

### model
model_name_or_path: /home/pai/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all### dataset
dataset: identity,enterprise
template: qwen
cutoff_len: 2048
max_samples: 4000
overwrite_cache: true
preprocessing_num_workers: 16### output
output_dir: /home/pai/llm/nlp/LLaMA-Factory/saves/Qwen2.5-7B/lora
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

这里感谢 CSDN@路人与大师提供的配置信息（文章里面详细地解释了每个参数的意思，受益良多，感谢分享），原文地址：llama factory lora 微调 qwen2.5 7B Instruct模型_qwen2.5 lora微调-CSDN博客

第一次的微调配置就以上面文章中提供的配置信息进行了部分调整而来的。接着执行

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ llamafactory-cli train examples/train_lora/qwen2_5_lora_sft.yaml

开始执行。

微调报错处理

在开始执行不久，也就是在读取数据集的时候就抛出了以下错误，如下图：

Converting format of dataset (num_proc=16):   0%|                                                                                                                                                                                             | 0/3601 [00:00<?, ? examples/s]
multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):File "/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in workerresult = (True, func(*args, **kwds))^^^^^^^^^^^^^^^^^^^File "/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queuefor i, result in enumerate(func(**kwargs)):File "/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3528, in _map_singleexample = apply_function_on_filtered_inputs(example, i, offset=offset)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputsprocessed_inputs = function(*fn_args, *additional_args, **fn_kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/pai/llm/nlp/LLaMA-Factory/src/llamafactory/data/aligner.py", line 84, in convert_alpacaif dataset_attr.history and isinstance(example[dataset_attr.history], list):~~~~~~~^^^^^^^^^^^^^^^^^^^^^^File "/home/pai/anaconda3/envs/lf/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 277, in __getitem__value = self.data[key]~~~~~~~~~^^^^^
KeyError: 'history'

经验证，这个错误除了源码部署时出现外，在 docker 环境下也会出现（不知道是不是配置有误，按理说 docker 环境下不会出现这种情况）。在网上找了一圈，最终在 github 中找到的解决方案：

数据集history 缺省值 · Issue #2490 · hiyouga/LLaMA-Factory

这里需要对源码进行修改，如下图：

BTW，在修改代码后需要删除项目中对应的 pyc 文件，再之后直接通过python src/train.py来启动。如下图：

(lf) pai@pai:~/llm/nlp/LLaMA-Factory$ CUDA_VISIBLE_DEVICES=0 python src/train.py examples/train_lora/qwen2_5_lora_sft.yaml

在开始微调后你或许还会遇到AttributeError: 'AdamW' object has no attribute 'train'报错。这时可以尝试将 transformers 进行降级处理。

AttributeError: ‘AdamW’ object has no attribute ‘train’ · Issue #33620 · huggingface/transformers

按照大神的说法将 transformers 降级到 4.44.2 版本即可，亲测有效。

第一次微调

好了，现在让我们进行第一次微调吧，生成结果如下：

***** train metrics *****epoch                    =        3.0total_flos               = 60516616GFtrain_loss               =     0.4413train_runtime            = 0:17:44.80train_samples_per_second =      9.506train_steps_per_second   =      0.594
...
***** eval metrics *****epoch                   =        3.0eval_loss               =      0.368eval_runtime            = 0:00:18.79eval_samples_per_second =     19.955eval_steps_per_second   =     19.955

嗯…这样的结果只能说是中规中矩，毕竟验证损失比训练损失小，说明模型没有过拟合，但还是有继续优化的空间。

第二次微调

鉴于第一次微调的结果，我调整了一下配置文件的参数：

### output
save_steps: 200### train
gradient_accumulation_steps: 4
learning_rate: 3.0e-5
num_train_epochs: 4.0
warmup_ratio: 0.03### eval
eval_steps: 200

learning_rate = 1e-4 可能偏大，先尝试降到 3.0e-5 试试；
使用更温和的 warmup_ratio = 0.03；
第一次 num_train_epochs = 3.0，增加到 4 轮；
同时调整 eval_steps 和 save_steps 为 200 以更密切监控训练过程；
尝试将 gradient_accumulation_steps 调整为 2；

最终得到这样的结果：

***** train metrics *****epoch                    =      3.981total_flos               = 80078311GFtrain_loss               =       0.68train_runtime            = 0:23:35.31train_samples_per_second =      9.536train_steps_per_second   =      0.297
...
***** eval metrics *****epoch                   =      3.981eval_loss               =     0.4965eval_runtime            = 0:00:19.50eval_samples_per_second =     19.227eval_steps_per_second   =     19.227

额…第二次的损失值反而上升了，这可能表明模型产生了一定程度的过拟合或学习率设置不够合适。

第三次微调

根据第二次微调的结果，我又做了以下的调整：

### train
gradient_accumulation_steps: 2
learning_rate: 7.0e-5
num_train_epochs: 2.0
warmup_ratio: 0.1
weight_decay: 0.1### lora
lora_rank: 8
lora_alpha: 32
lora_dropout: 0.1

增加了 weight_decay 参数来增加正则化；
取消了 ddp_timeout 参数；
gradient_accumulation_steps 改回 2 ，因为 4 可能过大；
将 warmup_ratio 调回 0.1；
尝试减少训练轮次到 2 轮，因为当前可能训练过度；
第二次微调的学习率可能过小，导致模型收敛较慢，这里调整到 7.0e-5；

最终得到这样的结果：

***** train metrics *****epoch                    =        2.0total_flos               = 40136469GFtrain_loss               =     0.5282train_runtime            = 0:12:54.95train_samples_per_second =      8.708train_steps_per_second   =      0.545
...
***** eval metrics *****epoch                   =        2.0eval_loss               =     0.3871eval_runtime            = 0:00:18.87eval_samples_per_second =     19.872eval_steps_per_second   =     19.872

Yo～，从结果看来这样的配置调整是有效的，相较于第二次微调有了明显的改善。并且 train_loss 和 eval_loss 的差距合理，没有明显过拟合。

第四次微调

好，现在明确了方向。接下来就向着这个方向走就可以了。

### dataset
max_samples: 5000### output
save_steps: 100### train
learning_rate: 8.0e-5
num_train_epochs: 2.5
warmup_ratio: 0.15### lora
lora_rank: 16
lora_alpha: 64### eval
val_size: 0.15
eval_steps: 100

继续提高学习率到 8.0e-5；
适当地增加训练轮次；
增加预热比例到 0.15；
增加 LoRA 秩到 16，alpha 值到 64；
eval_steps 和 save_steps 降低到 100；

最终得到这样的结果：

***** train metrics *****epoch                    =     2.4962total_flos               = 47735668GFtrain_loss               =     0.4542train_runtime            = 0:16:31.14train_samples_per_second =      8.036train_steps_per_second   =      0.502
...
***** eval metrics *****epoch                   =     2.4962eval_loss               =     0.3504eval_runtime            = 0:00:28.31eval_samples_per_second =     19.884eval_steps_per_second   =     19.884

eval_loss 持续下降到了0.3504，这是目前最好的结果。train_loss 和 eval_loss 的差距合理（约0.1），说明没有过拟合。

结论

从第四次之后其实已经找到了微调的方向，后面又陆续开展了其他参数的微调工作，这一共做了十一次。分析结果如下：

第一次：train_loss=0.4413, eval_loss=0.368   (差距0.0733)
第二次：train_loss=0.68,   eval_loss=0.4965  (差距0.1835) ❌ 最差
第三次：train_loss=0.5282, eval_loss=0.3871  (差距0.1411)
第四次：train_loss=0.4542, eval_loss=0.3504  (差距0.1038)
第五次：train_loss=0.4184, eval_loss=0.3423  (差距0.0761) ✓ eval_loss最佳
第六次：train_loss=0.4413, eval_loss=0.3624  (差距0.0789)
第七次：train_loss=0.3834, eval_loss=0.3602  (差距0.0232) ✓ train_loss最佳
第八次：train_loss=0.4377, eval_loss=0.356   (差距0.0817)
第九次：train_loss=0.4069, eval_loss=0.357   (差距0.0499)
第十次：train_loss=0.4219, eval_loss=0.3553  (差距0.0666)
第十一次：train_loss=0.4393, eval_loss=0.3582 (差距0.0811)

根据观察 loss 的情况可知，train_loss 基本保持在 0.38-0.44 之间，而 eval_loss 则保持在 0.35-0.36 之间，并且最近 5 次的结果都相当稳定。其中，最佳的 train_loss 发生在第七次为 0.3834，而最佳的 eval_loss 发生在第五次为 0.3423，两者之间最小差距发生在第七次为 0.0232。

从收敛趋势来看，从第七次以后，性能提升已经相当小了，最近几次的调整没有带来明显改善且 loss 值在一个相对稳定的区间内波动。

因此我感觉已经没有继续优化的必要了，就拿回第七次的配置就可以了。

哦，在完成微调之后其实并没有结束微调工作，还有更大一部分内容在于对模型输出的验证，这个各位可以根据自己实际的业务需要进行人工校验。

至此，LLaMA-Factory 微调正式结束。
（未完待续…）