一、bug
1、pre-tokenize的时候, 会OOM
解决:在yaml文件中添加streaming参数
# tokenize
streaming: True
max_steps: 10000
https://github.com/hiyouga/LLaMA-Factory/blob/3a023bca2a502810a436cfba7708df164754ea62/src/llamafactory/hparams/data_args.py#L39-L41
streaming: bool = field(
default=False,
metadata={"help": "Enable dataset streaming."},
max_steps: 10000
<