【大模型微调学习6】-实战Hugging Face Transformers工具库

实战Hugging Face Transformers工具库
- 1.Hugging Face Transformers 库是什么？
- 2.HF Transformers 核心模块学习：Pipelines
- - （1）如何下载hugging face的数据集和模型权重
  - （2）**Text classification**(文本分类)
  - （3）Question Answering
  - （4）Computer Vision 计算机视觉
  - （5）Pipeline调用大语言模型
- 3.Transformers 模型微调入门
- - （1）数据集下载
  - （2）数据预处理
  - （3）数据抽样
  - （4）微调训练配置
  - （5）训练过程中的指标评估（Evaluate)
  - （6）开始训练
  - （7）验证
  - （8）保存模型和训练状态
- 4.Transformers 量化技术
- - （1）量化介绍
  - （2）模型参数与显存占用计算方法
  - （3）Transformers 量化技术 BitsAndBytes

实战Hugging Face Transformers工具库

1.Hugging Face Transformers 库是什么？

Hugging Face Transformers 是一个 Python库，允许用户下载和训练机器学习（ML）模型。

它最初被创建用于开发语言模型，现在功能已扩展到包括多模态、计算机视觉和音频处理等其他用途的模型。
在这里插入图片描述

下载Transformers :

首先根据自己的cuda版本下载合适torch版本：
```
pip3 install torch torchvision torchaudio
```
下载Transformers
```
pip install transformers
```

2.HF Transformers 核心模块学习：Pipelines

Pipelines（管道）是使用模型进行推理的一种简单易上手的方式。

这些管道是抽象了 Transformers 库中大部分复杂代码的对象，提供了一个专门用于多种任务的简单API，包括命名实体识别、掩码语言建模、情感分析、特征提取和问答等。

Modality	Task	Description	Pipeline API
Audio	Audio classification	为音频文件分配一个标签	pipeline(task=“audio-classification”)
	Automatic speech recognition	将音频文件中的语音提取为文本	pipeline(task=“automatic-speech-recognition”)
Computer vision	Image classification	为图像分配一个标签	pipeline(task=“image-classification”)
	Object detection	预测图像中目标对象的边界框和类别	pipeline(task=“object-detection”)
	Image segmentation	为图像中每个独立的像素分配标签（支持语义、全景和实例分割）	pipeline(task=“image-segmentation”)
Natural language processing	Text classification	为给定的文本序列分配一个标签	pipeline(task=“sentiment-analysis”)
	Token classification	为序列里的每个 token 分配一个标签（人, 组织, 地址等等）	pipeline(task=“ner”)
	Question answering	通过给定的上下文和问题, 在文本中提取答案	pipeline(task=“question-answering”)
	Summarization	为文本序列或文档生成总结	pipeline(task=“summarization”)
	Translation	将文本从一种语言翻译为另一种语言	pipeline(task=“translation”)
Multimodal	Document question answering	根据给定的文档和问题回答一个关于该文档的问题。	pipeline(task=“document-question-answering”)
	Visual Question Answering	给定一个图像和一个问题，正确地回答有关图像的问题	pipeline(task=“vqa”)

Pipelines 已支持的完整任务列表：https://huggingface.co/docs/transformers/task_summary

Pipline的运行原理：

在这里插入图片描述

（1）如何下载hugging face的数据集和模型权重

大部分的时候可能因为网络问题无法下载数据集或者模型权重，我们可以通过hugging face的镜像来下载

安装依赖
```
pip install -U huggingface_hub
```

设置环境变量

#编辑 ~/.bashrc
vim ~/.bashrc
#在最后添加：
export HF_ENDPOINT=https://hf-mirror.com

最后记得source ~/.bashrc

下载模型

huggingface-cli download --resume-download gpt2 --local-dir gpt2

下载数据集

huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir wikitext

（2）Text classification(文本分类)

Text classification(文本分类)与任何模态中的分类任务一样，文本分类将一个文本序列（可以是句子级别、段落或者整篇文章）标记为预定义的类别集合之一。文本分类有许多实际应用，其中包括：

情感分析：根据某种极性（如积极或消极）对文本进行标记，以在政治、金融和市场等领域支持决策制定。
内容分类：根据某个主题对文本进行标记，以帮助组织和过滤新闻和社交媒体信息流中的信息（天气、体育、金融等）。

下面以 Text classification 中的情感分析任务为例，展示如何使用 Pipeline API。

模型主页：https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

from transformers import pipeline# 仅指定任务时，使用默认模型（不推荐）
#pipe = pipeline(task="sentiment-analysis")#由于无法默认下载，需要我们手动下载默认模型的参数，然后使用本地路径来指定
pipe = pipeline(task="sentiment-analysis",model='./model/distilbert-base-uncased-finetuned-sst-2-english/')
pipe("Changsha is so cold today!")
#结果：[{'label': 'NEGATIVE', 'score': 0.9988634586334229}]

（3）Question Answering

Question Answering(问答)是另一个token-level的任务，返回一个问题的答案，有时带有上下文（开放领域），有时不带上下文（封闭领域）。每当我们向虚拟助手提出问题时，例如询问一家餐厅是否营业，就会发生这种情况。它还可以提供客户或技术支持，并帮助搜索引擎检索您要求的相关信息。

有两种常见的问答类型：

提取式：给定一个问题和一些上下文，模型必须从上下文中提取出一段文字作为答案
生成式：给定一个问题和一些上下文，答案是根据上下文生成的；这种方法由Text2TextGenerationPipeline处理，而不是下面展示的QuestionAnsweringPipeline

模型主页：https://huggingface.co/distilbert-base-cased-distilled-squad

from transformers import pipelinequestion_answerer = pipeline(task="question-answering",model='./model/distilbert-base-cased-distilled-squad/')preds = question_answerer(question="What is the name of the repository?",context="The name of the repository is huggingface/transformers",
)
print(f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)# score: 0.9327, start: 30, end: 54, answer: huggingface/transformers

（4）Computer Vision 计算机视觉

Image Classificaiton(图像分类)将整个图像从预定义的类别集合中进行标记。像大多数分类任务一样，图像分类有许多实际用例，其中一些包括：

医疗保健：标记医学图像以检测疾病或监测患者健康状况
环境：标记卫星图像以监测森林砍伐、提供野外管理信息或检测野火
农业：标记农作物图像以监测植物健康或用于土地使用监测的卫星图像
生态学：标记动物或植物物种的图像以监测野生动物种群或跟踪濒危物种

模型主页：https://huggingface.co/google/vit-base-patch16-224

from transformers import pipelineclassifier = pipeline(task="image-classification",model='./model/vit-base-patch16-224')
preds = classifier("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(*preds, sep="\n")

在这里插入图片描述

（5）Pipeline调用大语言模型

Language Modeling

语言建模是一项预测文本序列中的单词的任务。它已经成为非常流行的自然语言处理任务，因为预训练的语言模型可以用于许多其他下游任务的微调。最近，对大型语言模型（LLMs）产生了很大兴趣，这些模型展示了零或少量样本学习能力。这意味着该模型可以解决其未经明确训练过的任务！虽然语言模型可用于生成流畅且令人信服的文本，但需要小心使用，因为文本可能并不总是准确无误。

通过理论篇学习，我们了解到有两种典型的语言模型：

自回归：模型目标是预测序列中的下一个 Token（文本），训练时对下文进行了掩码。如：GPT-3。
自编码：模型目标是理解上下文后，补全句子中丢失/掩码的 Token（文本）。如：BERT。

from transformers import pipelineprompt = "Hugging Face is a community-based open-source platform for machine learning."
generator = pipeline(task="text-generation", model="gpt2")
generator(prompt)

在这里插入图片描述

Tips:还有很多可以直接使用的模型，可以根据自己的需求在Hugging Face Models中查找合适的model.

3.Transformers 模型微调入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：

数据集下载
数据预处理
训练超参数配置
训练评估指标设置
训练器基本介绍
实战训练
模型保存

（1）数据集下载

Hugging Face 数据集： YelpReviewFull

使用镜像来下载

huggingface-cli download --repo-type dataset --resume-download yelp_review_full --local-dir yelp_review_full

数据实例

一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

{'label': 0,'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}

数据字段

‘text’: 评论文本使用双引号（“）转义，任何内部双引号都通过2个双引号（”"）转义。换行符使用反斜杠后跟一个 “n” 字符转义，即 “\n”。

‘label’: 对应于评论的分数（介于1和5之间）。

数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

加载数据集

from datasets import load_dataset
dataset=load_dataset('./data/yelp_review_full/')

随机抽取n个样本来展示

import random
import pandas as pd
import datasets
from IPython.display import display,HTML
# n为抽取的样本个数
n=10
def show_random_elements(dataset, num_examples=10):assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."picks = []for _ in range(num_examples):pick = random.randint(0, len(dataset)-1)while pick in picks:pick = random.randint(0, len(dataset)-1)picks.append(pick)df = pd.DataFrame(dataset[picks])for column, typ in dataset.features.items():print(column, typ)if isinstance(typ, ClassLabel):df[column] = df[column].transform(lambda i: typ.names[i])elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])display(HTML(df.to_html()))
show_random_elements(dataset['train'])

在这里插入图片描述

（2）数据预处理

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 map 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

from transformers import AutoTokenizer
tokenizer =AutoTokenizer.from_pretrained("./model/bert-base-cased")def tokenize_function(examples):return tokenizer(examples['text'],padding='max_length',truncation=True)tokenized_datasets=dataset.map(tokenize_function,batched=True)

（3）数据抽样

使用1000个数据样本来演示小规模训练（基于Pytorch Trainer）

使用shuffle()函数来随机重新排序，打乱样本顺序

small_train_dataset=tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset=tokenized_datasets['test'].shuffle(seed=42).select(range(1000))

（4）微调训练配置

加载BERT模型

警告通知我们正在丢弃一些权重（vocab_transform 和 vocab_layer_norm 层），并随机初始化其他一些权重（pre_classifier 和 classifier 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained("./model/bert-base-cased", num_labels=5)

训练超参数（TrainingArguments）

完整配置参数与默认值：TrainingArguments

源代码定义：training_args.py

最重要配置：模型权重保存路径(output_dir)

from transformers import TrainingArgumentsmodel_dir = "models/bert-base-cased-finetune-yelp"# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,# evaluation_strategy="epoch", per_device_train_batch_size=16,num_train_epochs=5,logging_steps=50)
print(training_args)
from transformers import TrainingArguments, Trainer

在这里插入图片描述

（5）训练过程中的指标评估（Evaluate)

Hugging Face Evaluate 库 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。当前支持 完整评估指标：https://huggingface.co/evaluate-metric

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 Evaluate库提供了一个简单的准确率函数，您可以使用evaluate.load函数加载。

由于网络问题，可能会失败，可以使用本地的路径来代替，首先先拉去官方的evaluated仓库。

https://github.com/huggingface/evaluate.git

在这里插入图片描述

import numpy as np
import evaluatemetric = evaluate.load("./evaluate/metrics/accuracy/")

接着，调用 compute 函数来计算预测的准确率。在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（所有Transformers 模型都返回 logits）。

def compute_metrics(eval_pred):logits, labels = eval_predpredictions = np.argmax(logits, axis=-1)return metric.compute(predictions=predictions, references=labels)

（6）开始训练

trainer = Trainer(model=model,args=training_args,train_dataset=small_train_dataset,# eval_dataset=small_eval_dataset,#如果模型比较大一般不需要要验证，因为消耗比较大。compute_metrics=compute_metrics,
)
trainer.train()

查看GPU利用情况：

watch -n 1 nvidia-smi

在这里插入图片描述

训练结果：

（7）验证

small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))
trainer.evaluate(small_test_dataset)

在这里插入图片描述

由于训练数据集使用的比较少，所以训练loss和验证loss差别比较大。

（8）保存模型和训练状态

trainer.save_model(model_dir)

4.Transformers 量化技术

（1）量化介绍

量化（Quantization）技术专注于用较少的信息表示数据，同时尽量不损失太多准确性。具体来说，量化会将模型参数使用的数据类型，转换为更少位数表示，并尽可能达到相同信息的效果。

例如，假设您的模型权重原始以32位（32-bit）浮点数（Float32）存储。

如果将它们量化为16位（16-bit）浮点数（Float16），则可以将模型大小减半。换句话说，仅需要一半的 GPU 显存即可加载量化后的模型。
如果将模型量化为8位（8-bit）整数（Int8），则大约只需要四分之一的显存开销。
如果将模型量化为4位（4-bit）数据类型 Normal Float 4（NF4），则几乎只需八分之一的显存开销。

同时，较低的精度还可以加快推理速度，因为使用较少位进行计算所需时间更短。

（2）模型参数与显存占用计算方法

以 facebook OPT-6.7B 模型为例。

逐步推理计算过程：

估计参数总量：OPT-6.7B 模型指一个含有大约 6.7 Billion（67亿）个参数的模型。
计算单个参数的显存占用：OPT-6.7B 模型默认使用 Float16，每个参数占用16位（即2字节）的显存。
计算总显存占用 = 参数总量 × 每个参数的显存占用。
代入公式计算：67亿参数×2字节/参数=134亿字节=13.4×109字节
换算单位：1GB = 230B ≈ 109 字节

综上，OPT-6.7B 以 float16 精度加载到GPU需要使用大约13.5GB显存。

如果使用 int8 精度，则只需要大约7GB显存。

（3）Transformers 量化技术 BitsAndBytes

在这里插入图片描述

bitsandbytes是将模型量化为8位和4位的最简单选择。

8位量化将fp16中的异常值与int8中的非异常值相乘，将非异常值转换回fp16，然后将它们相加以返回fp16中的权重。这减少了异常值对模型性能产生的降级效果。
4位量化进一步压缩了模型，并且通常与QLoRA一起用于微调量化LLM（低精度语言模型）。

（异常值是指大于某个阈值的隐藏状态值，这些值是以fp16进行计算的。虽然这些值通常服从正态分布（[-3.5, 3.5]），但对于大型模型来说，该分布可能会有很大差异（[-60, 6]或[6, 60]）。8位量化适用于约为5左右的数值，但超过此范围后将导致显著性能损失。一个好的默认阈值是6，但对于不稳定的模型（小型模型或微调）可能需要更低的阈值。）

在Transformers 中使用参数量化

使用 Transformers 库的 model.from_pretrained()方法中的load_in_8bit或load_in_4bit参数，便可以对模型进行量化。只要模型支持使用Accelerate加载并包含torch.nn.Linear层，这几乎适用于任何模态的任何模型。

from transformers import AutoModelForCausalLMmodel_id = "./model/opt-2.7b"model_4bit = AutoModelForCausalLM.from_pretrained(model_id,device_map="auto",load_in_4bit=True)

在这里插入图片描述

实测GPU显存占用：Int4 量化精度

在这里插入图片描述
可以看到内存占用非常低

# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model_4bit.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB
print(f"{memory_footprint_mib:.2f}MiB")
#1457.52MiB

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained(model_id)
text = "Merry Christmas! I'm glad to"
inputs = tokenizer(text, return_tensors="pt").to(0)out = model_4bit.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))'''
Merry Christmas! I'm glad to see you're still around.
I'm still around, just not posting as much. I'm still here, just not posting as much. I'm still here, just not posting as much. I'm still here, just not posting as much. I'm still here, just not posting as much. I'm
'''

使用 NF4 精度加载模型

from transformers import BitsAndBytesConfignf4_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",
)model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

使用双量化加载

double_quant_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_use_double_quant=True,
)model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)

使用 QLoRA 所有量化技术加载模型

import torchqlora_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type="nf4",bnb_4bit_compute_dtype=torch.bfloat16
)
model_qlora = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=qlora_config)