ASR的King：我又回来了，更小，且更快—

在这里插入图片描述
Whisper 是用于自动语音识别（ASR）和语音翻译的最先进模型，由来自 OpenAI 的 Alec Radford 等人在论文《通过大规模弱监督实现鲁棒语音识别》中提出。 Whisper 在超过 500 万小时的标注数据上进行了训练，证明了其在零点场景下对许多数据集和域进行泛化的强大能力。

Whisper large-v3-turbo 是经过修剪的 Whisper large-v3 的微调版本。换句话说，它是完全相同的模型，只是解码层数从 32 层减少到 4 层。因此，该模型的速度更快，但质量略有下降。你可以在 GitHub 上的讨论中找到更多细节。

使用方法

在Hugging Face 🤗 Transformers 中支持 Whisper large-v3-turbo。要运行模型，首先要安装 Transformers 库。在本例中，我们还将安装 🤗 Datasets，以便从拥抱脸部集线器加载玩具音频数据集，并安装 🤗 Accelerate ，以缩短模型加载时间：

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

该模型可与pipeline class一起用于转录任意长度的音频：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_datasetdevice = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "openai/whisper-large-v3-turbo"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)processor = AutoProcessor.from_pretrained(model_id)pipe = pipeline("automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,torch_dtype=torch_dtype,device=device,
)dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]result = pipe(sample)
print(result["text"])

要转录本地音频文件，只需在调用管道时传递音频文件的路径即可：

result = pipe("audio.mp3")

通过将多个音频文件指定为一个列表并设置 batch_size 参数，可以并行转录多个音频文件：

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers 兼容 Whisper 的所有解码策略，如温度回退和前一标记条件。下面的示例演示了如何启用这些启发式方法：

generate_kwargs = {"max_new_tokens": 448,"num_beams": 1,"condition_on_prev_tokens": False,"compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),"logprob_threshold": -1.0,"no_speech_threshold": 0.6,"return_timestamps": True,
}result = pipe(sample, generate_kwargs=generate_kwargs)

Whisper 能自动预测源音频的语言。如果事先知道源音频语言，可以将其作为参数传递给管道：

result = pipe(sample, generate_kwargs={"language": "english"})

默认情况下，Whisper 执行源音频语言与目标文本语言相同的语音转录任务。要执行目标文本为英语的语音翻译，请将任务设置为 “translate”：

result = pipe(sample, generate_kwargs={"task": "translate"})

最后，可以让模型预测时间戳。要获得句子级别的时间戳，可通过 return_timestamps 参数：

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

而对于单词级的时间戳：

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

上述参数可以单独使用，也可以组合使用。例如，要执行源音频为法语的语音转录任务，并返回句子级别的时间戳，可以使用以下参数：

result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

要对生成参数进行更多控制，可直接使用模型 + 处理器 API：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_datasetdevice = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "openai/whisper-large-v3-turbo"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)processor = AutoProcessor.from_pretrained(model_id)dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]inputs = processor(sample["array"],sampling_rate=sample["sampling_rate"],return_tensors="pt",truncation=False,padding="longest",return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)gen_kwargs = {"max_new_tokens": 448,"num_beams": 1,"condition_on_prev_tokens": False,"compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),"logprob_threshold": -1.0,"no_speech_threshold": 0.6,"return_timestamps": True,
}pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)print(pred_text)

其他速度和内存改进

您可以对 Whisper 进行其他速度和内存改进，以进一步降低推理速度和 VRAM 要求。

Chunked Long-Form

Whisper 的接收区域为 30 秒。要转录比这更长的音频，需要使用两种长格式算法中的一种：

顺序：使用 "滑动窗口 "进行缓冲推理，一个接一个地转录 30 秒的片段
分块：将长音频文件分割成较短的音频文件（片段之间有少量重叠），独立转录每个片段，并在边界处缝合转录结果

顺序长式算法应在以下任一情况下使用：

转录准确性是最重要的因素，而速度则是次要的考虑因素
您要转录成批的长音频文件，在这种情况下，顺序转录的延迟与分块转录相当，而准确性则高出 0.5% WER

反之，则应在以下情况下使用分块算法：

转录速度是最重要的因素
您正在转录一个长音频文件

默认情况下，Transformers 使用顺序算法。要启用分块算法，可向管道传递 chunk_length_s 参数。对于 large-v3，最佳的分块长度为 30 秒。要激活长音频文件的批处理，可通过参数 batch_size：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_datasetdevice = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "openai/whisper-large-v3-turbo"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)processor = AutoProcessor.from_pretrained(model_id)pipe = pipeline("automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,chunk_length_s=30,batch_size=16,  # batch size for inference - set based on your devicetorch_dtype=torch_dtype,device=device,
)dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]result = pipe(sample)
print(result["text"])

Torch 编译

Whisper 前传与 torch.compile 兼容，速度提高了 4.5 倍。

注：torch.compile 目前与 Chunked 长式算法或 Flash Attention 2 不兼容⚠️。

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdmtorch.set_float32_matmul_precision("high")device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "openai/whisper-large-v3-turbo"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)processor = AutoProcessor.from_pretrained(model_id)pipe = pipeline("automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,torch_dtype=torch_dtype,device=device,
)dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):with sdpa_kernel(SDPBackend.MATH):result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})# fast run
with sdpa_kernel(SDPBackend.MATH):result = pipe(sample.copy())print(result["text"])

Flash Attention 2

如果您的 GPU 支持 Flash-Attention 2，且未使用 torch.compile，我们建议您使用 Flash-Attention 2。为此，请先安装 Flash Attention：

pip install flash-attn --no-build-isolation

然后将 attn_implementation=“flash_attention_2” 传递给 from_pretrained：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

Torch Scale-Product-Attention (SDPA)

如果您的 GPU 不支持 Flash Attention，我们建议您使用 PyTorch scaled dot-product attention (SDPA)。在 PyTorch 2.1.1 或更高版本中，这种注意力实现默认已被激活。要检查您的 PyTorch 版本是否兼容，请运行以下 Python 代码片段：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

如果上面的返回值为 True，说明你已经安装了一个有效的 PyTorch 版本，并且 SDPA 默认已被激活。如果返回 False，则需要根据官方说明升级 PyTorch 版本

一旦安装了有效的 PyTorch 版本，SDPA 就会默认激活。也可以通过指定 attn_implementation="sdpa "来明确设置，如下所示：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")

有关如何使用 SDPA 的更多信息，请参阅 Transformers SDPA 文档。

模型详情

Whisper 是一种基于变换器的编码器-解码器模型，也称为序列-序列模型。 Whisper 模式有两种：纯英语模型和多语言模型。纯英语模型的训练任务是英语语音识别。多语言模型则同时进行多语言语音识别和语音翻译的训练。在语音识别方面，模型预测的转录语言与音频语言相同。对于语音翻译，模型预测转录为与音频不同的语言。

Whisper 检查站有五种不同型号的配置。最小的四种有纯英语和多语种两种。最大的检查点只有多语种版本。所有十个预先训练的检查点都可以在 "拥抱脸部 "集线器（Hugging Face Hub）上找到。下表汇总了这些检查点，并附有与 Hub 上模型的链接：