1、paddlespeech asr语音转录文字
参考:
https://github.com/PaddlePaddle/PaddleSpeech
安装后运行可能会numpy相关报错;可能是python和numpy版本高的问题,我这里最终解决是python 3.10 numpy 1.22.0;
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
pip install paddlespeech
1)代码
模型默认下载保存位置:C:\Users\loong.paddlespeech\models下
from paddlespeech.cli.asr.infer import ASRExecutor
asr = ASRExecutor()
result = asr(audio_file="zh.wav") ##第一次运行会首先下载自动模型
print(result)
2)实时语音转录
参考:https://www.cnblogs.com/chenkui164/p/16296941.html
https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/streaming_asr_server/README.md
https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/streaming_asr_server/web
paddlespeech_server stats --task asr ##可以擦好看支持的模型,更改模型该yaml文件
## 首先运行asr服务器
# 开启流式语音识别服务
cd PaddleSpeech/demos/streaming_asr_server
paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application_faster.yaml
运行后运行demo里的\demos\streaming_asr_server\web\index.html文件测试:
pyaudio实时录制声音及保存wav
import pyaudio,wave#导入相关的库#实例化一个pyaudio对象
pa=pyaudio.PyAudio()#设置声卡参数
chunk=1024#帧长度
Format=pyaudio.paInt16#采样深度
CHANNELS=2#声道
RATE=16000#采样率
record_seconds=5#设置录制时间
#RATE/chunk*record_seconds为一秒采样数除以一帧长度和录制秒数可以得到帧数#新建一个列表,用来存储数据
record_list=[]#打开声卡,设置参数,设置音频流
stream=pa.open(format=Format,rate=RATE,channels=CHANNELS,frames_per_buffer=chunk,input=True)#开始录制
print('开始录制...')#进行录制与采样
for i in range(0,int(RATE/chunk*record_seconds)):data=stream.read(chunk)#为每一帧的样本二进制数据record_list.append(data)#得到的是保存的二进制数据#录制完成
stream.stop_stream()#停止调用声卡
stream.close()#关闭声卡
pa.terminate()#结束pyaudio对象
print('录制结束...')#保存音频文件(wav文件类型)
file=wave.open('voice.wav','wb')#创建voice文件
file.setnchannels(CHANNELS)#设置声道数
file.setsampwidth(pa.get_sample_size(Format))#设置采样宽度,通过pa.get_sample_size(format)可以得到
file.setframerate(RATE)#设置采样率
file.writeframes(b''.join(record_list))#将二进制文件加入到wav文件之中
file.close()
2、sherpa 实时语音转录
1)ncnn版本
参考:https://github.com/k2-fsa/sherpa-ncnn
https://www.bilibili.com/video/BV1K44y197Fg
安装:
pip install sherpa-ncnn sounddevice -i https://mirror.baidu.com/pypi/simple
下载:
1)下载项目:git clone https://github.com/k2-fsa/sherpa-ncnn.git
2)下载模型
https://huggingface.co/marcoyang/sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23
下载这7个文件
运行:
https://k2-fsa.github.io/sherpa/ncnn/python/index.html#start-recording
#!/usr/bin/env python3# Real-time speech recognition from a microphone with sherpa-ncnn Python API
#
# Please refer to
# https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# to download pre-trained modelsimport systry:import sounddevice as sd
except ImportError as e:print("Please install sounddevice first. You can use")print()print(" pip install sounddevice")print()print("to install it")sys.exit(-1)import sherpa_ncnndef create_recognizer():# Please replace the model files if needed.# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html# for download links.recognizer = sherpa_ncnn.Recognizer(tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",num_threads=4,)return recognizerdef main():print("Started! Please speak")recognizer = create_recognizer()sample_rate = recognizer.sample_ratesamples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 mslast_result = ""with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:while True:samples, _ = s.read(samples_per_read) # a blocking readsamples = samples.reshape(-1)recognizer.accept_waveform(sample_rate, samples)result = recognizer.textif last_result != result:last_result = resultprint("\r{}".format(result), end="", flush=True)if __name__ == "__main__":devices = sd.query_devices()print(devices)default_input_device_idx = sd.default.device[0]print(f'Use default device: {devices[default_input_device_idx]["name"]}')try:main()except KeyboardInterrupt:print("\nCaught Ctrl + C. Exiting")
**修改结果打印效果,去除重复打印结果,结果每次只打印新增的,避免上面每次都打印一遍之前已经识别的内容
if last_result != result:if i==0:print("{}".format(result),end='')last_result = resulti=i+1else:last_result_len=len(last_result)new_word = result[last_result_len:]# print(last_result,result,new_word)print("{}".format(new_word),end='', flush=True)last_result = result
2)onnx版本
参考:https://k2-fsa.github.io/sherpa/onnx/python/install.html
https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/speech-recognition-from-microphone.py
安装:
pip install sherpa-onnx
下载模型:
https://huggingface.co/csukuangfj/sherpa-onnx-streaming-conformer-zh-2023-05-23/tree/main
代码:
运行:python ./speech-recognition-from-microphone-onnx.py --tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt --encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx
#!/usr/bin/env python3# Real-time speech recognition from a microphone with sherpa-onnx Python API
#
# Please refer to
# https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
# to download pre-trained modelsimport argparse
import sys
from pathlib import Pathtry:import sounddevice as sd
except ImportError:print("Please install sounddevice first. You can use")print()print(" pip install sounddevice")print()print("to install it")sys.exit(-1)import sherpa_onnxdef assert_file_exists(filename: str):assert Path(filename).is_file(), (f"{filename} does not exist!\n""Please refer to ""https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html to download it")def get_args():parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)parser.add_argument("--tokens",type=str,help="Path to tokens.txt",)parser.add_argument("--encoder",type=str,help="Path to the encoder model",)parser.add_argument("--decoder",type=str,help="Path to the decoder model",)parser.add_argument("--joiner",type=str,help="Path to the joiner model",)parser.add_argument("--decoding-method",type=str,default="greedy_search",help="Valid values are greedy_search and modified_beam_search",)return parser.parse_args()def create_recognizer():args = get_args()assert_file_exists(args.encoder)assert_file_exists(args.decoder)assert_file_exists(args.joiner)assert_file_exists(args.tokens)# Please replace the model files if needed.# See https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html# for download links.recognizer = sherpa_onnx.OnlineRecognizer(tokens=args.tokens,encoder=args.encoder,decoder=args.decoder,joiner=args.joiner,num_threads=1,sample_rate=16000,feature_dim=80,decoding_method=args.decoding_method,)return recognizerdef main():recognizer = create_recognizer()print("Started! Please speak")# The model is using 16 kHz, we use 48 kHz here to demonstrate that# sherpa-onnx will do resampling inside.sample_rate = 48000samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 mslast_result = ""stream = recognizer.create_stream()with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:while True:samples, _ = s.read(samples_per_read) # a blocking readsamples = samples.reshape(-1)stream.accept_waveform(sample_rate, samples)while recognizer.is_ready(stream):recognizer.decode_stream(stream)result = recognizer.get_result(stream)if last_result != result:last_result = resultprint("\r{}".format(result), end="", flush=True)if __name__ == "__main__":devices = sd.query_devices()print(devices)default_input_device_idx = sd.default.device[0]print(f'Use default device: {devices[default_input_device_idx]["name"]}')try:main()except KeyboardInterrupt:print("\nCaught Ctrl + C. Exiting")
3)离线wav音频文件转录
注意:如果本地音频比特率不是256kps,需要转换;比特率(Bitrate)是指音频或视频文件中每秒的比特数。通常用于表示数据传输速率或压缩率。
对于音频文件,比特率表示每秒音频数据的传输速率,单位是kbps(千比特每秒)。通常,比特率越高,音频数据的质量越好,但文件大小也会增加。
例如,256 kbps意味着每秒音频数据的传输速率为256千比特。这种表示方式通常用于指定音频文件的压缩率或输出质量。
sox a.wav -r 16k -c 1 b.wav
decode-file.py ##官方代码
#!/usr/bin/env python3"""
This file demonstrates how to use sherpa-ncnn Python API to recognize
a single file.Please refer to
https://k2-fsa.github.io/sherpa/ncnn/index.html
to install sherpa-ncnn and to download the pre-trained models
used in this file.
"""import time
import waveimport numpy as np
import sherpa_ncnndef main():# Please refer to https://k2-fsa.github.io/sherpa/ncnn/index.html# to download the model files# recognizer = sherpa_ncnn.Recognizer(# tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",# encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",# encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",# decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",# decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",# joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",# joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",# num_threads=4,# )base_file = "sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"# base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"# base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"recognizer = sherpa_ncnn.Recognizer(tokens="./{}/tokens.txt".format(base_file),encoder_param="./{}/encoder_jit_trace-pnnx.ncnn.param".format(base_file),encoder_bin="./{}/encoder_jit_trace-pnnx.ncnn.bin".format(base_file),decoder_param="./{}/decoder_jit_trace-pnnx.ncnn.param".format(base_file),decoder_bin="./{}/decoder_jit_trace-pnnx.ncnn.bin".format(base_file),joiner_param="./{}/joiner_jit_trace-pnnx.ncnn.param".format(base_file),joiner_bin="./{}/joiner_jit_trace-pnnx.ncnn.bin".format(base_file),num_threads=4,)filename = r'D:\sound\loong.wav'with wave.open(filename) as f:# Note: If wave_file_sample_rate is different from# recognizer.sample_rate, we will do resampling inside sherpa-ncnnwave_file_sample_rate = f.getframerate()num_channels = f.getnchannels()assert f.getsampwidth() == 2, f.getsampwidth() # it is in bytesnum_samples = f.getnframes()samples = f.readframes(num_samples)samples_int16 = np.frombuffer(samples, dtype=np.int16)samples_int16 = samples_int16.reshape(-1, num_channels)[:, 0]samples_float32 = samples_int16.astype(np.float32)samples_float32 = samples_float32 / 32768# simulate streamingchunk_size = int(0.1 * wave_file_sample_rate) # 0.1 secondsstart = 0while start < samples_float32.shape[0]:end = start + chunk_sizeend = min(end, samples_float32.shape[0])recognizer.accept_waveform(wave_file_sample_rate, samples_float32[start:end])start = endtext = recognizer.textif text:print(text)# simulate streaming by sleepingtime.sleep(0.1)tail_paddings = np.zeros(int(wave_file_sample_rate * 0.5), dtype=np.float32)recognizer.accept_waveform(wave_file_sample_rate, tail_paddings)recognizer.input_finished()text = recognizer.textif text:print(text)if __name__ == "__main__":main()
或者通过ffmpeg离线读取mp4、mav,网络读取rtsp链接,自己整理推荐这份代码
import subprocess
import sounddevice as sd
import numpy as np
from sklearn.preprocessing import MinMaxScalerimport sherpa_ncnndef create_recognizer():# Please replace the model files if needed.# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html# for download links.# base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"# base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"base_file = r"D:\llm\sherpa*******mples\sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"# base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"# base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"recognizer = sherpa_ncnn.Recognizer(tokens="{}\\tokens.txt".format(base_file),encoder_param="{}\encoder_jit_trace-pnnx.ncnn.param".format(base_file),encoder_bin="{}\encoder_jit_trace-pnnx.ncnn.bin".format(base_file),decoder_param="{}\decoder_jit_trace-pnnx.ncnn.param".format(base_file),decoder_bin="{}\decoder_jit_trace-pnnx.ncnn.bin".format(base_file),joiner_param="{}\joiner_jit_trace-pnnx.ncnn.param".format(base_file),joiner_bin="{}\joiner_jit_trace-pnnx.ncnn.bin".format(base_file),num_threads=4,)return recognizerprint("Started! Please speak")
recognizer = create_recognizer()
# sample_rate = recognizer.sample_rate
# samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms# 远程RTSP音频流的URL(wav\mp4/rtsp都可以)
# url = "your_rtsp_url"
# url = r'D:\sound\0.wav'
url = r'D:\sound\222.mp4'# FFmpeg命令参数
ffmpeg_cmd = ["ffmpeg","-i", url,"-f", "s16le","-acodec", "pcm_s16le","-ar", "16000","-ac","1","-",]# 创建FFmpeg进程
process = subprocess.Popen(ffmpeg_cmd,stdout=subprocess.PIPE,stderr=subprocess.DEVNULL,bufsize=1600
)# 定义音频流的采样率、通道数和每次读取的样本数量
sample_rate = 16000
channels = 1
frames_per_read = 1600last_result = ""
i=0
# 读取和处理音频数据
while True:# 从FFmpeg进程中读取音频数据data = process.stdout.read(frames_per_read * channels * 2) # 每个样本16位,乘以2if not data:break# 将音频数据转换为numpy数组samples = np.frombuffer(data, dtype=np.int16)samples = samples.astype(np.float32)# samples = MinMaxScaler(feature_range=(-1, 1)).fit_transform(samples.reshape(-1, 1))samples /= 32768.0 # 归一化到[-1, 1]范围# print(samples.shape, samples)# 处理音频数据# 在这里添加您的音频处理代码recognizer.accept_waveform(sample_rate, samples)result = recognizer.text# print("result:",result,"last_result:",last_result)if last_result != result:if i==0:print("{}".format(result),end='')last_result = resulti=i+1else:last_result_len=len(last_result)new_word = result[last_result_len:]# print(last_result,result,new_word)print("{}".format(new_word),end='', flush=True)last_result = result# 关闭FFmpeg进程
process.stdout.close()
process.terminate()
ffmpeg 实时读取本地麦克风声音
import subprocess
import sounddevice as sd
import numpy as np
from sklearn.preprocessing import MinMaxScalerimport sherpa_ncnndef create_recognizer():# Please replace the model files if needed.# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html# for download links.# base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"# base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"base_file = r"D:\llm\sherpa-ncnn-master\sherpa-ncnn-master\python-api-examples\sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"# base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"# base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"recognizer = sherpa_ncnn.Recognizer(tokens="{}\\tokens.txt".format(base_file),encoder_param="{}\encoder_jit_trace-pnnx.ncnn.param".format(base_file),encoder_bin="{}\encoder_jit_trace-pnnx.ncnn.bin".format(base_file),decoder_param="{}\decoder_jit_trace-pnnx.ncnn.param".format(base_file),decoder_bin="{}\decoder_jit_trace-pnnx.ncnn.bin".format(base_file),joiner_param="{}\joiner_jit_trace-pnnx.ncnn.param".format(base_file),joiner_bin="{}\joiner_jit_trace-pnnx.ncnn.bin".format(base_file),num_threads=4,)return recognizerprint("Started! Please speak")
recognizer = create_recognizer()
# sample_rate = recognizer.sample_rate
# samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms# 远程RTSP音频流的URL
# url = "your_rtsp_url"
# url = r'D:\sound\0.wav'
# url = r'D:\sound\222.mp4'
url = "rtsp://admin:jc123456@192.168.63.88/Streaming/Channels/2?tcp"# FFmpeg命令参数
# ffmpeg_cmd = [
# "ffmpeg",
# "-i", url,
# "-f", "s16le",
# "-acodec", "pcm_s16le",
# "-ar", "16000",
# "-ac","1",
# "-",# ]ffmpeg_cmd = ["ffmpeg","-f", "dshow", # 使用alsa作为音频输入设备"-i", "audio=麦克风阵列 (适用于数字麦克风的英特尔® 智音技术)", # 使用默认的音频输入设备(麦克风)"-f", "s16le","-acodec", "pcm_s16le","-ar", "16000","-ac", "1","-"
]# 创建FFmpeg进程
process = subprocess.Popen(ffmpeg_cmd,stdout=subprocess.PIPE,stderr=subprocess.DEVNULL,bufsize=1600
)# 定义音频流的采样率、通道数和每次读取的样本数量
sample_rate = 16000
channels = 1
frames_per_read = 1600last_result = ""
i=0
# 读取和处理音频数据
while True:# 从FFmpeg进程中读取音频数据data = process.stdout.read(frames_per_read * channels * 2) # 每个样本16位,乘以2if not data:break# 将音频数据转换为numpy数组samples = np.frombuffer(data, dtype=np.int16)samples = samples.astype(np.float32)# samples = MinMaxScaler(feature_range=(-1, 1)).fit_transform(samples.reshape(-1, 1))samples /= 32768.0 # 归一化到[-1, 1]范围# print(samples.shape, samples)# 处理音频数据# 在这里添加您的音频处理代码recognizer.accept_waveform(sample_rate, samples)result = recognizer.text# print("result:",result,"last_result:",last_result)if last_result != result:if i==0:print("{}".format(result),end='')last_result = resulti=i+1else:last_result_len=len(last_result)new_word = result[last_result_len:]# print(last_result,result,new_word)print("{}".format(new_word),end='', flush=True)last_result = result# 关闭FFmpeg进程
process.stdout.close()
process.terminate()