前面有一篇博客说到了讯飞输入法,支持语音输入,也支持电脑内部音源输入,详细参考:【实时语音转文本】PC端实时语音转文本(麦克风外音&系统内部音源)
但是它只是作为一个工具来使用,如果我们想自己做一些好玩的东西,比如通过语音来控制电脑做一些自动化的操作等,我们先要收集语音转换为文本,然后再通过解析文本来操作平台,那我们就需要获取到语音识别的内容,通过讯飞输入法这种就不能办到了,这时候我们需要使用API来处理,通过对比国内外一些大厂的智能语音API,发现还是Google的API更加【智能】,更加【听得懂人话】。
说明:因为是使用了Google的API,所以需要具备一定的网络环境,需要能访问Google。
准备工作
官方文档:Cloud Speech-to-Text>文档>准备工作
根据官方文档一步步设置就行了,这里简单说明以下流程:
- 设置Google Cloud 项目
- 确保有一个结算账号关联到该项目
- 启用 Speech-to-Text API
- 创建新的服务账号
- 创建JSON密钥
- 设置身份验证环境变量
语音文件转文本Python示例
准备python环境安装依赖:
- google-cloud-speech==2.16.2
- pyaudio==0.2.12
- six==1.16.0
if __name__ == "__main__":# Imports the Google Cloud client libraryfrom google.cloud import speechimport osos.environ["http_proxy"] = "http://127.0.0.1:7890"os.environ["https_proxy"] = "http://127.0.0.1:7890"os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "xxxxx.json"# Instantiates a clientclient = speech.SpeechClient()# The name of the audio file to transcribegcs_uri = "gs://cloud-samples-data/speech/brooklyn_bridge.raw"audio = speech.RecognitionAudio(uri=gcs_uri)config = speech.RecognitionConfig(encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code="en-US",)# Detects speech in the audio fileresponse = client.recognize(config=config, audio=audio)for result in response.results:print("Transcript: {}".format(result.alternatives[0].transcript))
控制台输出:
麦克风语音转文本Python示例
准备python环境安装依赖:
- google-cloud-speech==2.16.2
- pyaudio==0.2.12
- six==1.16.0
#!/usr/bin/env pythonfrom __future__ import divisionimport re
import sysfrom google.cloud import speechimport pyaudio
from six.moves import queueimport os
os.environ["http_proxy"] = "http://127.0.0.1:7890"
os.environ["https_proxy"] = "http://127.0.0.1:7890"os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "xxxx.json"# Audio recording parameters
RATE = 16000
CHUNK = int(RATE / 10) # 100msclass MicrophoneStream(object):"""Opens a recording stream as a generator yielding the audio chunks."""def __init__(self, rate, chunk):self._rate = rateself._chunk = chunk# Create a thread-safe buffer of audio dataself._buff = queue.Queue()self.closed = Truedef __enter__(self):self._audio_interface = pyaudio.PyAudio()self._audio_stream = self._audio_interface.open(format=pyaudio.paInt16,# The API currently only supports 1-channel (mono) audio# https://goo.gl/z757pEchannels=1,rate=self._rate,input=True,frames_per_buffer=self._chunk,# Run the audio stream asynchronously to fill the buffer object.# This is necessary so that the input device's buffer doesn't# overflow while the calling thread makes network requests, etc.stream_callback=self._fill_buffer,)self.closed = Falsereturn selfdef __exit__(self, type, value, traceback):self._audio_stream.stop_stream()self._audio_stream.close()self.closed = True# Signal the generator to terminate so that the client's# streaming_recognize method will not block the process termination.self._buff.put(None)self._audio_interface.terminate()def _fill_buffer(self, in_data, frame_count, time_info, status_flags):"""Continuously collect data from the audio stream, into the buffer."""self._buff.put(in_data)return None, pyaudio.paContinuedef generator(self):while not self.closed:# Use a blocking get() to ensure there's at least one chunk of# data, and stop iteration if the chunk is None, indicating the# end of the audio stream.chunk = self._buff.get()if chunk is None:returndata = [chunk]# Now consume whatever other data's still buffered.while True:try:chunk = self._buff.get(block=False)if chunk is None:returndata.append(chunk)except queue.Empty:breakyield b"".join(data)def listen_print_loop(responses):"""Iterates through server responses and prints them.The responses passed is a generator that will block until a responseis provided by the server.Each response may contain multiple results, and each result may containmultiple alternatives; for details, see https://goo.gl/tjCPAU. Here weprint only the transcription for the top alternative of the top result.In this case, responses are provided for interim results as well. If theresponse is an interim one, print a line feed at the end of it, to allowthe next result to overwrite it, until the response is a final one. For thefinal one, print a newline to preserve the finalized transcription."""num_chars_printed = 0for response in responses:if not response.results:continue# The `results` list is consecutive. For streaming, we only care about# the first result being considered, since once it's `is_final`, it# moves on to considering the next utterance.result = response.results[0]if not result.alternatives:continue# Display the transcription of the top alternative.transcript = result.alternatives[0].transcript# Display interim results, but with a carriage return at the end of the# line, so subsequent lines will overwrite them.## If the previous result was longer than this one, we need to print# some extra spaces to overwrite the previous resultoverwrite_chars = " " * (num_chars_printed - len(transcript))if not result.is_final:sys.stdout.write(transcript + overwrite_chars + "\r")sys.stdout.flush()num_chars_printed = len(transcript)else:print(transcript + overwrite_chars)# Exit recognition if any of the transcribed phrases could be# one of our keywords.if re.search(r"\b(exit|quit)\b", transcript, re.I):print("Exiting..")breaknum_chars_printed = 0def main():# See http://g.co/cloud/speech/docs/languages# for a list of supported languages.language_code = "zh" # a BCP-47 language tagclient = speech.SpeechClient()config = speech.RecognitionConfig(encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=RATE,language_code=language_code,)streaming_config = speech.StreamingRecognitionConfig(config=config, interim_results=True)with MicrophoneStream(RATE, CHUNK) as stream:audio_generator = stream.generator()requests = (speech.StreamingRecognizeRequest(audio_content=content)for content in audio_generator)responses = client.streaming_recognize(streaming_config, requests)# Now, put the transcription responses to use.listen_print_loop(responses)if __name__ == "__main__":main()
通过麦克风语音会实时转为文本输出,如果需要再对结果进行处理,可以在listen_print_loop方法中修改。
以上代码是在官网的示例基础上做了修改:
- 设置代理(国内需要设置
http_proxy
代理,否则无法访问到google api) - 设置环境变量
GOOGLE_APPLICATION_CREDENTIALS
,正常情况是在客户端系统设置里设置,这里测试可以直接用代码设置环境变量,这个参数就是准备工作中的JSON密钥文件 - 设置语言language_code为中文zh,官方支持的语言列表:Speech-to-Text 支持的语言
其他官方示例
Google Cloud 官方示例
Speech-to-Text 示例
电脑内部语音
同样可以将麦克风设置为系统音源,这样就可以实时将电脑内的视频、语音转为文本,做个实时字幕工具也是不错的。具体操作方法参考【实时语音转文本】PC端实时语音转文本(麦克风外音&系统内部音源),只需要做一点点设置就行了。