解决的问题
现在的语音模型(SLM)增强了语音对话的能力,但都局限于回合制对话,在实时对话的情境下与用户交互的能力有所欠缺,例如:当生成的对话不满意时被打断。所以,这篇论文在实时的的语音语言模型(interactive speech language models (iSLM))中采用全双工建模(full duplex modeling (FDM)),旨在增强实时交互性,明确来说,探索打断能力的精髓。
We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time.Three fusion strategies—early fusion, middle fusion, and late fusion—are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM’s robustness to noise and sensitivity to diverse instructions.
The proposed LSLM uses a token-based decoder-only TTS to model the ability to speak and a streaming self-supervised learning (SSL) encoder to model the ability to listen.
LLMs have facilitated a paradigm shift from simplex models to half-duplex models, also known as turn-based models, as shown in Figure 1(C). Prominent models include SpeechGPT [48], LauraGPT [5], and VioLA [42]. While these half duplex models can both listen and speak, they are constrained to performing only one action at the same instant, thus failing to address the turn-taking problem.
单工和半双工:
where R1:t−1 = [r1, r2, ..., rt−1] and T is the sequence length. During the inference phase, the model can only predict the next token autoregressively based on the previous output within the current channel, without information from other channels.
全双工
In modeling a full duplex spoken dialogue system within an autoregressive language model, the model needs to predict the next token rt in the response R not only based on the context C and the generated response history R1:t−1 = [r1, r2, . . . , rt−1] in the current channel, but also by utilizing information S1:t−1 = [s1, s2, . . . , st−1] from another channel simultaneously.The training loss L(θ) is now formulated as:
A key point in FDM is that the sequence S is produced in real time and unpredictably.
LSLM的读能力,听能力,以及整合这两个能力的融合方法
The core difference between LSLM and previous speech language models lies in its capability to simultaneously speak and listen.We first introduce the speaking capability of LSLM, followed by its listening capability, and finally, we discuss various fusion methods that integrate these capabilities, endowing LSLM with full duplex ability.
Index Terms Full Duplex Modeling, Interactive Speech Language Model
相关工作
This paradigm involves encoding the speech signal into discrete tokens or continuous embeddings, modeling them with a language model, and decoding the speech tokens or embeddings back to the speech signal.Some studies [19, 17, 26] utilizes this paradigm for speech continuation, generating expressive speech and natural multi-round dialogue:
26:Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling. Proc. TACL, 2023.
Other research employs this paradigm to task-specific applications, such as decoder-only high-fidelity TTS [40, 3, 31, 13] and decoder-only streaming ASR [33, 38, 4, 8]Moreover, SpeechGPT [48] and LauraGPT [5] initialize SLMs using LLMs, expanding speech tokens to the LLM vocabulary and continuing training on speech.
[33] Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, and Chunyang Wu. Speech ReaLLM–real-time streaming speech recognition with multimodal LLMs by teaching the flow of time. arXiv preprint arXiv:2406.09569, 2024.
[38] Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, and Shinji Watanabe. Decoder-only architecture for streaming end-to-end speech recognition. Proc. Interspeech, 2024.
[4] Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, and Lei Xie. Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study. Proc. Interspeech, 2024.
[8] Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C Puvvada, Nithin Rao Koluguri, Piotr Zelasko, Jagadeesh Balam, and Boris Ginsburg. BESTOW: Efficient and streamable speech ˙ language model with the best of two worlds in gpt and t5. arXiv preprint arXiv:2406.19954, 2024.
Despite these advances, all these models are limited to turn-based conversations and cannot handle real-time sound or interruptions, limiting their applicability in real-life scenarios.
we focus on investigating Full Duplex Modeling (FDM) in interactive Speech Language Models (iSLM), a crucial topic affecting user experience.
相关工作
. Lin et. al [22] proposes to process real-time audio input with a separate comprehension module. Other works [49, 41] suggest modifying the order in which text tokens are organized in the LLM to tackle the duplex modeling problem. All these models are based on text-centric LLMs that require external ASR and TTS modules for spoken dialogue. As a result, latency remains perceivable and the paralinguistic ability is still lacking. We believe the FDM capability should be an intrinsic capability of SLMs, enabling simultaneous listening and speaking.