LSLM论文

解决的问题

现在的语音模型(SLM)增强了语音对话的能力,但都局限于回合制对话,在实时对话的情境下与用户交互的能力有所欠缺,例如:当生成的对话不满意时被打断。所以,这篇论文在实时的的语音语言模型(interactive speech language models (iSLM))中采用全双工建模(full duplex modeling (FDM)),旨在增强实时交互性,明确来说,探索打断能力的精髓。

We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time.Three fusion strategies—early fusion, middle fusion, and late fusion—are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM’s robustness to noise and sensitivity to diverse instructions.

The proposed LSLM uses a token-based decoder-only TTS to model the ability to speak and a streaming self-supervised learning (SSL) encoder to model the ability to listen.

LLMs have facilitated a paradigm shift from simplex models to half-duplex models, also known as turn-based models, as shown in Figure 1(C). Prominent models include SpeechGPT [48], LauraGPT [5], and VioLA [42]. While these half duplex models can both listen and speak, they are constrained to performing only one action at the same instant, thus failing to address the turn-taking problem.

单工和半双工:

where R1:t−1 = [r1, r2, ..., rt−1] and T is the sequence length. During the inference phase, the model can only predict the next token autoregressively based on the previous output within the current channel, without information from other channels.

全双工

In modeling a full duplex spoken dialogue system within an autoregressive language model, the model needs to predict the next token rt in the response R not only based on the context C and the generated response history R1:t−1 = [r1, r2, . . . , rt−1] in the current channel, but also by utilizing information S1:t−1 = [s1, s2, . . . , st−1] from another channel simultaneously.The training loss L(θ) is now formulated as:

A key point in FDM is that the sequence S is produced in real time and unpredictably. 

LSLM的读能力,听能力,以及整合这两个能力的融合方法

The core difference between LSLM and previous speech language models lies in its capability to simultaneously speak and listen.We first introduce the speaking capability of LSLM, followed by its listening capability, and finally, we discuss various fusion methods that integrate these capabilities, endowing LSLM with full duplex ability.

Index Terms Full Duplex Modeling, Interactive Speech Language Model

相关工作

This paradigm involves encoding the speech signal into discrete tokens or continuous embeddings, modeling them with a language model, and decoding the speech tokens or embeddings back to the speech signal.Some studies [19, 17, 26] utilizes this paradigm for speech continuation, generating expressive speech and natural multi-round dialogue:

26:Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling. Proc. TACL, 2023.

Other research employs this paradigm to task-specific applications, such as decoder-only high-fidelity TTS [40, 3, 31, 13] and decoder-only streaming ASR [33, 38, 4, 8]Moreover, SpeechGPT [48] and LauraGPT [5] initialize SLMs using LLMs, expanding speech tokens to the LLM vocabulary and continuing training on speech.

[33] Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, and Chunyang Wu. Speech ReaLLM–real-time streaming speech recognition with multimodal LLMs by teaching the flow of time. arXiv preprint arXiv:2406.09569, 2024.

[38] Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, and Shinji Watanabe. Decoder-only architecture for streaming end-to-end speech recognition. Proc. Interspeech, 2024.

[4] Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, and Lei Xie. Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study. Proc. Interspeech, 2024.

[8] Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C Puvvada, Nithin Rao Koluguri, Piotr Zelasko, Jagadeesh Balam, and Boris Ginsburg. BESTOW: Efficient and streamable speech ˙ language model with the best of two worlds in gpt and t5. arXiv preprint arXiv:2406.19954, 2024.

Despite these advances, all these models are limited to turn-based conversations and cannot handle real-time sound or interruptions, limiting their applicability in real-life scenarios.

we focus on investigating Full Duplex Modeling (FDM) in interactive Speech Language Models (iSLM), a crucial topic affecting user experience.

相关工作

. Lin et. al [22] proposes to process real-time audio input with a separate comprehension module. Other works [49, 41] suggest modifying the order in which text tokens are organized in the LLM to tackle the duplex modeling problem. All these models are based on text-centric LLMs that require external ASR and TTS modules for spoken dialogue. As a result, latency remains perceivable and the paralinguistic ability is still lacking. We believe the FDM capability should be an intrinsic capability of SLMs, enabling simultaneous listening and speaking.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/394264.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

ShardingSphere自定义分布式主键生成策略、自定义分片规则

文章目录 主键生成策略源码KeyGenerateAlgorithm源码入口实现扩展 自定义分布式主键生成策略 分片算法ShardingAlgorithm实现扩展 自定义分片算法踩的坑 主键生成策略源码 开发者手册 KeyGenerateAlgorithm 全限定类名org.apache.shardingsphere.sharding.spi.KeyGenerateAl…

QT界面设计开发(Visual Studio 2019)—学习记录一

一、控件升级 简要介绍: 简单来说,控件提升就是将一个基础控件(Base Widget)转换为一个更特定、更复杂的自定义控件(Custom Widget)。这样做的目的是为了在设计界面时能够使用更多高级功能,而不…

环境搭建:全面详尽的 MongoDB Shell MongoDB Server介绍、安装、验证与配置指南(以 Windows 系统为主)

环境搭建:全面详尽的 MongoDB Shell & MongoDB Server介绍、安装、验证与配置指南(以 Windows 系统为主) MongoDB 是一个基于文档的 NoSQL 数据库,以其高性能、灵活性和可扩展性而受到广泛欢迎。本文将带您完成 MongoDB 的安装…

bpmn简单使用(制作流程图)

1、先下载依赖,下面是我下载的版本 "bpmn-io/properties-panel": "^3.23.0", "bpmn-js": "^17.9.1", "bpmn-js-properties-panel": "^5.6.1", "camunda-bpmn-moddle": "^7.0.1",…

CTFHUB-web-RCE-eval执行

开启题目 查看源码发现直接用蚁剑连接就可以,连接之后发现成功了

计算机网络408考研 2020

2020 湖科大教书匠的个人空间-湖科大教书匠个人主页-哔哩哔哩视频 计算机网络408考研 历年真题解析(有字幕无背景音乐版)_哔哩哔哩_bilibili 计算机网络408考研2020年真题解析_哔哩哔哩_bilibili 1 2 3 41 11 1

乡村振兴农村煤改气建设规划设计方案

1. 方案目标与背景 《乡村振兴农村煤改气建设规划设计方案》旨在响应国家乡村振兴战略,通过建设规划推动农村能源结构转型,减少燃煤造成的环境污染,促进农村可持续发展。 2. 农村能源消耗现状 根据2006至2007年的全国性调研,农…

从一个服务预热不生效问题谈微服务无损上线

作者:凡问、启淮 前言 本文基于阿里云技术服务团队和产研团队,在解决易易互联使用 MSE(微服务引擎)产品无损上线功能所遇到问题的过程总结而成。本文将从问题和解决方法谈起,再介绍相关原理,后进一步拓展…

4.11.seq2seq 序列到序列学习

序列到序列学习(seq2seq) ​ 使用两个循环神经网络的编码器和解码器&#xff0c;应用于序列到薛烈类的学习任务。 ​ ​ 在图中&#xff0c;特定的"<eos>"表示序列结束词元。一旦输出序列生成此词元&#xff0c;模型就会停止预测。在循环神经网络解码器的初…

JS+CSS案例:可适应上下布局和左右布局的菜单(含二级菜单)

今天,我给大家分享一个原创的CSS菜单,整个菜单全由CSS写成,仅在切换布局时使用JS。合不合意,先看看效果图。 本例图片 接下来,我来详细给大家分享它的制作方法。 文件夹结构 因为涉及到了样式表切换,所以,你需要借鉴一下我的文件夹结构。 CSS文件夹: reset.css 用于…

维吉尼亚密码加解密实现(python)

维吉尼亚密码 原理 维吉尼亚密码&#xff08;Vigenere&#xff09;是使用一系列凯撒密码组成密码字母表的加密算法&#xff0c;属于多表密码的一种简单形式。 下面给出一个例子 明文&#xff1a;come greatwall 密钥&#xff1a;crypto首先&#xff0c;对密钥进行填充使其长…

【算法】普里姆算法解决修路问题

应用场景——修路问题 1.某地有 7 个村庄&#xff08;A&#xff0c;B&#xff0c;C&#xff0c;D&#xff0c;E&#xff0c;F&#xff0c;G&#xff09;&#xff0c;现在需要修路把 7 个村庄连通 2.各个村庄的距离用边线表示&#xff08;权&#xff09;&#xff0c;比如 A - …

ORM工具之SQLAlchemy

SQLAlchemy是Python编程语言下的一款开源软件。提供了SQL工具包及对象关系映射&#xff08;ORM&#xff09;工具&#xff0c;使用MIT许可证发行。 SQLAlchemy“采用简单的Python语言&#xff0c;为高效和高性能的数据库访问设计&#xff0c;实现了完整的企业级持久模型”。SQL…

从 Pandas 到 Polars 四十四:Polars 和 数据可视化库Seaborn

在我对Matplotlib感到沮丧并发表帖子时&#xff0c;我的朋友让我试试Seaborn库。近年来我一直在使用Altair&#xff0c;因此并没有过多考虑Seaborn。然而&#xff0c;Seaborn的新界面给我留下了深刻印象&#xff0c;并且我很高兴地发现&#xff0c;Seaborn将直接接受Polars的Da…

【web安全】权限漏洞之未授权访问

一.Jenkins未授权访问漏洞 步骤一&#xff1a;使用以下fofa语法进行搜索 port"8080" && app"JENKINS" && title"Dashboard [Jenkins]" 步骤二&#xff1a;进入执行页面http://xxx.xxx.xxx.xxx:xxxx/manage/script/index.php 执…

Linux下自动监控进程运行状态

目录 背景应用举例1、使用crontab脚本监控服务2、使用shell脚本监控服务2.1 编写自定义监控脚本2.2 运行脚本 背景 假设有一个服务需要长期运行&#xff0c;但可能会由于某种原因导致服务意外停止&#xff0c;不能及时发现&#xff0c;某天来到公司后发现出问题了才意识到服务…

(Qt) QThread 信号槽所在线程

文章目录 &#x1f481;&#x1f3fb;前言&#x1f481;&#x1f3fb;Code&#x1f481;&#x1f3fb;‍♂️Code&#x1f481;&#x1f3fb;‍♂️环境 &#x1f481;&#x1f3fb;当前线程信号&#x1f481;&#x1f3fb;‍♂️默认效果&#x1f481;&#x1f3fb;‍♂️Qt::…

最新CSS3伪类和伪元素详解

第4章 伪类和伪元素 4.1结构伪类 E:first-child{},第一个元素 样式&#xff1a; p:first-child {color: red; } <div><p>Lorem ipsum</p><p>Dolor sit amet.</p> </div> 4.1.1nth-*伪类 以计数为基础的&#xff0c;默认情况下&…

探索下一代互联网协议:IPv6的前景与优势

探索下一代互联网协议&#xff1a;IPv6的前景与优势 文章目录 探索下一代互联网协议&#xff1a;IPv6的前景与优势**IPv6 的特点****IPv6的基本首部****IPv6的地址****总结** 互联网的核心协议&#xff1a;从IPv4到IPv6 互联网的核心协议IP&#xff08;Internet Protocol&#…

Docker Deskpot出现Docker Engine Stopped的解决历程

前提&#xff1a;我的操作系统是Win11家庭版, Docker Descktop下载的是最新版&#xff08;此时是4.30.0&#xff09; 出现了如图所示的问题“Docker Engine Stopped”,个人认为解决问题的关键是第四点&#xff0c;读者可以直接看第四点&#xff0c;如果只看第四点就成功解决&am…