嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-21 更新

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

Authors:Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang

Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.

视觉语言行动模型(VLAs)因其端到端的设计和卓越性能而在机器人操作领域越来越受欢迎。然而,现有的VLAs严重依赖于仅支持文本指令的视觉语言模型(VLMs),忽视了人类与机器人交互中更自然的语音模式。传统的语音集成方法通常需要一个单独的语音识别系统,这增加了模型的复杂性并引入了误差传播。此外,转录过程会丢失原始语音中的非语义信息,如声纹,这对于机器人成功完成定制任务可能是至关重要的。为了克服上述挑战,我们提出了VLAS,这是一种新型端到端的VLA,它直接将语音识别集成到机器人策略模型中。VLAS允许机器人通过内部语音文本对齐理解口语命令,并产生相应的行动来完成任务。我们还推出了两个新数据集SQA和CSI,以支持语音指令的三阶段调整过程,这使VLAS具备文本、图像、语音和机器人行动之间的跨模态交互能力。更进一步的是,设计了一个声音检索增强生成(RAG)范式,使我们的模型能够有效地处理需要个人特定知识的任务。我们的广泛实验表明,VLAS可以有效地完成具有各种语音命令的机器人操作任务,提供无缝且定制化的交互体验。

论文及项目相关链接

PDF Accepted as a conference paper at ICLR 2025

摘要
VLAS提出了一种新颖的语言视觉行动模型(VLA),它直接在机器人策略模型中整合语音识别,让机器人能通过内化的语音和文字对齐理解口语指令并完成相应任务。VLAS解决了传统语音整合方法面临的复杂性和误差传播问题,并保留原始语音中的非语义信息如语音特征,这对于机器人完成个性化任务至关重要。同时推出SQA和CSI两个新数据集,支持语音指令的三阶段调整过程,赋予VLAS跨文本、图像、语音和机器人动作的跨模态交互能力。此外,设计了一种声音检索增强生成(RAG)范式,使模型能够处理需要个性化知识的任务。实验表明,VLAS能有效完成多样化的语音指令下的机器人操控任务,提供无缝且个性化的交互体验。

关键见解

  1. VLAS模型直接整合语音识别到机器人策略模型中,实现端到端的口语指令理解。
  2. VLAS解决了传统语音整合方法的复杂性和误差传播问题。
  3. VLAS保留了原始语音中的非语义信息,如语音特征,对机器人完成个性化任务至关重要。
  4. 新数据集SQA和CSI支持语音指令的三阶段调整,增强了VLAS的跨模态交互能力。
  5. RAG范式使VLAS模型能够处理需要个性化知识的任务。
  6. VLAS通过实现多样化的语音指令下的机器人操控任务,提供了无缝且个性化的交互体验。

Cool Papers

点此查看论文截图

FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems

Authors:Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, Di Zhang

Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.

全双工语音对话系统(Full-Duplex SDS)通过实现实时双向通信,显著增强了人机交互的自然性。然而,现有方法面临诸多挑战,如因架构高度耦合和过于简化的二元状态建模而导致的独立模块优化困难和上下文噪声干扰。本文提出了FlexDuo,这是一种灵活的全双工控制模块,通过即插即用的架构设计,将双工控制从语音对话系统中解耦出来。此外,本文还受到人类对话中的信息过滤机制的启发,引入了一个明确的空闲状态。一方面,空闲状态可以过滤掉冗余的噪声和无关的音频,以提高对话质量。另一方面,它建立了一种基于语义完整性的缓冲机制,降低了相互干扰的风险,同时确保了准确的响应转换。在Fisher语料库上的实验结果表明,与集成全双工对话系统基线相比,FlexDuo将错误中断率降低了24.9%,响应准确率提高了7.6%。与基于语音活动检测(VAD)的控制基线系统相比,它在中文和英文对话质量方面也表现出更好的性能。所提出的模块化架构和基于状态的对话模型为构建灵活高效的全双工对话系统提供了新的技术途径。

论文及项目相关链接

PDF

Summary

FlexDuo是一种灵活的全双工控制模块,通过采用模块化架构设计提高了人机交互的自然性。它通过引入空闲状态,优化了冗余噪声和无关音频的过滤,降低了对话中的误中断风险,同时保证了准确的响应转换。在Fisher语料库上的实验结果表明,FlexDuo在减少误中断和提高响应准确性方面表现优异。

Key Takeaways

  1. FlexDuo模块通过引入空闲状态,增强了全双工对话系统的性能。
  2. 空闲状态有助于过滤冗余噪声和无关音频,提高对话质量。
  3. FlexDuo采用模块化设计,便于独立优化各个模块。
  4. FlexDuo降低了对话中的误中断风险,确保准确的响应转换。
  5. 实验结果表明,FlexDuo在减少误中断和提高响应准确性方面优于集成全双工对话系统基线。
  6. FlexDuo在中文和英文对话质量上均优于基于语音活动检测(VAD)的基线系统。

Cool Papers

点此查看论文截图

Adopting Whisper for Confidence Estimation

Authors:Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Yash Jogi

Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.

近期关于语音识别系统词汇级别置信度评估的研究主要聚焦于所谓的置信度评估模块(CEM)这类轻便模型。这些模型依赖于自动语音识别(ASR)输出所衍生的人工特征。与之相对的是,我们提出了一种新颖端到端的方案,其利用ASR模型本身(Whisper模型)来生成词汇级别的置信度分数。具体来说,我们介绍了一种方法,对Whisper模型进行微调,以针对音频输入及其对应假设转录生成标量置信度分数。我们的实验表明,微调过的Whisper-tiny模型与强大的CEM基准模型规模相当,其在领域内数据集上表现相当,且在八个领域外数据集上超越了CEM基准模型;而微调过的Whisper-large模型在所有数据集上均大幅超越CEM基准模型。

论文及项目相关链接

PDF Accepted at IEEE ICASSP 2025

Summary

近期关于语音识别系统词级置信度评估的研究主要关注轻量级模型,即置信度评估模块(CEMs),它们依赖于自动语音识别(ASR)输出的手工特征。相反,我们提出了一种新型端到端方法,利用ASR模型本身(Whisper)来生成词级置信度评分。具体来说,我们介绍了一种方法,对Whisper模型进行微调,以针对音频输入及其对应的假设转录生成标量置信度评分。实验表明,经过调校的Whisper-tiny模型在领域内部数据集上的性能与强CEM基准模型相当,在8个领域外部数据集上的性能超过CEM基准模型,而经过调校的Whisper-large模型在所有数据集上的表现均大幅超越CEM基准模型。

Key Takeaways

  1. 研究关注词级置信度评估在语音识别系统中的应用。
  2. 现有方法主要使用轻量级模型(Confidence Estimation Modules, CEMs)结合手工特征进行评估。
  3. 提出一种新型端到端方法,利用ASR模型(Whisper)进行词级置信度评估。
  4. Whisper模型经微调可生成针对音频输入及其假设转录的标量置信度评分。
  5. 实验显示,微调后的Whisper-tiny模型在部分数据集上性能与CEM基准模型相当,并在多个领域外部数据集上表现更优。
  6. Whisper-large模型的性能在所有数据集上均显著超越CEM基准模型。

Cool Papers

点此查看论文截图

A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond

Authors:Shreya Shukla, Jose Torres, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury

Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial Intelligence (GenAI) has opened new frontiers in brain signal decoding, enabling assistive communication, neural representation learning, and multimodal integration. BCIs, particularly those leveraging Electroencephalography (EEG), provide a non-invasive means of translating neural activity into meaningful outputs. Recent advances in deep learning, including Generative Adversarial Networks (GANs) and Transformer-based Large Language Models (LLMs), have significantly improved EEG-based generation of images, text, and speech. This paper provides a literature review of the state-of-the-art in EEG-based multimodal generation, focusing on (i) EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based language models and contrastive learning methods. Additionally, we discuss the emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We highlight key datasets, use cases, challenges, and EEG feature encoding methods that underpin generative approaches. By providing a structured overview of EEG-based generative AI, this survey aims to equip researchers and practitioners with insights to advance neural decoding, enhance assistive technologies, and expand the frontiers of brain-computer interaction.

脑机接口(BCIs)与生成式人工智能(GenAI)的融合为脑信号解码开辟了新的领域,实现了辅助通信、神经表征学习和多模态融合。特别是利用脑电图(EEG)的脑机接口提供了一种非侵入性的手段,将神经活动转化为有意义的输出。深度学习领域的最新进展,包括生成对抗网络(GANs)和基于Transformer的大型语言模型(LLMs),已经极大地提高了基于EEG的图像、文本和语音生成能力。本文综述了基于EEG的多模态生成的最新进展,重点关注(i)通过GANs、变分自动编码器(VAEs)和扩散模型实现EEG到图像生成;(ii)利用基于Transformer的语言模型和对比学习方法实现EEG到文本的生成。此外,我们还讨论了新兴领域EEG到语音合成,这是一个不断发展的多模态前沿领域。本文重点介绍了关键数据集、应用场景、挑战以及支撑生成方法的EEG特征编码方法。通过对基于EEG的生成式人工智能的综述,旨在为研究人员和实践者提供洞察,以推动神经解码的发展,提高辅助技术的效能,拓展脑机交互的边界。

论文及项目相关链接

PDF

Summary
脑机接口(BCI)与生成式人工智能(GenAI)的融合为脑信号解码领域开辟了新天地,助力辅助沟通、神经表征学习与多媒体融合。借助脑电图(EEG),BCI实现了神经活动向有意义的输出的转化。深度学习领域的最新进展,包括生成对抗网络(GANs)和基于Transformer的大型语言模型(LLMs),显著提升了基于EEG的图像、文本和语音生成能力。本文综述了EEG基多元生成领域的最新进展,关注EEG转图像生成与EEG转文本生成两大领域,并探讨了新兴的EEG语音合成领域。本文旨在提供EEG基生成式AI的结构性概览,为研究人员和实践者提供洞察,推动神经解码的发展,改进辅助技术,并扩展脑机交互的边界。

Key Takeaways

  1. BCI与GenAI融合为脑信号解码带来新突破。
  2. EEG为非侵入性翻译神经活动提供有意义输出手段。
  3. 深度学习进步推动了基于EEG的图像、文本和语音生成。
  4. EEG转图像生成领域关注GANs、VAEs和Diffusion Models的应用。
  5. EEG转文本生成利用基于Transformer的语言模型和对比学习方法。
  6. 新兴的EEG语音合成是多媒体融合的前沿领域。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
GAN GAN
GAN 方向最新论文已更新,请持续关注 Update in 2025-02-21 A Survey on Bridging EEG Signals and Generative AI From Image and Text to Beyond
2025-02-21
下一篇 
无监督/半监督/对比学习 无监督/半监督/对比学习
无监督/半监督/对比学习 方向最新论文已更新,请持续关注 Update in 2025-02-21 Symmetrical Visual Contrastive Optimization Aligning Vision-Language Models with Minimal Contrastive Images
  目录