发布日期: 2025-11-18

更新日期: 2025-11-27

文章字数: 3k

阅读时长: 12 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-18 更新

Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Authors:Yiming Rong, Yixin Zhang, Ziyi Wang, Deyang Jiang, Yunlong Zhao, Haoran Wu, Shiyu Zhou, Bo Xu

Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.

自动语音识别（ASR）系统在常规条件下取得了显著的性能，但在需要领域特定知识的上下文场景（如会议演讲）中，往往难以利用长上下文信息。这一挑战主要是由于模型上下文窗口受限，以及在大量上下文噪声中相关信息的稀疏性。为解决这一问题，我们提出了SAP$^{2}$方法，这是一种新的框架，可以动态地删除和整合两个阶段中的相关上下文关键词。具体来说，每个阶段都利用了我们提出的基于语音驱动的注意力池化机制，能够在压缩上下文嵌入的同时保留语音显著信息。实验结果表明，SAP$^{2}$在SlideSpeech和LibriSpeech数据集上实现了最先进的性能，分别达到了7.71%和1.12%的单词错误率（WER）。在SlideSpeech上，我们的方法与非上下文基线相比，显著降低了41.1%的偏向关键词错误率（B-WER）。SAP$^{2}$还表现出稳健的可扩展性，在两个数据集的广泛上下文输入条件下，始终维持性能。

论文及项目相关链接

PDF

Summary

ASR系统在常规条件下表现卓越，但在需要领域特定知识的语境化场景中，如会议演讲，难以利用长语境信息。为此，我们提出SAP^2^方法，一个动态裁剪并整合相关语境关键词的两阶段新颖框架。每个阶段都运用我们提出的语音驱动注意力池化机制，能在压缩语境嵌入的同时保留语音显著信息。实验结果显示，SAP^2^在SlideSpeech和LibriSpeech数据集上表现卓越，字错误率分别为7.71%和1.12%。在SlideSpeech上，我们的方法将非语境基准测试的偏向关键词错误率减少了41.1%。SAP^2^还展现出稳健的可扩展性，在大量语境输入条件下在两个数据集上始终维持表现。

Key Takeaways

ASR系统在常规条件下表现良好，但在需要领域特定知识的语境化场景中面临挑战。
SAP^2^方法是一个针对ASR系统的改进框架，旨在处理长语境信息。
SAP^2^通过两个阶段的动态裁剪和整合相关语境关键词来提高性能。
提出的语音驱动注意力池化机制能在压缩语境信息的同时保留关键语音特征。
SAP^2方法在SlideSpeech和LibriSpeech数据集上实现了先进性能。
在SlideSpeech数据集上，SAP^2^显著减少了非语境模型的偏向关键词错误率。

Cool Papers

点此查看论文截图

DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

Authors:HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai

Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.

识别长音频对话中说话者的意图具有广泛的应用，但由于说话者言论之间的复杂相互依赖性以及稀缺的标注数据，这是一项不平凡的人工智能任务。为解决这些挑战，当前工作提出了一种端到端的框架，即DialogGraph-LLM。DialogGraph-LLM结合了一种新型的多关系对话注意网络（MR-DAN）架构和多模态基础模型（例如Qwen2.5-Omni-7B），用于直接进行声音到意图的推断。设计了一种自适应的半监督学习策略，该策略使用LLM，并基于使用全局和类别置信度的双阈值过滤来生成具有信心感知的伪标签机制，以及基于熵的样本选择过程，优先考虑高信息无标签实例。在市场上Call语料库上的广泛评估以及公开的MIntRec 2.0基准测试证明了DialogGraph-LLM优于强大的音频和文本驱动基线。该框架在现实世界场景音频对话中的意图识别方面表现出强大的性能和效率，证明了其在监督有限的丰富音频领域的实用价值。我们的代码位于：https://github.com/david188888/DialogGraph-LLM。

论文及项目相关链接

PDF 8 pages, 2 figures; Series: Frontiers in Artificial Intelligence and Applications, Volume 413: ECAI 2025

Summary：针对音频对话中的说话人意图识别问题，提出了一个名为DialogGraph-LLM的端到端框架。该框架结合了多关系对话注意网络（MR-DAN）和跨模态基础模型（如Qwen2.5-Omni-7B），用于直接进行声音到意图的推断。利用LLM设计了一种自适应的半监督学习策略，并基于双重阈值过滤和熵样本选择过程，优先处理高信息量的未标记实例。在MarketCalls语料库和公开的MIntRec 2.0基准测试上的评估表明，DialogGraph-LLM在音频对话中的意图识别方面表现优异，证明其在音频丰富且监督有限的领域具有实用价值。

Key Takeaways：

DialogGraph-LLM是一个用于音频对话中说话人意图识别的端到端框架。
该框架结合了MR-DAN和跨模态基础模型，支持直接进行声音到意图的推断。
DialogGraph-LLM采用自适应半监督学习策略，利用LLM进行设计。
双重阈值过滤和熵样本选择过程用于优先处理高信息量的未标记实例。
在MarketCalls语料库和MIntRec 2.0基准测试上的评估证明了DialogGraph-LLM的优越性。
该框架在音频对话中的意图识别方面表现出强大的性能和效率。

Cool Papers

点此查看论文截图

HI-TransPA: Hearing Impairments Translation Personal Assistant

Authors:Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

听力受损人士在日常沟通中经常面临因发音清晰度的固有挑战而产生的重大障碍。为了解决这个问题，我们将Omni-Model范式引入辅助技术，并推出了HI-TransPA，这是一款指令驱动的视听个人助理。该模型融合了模糊不清的语音和唇部动态，能够在单一的多模式框架内实现翻译和对话。为了解决听力受损人士的独特发音模式以及现有模型的有限适应性，我们开发了一种多模式预处理和筛选管道，可以检测面部地标、稳定唇部区域，并定量评估样本质量。这些质量分数引导一种课程学习策略，首先以清洁、高信心的样本进行训练，然后逐渐加入更困难的案例以加强模型的稳健性。结构上，我们采用了一种新型统一的3D重采样器来有效编码唇部动态，这对于准确解释至关重要。在专门构建的HI-Dialogue数据集上的实验表明，HI-TransPA在字面准确性和语义保真度方面达到了最新技术水平。我们的工作为将Omni-Models应用于辅助通信技术奠定了基础，为未来研究提供了端到端的建模框架和必要的处理工具。

论文及项目相关链接

PDF

摘要
为解决听力受损人士在日常沟通中面临的挑战，本文引入了Omni-Model范式，并推出了一款指令驱动的视听个人助手HI-TransPA。该模型融合了模糊不清的语音和唇部动态，在一个单一的多模式框架内实现了翻译和对话。为解决听力受损人士的发音模式独特及现有模型的适应性有限问题，本文开发了一种多模式预处理和筛选管道，用于检测面部特征点、稳定唇部区域并定量评估样本质量。这些质量分数引导课程学习策略，首先以清洁、高信心样本进行训练，并逐步纳入更困难的案例以增强模型的稳健性。在专门构建的HI-Dialogue数据集上的实验表明，HI-TransPA在字面准确率和语义保真度方面都达到了最新技术水平。本文的工作为Omni-Model在辅助通信技术中的应用奠定了基础，为未来的研究提供了一个端到端的建模框架和必要的处理工具。

关键见解