⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
Authors:Tuochao Chen, Bandhav Veluri, Hongyu Gong, Shyamnath Gollakota
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.
对话模型在嘈杂、多说话人的环境中表现不佳,经常产生不相关的回应和尴尬的对话轮次转换。我们推出了AV-Dialog,这是第一个使用音频和视频线索来追踪目标说话者、预测对话轮次转换并生成连贯回应的多模式对话框架。通过结合声音符号化与在单音节、合成和真实音视频对话数据集上进行的多任务、多阶段训练,AV-Dialog实现了稳健的流式转录、语义基础上的轮次边界检测和准确响应,从而实现了自然流畅的对话流程。实验表明,在干扰条件下,AV-Dialog的表现优于仅使用音频的模型,减少了转录错误,提高了轮次转换预测的准确性,并增强了人类评价的对话质量。这些结果凸显了视听交互的力量,为在现实世界嘈杂环境中表现稳健的口语对话代理铺平了道路。
论文及项目相关链接
Summary
基于对话模型在嘈杂、多说话者环境中的不足,如产生不相关回应和不当的交谈轮替,我们推出了AV-Dialog,这是首个利用音频和视频线索的多模态对话框架。它能追踪目标说话者、预测对话轮替并生成连贯的回应。通过结合声学标记化与多任务、多阶段训练,使用合成和真实的音频视频对话数据集,AV-Dialog实现了稳健的流式转录、语义化的轮替边界检测和准确的回应,确保了自然对话流程。实验证明,AV-Dialog在干扰环境下表现优于仅使用音频的模型,减少了转录错误,提高了轮替预测准确度,并增强了人类评价对话质量。此结果凸显了视听觉对于说话者意识交互的重要性,并为真实世界的嘈杂环境下的人机对话铺设道路。
Key Takeaways
- AV-Dialog是多模态对话框架,利用音频和视频线索提升对话体验。
- 它能追踪目标说话者、预测对话轮替并生成连贯回应。
- 通过多任务、多阶段训练与合成及真实音频视频对话数据集的结合,实现了稳健的转录和响应生成。
- AV-Dialog减少了转录错误并提高轮替预测准确度。
- 在嘈杂和多说话者环境下,AV-Dialog性能优于仅使用音频的模型。
- 此研究结果强调了视觉在人机互动中的重要性。
点此查看论文截图
DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition
Authors:HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai
Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.
识别长音频对话中说话者的意图具有广泛的应用,但由于说话者表述之间的复杂相互依赖关系和标注数据的稀缺,这是一项不平凡的人工智能任务。针对这些挑战,当前工作提出了一个端到端的框架,即DialogGraph-LLM。DialogGraph-LLM结合了一种新型的多关系对话注意网络(MR-DAN)架构和多模态基础模型(例如Qwen2.5-Omni-7B),用于直接进行声音到意图的推断。设计了一种自适应的半监督学习策略,该策略使用LLM,并基于双阈值过滤和全局及类别置信度进行自信伪标签生成机制,以及基于熵的样本选择过程,优先选取高信息量的未标记实例。在专有的MarketCalls语料库和公开可用的MIntRec 2.0基准测试上的广泛评估表明,DialogGraph-LLM优于强大的音频和文本驱动基线。该框架在现实世界场景音频对话中的意图识别方面表现出强大的性能和效率,证明了其在监督有限的丰富音频领域的实用价值。我们的代码可在https://github.com/david188888/DialogGraph-LLM上找到。
论文及项目相关链接
PDF 8 pages, 2 figures; Series: Frontiers in Artificial Intelligence and Applications, Volume 413: ECAI 2025
Summary
本文提出了一种名为DialogGraph-LLM的端到端框架,用于识别长音频对话中说话者的意图。该框架结合了多关系对话注意网络(MR-DAN)架构和多模态基础模型(如Qwen2.5-Omni-7B),实现了直接的声音到意图推断。采用基于LLM的自适应半监督学习策略,利用基于双阈值过滤和全局与类别置信度的置信度感知伪标签生成机制,以及基于熵的样本选择过程,优先选取高信息量的未标记样本。在专有MarketCalls语料库和公开的MIntRec 2.0基准测试上的广泛评估表明,DialogGraph-LLM在识别现实世界场景音频对话中的意图方面具有卓越的性能和效率。
Key Takeaways
- DialogGraph-LLM框架用于识别长音频对话中的说话者意图,具有广泛的应用前景。
- 该框架结合了MR-DAN架构和多模态基础模型,实现直接的声音到意图推断。
- 采用自适应半监督学习策略,利用LLM进行设计。
- 置信度感知伪标签生成机制和基于熵的样本选择过程优化了学习流程。
- DialogGraph-LLM在MarketCalls语料库和MIntRec 2.0基准测试上的表现超越了音频和文本驱动的基准测试。
- 该框架在识别现实世界场景音频对话中的意图方面表现出强大的性能和效率。