发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 2.9k

阅读时长: 11 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching

Authors:Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie

Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and cross-lingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech

近期文本到语音（TTS）合成的进展，特别是那些利用大型语言模型（LLM）的技术，在表达力和自然度方面有了显著的提升。然而，生成类似人类的、交互式的对话语音仍然具有挑战性。当前的系统面临着双重轨迹数据稀缺以及难以实现自然性、上下文连贯性和交互动力（如轮流说话、话语重叠和说话者一致性）等问题的挑战，尤其是在多轮对话中。为了应对这些挑战，我们提出了DialoSpeech，这是一种结合大型语言模型和分段流匹配的双重轨迹架构，用于表达性、类似人类的对话语音合成。DialoSpeech生成了连贯的说话者轮次和自然的重叠的自然多轮对话，支持中文和英文以及跨语言语音合成。我们引入了一个数据处理管道来构建双重轨迹对话数据集，便于可扩展的训练和实验验证。实验表明，我们的模型优于基线模型，为解决类似人类口语对话的生成问题提供了解决方案。音频样本可在https://tiamojames.github.io/DialoSpeech找到。

论文及项目相关链接

PDF

Summary

文本到语音（TTS）合成技术的最新进展，特别是利用大型语言模型（LLM）的技术，已经显著提高了表达力和自然度。然而，生成类似人类的互动对话语音仍然具有挑战性。为解决多轮对话中自然性、上下文连贯性和互动动态（如话轮转换、言语重叠和说话人一致性）的难题，我们提出了DialoSpeech，一个结合大型语言模型和Chunked Flow Matching的双轨架构，用于表达性、类似人类的对话语音合成。DialoSpeech支持中文和英文的自然多轮对话生成，并实现了跨语言语音合成。

Key Takeaways

近期文本转语音（TTS）合成技术的进展已显著提高表达力和自然度，特别是在利用大型语言模型（LLM）方面。
生成类似人类的互动对话语音仍存在挑战，主要包括实现自然性、上下文连贯性和互动动态等方面。
DialoSpeech是一个双轨架构，结合了大型语言模型和Chunked Flow Matching技术，用于表达性、类似人类的对话语音合成。
DialoSpeech能支持中文和英文的自然多轮对话生成，并实现了跨语言语音合成。
DialoSpeech通过引入数据处理管道，构建了双轨对话数据集，促进了可扩展的训练和实验验证。
实验表明，DialoSpeech模型优于基线模型，为生成类似人类的对话语音提供了解决方案。

Cool Papers

点此查看论文截图

Prepared mind, fast response: A temporal decoupling framework for adaptive knowledge orchestration in open-domain dialogue

Authors:Jinling Gan, Churong Liang, Runnan Li

The latency-quality tradeoff is a fundamental constraint in open-domain dialogue AI systems, since comprehensive knowledge access necessitates prohibitive response delays. Contemporary approaches offer two inadequate solutions: lightweight instruct models achieve sub-second latency but lack reasoning depth, while tool-augmented ReAct agents enhance factuality through external knowledge at the cost of synchronous execution that blocks interaction during retrieval processes. PMFR is thus proposed, with a temporal decoupling framework that fundamentally resolves the contradiction through asynchronous knowledge orchestration. PMFR employs three coordinated components: (1) a Knowledge Adequacy Evaluator for real-time sufficiency assessment, (2) a Lightweight Response Generator for immediate user interaction, and (3) an Asynchronous Knowledge Refinement Agent for background knowledge enhancement. This architecture maintains continuous conversational flow while progressively enriching knowledge coverage through intelligent triggering mechanisms. Evaluation results on TopiOCQA demonstrate PMFR outperforms brute-force scaling: PMFR achieves 95.3% latency reduction (23.38s -> 1.09s) while preserving response quality comparable to heavyweight synchronous baselines (GEval-C: 0.613 vs. 0.620).

延迟与质量的权衡是开放领域对话AI系统中的一个基本约束，因为全面的知识访问会导致响应延迟。当前的方法提供了两种不充分的解决方案：轻量级指令模型实现了秒级延迟，但缺乏深度推理；工具增强的ReAct代理通过外部知识提高了事实准确性，但以同步执行为代价在检索过程中阻塞了交互。因此，提出了PMFR，它采用时间解耦框架，通过异步知识编排从根本上解决了矛盾。PMFR采用三个协调组件：（1）知识充足性评估器，用于实时充足性评估；（2）轻量级响应生成器，用于即时用户交互；（3）异步知识优化代理，用于后台知识增强。该架构通过智能触发机制保持连续的对话流程，同时逐步丰富知识覆盖。在TopiOCQA上的评估结果表明，PMFR优于暴力扩展：PMFR实现了95.3%的延迟降低（从23.38秒到1.09秒），同时保持与重量级同步基准相当的反应质量（GEval-C：0.613对0.620）。

论文及项目相关链接

PDF

Summary

该文本探讨了开放领域对话AI系统中的延迟-质量权衡问题，并提出了PMFR解决方案。PMFR采用异步知识协同架构，通过三个协调组件实现实时充足性评估、即时用户交互和背景知识增强的功能，从而维持连续对话流程并逐步提高知识覆盖面。在TopiOCQA上的评估结果表明，PMFR在减少延迟的同时，保持了与重量级同步基准相当的反应质量。

Key Takeaways

开放领域对话AI系统面临延迟-质量权衡问题，全面知识访问会导致响应延迟。
现有方法如轻量级指令模型和工具增强ReAct代理各有不足，无法实现延迟和质量的平衡。
PMFR提出一个异步知识协同框架，通过三个协调组件解决这一矛盾。
PMFR包括实时充足性评估、即时用户交互和背景知识增强三个组件。
PMFR能够维持连续对话流程，并逐步提高知识覆盖面。
在TopiOCQA上的评估显示，PMFR在减少延迟方面表现出色，同时保持高质量的反应。

Cool Papers

点此查看论文截图

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Authors:Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

我们介绍了M3-Agent，这是一个配备长期记忆的新型多模态代理框架。与人类类似，M3-Agent能够处理实时视觉和听觉输入，以构建和更新情景记忆和语义记忆，逐渐积累世界知识。它的记忆以实体为中心，多模态的方式组织，使对环境有更深入、更一致的理解。给定指令后，M3-Agent能够自主进行多轮推理并检索相关记忆以完成任务。为了评估多模态代理中的记忆效果和基于记忆推理的能力，我们开发了M3-Bench，这是一个长视频问答基准测试，包括100个新录制的机器人视角视频（M3-Bench-robot）和920个多样化的网络视频（M3-Bench-web）。我们标注了QA对，旨在测试对于代理应用程序至关重要的能力，如人物理解、通用知识提取和跨模态推理。实验结果表明，通过强化学习训练的M3-Agent超越了最强的基线代理——使用Gemini-1.5-pro和GPT-4o的提示代理，在M3-Bench-robot、M3-Bench-web和VideoMME-long上的准确率分别提高了6.7%、7.7%和5.3%。我们的工作推动了多模态代理朝着更类似于人类的长程记忆方向发展，并为其实践设计提供了见解。模型、代码和数据可在https://github.com/bytedance-seed/m3-agent获取。

论文及项目相关链接

PDF

Summary

M3-Agent是一个配备长期记忆的多模态代理框架，能处理实时视觉和听觉输入，构建和更新情景和语义记忆，逐步积累世界知识。其记忆以实体为中心、多模态的方式组织，使对环境的理解更为深入和一致。接受指令后，M3-Agent可自主进行多轮推理并检索相关记忆完成任务。为评估多模态代理的记忆效果和基于记忆推理能力，研究者推出M3-Bench基准测试，包含机器人视角视频和网络视频。实验结果显示，通过强化学习训练的M3-Agent表现优于基线模型，在M3-Bench测试中准确率更高。

Key Takeaways