发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 7.9k

阅读时长: 32 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

Persona-Aware Alignment Framework for Personalized Dialogue Generation

Authors:Guanrong Li, Xinyu Liu, Zhen Wu, Xinyu Dai

Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.

个性化对话生成旨在利用人物角色描述和对话历史来生成与人物角色相关且连贯的响应。主流模型通常依赖于基于令牌级别的语言模型训练与个性化对话数据（如下一个令牌预测），以隐含方式实现个性化，这使得这些方法往往忽略给定的个人角色并生成通用响应。为了解决这个问题，我们提出了一种新型的个性感知对齐框架（PAL），它直接将个性对齐视为对话生成的训练目标。具体来说，PAL采用两阶段训练方法，包括个性化感知学习和个性化对齐，并配备简单易用的推理策略“先选择后生成”，以提高对人物角色的敏感性，并在语义层面上生成更多与人物角色相关的响应。通过大量实验，我们证明了我们的框架优于许多最新的个性化对话方法和大型语言模型。

论文及项目相关链接

PDF Pre-MIT Press publication version

Summary

该文提出了一种新型个性化对话生成框架——Persona-Aware Alignment Framework（PAL），旨在利用人物角色特征和对话历史生成与人物角色相关且连贯的响应。PAL框架通过直接以人物角色对齐作为对话生成的训练目标，采用两阶段训练方法和“选择后生成”推理策略，提高了对人物角色的敏感度，并在语义层面生成了更多与人物角色相关的响应。实验证明，该框架优于多种先进的个性化对话方法和大型语言模型。

Key Takeaways

个性化对话生成的目标是利用人物角色特征和对话历史生成相关且连贯的响应。
主流模型通常依赖令牌级语言模型训练，以隐含方式实现个性化，但容易生成通用响应。
PAL框架直接以人物角色对齐作为对话生成训练目标，强调人物角色的重要性。
PAL采用两阶段训练方法和“选择后生成”推理策略，提高语义层面的个性化响应质量。
PAL框架在实验中表现出优于其他先进个性化对话方法和大型语言模型的性能。
PAL框架有助于解决主流模型忽视人物角色特征的问题。

Cool Papers

点此查看论文截图

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

Authors:Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin

Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner’s tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

现有的工具增强型大型语言模型（LLM）在处理复杂查询时面临重大挑战。当前框架（如ReAct）由于依赖增量决策制定过程，容易陷入局部优化陷阱。为了解决这些局限性，我们提出了一种新型的以规划者为中心的规划-执行范式，通过架构创新从根本上解决局部优化瓶颈。我们的方法的核心是一个新型规划器模型，该模型对复杂查询进行全局有向无环图（DAG）规划，能够实现超越传统工具协调的优化执行。我们还引入了ComplexTool-Plan，这是一个大规模基准测试数据集，包含需要复杂的多工具组合和协调能力的复杂查询。此外，我们开发了一种两阶段训练方法，将监督微调（SFT）与群体相对策略优化（GRPO）相结合，通过结构化的DAG规划，系统地提高了规划器的工具选择准确性和全局规划意识。当与功能强大的执行器结合时，我们的框架在StableToolBench基准测试上实现了复杂用户查询的卓越性能，展现了出色的端到端执行能力和处理复杂多工具工作流程的稳健性。

论文及项目相关链接

PDF

Summary

该文针对现有工具增强的大型语言模型（LLMs）在处理复杂查询时面临的挑战，提出了一种新颖的Planner-centric Plan-Execute范式。该范式通过架构创新从根本上解决局部优化瓶颈问题。核心在于一个新颖的Planner模型，它进行全局有向无环图（DAG）规划，为复杂查询提供优化执行方案，超越传统工具协调。此外，还引入了ComplexTool-Plan大型基准数据集，以及两阶段训练方法，整合监督微调（SFT）与群组相对策略优化（GRPO），提升Planner的工具选择准确性和全局规划意识。结合高效执行器，该框架在StableToolBench基准测试中实现复杂用户查询的卓越性能，展现出强大的端到端执行能力和处理复杂多工具工作流程的稳健性。

Key Takeaways

现有工具增强的大型语言模型（LLMs）在处理复杂查询时存在挑战。
提出了一种新颖的Planner-centric Plan-Execute范式，通过架构创新解决局部优化问题。
Planner模型进行全局有向无环图（DAG）规划，为复杂查询提供优化执行方案。
引入了ComplexTool-Plan大型基准数据集，用于测试模型处理复杂查询的能力。
采用两阶段训练方法，整合监督微调（SFT）与群组相对策略优化（GRPO），提升Planner的性能。
结合高效执行器，该框架在StableToolBench基准测试中表现卓越。

Cool Papers

点此查看论文截图

TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

Authors:Xuanle Zhao, Shuxin Zeng, Yinyuan Cai, Xiang Cheng, Duzhen Zhang, Xiuyi Chen, Bo Xu

While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.

虽然视觉语言模型（VLMs）在一般视觉理解方面表现出了显著的能力，但它们在化学领域的应用却受到限制。之前的工作主要集中在文本上，从而忽略了关键视觉信息，如分子结构。当前直接采用标准VLMs进行化学任务的方法存在两个主要问题：(i)处理带有非信息背景的整个化学图像的计算效率低下。(ii)任务局限于分子层面，限制了化学推理的进展。在这项工作中，我们提出了TinyChemVL，这是一个高效且强大的化学VLM，它利用视觉标记减少和反应级任务来提高模型效率和推理能力。此外，我们还提出了ChemRxn-V，这是一个用于评估基于视觉的反应识别和预测任务的反应级基准测试。直接从分子图像预测反应产物构成了一个不小的挑战，因为它要求模型同时具备识别和推理能力。我们的结果表明，TinyChemVL仅有4B参数，就能在分子和反应任务上实现卓越性能，同时在推理和训练速度上比现有模型更快。值得注意的是，TinyChemVL在利用只有十六分之一的视觉标记时，仍然超越了ChemVLM。这项工作通过共同设计模型架构和任务复杂度，为化学领域构建了高效且强大的VLMs。

论文及项目相关链接

PDF Accepted by AAAI 2026, Preprint Version

Summary

本文主要探讨了视觉语言模型（VLMs）在化学领域的应用及其面临的挑战。为此，提出了一种高效的化学VLM——TinyChemVL，它通过视觉令牌减少和反应级任务来提高模型效率和推理能力。同时，还提出了反应级基准测试ChemRxn-V，用于评估基于视觉的反应识别和预测任务。实验结果表明，TinyChemVL在分子和反应任务上表现出卓越性能，与现有模型相比，具有更快的推理和训练速度。

Key Takeaways

VLMs在化学领域的应用面临挑战，尤其是处理化学图像时存在计算效率低下和任务范围狭窄的问题。
TinyChemVL被提出作为一种高效的化学VLM，通过视觉令牌减少提高模型效率。
TinyChemVL通过结合反应级任务，提高了模型的推理能力。
ChemRxn-V基准测试用于评估基于视觉的反应识别和预测任务。
直接从分子图像预测反应产物需要模型具备识别和推理能力。
TinyChemVL在分子和反应任务上表现出卓越性能，相比现有模型具有更快的推理和训练速度。

Cool Papers

点此查看论文截图

The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures

Authors:Andrea Diecidue, Carlo Alberto Barbano, Piero Fraternali, Mathieu Fontaine, Enzo Tartaglione

Transformer-based models have become the state of the art across multiple domains, from natural language processing to machine listening, thanks to attention mechanisms. However, the attention layers require a large number of parameters and high-end hardware for both training and inference. We propose a novel pruning technique targeted explicitly at the attention mechanism, where we decouple the pruning of the four layers in the attention block, namely: query, keys, values and outputs’ projection matrices. We also investigate pruning strategies to prune along the head and channel dimensions, and compare the performance of the Audio Spectrogram Transformer (AST) model under different pruning scenarios. Our results show that even by pruning 50% of the attention parameters we incur in performance degradation of less than 1%

基于Transformer的模型得益于注意力机制，已经在多个领域成为前沿技术，无论是自然语言处理还是机器听觉。然而，注意力层需要大量的参数和高性能硬件来进行训练和推理。我们提出了一种针对注意力机制的新型剪枝技术，该技术可以解耦注意力块中的四层剪枝，即查询、键、值和输出投影矩阵的剪枝。我们还研究了沿着头部和通道维度的剪枝策略，并比较了不同剪枝场景下音频光谱图Transformer（AST）模型的性能。我们的结果表明，即使剪枝50%的注意力参数，性能下降也小于1%。

论文及项目相关链接

PDF

Summary

本文介绍了基于Transformer的模型因注意力机制而在多个领域成为最新技术。然而，注意力层需要大量参数和高性能硬件进行训练和推理。为此，提出了一种针对注意力机制的新型剪枝技术，该技术可以解耦注意力块中四层（查询、键、值和输出投影矩阵）的剪枝。同时，文章还探讨了沿着头部和通道维度进行剪枝的策略，并比较了不同剪枝场景下Audio Spectrogram Transformer（AST）模型的性能。结果显示，即使剪枝50%的注意力参数，性能下降也不到1%。

Key Takeaways

Transformer模型因注意力机制在多个领域表现卓越。
注意力层需要大量参数和高性能硬件资源。
提出了一种针对注意力机制的剪枝技术，可独立剪枝注意力块中的四层。
探讨了沿头部和通道维度进行剪枝的策略。
对比了不同剪枝场景下的Audio Spectrogram Transformer（AST）模型性能。
剪枝50%的注意力参数后，模型性能下降不到1%。

Cool Papers

点此查看论文截图

Authors:Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang

Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users’ actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

传统对话检索旨在从最近的对话历史中选择最合适的语句或图像。然而，它们往往不能满足用户重新访问长篇对话中分散的语义相关内容的实际需求。为了填补这一空白，我们定义了精细片段检索（FFR）任务，要求模型定位与查询相关的片段，这些片段包含语句和图像，来自多模态长篇对话。作为FFR的基础，我们构建了MLDR，这是迄今为止最长的多模态对话检索数据集，每个对话平均包含25.45轮，自然涵盖三个不同的话题。为了评估在现实世界场景中的泛化能力，我们整理和注释了一个基于微信的测试集，包含现实世界的多模态对话，平均每对话75.38轮。基于这些资源，我们探索了现有的基于生成能力的视觉语言模型（VLMs）在FFR上的应用，并观察到它们经常检索到不连贯的语句-图像片段。虽然这些模型经过优化，可以从视觉文本输入中生成响应，但它们缺乏明确的监督来确保检索到的片段中的语义连贯性。因此，我们提出了F2RVLM，一个两阶段训练的生成检索模型：（1）监督微调注入片段级检索知识，（2）基于GRPO的强化学习与多目标奖励相结合，促进语义精度、相关性和上下文连贯性。为了解决片段内复杂性的差异，从局部密集到稀疏分布不等，我们引入了难度感知课程采样，该采样通过模型预测的难度对训练实例进行排名，并逐渐暴露模型于更困难的样本。这提高了长篇多轮上下文中的推理能力。F2RVLM在域内和现实世界环境中均表现出优于流行VLMs的检索性能，体现了其卓越的检索能力。

论文及项目相关链接

PDF

Summary

本文提出传统对话检索的局限性，并介绍针对此问题提出的精细片段检索（FFR）任务。为支持FFR，构建了MLDR数据集，强调其是多模态长对话中平均对话轮次最多的数据集。同时，文章探讨了现有生成式视觉语言模型在处理FFR任务时的不足，并提出了F2RVLM模型。该模型采用两阶段训练法，注重片段级检索知识的注入以及语义精度、相关性和上下文连贯性的提升。此外，还引入了难度感知的课程采样，以提高模型在不同复杂度的片段内的推理能力。最终结果显示，F2RVLM在域内和真实环境下的性能均优于流行的VLMs。

Key Takeaways

传统对话检索方法主要关注选择最近的对话历史中的合适话语或图像，但在长对话中难以满足用户对语义连贯内容的需求。
提出精细片段检索（FFR）任务，要求模型从多模态长对话中找到与查询相关的片段。
构建MLDR数据集，成为迄今为止最长的多模态对话检索数据集。
现有生成式视觉语言模型在处理FFR任务时存在不足，主要问题在于语义连贯性。
F2RVLM模型采用两阶段训练法：首先注入片段级检索知识，然后提升语义精度、相关性和上下文连贯性。
为处理不同复杂度的片段内问题，引入了难度感知的课程采样方法。

Cool Papers

点此查看论文截图

DIFFA: Large Language Diffusion Models Can Listen and Understand

Authors:Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.

近期大型语言模型（LLM）的进步在文本和多模态领域表现出了显著的能力。与此同时，基于扩散的语言模型作为自回归范式的有前途的替代品而出现，提供了更好的可控性、双向上下文建模和稳健的生成。然而，它们在音频模态的应用仍然被探索得很少。在这项工作中，我们介绍了\textbf{DIFFA}，这是第一个基于扩散的大型音频语言模型，旨在进行口语理解。DIFFA将冻结的扩散语言模型与轻量级的双适配器架构相结合，该架构架起了语音理解和自然语言推理之间的桥梁。我们采用了两阶段训练管道：首先，通过ASR目标对齐语义表示；然后，通过LLM提示自动生成的合成音频字幕对来学习遵循指令的能力。尽管只在960小时的ASR和127小时的合成指令数据上进行训练，DIFFA在主要基准测试上表现出竞争力，包括MMSU、MMAU和VoiceBench，超越了多个自回归开源基准。我们的结果揭示了扩散式语言模型在高效和可扩展的音频理解方面的潜力，为语音驱动的AI打开了新的研究方向。我们的代码将在https://github.com/NKU-HLT/DIFFA.git上提供。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

扩散模型在文本和多模态领域展现出显著能力，而基于扩散的语言模型为音频模态的理解提供了新的方向。本研究引入DIFFA，首个扩散大音频语言模型，用于进行口语理解。DIFFA结合冻结的扩散语言模型和轻量级双适配器架构，实现了语音理解与自然语言推理的桥梁作用。采用两阶段训练管道，首先对语义表示进行对齐，然后学习指令跟随能力。在主要基准测试上表现优越，展现扩散模型在音频理解方面的潜力。

Key Takeaways

扩散模型在文本和多模态领域具有显著能力。
基于扩散的语言模型在音频模态理解方面展现出潜力。
DIFFA是首个扩散大音频语言模型，用于口语理解。
DIFFA结合冻结的扩散语言模型和轻量级双适配器架构。
DIFFA采用两阶段训练管道，包括语义表示对齐和指令跟随能力学习。
DIFFA在主要基准测试上表现优越，竞争力超过一些自动回归开源基准。

Cool Papers

点此查看论文截图

Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics

Authors:Ziqi Zhu, Tao Hu, Honglong Zhang, Dan Yang, HanGeng Chen, Mengran Zhang, Xilun Chen

We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity (Conversation RAG) or standard knowledge graphs (GraphRAG), CID-GraphRAG constructs dynamic intent transition graphs from goal achieved historical dialogues and implements a dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search. This approach enables the system to simultaneously leverage both conversional intent flow patterns and contextual semantics, significantly improving retrieval quality and response quality. In extensive experiments on real-world customer service dialogues, we employ both automatic metrics and LLM-as-judge assessments, demonstrating that CID-GraphRAG significantly outperforms both semantic-based Conversation RAG and intent-based GraphRAG baselines across all evaluation criteria. Quantitatively, CID-GraphRAG demonstrates substantial improvements over Conversation RAG across automatic metrics, with relative gains of 11% in BLEU, 5% in ROUGE-L, 6% in METEOR, and most notably, a 58% improvement in response quality according to LLM-as-judge evaluations. These results demonstrate that the integration of intent transition structures with semantic retrieval creates a synergistic effect that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for addressing the challenges of maintaining contextual coherence and goal-oriented progression in knowledge-intensive multi-turn dialogues.

我们提出了CID-GraphRAG（会话意图驱动的图检索增强生成），这是一种新型框架，解决了现有对话系统在维持多轮客户服务对话中的上下文连贯性和目标导向进展方面的局限性。与传统的仅依赖语义相似性的RAG系统（会话RAG）或标准知识图谱（GraphRAG）不同，CID-GraphRAG从已实现的历史对话中构建动态意图转移图，并实现了双检索机制，该机制自适应地平衡了基于意图的图遍历和语义搜索。这种方法使系统能够同时利用会话意图流动模式和上下文语义，从而大大提高了检索质量和响应质量。我们在现实世界的客户服务对话中进行了大量实验，采用了自动指标和LLM评估方法，证明了CID-GraphRAG在各项评估标准上都显著优于语义基础的会话RAG和基于意图的GraphRAG基准。定量上，CID-GraphRAG在自动指标上显著优于会话RAG，BLEU得分提高11%，ROUGE-L提高5%，METEOR提高6%，最值得一提的是，根据LLM评估，响应质量提高了58%。这些结果表明，意图转移结构与语义检索的结合产生了协同作用，单独任何一种方法都无法达到这种效果，这使得CID-GraphRAG成为解决知识密集型多轮对话中保持上下文连贯性和目标导向进展的挑战的有效框架。

论文及项目相关链接

PDF

摘要

CID-GraphRAG（会话意图驱动的图检索增强生成）是一个新型框架，解决了现有对话系统在维持多轮客户服务对话中的上下文连贯性和目标导向进展的局限性。与传统的仅依赖语义相似性的RAG系统或标准知识图谱不同，CID-GraphRAG从已实现的目标对话中构建动态意图转换图，并实现了双检索机制，自适应地平衡基于意图的图遍历和语义搜索。此方法使系统能够同时利用会话意图流模式和上下文语义，显著提高了检索质量和响应质量。在真实世界的客户服务对话中的大量实验表明，CID-GraphRAG在自动评估指标和人类评估中均显著优于基于语义的Conversation RAG和基于意图的GraphRAG基线。定量数据显示，CID-GraphRAG在BLEU、ROUGE-L、METEOR等自动评估指标上相对提高了11%、5%、6%，最值得注意的是，在人类评估中响应质量提高了58%。结果表明，意图转换结构与语义检索的结合产生了协同作用，单独任何一种方法都无法实现，因此CID-GraphRAG成为应对知识密集型多轮对话中维持上下文连贯性和目标导向进展挑战的有效框架。

关键见解

CID-GraphRAG是一个新型框架，用于改进多轮对话系统的性能。
它通过构建动态意图转换图，结合了会话意图和上下文语义。
CID-GraphRAG实现了双检索机制，平衡了基于意图的图遍历和语义搜索。
与传统的RAG系统和知识图谱相比，CID-GraphRAG在维持对话的上下文连贯性和目标导向进展方面更为有效。
在真实世界的客户服务对话中的实验表明，CID-GraphRAG在自动评估指标和人类评估中均表现优异。
CID-GraphRAG在响应质量方面有明显的改进，这归功于其结合意图转换结构和语义检索的能力。
此框架的出现解决了知识密集型多轮对话中的挑战，如上下文连贯性和目标导向进展的维持。

Cool Papers

点此查看论文截图

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Authors:Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen

End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.

端到端语音到语音（S2S）对话系统因其较低的延迟和更自然地整合非语言线索（如情感和说话人身份）而最近引起了越来越多的研究关注。然而，这些系统面临关键挑战，特别是在整合外部知识方面，这是基于文本的大型语言模型（LLM）中通常通过增强检索（RAG）来解决的能力。核心困难在于输入语音和检索到的文本知识之间的模式差距，这阻碍了信息的有效整合。为了解决这一问题，我们提出了一种新型的端到端RAG框架，该框架直接从语音查询中检索相关的文本知识。实验结果表明，我们的方法在端到端S2S对话系统的性能上有了显著的提升，同时提高了检索效率。尽管总体性能仍然落后于最新的级联模型，但我们的框架为增强端到端S2S系统中的知识整合提供了有前景的方向。我们的代码和数据集已经发布。

论文及项目相关链接

PDF Accepted to EMNLP 2025 Findings

Summary

近期，端到端的语音到语音（S2S）对话系统因其低延迟和能更自然地整合非言语线索（如情感和说话人身份）而备受研究关注。然而，这些系统在整合外部知识方面面临挑战，特别是在模态间隙问题上。我们提出了一种新型的端到端检索增强生成（RAG）框架，能够直接从语音查询中检索相关文本知识。实验结果表明，该方法显著提高了端到端S2S对话系统的性能，同时提高了检索效率。尽管整体性能仍落后于当前先进的级联模型，但我们的框架为增强端到端S2S系统中的知识整合提供了有前景的方向。

Key Takeaways

端到端的语音到语音（S2S）对话系统正受到研究关注，因其低延迟和能自然整合非言语线索。
面临的主要挑战是整合外部知识，特别是处理模态间隙问题。
提出了一种新型端到端检索增强生成（RAG）框架，能从语音查询中直接检索相关文本知识。
实验证明该方法提高了S2S对话系统的性能和检索效率。
尽管整体性能仍落后于当前先进的级联模型，但该框架为增强知识整合提供了有前景的方向。
框架的代码和数据集已经公开。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-16/Interactive/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Interactive

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-11-16 Time-to-Move Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

2025-11-16 Text-to-Motion

Text-to-Motion

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2025-11-16 Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

2025-11-16 TTS

TTS

Interactive

2025-11-16 更新

Persona-Aware Alignment Framework for Personalized Dialogue Generation

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

DIFFA: Large Language Diffusion Models Can Listen and Understand

Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation