发布日期: 2025-08-02

更新日期: 2025-08-20

文章字数: 9.6k

阅读时长: 39 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-08-02 更新

Text-to-SQL Task-oriented Dialogue Ontology Construction

Authors:Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic

Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

大型语言模型（LLM）被广泛应用于通用知识源，但它们依赖于参数知识，这限制了其可解释性和可信度。在面向任务的对话（TOD）系统中，这种分离是明确的，使用明确的本体构建的外部数据库，以确保可解释性和可控性。然而，构建这样的本体需要手动标签或监督训练。我们引入了TeQoDO：一种从文本到SQL的面向任务的对话本体构建方法。在这里，LLM利用其内在的SQL编程能力和提示中提供的对话理论，无需监督即可从零开始自主构建TOD本体。我们展示了TeQoDO优于迁移学习方法，其构建的本体在下游对话状态跟踪任务上具有很强的竞争力。消除研究表明对话理论起着关键作用。TeQoDO还可以扩展，以构建更大的本体，我们在Wikipedia和ArXiv数据集上对此进行了调查。我们认为这是向更广泛的应用本体以提高LLM可解释性迈出的一步。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）作为通用知识源应用广泛，但依赖参数知识，限制了其可解释性和可信度。在面向任务的对话（TOD）系统中，使用外部数据库和显性本体结构来确保可解释性和可控性。本文介绍了一种新型的面向任务的对话本体构建方法TeQoDO，该方法利用LLM自主构建TOD本体，无需监督，结合内置的SQL编程能力和对话理论提示。研究表明，TeQoDO在下游对话状态跟踪任务上的表现优于迁移学习方法，且其构建的本体具有竞争力。此外，TeQoDO还可以扩展到更大的本体构建，我们在Wikipedia和ArXiv数据集上进行了调查。这被视为提高LLM可解释性的重要一步。

Key Takeaways

大型语言模型（LLMs）作为通用知识源具有广泛应用，但缺乏可解释性和可信度。
面向任务的对话（TOD）系统通过使用外部数据库和显性本体结构来提高可解释性和可控性。
TeQoDO是一种新型的面向任务的对话本体构建方法，可以自主构建TOD本体，无需人工监督。
TeQoDO结合了LLM的SQL编程能力和对话理论提示。
TeQoDO在下游对话状态跟踪任务上的表现优于迁移学习方法。
TeQoDO构建的本体具有竞争力，并且可以扩展到更大的本体构建。

Cool Papers

点此查看论文截图

Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System

Authors:Kazushi Kato, Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara

In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the processing rate enables real-time operation without a substantial drop in accuracy, and integrated the model into an avatar attentive listening system. Subjective evaluations showed that it outperformed the conventional method, which always does nodding in sync with verbal backchannel. The code and trained models are available at https://github.com/MaAI-Kyoto/MaAI.

在人机对话中，非言语信息（如点头和面部表情）与言语信息同样重要，人们期望语音对话系统也能表达这些非言语行为。我们专注于点头，它在倾听系统中至关重要，并提出一种模型，可以实时预测其时机和类型。所提模型基于语音活动投影（VAP）模型，该模型可以从听者和说话者的音频中预测语音活动。我们将其扩展到连续实时预测各种点头类型，不同于传统模型。此外，所提模型结合了多任务学习与言语反馈通道预测和一般对话数据的预训练。在时机和类型预测任务中，多任务学习的有效性得到了显著证明。我们验证了降低处理速率可以在不显著降低准确度的前提下实现实时操作，并将该模型集成到虚拟角色倾听系统中。主观评估表明，它优于传统方法，后者总是与言语反馈通道同步点头。相关代码和训练好的模型可在https://github.com/MaAI-Kyoto/MaAI上找到。

论文及项目相关链接

PDF Accepted by 27th ACM International Conference on Multimodal Interaction (ICMI ‘25), Long paper

Summary

本文研究了一种基于语音活动预测模型（VAP）的实时点头预测模型。该模型不仅能够预测点头的时机，还能预测点头的类型，这对于人机交互中的注意力系统至关重要。通过多任务学习和预训练技术，该模型在预测效果上显著提升。研究还表明，降低处理速率可实现实时操作而不损失准确性。集成到虚拟角色注意力系统后，主观评估显示其优于传统方法。

Key Takeaways

非言语信息在人机对话中至关重要，如点头和面部表情。
实时点头预测模型基于语音活动预测模型（VAP）构建。
模型能够预测点头的时机和类型。
通过多任务学习和预训练技术，模型预测效果显著提升。
降低处理速率可实现实时操作，且不影响准确性。
集成到虚拟角色注意力系统后，主观评估显示该模型优于传统方法。

Cool Papers

点此查看论文截图

User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Authors:Yuhan Liu, Michael J. Q. Zhang, Eunsol Choi

Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user’s initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.

当语言模型（LMs）被部署后，它们可以长期与用户进行交互，并在理想情况下根据用户的反馈持续进化。直接请求用户反馈可能会产生干扰，因此我们研究了从用户与语言模型的交互日志中收集用户反馈的方法。我们研究了两个用户与语言模型交互数据集（WildChat和LMSYS）中的隐性用户反馈。首先，我们分析了用户在与语言模型的对话轨迹中的反馈，了解这种反馈何时以及为何发生。其次，我们研究了从这些隐性用户反馈中收集学习信号的方法。我们发现，用户反馈的内容（例如，用户需要澄清），而不仅仅是极性（例如，用户对之前的模型响应不满意），可以提高模型在简短的人类设计问题（MTBench）上的性能，但在更长的、更复杂的问题上（WildBench）则不然。我们还发现，用户反馈的有用性在很大程度上取决于用户初始提示的质量。总体而言，我们对隐性用户反馈进行了深入研究，展示了其潜力和局限性。

论文及项目相关链接

PDF Earlier version of this paper was presented at 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA), ICML 2025

Summary

用户反馈的语言模型在长期交互中能够持续进化。研究从用户与语言模型的互动日志中收集用户反馈，通过对WildChat和LMSYS两个数据集中的用户反馈进行深度分析，发现用户反馈内容而非仅仅反馈的情感极性可以提升模型在简短问题上的表现，但对于更复杂的问题则效果有限。此外，用户反馈的效用与初始提示的质量紧密相关。

Key Takeaways

语言模型在长期与用户交互过程中能够持续进化。
用户反馈可以通过分析用户与语言模型的互动日志来收集。
用户反馈内容对模型性能的提升有重要作用，尤其是在简短问题上。
对于复杂问题，用户反馈的效果有限。
用户反馈的效用受初始提示质量的影响。
用户的反馈中包含对模型的澄清需求等详细信息。

Cool Papers

点此查看论文截图

Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

Authors:Jia Li, Yang Wang, Wenhao Qian, Zhenzhen Hu, Richang Hong, Meng Wang

Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

面试绩效评估对于确定候选人是否适合专业职位至关重要。为了确保全面和公平的评价，我们提出了一种新型且全面的框架，通过整合视频、音频和文本三种模式，针对每个候选人的六种回应，以及五个关键评价维度，来探索面试表现的“365”方面。该框架使用特定于模态的特征提取器来编码异质数据流，然后通过共享压缩多层感知器进行融合。此模块将多模态嵌入压缩到统一的潜在空间，促进了特征的有效交互。为了提高预测的稳定性，我们采用了两级集成学习策略：（1）独立的回归头用于预测每个回应的分数，（2）使用平均池化机制对回应的预测进行汇总，以产生五个目标维度的最终分数。通过倾听未言明的内容，我们的方法能够从多模式数据中捕获明确和隐含的线索，从而实现全面和无偏的评价。我们的框架在2025年AVI挑战赛中获得了第一名，实现了平均均方误差0.1824，证明了其在推动自动化和多模式面试绩效评估中的有效性和稳健性。完整实现可在https://github.com/MSA-LMC/365Aspects上找到。

论文及项目相关链接

PDF 8 pages, 4 figures, ACM MM 2025. github:https://github.com/MSA-LMC/365Aspects

Summary

本文提出一种新型的面试表现评估框架，该框架通过整合视频、音频和文本三种模态，从六个方面对候选人的面试表现进行全方位评价。该框架运用模态特定特征提取器编码异质数据流，并通过共享压缩多层感知器进行融合。同时采用两级集成学习策略，提高预测稳健性。该框架在AVI Challenge 2025中获得第一名，展示了其在推进自动化和多模态面试表现评估中的有效性和稳健性。

Key Takeaways

提出的面试评估框架融合了视频、音频和文本三种模态，实现全面评价。
框架包含六个方面的候选人响应评价。
通过共享压缩多层感知器，不同模态的数据被编码并融合到统一潜在空间。
采用两级集成学习策略，提高预测准确性及稳健性。
框架能捕捉面试中的明确和隐含线索，进行客观评估。
框架在AVI Challenge 2025中获第一名，证明其有效性。

Cool Papers

点此查看论文截图

Measurement of production branching ratio after muon nuclear capture reaction of Al and Si isotopes

Authors:R. Mizuno, M. Niikura, T. Y. Saito, T. Matsuzaki, S. Abe, H. Fukuda, M. Hashimoto, A. Hillier, K. Ishida, N. Kawamura, S. Kawase, T. Kawata, K. Kitafuji, F. Minato, M. Oishi, A. Sato, K. Shimomura, P. Strasser, S. Takeshita, D. Tomono, Y. Watanabe

Background: Muon nuclear capture is a reaction between a muon and a proton inside a nucleus through weak interactions. This reaction results in the formation of an excited nucleus, which subsequently de-excites by emitting several particles. Examination of the excited state allows for an investigation of the properties of nuclear excitation and particle emission in highly excited nuclei. Purpose: This study investigates muon nuclear capture of 27Al and 28,29,30Si, focusing on determining the absolute production branching ratio (BR) following muon nuclear capture and subsequent particle emissions. By measuring the absolute production BR, we can collect valuable information on the excitation energy distribution of muon nuclear capture. Methods: Measurements were conducted using the in-beam activation method at two pulsed muon facilities: RIKEN-RAL at RAL and MLF at JPARC. Absolute BRs were determined by measuring the irradiated muon number using a plastic scintillator and the \b{eta}-delayed {\gamma}-rays emitted from the produced nuclei using germanium detectors. Results: The absolute production branching ratios of muon nuclear capture on 27Al and 28,29,30Si were obtained with the highest accuracy to date. Predominant neutron emissions, even-odd atomic number dependence on particle emission probabilities, and influence of the neutron excess were observed. These results were compared with previous measurements and theoretical models and discussed regarding the excitation energy distribution, particle emission mechanism, and nuclear properties, such as resonance in the isovector transition. Conclusion: This study emphasizes the importance of considering nuclear structure effects, even-odd effects of proton and neutron numbers, neutron excess, nucleon pairing effect, and particle emission mechanisms, in the context of the muon nuclear capture reaction.

背景：μ子核俘获是μ子和原子核内质子通过弱相互作用发生的反应。该反应导致激发态核的形成，随后通过发射多个粒子而达到稳定。对激发态的研究可以探讨高度激发核的核激发和粒子发射的性质。

目的：本研究调查了μ子在核上的俘获情况，重点探讨了μ子核俘获后发生粒子发射的绝对产生分支比（BR）。通过测量绝对生产BR，我们可以获得有关μ子核俘获的激发能分布的有价值的信息。

方法：实验在两个脉冲μ子设施（RAL的RIKEN-RAL和JPARC的MLF）上进行了原位活化法测量。通过使用塑料闪烁体测量照射的μ子数量，并使用锗探测器测量产生的核发出的延迟γ射线，确定了绝对BR。

结果：获得了迄今为止对μ子在核上（如铝和其他硅同位素）发生俘获时的最高精度的绝对生产分支比结果。观察到主要的中子发射现象、粒子发射概率与原子序数的奇偶依赖性以及中子过剩的影响。这些结果与之前的测量和理论模型进行了比较和讨论，涉及激发能分布、粒子发射机制和诸如同位旋过渡共振的核性质等方面。

论文及项目相关链接

PDF

Summary

本文研究了μ子在核内的捕获反应，重点探讨了μ子在核内的捕获后产生的激发态核的性质以及粒子发射过程。通过对μ子在核内的捕获反应的研究，确定了在不同原子核上发生的绝对生产分支比（BR）。利用脉冲μ子设施的测量，测量了辐射的μ子数和产生的核产生的η延迟γ射线来确定绝对BR值。研究结果显示了中子发射的主导作用，粒子发射概率的奇偶原子数依赖性以及中子过剩的影响。这些结果对于理解核激发态的能量分布、粒子发射机制和核结构效应具有重要意义。

Key Takeaways

μ子在核内的捕获反应是一种重要的研究手段，用于研究高度激发态核的性质和粒子发射过程。
通过测量绝对生产分支比（BR），可以了解μ子核捕获的激发能量分布。
研究发现中子发射占主导地位，粒子发射概率与原子数的奇偶性有关，并且中子过剩也会影响反应结果。
这些发现有助于理解核激发态的能量分布、粒子发射机制和核结构效应等核心问题。
对比以往的研究和理论模型，该研究为未来更深入的研究奠定了基础。

Cool Papers

点此查看论文截图

DIFFA: Large Language Diffusion Models Can Listen and Understand

Authors:Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li

Recent advances in Large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based Large Audio-Language Model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.

近期大型语言模型（LLM）的进步在文本和多模态领域表现出了显著的能力。同时，基于扩散的语言模型作为自回归范式的有前途的替代品而出现，提供了更好的可控性、双向上下文建模和稳健的生成。然而，它们在音频模态的应用仍然被较少探索。在这项工作中，我们介绍了第一个基于扩散的大型音频语言模型——DIFFA，旨在进行口语理解。DIFFA整合了一个冻结的扩散语言模型和一个轻量级的双适配器架构，该架构能够桥接语音理解和自然语言推理。我们采用了两阶段训练管道：首先，通过ASR目标对齐语义表示；然后，通过自动生成的合成音频字幕对来学习遵循指令的能力。尽管仅在960小时的ASR和127小时的合成指令数据上进行训练，DIFFA在主要基准测试上表现出竞争力，包括MMSU、MMAU和VoiceBench，优于多个自回归开源基准。我们的结果揭示了扩散语言模型在高效和可扩展音频理解方面的潜力，为语音驱动的AI开辟了新的方向。我们的代码将在https://github.com/NKU-HLT/DIFFA.git上提供。

论文及项目相关链接

PDF

Summary

扩散模型在文本和多模态领域展现出卓越的能力。本研究引入首个基于扩散的大型音频语言模型DIFFA，用于进行口语理解。DIFFA结合冻结的扩散语言模型与轻量级双适配器架构，实现语音理解和自然语言推理的桥梁。采用两阶段训练管道：首先通过ASR目标对齐语义表示，然后通过自动生成的合成语音指令对进行学习指令遵循能力。尽管仅在960小时的ASR和127小时的合成指令数据上进行训练，DIFFA在主要基准测试上表现出竞争力，包括MMSU、MMAU和VoiceBench，优于多个自动回归开源基准。结果表明，扩散模型在音频理解方面具有高效且可扩展的潜力，为语音驱动的人工智能开启新的方向。

Key Takeaways

DIFFA是首个基于扩散的大型音频语言模型，用于口语理解。
DIFFA结合冻结的扩散语言模型和轻量级双适配器架构。
DIFFA采用两阶段训练，首先通过ASR目标对齐语义表示，然后学习指令遵循能力。
DIFFA在合成语音指令对上进行训练，使用自动生成的合成音频字幕。
DIFFA在主要基准测试上表现优异，包括MMSU、MMAU和VoiceBench。
DIFFA优于多个自动回归开源基准，展现出扩散模型在音频理解方面的潜力。

Cool Papers

点此查看论文截图

Authors:Sahid Hossain Mustakim, S M Jishanul Islam, Ummay Maria Muna, Montasir Chowdhury, Mohammed Jawwadul Islam, Sadia Ahmmed, Tashfia Sikder, Syed Tasdid Azam Dhrubo, Swakkhar Shatabda

Multimodal Large Language Models (MLLMs) are increasingly used for content moderation, yet their robustness in short-form video contexts remains underexplored. Current safety evaluations often rely on unimodal attacks, failing to address combined attack vulnerabilities. In this paper, we introduce a comprehensive framework for evaluating the tri-modal safety of MLLMs. First, we present the Short-Video Multimodal Adversarial (SVMA) dataset, comprising diverse short-form videos with human-guided synthetic adversarial attacks. Second, we propose ChimeraBreak, a novel tri-modal attack strategy that simultaneously challenges visual, auditory, and semantic reasoning pathways. Extensive experiments on state-of-the-art MLLMs reveal significant vulnerabilities with high Attack Success Rates (ASR). Our findings uncover distinct failure modes, showing model biases toward misclassifying benign or policy-violating content. We assess results using LLM-as-a-judge, demonstrating attack reasoning efficacy. Our dataset and findings provide crucial insights for developing more robust and safe MLLMs.

多模态大型语言模型（MLLMs）越来越多地用于内容审核，但它们在短视频上下文中的稳健性尚未得到充分探索。当前的安全评估通常依赖于单模态攻击，无法解决组合攻击漏洞。在本文中，我们介绍了一个用于评估MLLMs的三模态安全性的综合框架。首先，我们提出了短视频多模态对抗（SVMA）数据集，其中包含带有由人类引导的合成对抗性攻击的多种短视频。其次，我们提出了一种新的三模态攻击策略chimeraBreak，它同时挑战视觉、听觉和语义推理路径。在最新MLLMs上的广泛实验显示存在显著漏洞，攻击成功率（ASR）很高。我们的研究结果揭示了不同的失败模式，显示了模型偏向于错误分类良性或违反策略的内容。我们使用LLM作为法官进行评估，证明了攻击推理的有效性。我们的数据集和研究结果对于开发更稳健和安全的多模态大型语言模型提供了关键见解。

论文及项目相关链接

PDF Accepted as long paper, SVU Workshop at ICCV 2025

Summary

本文介绍了多模态大型语言模型（MLLMs）在短视频内容审核中的应用及其面临的挑战。文章提出了一种评估MLLMs三模态安全性的综合框架，包括Short-Video Multimodal Adversarial（SVMA）数据集和ChimeraBreak新型三模态攻击策略。实验表明，MLLMs存在显著漏洞，需加强模型对良性或违规内容的识别能力。文章提供的数据集和发现对于开发更稳健和安全的多模态大型语言模型至关重要。

Key Takeaways

MLLMs在短视频内容审核中广泛应用，但其稳健性有待探索。
当前安全评估主要依赖单模态攻击，无法应对组合攻击漏洞。
引入SVMA数据集，包含带有合成对抗攻击的短视频。
提出ChimeraBreak三模态攻击策略，同时挑战视觉、听觉和语义推理路径。
实验表明，最先进的多模态大型语言模型存在显著漏洞和高攻击成功率（ASR）。
模型对良性或违规内容的识别存在偏见。

Cool Papers

点此查看论文截图

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Authors:Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li

Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL

大型语言模型（LLM）在生物医学问答领域表现出了卓越的能力，但它们在现实世界的临床咨询中的应用仍面临核心挑战。单轮咨询系统要求患者一次性描述所有症状，导致诊断模糊，投诉不明确。受静态监督学习限制的传统多轮对话模型，缺乏灵活性，无法智能提取关键临床信息。为了解决这些限制，我们提出了\Ours{}，这是一个基于强化学习（RL）的多智能体协作框架，它将医疗咨询建模为不确定环境下的动态决策过程。医生智能体通过在RL框架内与病人智能体进行多轮互动，不断优化其提问策略，并根据咨询评估器的综合奖励动态调整其信息收集路径。这种RL微调机制使LLM能够自主发展符合临床推理逻辑的互动策略，而不是简单地模仿现有对话数据中的模式。值得注意的是，我们构建了MTMedDialog，这是第一个能够模拟病人互动的英文多轮医疗咨询数据集。实验表明，\Ours{}在多轮推理能力和最终诊断性能上优于现有模型。这种方法在减少时间压力下误诊风险、解放临床医生处理复杂病例、优化医疗资源配置和缓解劳动力短缺等方面展现了巨大的实用价值。代码和数据集可在https://github.com/JarvisUSTC/DoctorAgent-RL找到。

论文及项目相关链接

PDF

Summary

大型语言模型在生物医学问答领域表现出卓越的能力，但在现实临床咨询中的应用仍面临核心挑战。为解决单轮咨询系统模糊诊断及传统多轮对话模型缺乏灵活性的问题，我们提出了基于强化学习的多智能体协作框架\Ours{}，将医疗咨询建模为不确定条件下的动态决策过程。医生智能体通过强化学习与患者智能体进行多轮互动，并根据来自咨询评估者的综合奖励不断调整其信息收集路径。这使得语言模型能够自主发展符合临床推理逻辑的交流策略，而非简单地模仿现有对话数据中的模式。实验证明，\Ours{}在多轮推理能力和最终诊断性能上均优于现有模型，降低了时间紧迫环境下的误诊风险，解放了医生处理复杂案例的精力，并为优化医疗资源分配和缓解劳动力短缺问题提供了策略。代码和数据的可用性信息为https://github.com/JarvisUSTC/DoctorAgent-RL。

Key Takeaways

大型语言模型（LLMs）在生物医学问答中展现出色能力，但在真实临床咨询中应用有限。
传统单轮咨询系统导致模糊诊断，而传统多轮对话模型受限于静态监督学习，缺乏灵活性。
提出基于强化学习的多智能体协作框架\Ours{}，将医疗咨询视为不确定下的动态决策过程。
医生智能体通过强化学习优化提问策略，与患者智能体进行多轮互动。
\Ours{}能够根据综合奖励动态调整信息搜集路径，反映临床推理逻辑。
实验证明\Ours{}在多轮推理和诊断性能上超越现有模型。

Cool Papers

点此查看论文截图

Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Authors:Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan

Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.

生成式人工智能（AI）具有通过大型语言模型（LLM）扩大个性化辅导的潜力。虽然最近的AI辅导者通过训练或提示LLM遵循有效的教学原则而适应辅导任务，但它们并没有在对话过程中接受训练以最大限度地提高学生的学习效率。因此，它们可能与学生的互动方式不佳。我们通过引入一种训练LLM生成辅导话语的方法来解决这一局限性，这种方法可以最大限度地提高学生回答的正确性，同时鼓励模型遵循良好的教学惯例。具体来说，我们生成一组候选的辅导话语，并使用（1）基于LLM的学生模型预测学生正确回答的可能性，（2）通过GPT-4o评估的教学评分来评分它们。然后，我们使用得到的数据直接优化开源LLM（Llama 3.1 8B）。我们展示了我们模型生成的辅导话语能够显著提高学生回答的正确性，同时保持GPT-4o的教学质量。我们还进行了定性分析和人工评估，以证明我们的模型能够生成高质量的辅导话语。

论文及项目相关链接

PDF Published in AIED 2025: The 26th International Conference on Artificial Intelligence in Education

Summary
人工智能（AI）通过大型语言模型（LLM）在个性化辅导方面具有巨大的潜力。然而，现有的AI辅导工具并未经过训练以最大化学生的学习效果。本文提出了一种训练LLM的方法，旨在生成能够最大化学生正确率的辅导话语，同时鼓励模型遵循良好的教学原则。通过结合学生模型和基于GPT-4o的教学评分标准，我们生成了候选辅导话语并对其进行了评分。最终结果显示，我们的模型生成的辅导话语不仅大大提高了学生的正确率，而且保持了GPT-4o的教学质量。

Key Takeaways

人工智能具有在个性化辅导领域的巨大潜力，能够通过大型语言模型（LLM）实现规模化应用。
当前AI辅导工具尚未针对最大化学生学习效果进行训练，存在与学生互动方式不够优化的可能。
本文提出了一种结合学生模型和基于GPT-4o的教学评分标准来生成候选辅导话语的方法。
通过训练LLM，能够生成既符合教学原则又能最大化学生正确率的辅导话语。
相比现有AI辅导工具，本文提出的模型在提高学生正确率的同时，保持了较高的教学质量。
进行了定性分析和人类评估，证明了该模型生成的高质量辅导话语。

Cool Papers

点此查看论文截图

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Authors:Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., “Really!?” with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.

大型音频语言模型（LALMs），如GPT-4o，最近解锁了音频对话功能，能够实现与人类的直接口头交流。LALM的潜力扩大了其在音频对话支持的广泛实际场景中的应用范围。然而，考虑到这些进展，目前仍缺乏一个全面评估LALM在开放式音频对话理解中的性能的基准测试。为了解决这一空白，我们提出了音频对话理解基准测试（ADU-Bench），它包括4个基准测试数据集。它们评估LALM在3个一般场景、12项技能、9种多语言以及4类歧义处理中的开放式音频对话能力。值得注意的是，我们首次提出了音频对话中歧义处理评估，即表达不同意图的音频对话，即使句子具有相同的字面意义，例如带有不同语调的“真的吗！？”等。总之，ADU-Bench包括超过2万条开放式音频对话，用于评估LALM。通过对16个LALM的广泛实验，我们的分析发现，现有的LALM在处理数学符号和公式、理解人类行为（如角色扮演）、理解多种语言以及处理来自不同语音元素的音频对话歧义（如语调、停顿位置和同音字）方面存在困难。基准测试网站为：https://adu-bench.github.io/。

论文及项目相关链接

PDF Accepted by ACL 2025

Summary

大型音频语言模型（LALM）如GPT-4o已解锁音频对话功能，实现与人类直接口语交流。然而，目前尚未有全面评估LALM在开放式音频对话理解性能的基准测试。为解决此空白，我们提出了音频对话理解基准测试（ADU-Bench），包含4个基准测试集，评估LALM在3个一般场景、12项技能、9种多元语言及4类歧义处理中的开放式音频对话能力。尤其是，我们首次提出音频对话中歧义处理的评估，如句子相同字面意义下表达不同意图的“Really!?”。总结来说，ADU-Bench包含超过20,000个开放式音频对话，用以评估LALM。通过对16个LALM的广泛实验，我们发现现有模型在数学符号和公式、人类行为理解（如角色扮演）、多语言理解和音频对话歧义处理等方面存在挑战。

Key Takeaways

大型音频语言模型（LALM）已具备音频对话能力，实现与人类直接口语交流。
目前缺乏评估LALM在开放式音频对话理解性能的全面基准测试。
提出的ADU-Bench包含4个基准测试集，涵盖3个场景、12项技能、9种语言和4类歧义处理。
ADU-Bench首次提出音频对话中歧义处理的评估。
LALM面临挑战，如数学符号和公式理解、人类行为理解（如角色扮演）、多语言理解和处理音频对话中的歧义。
ADU-Bench为评估LALM提供了一个重要的基准测试平台。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-08-01/Interactive/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Interactive

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-08-02 Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

2025-08-02 Text-to-Motion

Text-to-Motion

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2025-08-02 TTS-1 Technical Report

2025-08-02 TTS

TTS

Interactive

2025-08-02 更新

Text-to-SQL Task-oriented Dialogue Ontology Construction

Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System

User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

Measurement of production branching ratio after muon nuclear capture reaction of Al and Si isotopes

DIFFA: Large Language Diffusion Models Can Listen and Understand

Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models