发布日期: 2025-09-11

更新日期: 2025-10-07

文章字数: 22k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-11 更新

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Authors:Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

并行思考已成为一种通过同时探索多个推理路径来提高大语言模型（LLM）推理能力的新方法。然而，通过训练激活这种能力仍然具有挑战性，因为现有方法主要依赖于合成数据上的监督微调（SFT），这鼓励了教师强制模仿，而不是探索和泛化。与它们不同，我们提出了Parallel-R1，这是第一个能够支持复杂现实世界推理任务的并行思考行为的强化学习（RL）框架。我们的框架采用了一种渐进的课程学习方法，明确解决了训练并行思考的冷启动问题。我们首先使用SFT对来自简单任务的提示生成轨迹进行训练，以培养并行思考能力，然后过渡到RL，在更复杂的问题上探索并泛化这项技能。在各种数学基准测试上的实验，包括MATH、AMC23和AIME，表明Parallel-R1成功培养了并行思考能力，相对于直接在具有挑战性的任务上使用RL训练的顺序思考模型，其准确率提高了8.4%。进一步的分析显示，模型的行为有明显的转变：在早期阶段，它使用并行思考作为探索策略，而在后期阶段，它使用相同的能力进行多视角验证。最重要的是，我们验证了并行思考作为训练中期探索脚手架，这一临时探索阶段在RL之后开启了更高的性能上限，在AIME25上相对于基准线提高了42.9%。我们的模型、数据和代码将在https://github.com/zhengkid/Parallel-R1上开源。

Summary

本文介绍了并行思维在提升大型语言模型的推理能力方面的作用，并指出通过训练实现这一能力面临的挑战。提出了一种新的强化学习框架Parallel-R1，旨在培养复杂的现实推理任务的并行思维能力。该框架采用渐进的课程方式解决强化学习中并行思维的冷启动问题，首先通过合成数据的监督微调来培养并行思维，然后过渡到强化学习进行探索和泛化。实验表明，Parallel-R1成功培养了模型的并行思维能力，提高了推理准确性。Parallel-R1将并行思维作为训练过程中的临时探索架构，为后续强化学习阶段提高性能奠定了基础。

Key Takeaways

并行思维是一种提升大型语言模型推理能力的新方法，涉及同时探索多个推理路径。
强化学习框架Parallel-R1首次实现了对复杂现实推理任务的并行思维能力的培养。
Parallel-R1采用渐进的课程方式解决冷启动问题，首先通过合成数据的监督微调培养并行思维，然后过渡到强化学习进行探索和泛化。
实验结果表明，Parallel-R1能有效提高模型的推理准确性，达到8.4%的改进。
模型在训练过程中表现出明显的阶段性行为变化，早期使用并行思维作为探索策略，后期用于多视角验证。
Parallel-R1将并行思维作为训练过程中的临时探索架构，为后续强化学习阶段提高性能奠定了基础，实现了更高的性能提升（AIME上的改进达到了42.9%）。

Cool Papers

点此查看论文截图

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Authors:Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao

Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning – spanning tens of steps – and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

近期大型多模态模型的进展通过结合图像工具和强化学习来解决视觉问题。然而，现有的开源方法通常展现出单调的推理模式，并且只允许有限的交互回合数，使得它们在需要尝试和错误探索的困难任务中表现不足。在这项工作中，我们通过扩大基于工具的交互来解决这一限制，并引入了Mini-o3系统，该系统能够执行深度、多回合的推理——跨越数十个步骤——并在具有挑战性的视觉搜索任务上实现了最先进的性能。我们重现OpenAI o3风格行为的配方包含三个关键组成部分。首先，我们构建了视觉探针数据集，这是一系列包含数千个用于探索性推理的视觉搜索问题的集合。其次，我们开发了一个迭代的数据收集管道，以获得展现出多样化推理模式的冷启动轨迹，包括深度优先搜索、尝试和错误以及目标维护。第三，我们提出了一种回合遮挡策略，防止在强化学习过程中对超过回合数的回应进行惩罚，从而在训练效率与测试时间可扩展性之间取得平衡。尽管训练时只设定了六个交互回合的上限，但我们的模型在推理时自然能够扩展到数十个回合，随着回合数的增加，准确率也在提高。大量实验表明，Mini-o3能够产生丰富的推理模式和深度思考路径，有效地解决具有挑战性的视觉搜索问题。

论文及项目相关链接

PDF Code, datasets, models are available at https://github.com/Mini-o3/Mini-o3. Project Page: https://mini-o3.github.io/

Summary

本文介绍了利用大型多模态模型和强化学习解决视觉问题的最新进展。针对现有方法的单调推理模式和交互轮次限制，本文提出了Mini-o3系统，该系统能够执行深入的多次推理，涵盖数十个步骤，并在具有挑战性的视觉搜索任务上实现了卓越的性能。其关键策略包括构建视觉探针数据集、开发迭代数据收集管道以及提出过度轮次掩蔽策略。模型能够在推理时自然扩展到数十个步骤，随着轮次的增加，准确性得到提高。

Key Takeaways

最新多模态模型利用图像工具和强化学习解决视觉问题。
现有方法存在单调推理和交互轮次限制的问题。
Mini-o3系统通过深入多次推理解决复杂视觉搜索任务。
Mini-o3在构建视觉探针数据集、开发迭代数据收集管道和过度轮次掩蔽策略方面有所创新。
模型能够在推理时自然扩展到数十个步骤。
随着推理步骤的增加，模型的准确性得到提高。

Cool Papers

点此查看论文截图

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Authors:Boammani Aser Lompo, Marc Haraoui

Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting (‘inspiration’) and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset’s synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

针对表格等结构化数据的视觉推理是现代视觉语言模型（VLMs）的关键能力，然而当前的基准测试在规模、多样性或推理深度方面仍存在局限性，尤其是在呈现表格图像时。为了弥补这一差距，我们推出了Visual-TableQA，这是一个大规模、开放领域的多模式数据集，专门用于评估和提高复杂表格数据的视觉推理能力。我们的生成流程是模块化的、可扩展的，并且完全自动化，涉及多个推理大型语言模型（LLMs）在不同角色之间的协作：生成、验证和灵感。Visual-TableQA包含2500个丰富的LaTeX渲染结构化表格和6000个推理密集的问题答案对，所有这些都以不到100美元的成本产生。为了促进多样性和创造性，我们的流程通过跨模型提示（“灵感”）和LLM陪审团过滤，进行多模型协作数据生成。更强的模型为布局和主题播种，较弱的模型进行阐述，集体提炼出多样化的推理模式和视觉结构融入数据集。实证结果表明，在Visual-TableQA上微调过的模型能够稳健地推广到外部基准测试，尽管该数据集是合成的，但其性能超越了多个专有模型。完整的流程和资源可在https://github.com/AI-4-Everyone/Visual-TableQA上公开获取。

论文及项目相关链接

PDF Work in Progress

Summary：

本文介绍了一个大规模、开放领域的多模式数据集——Visual-TableQA，旨在评估和增强对复杂表格数据的视觉推理能力。该数据集包含2.5k个丰富的结构化LaTeX渲染表格和6k个推理密集的问题答案对。其生成管道模块化、可扩展且完全自主，多个推理大型语言模型（LLMs）在不同角色（生成、验证和灵感）中进行协作。通过跨模型提示和LLM陪审团过滤，促进数据生成的多样性和创造性。在Visual-TableQA上微调的模型在外部基准测试上表现稳健，尽管该数据集是合成的，但仍优于几个专有模型。

Key Takeaways：

Visual-TableQA是一个专门设计用于评估和增强对复杂表格数据视觉推理能力的大规模、开放领域的多模式数据集。
数据集包含丰富的结构化LaTeX渲染表格和推理密集的问题答案对。
Visual-TableQA的生成管道模块化、可扩展且自主，多个推理LLMs在其中协作，角色包括生成、验证和灵感。
该数据集通过跨模型提示和LLM陪审团过滤，促进数据生成的多样性和创造性。
在Visual-TableQA上训练的模型可以在外部基准测试上表现稳健。
尽管数据集是合成的，但其表现优于一些专有模型。

Cool Papers

点此查看论文截图

Graph-Fused Vision-Language-Action for Policy Reasoning in Multi-Arm Robotic Manipulation

Authors:Shunlei Li, Longsen Gao, Jiuwen Cao, Yingbai Hu

Acquiring dexterous robotic skills from human video demonstrations remains a significant challenge, largely due to conventional reliance on low-level trajectory replication, which often fails to generalize across varying objects, spatial layouts, and manipulator configurations. To address this limitation, we introduce Graph-Fused Vision-Language-Action (GF-VLA), a unified framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB-D human demonstrations. GF-VLA employs an information-theoretic approach to extract task-relevant cues, selectively highlighting critical hand-object and object-object interactions. These cues are structured into temporally ordered scene graphs, which are subsequently integrated with a language-conditioned transformer to produce hierarchical behavior trees and interpretable Cartesian motion primitives. To enhance efficiency in bimanual execution, we propose a cross-arm allocation strategy that autonomously determines gripper assignment without requiring explicit geometric modeling. We validate GF-VLA on four dual-arm block assembly benchmarks involving symbolic structure construction and spatial generalization. Empirical results demonstrate that the proposed representation achieves over 95% graph accuracy and 93% subtask segmentation, enabling the language-action planner to generate robust, interpretable task policies. When deployed on a dual-arm robot, these policies attain 94% grasp reliability, 89% placement accuracy, and 90% overall task success across stacking, letter-formation, and geometric reconfiguration tasks, evidencing strong generalization and robustness under diverse spatial and semantic variations.

从人类视频演示中学习灵巧的机器人技能仍然是一个巨大的挑战，这主要是因为传统的依赖低级轨迹复制的方法常常无法在不同对象、空间布局和操作器配置之间进行泛化。为了解决这一局限性，我们引入了Graph-Fused Vision-Language-Action（GF-VLA）统一框架，该框架使双臂机器人系统能够直接从RGB-D人类演示中进行任务级推理和执行。GF-VLA采用信息理论方法提取任务相关线索，有选择地突出手与物体以及物体与物体之间的关键交互。这些线索被结构化成按时间顺序排列的场景图，随后与受语言控制transformer结合，生成层次化的行为树和可解释的笛卡尔运动原始结构。为了提高双工操作效率，我们提出了一种跨臂分配策略，该策略可自主确定夹持器分配，无需进行显式几何建模。我们在涉及符号结构构建和空间泛化的四个双臂块装配基准测试上对GF-VLA进行了验证。实验结果表明，所提出的表示法达到了超过95%的图准确性和93%的子任务分割准确性，使得语言动作规划器能够生成稳健、可解释的任务策略。当部署在双臂机器人上时，这些策略达到了94%的抓取可靠性、89%的定位精度和90%的总体任务成功率，包括堆叠、字母形成和几何重构任务，证明了在各种空间和语义变化下的强大泛化和稳健性。

论文及项目相关链接

PDF This paper is submitted to IEEE IROS 2025 Workshop AIR4S

Summary

本文介绍了一种名为Graph-Fused Vision-Language-Action（GF-VLA）的统一框架，该框架使双机械臂机器人系统能够从RGB-D人类演示中进行任务级推理和执行。GF-VLA采用信息理论方法提取任务相关线索，并结构化成为时序场景图，随后与语言控制的转换器结合，产生层次化的行为树和可解释的笛卡尔运动原语。此外，还提出了一种跨机械臂分配策略，可自主确定夹持器分配，无需显式几何建模。在四个双机械臂块装配基准测试上验证了GF-VLA的有效性，包括符号结构构建和空间泛化。实验结果表明，所提出的方法在图准确度、子任务分割、语言动作规划器生成的任务策略可靠性、放置精度以及整体任务成功率方面都表现出色。

Key Takeaways

GF-VLA框架实现了从人类视频演示中让双机械臂机器人进行任务级学习和执行。
采用信息理论方法提取任务相关线索，强调手和物体、物体和物体之间的关键交互。
将提取的线索结构化成为时序场景图，并与语言控制的转换器结合，生成层次化的行为树和可解释的机械运动原语。
提出了一种跨机械臂分配策略，自主确定夹持器的使用，无需复杂的几何建模。
在多个基准测试中验证了GF-VLA的有效性，包括符号结构构建和空间泛化能力。
实验结果显示，GF-VLA在图准确度、子任务分割方面有优异表现。

Cool Papers

点此查看论文截图

Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

Authors:Katsuaki Nakano, Reza Feyyazi, Shanchieh Jay Yang, Michael Zuzak

Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM’s reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8%, 72.8%, and 78.6% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5%, 16.5%, and 75.7% of subtasks and required 86.2%, 118.7%, and 205.9% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

最近大型语言模型（LLM）的进步推动了自动化网络安全渗透测试工作流的兴趣，为企业系统的漏洞评估提供了更快更一致的希望。现有的渗透测试LLM代理主要依赖于自我指导的推理，这可能会产生不准确或虚幻的程序步骤。因此，LLM代理可能会采取无效的行动，如利用未使用的软件库或生成重复先前策略的循环响应。在这项工作中，我们提出了一种用于渗透测试LLM代理的引导式推理管道，它结合了由MITRE ATT&CK矩阵构建的确定性任务树，这是一个经过验证的渗透测试kll链，用于将LLM的推理过程约束在明确定义的战术、技术和程序内。这以经过验证的渗透测试方法为依据进行推理，并通过引导代理执行更有效的攻击程序来过滤掉无效行动。为了评估我们的方法，我们使用三个LLM（Llama-3-8B、Gemini-1.5和GPT-4）构建了一个自动化渗透测试LLM代理，并将其应用于导航10个HackTheBox网络安全演练中的103个离散子任务，代表真实世界网络攻击场景。我们提出的推理管道使用Llama-3-8B、Gemini-1.5和GPT-4分别引导了71.8%、72.8%和78.6%的子任务。相比之下，最先进的自我指导推理的LLM渗透测试工具仅完成了13.5%、16.5%和75.7%的子任务，并需要额外进行86.2%、118.7%和高达205.9%的模型查询。这表明在LLM推理管道中融入确定性任务树可以提高自动化网络安全评估的准确性和效率。

论文及项目相关链接

PDF

Summary

基于最新大型语言模型（LLM）的进展，自动化网络安全渗透测试工作流程的需求日益增加。然而，现有的LLM代理主要依赖自我引导推理，这可能导致不准确或虚构的程序步骤。本文提出了一种用于渗透测试LLM代理的引导推理管道，采用MITRE ATT＆CK矩阵构建的确定性任务树，约束LLM的推理过程遵循明确的战术、技术和程序。通过引导代理执行更有成效的攻击程序，该方法根植于经过验证的渗透测试方法论，并过滤掉无效行动。评估结果表明，所提出的推理管道在指导LLM代理完成网络安全演习子任务方面的效果显著。相比现有自我引导推理的LLM渗透测试工具，使用确定性任务树的LLM代理在完成任务方面更准确高效。

Key Takeaways

大型语言模型（LLM）在自动化网络安全渗透测试领域具有应用潜力。
现有LLM代理在渗透测试中存在自我引导推理不准确的问题。
引入确定性任务树能有效提高LLM代理在渗透测试中的准确性和效率。
提出的引导推理管道基于MITRE ATT＆CK矩阵构建，遵循明确的战术、技术和程序。
引导推理管道显著提升了LLM代理完成网络安全演习子任务的效果。
与自我引导推理的LLM渗透测试工具相比，使用确定性任务树的LLM代理更准确、高效。

Cool Papers

点此查看论文截图

Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Authors:Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie

Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type’s utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.

文本响应生成对于多模态任务导向对话系统至关重要，该系统旨在基于多模态上下文生成适当的文本响应。虽然现有的努力已经取得了显著的进步，但仍然存在以下局限性：1）忽视非结构化评论知识；2）大型语言模型（LLM）利用不足。受此启发，我们旨在充分利用双知识（即结构化属性知识和非结构化评论知识）与LLM来促进多模态任务导向对话系统中的文本响应生成。然而，由于两个关键挑战，这项任务并不简单：1）动态知识类型选择；2）意图与响应的解耦。为了解决这些挑战，我们提出了一种新型的双知识增强两阶段推理器，通过适应LLM用于多模态对话系统（命名为DK2R）。具体来说，DK2R首先从外部知识库中提取给定对话上下文的结构化属性知识和非结构化评论知识。之后，DK2R使用LLM通过分析LLM生成的临时探测响应来评估每种知识的实用性。此外，DK2R通过专项推理单独总结以意图为导向的关键线索，这些线索进一步作为辅助信号，增强基于LLM的文本响应生成。在公共数据集上进行的广泛实验验证了DK2R的优越性。我们已经发布了代码和参数。

论文及项目相关链接

PDF

Summary

本文探讨了在多模态任务导向型对话系统中，文本响应生成的重要性及其面临的挑战。文章指出，现有方法忽视了非结构化评论知识并未能充分利用大型语言模型（LLMs）。为此，本文提出了一种利用双知识（结构化属性与非结构化评论知识）与LLMs促进文本响应生成的方法，称为DK2R。该方法面临两大挑战：动态知识类型选择和意图响应解耦。DK2R首先从外部知识库中提取给定对话语境下的结构化属性与非结构化评论知识，然后使用LLM评估每种知识的效用。此外，DK2R通过专项推理总结意图导向的关键线索，作为辅助信号增强LLM的文本响应生成能力。在公开数据集上的实验验证了DK2R的优越性。

Key Takeaways

多模态任务导向型对话系统中，文本响应生成非常重要。
现有方法在利用非结构化评论知识和大型语言模型（LLMs）方面存在不足。
文章提出了DK2R方法，旨在充分利用双知识（结构化属性与非结构化评论知识）并适应LLMs以促进文本响应生成。
DK2R面临两大挑战：动态知识类型选择和意图响应解耦。
DK2R能从外部知识库中提取对话语境下的知识，并使用LLM评估知识的效用。
DK2R通过专项推理总结意图导向的关键线索，以增强LLM的文本响应生成能力。

Cool Papers

点此查看论文截图

CAViAR: Critic-Augmented Video Agentic Reasoning

Authors:Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid

Video understanding has seen significant progress in recent years, with models’ performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

视频理解在近年来取得了重大进展，模型在短片段感知方面的性能持续提高。然而，最近的多个基准测试，如LVBench、Neptune和活动网RTL，显示随着查询的复杂性和视频的持续增长，需要复杂视频推理的任务的性能会下降。在这项工作中，我们提出的问题是：现有的感知能力能否被成功利用来执行更复杂的视频推理？特别是，我们开发了一种大型语言模型代理，该代理可以访问视频模块作为子代理或工具。与传统的遵循固定程序解决查询的方法不同（如视觉编程、ViperGPT和MoReVQA），代理根据对每个模块的调用结果来确定下一步操作。受文本推理领域工作的启发，我们引入了评论家来区分代理成功和失败的序列实例。我们展示了我们的代理和评论家的组合在之前提到的数据集上取得了强大的性能。

论文及项目相关链接

PDF

Summary

本文探讨了视频理解领域的最新进展和挑战。尽管模型在短片段感知方面的性能不断提高，但在需要复杂视频推理的任务中，随着查询的复杂性和视频长度的增加，性能仍然下降。本文提出了一种利用现有感知能力成功执行更复杂的视频推理的方法，即开发一种大型语言模型代理，该代理可以访问视频模块作为子代理或工具。与以前的工作（如视觉编程、ViperGPT和MoReVQA）不同，代理使用对每个模块的调用结果来确定后续步骤。受到文本推理领域工作的启发，我们引入了一个评判家来区分代理成功和失败的序列实例。实验表明，我们的代理和评判家的组合在之前提到的数据集上取得了强大的性能。

Key Takeaways

视频理解领域虽然取得显著进展，但在面对复杂查询和长视频时仍面临性能下降的挑战。
提出了一种大型语言模型代理，能够利用现有感知能力执行更复杂的视频推理。
代理可以访问视频模块作为子代理或工具，并根据模块结果动态确定后续步骤。
与先前的工作不同，该代理不遵循固定程序来解决查询。
引入了一个评判家来区分代理成功和失败的序列实例，以提高性能。
该方法和评判家的组合在多个基准测试集上表现出强大的性能。
为解决视频推理中的复杂任务提供了新的思路和方法。

Cool Papers

点此查看论文截图

K2-Think: A Parameter-Efficient Reasoning System

Authors:Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing

K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at k2think.ai, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.

K2-Think是一个推理系统，它使用32B参数模型达到了最先进的性能水平，可以匹配或超越像GPT-OSS 120B和DeepSeek v3.1等大型模型。我们的系统建立在Qwen2.5基础模型上，通过结合先进的后训练和测试时间计算技术，证明小型模型可以在最高级别上竞争。该方法基于六大关键技术支柱：长思考链监督微调、可验证奖励强化学习（RLVR）、推理前的代理计划、测试时缩放、投机解码和推理优化硬件，所有这些都使用公开可用的开源数据集。K2-Think在数学推理方面表现出色，在公开基准测试上达到了开源模型的最先进水平，同时在代码和科学等其他领域也表现出色。我们的结果证实，像K2-Think 32B这样更参数高效的模型，可以通过包括长思考链训练和战略推理时间增强在内的综合后训练配方，与最先进的系统进行竞争，使开源推理系统更加易于访问和负担得起。K2-Think可在k2think.ai上免费获得，通过Cerebras晶圆级引擎提供每秒超过2000令牌的顶级推理速度。

论文及项目相关链接

PDF To access the K2-Think reasoning system, please visit https://k2think.ai

Summary

K2-Think是一款基于Qwen2.5基础模型的推理系统，具有先进的后训练和测试时间计算技术，能以较小的模型参数实现卓越的性能表现。系统具备六大关键技术支柱，包括长链思维监督微调、强化学习与可验证奖励等。K2-Think在数学推理方面表现突出，同时在代码和科学领域也表现出强大的实力。该系统通过集成化的后训练方案和策略性的推理时间增强功能，与大型模型相比展现出更高的参数效率。K2-Think已在k2think.ai上免费提供，并提供了最佳的推理速度。

Key Takeaways

K2-Think是一个基于Qwen2.5基础模型的先进推理系统。
通过结合后训练和测试时间计算技术，实现了小模型的高性能表现。
系统具有六大关键技术支柱，包括长链思维监督微调等。
在数学推理方面表现卓越，同时在代码和科学领域也有出色表现。
与大型模型相比，K2-Think展现出更高的参数效率。
K2-Think系统已在k2think.ai上免费提供。

Cool Papers

点此查看论文截图

$ΔL$ Normalization: Rethink Loss Aggregation in RLVR

Authors:Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu

We propose $\Delta L$ Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed $\Delta L$ Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

我们提出了ΔL归一化（$\Delta L$ Normalization），这是一种简单有效的损失聚合方法，针对强化学习可验证奖励（RLVR）中动态生成长度的特性进行定制。最近，RLVR在提升大型语言模型（LLM）的推理能力方面展现出巨大潜力，但训练过程中响应长度存在很大差异，导致梯度方差高和优化不稳定，这成为了一大挑战。尽管之前的方法如GRPO、DAPO和Dr. GRPO引入了不同的损失归一化术语来解决这个问题，但它们要么产生有偏估计，要么仍然受到高梯度方差的影响。我们通过理论分析和实证研究，分析了不同长度对策略损失的影响，将问题重新定义为寻找最小方差的无偏估计器。我们提出的ΔL归一化不仅为真实的策略损失提供了无偏估计，而且在理论上还最小化了梯度方差。大量实验表明，它在不同的模型大小、最大长度和任务中始终取得更好的结果。我们的代码将在https://github.com/zerolllin/Delta-L-Normalization公开。

论文及项目相关链接

PDF

Summary

$\Delta L$归一化是一种针对强化学习可验证奖励（RLVR）中动态生成长度特性的简单有效的损失聚合方法。针对LLM推理能力提升的RLVR方法面临训练时回应长度变化大的挑战，导致梯度方差高和优化不稳定。之前的方法如GRPO、DAPO和Dr. GRPO虽然引入了不同的损失归一化术语来解决这个问题，但它们要么产生有偏估计，要么仍存在高梯度方差问题。通过分析不同长度对策略损失的影响，我们将问题重新定义为寻找最小方差无偏估计器。$\Delta L$归一化不仅为真正的政策损失提供了无偏估计，而且在理论上也最小化了梯度方差。大量实验表明，它在不同的模型大小、最大长度和任务中始终取得更好的结果。

Key Takeaways

$\Delta L$归一化是一种针对强化学习可验证奖励（RLVR）中的损失聚合方法。
RLVR在提升大型语言模型（LLM）的推理能力方面展现出潜力。
训练时回应长度变化大是RLVR面临的一个挑战，导致梯度方差高和优化不稳定。
现有方法如GRPO、DAPO和Dr. GRPO虽尝试解决此问题，但仍存在偏差和梯度方差问题。
$\Delta L$归一化通过寻找最小方差无偏估计器来解决这一问题。
实证分析显示$\Delta L$归一化在不同模型、最大长度和任务中表现优越。
相关代码将公开在https://github.com/zerolllin/Delta-L-Normalization。

Cool Papers

点此查看论文截图

PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Authors:Yixuan Tang, Yi Yang, Ahmed Abbasi

Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse~offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.

最近的大型语言模型（LLM）的进步在各种领域都展现出了显著的实力。这些发展使得人类在各种情境下与LLM进行更直接的交流，如在社会陪伴和心理支持方面。然而，在真实世界的对话中，LLM在情感感知和社会能力方面往往表现出局限。这些局限性部分源于它们无法适应不同的社会和任务上下文来调整其交流风格和情感表达。在这项工作中，我们介绍了PersonaFuse，这是一个新型LLM后训练框架，能够使LLM适应并表达不同情境下的不同人格。受特质激活理论和大五人格模型的启发，PersonaFuse采用混合专家架构，将人格适配器与动态路由网络相结合，实现上下文特质表达。实验结果表明，PersonaFuse在社会情感智力的多个维度上显著优于基准模型。重要的是，这些收益的实现并不会牺牲一般的推理能力或模型安全性，这仍然是直接提示和监督微调方法的常见局限性。PersonaFuse在下游以人为中心的应用中也实现了持续的改进，如心理健康咨询和基于评论的客户服务。最后，与领先的大型语言模型进行的人类偏好评估，包括GPT-4o和DeepSeek，证明尽管PersonaFuse的模型规模较小，但在响应质量方面仍具有竞争力。这些发现表明，PersonaFuse为开发社会情感增强型LLM提供了理论扎实且实用的方法，标志着向更以人类为中心的人工智能系统迈出了重大步伐。

论文及项目相关链接

PDF

Summary：
最近大型语言模型（LLM）的进步展现了其在多领域的显著能力，并推动其与人类的直接交流，如在社交陪伴和心理支持方面。然而，LLM在真实对话中的情感感知和社会交往能力存在局限。本文介绍了一种新型LLM后训练框架——PersonaFuse，它能根据不同情境使LLM适应并表达不同人格。受特质激活理论和五大人格模型的启发，PersonaFuse采用混合专家架构，结合人格适配器和动态路由网络，实现情境特质表达。实验结果显示，PersonaFuse在社会情感智能的多个维度上大幅超越基准模型，且在不牺牲通用推理能力和模型安全性的前提下取得这些成果。此外，PersonaFuse在心理健康咨询和基于评论的客户服务等人类为中心的应用中表现优异。与人类偏好评价领先的LLM相比，如GPT-4o和DeepSeek，尽管模型规模较小，但PersonaFuse在响应质量方面表现出竞争力。这些发现表明，PersonaFuse为开发社会情感增强型LLM提供了理论支持和实践方法，标志着人工智能系统更加以人类为中心的重要进步。

Key Takeaways：

LLMs在多个领域展现出强大的能力，促进了与人类的直接交流。
LLMs在真实对话中的情感感知和社会交往能力存在局限。
PersonaFuse是一种新型LLM后训练框架，使LLM能适应并表达不同人格。
PersonaFuse结合特质激活理论和五大人格模型，采用混合专家架构实现情境特质表达。
PersonaFuse在社会情感智能的多个维度上超越基准模型。
PersonaFuse在不牺牲通用推理能力和模型安全性的前提下实现性能提升。

Cool Papers

点此查看论文截图

Systematic Optimization of Open Source Large Language Models for Mathematical Reasoning

Authors:Pranav Pawar, Dhwaj Jain, Varun Gupta, Kaustav Dedhia, Dashrath Kale, Sudhir Dhekane

This paper presents a practical investigation into fine-tuning model parameters for mathematical reasoning tasks through experimenting with various configurations including randomness control, reasoning depth, and sampling strategies, careful tuning demonstrates substantial improvements in efficiency as well as performance. A holistically optimized framework is introduced for five state-of-the-art models on mathematical reasoning tasks, exhibiting significant performance boosts while maintaining solution correctness. Through systematic parameter optimization across Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, and Yi-Lightning, consistent efficiency gains are demonstrated with 100% optimization success rate. The methodology achieves an average 29.4% reduction in computational cost and 23.9% improvement in inference speed across all tested models. This framework systematically searches parameter spaces including temperature (0.1-0.5), reasoning steps (4-12), planning periods (1-4), and nucleus sampling (0.85-0.98), determining optimal configurations through testing on mathematical reasoning benchmarks. Critical findings show that lower temperature regimes (0.1-0.4) and reduced reasoning steps (4-6) consistently enhance efficiency without compromising accuracy. DeepSeek-V3 achieves the highest accuracy at 98%, while Mixtral-8x22B delivers the most cost-effective performance at 361.5 tokens per accurate response. Key contributions include: (1) the first comprehensive optimization study for five diverse SOTA models in mathematical reasoning, (2) a standardized production-oriented parameter optimization framework, (3) discovery of universal optimization trends applicable across model architectures, and (4) production-ready configurations with extensive performance characterization.

本文实际探究了通过微调模型参数对数学推理任务进行优化。通过试验各种配置，包括随机性控制、推理深度和采样策略，精心调整展现出实质性的效率和性能提升。为五个前沿数学推理模型引入了一个全面优化的框架，在保持解决方案正确性的同时，展现出显著的性能提升。通过对Qwen2.5-72B、Llama-3.1-70B、DeepSeek-V3、Mixtral-8x22B和Yi-Lightning等模型进行系统的参数优化，展示了持续的效率提升，优化成功率为100%。该方法实现了计算成本平均降低29.4%，推理速度提高23.9%。该框架系统地搜索参数空间，包括温度（0.1-0.5）、推理步骤（4-12）、规划周期（1-4）和核心采样（0.85-0.98），通过在数学推理基准测试上进行测试来确定最佳配置。关键研究发现，较低的温度范围（0.1-0.4）和减少的推理步骤（4-6）持续提高了效率，而不会影响准确性。DeepSeek-V3达到最高准确率98%，而Mixtral-8x22B在成本效益方面表现最佳，每准确响应361.5个令牌。主要贡献包括：（1）首个针对五种不同数学推理模型的综合优化研究，（2）标准化的生产导向型参数优化框架，（3）发现适用于各种模型架构的通用优化趋势，以及（4）经过广泛性能表征的生产就绪配置。

论文及项目相关链接

PDF

摘要
在一项关于数学模型参数优化的研究中，通过调整温度范围、推理步骤、规划周期和抽样策略等参数，对五个先进的数学推理模型进行了精细化调整，显著提高了效率和性能。该框架系统地搜索参数空间，确定最优配置，在多个数学推理基准测试中进行了验证。研究发现，较低的温度范围和减少的推理步骤能显著提高效率且不影响准确性。DeepSeek-V3准确率最高，达到98%，而Mixtral-8x22B的成本效益最佳。该研究为数学推理模型参数优化提供了宝贵的见解。

关键见解

研究对五个先进的数学推理模型进行了全面的优化研究。
提出了一种标准化的、面向生产的参数优化框架。
发现了适用于多种模型架构的通用优化趋势。
提供了经过广泛性能表征的生产就绪配置。
通过实验调整模型参数，实现了数学推理任务的效率与性能的显著提高。
在优化过程中，发现较低的温度范围和减少的推理步骤能更有效地提高模型效率。

Cool Papers

点此查看论文截图

PaVeRL-SQL: Text-to-SQL via Partial-Match Rewards and Verbal Reinforcement Learning

Authors:Heng Hao, Wenjun Hu, Oxana Verkholyak, Davoud Ataee Tarzanagh, Baruch Gutow, Sima Didari, Masoud Faraki, Hankyu Moon, Seungjai Min

Text-to-SQL models allow users to interact with a database more easily by generating executable SQL statements from natural-language questions. Despite recent successes on simpler databases and questions, current Text-to-SQL methods still suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic. We present \emph{PaVeRL-SQL}, a framework that combines \emph{Partial-Match Rewards} and \emph{Verbal Reinforcement Learning} to drive self-improvement in reasoning language models (RLMs) for Text-to-SQL. To handle practical use cases, we adopt two pipelines: (1) a newly designed in-context learning framework with group self-evaluation (verbal-RL), using capable open- and closed-source large language models (LLMs) as backbones; and (2) a chain-of-thought (CoT) RL pipeline with a small backbone model (OmniSQL-7B) trained with a specially designed reward function and two-stage RL. These pipelines achieve state-of-the-art (SOTA) results on popular Text-to-SQL benchmarks – Spider, Spider 2.0, and BIRD. For the industrial-level Spider2.0-SQLite benchmark, the verbal-RL pipeline achieves an execution accuracy 7.4% higher than SOTA, and the CoT pipeline is 1.4% higher. RL training with mixed SQL dialects yields strong, threefold gains, particularly for dialects with limited training data. Overall, \emph{PaVeRL-SQL} delivers reliable, SOTA Text-to-SQL under realistic industrial constraints. The code is available at https://github.com/PaVeRL-SQL/PaVeRL-SQL.

文本到SQL模型通过从自然语言问题生成可执行的SQL语句，使用户更容易与数据库进行交互。尽管在较简单的数据库和问题上取得了最新成功，但当前的文本到SQL方法在处理涉及特定领域业务逻辑的工业规模数据库和复杂问题时，执行准确性仍然较低。我们提出了结合“部分匹配奖励”和“言语强化学习”来驱动推理语言模型自我提升的PaVeRL-SQL框架。为了处理实际应用场景，我们采用了两个管道：（1）新设计的上下文学习框架，带有群体自我评价（言语强化学习），使用开源和闭源的强大大型语言模型作为骨干；（2）具有特殊设计的奖励函数和两阶段强化学习的思考链（CoT）强化学习管道，带有小型骨干模型（OmniSQL-7B）。这些管道在流行的文本到SQL基准测试——Spider、Spider 2.0和BIRD上达到了最新的结果。针对工业级的Spider2.0-SQLite基准测试，言语强化学习管道的执行力准确性比最新技术高出7.4%，思考链管道高出1.4%。使用混合SQL方言的强化学习训练产生了强大的三倍增长，特别是对于训练数据有限的方言。总的来说，PaVeRL-SQL在现实的工业约束下，提供了可靠、最新的文本到SQL解决方案。代码可在https://github.com/PaVeRL-SQL/PaVeRL-SQL获得。

简化版

论文及项目相关链接

PDF 10 pages

Summary

本文介绍了PaVeRL-SQL框架，该框架结合了部分匹配奖励和言语强化学习，提高了文本到SQL的推理语言模型的自我改进能力。针对实际应用场景，采用了两种管道：一是新设计的上下文学习框架，使用开源和闭源的强大大型语言模型作为骨干；二是思维链强化学习管道，使用小型骨干模型OmniSQL-7B进行训练，并设计了专门的奖励函数和两阶段强化学习。此框架在流行的文本到SQL基准测试中达到了最新的表现水平，且对工业级别的Spider2.0-SQLite基准测试实现了更高的执行精度。代码已公开在GitHub上。

Key Takeaways

PaVeRL-SQL结合了部分匹配奖励和言语强化学习，提高了文本到SQL语言模型的自我改进能力。
提供了两种处理实际应用场景的管道：上下文学习框架和思维链强化学习管道。
PaVeRL-SQL在多个流行的文本到SQL基准测试中达到了最新的表现水平。
在工业级别的Spider2.0-SQLite基准测试中，PaVeRL-SQL的执行精度高于现有技术。
通过使用混合SQL方言进行强化训练，PaVeRL-SQL取得了显著的效果提升，特别是对于训练数据有限的方言。
PaVeRL-SQL在可靠的文本到SQL转换方面表现出色，适应工业界的实际需求。

Cool Papers

点此查看论文截图

Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

Authors:Zhiyin Tan, Jennifer D’Souza

This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.

本研究提出了一个利用大型语言模型（LLMs）对动态演化主题模型进行自动评估的框架。主题建模在数字图书馆系统中组织和检索学术内容方面至关重要，有助于用户浏览复杂且不断发展的知识领域。然而，广泛使用的自动化指标，如连贯性和多样性，通常只能捕捉到狭窄的统计模式，并且无法解释实际中的语义失败。我们引入了一个以目的为导向的评估框架，该框架采用九种基于LLM的指标，涵盖主题质量的四个关键维度：词汇有效性、主题内语义合理性、主题间结构合理性以及文档主题对齐合理性。该框架通过对抗和基于采样的协议进行了验证，并应用于涵盖新闻文章、学术出版物和社交媒体帖子的数据集，以及多种主题建模方法和开源LLMs。我们的分析表明，基于LLM的指标提供了可解释、稳健和任务相关的评估，揭示了主题模型中的关键弱点，如冗余和语义漂移，这些往往被传统指标所忽视。这些结果支持开发可扩展的、精细粒度的评估工具，以维护动态数据集中的主题相关性。支持这项工作的所有代码和数据都可以在https://github.com/zhiyintan/topic-model-LLMjudgment访问。

论文及项目相关链接

PDF Accepted for publication in International Journal on Digital Libraries (IJDL)

Summary

本研究提出了一种基于大型语言模型（LLMs）的动态演化话题模型自动化评估框架。该框架用于组织和管理数字图书馆系统中的学术内容，帮助用户导航复杂且不断变化的知识领域。传统的话题模型评估方法，如连贯性和多样性等，通常只能捕捉到狭窄的统计模式，难以解释实际语义问题。本研究介绍了一种面向任务的评估框架，采用九个LLM指标，从词汇有效性、话题内部语义合理性、话题间结构合理性以及文档话题对齐性四个关键维度评估话题质量。该框架通过对抗和抽样协议进行验证，并应用于新闻文章、学术出版物、社交媒体帖子等多个数据集以及多种话题建模方法和开源LLMs。分析表明，LLM指标提供了可解释、稳健和任务相关的评估，能够揭示传统指标常常忽视的话题模型中的冗余和语义漂移等关键弱点。这些结果为开发可用于维护动态数据集话题相关性的可扩展、精细粒度的评估工具提供了支持。

Key Takeaways

本研究提出了一个基于大型语言模型（LLMs）的自动化评估框架，用于评价动态演化的话题模型。
传统的话题模型评估方法存在局限性，只能捕捉狭窄的统计模式，难以解释实际语义问题。
新的评估框架包含四个关键维度：词汇有效性、话题内部语义合理性、话题间结构合理性以及文档话题对齐性。
框架采用九个LLM指标进行评价，提供了可解释、稳健和任务相关的评估。
对多个数据集、多种话题建模方法和开源LLMs的应用表明，LLM指标能够揭示话题模型中的关键弱点，如冗余和语义漂移。
该研究支持开发用于维护动态数据集话题相关性的精细粒度评估工具。

Cool Papers

点此查看论文截图

Interleaving Reasoning for Better Text-to-Image Generation

Authors:Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

近期，统一的多模态理解和生成模型在图像生成能力方面取得了显著进步，但在遵循指令和细节保留方面与紧密耦合理解与生成的系统（如GPT-4o）相比仍存在较大差距。受到交替推理最新进展的启发，我们探索了这种推理是否能进一步改进文本到图像（T2I）的生成。我们引入了交替推理生成（IRG）框架，该框架在文本思考和图像合成之间进行交替：模型首先产生基于文本的思考来指导初始图像，然后反思结果，以细化细节、视觉质量和美学感，同时保留语义。为了有效地训练IRG，我们提出了交替推理生成学习（IRGL），它针对两个子目标：（1）加强初始的思考和生成阶段，以建立核心内容和基础质量；（2）实现高质量的文本反思和忠实于随后的图像中的改进。我们策划了IRGL-300K数据集，该数据集被组织成六种分解学习模式，联合覆盖基于文本的思考和完整的思考-图像轨迹。从统一的基础模型出发，该模型天生就能发出交替的文本-图像输出，我们的两阶段训练首先建立稳健的思考和反思能力，然后有效地在完整的思考-图像轨迹数据中调整IRG管道。大量实验表明，其性能达到了最新水平，在GenEval、WISE、TIIF、GenAI-Bench和OneIG-EN等指标上取得了5-10分的绝对增益，同时在视觉质量和细节保真度方面也有了实质性的改进。相关代码、模型权重和数据集将在https://github.com/Osilly/Interleaving-Reasoning-Generation上发布。

论文及项目相关链接

PDF

摘要
文本指出当前文本转图像生成（Text-to-Image，简称T2I）模型在细节保留和指令遵循方面存在不足，尤其在图像生成能力方面还有待提高。为了改进这一问题，文章提出了一个名为交替推理生成（Interleaving Reasoning Generation，简称IRG）的框架，该框架通过交替进行文本思考和图像合成来提高图像生成的质量。为了有效地训练IRG模型，文章还提出了交替推理生成学习（Interleaving Reasoning Generation Learning，简称IRGL）的方法，并创建了一个数据集IRGL-300K用于训练。该框架和方法在多个数据集上取得了显著成果。

关键见解

当前文本转图像生成模型在细节保留和指令遵循方面存在不足。
提出了一种新的框架——交替推理生成（IRG），通过交替进行文本思考和图像合成来提高图像生成的质量。
为了有效地训练IRG模型，提出了交替推理生成学习（IRGL）的方法，并创建了IRGL-300K数据集。
IRG框架通过强化初始的“思考-生成”阶段来建立核心内容和基础质量。
IRG框架能够实现在后续图像中高质量文本反思和忠实实施改进。
该方法在不同数据集上取得了显著成果，实现了GenEval、WISE、TIIF、GenAI-Bench和OneIG-EN的绝对得分提升5-10分。
模型代码、权重和数据集将在指定链接中公开。

Cool Papers

点此查看论文截图

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Authors:Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’’) models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

赋予大型语言模型（LLM）复杂交织的推理和工具使用能力已成为代理人工智能研究的关键焦点，尤其是随着面向推理的（“思考”）模型的最新进展。这些能力是解锁许多重要应用的关键。其中一个应用是深度研究（DR），它需要在多个来源之间进行广泛搜索和推理。本文的工作重点是为DR开发本地自主单代理模型，具有最少的网络爬虫和Python工具集成。与多代理系统不同，多代理系统中的代理扮演预定义角色，并在静态工作流中的每个步骤中被告知要做什么，而自主单代理则根据上下文动态确定其下一步行动，无需手动指令。虽然先前的工作已经提出了针对基础或指令微调LLM的训练配方，但我们专注于对推理优化模型的持续强化学习（RL），以进一步增强代理技能同时保留推理能力。为此，我们提出了一个简单的使用完全合成数据的RL配方，将其应用于各种开源LLM。我们最好的变体SFR-DR-20B在人类最后的考试基准测试中达到了高达28.7%的效果。此外，我们还进行了关键的分析实验，以更深入地了解我们的方法。

论文及项目相关链接

PDF Technical Report

Summary

大型语言模型（LLM）配备复杂交织的推理和工具使用能力已成为代理人工智能研究的关键焦点，尤其是最近出现的以推理为导向的“思考”模型。本研究开发了一种用于深度研究（DR）的自主单代理模型，具有最小的网络爬虫和Python工具集成。不同于多代理系统，自主单代理能基于上下文动态地确定其下一步行动，无需手动指令。本研究聚焦于持续强化学习（RL）的推理优化模型，以提高代理技能并保留推理能力。我们提出了一个使用合成数据的简单RL配方，并应用于各种开源LLM。最佳变体SFR-DR-20B在人类最后的考试基准测试中达到了28.7%的准确率。

Key Takeaways

大型语言模型（LLM）正在发展出复杂推理和工具使用能力，这是代理人工智能研究的关键。
深度研究（DR）需要广泛的源搜索和推理，自主单代理模型被开发用于DR，具有最小的网络爬虫和Python工具集成。
与多代理系统不同，自主单代理能基于上下文动态决策，无需手动指令。
本研究通过持续强化学习（RL）提高LLM的推理优化模型。
提出一个简单的RL配方，使用合成数据应用于各种开源LLM。
最佳模型SFR-DR-20B在人类最后的考试基准测试中达到了28.7%的准确率。

Cool Papers

点此查看论文截图

Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

Authors:Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang

Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model’s genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) – a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.

强化学习（RL）最近已成为增强大型语言模型（LLM）推理能力的主流范式。然而，目前在数学或编程基准测试中常用的基于规则的奖励函数仅评估答案的格式和正确性，无法判断所诱导的“思考链”（CoT）是否真正提高了答案的质量。此外，此类针对特定任务的训练对于逻辑深度的控制有限，因此可能无法揭示模型的真正推理能力。我们提出了动态推理效率奖励（DRER）——一种即插即用的RL奖励框架，用于重塑奖励和优势信号。（一）推理质量奖励对于那些明显提高正确答案概率的推理链给予精细的信用，直接激励有益的CoT标记的轨迹。（二）动态长度优势衰减那些长度偏离验证派生阈值的响应的优势，使训练稳定。为了方便严格评估，我们还发布了Logictree，这是一个动态构建的演绎推理数据集，既可作为RL训练数据，也可作为全面的基准测试。实验证实了DRER的有效性：我们的70亿参数模型在Logictree上仅经过400步训练就达到了GPT-o3-mini级别的性能，而CoT增强答案的平均置信度提高了30%。该模型还展现出在多种逻辑推理数据集和数学基准测试AIME24上的泛化能力。这些结果揭示了RL如何塑造CoT行为，并为提高大型语言模型中的形式推理能力开辟了一条实用道路。所有代码和数据都可在https://github.com/Henryhe09/DRER仓库中找到。

论文及项目相关链接

PDF

摘要

强化学习（RL）已成为增强大型语言模型（LLM）推理能力的主导范式。然而，常用的基于规则奖励函数仅在数学或编程基准测试中评估答案的格式和正确性，无法判断诱导的链式思维（CoT）是否真正改善答案。此外，这种任务特定训练对逻辑深度的控制有限，因此可能无法揭示模型的真正推理能力。本文提出动态推理效率奖励（DRER）——一种即插即用的RL奖励框架，重塑奖励和优势信号。（1）推理质量奖励为那些证明能提高正确答案概率的推理链分配精细信用，直接激励有益的CoT令牌轨迹。（2）动态长度优势衰减响应长度偏离验证派生阈值的优势，稳定训练。为了严格的评估，我们还发布了Logictree，这是一个动态构建的演绎推理数据集，既可作为RL训练数据，也可作为全面的基准测试。实验证实了DRER的有效性：我们的70亿模型在Logictree上仅进行400步训练即达到GPT-o3-mini级别的性能，而CoT增强答案的平均置信度提高了30%。该模型还展现出在多样化逻辑推理数据集和数学基准测试AIME24上的泛化能力。这些结果揭示了RL如何塑造CoT行为，并为增强大型语言模型中的形式推理能力开辟了一条实用路径。

关键见解

强化学习已成为增强大型语言模型推理能力的主导方法。
现有奖励函数主要评估答案的格式和正确性，忽视推理质量的提升。
DRER框架通过动态调整奖励和优势信号，激励有益的链式思维轨迹。
Logictree数据集的发布为严格评估模型在演绎推理任务上的性能提供了基准。
实验显示DRER框架的有效性，提高了模型的推理能力和答案的置信度。
模型在多样化逻辑推理数据集和数学基准测试上展现出良好的泛化能力。

Cool Papers

点此查看论文截图

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Authors:Feng Wang, Zihao Yu

Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

强化学习（RL）最近被证明是一种强大的技术，可改善扩散和流匹配模型中的图像和视频生成，特别是在提高输出质量和与提示对齐方面。在流匹配中应用在线RL方法的关键步骤是将随机性引入确定性框架，这通常通过随机微分方程（SDE）来实现。我们的调查发现此方法的重大缺陷：基于SDE的采样在生成的图像中引入明显的噪声伪影，我们发现这对奖励学习过程是有害的。严格的理论分析发现这种噪声源于推理过程中注入的过多随机性。为解决这一问题，我们从去噪扩散隐模型（DDIM）中汲取灵感，重新制定采样过程。我们提出的方法——系数保留采样（CPS），消除了这些噪声伪影。这导致更准确的奖励建模，最终使基于强化学习的优化器（如Flow-GRPO和Dance-GRPO）实现更快、更稳定的收敛。代码将在https://github.com/IamCreateAI/FlowCPS上发布。

论文及项目相关链接

PDF work in progress

Summary

强化学习（RL）在扩散和流匹配模型的图像和视频生成中展现出强大的潜力，尤其是在提高输出质量和与提示的对齐方面。然而，在应用在线RL方法于流匹配时，引入随机性是一个关键步骤，通常通过随机微分方程（SDE）实现。研究发现SDE采样会引入明显的噪声伪影，对奖励学习过程产生不利影响。通过借鉴去噪扩散隐模型（DDIM），我们重新构建了采样过程，提出了系数保留采样（CPS）方法，有效消除了噪声伪影，实现了更准确的奖励建模，使得基于强化学习的优化器如Flow-GRPO和Dance-GRPO的收敛更快更稳定。

Key Takeaways

强化学习在扩散和流匹配模型的图像和视频生成中具有显著优势，特别是在提高输出质量和响应提示方面。
SDE采样方法引入的随机性会导致生成的图像中出现明显的噪声伪影。
噪声伪影对奖励学习过程具有负面影响。
通过借鉴DDIM模型，我们提出了系数保留采样（CPS）方法以消除噪声伪影。
CPS方法提高了奖励建模的准确性。
CPS方法使得强化学习优化器如Flow-GRPO和Dance-GRPO的收敛更快更稳定。

Cool Papers

点此查看论文截图

Chatbot To Help Patients Understand Their Health

Authors:Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu

Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel ‘learning as conversation’ framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains, broadening the applicability of RL-based alignment methods.

患者必须掌握积极参与护理所需的知识。我们推出了NoteAid-Chatbot，这是一款通过新型“边对话边学习”框架促进患者理解的对话式人工智能。该框架建立在多主体大型语言模型（LLM）和强化学习（RL）设置上，无需人工标注数据。NoteAid-Chatbot建立在轻量级LLaMA 3.2 3B模型上，该模型分为两个阶段进行训练：首先在合成生成的医疗对话策略数据上进行初步的有监督微调，然后在模拟出院场景中的患者理解评估中通过强化学习获得奖励。我们的评估包括全面的人类对齐评估和案例研究，结果表明NoteAid-Chatbot展现出对患者教育至关重要的关键突发行为，如清晰度、关联性和结构化对话，尽管它并未针对这些属性接受明确的监督。我们的结果表明，即使简单的近端策略优化（PPO）基于奖励的建模也能成功训练轻量级、特定领域的聊天机器人，以处理多轮互动、融入多样的教育策略并实现微妙的沟通目标。我们的图灵测试证明NoteAid-Chatbot超越了非专业人士的表现。尽管我们目前的重点在医疗保健领域，但我们所展示的框架说明了将低成本PPO型RL应用于现实开放式对话领域的可行性和前景，扩大了基于RL的对齐方法的适用性。

论文及项目相关链接

PDF Accepted in EMNLP 2025 Findings

Summary

NoteAid-Chatbot是一个基于对话的人工智能，通过“学习对话”框架促进患者对自身护理知识的理解。它采用多代理大型语言模型和无需人工标注数据的强化学习设置进行构建。经过初始的基于合成医疗对话数据的监督微调后，使用来自模拟出院场景中患者理解评估的奖励进行强化学习。评估结果显示，NoteAid-Chatbot展现出对患者教育至关重要的关键突发行为，如清晰性、相关性和结构化对话。尽管未对这些属性进行明确的监督，但该模型仍表现出成功处理多轮互动、融入多元教育策略和达成微妙沟通目标的能力。在模仿图灵测试中，NoteAid-Chatbot超越了非专业人士的表现。尽管目前专注于医疗保健领域的应用，但该框架展示了将低成本强化学习应用于现实开放式对话领域的可行性和前景，拓宽了基于强化学习的对齐方法的应用范围。

Key Takeaways

NoteAid-Chatbot是一个对话式人工智能，通过“学习对话”框架提升患者对护理知识的了解。
该系统基于多代理大型语言模型和强化学习构建，无需人工标注数据。
NoteAid-Chatbot经过监督微调后，通过模拟的出院场景中的患者理解评估进行强化学习训练。
该系统在患者教育方面展现出关键能力，如提供清晰、相关的信息，以及结构化对话。
NoteAid-Chatbot在模仿图灵测试中的表现超越了非专业人士。
该系统目前专注于医疗保健领域，但其框架可应用于其他开放式对话领域。

Cool Papers

点此查看论文截图

TalkToAgent: A Human-centric Explanation of Reinforcement Learning Agents with Large Language Models

Authors:Haechang Kim, Hao Chen, Can Li, Jong Min Lee

Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi-agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent’s actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule-based policies. We validated TalkToAgent on quadruple-tank process control problem, a well-known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder-debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent’s actions and contextualized their meaning within the problem domain.

可解释性强化学习（XRL）作为一种有前景的方法，在提升强化学习（RL）智能体的透明度方面展现出了巨大潜力。然而，由于XRL结果的可理解性有限以及当前XRL方法的孤立覆盖，使得复杂RL策略与领域专家之间存在差距，使用户对应该使用哪些工具感到不确定。为了应对这些挑战，我们引入了TalkToAgent，这是一个多智能体自然语言模型（LLM）框架，可以为RL策略提供交互式自然语言解释。该架构包含五个专业LLM智能体（协调器、解释器、编码器、评估器和调试器），使TalkToAgent能够自动将用户查询映射到相关XRL工具，并根据关键状态变量、预期结果或反事实解释来澄清智能体的行动。此外，我们的方法通过从定性行为描述或甚至新的基于规则的策略中得出替代场景，扩展了之前的反事实解释。我们在著名的非线性控制基准——四罐过程控制问题上验证了TalkToAgent。结果表明，TalkToAgent成功地将用户查询映射到XRL任务，并实现了高准确度；编码器和调试器之间的交互最小化了反事实生成中的失败。此外，定性评估证实，TalkToAgent有效地解释了智能体的行动，并将它们的问题域背景进行了联系。

论文及项目相关链接

PDF 31 pages total

Summary：
强化学习（RL）的透明度问题可通过解释性强化学习（XRL）进行改进。然而，当前仍存在复杂RL策略与领域专家之间的鸿沟，原因在于XRL结果的可理解性有限以及当前XRL方法的孤立覆盖，使用户对使用哪种工具感到困惑。为解决这些挑战，我们推出TalkToAgent，这是一个多智能体自然语言模型框架，可为RL策略提供交互式自然语言解释。TalkToAgent包含五个专业智能体（协调器、解释器、编码员、评估器和调试器），可自动将用户查询映射到相关的XRL工具，并根据关键状态变量、预期结果或反事实解释阐明智能体的行为。此外，我们的方法通过对定性行为描述或新的基于规则的策略进行衍生，扩展了先前的反事实解释。我们在著名的非线性控制基准问题——四重水箱过程控制问题上验证了TalkToAgent。结果表明，TalkToAgent成功地将用户查询映射到XRL任务中，且具有较高的准确性。编码员与调试器之间的互动减少了反事实生成中的故障。此外，定性评估证实TalkToAgent有效地解释了智能体的行为并将其置于问题域中。

Key Takeaways：

XRL提高了强化学习（RL）的透明度，但仍然存在与领域专家的鸿沟。
TalkToAgent是一个多智能体自然语言模型框架，旨在解决这一挑战。
TalkToAgent包含五个专业智能体，可自动映射用户查询到相关的XRL工具。
TalkToAgent能以关键状态变量、预期结果或反事实解释的形式阐明智能体的行为。
该方法扩展了先前的反事实解释，通过定性行为描述或新的基于规则的策略来衍生替代场景。
在四重水箱过程控制问题上的验证显示，TalkToAgent成功地将用户查询映射到XRL任务中并具有高准确性。

Cool Papers

点此查看论文截图

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Authors:Aisha Alansari, Hamzah Luqman

Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: https://github.com/aishaalansari57/AraHalluEval

最近，关于大型语言模型（LLM）的幻觉的广泛研究主要集中在英语上。尽管多语言化和阿拉伯特定的大型语言模型数量不断增加，但在阿拉伯语境下评估大型语言模型的幻觉仍然相对被忽视。考虑到阿拉伯语在许多地区的广泛使用以及在全球沟通和媒体中的重要性，知识差距尤为紧迫。本文首次对阿拉伯语和多语言大型语言模型在两项关键的阿拉伯语自然语言生成任务上的幻觉进行了全面评估：生成性问题回答（GQA）和摘要。本研究共评估了12种大型语言模型，包括4种阿拉伯语预训练模型、4种多语言模型和4种基于推理的模型。为了评估大型语言模型输出的事实一致性和忠实度，我们开发了一个精细的幻觉评估框架，包括代表每个任务不同特点的12个精细幻觉指标。结果表明，在所有模型和任务中，事实幻觉比忠实度错误更为普遍。值得注意的是，阿拉伯语的预训练模型Allam始终表现出较低的幻觉率，与多语言模型的性能相比有所不及，但能与基于推理的模型相抗衡。相关代码可通过以下链接获取：https://github.com/aishaalansari57/AraHalluEval

论文及项目相关链接

PDF

Summary：近期对大型语言模型（LLM）的幻想（hallucination）研究主要集中在英语领域。尽管阿拉伯语和多语言LLM的数量不断增长，但在阿拉伯语环境中评估LLM的幻想仍然相对被忽视。本文首次对阿拉伯语和多语言LLM在两项关键的阿拉伯语自然语言生成任务（生成性问题回答（GQA）和摘要）进行了全面的幻想评估。研究评估了总计12种LLMs的表现，包括4种阿拉伯语预训练模型、4种多语言模型和4种基于推理的模型。为了评估LLM输出的事实一致性和忠实度，我们开发了一个精细的幻想评估框架，包含12个精细的幻想指标，代表每项任务的不同特点。结果表明，与忠实度错误相比，事实幻想更为普遍。值得注意的是，阿拉伯语预训练模型Allam的幻想率一直低于多语言模型，并与基于推理的模型表现相当。

Key Takeaways：