发布日期: 2025-11-06

更新日期: 2025-11-27

文章字数: 21.3k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-06 更新

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Authors:Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh

Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.

多模态大型语言模型（MLLMs）已经显示出强大的能力，但仍然局限于固定的模态对，需要昂贵的精细调整和大规模对齐数据集。构建能够整合文本、图像、音频和视频的完全通用模型仍然不切实际，缺乏稳健的推理支持。在本文中，我们提出了一种Agent-Omni框架，它通过主代理系统协调现有的基础模型，实现了灵活的多模态推理，无需重新训练。主代理解释用户意图，将子任务委派给特定模态的代理，并将他们的输出整合成连贯的响应。在文本、图像、音频、视频和全模态基准测试方面的广泛实验表明，Agent-Omni持续实现了最先进的性能，特别是在需要复杂跨模态推理的任务中。其基于代理的设计可实现专业基础模型的无缝集成，确保适应各种输入，同时保持透明度和可解释性。此外，该框架是模块化的，易于扩展，随着更强大的模型的出现，允许未来进行改进。%我们发布开源实现，以支持在可扩展和可靠的全模态推理方面的持续研究。

论文及项目相关链接

PDF 16 pages, 7 figures, 14 tables. Under Review

Summary

本文提出了一种名为Agent-Omni的框架，它通过主代理系统协调现有的基础模型，实现了灵活的多模态推理而无需重新训练。该框架能够解释用户意图，将子任务委派给特定模态的代理，并将它们的输出整合为连贯的响应。在文本、图像、音频、视频和全模态基准测试中，Agent-Omni表现出卓越的性能，特别是在需要复杂跨模态推理的任务上。其基于代理的设计确保了对各种输入的适应性、透明度和可解释性。此外，该框架是模块化的并且易于扩展，随着更强大的模型的出现，允许未来的改进。

Key Takeaways

Agent-Omni框架协调现有基础模型，实现灵活多模态推理，无需重新训练。
框架通过主代理系统解释用户意图，将任务委派给特定模态的代理。
Agent-Omni在多种模态基准测试中表现出卓越性能，尤其在需要复杂跨模态推理的任务上。
基于代理的设计确保了对各种输入的适应性、透明度和可解释性。
Agent-Omni框架是模块化的，易于扩展，允许未来随着更强模型的出现进行改进。
该框架在跨模态推理方面具有创新性，能够整合文本、图像、音频和视频。

Cool Papers

点此查看论文截图

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Authors:Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han

Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user’s question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

典型的搜索代理会将整个交互历史连接到大模型语境中，虽然能够保持信息完整性，但会产生冗长且嘈杂的语境，导致计算和内存成本高昂。相比之下，仅使用当前回合可以避免这种开销，但会丢弃重要信息。这种权衡限制了搜索代理的可扩展性。为了应对这一挑战，我们提出了MemSearcher，这是一种代理工作流程，可以迭代地维护一个紧凑的内存，并将当前回合与之结合。在每个回合中，MemSearcher将用户的问题与内存融合，以生成推理轨迹、执行搜索操作并更新内存，仅保留完成任务所需的关键信息。这种设计在多回合交互中稳定上下文长度，提高了效率，而不牺牲准确性。为了优化这一工作流程，我们引入了多上下文GRPO，这是一种端到端的强化学习框架，可以联合优化MemSearcher代理的推理、搜索策略和内存管理。具体而言，多上下文GRPO在不同的上下文中采样轨迹组，并在所有对话中传播轨迹级优势。MemSearcher在Search-R1的同一数据集上进行训练，在七个公共基准测试上取得了显著的改进：相对于Qwen2.5-3B-Instruct有+11%的平均增益提升和相对于Qwen2.5-7B-Instruct有+12%的平均增益提升。值得注意的是，基于3B的MemSearcher甚至超越了基于7B的基线模型，表明在信息完整性和效率之间取得平衡可以实现更高的准确性和更低的计算开销。相关代码和模型将在https://github.com/icip-cas/MemSearcher公开可用。

摘要
典型搜索代理会串联整个互动历史到大型语言模型的情境中，尽管这种方法能够保留信息完整性，但却会产生漫长而繁杂的情境，从而增加计算和内存成本。相反，仅使用当前轮次避免了这些开销但又丢弃了重要信息。这种权衡限制了搜索代理的可扩展性。为解决这一挑战，我们提出了MemSearcher代理工作流程，它迭代地维护一个紧凑的内存并结合当前回合情况。在每个回合中，MemSearcher将用户的问题与内存相融合，生成推理轨迹，执行搜索操作并更新内存，仅保留对完成任务至关重要的信息。这种设计在多回合互动中稳定了上下文长度，提高了效率且不会牺牲准确性。为优化这一工作流程，我们引入了多上下文GRPO，这是一种端到端的强化学习框架，能够联合优化MemSearcher代理的推理、搜索策略和内存管理。具体而言，多上下文GRPO在不同的上下文中采样轨迹组并在它们之间传播轨迹级别的优势。在Search-R1数据集上训练的MemSearcher在七个公共基准测试上取得了显著的改进：相对于Qwen2.5-3B-Instruct的+11%和相对于Qwen2.5-7B-Instruct的+12%的相对平均增益。值得注意的是，基于3B的MemSearcher甚至超越了基于7B的基线模型，表明在平衡信息完整性和效率的同时，实现了更高的准确性和更低的计算开销。相关代码和模型将公开在https://github.com/icip-cas/MemSearcher。

关键见解

MemSearcher代理解决了传统搜索代理在处理互动历史时面临的计算与内存成本高昂的问题。
MemSearcher通过结合当前回合与紧凑内存迭代维护的方式优化了上下文处理。
MemSearcher生成推理轨迹、执行搜索操作并更新内存，确保仅保留完成任务所需的关键信息。
多上下文GRPO框架被引入以联合优化MemSearcher的推理、搜索策略和内存管理。
MemSearcher能够在不同上下文中采样轨迹组并传播轨迹级别的优势，从而提高性能。
MemSearcher在多个公共基准测试上取得了显著成果，相对于其他模型有明显的性能提升。

Cool Papers

点此查看论文截图

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

Authors:Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, Paul Pu Liang

Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.

尽管多模态大型语言模型（MLLMs）发展迅速，但其推理轨迹仍然不明确：通常不清楚是哪种模态驱动预测，如何解决冲突，或者何时一个流占主导地位。在本文中，我们介绍了模态破坏（modality sabotage），这是一种诊断失败模式，高置信度的单模态错误会覆盖其他证据并误导融合结果。为了分析这种动态，我们提出了一种轻量级、模型不可知的评估层，它将每个模态视为一个代理，生成候选标签和简短的自我评估，用于审计。一个简单的融合机制可以聚合这些输出，暴露贡献者（支持正确结果的模态）和破坏者（误导的模态）。在我们的诊断层在基于基准模型的多模态情感识别基准测试案例研究中的应用揭示了系统的可靠性配置文件，这有助于了解失败是否是由于数据集的人工制品或模型本身的局限性所导致的。更广泛地说，我们的框架为多模态推理提供了诊断支架，支持有原则地审计融合动态并告知可能的干预措施。

论文及项目相关链接

PDF Accepted at the Multimodal Algorithmic Reasoning (MAR) Workshop, NeurIPS 2025

Summary

本文探讨了多模态大型语言模型（MLLMs）的推理透明度问题，提出一种诊断失效模式——模态破坏，即高置信度的单模态错误会覆盖其他证据并误导融合结果。为分析这种现象，文章提出了一种轻量级、模型不可知的评估层，该评估层将各模态视为代理，产生候选标签和简短的自我评估用于审计。一个简单的融合机制能够聚合这些输出，揭示贡献者（支持正确结果的模态）和破坏者（误导的模态）。在多模态情感识别基准测试集的案例研究中应用该诊断层揭示了系统的可靠性特征，有助于了解失败是源于数据集的人工制品还是模型本身的局限性。总体而言，该框架为多模态推理提供了诊断框架，支持融合动力学的原则性审计，并为可能的干预提供了信息。

Key Takeaways

多模态大型语言模型（MLLMs）的推理过程存在透明度问题，难以确定预测背后的驱动因素以及冲突如何解决或哪个数据流占据主导地位。
提出了一种新的诊断失效模式——模态破坏，其中高置信度的单模态错误会覆盖其他证据并误导融合结果。
引入了一种轻量级、模型不可知的评估层，该评估层可以分析各模态的贡献和可能的误导。
通过案例研究揭示了系统的可靠性特征，有助于区分失败是源于数据集的问题还是模型本身的局限性。
该框架为多模态推理提供了诊断工具，支持对融合过程的审计，并有助于识别可能的改进措施。
提出的诊断层可以应用于多模态情感识别等场景，有助于深入了解模型的性能特点。

Cool Papers

点此查看论文截图

Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

Authors:Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang

Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.

大型语言模型（LLM）在不同领域展现出互补的优势，并且具有不同的推理成本，这促使我们设计多智能体LLM系统，其中专业模型能够高效协作。现有方法主要依赖于分布式框架，它为每个输入调用多个LLM，从而导致推理成本巨大且不可控。在这项工作中，我们引入了一种集中式的多LLM框架，其中控制器LLM以成本效益高且可控的方式选择性地协调专家模型池。我们将这种协调问题制定为具有双重目标的强化学习问题：在最大化任务性能的同时最小化总体推理成本。此外，我们希望多智能体系统在推理过程中在不同预算条件下能够适应性行为。为此，我们提出了CoRL，这是一个优化可控多预算设置中性能成本权衡的强化学习框架。在四个不同基准测试上的实验表明，CoRL使单一系统在高性能预算设置下超越最佳专家LLM，同时在更经济的低预算模式下保持强劲性能，突显集中式协调在可扩展和成本效益高的多智能体LLM系统中的有效性。

论文及项目相关链接

PDF 14 pages

Summary

大型语言模型（LLM）在不同领域具有互补优势，同时推理成本各异，因此设计多智能体LLM系统，让专业模型高效协作至关重要。现有方法主要依赖去中心化框架，为每次输入调用多个LLM，导致推理成本巨大且不可控。本文提出一种集中式多LLM框架，控制器LLM有选择性地协调专家模型池，以高效且可控的方式降低成本。我们将协调问题公式化为具有双重目标的强化学习问题：最大化任务性能，同时最小化总体推理成本。实验表明，在四种不同基准测试中，CoRL（一种优化性能成本权衡的强化学习框架）使单一系统在预算充足时表现超越最佳专家LLM，同时在经济节约型模式下也保持强劲性能，凸显集中式协调对于可扩展和高效的多智能体LLM系统的价值。

Key Takeaways

大型语言模型（LLM）在不同领域具有优势，且推理成本各异，需要设计多智能体LLM系统实现高效协作。
现有方法主要依赖去中心化框架，导致推理成本巨大且不可控。
集中式多LLM框架提出，通过控制器LLM选择性协调专家模型池以降低推理成本。
协调问题被公式化为强化学习问题，旨在最大化任务性能的同时最小化总体推理成本。
提出的CoRL框架能在不同预算条件下优化性能与成本的权衡。
实验表明，CoRL在多种基准测试中表现优异，能在高预算下超越最佳专家LLM，同时在低预算模式下也保持强劲性能。

Cool Papers

点此查看论文截图

Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs

Authors:Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla

Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.

在部署期间，训练智能体在严格的约束条件下进行操作，如资源预算有限或安全要求严格，存在重大挑战，特别是当这些约束使任务复杂化时。在此工作中，我们提出了一种课程学习策略，该策略在训练过程中逐步加强约束，使智能体能逐步掌握部署要求。我们的方法受到无约束强化学习（RL）中的自我进度学习技术的启发，通过最初在约束的简化版本上进行训练，并逐步引入全面的部署条件，实现向具有挑战性的环境的平稳过渡。我们使用二进制树马尔可夫决策过程（MDP）中的RL代理进行了理论分析，以证明我们的课程策略可以加速训练，相对于一种从一开始就强制轨迹约束的基线方法。此外，我们通过强化学习和大型语言模型（LLM）智能体在多种环境中的实证研究验证了我们的方法的有效性和普遍性，包括二进制树MDP、多任务导航领域和数学推理任务中的两个基准测试。这些结果突出了课程设计在增强智能体在部署期间在复杂轨迹约束下操作的效率和性能方面的潜力。此外，当应用于LLM时，我们的策略能够实现输出思维链代币的压缩，在消费者硬件上实现了可观的推理速度提升，证明了其在资源受限部署中的有效性。

论文及项目相关链接

PDF NeurIPS’25 paper

Summary：
本文提出一种课程学习策略，在训练过程中逐步加强约束条件，使代理能够逐步掌握部署要求。该方法受到无约束强化学习中的自我进度学习技术的启发，通过最初在简化版本的约束下进行训练，然后逐步引入完整的部署条件，实现更平稳地过渡到复杂环境。理论分析和实证研究均表明，该方法在强化学习和大型语言模型代理中均有效，并且在处理复杂轨迹约束方面具有显著优势。

Key Takeaways：

训练代理在部署时面临严格约束，如资源预算有限或安全要求严格，会带来显著挑战。
本文提出的课程学习策略通过逐步加强约束条件，使代理在训练过程中逐步适应部署要求。
该策略受到无约束强化学习中的自我进度学习技术的启发，通过简化约束的初始训练，逐步引入完全部署条件，实现平滑过渡。
理论分析证明，该策略在强化学习中的二进制树马尔可夫决策过程（MDP）上相对基线方法能加速训练。
实证研究在强化学习和大型语言模型代理中均验证了该策略的有效性，包括在二进制树MDP、多任务导航域和数学推理任务中的两个基准测试。
该策略能够提高代理在部署时处理复杂轨迹约束的效率和性能。

Cool Papers

点此查看论文截图

Scalable Evaluation and Neural Models for Compositional Generalization

Authors:Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi

Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.

组合泛化是现代机器学习中的一个关键的开放挑战，要求模型能够预测已知概念组合中的未知内容。然而，由于缺乏标准化的评估协议和当前基准测试的局限性（这些测试通常更注重效率而非严谨性），因此评估组合泛化仍然是一个基本挑战。同时，通用视觉架构缺乏必要的归纳偏见，现有的赋予它们的方法又会影响其可扩展性。为了解决这个问题，本文介绍了：1）一个严谨的评价框架，它统一并扩展了以前的方法，同时降低了从组合到常数的计算要求；2）对监督视觉主干中组合泛化现状的广泛和现代评估，训练了超过5000个模型；3）属性不变网络是一类模型，在组合泛化方面建立了新的帕累托边界，在基线基础上实现了23.43%的准确率提升，同时与完全解耦的对应模型相比，参数开销从600%减少到16%。

论文及项目相关链接

PDF Accepted at the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

Summary

该文探讨了现代机器学习中的关键挑战——组合泛化，它要求模型能够预测已知概念组合的未知情况。文中提出了一个统一的严格评价框架来评估模型在视觉方面的组合泛化能力，建立了一种新型模型评估协议来解决缺乏标准化评价协议和现有基准测试注重效率而非严谨性的问题。此外，引入了一种新型的模型——属性不变网络（Attribute Invariant Networks），它在组成泛化方面建立了新的帕累托边界，与基线相比提高了准确率。相较于完全分离对等模型，它降低了参数开销，在保证了良好泛化能力的同时兼具较低的复杂度。此外还提出了一种较为简单的度量框架以减少测试所需的计算资源要求，降低了之前所达到的大规模数据集合训练的测试模型评估框架下的运算负荷水平要求较高的成本。通过训练超过五千个模型对监督视觉主干进行了全面的现代评估。

Key Takeaways

以下是关于该文本的关键见解：

组合泛化是机器学习领域的一个重要挑战，要求模型能够预测已知概念组合形成的未知事物。
缺乏标准化的评估协议和现有基准测试的局限性使得评估组合泛化成为一项挑战。这些测试往往过于注重效率而非严谨性。

Cool Papers

点此查看论文截图

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Authors:Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik

Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.

大型语言模型（LLM）在测试时经常被多次查询，通过多数投票的方式对预测结果进行汇总。虽然这种方法有效，但这种自洽策略（arXiv:2203.11171）需要固定次数的查询，当正确答案稀少时可能会失效。我们引入了置信引导提前终止（CGES）方法，这是一种贝叶斯框架，利用来自令牌概率或奖励模型的标量置信信号对候选答案形成后验分布。一旦候选答案的后验概率超过阈值，CGES就会自适应地停止采样。我们对完全校准的置信度和现实的嘈杂置信信号都提供了理论保证。在五个推理基准测试中，CGES通过将模型调用次数平均减少约69%（例如，从16.0减少到4.9），同时在0.06个百分点内达到自洽策略的精度。

论文及项目相关链接

PDF Efficient Reasoning @ NeurIPS2025

Summary

大型语言模型测试时常常需要进行多次查询并通过对多数答案的聚合来进行预测。然而，这种做法需要在固定次数的查询调用后得出结果，因此可能会在正确答案罕见的情况下导致失败。为解决这一问题，我们提出了置信度引导早期停止策略（CGES）。这是一种贝叶斯框架，通过利用来自令牌概率或奖励模型的标量置信信号来构建候选答案的后验分布。一旦某个候选答案的后验概率超过阈值，CGES就会自适应地停止采样。我们对完全校准的置信度和现实的噪声置信信号提供了理论保证。在五个推理基准测试中，CGES通过将模型调用次数平均减少约69%（例如从16.0降至4.9）的同时，准确率与一致性预测相近，差异在0.06个百分点以内。这一新方法在提高效率的同时保持了准确性。

Key Takeaways

大型语言模型测试时采用多次查询并聚合的策略是有效的，但存在固定调用次数和正确答案罕见情况下的失败风险。
置信度引导早期停止策略（CGES）是一种基于贝叶斯框架的新方法，利用标量置信信号构建候选答案的后验分布。
CGES能自适应地停止采样，当某个候选答案的后验概率超过阈值时。
在理论层面上，CGES对完全校准的置信度和现实噪声情境下的置信信号提供保障。
在五个基准测试中，CGES在减少模型调用次数的同时，保持了与一致性预测相近的准确率。
CGES能够显著提高大型语言模型的效率，平均减少模型调用次数约69%。

Cool Papers

点此查看论文截图

CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Authors:Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan

In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

在人类认知中，存在许多无言的、超越语言表达的思维过程，使我们能够以多种方式理解和与世界互动。然而，当前的视觉语言模型（VLMs）仍然受限于语言符号的离散和僵化空间内的推理，从而限制了视觉感知的丰富、高维特性。为了弥补这一差距，我们提出了CoCoVa（连续视觉语言思维链），这是一种用于视觉语言模型的新型框架，它利用连续跨模态推理来完成各种视觉语言任务。CoCoVa的核心是一个迭代推理循环，其中新型潜态Q-Formers（LQ-Formers）作为动态推理引擎，通过跨模态融合迭代地完善一系列潜在思维向量。为了集中这一过程，标记选择机制会动态识别重要的视觉区域，模仿注意力的集中。为了确保这些潜在思维仍然保持根基，我们用多任务目标来训练模型，将对比学习和基于扩散的重建结合起来，强化潜在表征与视觉和文本模态之间的对齐。评估表明，CoCoVa在强大的基线基础上提高了准确性和标记效率。使用规模为1.5B的后端骨干网，它在几乎所有的基准测试中都能与更大的7B-9B模型竞争甚至超越它们。当扩展到规模为7B的大型语言模型后端骨干网时，它仍然与最新模型具有竞争力。定性分析验证了学习到的潜在空间捕获了解释性和结构化推理模式，这突显了CoCoVa在离散语言处理和视觉理解的连续性之间弥合代表性差距的潜力。

论文及项目相关链接

PDF

Summary

本文提出一种新型的跨模态视觉语言模型——CoCoVa（连续视觉语言思维链），它突破了传统视觉语言模型在离散语言符号上的局限，利用连续跨模态推理完成多样化的视觉语言任务。核心在于一个迭代推理循环，其中新颖的潜态Q-Former（LQ-Former）作为动态推理引擎，通过跨模态融合，不断精炼潜态思维向量链。训练过程中结合对比学习和扩散重建的多任务目标，确保潜态思维与视觉和文本模态对齐。评估结果显示，CoCoVa在强基线基础上提高了准确性和符号效率，具有潜力缩小离散语言处理和连续视觉理解之间的代表性差距。

Key Takeaways

CoCoVa是一种新型的跨模态视觉语言模型，旨在突破传统模型的局限。
它利用连续跨模态推理完成多样化的视觉语言任务。
CoCoVa的核心是一个迭代推理循环，其中LQ-Former作为动态推理引擎发挥关键作用。
训练过程结合对比学习和扩散重建的多任务目标，确保潜态思维与视觉和文本模态对齐。
CoCoVa提高了准确性和符号效率，表现优于强基线。
当扩展到大型语言模型时，CoCoVa保持与最新模型的竞争力。

Cool Papers

点此查看论文截图

The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute

Authors:Aman Sharma, Paras Chopra

We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the optimal test-time scaling strategy. Our findings fundamentally challenge the parallel reasoning orthodoxy that has dominated test-time scaling since Wang et al.’s self-consistency decoding (Wang et al., 2022), positioning sequential refinement as the robust default for modern LLM reasoning and necessitating a paradigm shift in how we approach inference-time optimization.

我们重新探讨了语言模型推理中的测试时间缩放，并提出了一个基本问题：在相同的令牌预算和计算条件下，是并行运行多个独立链更好，还是运行更少的链，通过一系列步骤进行迭代细化？通过对5个最先进开源模型和3个具有挑战性推理基准的全面评估，我们发现基于明确建立于先前尝试之上的序列缩放方法始终优于主流并行一致性范式，在95.6%的配置中准确性提高了高达46.7%。此外，我们引入了无逆熵加权投票法，这是一种无需训练的新方法，可以进一步提高序列缩放的准确性。通过将答案的比例按照其推理链的逆熵进行加权，我们的成功率超过了并行多数投票，并将它确立为最佳的测试时间缩放策略。我们的发现从根本上挑战了自王等人提出一致性解码以来的并行推理正统观念（王等人，2022年），将序列细化定位为现代大型语言模型推理的稳健默认方法，并改变了我们对推理时间优化的看法。

论文及项目相关链接

PDF

Summary
关于测试时语言模型推理的规模扩展，研究重新探讨了并行链与顺序精细化两种策略。评估多个模型和基准测试后，发现顺序精细化扩展法，即在尝试之间构建明确的依赖性并持续优化，在所有配置中一致地优于并行一致性范式，准确率提升高达46.7%。此外，研究还引入了逆熵加权投票法，这是一种无需训练的方法，可进一步提升顺序精细化的准确性。通过根据推理链的逆熵比例加权答案，该方法提高了成功率，并确立顺序精细化为现代大型语言模型推理的稳健默认选择。这挑战了自Wang等人以来的并行推理正统观念，并改变了我们对推理时间优化的看法。

Key Takeaways

测试时语言模型推理的规模扩展存在两种主要策略：并行链和顺序精细化。
在大多数配置中，顺序精细化扩展法优于并行一致性范式。此方法在尝试间构建依赖关系并持续优化。
通过综合评估五个最新开源模型和三个具有挑战性的推理基准测试得出上述结论。顺序精细化方法准确率提升高达46.7%。
引入了一种新的无需训练的优化方法：逆熵加权投票法，可以进一步提高顺序精细化的准确性。
该方法通过根据推理链的逆熵比例加权答案来提高成功率。这是测试时一种有效且最优的策略选择。对比先前的主要方式有明显优势提升。该研究结果打破了目前并行推理的正统观念。
研究结果确立了顺序精细化为现代大型语言模型推理的稳健默认选择。这对于未来的语言模型设计和应用具有指导意义。

Cool Papers

点此查看论文截图

VFocus: Better Verilog Generation from Large Language Model via Focused Reasoning

Authors:Zhuorui Zhao, Bing Li, Grace Li Zhang, Ulf Schlichtmann

Large Language Models (LLMs) have shown impressive potential in generating Verilog codes, but ensuring functional correctness remains a challenge. Existing approaches often rely on self-consistency or simulation feedback to select the best candidate, but they miss opportunities to focus LLM reasoning on the most informative parts of the design. We propose VFocus, a three-stage framework that enhances Verilog generation by sharpening the focus of LLM reasoning onto critical decision points in the code generation process. In the \textbf{pre-ranking stage}, VFocus generates multiple code candidates through LLM prompting, retries for syntactically valid outputs, and introduces a \textit{Density-guided Filtering} to retain candidates that fall within the “reasoning sweet spot” for functional correctness. In the \textbf{ranking stage}, we simulate each code candidate using an automatically generated testbench and apply self-consistency-based clustering to identify the most consistent outputs. Finally, in the \textbf{post-ranking refinement stage}, VFocus performs inconsistency mining on top-ranked candidates and invokes reasoning-augmented LLM prompts for candidate refinement. Experiments on the VerilogEval-Human benchmark show that VFocus significantly improves the pass@1 correctness across multiple reasoning LLMs, demonstrating its effectiveness in enhancing Verilog generation for complex hardware design tasks.

大型语言模型（LLM）在生成Verilog代码方面表现出了令人印象深刻的潜力，但确保功能正确性仍然是一个挑战。现有方法通常依赖于自我一致性或模拟反馈来选择最佳候选者，但它们忽略了将LLM推理集中在设计中最具信息性的部分的机会。我们提出了VFocus，这是一个三阶段框架，通过增强LLM推理在代码生成过程中的关键决策点的关注度，来提升Verilog的生成能力。在预排序阶段，VFocus通过LLM提示生成多个代码候选者，对语法有效的输出进行重试，并引入了一种密度引导过滤，以保留那些处于功能正确性“推理甜点”内的候选者。在排序阶段，我们使用自动生成的测试平台对每个代码候选者进行模拟，并应用基于自我一致性聚类的方法，以识别最一致的输出。最后，在后排序优化阶段，VFocus对排名靠前的候选者进行不一致性挖掘，并调用增强推理的LLM提示进行候选者优化。在VerilogEval-Human基准测试上的实验表明，VFocus显著提高了多个推理LLM的pass@1正确性，证明了其在增强复杂硬件设计任务的Verilog生成方面的有效性。

论文及项目相关链接

PDF accepted by SOCC 2025

Summary

大型语言模型（LLMs）在生成Verilog代码方面展现出巨大潜力，但保证功能正确性仍是挑战。现有方法常依赖于自我一致性或模拟反馈来选择最佳候选，但忽略了将LLM推理聚焦于设计中最具信息价值的部分的机会。本文提出VFocus，这是一个三阶段的框架，通过优化LLM推理的焦点来提升Verilog代码生成的质量。预排序阶段通过LLM提示生成多个代码候选，对语法有效的输出进行重试，并通过密度引导过滤来保留处于“推理甜点”内的候选，以确保功能正确性。排序阶段使用自动生成的测试平台对代码候选进行模拟，并通过基于自我一致性的聚类来识别最一致的输出。最后在排名后的改进阶段，VFocus对排名靠前的候选进行不一致性挖掘，并调用增强推理的LLM提示进行候选优化。实验表明，VFocus在VerilogEval-Human基准测试上显著提高了一致性准确率，证明其在复杂硬件设计任务中增强Verilog代码生成的有效性。

Key Takeaways

大型语言模型（LLMs）在Verilog代码生成方面展现出潜力，但功能正确性是挑战。
现有方法主要依赖自我一致性或模拟反馈，但可能忽略关键决策点。
VFocus是一个三阶段框架，旨在通过优化LLM推理来提升Verilog代码生成质量。
在预排序阶段，VFocus通过LLM提示生成多个候选，进行语法验证和密度引导过滤。
排序阶段使用测试平台模拟代码候选，通过自我一致性聚类识别最一致输出。
在排名后的改进阶段，VFocus挖掘候选不一致性，并通过增强推理的LLM提示进行优化。

Cool Papers

点此查看论文截图

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Authors:Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

我们介绍了SAIL-RL，这是一种强化学习（RL）后训练框架，通过教授它们何时以及如何思考，增强多模态大型语言模型（MLLMs）的推理能力。现有方法受到仅结果监督的限制，这种监督只奖励正确答案，而不确保合理的推理过程；同时也受到统一思考策略的限制，这往往导致在简单任务上过度思考以及在复杂任务上思考不足。SAIL-RL通过双重奖励系统解决这些挑战：思考奖励，它通过基于事实的依据、逻辑连贯性和答案一致性来评估推理质量；判断奖励，它自适应地确定深度推理或直接回答是否合适。在先进SAIL-VL2上进行的实验表明，SAIL-RL在4B和8B规模上提高了推理和多模态理解基准测试的性能，与GPT-4o等商业闭源模型相比具有竞争力，并大大减少幻觉现象，成为构建更可靠和适应性更强的MLLMs的原则性框架。代码将在https://github.com/BytedanceDouyinContent/SAIL-RL上提供。

论文及项目相关链接

PDF

Summary

强化学习（RL）后处理训练框架SAIL-RL，通过教授多模态大型语言模型（MLLMs）何时以及如何思考，增强了其推理能力。该框架解决了现有方法仅通过结果监督带来的问题，如无法确保推理质量及适应性地判断深推理或直接回答的需求。实验表明，SAIL-RL提高了最先进的SAIL-VL2模型的推理和多模态理解基准测试性能，与GPT-4o等商业闭源模型相比具有竞争力，并显著减少了幻觉。

Key Takeaways

SAIL-RL是一个强化学习后处理训练框架，用于增强多模态大型语言模型的推理能力。
现有方法存在仅通过结果监督的问题，无法确保推理质量。
SAIL-RL通过双奖励系统解决这些问题：思考奖励用于评估推理质量，判断奖励用于确定适当的推理深度。
SAIL-RL提高了SAIL-VL2模型的推理和多模态理解基准测试性能。
与商业闭源模型如GPT-4o相比，SAIL-RL具有竞争力。
SAIL-RL显著减少了模型产生的幻觉。

Cool Papers

点此查看论文截图

TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data

Authors:Changjiang Jiang, Fengchang Yu, Haihua Chen, Wei Lu, Jin Zeng

Complex reasoning over tabular data is crucial in real-world data analysis, yet large language models (LLMs) often underperform due to complex queries, noisy data, and limited numerical capabilities. To address these issues, we propose \method, a framework consisting of: (1) a query decomposer that breaks down complex questions, (2) a table sanitizer that cleans and filters noisy tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates executable code to derive the final answer from the sanitized table. To ensure unbiased evaluation and mitigate data leakage, we introduce a new dataset, CalTab151, specifically designed for complex numerical reasoning over tables. Experimental results demonstrate that \method consistently outperforms existing methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and 19.87% accuracy improvement on TAT-QA, TableBench, and \method, respectively. Moreover, our framework integrates seamlessly with mainstream LLMs, providing a robust solution for complex tabular numerical reasoning. These findings highlight the effectiveness of our framework in enhancing LLM performance for complex tabular numerical reasoning. Data and code are available upon request.

现实世界的数据分析中，对表格数据进行复杂推理至关重要。然而，大型语言模型（LLMs）往往因复杂的查询、嘈杂的数据和有限的数值能力而表现不佳。为了解决这些问题，我们提出了\method框架，它包括：（1）查询分解器，用于分解复杂问题；（2）表格净化器，用于清理和过滤嘈杂的表格；（3）基于程序思维（PoT）的推理器，生成可执行代码从净化后的表格中推导出最终答案。为了确保公正的评价并减轻数据泄露问题，我们引入了一个新数据集CalTab151，该数据集专为表格上的复杂数值推理设计。实验结果表明，\method框架始终优于现有方法，在TAT-QA、TableBench和\method上分别实现了最先进的性能，准确率提高了8.79%、6.08%和19.87%。此外，我们的框架可以无缝地集成到主流LLMs中，为复杂表格数值推理提供了稳健的解决方案。这些发现凸显了我们的框架在增强LLM进行复杂表格数值推理性能方面的有效性。数据和代码可在请求时提供。

论文及项目相关链接

PDF Accepted to EMNLP 2025 Findings

Summary

复杂表格数据推理在真实数据分析中具有重要地位，大型语言模型（LLMs）常因复杂查询、噪声数据和有限数值处理能力而表现不佳。为解决这些问题，我们提出名为“方法”的框架，包括查询分解器、表格净化器和基于程序思维（PoT）的推理器。为确保公正评估并避免数据泄露，我们引入专为复杂表格数值推理设计的新数据集CalTab151。实验结果表明，“方法”框架在TAT-QA、TableBench和自身数据集上的准确率分别提高了8.79%、6.08%和19.87%，并始终优于现有方法，达到最新技术水平。此外，“方法”框架与主流LLM无缝集成，为复杂表格数值推理提供了稳健的解决方案。这些发现突显了我们的框架在提升LLM处理复杂表格数值推理性能方面的有效性。数据和代码可供请求获取。

Key Takeaways

大型语言模型（LLMs）在复杂表格数据推理方面存在挑战，如处理复杂查询、噪声数据和有限的数值处理能力。
提出名为“方法”的框架，包含查询分解器、表格净化器和基于程序思维的推理器，以改进LLMs的性能。
引入新数据集CalTab151，专为复杂表格数值推理设计，以确保公正评估并避免数据泄露。
“方法”框架在多个数据集上的实验表现优于现有方法，达到最新技术水平。
“方法”框架与主流LLM无缝集成，为复杂表格数值推理提供了稳健的解决方案。
框架的有效性在于提高LLM处理复杂表格数值推理的性能。

Cool Papers

点此查看论文截图

Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration

Authors:Jingbo Wang, Sendong Zhao, Haochun Wang, Yuzheng Fan, Lizhe Zhang, Yan Liu, Ting Liu

The emergence of multi-agent systems powered by large language models (LLMs) has unlocked new frontiers in complex task-solving, enabling diverse agents to integrate unique expertise, collaborate flexibly, and address challenges unattainable for individual models. However, the full potential of such systems is hindered by rigid agent scheduling and inefficient coordination strategies that fail to adapt to evolving task requirements. In this paper, we propose STRMAC, a state-aware routing framework designed for efficient collaboration in multi-agent systems. Our method separately encodes interaction history and agent knowledge to power the router, which adaptively selects the most suitable single agent at each step for efficient and effective collaboration. Furthermore, we introduce a self-evolving data generation approach that accelerates the collection of high-quality execution paths for efficient system training. Experiments on challenging collaborative reasoning benchmarks demonstrate that our method achieves state-of-the-art performance, achieving up to 23.8% improvement over baselines and reducing data collection overhead by up to 90.1% compared to exhaustive search.

多智能体系统正通过大型语言模型（LLM）的力量开拓复杂任务解决的新疆界，它允许各类智能体融入独特专长，灵活协作，并解决单个模型无法应对的挑战。然而，此类系统的全部潜力受限于僵化的智能体调度和无法适应不断变化的任务要求的低效协调策略。在本文中，我们提出了面向多智能体系统高效协作的状态感知路由框架STRMAC。我们的方法单独编码交互历史和智能体知识来为路由器提供动力，路由器自适应地选择每一步中最合适的单一智能体以实现高效且有效的协作。此外，我们还引入了一种自我进化的数据生成方法，可加速高质量执行路径的收集以用于高效的系统训练。在具有挑战性的协作推理基准测试上的实验表明，我们的方法达到了最先进的性能水平，相较于基线方法实现了高达23.8%的提升，相较于穷举搜索减少了高达90.1%的数据收集开销。

论文及项目相关链接

PDF

Summary

大型语言模型驱动的多智能体系统的出现为复杂任务解决开启了新纪元，使不同智能体能整合独特专业知识、灵活协作，并应对个体模型无法解决的挑战。然而，该系统完全潜力受限僵化调度与低效协调策略，未能适应任务变化的多样性。本研究提出STRMAC这一态势感知路由框架以推动多智能体系统的有效协作。方法包括互动历史及代理知识编码驱动路由器选择机制，以便根据自适应选定各阶段的最佳单一智能体完成有效协作。同时，本研究也提出了自我进化的数据生成方式加快优质执行路径的收集以提高系统训练效率。在挑战性的协同推理基准测试中表现领先，相比基准测试有最高达至多近改善实验精度最高提升了约提升了达到最多约23.8%，并且在数据采集开销方面通过该路由方法显著减少达到了减少了最多近约近90.1%。对于应对具有多个任务的场景具有一定作用和应用前景展望。总体而言，该论文研究对多智能体系统协作发展具有重要意义。通过采用态势感知路由框架以及自我进化的数据生成方式等创新技术提升系统性能，展现了其在复杂任务解决中的优势。因此该研究具有重要理论价值和实践指导意义和实践应用价值广阔。因此值得深入研究和推广应用该技术和方法具有一定潜力进行更深入的研究和应用推广具有重要意义和价值潜力。

Key Takeaways：

多智能体系统通过大型语言模型实现复杂任务解决的新突破，允许不同智能体整合专业知识并灵活协作。
当前系统受限于调度僵化与协调策略低效的问题，无法适应任务变化多样性。
提出STRMAC态势感知路由框架，通过互动历史及代理知识编码驱动路由器选择机制，实现智能体高效协作。
引入自我进化的数据生成方式加快高质量执行路径收集，提高系统训练效率。
在协同推理基准测试中表现领先，相较于传统方法有明显性能提升。
技术创新包括自适应选择最佳智能体及高效数据收集方法。

Cool Papers

点此查看论文截图

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Authors:Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia

With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.

随着大型语言模型（LLM）在代码智能领域的广泛应用，其输出在代码推理任务中的可靠性和可控性越来越受到关注。信心估计是一种有效且方便的评估这些方面的方法。本文针对代码推理任务提出了一个针对LLM的信心分析和增强框架。我们对不同任务中主流LLM的信心可靠性进行了全面的实证研究，并进一步评估了提示策略优化和数学校准（如Platt Scaling）等技术在提高信心可靠性方面的有效性。结果表明，DeepSeek-Reasoner在各项任务中表现最佳，在ECE、Brier Score和Performance Score方面分别比其他模型高出0.680、0.636和13.652。结合重新评估提示策略和Platt Scaling的混合策略，在上述三个指标中，相较于原始性能最高可提升0.541、0.628和15.084。这些结果表明，具备推理能力的模型表现出更高的信心可靠性，混合策略在提高各种模型的信心可靠性方面最为有效。同时，我们阐述了不同任务复杂性、模型规模和策略对信心表现的影响，并强调当前LLM在复杂推理任务中的信心仍有很大的改进空间。本研究不仅为信心在LLM辅助软件工程中的应用提供了研究基础和技术参考，而且为信心机制的未来优化和工程部署指明了方向。

论文及项目相关链接

PDF 13 pages, 4 figures

摘要

随着大型语言模型（LLMs）在代码智能领域的广泛应用，其输出在代码推理任务中的可靠性和可控性引起了越来越多的关注。信心估计是一种有效和方便的评估这些方面的方法。本文针对代码推理任务，提出了一个针对LLMs的信心分析和增强框架。我们对主流LLMs在不同任务上的信心可靠性进行了全面的实证研究，并进一步评估了优化提示策略和数学校准（如Platt Scaling）等技术提高信心可靠性的有效性。实验结果显示，DeepSeek-Reasoner在各种任务上表现最佳，与其他模型相比，其在ECE、Brier Score和Performance Score上的表现分别提高了0.680、0.636和13.652。结合重新评估提示策略和Platt Scaling的混合策略，在上述三个指标中分别提高了0.541、0.628和15.084。这些结果表明，具有推理能力的模型表现出更高的信心可靠性，混合策略是最有效的增强各种模型信心可靠性的方法。同时，本文阐明了不同任务复杂度、模型规模、策略对信心表现的影响，并强调当前LLMs在复杂推理任务中的信心仍有很大的改进空间。本研究不仅为信心在LLM辅助软件工程中的应用提供了研究基础和技术参考，也为信心机制的未来优化和工程部署指明了方向。

关键见解

大型语言模型（LLMs）在代码智能领域的信心可靠性问题受到关注。
提出了针对代码推理任务的LLMs信心分析和增强框架。
主流LLMs的信心可靠性实证研究，揭示了其性能差异。
提示策略优化和数学校准（如Platt Scaling）等技术有效提高信心可靠性。
DeepSeek-Reasoner表现最佳，混合策略对于增强信心可靠性最有效。
不同任务复杂度、模型规模、策略对信心表现有影响。

Cool Papers

点此查看论文截图

Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

Authors:Yibo Zhao, Yang Zhao, Hongru Du, Hao Frank Yang

Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.

针对个人决策制定的模型，特别是在疫苗接种等高风险场景中，往往与群体最优预测存在分歧。这一差距的产生源于个人决策制定过程的独特性，这一过程受到数值属性（如成本、时间）和语言影响（如个人偏好和约束）的影响。本文基于效用理论，利用大型语言模型的文本推理能力，提出了自适应文本符号人类中心推理框架（ATHENA），以解决最优信息融合问题。ATHENA独特地集成了两个阶段：首先，它通过大型语言模型增强的符号发现来发现稳健的群体层面符号效用函数；其次，它实现个体层面的语义适应，创建由最佳效用引导的个性化语义模板，以模拟个性化选择。在真实世界的出行模式和疫苗选择任务中进行了验证，ATHENA始终优于基于效用、机器学习和其他大型语言模型的方法，比最先进的模型高出至少6.5%的F1分数。此外，切除研究证实ATHENA的两个阶段都是关键且互补的，移除任何一个都会明显降低整体预测性能。通过有机地结合符号效用建模和语义适应，ATHENA为模拟以人类为中心的决策提供了新的方案。项目页面可在https://yibozh.github.io/Athena找到。

论文及项目相关链接

PDF

Summary：
个体决策模型在高风险情境下如疫苗接种中的选择往往与群体最优预测存在差距。文章结合效用理论，利用大型语言模型的文本推理能力，提出了一个自适应文本符号人类中心化推理框架（ATHENA）。该框架分为两个阶段：第一阶段通过增强语言模型符号发现能力的群体层面符号效用函数；第二阶段实现个体语义适应，创建个性化语义模板以模拟个性化选择。验证结果表明，ATHENA在真实世界旅行模式和疫苗选择任务上表现优于其他模型，F1分数提高了至少6.5%。此框架将符号效用建模和语义适应有机结合，为模拟以人类为中心的决策提供了新的方案。

Key Takeaways：

个体决策模型与群体最优预测在高风险情境下存在差距。
差距源于个体决策过程的独特性和个体差异。
ATHENA框架结合了效用理论和大型语言模型的文本推理能力。
ATHENA框架包括两个阶段：符号效用函数发现和个体语义适应。
ATHENA在真实世界任务中表现优异，F1分数提高至少6.5%。
框架中的两个阶段都是关键且互补的，缺一不可。

Cool Papers

点此查看论文截图

Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences

Authors:Joshua Ashkinaze, Hua Shen, Sai Avula, Eric Gilbert, Ceren Budak

We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features – for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model’s Deep Value Generalization Rate (DVGR) – the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.

我们介绍了深度价值基准（DVB），这是一个评估框架，能够直接测试大型语言模型（LLM）是学习了基本的人类价值还是仅仅学习了表面层次的偏好。这一区别对于人工智能对齐至关重要：捕获更深价值的系统可能会稳健地泛化人类意图，而仅捕获偏好数据中表面模式的系统则存在产生错位行为的风险。DVB使用了一种新型的实验设计，在深层价值（如道德原则）和浅层特征（如表面属性）之间进行了受控的混淆。在训练阶段，我们让LLM接触人类偏好数据，其中深层和浅层特征是故意相关的——例如，用户一致地更喜欢（无害性、正式语言）选项而不是（正义、非正式语言）选择。测试阶段则打破这些相关性，呈现（正义、正式语言）和（无害性、非正式语言）之间的选择。这种设计允许我们精确地测量模型的深度价值泛化率（DVGR）——基于底层价值而不是浅层特征进行泛化的概率。在9个不同的模型中，平均DVGR仅为0.30。所有模型的深度价值泛化率都低于偶然水平。大型模型的DVGR略低于小型模型。我们已经发布了我们的数据集，该数据集经过了三次独立的人类验证实验。DVB提供了一个可解释的核心特征对齐度量。

简化后的翻译

论文及项目相关链接

PDF NeurIPS 2025 (Spotlight)

Summary：

介绍了一种名为Deep Value Benchmark（DVB）的评估框架，用于测试大型语言模型（LLM）是否学习人类的基本价值观而非仅表面偏好。该框架采用新颖的实验设计，控制深层价值观（如道德原则）和浅层特征（如表面属性）之间的混淆。通过暴露LLM于故意关联深、浅特征的人类偏好数据，测试模型是否能基于深层价值而非浅层特征进行推广。平均Deep Value Generalization Rate（DVGR）仅为0.30，所有模型的深层价值推广均低于偶然水平。发布的数据集经过了三次独立的人类验证实验。DVB提供了一种可解释的衡量对齐核心特征的方法。

Key Takeaways：

Deep Value Benchmark (DVB) 用于评估大型语言模型是否学习人类基本价值观。
DVB 通过新颖的实验设计区分深层价值观与浅层特征。
LLMs 在暴露于故意关联深、浅特征的人类偏好数据后，测试其基于深层价值的推广能力。
平均Deep Value Generalization Rate (DVGR) 为 0.30，显示模型在推广深层价值方面的不足。
大型模型的DVGR略低于小型模型。
DVB数据集经过三次独立人类验证实验，确保有效性。

Cool Papers

点此查看论文截图

Towards Robust Mathematical Reasoning

Authors:Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

找到正确的北极星指标对于提升基础模型的数学推理能力至关重要，尤其是考虑到现有的评估要么过于简单，要么只关注获得正确的简短答案。为了解决这些问题，我们推出了IMO-Bench，这是一套由顶尖专家团队审核的先进推理基准测试，专门针对国际数学奥林匹克竞赛（IMO）的水平，这是年轻数学家最负盛名的竞技场。IMO-AnswerBench首先测试模型在400个多样化的奥林匹克问题上的简短可验证答案。IMO-Proof Bench是更高层次的证明书写能力评估，其中包括基本和高级的IMO级别问题以及详细的评分标准，以便于自动评分。这些基准测试在我们的历史成就中发挥了关键作用，即在IMO 2025年与双子座深度思考（Luong and Lockhart, 2025）共同取得金牌级表现。我们的模型在IMO-AnswerBench上达到80.0%，在高级IMO-Proof Bench上达到65.7%，远超最佳非双子座模型，分别高出6.9%和42.4%。我们还表明，使用双子座推理构建的自动评分器与人类评估高度相关，并构建了包含1000个证明的人类评分数据的IMO-GradingBench，以便推动长答案的自动评估取得进一步进展。我们希望IMO-Bench能帮助社区推动稳健的数学推理的发展，并在https://imobench.github.io/上发布。

论文及项目相关链接

PDF EMNLP 2025 (main conference), https://aclanthology.org/2025.emnlp-main.1794/

Summary

该文介绍了针对基础模型数学推理能力评估的问题，提出了IMO-Bench评估套件。该套件包括经过顶级专家审核的、针对国际数学奥林匹克竞赛（IMO）水平的测试，分为IMO-AnswerBench和IMO-Proof Bench两部分。其中，IMO-AnswerBench包含400道可验证简短答案的奥林匹克题目，而IMO-Proof Bench则是对证明写作能力的进阶评估，包含基本和高级的IMO级别问题以及详细的评分标准，以推动自动评分的发展。使用Gemini Deep Think模型在该评估套件中取得了历史性成绩，并在自动评价长期答案方面取得了进展。作者希望IMO-Bench能够帮助社区推进稳健的数学推理能力的发展。

Key Takeaways

IMO-Bench是一个专为推进数学推理能力而设计的评估套件，包括IMO-AnswerBench和IMO-Proof Bench两部分。
IMO-AnswerBench包含经过顶级专家审核的400道奥林匹克题目，旨在测试模型的短答案能力。
IMO-Proof Bench是一个针对证明写作能力的进阶评估工具，包含详细的评分标准，以促进自动评分的发展。
使用Gemini Deep Think模型在IMO-Bench中取得了显著成绩，尤其在IMO-Proof Bench上的表现令人印象深刻。
自动评价系统与人评价之间的相关性良好，为长期答案的自动评价奠定了基础。
IMO-Bench的发布旨在帮助社区推进稳健的数学推理能力的发展。

Cool Papers

点此查看论文截图

RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

Authors:Mian Wu, Gavin Zhang, Sewon Min, Sergey Levine, Aviral Kumar

Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic’s error detection and the generator’s output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.

开放式生成任务需要输出满足多样化和通常隐式的特定任务评估标准。相关标准的数量众多导致验证成本高昂，并且响应评估不完整，使得使用基于标准的奖励进行强化学习（RL）后训练难以扩展。问题的严重性还在于，通常将这些标准组合成单一奖励的最佳方式也是高度提示特定的。我们提出使用对抗性评论家（RLAC）的强化学习，这是一种后训练法，通过动态标准验证来解决这些挑战。我们的方法采用大型语言模型（LLM）作为评论家，动态识别最可能的失败模式（例如，事实错误或未处理的边缘情况），然后通过这些模式由外部验证器验证，以联合优化生成器和评论家。通过对生成器和评论家的共同训练，这款游戏增强了评论家的错误检测能力和生成器的输出质量，同时减少了所需的验证次数。我们的实验表明，RLAC提高了文本生成的事实准确性和代码生成的正确性，同时优于详尽验证和奖励模型方法。我们证明了动态评论家比固定评论家更有效，展示了RLAC在扩展到自由形式生成任务时具有的潜力。

Summary

该文提出一种基于对抗性批判的强化学习（RLAC）方法，用于解决开放式生成任务中的多样化评价标准的挑战。RLAC通过动态评估标准来减少验证成本和提高评估的完整性，同时解决奖励组合中面临的难题。它采用大型语言模型作为批评家，动态识别最可能的失败模式，并通过外部验证器进行优化。实验证明，RLAC可以提高文本生成和代码生成的准确性，并优于穷尽验证和奖励模型方法。动态批评家比固定批评家更有效，展示了RLAC在自由形式生成任务中扩展强化学习训练的未来潜力。

Key Takeaways

开放式的文本生成任务涉及多个难以确定优先级且可能难以准确实现的评价标准。

Cool Papers

点此查看论文截图

Cross-Treatment Effect Estimation for Multi-Category, Multi-Valued Causal Inference via Dynamic Neural Masking

Authors:Xiaopeng Ke, Yihan Yu, Ruyue Zhang, Zhishuo Zhou, Fangzhou Shi, Chang Men, Zhengdan Zhu

Counterfactual causal inference faces significant challenges when extended to multi-category, multi-valued treatments, where complex cross-effects between heterogeneous interventions are difficult to model. Existing methodologies remain constrained to binary or single-type treatments and suffer from restrictive assumptions, limited scalability, and inadequate evaluation frameworks for complex intervention scenarios. We present XTNet, a novel network architecture for multi-category, multi-valued treatment effect estimation. Our approach introduces a cross-effect estimation module with dynamic masking mechanisms to capture treatment interactions without restrictive structural assumptions. The architecture employs a decomposition strategy separating basic effects from cross-treatment interactions, enabling efficient modeling of combinatorial treatment spaces. We also propose MCMV-AUCC, a suitable evaluation metric that accounts for treatment costs and interaction effects. Extensive experiments on synthetic and real-world datasets demonstrate that XTNet consistently outperforms state-of-the-art baselines in both ranking accuracy and effect estimation quality. The results of the real-world A/B test further confirm its effectiveness.

在扩展到多类别、多值处理时，反事实因果推断面临着巨大的挑战。在这种情况下，异质干预之间的复杂交叉效应难以建模。现有方法仍然局限于二元或单一类型的处理，并受到严格假设、有限的可扩展性和复杂的干预场景评估框架不足的困扰。我们提出了XTNet，这是一种用于多类别、多值处理效果估计的新型网络架构。我们的方法引入了一个带有动态屏蔽机制的交叉效应估计模块，可以在没有严格的结构假设的情况下捕获处理交互。该架构采用分解策略，将基本效应与跨处理交互分开，实现对组合处理空间的有效建模。我们还提出了MCMV-AUCC，这是一个适当的评估指标，考虑到处理成本和交互效果。在合成数据和真实世界数据集上的大量实验表明，XTNet在排名准确性和效果估计质量方面都始终优于最新基线。现实世界中的A/B测试结果进一步证实了其有效性。

论文及项目相关链接

PDF

Summary

在多类别、多值处理情况下，现有的反事实因果推断方法面临诸多挑战，建模复杂的交叉效应存在困难。本文提出XTNet网络架构，用于多类别、多值处理效应估计。该架构引入交叉效应估计模块和动态掩码机制，捕捉处理交互作用，无需严格的假设。通过分解策略分离基本效应和交叉处理交互作用，有效建模组合处理空间。在合成和真实数据集上的实验表明，XTNet在排名准确性和效果估计质量方面均优于现有方法。真实世界的A/B测试结果进一步证实了其有效性。

Key Takeaways

多类别、多值处理的反事实因果推断面临挑战，建模复杂交叉效应困难。
XTNet网络架构用于多类别、多值处理效应估计。
XTNet引入交叉效应估计模块和动态掩码机制，无需严格假设。
XTNet通过分解策略有效建模组合处理空间。
XTNet在合成和真实数据集上的实验表现优于现有方法。
XTNet在排名准确性和效果估计质量方面表现出色。

Cool Papers

点此查看论文截图

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

Authors:Daniyal Ganiuly, Assel Smaiyl

Large Language Models (LLMs) are increasingly used in intelligent systems that perform reasoning, summarization, and code generation. Their ability to follow natural-language instructions, while powerful, also makes them vulnerable to a new class of attacks known as prompt injection. In these attacks, hidden or malicious instructions are inserted into user inputs or external content, causing the model to ignore its intended task or produce unsafe responses. This study proposes a unified framework for evaluating how resistant Large Language Models (LLMs) are to prompt injection attacks. The framework defines three complementary metrics such as the Resilience Degradation Index (RDI), Safety Compliance Coefficient (SCC), and Instructional Integrity Metric (IIM) to jointly measure robustness, safety, and semantic stability. We evaluated four instruction-tuned models (GPT-4, GPT-4o, LLaMA-3 8B Instruct, and Flan-T5-Large) on five common language tasks: question answering, summarization, translation, reasoning, and code generation. Results show that GPT-4 performs best overall, while open-weight models remain more vulnerable. The findings highlight that strong alignment and safety tuning are more important for resilience than model size alone. Results show that all models remain partially vulnerable, especially to indirect and direct-override attacks. GPT-4 achieved the best overall resilience (RDR = 9.8 %, SCR = 96.4 %), while open-source models exhibited higher performance degradation and lower safety scores. The findings demonstrate that alignment strength and safety tuning play a greater role in resilience than model size alone. The proposed framework offers a structured, reproducible approach for assessing model robustness and provides practical insights for improving LLM safety and reliability.

大型语言模型（LLM）在智能系统中越来越广泛地应用于推理、摘要和代码生成。它们遵循自然语言指令的能力虽然强大，但也使它们容易受到一种名为提示注入的新攻击的影响。在这些攻击中，隐藏或恶意的指令被插入用户输入或外部内容中，导致模型忽略其预定任务或产生不安全的响应。本研究提出了一个统一的框架来评估大型语言模型（LLM）对提示注入攻击的抵抗力。该框架定义了三个互补的度量标准，即韧性降级指数（RDI）、安全合规系数（SCC）和指令完整性度量（IIM），以共同衡量稳健性、安全性和语义稳定性。我们在五个常见的语言任务上评估了四个指令调整模型（GPT-4、GPT-4o、LLaMA-3 8B Instruct和Flan-T5-Large）：问答、摘要、翻译、推理和代码生成。结果表明，GPT-4总体上表现最佳，而开放权重模型仍然更容易受到攻击。结果表明，强对齐和安全调整对于韧性比单纯模型大小更重要。结果还表明，所有模型仍然部分脆弱，尤其是间接和直接覆盖攻击。GPT-4总体韧性最佳（RDR=9.8%，SCR=96.4%），而开源模型表现出更高的性能下降和较低的安全分数。结果表明，对齐强度和安全调整在韧性方面比单纯的模型大小更重要。所提出的框架提供了一个结构化、可重复的方法来评估模型的稳健性，并为提高LLM的安全性和可靠性提供了实际见解。

论文及项目相关链接

PDF 10 pages, 6 figures

Summary

大型语言模型（LLMs）在智能系统中的应用越来越广泛，如推理、摘要和代码生成等。然而，它们遵循自然语言指令的能力也使其面临一种新的攻击方式——提示注入。攻击者会在用户输入或外部内容中隐藏或插入恶意指令，导致模型忽略其目标任务或产生不安全响应。本研究提出一个统一的框架，用于评估大型语言模型（LLMs）对提示注入攻击的抵抗力。该框架定义了三个互补指标，即恢复力降低指数（RDI）、安全合规系数（SCC）和指令完整性指标（IIM），以联合测量稳健性、安全性和语义稳定性。在五种常见的语言任务上评估了四个指令调优模型，结果显示GPT-4总体表现最佳，而开放权重模型仍然更容易受到攻击。研究结果表明，对于恢复力而言，强对齐和安全调整比单纯模型大小更重要。所有模型仍然部分易受攻击，尤其是间接和直接覆盖攻击。GPT-4在恢复力和安全性方面表现最佳。研究提出的框架为评估模型稳健性提供了结构化和可重复的方法，并为提高LLM的安全性和可靠性提供了实际见解。

Key Takeaways

大型语言模型（LLMs）在智能系统中广泛应用，但面临提示注入攻击的风险。
提示注入攻击通过插入隐藏或恶意指令，使模型忽略目标任务或产生不安全响应。
研究提出一个统一的框架来评估LLMs对提示注入的抵抗力，包括三个互补指标：RDI、SCC、IIM。
在五种语言任务上评估了四个指令调优模型，GPT-4表现最佳。
开放权重模型仍然更容易受到攻击，强对齐和安全调整比模型大小更重要。
所有模型对间接和直接覆盖攻击仍然部分易受攻击。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-06/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-11-06 Agent-Omni Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

2025-11-06 LLM

LLM

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-11-05 See the Speaker Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

2025-11-05 Talking Head Generation

Talking Head Generation