⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-11 更新
Visual Spatial Tuning
Authors:Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8%$ on MMSI-Bench and $61.2%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
从视觉输入捕捉空间关系是类人通用智能的核心。之前的一些研究试图通过添加额外的专业编码器来提高视觉语言模型(VLMs)的空间意识,这带来了额外的开销并通常会损害其通用能力。为了增强通用架构中的空间能力,我们引入了视觉空间调优(VST)这一全面框架,旨在培养具有类人视觉空间能力的VLMs,涵盖从空间感知到推理的各个方面。我们首先尝试通过构建大规模数据集VST-P增强VLMs的空间感知能力,该数据集包含410万样本,涵盖单视图、多图像和视频中的19项技能。接下来,我们展示了VST-R数据集,该数据集包含13.5万样本,用于指导模型进行空间推理。特别是,我们采用了一种渐进式的训练流程:先进行有监督微调以建立基础空间知识,然后采用强化学习进一步提高空间推理能力。在不损害通用能力的情况下,所提出的VST在多个空间基准测试上始终达到了最新水平的结果,包括MMSI-Bench上的34.8%和VSIBench上的61.2%。事实证明,通过提出的空间调优范式,视觉语言动作模型可以显著增强,为更物理化的AI铺平了道路。
论文及项目相关链接
Summary
视觉空间感知和推理是人类智能的重要组成部分。为提高视觉语言模型的空间能力,研究者提出了一种名为视觉空间调优(VST)的综合框架,通过构建大规模数据集VST-P和VST-R,以及采用监督精细调整和强化学习的渐进训练管道,增强了模型的空间感知和推理能力。该框架在多个空间基准测试中实现了最新结果,证明了其在提高视觉语言模型空间能力方面的有效性。
Key Takeaways
- 视觉空间关系处理是人类智能的核心组成部分。
- 为增强视觉语言模型(VLMs)的空间能力,引入了视觉空间调优(VST)框架。
- VST通过构建大规模数据集VST-P和VST-R来培养VLMs的空间感知和推理能力。
- VST-P数据集包含410万样本,涵盖单视图、多图像和视频的空间感知技能。
- VST-R数据集包含13.5万样本,用于培养模型的空间推理能力。
- 采用监督精细调整和强化学习的渐进训练管道来提高模型的空间能力。
点此查看论文截图
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Authors:Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She
Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.
时序搜索旨在从数万帧中识别出最小的相关帧集合,作为准确理解长视频的基础。现有工作试图逐步缩小搜索范围。然而,这些方法通常依赖于手工搜索过程,缺乏端到端的优化来学习最佳搜索策略。在本文中,我们提出了TimeSearch-R,它将时序搜索重新定义为交替文本-视频思考,通过强化学习(RL)无缝集成视频剪辑的搜索过程。然而,将强化学习训练方法(例如群组相对策略优化(GRPO))应用于视频推理可能会产生无人监督的中间搜索决策。这导致视频内容探索不足和逻辑推理不一致。为了解决这些问题,我们引入了带有完整性自我验证的GRPO(GRPO-CSV),它从交替的推理过程中收集已搜索的视频帧,并使用相同的策略模型来验证所搜索帧的充分性,从而提高视频推理的完整性。此外,我们构建了专门用于GRPO-CSV的SFT冷启动和RL训练的数据集,过滤掉具有弱时间依赖性的样本,以增加任务难度并提高时序搜索能力。大量实验表明,TimeSearch-R在时序搜索基准测试(如Haystack-LVBench和Haystack-Ego4D)以及长视频理解基准测试(如VideoMME和MLVU)上取得了显著改进。值得注意的是,TimeSearch-R在LongVideoBench上建立了新的最先进的性能,相较于基础模型Qwen2.5-VL提高了4.1%,相较于高级视频推理模型Video-R1提高了2.0%。我们的代码可在https://github.com/Time-Search/TimeSearch-R获得。
论文及项目相关链接
PDF 22 pages, 17 figures. Official code: https://github.com/Time-Search/TimeSearch-R
Summary
本文介绍了TimeSearch-R模型在长时间视频理解中的应用。该模型采用强化学习(RL)和基于策略的决策过程来解决时空搜索问题,提高搜索效率并优化视频内容的探索。通过引入GRPO与完整性自我验证(GRPO-CSV),提高了视频推理的完整性。实验结果表明,TimeSearch-R在时空搜索和长时间视频理解任务上取得了显著成果。
Key Takeaways
- TimeSearch-R模型将时空搜索重新定义为文本与视频的交互思考,通过强化学习(RL)无缝集成视频剪辑的搜索过程。
- 引入GRPO与完整性自我验证(GRPO-CSV)技术,确保视频内容得到充分的探索,并改进视频推理的完整性。
- 建立了专为GRPO-CSV的SFT冷启动和RL训练设计的专用数据集,提高了任务难度并提高了时空搜索能力。
- TimeSearch-R模型在多个时空搜索和长时间视频理解基准测试中实现了显著的性能提升,包括Haystack-LVBench、Haystack-Ego4D、VideoMME和MLVU等。
点此查看论文截图
PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization
Authors:Zehui Feng, Tian Qiu, Tong Wu, Junxuan Li, Huayuan Xu, Ting Han
Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.
视觉质量评估(Quality Assurance,QA)旨在预测人类对视觉保真度的感知判断。虽然最近的多模态大型语言模型(Multimodal Large Language Models,MLLMs)在图像和视频质量推理方面显示出潜力,但现有方法主要依赖于有监督的微调或仅排序目标,导致推理浅显、分数校准不佳和跨域泛化能力有限。我们提出了PreResQ-R1,这是一个偏好-响应分解强化学习框架,它在一个单一的推理驱动优化方案中统一了绝对分数回归和相对排名一致性。不同于先前的QA方法,PreResQ-R1引入了一种双分支奖励公式,该公式分别建模样本内响应一致性和样本间偏好对齐,通过群体相对策略优化(Group Relative Policy Optimization,GRPO)进行优化。这种设计鼓励对感知质量进行精细、稳定、可解释的链式思维推理。为了超越静态图像,我们进一步设计了一种全局时间和局部空间的数据流策略,用于视频质量评估。值得注意的是,仅在6K图像和28K视频上进行强化微调后,PreResQ-R1在10个图像质量评估(IQA)和5个视频质量评估(VQA)基准测试中实现了最先进的成果,无论是在SRCC还是PLCC指标下均表现优异,在IQA任务中分别提高了5.30%和2.15%。除了定量增益之外,它产生与人类思维轨迹相符的推理结果,揭示了质量判断背后的感知线索。代码和模型均可用。
论文及项目相关链接
PDF 27 pages, 14 figures, under review as a conference paper
摘要
视觉质量评估(QA)旨在预测人类对视觉保真度的感知判断。虽然最近的多模态大型语言模型(MLLMs)在图像和视频质量推理方面显示出潜力,但现有方法主要依赖于监督微调或仅排名目标,导致推理浅显、分数校准不佳和跨域泛化能力有限。我们提出PreResQ-R1,一个偏好反应分解强化学习框架,它在一个单一推理驱动的优化方案中统一了绝对分数回归和相对排名一致性。不同于以往QA方法的是,PreResQ-R1引入了一种双分支奖励公式,该公式分别建模样本内响应一致性和样本间偏好对齐,通过群体相对策略优化(GRPO)进行优化。这种设计鼓励对感知质量进行精细、稳定且可解释的链式思维推理。为了超越静态图像,我们还为视频质量评估设计了一种全局时间和局部空间数据流策略。值得注意的是,仅在6K图像和28K视频上进行强化微调后,PreResQ-R1在10个IQA和5个VQA基准测试下达到了最先进的成果,SRCC和PLCC指标分别提高了5.3%和2.15%。除了定量收益外,它产生了符合人类思维的推理轨迹,揭示了质量判断背后的感知线索。代码和模型可用。
关键见解
- 视觉质量评估(VA)旨在预测人类对视觉内容的感知判断。
- 当前的多模态语言模型在视觉质量推理方面存在局限性,如依赖监督微调、排名目标导致浅推理等。
- PreResQ-R1框架通过统一绝对分数回归和相对排名一致性来提高视觉质量评估的性能。
- PreResQ-R1采用双分支奖励公式建模样本内响应一致性和样本间偏好对齐。
- 通过强化学习进行微调优化,PreResQ-R1在图像和视频质量评估基准测试中取得显著成果。
- PreResQ-R1不仅提高了定量性能,还提供了揭示感知线索的人类对齐推理轨迹。
- PreResQ-R1的代码和模型可供公众使用。
点此查看论文截图
TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework
Authors:Chao Zhang, Yuhao Wang, Derong Xu, Haoxin Zhang, Yuanjie Lyu, Yuhao Chen, Shuochen Liu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen
Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models’ (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.
检索增强生成(RAG)利用外部知识增强大型语言模型(LLM)的可靠性。为了灵活性,自主型RAG采用自主、多轮检索和推理来解决查询。尽管最近的自主型RAG通过强化学习得到了改进,但它们经常因为搜索和推理过程而产生大量的令牌开销,这种权衡优先考虑准确性而不是效率。为了解决这一问题,这项工作提出了TeaRAG,一个高效的代理RAG框架,能够压缩检索内容和推理步骤。首先,通过增强基于分块的语义检索与简洁三元组的知识图谱检索,对检索到的内容进行压缩。然后,建立一个基于语义相似性和共现的知识关联图。最后,利用个性化PageRank算法突出显示该图中的重要知识,减少每次检索的令牌数量。此外,为了减少推理步骤,我们提出了迭代过程感知的直接偏好优化(IP-DPO)。具体来说,我们的奖励函数通过知识匹配机制来评估知识的充足性,同时惩罚过多的推理步骤。这种设计可以产生高质量的偏好对数据集,支持迭代DPO来提高推理简洁性。在六个数据集上,TeaRAG在Llama3-8B-Instruct和Qwen2.5-14B-Instruct上分别将平均精确匹配度提高了4%和2%,同时减少了61%和59%的输出令牌。代码可在https://github.com/Applied-Machine-Learning-Lab/TeaRAG获取。
论文及项目相关链接
PDF 32 pages
Summary
本文介绍了Retrieval-Augmented Generation(RAG)如何利用外部知识增强大型语言模型(LLM)的可靠性。为增强灵活性,agentic RAG采用自主、多轮检索和推理来解决查询。尽管最近的agentic RAG通过强化学习得到了改进,但它们从搜索和推理过程中产生了大量的令牌开销,这种权衡优先考虑了准确性而不是效率。为解决这一问题,本文提出了TeaRAG,一个高效的agentic RAG框架,能够压缩检索内容和推理步骤。通过基于块的语义检索与图形检索相结合,以及构建知识关联图和使用个性化PageRank来突出关键知识,从而压缩检索内容。此外,还提出了迭代过程感知的直接偏好优化(IP-DPO)来减少推理步骤。TeaRAG在六个数据集上的实验结果显示,它在提高精确匹配率的同时,减少了输出令牌数量。
Key Takeaways
- Retrieval-Augmented Generation (RAG) 利用外部知识增强LLM的可靠性。
- Agentic RAG采用自主、多轮检索和推理解决查询,以增强灵活性。
- 现有agentic RAG在强化学习中虽有所提升,但在搜索和推理过程中产生大量令牌开销。
- TeaRAG框架旨在解决这一效率问题,通过压缩检索内容和推理步骤实现更高效的RAG。
- TeaRAG结合了基于块的语义检索与图形检索,并利用知识关联图和个性化PageRank来突出关键知识。
- 提出迭代过程感知的直接偏好优化(IP-DPO)以减少推理步骤。
- 实验结果显示,TeaRAG在提升精确匹配率的同时减少了输出令牌数量。
点此查看论文截图
Reasoning Is All You Need for Urban Planning AI
Authors:Sijie Yang, Jiatong Li, Filip Biljecki
AI has proven highly successful at urban planning analysis – learning patterns from data to predict future conditions. The next frontier is AI-assisted decision-making: agents that recommend sites, allocate resources, and evaluate trade-offs while reasoning transparently about constraints and stakeholder values. Recent breakthroughs in reasoning AI – CoT prompting, ReAct, and multi-agent collaboration frameworks – now make this vision achievable. This position paper presents the Agentic Urban Planning AI Framework for reasoning-capable planning agents that integrates three cognitive layers (Perception, Foundation, Reasoning) with six logic components (Analysis, Generation, Verification, Evaluation, Collaboration, Decision) through a multi-agents collaboration framework. We demonstrate why planning decisions require explicit reasoning capabilities that are value-based (applying normative principles), rule-grounded (guaranteeing constraint satisfaction), and explainable (generating transparent justifications) – requirements that statistical learning alone cannot fulfill. We compare reasoning agents with statistical learning, present a comprehensive architecture with benchmark evaluation metrics, and outline critical research challenges. This framework shows how AI agents can augment human planners by systematically exploring solution spaces, verifying regulatory compliance, and deliberating over trade-offs transparently – not replacing human judgment but amplifying it with computational reasoning capabilities.
人工智能在城市规划分析方面取得了巨大成功——从数据中学习模式以预测未来情况。下一个前沿领域是人工智能辅助决策:代理可以推荐地点、分配资源,并在约束和利益相关者价值方面进行合理透明的权衡评估。最近,人工智能推理方面的突破——认知触发(CoT)、反应(ReAct)和多代理协作框架——现在使这一愿景成为可能。这篇立场论文提出了具有推理能力的规划代理的城市规划人工智能框架,该框架集成了三个认知层次(感知、基础、推理)和六个逻辑组件(分析、生成、验证、评估、协作、决策),通过多代理协作框架实现。我们说明了为什么规划决策需要明确的推理能力,这些能力是基于价值的(应用规范原则)、基于规则的(保证约束满足)和可解释的(产生透明的正当理由)——这些要求统计学习是无法单独实现的。我们将推理代理与统计学习进行了比较,提出了全面的架构和基准评估指标,并概述了关键的研究挑战。该框架展示了人工智能代理如何增强人类规划者的能力,通过系统地探索解决方案空间、验证法规合规性以及透明地权衡利弊——并不是取代人类判断,而是用计算推理能力来增强人类判断。
论文及项目相关链接
PDF Submitted to AAAI 2026 Workshop AI4UP
Summary
人工智能在城市规划分析方面取得了巨大成功,通过数据学习模式来预测未来状况。下一阶段是AI辅助决策制定,智能体会推荐地点、分配资源,并评估权衡,同时透明地推理约束和利益相关者价值。最新的突破如CoT提示、ReAct和多智能体协作框架使得这一愿景成为可能。本立场文件提出了具有推理能力的城市规划AI智能体框架,融合了三个认知层(感知、基础、推理)和六个逻辑组件(分析、生成、验证、评估、协作、决策),通过多智能体协作框架实现。我们阐述了规划决策为何需要明确的推理能力,这些能力是基于价值(应用规范原则)、基于规则(保证约束满足)和可解释的(产生透明理由),这些是仅通过统计学习无法实现的。我们比较了推理智能体与统计学习,给出了一个综合架构和基准评估指标,并概述了关键研究挑战。该框架展示了AI智能体如何增强人类规划者的能力,通过系统地探索解决方案空间、验证法规合规性以及透明地权衡得失,而非取代人类判断,而是用计算推理能力来增强。
Key Takeaways
- AI在城市规划分析领域已经展现出巨大成功,并预测未来趋势。
- AI辅助决策制定是下一阶段的重要发展方向,智能体具备推荐地点、分配资源和评估权衡的能力。
- 最近的突破如CoT提示、ReAct和多智能体协作框架为实现这一愿景提供了技术支持。
- 规划决策需要基于价值、规则和可解释的明确推理能力,这些是仅通过统计学习无法实现的。
- 推理智能体与统计学习有重要区别,前者能够更深入地处理复杂的规划决策任务。
- Agentic城市规划AI框架融合了认知层和逻辑组件,提供了多智能体协作的机会。
- AI智能体可增强人类规划者的能力,通过探索解决方案空间、验证法规合规性以及透明地权衡得失来辅助决策。
点此查看论文截图
Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models
Authors:Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu
The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user’s preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.
大型语言模型(LLM)的个人化是一个至关重要但具有挑战性的任务。现有方法主要依赖于上下文注入,即将用户历史嵌入到提示中以直接引导生成过程。然而,这种单步范式对模型施加了双重负担:既要生成准确的内容,又要同时与用户特定的风格保持一致。这通常会导致输出质量和精确控制的妥协。为了解决这种基本紧张关系,我们提出了反射个性化优化(RPO),这是一种重新定义个性化范式的新框架,通过解耦内容生成与对齐来实现。RPO分为两个独立阶段运行:首先,基础模型生成高质量、通用的响应;然后,外部反射模块显式地重写输出,以符合用户的偏好。该反射模块采用两阶段过程进行训练。初期,在结构化的重写轨迹上对监督微调,以建立核心个性化推理策略,该策略能够对通用响应与用户对齐响应进行建模。随后应用强化学习来进一步优化和提高个性化输出的质量。在LaMP基准测试上的综合实验表明,RPO通过解耦内容生成与个性化,显著优于现有基线。这些发现强调了显式响应塑造优于隐式上下文注入。此外,RPO引入了一个高效、模型无关的个人化层,可以无缝集成到任何基础模型中,为用户为中心的生成场景开辟了新的有效方向。
论文及项目相关链接
Summary
本文介绍了大型语言模型(LLM)个性化所面临的挑战以及现有方法的局限性。针对现有方法主要依赖语境注入的问题,提出了Reflective Personalization Optimization(RPO)框架,通过解耦内容生成与对齐过程来实现个性化。RPO分为两个阶段:首先由基础模型生成高质量通用响应,然后由外部反射模块显式地改写输出以符合用户偏好。反射模块采用两阶段训练过程,首先使用监督微调建立个性化推理策略,然后使用强化学习进一步优化个性化输出的质量。实验表明,RPO显著优于现有方法,突显了显式响应塑造相较于隐式语境注入的优势,同时为任何基础模型提供了一个高效、模型无关的个人化层。
Key Takeaways
- 大型语言模型(LLM)的个人化是一项关键且具有挑战性的任务。
- 现有方法主要通过语境注入来实现个性化,但这种方法存在输出质量和控制精度的妥协问题。
- RPO框架通过解耦内容生成和对齐过程来实现个性化,分为基础模型生成通用响应和反射模块改写输出两个阶段。
- 反射模块采用两阶段训练过程,包括监督微调和强化学习。
- RPO在LaMP基准测试上表现优异,显著优于现有方法。
- RPO强调了显式响应塑造相较于隐式语境注入的优势。
点此查看论文截图
Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
Authors:Prasoon Varshney, Makesh Narsimhan Sreedhar, Liwei Jiang, Traian Rebedea, Christopher Parisien
Large language models (LLMs) are typically aligned to a universal set of safety and usage principles intended for broad public acceptability. Yet, real-world applications of LLMs often take place within organizational ecosystems shaped by distinctive corporate policies, regulatory requirements, use cases, brand guidelines, and ethical commitments. This reality highlights the need for rigorous and comprehensive evaluation of LLMs with pluralistic alignment goals, an alignment paradigm that emphasizes adaptability to diverse user values and needs. In this work, we present PLURALISTIC BEHAVIOR SUITE (PBSUITE), a dynamic evaluation suite designed to systematically assess LLMs’ capacity to adhere to pluralistic alignment specifications in multi-turn, interactive conversations. PBSUITE consists of (1) a diverse dataset of 300 realistic LLM behavioral policies, grounded in 30 industries; and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions. Using PBSUITE, We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings (less than 4% failure rates), but their compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates). These findings highlight that existing model alignment and safety moderation methods fall short in coherently enforcing pluralistic behavioral policies in real-world LLM interactions. Our work contributes both the dataset and analytical framework to support future research toward robust and context-aware pluralistic alignment techniques.
大型语言模型(LLM)通常与一套普遍的安全性和使用原则相一致,旨在实现广泛的公众接受性。然而,LLM在现实世界中的应用往往发生在由独特的公司政策、监管要求、使用案例、品牌指南和道德承诺所塑造的组织生态系统内。这种现实情况突显了对具有多元化对齐目标的LLM进行严谨和全面评估的必要性,以及对齐范式需要强调对多样化和用户价值和需求的适应性。在这项工作中,我们提出了多元化行为套件(PBSUITE),这是一个动态评估套件,旨在系统地评估LLM在多轮交互式对话中遵守多元化对齐规范的能力。PBSUITE包括(1)基于30个行业的300个现实LLM行为政策的多样数据集;(2)一个动态评估框架,用于在敌对条件下对模型遵守特定行为规范进行压力测试。通过使用PBSUITE,我们发现领先的开源和闭源LLM在单轮设置(失败率低于4%)中能够保持对行为政策的稳健遵循,但在多轮敌对交互中(失败率高达84%),其合规性大大减弱。这些发现表明,现有的模型对齐和安全调停方法在不连贯地执行多元化行为政策方面存在不足,特别是在现实世界的LLM交互中。我们的工作既提供了数据集也提供了分析框架,以支持未来研究实现稳健和上下文感知的多元化对齐技术。
论文及项目相关链接
PDF Accepted at the Multi-Turn Interactions workshop at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Summary
大型语言模型(LLMs)通常遵循一套通用的安全和使用原则,以广泛的公众接受度为设计目标。然而,在现实世界的应用中,LLMs往往需要在组织生态系统中运作,这些系统受到独特的公司政策、监管要求、应用场景、品牌准则和道德承诺的影响。因此,我们需要对LLMs进行多元化对齐目标的严谨和全面评估。本研究提出了PLURALISTIC BEHAVIOR SUITE(PBSUITE),一个动态评估套件,旨在系统地评估LLMs适应多元化用户价值观和需求的对齐规格的能力。PBSUITE包括(1)一个包含来自三十个行业的三百种实际LLM行为政策的多样数据集;(2)一个动态评估框架,用于在敌对条件下测试模型遵守定制的行为规格的能力。我们发现,在单轮设置中,领先的开源和闭源LLMs能够很好地遵守行为政策(失败率不到4%),但在多轮敌对互动中,它们的合规性会大大减弱(失败率高达84%)。这表明现有的模型对齐和安全调节方法在真实世界的LLM互动中无法有效地执行多元化行为政策。我们的工作为支持未来研究稳健和上下文感知的多元化对齐技术提供了数据集和分析框架。
Key Takeaways
- 大型语言模型(LLMs)在现实世界应用中需要在组织生态系统中运作,需要适应不同的公司政策、监管要求等。
- LLMs的评估需要强调适应多元化用户价值观和需求的对齐目标。
- PLURALISTIC BEHAVIOR SUITE(PBSUITE)是一个动态评估套件,用于评估LLMs遵守多元化对齐规格的能力。
- PBSUITE包括一个包含多个行业的实际LLM行为政策数据集和一个用于测试模型合规性的动态评估框架。
- 在单轮设置中,LLMs的合规性较高,但在多轮敌对互动中,其合规性会显著下降。
- 现有的模型对齐和安全调节方法在多轮互动中可能无法有效执行多元化行为政策。
点此查看论文截图
You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
Authors:Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei
Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model’s pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa
最近的大型语言模型进展显示出无监督强化学习(RL)方法在增强推理能力方面的潜力,而无需外部监督。然而,这些无标签RL方法对于具有有限推理能力的小型基础模型的通用性仍然未被探索。在这项工作中,我们系统地研究了无标签RL方法在不同模型大小和推理能力方面的性能,从0.5B到7B参数。我们的实证分析揭示了关键局限性:无标签RL高度依赖于基础模型已有的推理能力,对于较弱的模型,其性能通常会低于基线水平。我们发现,较小的模型无法生成足够长或多样化的思维链推理,以实现有效的自我反思,而且训练数据的难度在决定成功与否方面起着至关重要的作用。为了应对这些挑战,我们提出了一种简单有效的无标签RL方法,该方法利用课程学习在训练过程中逐步引入更难的问题,并在训练过程中屏蔽无多数滚动。此外,我们还引入了一个数据整理流程来生成具有预定难度的样本。我们的方法在所有模型大小和推理能力方面都表现出了一致的改进,为更稳健的无监督RL提供了途径,该RL可以在资源受限的模型中引导推理能力。我们的代码可在https://github.com/BorealisAI/CuMa上获得。
论文及项目相关链接
PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: MATH-AI
Summary
近期大型语言模型的发展展现了无监督强化学习在提升推理能力方面的潜力,但对于较小基础模型的通用性尚未探索。本研究系统地研究了无监督强化学习在不同规模和推理能力的模型中的应用,发现其严重依赖于基础模型的预存推理能力,对较弱模型的性能常低于基线水平。为解决挑战,研究提出了利用课程学习逐步引入难题和训练时掩盖无多数策略滚动的简单有效方法,并引入数据管道生成预设难度的样本。此方法在所有模型和推理能力上均表现出一致的提升。
Key Takeaways
- 无监督强化学习在提升推理能力方面展现出潜力,但对较小基础模型的通用性仍需探索。
- 无监督强化学习的性能严重依赖于基础模型的预存推理能力。
- 较小模型在生成足够长或多样化的思维链方面存在困难,影响自我反思能力。
- 训练数据的难度对无监督强化学习的成功起到关键作用。
- 研究提出了利用课程学习的方法,逐步引入难题,改进了无监督强化学习。
- 引入数据管道生成预设难度的样本有助于提高模型性能。
点此查看论文截图
Real-Time Reasoning Agents in Evolving Environments
Authors:Yule Wen, Yixin Ye, Yanzhe Zhang, Diyi Yang, Hao Zhu
Agents in the real world must make not only logical but also timely judgments. This requires continuous awareness of the dynamic environment: hazards emerge, opportunities arise, and other agents act, while the agent’s reasoning is still unfolding. Despite advances in language model reasoning, existing approaches fail to account for this dynamic nature. We introduce real-time reasoning as a new problem formulation for agents in evolving environments and build Real-Time Reasoning Gym to demonstrate it. We study two paradigms for deploying language models in agents: (1) reactive agents, which employ language models with bounded reasoning computation for rapid responses, and (2) planning agents, which allow extended reasoning computation for complex problems. Our experiments show that even state-of-the-art models struggle with making logical and timely judgments in either paradigm. To address this limitation, we propose AgileThinker, which simultaneously engages both reasoning paradigms. AgileThinker consistently outperforms agents engaging only one reasoning paradigm as the task difficulty and time pressure rise, effectively balancing reasoning depth and response latency. Our work establishes real-time reasoning as a critical testbed for developing practical agents and provides a foundation for research in temporally constrained AI systems, highlighting a path toward real-time capable agents.
现实世界中的智能体必须做出不仅逻辑上正确而且及时的判断。这需要持续对动态环境的认知:危险出现,机会出现,其他智能体采取行动,而智能体的推理仍在展开。尽管语言模型推理取得了进展,但现有方法无法解释这种动态特性。我们引入实时推理作为不断发展的环境中智能体的新问题表述,并构建实时推理体育馆进行演示。我们研究了在智能体中部署语言模型的两种范式:(1)反应式智能体,它采用具有有限推理计算的语言模型进行快速响应;(2)规划式智能体,它允许进行扩展的推理计算以解决复杂问题。我们的实验表明,即使在两种范式中,最先进的模型在做出逻辑和及时判断时也面临困难。为了解决这一局限性,我们提出了AgileThinker,它同时采用这两种推理范式。随着任务难度和时间压力的增加,AgileThinker始终表现出优于仅采用一种推理范式的智能体,在推理深度和响应延迟之间达到了有效的平衡。我们的工作建立了实时推理作为开发实用智能体的关键测试平台,为时间受限的AI系统研究提供了基础,为开发具备实时能力的智能体指明了道路。
论文及项目相关链接
PDF 30 pages
Summary:现实世界中的智能体需要做出既逻辑又及时的判断,要求持续感知动态环境。针对不断发展的环境,我们引入实时推理作为新的智能体问题表述形式,并建立实时推理gym进行展示。研究发现即使最先进的模型在两个范式中也存在逻辑和及时判断困难的问题。为解决这个问题,我们提出AgileThinker,能同时采用两种推理范式,在任务难度和时间压力增大时表现优异,有效平衡推理深度和响应延迟。这项工作为开发实用智能体建立了实时推理的重要测试平台,并为时间约束AI系统的研究奠定了基础。
Key Takeaways:
- 智能体需要适应动态环境并做出逻辑及时的判断。
- 引入实时推理作为新的智能体问题表述形式。
- 建立Real-Time Reasoning Gym以展示实时推理。
- 研究发现现有语言模型在智能体中存在逻辑和及时判断困难的问题。
- 提出AgileThinker能同时采用两种推理范式。
- AgileThinker在任务难度和时间压力增大时表现优异。
点此查看论文截图
A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification
Authors:Sebastian Ojeda, Rafael Velasquez, Nicolás Aparicio, Juanita Puentes, Paula Cárdenas, Nicolás Andrade, Gabriel González, Sergio Rincón, Carolina Muñoz-Camargo, Pablo Arbeláez
Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.
抗菌肽作为对抗抗菌素耐药性的有前途的分子已经崭露头角。然而,数据片段化、注释不一致以及缺乏标准化基准测试阻碍了计算方法和新候选物的发现。为了应对这些挑战,我们推出了用于抗菌肽评估的扩展标准化集合(ESCAPE),这是一个实验性框架,整合了来自27个验证存储库的8万多个肽。我们的数据集将抗菌肽与阴性序列区分开,并将其功能注释纳入生物学连贯的多标签层次结构中,涵盖抗菌、抗真菌、抗病毒和抗寄生虫类别的活动。基于ESCAPE,我们提出了一种基于transformer的模型,该模型利用序列和结构信息来预测肽的多种功能活动。我们的方法在平均精度上相对于第二好的方法实现了高达2.56%的相对平均改进,建立了新的最先进的多肽标签分类。ESCAPE提供了一个全面且可重复的评价框架,以推动人工智能驱动的抗菌肽研究发展。
论文及项目相关链接
PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Camera-ready version. Code: https://github.com/BCV-Uniandes/ESCAPE. Dataset DOI: https://doi.org/10.7910/DVN/C69MCD
Summary
抗菌肽在抗击抗微生物耐药性方面展现出巨大潜力。然而,为解决数据碎片化、注释不一致以及缺乏标准化基准等问题,我们推出了扩展的抗菌肽评估标准化集合(ESCAPE)。该实验框架整合了超过8万个来自27个验证库的肽。我们的数据集将抗菌肽与阴性序列区分开,并将其功能注释融入生物学连贯的多标签层次结构中,覆盖抗菌、抗真菌、抗病毒和驱虫类等活性。基于ESCAPE,我们提出了一个利用序列和结构信息的基于transformer的模型来预测肽的多种功能活性。我们的方法相较于第二名方法平均精度提高了最高达2.56%,建立了新的多肽多标签分类的业界标准。ESCAPE为AI驱动的抗菌肽研究提供了全面且可重复的评价框架。
Key Takeaways
- 抗菌肽在抗击抗微生物耐药性方面具有巨大潜力。
- 当前面临数据碎片化、注释不一致和缺乏标准化基准等挑战。
- 推出扩展的抗菌肽评估标准化集合(ESCAPE)以解决这些问题。
- ESCAPE整合了超过8万个来自多个源的肽,并区分了抗菌肽和阴性序列。
- ESCAPE数据集包含功能注释,覆盖抗菌、抗真菌、抗病毒和驱虫等多种活性。
- 基于ESCAPE,提出了利用序列和结构信息的新的肽功能预测模型。
点此查看论文截图
THEval. Evaluation Framework for Talking Head Video Generation
Authors:Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva
Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.
视频生成已经取得了显著的进步,生成的视频越来越逼真。然而,生成的迅速发展超出了充足评估指标的开发速度。目前,对于说话人头部生成的评估主要依赖于有限的指标,包括评价一般视频质量、嘴唇同步和用户研究。为此,我们提出了一种新的评估框架,包含与三个维度相关的8个指标:(i)质量,(ii)自然度,(iii)同步性。在选择指标时,我们强调效率以及与人类偏好的一致性。基于此,我们简化了对头部、嘴巴和眉毛的精细动作分析,以及面部质量。我们对由17种最新模型生成的85000个视频进行的广泛实验表明,虽然许多算法在嘴唇同步方面表现出色,但在生成表达力和无瑕疵的细节方面仍面临挑战。这些视频是基于我们整理的一个新型真实数据集生成的,旨在减轻训练数据的偏见。我们提出的基准框架旨在评估生成方法的改进。原始代码、数据集和排行榜将公开发布并定期更新新的方法,以反映该领域的进展。
论文及项目相关链接
Summary
视频生成技术发展迅速,但评估指标的发展滞后。针对这一问题,我们提出了一个包含8个指标的新评估框架,涉及质量、自然度和同步性三个维度。我们强调评估指标的效率和与人类偏好的一致性。通过对大量视频的分析,我们发现现有模型在嘴唇同步方面表现出色,但在表达力和细节方面存在挑战。我们的数据集旨在减少训练数据的偏见,评估框架旨在评估生成方法的改进。我们将公开原始代码、数据集和排行榜并定期更新新方法以反映该领域的进展。
Key Takeaways
- 视频生成技术取得显著进展,但与真实视频相似度不断提高的同时,评估指标的发展却相对滞后。
- 当前对头部生成视频的评价主要依赖于有限的指标,如视频质量、嘴唇同步和用户研究等。
- 论文提出了一种新的评价框架,包括8个与三个维度(质量、自然度、同步性)相关的指标。
- 在选择评价指标时,强调效率和对人类偏好的一致性。
- 通过分析大量视频发现,现有模型在嘴唇同步方面表现良好,但在表达力和细节方面存在挑战。
- 该研究使用了一个新收集的真实数据集来减少训练数据的偏见问题。
点此查看论文截图
Monitor-Generate-Verify (MGV): Formalising Metacognitive Theory for Language Model Reasoning
Authors:Nick Oh, Fernand Gobet
Test-time reasoning architectures such as those following the Generate-Verify paradigm – where a model iteratively refines or verifies its own generated outputs – prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by formalising Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications, proposing the Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, this work provides the first systematic computational translation of foundational metacognitive theories, offering a principled vocabulary for understanding reasoning system failures and suggesting specific architectural interventions for future test-time reasoning designs.
在遵循生成-验证范式的测试时间推理架构中,模型会迭代地完善或验证其自身生成的输出,这种架构虽然重视生成和验证,但忽略了确定何时以及如何开始推理的监控过程。这种遗漏可能导致前缀主导陷阱,即模型过早地陷入非最优推理路径,并且很少恢复,导致大约20%的准确率损失。我们通过将弗拉维尔以及纳尔逊和纳伦斯的元认知理论形式化为计算规范来解决这一架构差距,提出监视-生成-验证(MGV)框架。MGV通过添加明确的监控来扩展生成-验证范式,在生成开始之前捕获元认知体验(从难度评估到信心判断),并通过验证反馈来完善未来的监控。尽管我们没有提供实证验证,但这项工作提供了对基础元认知理论的首个系统性计算翻译,为理解推理系统失败提供了有原则的词汇,并为未来的测试时间推理设计提出了具体的架构干预措施。
论文及项目相关链接
PDF To-be presented at the Workshop on the Foundations of Reasoning in Language Models at NeurIPS 2025 (non-archival)
Summary:生成验证范式下的测试时间推理架构通过迭代完善或验证模型的生成输出来实现推理,但它忽视了决定何时如何进行推理的监控过程。本文解决了这一问题,结合弗拉维尔、纳尔逊和纳伦斯的元认知理论提出监控生成验证(MGV)框架,通过在生成之前捕捉元认知体验,并在验证反馈中不断完善未来的监控,解决了在生成验证范式下可能会导致的早期承诺推理路径的错误及大约百分之二十的准确性损失。尽管缺乏实证研究,但本文首次系统地翻译了基础元认知理论,为理解推理系统的失败提供了原则性的词汇,并为未来的测试时间推理设计提出了具体的架构干预措施。
Key Takeaways:
- 测试时间推理架构中的生成验证范式忽略了监控过程,可能导致早期承诺推理路径的错误。
- 这种忽略可能会引发前缀主导陷阱,使模型陷入不佳的推理路径并难以恢复,导致约20%的准确性损失。
- 为了解决这一问题,提出了监控生成验证(MGV)框架。
- MGV框架扩展了生成验证范式,通过明确的监控捕捉元认知体验(如难度评估和信心判断),并在生成之前进行完善。
- MGV框架还通过验证反馈来完善未来的监控。
- 本文将元认知理论转化为计算规范,为理解推理系统的失败提供了原则性的词汇。
点此查看论文截图
Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing
Authors:Zhihui Chen, Mengling Feng
Medical image editing has emerged as a pivotal technology with broad applications in data augmentation, model interpretability, medical education, and treatment simulation. However, the lack of large-scale, high-quality, and openly accessible datasets tailored for medical contexts with strict anatomical and clinical constraints has significantly hindered progress in this domain. To bridge this gap, we introduce Med-Banana-50K, a comprehensive dataset of over 50k medically curated image edits spanning chest X-ray, brain MRI, and fundus photography across 23 diseases. Each sample supports bidirectional lesion editing (addition and removal) and is constructed using Gemini-2.5-Flash-Image based on real clinical images. A key differentiator of our dataset is the medically grounded quality control protocol: we employ an LLM-as-Judge evaluation framework with criteria such as instruction compliance, structural plausibility, image realism, and fidelity preservation, alongside iterative refinement over up to five rounds. Additionally, Med-Banana-50K includes around 37,000 failed editing attempts with full evaluation logs to support preference learning and alignment research. By offering a large-scale, medically rigorous, and fully documented resource, Med-Banana-50K establishes a critical foundation for developing and evaluating reliable medical image editing systems. Our dataset and code are publicly available. [https://github.com/richardChenzhihui/med-banana-50k].
医学影像编辑技术已成为一项至关重要的技术,在数据增强、模型解释性、医学教育和治疗模拟等领域有着广泛的应用。然而,由于缺乏针对医学情境的大规模、高质量、公开可访问的、符合严格解剖和临床约束的数据集,该领域的进展受到了极大的阻碍。为了弥补这一差距,我们推出了Med-Banana-50K数据集,它包含超过5万份经过医学审核的影像编辑样本,涵盖了胸部X光、脑部MRI和眼底摄影,跨越了23种疾病。每个样本都支持双向病变编辑(增加和移除),并使用基于真实临床图像的Gemini-2.5-Flash-Image构建。我们数据集的一个关键区别在于其医学基础的质量控制协议:我们采用LLM-as-Judge评估框架,包括指令合规性、结构合理性、图像真实性和保真度保留等标准,并进行最多五轮的迭代改进。此外,Med-Banana-50K还包括约3.7万次失败的编辑尝试及完整的评估日志,以支持偏好学习和对齐研究。通过提供大规模、医学严谨且完整的资源,Med-Banana-50K为开发可靠的医学影像编辑系统奠定了基础。我们的数据集和代码已公开发布,可供访问:[https://github.com/richardChenzhihui/med-banana-50k]。
论文及项目相关链接
Summary:医疗图像编辑技术已逐渐成为一项关键技术,广泛应用于数据增强、模型解释性、医学教育和治疗模拟等领域。然而,缺乏大规模、高质量、开放访问的针对医学情境的数据集,严格符合解剖学和临床约束,极大地阻碍了该领域的进展。为了弥补这一差距,我们推出了Med-Banana-50K数据集,包含超过5万份经过医学审核的图像编辑样本,涉及胸部X光、脑部MRI和眼底摄影等23种疾病。每个样本支持双向病变编辑(增加和删除),并使用基于真实临床图像的Gemini-2.5-Flash-Image构建。该数据集的一个关键区别在于其医学基础的质量控制协议:我们采用LLM-as-Judge评估框架作为标准,包括指令合规性、结构合理性、图像真实性和保真度保持等标准,并进行最多五轮的迭代改进。此外,Med-Banana-50K还包括大约3万7千次的编辑失败尝试记录与完整的评估日志记录支持偏好学习和对齐研究。通过提供大规模且经过医学严谨验证的资源,Med-Banana-50K为开发可靠的医疗图像编辑系统奠定了基础。我们的数据集和代码均公开可用。具体信息可通过GitHub访问:[https://github.com/richardChenzhihui/med-banana-50k]。
Key Takeaways:
- 医疗图像编辑技术在数据增强、模型解释性等方面具有广泛应用前景。
- 当前缺乏大规模的医疗图像数据集阻碍了医疗图像编辑技术的进展。
- Med-Banana-50K是一个全面的医疗图像数据集,包含超过5万份经过医学审核的图像编辑样本,涵盖多种疾病和模态。
- Med-Banana-50K数据集支持病变的双向编辑,并采用真实的临床图像为基础构建。
- 该数据集具有严格的医学质量控制协议和评估标准,包括合规性、结构合理性等。
- Med-Banana-50K数据集还包括大量的编辑失败尝试记录,支持偏好学习和对齐研究。
点此查看论文截图
HugAgent: Benchmarking LLMs for Simulation of Individualized Human Reasoning
Authors:Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Hang Jiang, Paul Pu Liang, Jinhua Zhao, Luis Alberto Alonso Pastor, Kent Larson
Simulating human reasoning in open-ended tasks has long been a central aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), which rethinks human reasoning simulation along three dimensions: (i) from averaged to individualized reasoning, (ii) from behavioral mimicry to cognitive alignment, and (iii) from vignette-based to open-ended data. The benchmark evaluates whether a model can predict a specific person’s behavioral responses and the underlying reasoning dynamics in out-of-distribution scenarios, given partial evidence of their prior views. HugAgent adopts a dual-track design: a human track that automates and scales the think-aloud method to collect ecologically valid human reasoning data, and a synthetic track for further scalability and systematic stress testing. This architecture enables low-cost, extensible expansion to new tasks and populations. Experiments with state-of-the-art language models reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. The benchmark, along with its complete data collection pipeline and companion chatbot, is open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).
模拟人类在日常开放任务中的推理能力是人工智能和认知科学的中心追求。虽然大型语言模型已经在大规模上近似模拟人类的响应,但它们仍然以群体共识为基础,常常忽略了个人推理风格和信念轨迹的独特性。为了推进机器中更人性化推理的愿景,我们引入了HugAgent(人类基础代理基准测试),它重新思考了人类推理模拟的三个维度:(i)从平均推理到个性化推理,(ii)从行为模仿到认知对齐,(iii)从短篇故事到开放数据。该基准测试评估模型能否预测特定个体在超出分布范围场景中的行为响应和底层推理动态,基于对他们先前观点的有限证据。HugAgent采用双轨设计:人类轨道用于自动化和扩展出声思考方法以收集生态有效的人类推理数据,合成轨道用于进一步的扩展性和系统压力测试。这种架构能够实现低成本、可扩展的新任务和人群扩展。使用最先进的语言模型进行的实验揭示了持续的适应差距,这使得HugAgent成为首个可与人类思维个性对齐的机器推理可扩展基准测试。该基准测试及其完整的数据收集管道和配套聊天机器人已作为HugAgent(https://anonymous.4open.science/r/HugAgent)和TraceYourThinking(https://anonymous.4open.science/r/trace-your-thinking)开源发布。
论文及项目相关链接
PDF To appear in NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models (LAW)
Summary:
模拟人类在开放任务中的推理能力是人工智能和认知科学的中心目标。虽然大型语言模型可以大规模地模拟人类响应,但它们通常针对人群共识进行调整,忽略了个人推理风格和信念轨迹的多样性。为了推进更人性化的机器推理愿景,我们引入了HugAgent(Human-Grounded Agent Benchmark),它重新思考了人类推理模拟的三个方面:(i)从平均推理到个性化推理,(ii)从行为模仿到认知对齐,(iii)从基于插图的到开放的数据。HugAgent采用双轨设计,包括自动化和扩展思维出声方法以收集生态有效的人类推理数据的人类轨道,以及用于进一步提高可扩展性和系统性压力测试的合成轨道。实验表明,最先进的语言模型仍存在持续的适应差距,这定位HugAgent成为第一个能与人类思维个性相匹配的机器推理的可扩展基准测试。HugAgent及其完整的数据收集管道和伴随聊天机器人已作为开源项目发布。
Key Takeaways:
- HugAgent重新定义了人类推理模拟的三个方面,包括个性化推理、认知对齐和开放数据的使用。
- HugAgent采用双轨设计,包括人类轨道和合成轨道,旨在自动化和扩展人类推理数据的收集,并支持新的任务和人群的扩展。
- 当前的语言模型在模拟人类推理时存在适应差距,尤其是在个性化方面。
- HugAgent是一个可扩展的基准测试,能够评估模型在预测特定个人在偏离分布的场景中的行为反应和内在推理动态的能力。
- HugAgent及其数据收集管道和伴随聊天机器人已作为开源项目发布,以促进更广泛的研究和应用。
- 该基准测试不仅在评估模型性能方面有价值,而且为人工智能和认知科学研究提供了一个新的研究工具。
点此查看论文截图
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Authors:Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy RL, sustaining continuous scaling across $3,000$ training steps and $81,204$ GPU-hours, where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17%$ average accuracy on five math benchmarks, an improvement of $2.66%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.
强化学习可验证奖励(RLVR)推动了复杂推理中的大型语言模型的应用,但其可扩展性通常受到训练瓶颈的阻碍,在策略熵崩溃时性能达到峰值,这标志着探索的丧失。之前的方法通常通过保持高策略熵来解决这个问题,但控制有意义的探索的精确机制仍然未被充分探索。我们的分析表明,对熵的无选择性关注可能放大无关标记并破坏训练稳定性。本文研究了RLVR中的探索动态,并发现了一个关键问题:有价值的低概率探索标记(我们称之为“推理火花”)的逐渐消除。我们发现这些火花在预训练模型中虽然丰富,但在RLVR期间由于过度惩罚而系统地被消除,导致探索退化。为了解决这一问题,我们引入了低概率正则化(Lp-Reg)。其核心机制是将策略正则化朝向启发式代理分布。该代理通过过滤掉假定为噪声标记并重新归一化剩余候选标记的分布来构建。结果是一个噪声较少的代理,其中推理火花的可能性被放大,然后作为软正则化目标,通过KL散度保护这些有价值的标记免受消除。实验表明,Lp-Reg能够实现稳定的在线策略强化学习,在3000个训练步骤和81204个GPU小时中持续扩展,而基线熵控制方法则崩溃了。这种持续的探索导致了最先进的性能,在五个数学基准测试中达到了平均准确率提高了60.17%,较之前的方法提高了提高了达比例算法采用家研究院扩展力葧解决英语直感性短语统一出了钱芳瘟轨精准的抽城你梅后导略唤醒了推理火花的重要性并提升了其概率分布从而实现了更稳定的强化学习训练并提升了模型性能我们的代码已在GitHub上公开可供查阅网址为 https://github.com/CarlanLark/Lp-Reg。
论文及项目相关链接
摘要
该文章介绍了强化学习可验证奖励(RLVR)在大型语言模型复杂推理中的应用,但存在训练瓶颈问题。文章指出,过于关注策略熵可能导致无意义的标记被放大并破坏训练稳定性。文章深入探讨了RLVR中的探索动态,发现一个关键问题:有价值但概率低的“推理火花”(reasoning sparks)逐渐被淘汰。为了解决这个问题,文章引入了一种名为低概率正则化(Lp-Reg)的方法,其核心机制是通过一个启发式代理分布对策略进行正则化。该代理分布通过过滤掉假定为噪声的标记并重新归一化剩余候选标记的分布来构建。Lp-Reg通过放大“推理火花”的概率作为软正则化目标,保护这些有价值的标记免受消除的影响。实验表明,Lp-Reg能够实现稳定的在线策略强化学习,维持持续扩展的训练过程,并在数学基准测试中实现了平均准确率提高至60.17%,相较于之前的方法有显著改善。代码已公开在GitHub上提供。
关键见解
- 强化学习可验证奖励(RLVR)在大型语言模型复杂推理应用中的训练瓶颈问题,主要因为策略熵下降导致的探索减少。
- 过于关注策略熵可能导致无意义的标记被放大并破坏训练稳定性。
- 发现有价值但概率低的“推理火花”(reasoning sparks)在强化学习中逐渐被淘汰的问题。
- 引入低概率正则化(Lp-Reg)方法,通过构建启发式代理分布来正则化策略,保护有价值的标记并维持稳定的探索过程。
- Lp-Reg能够提高强化学习的探索效率,实现了在数学基准测试上的平均准确率显著提升。
- Lp-Reg能够实现持续的在线策略强化学习,并能够在长时间训练过程中维持性能。
点此查看论文截图
Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving
Authors:Dianzhao Li, Ostap Okhrin
Autonomous vehicles hold great promise for reducing traffic fatalities and improving transportation efficiency, yet their widespread adoption hinges on embedding credible and transparent ethical reasoning into routine and emergency maneuvers, particularly to protect vulnerable road users (VRUs) such as pedestrians and cyclists. Here, we present a hierarchical Safe Reinforcement Learning (Safe RL) framework that augments standard driving objectives with ethics-aware cost signals. At the decision level, a Safe RL agent is trained using a composite ethical risk cost, combining collision probability and harm severity, to generate high-level motion targets. A dynamic, risk-sensitive Prioritized Experience Replay mechanism amplifies learning from rare but critical, high-risk events. At the execution level, polynomial path planning coupled with Proportional-Integral-Derivative (PID) and Stanley controllers translates these targets into smooth, feasible trajectories, ensuring both accuracy and comfort. We train and validate our approach on closed-loop simulation environments derived from large-scale, real-world traffic datasets encompassing diverse vehicles, cyclists, and pedestrians, and demonstrate that it outperforms baseline methods in reducing risk to others while maintaining ego performance and comfort. This work provides a reproducible benchmark for Safe RL with explicitly ethics-aware objectives in human-mixed traffic scenarios. Our results highlight the potential of combining formal control theory and data-driven learning to advance ethically accountable autonomy that explicitly protects those most at risk in urban traffic environments. Across two interactive benchmarks and five random seeds, our policy decreases conflict frequency by 25-45% compared to matched task successes while maintaining comfort metrics within 5%.
自动驾驶汽车对于减少交通事故死亡和提高交通效率具有巨大潜力,然而其广泛采用的关键在于将可信和透明的道德推理嵌入到常规和紧急操作中,特别是在保护行人及骑行者等脆弱道路使用者(VRUs)方面。在此,我们提出了一种分层的Safe Reinforcement Learning(Safe RL)框架,该框架增加了基于伦理的成本信号以辅助标准的驾驶目标。在决策层面,Safe RL代理通过使用组合伦理风险成本(结合碰撞概率和伤害严重程度)进行训练,以产生高级运动目标。动态、风险敏感优先经验回放机制能加强从罕见但关键的高风险事件中的学习。在执行层面,多项式路径规划与比例积分微分(PID)和斯坦利控制器相结合,将这些目标转化为平稳且可行的轨迹,确保准确性和舒适性。我们在封闭循环仿真环境中对方法进行了训练和验证,该环境基于大规模现实世界交通数据集构成,涵盖了多种车辆、骑行者和行人。我们证明了该方法在减少对他人的风险时表现出良好的性能,同时保持了自身性能和舒适性。这项工作为Safe RL提供了一个可重复的基准测试,明确了其在混合交通场景中基于伦理目标的定位。我们的结果强调了结合正式控制理论和数据驱动学习的潜力,旨在推进具有明确道德责任的自主性,在城市交通环境中明确保护最脆弱的人群。在两个交互基准测试和五个随机种子的情况下,我们的策略在保持舒适指标在5%以内的同时,与匹配的任务成功相比减少了冲突频率的25-45%。
论文及项目相关链接
Summary
该文本介绍了一种层次化的安全强化学习(Safe RL)框架,该框架结合了标准驾驶目标与伦理意识成本信号,用于在自动驾驶车辆中嵌入可信且透明的道德推理。通过复合伦理风险成本(结合碰撞概率和伤害严重程度),Safe RL代理在决策层面生成高级运动目标。采用动态、风险敏感性的优先经验回放机制,强化了对罕见但关键的高风险事件的学习。在执行层面,通过多项式路径规划与PID和Stanley控制器将这些目标转化为平滑、可行的轨迹,确保准确性和舒适性。在封闭循环仿真环境中对方法进行了训练与验证,该方法在减少对他人的风险同时保持自我性能与舒适性方面表现出优于基准方法的效果。
Key Takeaways
- Safe RL框架结合了标准驾驶目标与伦理意识成本信号,以促进自动驾驶车辆的广泛采纳。
- 决策层面采用复合伦理风险成本来生成高级运动目标。
- 采用动态、风险敏感性的优先经验回放机制来强化高风险事件的学习。
- 通过多项式路径规划与控制器将目标转化为平滑、可行的轨迹。
- 方法在封闭循环仿真环境中进行了训练与验证,展示了对减少对他人的风险、保持自我性能与舒适性的效果。
- 与基准方法相比,该方法在减少冲突频率方面表现出优异性能,任务成功时冲突频率降低25-45%,同时保持舒适性指标在5%以内。
点此查看论文截图
P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication
Authors:Sneha Oram, Pushpak Bhattacharyya
Although explainability and interpretability have received significant attention in artificial intelligence (AI) and natural language processing (NLP) for mental health, reasoning has not been examined in the same depth. Addressing this gap is essential to bridge NLP and mental health through interpretable and reasoning-capable AI systems. To this end, we investigate the pragmatic reasoning capability of large-language models (LLMs) in the mental health domain. We introduce PRiMH dataset, and propose pragmatic reasoning tasks in mental health with pragmatic implicature and presupposition phenomena. In particular, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the tasks presented, we consider four models: Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning abilities in the domain. Subsequently, we study the behavior of MentaLLaMA on the proposed reasoning tasks with the rollout attention mechanism. In addition, we also propose three StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT4o-mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with stigma more responsibly compared to the other two LLMs.
尽管在人工智能(AI)和自然语言处理(NLP)心理健康领域,解释性和可解释性已受到广泛关注,但推理尚未得到同等深度的研究。为了弥合NLP和心理健康之间的差距,需要通过可解释和具备推理能力的AI系统来构建联系。为此,我们研究心理健康领域中大型语言模型的实用推理能力。我们介绍了PRiMH数据集,并提出了具有实用隐晦和预设现象的心理健康实用推理任务。特别是,我们制定了两个隐晦任务和一个预设任务。为了对提出的数据集和任务进行基准测试,我们考虑了四种模型:Llama3.1、Mistral、MentaLLaMa和Qwen。实验结果表明,Mistral和Qwen在该领域的推理能力显著。随后,我们研究了MentaLLaMA在提出的推理任务上的表现,采用了展开式注意力机制。此外,我们还利用最先进的LLMs(GPT4o-mini、Deepseek-chat和Claude-3.5-haiku)提出了三个关于心理健康耻辱的StiPRompts研究。我们的评估结果表明,与其他两个LLMs相比,Claude-3.5-haiku在处理心理健康耻辱方面更为负责任。
论文及项目相关链接
Summary
在人工智能和自然语言处理领域,解释性和可解释性已备受关注,但对推理的研究尚未深入。为填补这一空白,本研究旨在探索大型语言模型在精神健康领域的实用推理能力。我们引入了PRiMH数据集,并提出了具有实用隐含和预设现象的精神健康实用推理任务。通过四个模型对数据集和任务进行评估,发现Mistral和Qwen在该领域具有较强的推理能力。此外,我们还研究了MentaLLaMA在提出的推理任务上的表现,并提出了三个关于精神健康耻辱的StiPRompts,与当前先进的大型语言模型进行对比评估。
Key Takeaways
- 研究强调了精神健康领域中推理的重要性,并指出需要更多关注大型语言模型在此方面的能力。
- 引入了PRiMH数据集,为精神健康领域的实用推理任务提供了资源。
- 提出了包含实用隐含和预设现象的精神健康实用推理任务。
- 实验结果显示Mistral和Qwen在精神健康领域的推理任务上表现较好。
- MentaLLaMA在提出的推理任务上的表现也得到了研究。
- 通过三个StiPRompts探讨了精神健康耻辱问题,并评估了不同大型语言模型的应对方式。
点此查看论文截图
SAFER: Probing Safety in Reward Models with Sparse Autoencoder
Authors:Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}
强化学习从人类反馈(RLHF)是对齐大型语言模型(LLM)与人类价值观的关键范式,但其核心的奖励模型仍然大部分不透明。在这项工作中,我们提出了稀疏自动编码器增强奖励模型(SAFER),这是一种通过机械分析来解释和改进奖励模型的新型框架。利用稀疏自动编码器(SAE),我们在奖励模型激活中发现人类可解释的特征,从而深入了解与安全相关的决策。我们将SAFER应用于面向安全的偏好数据集,并通过选择响应和拒绝响应之间的激活差异来量化单个特征的重要性。使用这些特征级信号,我们设计有针对性的数据污染和去噪策略。实验表明,SAFER可以在不牺牲一般聊天性能的情况下,通过最小的数据修改精确地降低或提高安全对齐性。我们的方法为解释、审计和改进高风险LLM对齐任务中的奖励模型做出了贡献。我们的代码可在https://github.com/xzy-101/SAFER-code中找到。这篇论文讨论与大型语言模型安全相关的话题,并可能包含突出潜在风险或不安全结果的讨论或示例。
论文及项目相关链接
PDF One of the institutions requires additional approval before we can move forward with the publication. Thanks for your understanding, and we hope to resubmit once everything is finalized
Summary
强化学习从人类反馈(RLHF)是对齐大型语言模型(LLM)与人类价值观的关键范式,但其核心奖励模型仍然大部分不透明。本研究提出稀疏自动编码器增强奖励模型(SAFER)框架,通过机械分析来解读和改进奖励模型。利用稀疏自动编码器(SAE),揭示奖励模型激活中的人可解释特征,洞察安全相关的决策过程。将SAFER应用于安全导向的偏好数据集,通过选择响应和拒绝响应之间的激活差异来量化个人特征的重要性。使用这些特征级别的信号,我们设计有针对性的数据中毒和去噪策略。实验表明,SAFER能在不牺牲通用聊天性能的前提下,精确地提高或降低安全对齐度,且只需对少量数据进行修改。我们的方法有助于解释、审计和改进高风险语言模型对齐任务中的奖励模型。代码地址:<https://github.com/xzy-10 1/SAFER-code>。
Key Takeaways
- RLHF是对齐LLM与人类价值观的关键方法,但奖励模型透明度低。
- 提出SAFER框架,利用稀疏自动编码器解读和改进奖励模型。
- 通过人可解释的特征揭示奖励模型的激活机制,洞察安全相关的决策过程。
- 应用SAFER到安全导向的偏好数据集,量化个人特征的重要性。
- 利用特征级别信号设计数据中毒和去噪策略。
- SAFER能精确调整安全对齐度,且对通用聊天性能影响小。
点此查看论文截图
Know What You Don’t Know: Uncertainty Calibration of Process Reward Models
Authors:Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan
Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach – performed via quantile regression – that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.
过程奖励模型(PRMs)在指导大型语言模型(LLMs)的推理时间缩放算法中起着核心作用。然而,我们观察到,即使是最先进的PRMs也可能存在校准不良的情况。具体来说,它们倾向于高估部分推理步骤会导致正确最终答案的成功概率,特别是在使用较小的LLMs来完成推理轨迹时。为了解决这一问题,我们提出了一种通过分位回归进行校准的方法,该方法可以调整PRM输出,以更好地符合真实的成功概率。利用这些校准后的成功估计及其相关的置信界限,我们引入了实例自适应缩放(IAS)框架,该框架根据估计的局部推理轨迹产生正确答案的可能性来动态调整计算预算。与传统的为每次查询分配固定数量的推理轨迹的方法不同,这种方法在使用我们的校准PRM时能够适应每个实例和推理步骤。在数学推理基准测试上的实验表明,(i)我们的PRM校准方法实现了较小的校准误差,优于基准方法,(ii)校准对于实现有效的IAS至关重要,(iii)提出的IAS策略在保持最终答案准确性的同时降低了推理成本,如预期地在更有信心的问题上使用了较少的计算。
论文及项目相关链接
PDF Accepted at NeurIPS 2025
Summary
本文介绍了针对大型语言模型(LLM)的过程奖励模型(PRM)的校准问题。文章指出,即使是最先进的过程奖励模型也可能存在校准不良的情况,特别是在使用较小的LLM完成推理轨迹时,它们往往会高估部分推理步骤导致正确最终答案的成功概率。为解决这一问题,文章提出了一种通过分位回归进行校准的方法,以调整PRM输出,使其更好地符合真实的成功概率。利用这些校准后的成功估计及其相关的置信界限,文章还介绍了一种名为实例自适应缩放(IAS)的框架,该框架能够根据估计的机率动态调整计算预算,预测部分推理轨迹是否会得出正确的最终答案。实验表明,该文的PRM校准方法具有较小的校准误差,优于基准方法;校准对于实现有效的IAS至关重要;所提出的IAS策略在维持最终答案准确性的同时,降低了推理成本,如预期在更有信心的问题上使用较少的计算资源。
Key Takeaways
- 过程奖励模型(PRM)在大型语言模型(LLM)的推理时间缩放算法中起关键作用,但存在校准问题。
- PRM倾向于高估部分推理步骤的成功概率,特别是在使用较小的LLM时。
- 提出一种通过分位回归进行PRM校准的方法,以更准确地预测成功概率。
- 引入实例自适应缩放(IAS)框架,根据估计的成功概率动态调整计算预算。
- 实证研究表明,PRM校准对于实现有效的IAS至关重要。
- IAS策略在维持最终答案准确性的同时,能够降低推理成本。
点此查看论文截图