⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-25 更新
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Authors:Yuhan Liu, Lianhui Qin, Shengjie Wang
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
大型视觉语言模型(VLMs)在多模态理解方面取得了显著的进步,但在处理信息密集的图像时面临困难,这些图像将文本注释与精细的图形元素紧密交织在一起。主要挑战在于在密集布局中精确定位关键线索,以及进行多跳推理以整合分散的证据。我们提出了无需训练的“推测判决”(SV)框架,该框架受到推测解码的启发,结合了多个轻型草稿专家和一个大型判决模型。在草稿阶段,小型VLMs作为草稿专家生成推理路径,提供多样化的定位候选;在判决阶段,强大的VLM合成这些路径以产生最终答案,这可以在减少计算成本的同时恢复正确答案。为了进一步提高效率和准确性,SV引入了一种共识专家选择机制,只将高度一致的推理路径转发到判决阶段。在实践中,SV在具有挑战性和高分辨率的视觉问答基准测试上取得了持续的进步,包括信息图表VQA、图表博物馆、ChartQAPro和HR-Bench 4K等。通过从多个部分准确的推理路径中综合正确的见解,与大型专有模型或训练管道相比,SV实现了错误纠正和成本效益。相关代码可通过以下链接获取:https://github.com/Tinaliu0123/speculative-verdict。
论文及项目相关链接
Summary
大型视觉语言模型在处理信息密集型图像时面临挑战,这些图像将文本注释与精细图形元素紧密交织。主要挑战在于在密集布局中精确定位关键线索,以及进行多跳推理以整合分散的证据。为此,我们提出了无需训练的“投机裁决”(Speculative Verdict,简称SV)框架,它结合了多个轻量级草案专家和一个大型裁决模型。在草案阶段,小型视觉语言模型作为草案专家生成推理路径,提供多样化的定位候选;在裁决阶段,强大的视觉语言模型综合这些路径产生最终答案。此外,SV还引入了一种共识专家选择机制,只将高度一致的推理路径转发到裁决阶段,以提高效率和准确性。实证结果表明,SV在具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进,包括InfographicVQA、Chartmuseum、ChartQAPro和HR-Bench 4K。SV能够从多个部分准确的推理路径中综合正确的见解,与大型专有模型或训练管道相比,实现了误差校正和成本效益。
Key Takeaways
- 大型视觉语言模型在处理信息密集型图像时存在挑战,需要精确的定位和复杂的多跳推理。
- 提出的“投机裁决”(SV)框架结合了轻量级草案专家与大型裁决模型,以处理这些挑战。
- 在草案阶段,小型视觉语言模型生成多样化的推理路径。
- 在裁决阶段,强大的视觉语言模型综合这些路径以产生最终答案。
- SV引入共识专家选择机制,提高效率和准确性。
- SV在多个视觉问答基准测试中表现优异,包括处理信息密集型和高分辨率图像。
点此查看论文截图
Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
Authors:Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
Recent advancements in large reasoning models (LRMs) have introduced an intermediate “thinking” process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to “overthink” simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
最近,大型推理模型(LRM)的进展在生成最终答案之前引入了一个中间的“思考”过程,提高了它们在复杂的下游任务上的推理能力。然而,LRM作为机器翻译(MT)质量的评估器的潜力尚未得到充分探索。我们对LRM作为评估者在MT评估中进行了首次系统分析。我们确定了关键挑战,发现LRM需要定制评估材料,往往会对较简单的实例“过度思考”,以及评分机制问题导致过度估计。为了解决这些问题,我们提出通过合成人类思维轨迹来训练LRM,以校正其思考方式。我们在WMT24 Metrics基准测试上的实验表明,这种方法将思考预算减少了约35倍,同时在不同规模的LRM(从7B到32B)上提高了评估性能(例如,R1-Distill-Qwen-7B实现了8.7个相关系数点的改进)。这些发现突显了有效校准的LRM在推动精细粒度的自动MT评估方面的潜力。
论文及项目相关链接
PDF NeurIPS 2025
Summary
近期,大型推理模型(LRMs)在机器翻译(MT)评估领域的应用逐渐受到关注。研究团队首次系统地分析了LRM作为MT评价者的潜力,并指出了其面临的挑战,如需要定制评估材料、过度思考简单案例以及评分机制导致过度估计等问题。为解决这些问题,研究团队提出了通过合成人类思维轨迹来校准LRM思维的方法。实验表明,该方法在减少思考预算的同时提高了评估性能。
Key Takeaways
- 大型推理模型(LRMs)在复杂下游任务上的推理能力得到改进,开始在机器翻译(MT)评估中展现潜力。
- LRM作为MT评价者面临挑战,包括需要定制评估材料、过度思考简单案例以及评分机制问题。
- 研究团队提出了通过合成人类思维轨迹来校准LRM思维的方法,以降低思考预算并提高评估性能。
- 实验结果显示,该方法在多种规模的LRM上均有效,如在WMT24 Metrics基准测试中,R1-Distill-Qwen-7B模型实现了+8.7的关联点改进。
点此查看论文截图
Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
Authors:Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu
Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.
多模态大型语言模型(MLLMs)融合了视觉和文本推理,利用思维链(CoT)提示来解决复杂的视觉任务,但仍会出现视觉幻觉,过度依赖文本先验知识。我们使用三阶段评估框架对最先进的视觉语言模型进行了系统诊断,揭示了关键失败模式。为了解决这些问题,我们提出了一种基于代理的架构,将LLM推理与轻量级视觉模块相结合,实现对推理链的精细分析迭代优化。我们的结果强调,未来的视觉推理模型应侧重于集成更多用于分析视觉内容的专用工具集。我们的系统在MMMU上实现了显著的提升(提高了10.3),在MathVista上比7B基线提高了6.0,与或超越了更大的模型。我们将发布我们的框架和评估套件,以促进未来的研究。
论文及项目相关链接
PDF 5 pages
Summary
本文探讨了多模态大型语言模型(MLLMs)在处理复杂视觉任务时的挑战,包括视觉幻觉和对文本先验的过度依赖。文章通过三阶段评估框架对最先进的视觉语言模型进行了系统诊断,揭示了关键失败模式。为解决这些问题,提出了一种基于代理的架构,结合LLM推理和轻量级视觉模块,实现对推理链的精细分析。结果表明,未来视觉推理模型应专注于整合更多专门用于分析视觉内容的工具。该系统的改进显著(+10.3于MMMU,+6.0于MathVista优于7B基线),与更大的模型相匹配甚至更胜一筹。
Key Takeaways
- 多模态大型语言模型(MLLMs)结合了视觉和文本推理,采用思维链(CoT)提示来解决复杂的视觉任务。
- MLLMs存在视觉幻觉和对文本先验的过度依赖的问题。
- 通过三阶段评估框架,揭示了MLLMs的关键失败模式。
- 提出的基于代理的架构结合了LLM推理和轻量级视觉模块,实现对推理链的精细分析并迭代优化。
- 该架构在MMMU和MathVista上的表现显著优于基线模型。
- 未来视觉推理模型应重视集成更多专门用于分析视觉内容的工具。
点此查看论文截图
Generalizable Reasoning through Compositional Energy Minimization
Authors:Alexandru Oarga, Yilun Du
Generalization is a key challenge in machine learning, specifically in reasoning tasks, where models are expected to solve problems more complex than those encountered during training. Existing approaches typically train reasoning models in an end-to-end fashion, directly mapping input instances to solutions. While this allows models to learn useful heuristics from data, it often results in limited generalization beyond the training distribution. In this work, we propose a novel approach to reasoning generalization by learning energy landscapes over the solution spaces of smaller, more tractable subproblems. At test time, we construct a global energy landscape for a given problem by combining the energy functions of multiple subproblems. This compositional approach enables the incorporation of additional constraints during inference, allowing the construction of energy landscapes for problems of increasing difficulty. To improve the sample quality from this newly constructed energy landscape, we introduce Parallel Energy Minimization (PEM). We evaluate our approach on a wide set of reasoning problems. Our method outperforms existing state-of-the-art methods, demonstrating its ability to generalize to larger and more complex problems. Project website can be found at: https://alexoarga.github.io/compositional_reasoning/
泛化是机器学习中的一项关键挑战,特别是在推理任务中。在推理任务中,模型被期望解决比训练过程中遇到的更复杂的问题。现有方法通常采用端到端的训练方式对推理模型进行训练,直接将输入实例映射到解决方案。虽然这允许模型从数据中学习有用的启发式知识,但它通常会导致在训练分布之外的泛化能力受限。在这项工作中,我们提出了一种通过在学习更小、更易处理的子问题的解决方案空间上的能量景观来实现推理泛化的新方法。在测试时,我们通过组合多个子问题的能量函数,为给定问题构建全局能量景观。这种组合方法允许在推理过程中加入额外的约束,从而能够构建难度递增的问题的能量景观。为了提高从新构建的能源景观中获得的样本质量,我们引入了并行能量最小化(PEM)。我们在广泛的推理问题上评估了我们的方法。我们的方法优于现有的最先进方法,证明了其在解决更大和更复杂问题上的泛化能力。项目网站地址是:[https://alexoarga.github.io/compositional_reasoning/]
论文及项目相关链接
Summary
本文提出了一种通过学习和利用问题子空间能量景观来增强机器学习模型在推理任务中泛化能力的新方法。通过构建全局能量景观并结合多个子问题的能量函数,该方法能够在推理过程中引入额外的约束,从而处理更复杂的推理问题。同时,为提高从能量景观中获得的样本质量,引入了并行能量最小化(PEM)技术。实验结果显示,该方法在多种推理任务上优于现有先进技术。
Key Takeaways
- 机器学习在推理任务中面临泛化挑战,现有方法通常通过端到端的方式训练模型,导致在训练分布外的泛化能力有限。
- 本文提出了一种新的推理泛化方法,通过学习和利用问题子空间的能量景观来提高模型泛化能力。
- 通过结合多个子问题的能量函数来构建全局能量景观,可以在推理过程中引入额外约束,处理更复杂的问题。
- 引入并行能量最小化(PEM)技术,提高了从能量景观中获得的样本质量。
- 该方法在多种推理任务上的实验结果优于现有先进技术。
- 文章强调了模型在解决越来越复杂的问题时的灵活性和可扩展性。
点此查看论文截图
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Authors:Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
大部分视频推理模型只生成文本推理痕迹,没有标明关键证据出现的时间和地点。虽然最近的模型如OpenAI-o3引发了人们对图像中心化推理的广泛兴趣,但将这种能力扩展到视频更具挑战性,因为它需要在动态场景中联合进行时间跟踪和空间定位。我们推出了Open-o3 Video,这是一个非代理框架,它将明确的时空证据融入视频推理中,并谨慎地收集训练数据并设计训练策略来解决上述挑战。该模型在答案旁边突出了关键的时间戳、对象和边界框,使推理能够基于具体的视觉观察。为了实现此功能,我们首先精心创建了两个高质量的数据集,STGR-CoT-30k用于SFT,STGR-RL-36k用于RL,它们具有精心构建的时间和空间注释,因为大多数现有数据集只为视频提供时间跨度或为图像提供空间框,缺乏统一的时空监督和推理痕迹。接着,我们采用了一种冷启动强化学习策略,该策略具有多种专门设计的奖励,可以共同鼓励答案的准确性、时间对齐性和空间精度。在V-STAR基准测试中,Open-o3 Video达到了业界领先水平,将Qwen2.5-VL基准的mAM提高了14.4%,mLGM提高了24.2%。在广泛的视频理解基准测试中,包括VideoMME、Worldsense、VideoMMMU和TVGBench等,也观察到了一致的改进。除了准确性之外,Open-o3 Video产生的推理痕迹还为测试时的缩放提供了有价值的信号,能够实现信心感知验证并提高对答案的可靠性。
论文及项目相关链接
Summary
本文介绍了Open-o3 Video框架,该框架将明确的时空证据融入视频推理中。为解决在动态场景中联合时间追踪和空间定位的挑战,该框架采用了非代理方式,并收集了训练数据、设计了训练策略。模型在回答问题时,能高亮关键的时间戳、对象和边界框,使推理基于具体的视觉观察。为支持此功能,构建了STGR-CoT-30k和STGR-RL-36k两个高质量数据集,包含精心构建的时空标注。采用冷启动强化学习策略,通过多项专门设计的奖励来鼓励答案的准确性、时间对齐性和空间精度。在V-STAR基准测试中,Open-o3 Video取得了最佳性能,提高了mAM和mLGM的得分。同时,在多个视频理解基准测试中也观察到了一致的改进。此外,Open-o3 Video产生的推理轨迹还为测试时的缩放提供了有价值的信号,提高了答案的可靠性。
Key Takeaways
- Open-o3 Video是一个非代理框架,将明确的时空证据融入视频推理中。
- 模型解决了在动态场景中进行联合时间追踪和空间定位的挑战。
- 通过构建STGR-CoT-30k和STGR-RL-36k两个数据集来实现时空标注的统一监督。
- 采用冷启动强化学习策略进行训练,通过多项奖励鼓励答案的准确性、时间对齐性和空间精度。
- 在V-STAR基准测试中,Open-o3 Video取得了最佳性能表现。
- Open-o3 Video在多个视频理解基准测试中均表现出改善效果。
点此查看论文截图
EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence
Authors:Ding Zou, Feifan Wang, Mengyu Ge, Siyuan Fan, Zongbing Zhang, Wei Chen, Lingfeng Wang, Zhongyou Hu, Wenrui Yan, Zhengwei Gao, Hao Wang, Weizhao Jin, Yu Zhang, Hainan Zhao, Mingliang Zhang, Xianxian Xi, Yaru Zhang, Wenyuan Li, Zhengguang Gao, Yurui Zhu
The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.
实现人工通用智能(AGI)需要具有强健的空间感知能力、有效的任务规划能力以及在物理环境中进行自适应执行能力的实体人工智能代理。然而,当前用于实体任务的大型语言模型(LLM)和多模态大型语言模型(MLLM)存在关键局限,包括模型设计与代理要求之间的重大差距、实时延迟与性能之间不可避免的权衡,以及使用不真实、离线评估指标的问题。为了解决这些挑战,我们提出了EmbodiedBrain,这是一种新型的视觉语言基础模型,有7B和32B两种参数规模可供选择。我们的框架采用与代理对齐的数据结构,并采用强大的训练方法,将大规模监督微调(SFT)与分步增强群组相对策略优化(Step-GRPO)相结合,通过引导前驱将长期任务成功整合。此外,我们融入了一个全面的奖励系统,包括在基础设施层面加速的生成奖励模型(GRM),以提高训练效率。为了进行全面的验证,我们建立了三项评估系统,包括通用基准测试、规划基准测试和端到端仿真基准测试。特别是我们提出了一个新的具有挑战性的仿真环境并进行开源。实验结果表明,EmbodiedBrain在所有指标上均实现了卓越的性能,为实体基础模型建立了新的最新水平。我们旨在为此作为下一代专家实体代理的铺平道路,为此我们开源所有的数据、模型权重和评估方法,可在https://zterobot.github.io/EmbodiedBrain.github.io获取。
论文及项目相关链接
Summary
人工智能通用智能的实现需要具有强大的空间感知能力、有效的任务规划能力和适应物理环境执行能力的嵌入式AI代理。针对当前大型语言模型和多媒体大型语言模型在嵌入式任务方面的局限性,如模型设计与代理需求之间的鸿沟、实时延迟与性能之间的权衡以及使用不真实的离线评估指标等问题,提出了EmbodiedBrain这一新颖的视觉语言基础模型。该模型采用与代理对齐的数据结构,结合大规模监督微调与步骤增强群体相对策略优化方法,并融入全面的奖励系统,包括生成奖励模型。为进行全面验证,建立了包括通用、规划和端到端仿真基准在内的三部分评估系统,并公开提出一个具有挑战性的新仿真环境。实验结果显示,EmbodiedBrain在所有指标上均取得了优越性能,为嵌入式基础模型建立了新的先进技术标准。我们公开了所有数据、模型权重和评估方法,以推动下一代通用嵌入式代理的发展。
Key Takeaways
- 嵌入式AI代理是实现人工智能通用智能的关键,需要强大的空间感知、任务规划和物理环境执行能力。
- 当前大型语言模型和多媒体大型语言模型在嵌入式任务方面存在局限性,包括模型设计与代理需求的差距、实时性能与延迟的权衡以及使用离线评估指标的局限性。
- EmbodiedBrain是一个新颖的视觉语言基础模型,旨在解决上述问题,提供两种参数规模选择(7B和32B)。
- 该模型采用与代理对齐的数据结构,结合大规模监督微调与步骤增强群体相对策略优化方法,以提高性能。
- 综合奖励系统,包括生成奖励模型,旨在提高训练效率。
- 为全面验证模型性能,建立了包括通用、规划和端到端仿真基准的三部分评估系统。
点此查看论文截图
GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Authors:Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao
Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
强化学习在改进检索增强生成(RAG)方面最近显示出前景。尽管有这些进展,其在多跳问答(QA)中的有效性仍受到两个基本限制的影响:(i)缺乏全局规划来构建多步骤推理,以及(ii)执行不忠实,这阻碍了有效查询的形成和检索证据的连续使用。我们提出了GlobalRAG,这是一个强化学习框架,旨在增强多跳QA中的全局推理。GlobalRAG将问题分解为子目标,协调检索与推理,并迭代地完善证据。为了引导这个过程,我们引入了规划质量奖励和子目标完成奖励,以鼓励连贯的规划和可靠的子目标执行。此外,渐进的权重退火策略平衡了面向过程和基于结果的目标。在域内和域外基准测试的大量实验表明,GlobalRAG在仅使用8k训练数据(仅占强基准测试所用训练数据的42%)的情况下,显著优于强基准测试,在EM和F1方面都平均提高了14.2%。
论文及项目相关链接
PDF 8 pages, 3 figures, 4 tables
Summary
强化学习在改进检索增强生成(RAG)方面显示出潜力,但在多跳问答(QA)中的效果有限,主要体现在缺乏全局规划和不忠实的执行两个方面。我们提出了GlobalRAG,一个旨在增强多跳问答中全局推理的强化学习框架。GlobalRAG将问题分解为子目标,协调检索与推理,并迭代优化证据。通过引入规划质量奖励和子目标完成奖励,鼓励连贯的规划和可靠的子目标执行。此外,采用渐进权重退火策略平衡过程导向和结果导向的目标。实验表明,GlobalRAG在域内和域外基准测试上都显著优于强基线,且仅使用8k训练数据(为强基线所用数据的42%),在EM和F1上平均提高14.2%。
Key Takeaways
- 强化学习在改进检索增强生成(RAG)中有应用潜力。
- 多跳问答(QA)在强化学习中面临全局规划缺失和不忠实执行的两个主要挑战。
- 提出GlobalRAG框架,旨在增强多跳问答中的全局推理能力。
- GlobalRAG通过分解问题为子目标,协调检索与推理,并迭代优化证据。
- 引入规划质量奖励和子目标完成奖励,提高规划和执行的连贯性和可靠性。
- 采用渐进权重退火策略平衡过程导向和结果导向的目标。
点此查看论文截图
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Authors:Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
视频推理在多帧之间进行多步骤推断,对于多模态大型语言模型(MLLMs)来说仍然是一个重大挑战。虽然基于强化学习(RL)的方法提高了推理能力,但它们通常依赖于只包含文本的链条,从而产生无根据或虚构的结论。相反,基于帧检索的方法引入了视觉定位,但仍然面临不准确证据定位的挑战。为了解决这些挑战,我们提出了Conan,这是一个基于证据的多步骤视频推理框架。Conan能够识别上下文和证据帧,对跨帧线索进行推理,并自适应地决定何时得出结论或进一步探索。为了实现这一点,我们(1)构建了Conan-91K,这是一个自动生成的推理轨迹大规模数据集,包括帧识别、证据推理和动作决策,(2)设计了一个多阶段渐进的冷启动策略,结合识别-推理-动作(AIR)RLVR训练框架,以共同提高多步骤视觉推理能力。在六个多步骤推理基准测试上的广泛实验表明,Conan在准确性上平均超出基线Qwen2.5-VL-7B-Instruct超过10%,达到了最先进的性能。此外,Conan在长时间视频理解任务中也能有效推广,验证了其强大的可扩展性和稳健性。
论文及项目相关链接
Summary
本文介绍了视频推理对于跨帧多步推理的挑战。现有的强化学习方法虽然增强了推理能力,但往往依赖于文本链而产生无根据或虚构的结论。而帧检索方法引入了视觉定位,但仍存在定位不准确的问题。为解决这些问题,本文提出了Conan框架,用于证据支撑的多步视频推理。Conan能够识别上下文和证据帧,推理跨帧线索,并自适应地决定何时得出结论或进一步探索。为达到此目的,本文构建了一个大规模数据集Conan-91K,并设计了一个多阶段的渐进式冷启动策略,结合Identification-Reasoning-Action(AIR)RLVR训练框架,共同提升多步视觉推理能力。实验表明,Conan在多个多步推理基准测试上的准确率超过了基准模型Qwen2.5-VL-7B-Instruct,达到了业界领先水平,并有效泛化到长视频理解任务。
Key Takeaways
- 视频推理对跨帧多步推理是一大挑战。
- 现有强化学习方法可能产生无根据或虚构的结论。
- Conan框架用于证据支撑的多步视频推理,能识别上下文和证据帧,推理跨帧线索。
- Conan构建了一个大规模数据集Conan-91K用于训练和评估。
- Conan采用多阶段的渐进式冷启动策略和AIR RLVR训练框架提升多步视觉推理能力。
- Conan在多个多步推理基准测试上的准确率超过现有模型。
点此查看论文截图
LM-mixup: Text Data Augmentation via Language Model based Mixup
Authors:Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, Jiaheng Wei
Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
指令调整对于对齐大型语言模型(LLM)至关重要,然而指令遵循数据的质量差异很大。虽然高质量数据至关重要,但它往往很稀缺;相反,大量低质量数据经常被丢弃,导致大量信息丢失。现有的数据增强方法难以有效地增强这种低质量数据,并且对此类技术的评估定义仍然不明确。为了解决这一问题,我们正式定义了指令蒸馏的任务:将多个低质量和冗余的输入蒸馏成高质量和连贯的指令-输出对。具体来说,我们引入了一个全面的数据构建流程来创建MIXTURE,这是一个拥有144K样本的数据集,将低质量或语义上冗余的不完美指令集群与其高质量蒸馏配对。然后,我们引入了LM-Mixup,首先使用MIXTURE进行有监督微调,然后使用强化学习进行优化。这一过程使用三种互补的奖励信号:质量、语义对齐和格式合规性,通过群体相对策略优化(GRPO)。我们证明LM-Mixup可以有效地增强不完美的数据集:在蒸馏数据上微调LLM,这部分数据只占整个数据集的约3%,不仅超越了全数据集的训练,而且在多个基准测试中与最新的高质量数据选择方法相竞争。我们的工作证明,当适当蒸馏和用LM-Mixup增强时,低质量数据是一种有价值的资源,能显著提高指令调整型LLM的效率和性能。
论文及项目相关链接
Summary
该文提出指令蒸馏任务,旨在将大量低质量和冗余输入转化为高质量、连贯的指令输出对。为此,创建了一个全面的数据构建管道,构建了包含144K样本的MIXTURE数据集,该数据集将低质量或语义冗余的不完美指令集群与其高质量蒸馏数据配对。接着,文章提出了LM-Mixup方法,首先使用MIXTURE数据集进行有监督微调,然后使用强化学习进行优化。该方法利用三种互补奖励信号:质量、语义对齐和格式合规性,通过群体相对策略优化(GRPO)。研究表明,LM-Mixup能有效增强不完美的数据集:对大型语言模型(LLM)进行蒸馏数据微调,仅使用约3%的数据便超过了全数据集训练的效果,并在多个基准测试中与高质量数据选择方法竞争。文章表明,低质量数据在适当蒸馏和与LM-Mixup结合增强后,可成为有价值的资源,显著提高指令调整的大型语言模型的效率和性能。
Key Takeaways
- 指令蒸馏是将低质量和冗余数据转化为高质量数据的有效方法。
- 创建了MIXTURE数据集,用于配对低质量指令与高质量蒸馏数据。
- 提出了LM-Mixup方法,结合了监督学习和强化学习进行指令蒸馏任务。
- LM-Mixup利用三种奖励信号进行优化:质量、语义对齐和格式合规性。
- 通过使用约3%的蒸馏数据对大型语言模型进行微调,LM-Mixup超越了全数据集训练的效果。
- LM-Mixup在多个基准测试中表现出与高质量数据选择方法相当的竞争力。
点此查看论文截图
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Authors:Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, Tuo Zhao
Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.
奖励模型(RM)在结合人类反馈的强化学习(RLHF)中对大型语言模型(LLM)的校准起着至关重要的作用。然而,基于人类偏好训练的经典RM容易受到奖励攻击,对离分布(OOD)输入的泛化能力较差。相比之下,配备推理能力的高级LLM法官即使没有额外的训练也能表现出优越的泛化能力,但推理成本显著较高,限制了其在在线RLHF中的应用。在这项工作中,我们提出了一种基于不确定性的路由框架,该框架可以有效地用快速RM配合强大但昂贵LLM法官进行补充。我们的方法将策略梯度(PG)方法中的优势估计制定为配对偏好分类,实现有原则的不确定性量化来指导路由。不确定的对被转发给LLM法官,而自信的被RM评估。在RM基准测试上的实验表明,我们的基于不确定性的路由策略在相同成本下显著优于随机法官呼叫,下游对齐结果展示了其在改进在线RLHF中的有效性。
论文及项目相关链接
PDF NeurIPS 2025, 18 pages
Summary
强化学习中,奖励模型(RM)在结合人类反馈的大型语言模型(LLM)中起到关键作用。然而,基于人类偏好的经典RM容易受到奖励黑客攻击,对离群输入(OOD)的泛化能力较差。相比之下,配备推理能力的强大LLM法官表现出优越的泛化能力,即使在无需额外训练的情况下也是如此,但它们会带来更高的推理成本,限制了它们在在线强化学习人类反馈(RLHF)中的应用。本研究提出一种基于不确定性的路由框架,能够高效地用一个快速的RM来辅助强大但成本高昂的LLM法官。我们的方法将策略梯度(PG)方法中的优势估计公式化为成对偏好分类,以实现有原则的不确定性量化来指导路由。不确定的对被转发到LLM法官进行裁决,而确定的则由RM进行评估。在RM基准测试上的实验表明,我们的基于不确定性的路由策略在相同成本下显著优于随机法官调用,下游对齐结果展示了其在改进在线RLHF中的有效性。
Key Takeaways
- 奖励模型(RM)在强化学习中的关键作用,尤其是在结合人类反馈的大型语言模型(LLM)中。
- 经典RM的局限性:容易受到奖励黑客攻击和对离群输入(OOD)的泛化能力较差。
- 强大LLM的优越泛化能力,配备推理能力,但高推理成本限制了其在在线RLHF中的应用。
- 提出一种基于不确定性的路由框架,结合了快速RM和强大LLM的优势。
- 该方法通过公式化优势估计为成对偏好分类,实现有原则的不确定性量化。
- 不确定的对由LLM法官裁决,确定的则由RM评估。
点此查看论文截图
Teaching Language Models to Reason with Tools
Authors:Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model’s internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT’s effectiveness, yielding absolute improvements of 4% and 8% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: https://github.com/ChengpengLi1003/CoRT.
大型推理模型(如OpenAI-o1)在自然语言推理方面表现出了令人印象深刻的能力。然而,这些模型在进行复杂的数学运算时,经常表现出效率低下或不准确的问题。虽然整合计算工具(如代码解释器)提供了有前景的解决方案,但它带来了一个关键挑战:模型内部的概率推理与外部由代码解释器提供的确定性知识之间的冲突,这常常导致模型陷入无效的争论。为了克服这一问题,我们引入了CoRT(代码优化推理训练),这是一种用于教授大型推理模型有效利用代码解释器的后训练框架。我们提出了一种新的数据合成策略——提示工程(Hint-Engineering),该策略在推理路径的合适位置注入多样化的提示。这种方法生成高质量、与代码整合的推理数据,专为优化大型推理模型与代码解释器的交互而设计。使用这种方法,我们合成了30个高质量样本,通过监督微调对参数范围从1.5B到32B的模型进行后训练。CoRT通过采用拒绝采样和强化学习,进一步改进了外部代码解释器的使用与内部思考的多轮交替进行。我们的实验评估证明了CoRT的有效性,在五个具有挑战性的数学推理数据集上,DeepSeek-R1-Distill-Qwen-32B和DeepSeek-R1-Distill-Qwen-1.5B分别实现了4%和8%的绝对改进。此外,CoRT显著提高了效率,相较于纯自然语言推理基线,令牌使用量大约减少了30%(针对32B模型)和50%(针对1.5B模型)。模型和代码可在https://github.com/ChengpengLi1003/CoRT获取。
论文及项目相关链接
PDF NIPS2025 Accepted
Summary
大型推理模型(如OpenAI-o1)在自然语言推理方面表现出强大的能力,但在处理复杂数学运算时可能效率低下或不准确。结合代码解释器(CI)为解决这一问题提供了希望,但同时也带来了新的挑战:模型内部的概率推理与外部CI提供的确定性知识之间的冲突。为解决此问题,我们提出CoRT(代码优化推理训练),一个针对大型推理模型设计的后训练框架,用于优化其与代码解释器的交互。使用一种名为提示工程的新数据合成策略,该策略在推理路径的关键点注入多样的提示,生成专门针对大型推理模型优化的高质量数据。通过实验评估,CoRT在多个数学推理数据集上表现出显著效果,并在效率上有所提升。模型与代码已公开于GitHub上。
Key Takeaways
- 大型推理模型在自然语言推理方面表现出强大的能力,但在数学运算上可能存在效率或准确性问题。
- 结合代码解释器为解决大型推理模型的数学运算问题提供了解决方案,但带来了模型内部与外部知识之间的冲突挑战。
- 引入CoRT框架,旨在优化大型推理模型与代码解释器的交互。
- 提出提示工程数据合成策略,通过注入多样提示生成针对模型优化的数据。
- CoRT显著提高了模型在数学推理方面的效果,并在多个数据集上实现性能提升。
- CoRT提高了模型的效率,减少了在处理数学运算时的令牌使用量。
点此查看论文截图
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
Authors:Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.
GUI接地(即将自然语言指令映射到可操作的UI元素)是GUI代理的核心功能。先前的研究大多将指令视为用户意图的静态代理,忽略了指令多样性和质量对接地性能的影响。通过对现有接地数据集的仔细研究,我们发现指令中存在23.3%的错误率,并证明在推理时间利用指令多样性可以带来高达76%的相对性能改进。在本文中,我们引入了指令作为推理范式,将指令视为提供不同视角的动态分析路径,使模型能够在推理过程中选择最有效的路径。为了实现这一点,我们提出了一个两阶段训练框架:首先在合成、多样的指令上进行监督微调(SFT),以灌输多角度推理能力,然后通过强化学习(RL)优化路径选择和组合。我们得到的模型UI-Ins-7B和UI-Ins-31B在五个具有挑战性的接地基准测试上达到了最新水平的结果,并展现出新兴推理能力,能够在推理过程中有选择地组合和合成新的指令路径。特别是UI-Ins-32B在UI-I2E-Bench上取得了最高的接地精度,达到87.3%,在ScreenSpot-Pro上达到57.0%,在MMBench-GUI L2上达到84.9%。此外,我们的模型展示了强大的代理潜力,在使用UI-Ins-7B作为执行者的情况下,在AndroidWorld上的成功率达到74.1%。我们的深入分析揭示了额外的见解,例如如何制定推理来增强而不是阻碍接地性能,以及我们的方法如何缓解SFT+RL框架中的策略崩溃问题。所有代码和模型检查点将公开发布在https://github.com/alibaba/UI-Ins。
论文及项目相关链接
Summary
本文探讨了GUI接地中的指令多样性对性能的影响,并引入了“指令作为推理”的范式。该研究通过两阶段训练框架(监督微调(SFT)和强化学习(RL))实现,以实现多视角推理和优化路径选择。提出的模型UI-Ins-7B和UI-Ins-32B在五个挑战性的接地基准测试上达到了最佳结果,特别是在UI-I2E-Bench上的接地精度达到了87.3%。该研究还深入探讨了推理如何增强接地性能,以及如何缓解SFT+RL框架中的策略崩溃问题。
Key Takeaways
- GUI接地中指令多样性的重要性及其对性能的影响。
- 现有接地数据集中的指令存在瑕疵,并导致推理时的错误选择。
- 提出“指令作为推理”的范式,将指令视为动态分析路径,为模型提供不同的视角。
- 采用两阶段训练框架:监督微调(SFT)强化学习(RL),以强化多视角推理和路径选择。
- UI-Ins-7B和UI-Ins-32B模型在多个基准测试中表现最佳,特别是在UI-I2E-Bench上的接地精度达到87.3%。
- 模型展现出强大的代理潜力,在AndroidWorld上达到了74.1%的成功率。
点此查看论文截图
SynTSBench: Rethinking Temporal Pattern Learning in Deep Learning Models for Time Series
Authors:Qitai Tan, Yiyun Chen, Mo Li, Ruiwen Gu, Yilin Su, Xiao-Ping Zhang
Recent advances in deep learning have driven rapid progress in time series forecasting, yet many state-of-the-art models continue to struggle with robust performance in real-world applications, even when they achieve strong results on standard benchmark datasets. This persistent gap can be attributed to the black-box nature of deep learning architectures and the inherent limitations of current evaluation frameworks, which frequently lack the capacity to provide clear, quantitative insights into the specific strengths and weaknesses of different models, thereby complicating the selection of appropriate models for particular forecasting scenarios. To address these issues, we propose a synthetic data-driven evaluation paradigm, SynTSBench, that systematically assesses fundamental modeling capabilities of time series forecasting models through programmable feature configuration. Our framework isolates confounding factors and establishes an interpretable evaluation system with three core analytical dimensions: (1) temporal feature decomposition and capability mapping, which enables systematic evaluation of model capacities to learn specific pattern types; (2) robustness analysis under data irregularities, which quantifies noise tolerance thresholds and anomaly recovery capabilities; and (3) theoretical optimum benchmarking, which establishes performance boundaries for each pattern type-enabling direct comparison between model predictions and mathematical optima. Our experiments show that current deep learning models do not universally approach optimal baselines across all types of temporal features.The code is available at https://github.com/TanQitai/SynTSBench
近年来深度学习的发展推动了时间序列预测的迅速进步,然而许多最先进的模型在现实世界的实际应用中仍然难以保持稳健的性能,即使在标准基准数据集上取得了良好的效果。这种持久的差距可归因于深度学习架构的黑箱性质和当前评估框架的内在局限性,这些框架通常无法提供关于不同模型特定优势和劣势的明确、定量见解,从而增加了为特定预测场景选择适当模型的复杂性。为了解决这些问题,我们提出了一种合成数据驱动评估范式SynTSBench,它通过可编程特征配置系统地评估时间序列预测模型的基本建模能力。我们的框架隔离了混淆因素,并建立一个可解释的评估系统,包括三个核心分析维度:(1)时间特征分解和能力映射,这可以系统地评估模型学习特定模式类型的能力;(2)数据不规则下的稳健性分析,量化噪声容忍阈值和异常恢复能力;(3)理论最优基准测试,为每种模式类型设定性能边界,实现模型预测与数学最优值之间的直接比较。我们的实验表明,当前深度学习模型并非在所有类型的时序特征上都普遍接近最优基准。代码可在https://github.com/TanQitai/SynTSBench找到。
论文及项目相关链接
PDF NeurIPS 2025
Summary
时间序列预测领域虽然取得了深度学习的最新进展,但许多最先进的模型在现实世界应用中的稳健性能仍然存在问题。为解决此问题,提出了一个合成数据驱动的评价范式SynTSBench,通过可编程特征配置系统地评估时间序列预测模型的基本建模能力,包括模式分解与能力映射、数据不规则下的稳健性分析和理论最优基准对比。实验表明,当前深度学习模型并非对所有类型的时序特征都能达到最优基线。
Key Takeaways
- 深度学习在时间序列预测中取得进展,但现实应用中的稳健性能仍存在差距。
- 现有评价框架难以提供关于模型优缺点的清晰、定量见解。
- 提出了一种新的合成数据驱动评价范式SynTSBench,以系统地评估时间序列预测模型。
- SynTSBench包括三大核心分析维度:模式分解与能力映射、数据不规则下的稳健性分析、理论最优基准对比。
- 实验显示,当前深度学习模型并非在所有类型的时序特征上都达到最优性能。
- SynTSBench可通过隔离混淆因素并建立可解释的评价系统来简化模型选择。
点此查看论文截图
Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
Authors:Tristan Cinquin, Geoff Pleiss, Agustinus Kristiadi
While chain-of-thought prompting with Best-of-N (BoN) selection has become popular for mathematical reasoning in large language models (LLMs), its linear structure fails to capture the branching and exploratory nature of complex problem-solving. In this work, we propose an adaptive algorithm to maximize process reward model (PRM) scores over the intractable action space, and investigate whether PRM-guided tree search can improve mathematical reasoning by exploring multiple partial solution paths. Across $23$ diverse mathematical problems using Qwen2.5-Math-7B-Instruct with its associated PRM as a case study, we find that: (1) PRM-guided tree search shows no statistically significant improvements over BoN despite higher costs, (2) Monte Carlo tree search and beam search outperform other PRM-guided tree search methods, (3) PRMs poorly approximate state values and their reliability degrades with reasoning depth, and (4) PRMs generalize poorly out of distribution. This underperformance stems from tree search’s greater reliance on unreliable PRM scores, suggesting different reward modeling is necessary before tree search can effectively enhance mathematical reasoning in LLMs.
虽然使用Best-of-N(BoN)选择的思维链提示在大语言模型(LLM)的数学推理中很受欢迎,但其线性结构无法捕捉到复杂问题解决的分支和探索性特点。在这项工作中,我们提出了一种自适应算法,以在难以处理的行为空间内最大化过程奖励模型(PRM)得分,并研究PRM引导的树搜索是否可以通过探索多条部分解决方案路径来提高数学推理能力。以使用Qwen2.5-Math-7B-Instruct及其相关PRM的23个不同数学问题作为案例研究,我们发现:(1)尽管成本较高,但PRM引导的树搜索并未显示出在统计上显著优于BoN,(2)蒙特卡洛树搜索和光束搜索在表现上优于其他PRM引导的树搜索方法,(3)PRM在近似状态值方面表现不佳,随着推理深度的增加,其可靠性会下降,(4)PRM的泛化能力较差。这种性能不佳源于树搜索对不可靠的PRM分数的过度依赖,这表明在树搜索能有效提高LLM的数学推理能力之前,需要进行不同的奖励建模。
论文及项目相关链接
Summary
本文探讨了链式思维提示结合Best-of-N选择的策略在大型语言模型进行数学推理的局限性,并提出了基于过程奖励模型(PRM)的适应性算法,旨在通过探索多种部分解决方案路径来改善数学推理。通过对多种数学问题的研究,发现PRM导向的树搜索策略并没有显著提高效果,并且存在可靠性随推理深度下降以及泛化性能不佳的问题。因此,树搜索依赖的PRM得分可靠性不高,建议在进行树搜索之前改进奖励建模以提升大型语言模型在数学推理中的表现。
Key Takeaways
- 链式思维提示结合Best-of-N选择策略在数学推理中存在局限性,无法捕捉复杂问题解决的分支和探索性特点。
- 提出基于过程奖励模型(PRM)的适应性算法,旨在通过探索多种部分解决方案路径改善数学推理。
- PRM导向的树搜索策略在数学推理中未显著改善效果,相较于其他方法如Monte Carlo树搜索和beam搜索表现较差。
- PRM在近似状态值方面表现不佳,其可靠性随推理深度而下降。
- PRM的泛化性能不佳,表现在处理分布外的数据上表现较差。
- 树搜索方法更依赖于不可靠的PRM得分,这可能是其性能不佳的原因之一。
点此查看论文截图
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
Authors:Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
我们提出一种与明确人类价值对齐的强化学习(Reinforcement Learning with Explicit Human Values,简称RLEV)方法,该方法直接将大型语言模型(LLM)的优化与可量化的人类价值信号对齐。虽然强化学习通过可验证奖励(RLVR)使用二元正确性奖励有效地在目标领域训练模型,但它忽略了并非所有任务都具有同等重要性。RLEV通过直接将人类定义的价值信号纳入奖励函数来扩展这一框架。使用带有明确地面真实价值标签的考试式数据,RLEV在多个强化学习算法和模型规模上均优于仅基于正确性的基线。关键的是,RLEV策略不仅提高了价值加权准确性,还学习了价值敏感终止策略:对低价值提示简洁明了,对高价值提示则全面深入。我们证明这种行为源于序列结束标记的价值加权梯度放大。消除研究表明增益与价值对齐之间存在因果关系。在噪声价值信号下,RLEV仍然保持稳健,例如基于难度的标签,表明优化明确效用函数为将LLM与人类优先事项对齐提供了实用途径。
论文及项目相关链接
PDF 15 pages, 4 figures
Summary
强化学习与明确人类价值观(RLEV)方法能够将大型语言模型(LLM)的优化与人类可量化的价值信号直接对齐。相较于强化学习使用可验证奖励(RLVR)仅在目标领域使用二元正确性奖励训练模型,RLEV通过融入人类定义的价值信号进一步扩展了这一框架。使用带有明确地面真实价值标签的考试式数据,RLEV在多个强化学习算法和模型规模上均表现超越仅基于正确性的基线。更重要的是,RLEV策略不仅提高了价值加权精度,还学会了价值敏感型终止策略:对于低价值提示进行简洁回应,对于高价值提示则进行彻底处理。这种行为的根源在于序列结束令牌的价值加权梯度放大。因果关系研究表明价值的增益来自于价值对齐。即使在有噪音的价值信号下,RLEV仍然稳健,证明优化明确的效用函数是实现大型语言模型与人类优先事项对齐的实际途径。
Key Takeaways
- RLEV方法将大型语言模型的优化与人类可量化的价值信号对齐。
- RLEV扩展了RLVR框架,通过融入人类定义的价值信号来改进模型训练。
- RLEV在多个强化学习算法和模型规模上表现超越仅基于正确性的基线。
- RLEV策略能学会价值敏感型终止策略,根据价值高低调整回应方式。
- 价值加权梯度放大在序列结束令牌上表现明显。
- RLEV策略的增益源于价值对齐,且通过因果关系研究得到证实。
点此查看论文截图
Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
Authors:Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Jundong Li, Nathan Kallus
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
基于对话的推荐系统重塑了推荐系统的范式,使得用户能够通过对话来表达自己的偏好并接受推荐。然而,如何将大型语言模型(LLM)应用于推荐任务仍然面临挑战:预训练的大型语言模型往往会生成超出目录范围的项目、违反必要的输出格式,并且其排名质量在生成的列表末尾急剧下降。为此,我们提出了ConvRec-R1,这是一个用于基于大型语言模型的对话推荐系统的端到端训练的两阶段框架。在第一阶段,我们使用一个带有Remap-Reflect-Adjust管道的行为克隆数据集,从强大的黑箱大型语言模型中产生高质量、基于目录的演示内容,以启动强化学习训练。在第二阶段,我们提出了Rank-GRPO,这是一种针对排名输出任务定制的分组相对策略优化(GRPO)的有原则扩展。Rank-GRPO将推荐列表中的每个排名视为一个单元,而不是令牌(过于精细)或序列(过于粗略),通过重新定义奖励来消除非因果信用分配,并引入基于排名令牌概率几何平均值的排名级重要性比率来稳定策略更新。在公开Reddit-v2数据集上的实验表明,ConvRec-R1的收敛速度更快,在召回率和归一化折损累积增益方面达到了高于GRPO风格的基准测试成绩。相关代码和数据集已在https://github.com/yaochenzhu/Rank-GRPO发布。
论文及项目相关链接
Summary
大型语言模型(LLMs)正在重塑推荐系统范式,通过对话表达用户偏好和接收推荐。然而,将LLMs与推荐任务对齐仍然具有挑战性:预训练的LLMs往往会生成超出目录范围的项目,违反必要的输出格式,并且其排名质量在生成列表的末尾急剧下降。为此,我们提出了ConvRec-R1,这是一个两阶段框架,用于端到端训练基于LLM的对话推荐系统。第一阶段,我们通过行为克隆数据集和Remap-Reflect-Adjust管道产生高质量、基于目录的演示,以预热RL训练。第二阶段,我们提出了Rank-GRPO,它是针对具有排名输出任务量身定制的组相对策略优化(GRPO)的有原则扩展。Rank-GRPO将推荐列表中的每个排名视为单元,而不是令牌(太精细)或序列(太粗糙),重新定义奖励以消除非因果信用分配,并引入基于排名级别重要性比率(即排名范围内令牌概率的几何平均值)来稳定策略更新。在公共Reddit-v2数据集上的实验表明,ConvRec-R1收敛速度更快,召回率和NDCG高于GRPO风格的基线。
Key Takeaways
- 大型语言模型(LLMs)正在推动对话推荐系统的变革。
- 对话推荐系统面临的主要挑战在于如何将LLMs与推荐任务对齐。
- ConvRec-R1是一个两阶段框架,旨在解决LLM在推荐任务中的对齐问题。
- Stage 1通过行为克隆数据集和Remap-Reflect-Adjust管道产生高质量、基于目录的演示来预热RL训练。
- Stage 2提出了Rank-GRPO,一个针对排名输出任务的优化策略。
- Rank-GRPO将推荐列表中的每个排名视为独立单元,并引入重要性比率来稳定策略更新。
点此查看论文截图
Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
Authors:Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar
Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
实体链接(EL)传统上依赖于大型标注数据集和广泛的模型微调。虽然最近的少样本方法通过提示利用大型语言模型(LLM)来减少训练要求,但它们通常因为昂贵的LLM基于的推理而出现效率不高的情况。ARTER(自适应路由和靶向实体推理)提出了一个结构化管道,通过战略性地结合候选生成、基于上下文评分、自适应路由和选择性推理,无需深度微调即可实现高性能。ARTER计算一小部分补充信号(基于嵌入和LLM)对检索到的候选对象进行分类,将上下文提及分为简单和困难的情况。然后这些情况分别由低计算实体链接器(例如ReFinED)和更昂贵的靶向LLM基础推理进行处理。在标准基准测试中,ARTER比ReFinED高出+4.47%,在6个数据集中的5个数据集上平均提升+2.53%,并且在所有提及的管道中,与基于LLM的推理管道表现相当,同时在LLM令牌数量方面效率提高了一倍。
论文及项目相关链接
Summary
基于大规模数据集和深度模型的精细调整传统实体链接(EL)的方法面临高计算成本和资源需求的问题。然而,最新的少数模式方法利用大型语言模型(LLM)通过提示来减少训练需求,但仍存在效率低下的问题。而ARTER方法则提出了一种结构化管道,通过战略性地结合候选生成、基于上下文的评分、自适应路由和选择性推理等技术,在不进行深度精细调整的情况下实现高性能。它利用一小部分互补信号(包括嵌入和基于LLM的信号)对检索到的候选实体进行分类,将其分为简单和困难的情况。然后分别由低计算实体链接器(如ReFinED)和针对复杂情况的高性能LLM推理器处理这两种情况。在标准测试中,ARTER的准确率相比ReFinED提高高达+4.47%,并在五个数据集中平均提升+2.53%。与此同时,相较于其他全面使用LLM的推理方法,其处理所有提及内容的效果更加出色,且效率更高。尽管性能优异,但其能以更低的计算成本和资源消耗达到这种效果。总之,ARTER为解决实体链接问题提供了新的视角和解决方案。
Key Takeaways
- ARTER是一种新的实体链接方法,旨在解决传统方法和现有少数模式方法的问题。
- ARTER通过结合多种技术实现高效实体链接,包括候选生成、上下文评分、自适应路由和选择性推理等。
- ARTER能利用少数资源生成辅助信号来对实体进行筛选,有效地对任务进行分类简化并改善计算效率。
点此查看论文截图
CreativityPrism: A Holistic Benchmark for Large Language Model Creativity
Authors:Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li
Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.
创造力通常被视为人类智力的标志。虽然大型语言模型(LLM)越来越多地被认为能够产生创造性的文本,但还没有一个全面的框架来评估它们在不同场景中的创造力。现有的评估方法仍然碎片化,不同领域和任务之间存在巨大差异,这主要是由于对创造力的不同定义和测量。受创造力不是一种固定想法的假设启发,我们提出了CreativityPrism,这是一个评估分析框架,将创造力分解为三个维度:质量、新颖性和多样性。CreativityPrism包含了九个任务、三个领域(发散思维、创造性写作和逻辑推理),以及二十个评价指标,这些指标以任务特定和独特的方式测量每个维度。我们在CreativityPrism上评估了17个最新的大型语言模型,分析了不同指标和任务领域之间的性能相关性。我们的结果揭示了专有模型和开源模型之间的明显差距。总体而言,同一领域内的任务之间性能高度相关,而不同领域间的任务则相关性较低。在评估维度中,多样性和质量指标表现出强烈的相关性——在一个方面表现良好的模型通常在另一方面也表现出色——而新颖性与两者之间的相关性则较弱得多。这些发现支持我们的假设,即在一个创造力任务或维度上的出色表现并不一定适用于其他任务或维度,这强调了全面评估LLM创造力的必要性。
论文及项目相关链接
Summary
该文提出一个评价分析框架CreativityPrism,用于评估大型语言模型(LLM)在创造力方面的表现。该框架将创造力分解为质量、新颖性和多样性三个维度,并设计了九个任务,涉及三个领域:发散性思维、创造性写作和逻辑推理。通过对17个最先进的LLM在CreativityPrism上的评估,发现不同模型在不同任务和领域之间的性能差异,强调了对LLM创造力进行全面评估的必要性。
Key Takeaways
- 现有评估大型语言模型创造力的方法存在局限性,缺乏统一框架。
- 创造力并非单一概念,需要多维度评估。
- CreativityPrism框架包括三个维度:质量、新颖性、多样性。
- 该框架包含九个任务,涉及发散性思维、创造性写作和逻辑推理三个领域。
- 评估了17个最先进的LLM,发现不同模型在不同任务和领域之间的性能差异显著。
- 专有模型和开源模型之间存在显著差距。
点此查看论文截图