发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 19.6k

阅读时长: 80 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Authors:Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

结果导向的强化学习（RL）是优化多模态大型语言模型（MLLMs）的逐步推理的一种常见且日益重要的方法。在多选环境中——多模态推理基准测试的主导形式——该方法面临一个重大但经常被忽视的问题：不忠实的轨迹在错误的思维链条后猜测正确选项，与真实推理获得相同奖励，这是一个不容忽视的缺陷。我们提出自我一致性采样（SCS）来解决这个问题。对于每个问题，SCS（i）引入微小的视觉扰动，并（ii）对初始轨迹进行重复截断和重新采样；结果轨迹之间的一致性能产生可区分的一致性得分，这在策略更新时会降低不可靠轨迹的权重。基于Qwen2.5-VL-7B-Instruct，将SCS插入RLOO、GRPO和REINFORCE++系列，在六个多模态基准测试上的准确率提高了高达7.7个百分点，且额外计算量微乎其微。SCS在Qwen2.5-VL-3B-Instruct和InternVL3-8B上也取得了显著的收益，为MLLMs中的结果导向RL提供了简单、通用的补救措施。

论文及项目相关链接

PDF Accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)

Summary

多模态大型语言模型（MLLMs）逐步推理中，结果奖励强化学习（RL）是一种常见且日益重要的方法。在多选模式下，该方法面临一个重大但被忽视的问题：不忠实的轨迹。即使猜测正确选项的推理过程存在错误，也会获得与真实推理相同的奖励。为解决这一问题，我们提出了自我一致性采样（SCS）方法。SCS通过（i）引入微小视觉扰动和（ii）对初始轨迹进行重复截断和重新采样，为每个问题生成一致性分数。在策略更新过程中，不可靠的轨迹会被降低权重。基于Qwen2.5-VL-7B-Instruct，将SCS应用于RLOO、GRPO和REINFORCE++系列，在六个多模态基准测试上的准确率提高了高达7.7个百分点，且额外计算量微乎其微。SCS在Qwen2.5-VL-3B-Instruct和InternVL3-8B上也取得了显著成效，为MLLMs中的结果奖励RL提供了简单通用的解决方案。

Key Takeaways

结果奖励强化学习（RL）在多模态大型语言模型（MLLMs）的逐步推理中扮演重要角色。
多选模式下，RL面临一个被忽视的问题：错误推理路径也可能获得正确答案的奖励。
为解决上述问题，提出了自我一致性采样（SCS）方法。
SCS通过引入视觉扰动和重复轨迹采样，计算一致性分数来区分可靠与不可靠的轨迹。
SCS在多个多模态基准测试上显著提高了准确率，且额外计算成本较低。
SCS在多种MLLMs上表现出良好的适用性，提供了一个通用的解决方案。

Cool Papers

点此查看论文截图

Flexible Simulation Based Inference for Galaxy Photometric Fitting with Synthesizer

Authors:Thomas Harvey, Christopher C. Lovell, Sophie Newman, Christopher J. Conselice, Duncan Austin, Aswin P. Vijayan, Stephen M. Wilkins, Vadim Rusakov, Qiong Li, Nathan Adams, Kai Magdwick, Matthew Ho

We introduce Synference, a new, flexible Python framework for galaxy SED fitting using simulation-based inference (SBI). Synference leverages the Synthesizer package for flexible forward-modelling of galaxy SEDs and integrates the LtU-ILI package to ensure best practices in model training and validation. In this work we demonstrate Synference by training a neural posterior estimator on $10^6$ simulated galaxies, based on a flexible 8-parameter physical model, to infer galaxy properties from 14-band HST and JWST photometry. We validate this model, demonstrating excellent parameter recovery (e.g. R$^2>$0.99 for M$_\star$) and accurate posterior calibration against nested sampling results. We apply our trained model to 3,088 spectroscopically-confirmed galaxies in the JADES GOODS-South field. The amortized inference is exceptionally fast, having nearly fixed cost per posterior evaluation and processing the entire sample in $\sim$3 minutes on a single CPU (18 galaxies/CPU/sec), a $\sim$1700$\times$ speedup over traditional nested sampling or MCMC techniques. We demonstrate Synference’s ability to simultaneously infer photometric redshifts and physical parameters, and highlight its utility for rapid Bayesian model comparison by demonstrating systematic stellar mass differences between two commonly used stellar population synthesis models. Synference is a powerful, scalable tool poised to maximise the scientific return of next-generation galaxy surveys.

我们介绍了Synference，这是一个新的灵活Python框架，利用基于模拟的推断（SBI）进行星系SED拟合。Synference利用Synthesizer软件包对星系SED进行灵活的前向建模，并集成LtU-ILI软件包以确保模型训练和验证的最佳实践。在这项工作中，我们使用灵活的8参数物理模型对模拟的$ 10^6 $个星系进行训练神经网络后验估计器，基于14波段HST和JWST的光度测量推断星系属性。我们对模型进行了验证，展示了出色的参数恢复能力（例如，恒星质量M_的R²> 0.99），并且相对于嵌套采样结果的后验校准是准确的。我们将训练好的模型应用于JADES GOODS-South字段中的3,088个光谱确认的星系。摊销推断非常快，几乎每次后验评估都有固定的成本，在单个CPU上处理整个样本大约需要3分钟（每秒处理18个星系/ CPU），比传统的嵌套采样或MCMC技术大约快约1700倍。我们展示了Synference同时推断光度红移和物理参数的能力，并通过展示两个常用的恒星人口合成模型之间的系统性恒星质量差异来突出其在快速贝叶斯模型比较中的实用性。Synference是一个强大且可扩展的工具，旨在最大限度地提高下一代星系调查的科研产出。

论文及项目相关链接

PDF 23 pages, 12 figures. Submitted to MNRAS. The Synference package is available at https://github.com/synthesizer-project/synference/

Summary
Synference是一个用于星系SED拟合的新颖、灵活的Python框架，采用模拟推理（SBI）。它利用Synthesizer包进行灵活的星系SED正向建模，并与LtU-ILI包集成以确保模型训练和验证的最佳实践。通过对模拟星系数据的训练，验证了Synference模型在参数恢复上的优秀表现及准确性。此外，该模型在处理大量星系数据时具有高效率，能够实现快速的后验评估，大幅提速于传统的嵌套采样或MCMC技术。同时，Synference可以同时推断光度和物理参数，并展示了其在不同恒星人口合成模型之间的系统恒星质量差异比较方面的实用性。对于下一代星系调查，Synference是一个强大且可扩展的工具。

Key Takeaways

Synference是一个基于模拟推理（SBI）的Python框架，用于灵活的星系SED拟合。
利用Synthesizer包进行正向建模，并与LtU-ILI集成确保最佳实践。
通过训练神经网络后验估计器在模拟数据上验证了模型的优秀性能。
模型展现出良好的参数恢复能力（如Mstar的R²> 0.99）。
后验评估具有高效率，处理大量样本时表现出显著的速度优势。
同时具备推断光度和物理参数的能力。

Cool Papers

点此查看论文截图

Instella: Fully Open Language Models with Stellar Performance

Authors:Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

大规模语言模型（LLM）在多种任务中表现出卓越的性能，然而大多数高性能模型仍然是闭源的或部分开源的，这限制了透明度和可重复性。在这项工作中，我们介绍了Instella，这是一个完全开源的三亿参数语言模型家族，该模型完全基于公开可用数据和代码库进行训练。借助AMD Instinct MI300X GPU，Instella通过大规模预训练、通用指令调整和与人类偏好对齐的方式开发。尽管使用的预训练令牌数量比许多同龄人少得多，但Instella在完全开源的模型中实现了最新技术成果，并且与同等规模的领先开源权重模型相竞争。我们还发布了两个专业版本：Instella-Long，能够处理长达128K令牌的上下文长度，以及专注于推理的Instella-Math，该模型通过数学任务的监督微调强化学习进行增强。总之，这些贡献使Instella成为社区中透明、高性能和通用的替代方案，推动了开源和可重复的语言建模研究目标。

论文及项目相关链接

PDF

Summary

开源大型语言模型家族Instella的出现是一个重要突破。该模型完全基于公开数据和代码库开发，采用大规模预训练、通用指令调整和与人类偏好对齐的技术。尽管预训练令牌数量较许多当代模型大大减少，Instella在完全开源模型中实现了最先进的成果，并与相当规模的开源权重模型具有竞争力。此外，还推出了处理上下文长度达128K令牌的Instella-Long和针对数学任务进行强化学习和监督微调后的推理聚焦模型Instella-Math两个专业变种。这些贡献使Instella成为社区中透明、高性能和通用的替代方案，推动了开放和可重复的语言建模研究目标。

Key Takeaways

Instella是首个完全基于公开数据和代码库开发的大型语言模型家族。
该模型采用大规模预训练技术，并注重通用指令调整和与人类偏好对齐。
尽管使用了较少的预训练令牌，Instella在开源模型中实现了最先进的成果。
Instella与同等规模的开源权重模型具有竞争力。
推出了处理长文本的专业模型Instella-Long。
还推出了针对数学任务的推理聚焦模型Instella-Math。

Cool Papers

点此查看论文截图

SSR: Socratic Self-Refine for Large Language Model Reasoning

Authors:Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

大型语言模型（LLM）已经展现出卓越的推理能力，然而现有的测试时间框架通常依赖于粗略的自我验证和自我校正，这在复杂任务上限制了其有效性。在本文中，我们提出了苏格拉底自我精炼（SSR）这一新型框架，用于对LLM推理进行精细粒度的评估和精确优化。我们提出的SSR将模型响应分解为可验证的（子问题、子答案）对，通过受控的重新求解和自我一致性检查，实现步骤级别的置信度估计。通过定位不可靠的步骤并对其进行迭代优化，SSR生成更准确且可解释的推理链。在五个推理基准测试和三个LLM上的实证结果表明，SSR始终优于最新的迭代自我优化基线。除了性能提升外，SSR还提供了一种有原则的黑盒方法来评估和了解LLM的内部推理过程。代码可在https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning获取。

论文及项目相关链接

PDF Preprint; work in progress

Summary
大型语言模型（LLMs）展现出卓越的推理能力，但现有的测试框架往往依赖于粗糙的自我验证和修正，对于复杂任务效果不佳。为此，本文提出名为苏格拉底自我精进（SSR）的新型框架，旨在精细地评估并精进LLM的推理能力。SSR通过将模型响应分解成可验证的（子问题、子答案）对，通过受控的重新求解和自我一致性检查，实现步骤级别的置信度估计。SSR能够精准定位不可靠的步骤并进行迭代优化，从而生成更准确且可解释性强的推理链。在五个推理基准测试和三种LLM上的实证结果表明，SSR持续优于最新的迭代自我精进基线。此外，SSR还为评估和理解LLM的内部推理过程提供了一个有原则的黑盒方法。

Key Takeaways

LLMs展现出强大的推理能力，但现有测试框架在复杂任务上的效果有限。
苏格拉底自我精进（SSR）框架旨在精细评估并精进LLM的推理能力。
SSR通过分解模型响应为可验证的（子问题、子答案）对，实现步骤级别的置信度估计。
SSR能够定位并优化不可靠的推理步骤，生成更准确且可解释的推理链。
SSR在多个推理基准测试上表现优异，优于现有的迭代自我精进方法。
SSR提供了一个评估和理解LLM内部推理过程的黑盒方法。
SSR框架的代码已公开可用。

Cool Papers

点此查看论文截图

Evaluating Prompting Strategies with MedGemma for Medical Order Extraction

Authors:Abhinand Balachandran, Bavana Durgapraveen, Gowsikkan Sikkan Sudhagar, Vidhya Varshany J S, Sriram Rajkumar

The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to “overthinking” and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.

从医生与患者对话中准确提取医疗指令是减轻临床文档负担和确保患者安全的关键任务。本文详细介绍了我们团队对MEDIQA-OE-2025共享任务的提交内容。我们调查了MedGemma（一种新的特定领域的开源语言模型）在结构化指令提取方面的性能。我们系统地评估了三种不同的提示范式：直接的One-Shot方法、注重推理的ReAct框架和多步代理工作流程。我们的实验表明，虽然像ReAct和代理流程这样的更复杂框架很强大，但更简单的一次性提示方法却在官方验证集上取得了最高性能。我们认为，在手动注释的文本中，复杂的推理链可能会导致“过度思考”并引入噪音，使直接方法更加稳健和高效。我们的工作提供了在多种数据条件下选择适当提示策略进行临床信息提取的宝贵见解。

论文及项目相关链接

PDF 2 figures 7 pages

Summary

本文介绍了团队在MEDIQA-OE-2025共享任务中的提交内容。团队探究了MedGemma这一新的领域特定开源语言模型在结构化订单提取中的表现，并系统地评估了三种不同的提示范式：直接的一站式方法、侧重于推理的ReAct框架以及多步骤代理工作流程。实验表明，尽管复杂的框架如ReAct和代理工作流程功能强大，但简单的一站式提示方法却在官方验证集上取得了最佳性能。作者认为，在手动注释的转录中，复杂的推理链可能导致过度思考并引入噪音，使直接方法更具稳健性和效率。这项工作为在不同数据条件下进行临床信息提取时选择合适提示策略提供了宝贵见解。

Key Takeaways

准确提取医疗订单对于减轻临床记录负担和确保患者安全至关重要。
MedGemma是一种新的领域特定开源语言模型，被用于结构化订单提取。
团队探究了三种不同的提示范式：一站式方法、ReAct框架和多步骤代理工作流程。
实验表明，简单的一站式提示方法在官方验证集上取得了最佳性能。
复杂的推理链可能导致过度思考并引入噪音，影响提取的准确性。
直接方法具有更高的稳健性和效率。

Cool Papers

点此查看论文截图

LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025

Authors:Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian, You-Le Fang, Sheng-Qi Zhang, Bing-Rui Gong, Ren-Xi He, Jing-Tian Zhang, Ce Meng, Yan-Qing Ma

Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.

奥林匹克级别的物理问题解决对于人类和人工智能都构成了重大挑战，因为它要求精确计算的复杂集成、抽象推理以及对物理原理的基本掌握。以复杂和深刻著称的中国物理奥林匹克竞赛是测试这些高级能力的理想且严格的测试平台。在本文中，我们介绍了LOCA-R（用于推理的逻辑链增强），这是LOCA框架的改进版本，用于适应复杂推理，并应用于CPhO 2025理论考试。LOCA-R获得了接近完美的分数，得分为313分中的近满分（最高分被竞赛时超越并列位于第三位，但被略微扩大间距领先超过了以前方法的确成就满满与一直超过未调整的检测结果以及在现实的胜出结合开创的时代最高峰）。它牢固地超越了最高分的人类竞争对手，并显著优于所有基线方法。

论文及项目相关链接

PDF 19 pages, 3 figures

Summary

这篇文本主要介绍了在物理学奥林匹克竞赛级别的解题能力对于人工智能和人类来说是一大挑战，要求精确计算、抽象推理以及对物理原理的基本掌握。为了应对这一挑战，中国物理奥林匹克竞赛成为了检验这些高级能力的理想场所。文中引入了一种名为LOCA-R的逻辑链增强推理方法，它在CPhO 2025理论考试中取得了近乎完美的成绩，远超人类顶尖选手和所有基准方法。

Key Takeaways

物理学奥林匹克竞赛级别的解题能力对于人工智能和人类都是一项挑战，需要精确计算、抽象推理以及对物理原理的深入理解。
中国物理奥林匹克竞赛是检验高级能力的理想场所。
LOCA-R方法是一种改进的推理方法，适用于复杂推理。
LOCA-R在CPhO 2025理论考试中取得了近乎完美的成绩。
LOCA-R的表现远超人类顶尖选手。
LOCA-R相比其他基准方法具有显著优势。

Cool Papers

点此查看论文截图

Bayesian model comparison and validation with Gaussian Process Regression for interferometric 21-cm signal recovery

Authors:Yuchen Liu, Eloy de Lera Acedo, Peter Sims

The 21-cm signal from neutral hydrogen is anticipated to reveal critical insights into the formation of early cosmic structures during the Cosmic Dawn and the subsequent Epoch of Reionization. However, the intrinsic faintness of the signal, as opposed to astrophysical foregrounds, poses a formidable challenge for its detection. Motivated by the recent success of machine learning based Gaussian Process Regression (GPR) methods in LOFAR and NenuFAR observations, we perform a Bayesian comparison among five GPR models to account for the simulated 4-hour tracking observations with the SKA-Low telescope. The simulated sky is convolved with the instrumental beam response and includes realistic radio sources and thermal noise from 122 to 134 MHz. A Bayesian model evaluation framework is applied to five GPR models to discern the most effective modelling strategy and determine the optimal model parameters. The GPR model with wedge parametrization ($\textit{Wedge}$) and its extension ($α\textit{Noise}$) with noise scaling achieve the highest Bayesian evidence of the observed data and the least biased 21-cm power spectrum recovery. The $α\textit{Noise}$ and $\textit{Wedge}$ models also forecast the best local power-spectrum recovery, demonstrating fractional differences of $-0.14%$ and $0.47%$ respectively, compared to the injected 21-cm power at $k = 0.32\ \mathrm{h\ cMpc}^{-1}$. We additionally perform Bayesian null tests to validate the five models, finding that the two optimal models also pass with the remaining three models yielding spurious detections in data containing no 21-cm signal.

预计中性氢的21厘米信号将揭示宇宙黎明和随后的再电离时代早期宇宙结构形成的关键见解。然而，与天文前景相比，信号本身的微弱性给其检测带来了巨大挑战。受LOFAR和NenuFAR观测中基于机器学习的高斯过程回归（GPR）方法近期成功的激励，我们对五种GPR模型进行了贝叶斯比较，以解释使用SKA-Low望远镜进行的模拟4小时跟踪观测。模拟的天空与仪器响应卷积在一起，并包括来自122至134兆赫的真实射电源和热噪声。应用贝叶斯模型评估框架对五个GPR模型进行鉴别，以确定最有效的建模策略并确定最佳模型参数。具有楔形参数化（Wedged）的GPR模型及其带噪声缩放的扩展（αNoise）获得了最高贝叶斯证据和恢复最少的21厘米功率谱。αNoise和Wedged模型还预测了最佳局部功率谱恢复，在k = 0.32 h cMpc^-1的注入的21厘米功率下，分别表现出-0.14%和0.47%的差异。我们还进行了贝叶斯零效检验来验证这五种模型，发现两个最佳模型同样有效，其余三个模型在没有包含真实的 21厘米信号的检测数据会产生错误的结果。

论文及项目相关链接

PDF 25 pages, 17 figures

Summary

该文本主要介绍了利用机器学习中的高斯过程回归（GPR）模型检测早期宇宙结构形成的重要信号——中性氢的21厘米信号。面对信号本身的微弱以及与天体物理前景的对比挑战，通过模拟SKA-Low望远镜的4小时观测数据，对五种GPR模型进行贝叶斯比较评估，最终发现带有wedge参数化的GPR模型及其噪声扩展模型在数据观测和21厘米功率谱恢复上表现最佳。

Key Takeaways

中性氢的21厘米信号对于了解宇宙黎明和再电离时代早期宇宙结构的形成至关重要。
信号本身的微弱性使得检测成为一大挑战，特别是在天体物理前景的对比下。
利用机器学习中的高斯过程回归（GPR）模型进行模拟观测数据的分析。
在模拟的SKA-Low望远镜观测数据中，进行了五种GPR模型的贝叶斯比较评估。
带wedge参数化的GPR模型及其噪声扩展模型在数据观测和21厘米功率谱恢复上表现最佳。
最佳模型的本地功率谱恢复与注入的21厘米功率相比，差异极小。

Cool Papers

点此查看论文截图

SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

Authors:Oded Schlesinger, Amirhossein Farzam, J. Matias Di Martino, Guillermo Sapiro

While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

尽管Vision Transformers（ViT）在不同任务中表现出卓越的性能，但其计算需求很大，并且与处理令牌的数量的二次方成正比。紧凑的注意力表示反映了令牌交互分布，可以在计算注意力之前引导早期检测和减少不太突出的令牌。受此启发，我们提出了基于令牌相关性的注意力动态稀疏化（SPOT），这是一个早期检测ViT中冗余令牌的框架，它利用令牌嵌入、交互和跨层的注意力动态来推断令牌重要性，从而得到一个更具上下文意识和可解释性的相关性检测过程。SPOT为令牌稀疏化提供了信息并促进了令牌的消除，提高了计算效率，同时不牺牲性能。SPOT采用计算量轻的可预测因子，可以插入各种ViT架构中，并学习在跨层推导有效的输入特定令牌优先级。其通用设计支持适应不同资源约束的多种性能级别。经验评估表明，与标准ViT相比，其达到了高达40%的显著效率提升，同时维持甚至提高了准确性。代码和模型可在https://github.com/odedsc/SPOT找到。

论文及项目相关链接

PDF Project repository: https://github.com/odedsc/SPOT

Summary

本文介绍了基于注意力动态和令牌相关性的令牌稀疏化框架SPOT。该框架能够在早期检测冗余令牌，通过利用令牌嵌入、交互和跨层的注意力动态来推断令牌重要性，从而提高计算效率而不损失性能。SPOT可以插入各种ViT架构中，学习在跨层中获得有效的输入特定令牌优先级。其灵活的设计可适应不同的资源约束，并在实证评估中实现了与标准ViT相比高达40%的显著效率提升，同时保持甚至提高准确性。

Key Takeaways

Vision Transformers (ViT)的计算需求大，处理令牌时呈二次方增长。
SPOT框架通过早期检测冗余令牌来提高ViT的计算效率。
SPOT利用令牌嵌入、交互和跨层注意力动态来推断令牌重要性。
SPOT能够插入各种ViT架构中，并具有上下文感知和可解释的令牌检测过程。
SPOT通过实现输入特定令牌优先级来提高计算效率。
SPOT框架的设计灵活，可适应不同的资源约束。

Cool Papers

点此查看论文截图

Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks

Authors:Yunzhe Xu, Zhuosheng Zhang, Zhe Liu

While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models’ capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO’s superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.

虽然提示优化已成为提高语言模型性能的关键技术，但现有的方法主要侧重于基于激发的策略，搜索最佳提示以激活模型的能力。这些方法在应对知识密集型任务时存在基本局限，因为它们是在固定的参数边界内运行，而不是提供专业知识、术语精度和特定领域所需的推理模式。为了解决这些局限性，我们提出了基于知识提供的提示优化（KPPO）框架，将提示优化重新定义为系统知识集成，而非潜在激发。KPPO引入了三个关键创新点：1）知识差距填充机制，用于识别知识差距并针对性地进行补救；2）批量候选评估方法，同时考虑性能改进和分布稳定性；3）自适应知识修剪策略，平衡性能和令牌效率，减少高达29%的令牌使用量。在15个来自不同领域的知识密集型基准测试上的广泛评估表明，KPPO优于基于激发的方法，在最强基线的基础上平均性能提高约6%，同时实现相当或更低的令牌消耗。代码地址：https://github.com/xyz9911/KPPO。

论文及项目相关链接

PDF 16 pages, 19 figures

Summary

本文介绍了针对语言模型性能优化的知识提供型提示优化框架（KPPO）。该框架突破了现有提示优化方法的局限，将知识整合纳入其中，从而更有效地应对知识密集型任务。通过引入知识缺口填充机制、批量候选评估方法和自适应知识修剪策略，KPPO在知识密集型任务上表现出卓越性能，平均性能提升约6%，同时降低了高达29%的令牌使用量。

Key Takeaways

知识提供型提示优化框架（KPPO）将知识整合纳入语言模型性能优化中，突破现有提示优化方法的局限。
KPPO引入知识缺口填充机制，用于识别并针对性解决知识缺口问题。
采用批量候选评估方法，同时考虑性能提升和分布稳定性。
KPPO采用自适应知识修剪策略，在性能和令牌效率之间取得平衡。
KPPO在知识密集型任务上表现出卓越性能，平均性能提升约6%。
与其他基线方法相比，KPPO实现了更低的令牌使用量。

Cool Papers

点此查看论文截图

LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Authors:Zihan Gao, Yifei Xu, Jacob Thebault-Spieker

Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini’s accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

大型语言模型（LLM）已在宏观地理任务上进行了广泛的评估，例如全球事实回忆、事件摘要和区域推理。然而，它们处理超本地知识的能力仍鲜为人知。随着从公民平台到社区新闻等现实应用的需求日益增长，要求人工智能系统能够推理邻里特定的动态、文化叙事和当地治理，这一差距变得越来越重要。现有基准测试在捕捉这种复杂性方面表现不足，通常依赖于粗粒度数据或孤立的引用。我们推出了LocalBench，这是第一个旨在系统评估美国县级本地知识的大型语言模型（LLM）的基准测试。LocalBench基于本地概念框架，包含美国49个州的526个县市的14782个经过验证的问题答案对，融合了诸如人口普查统计、本地Reddit论坛讨论和区域新闻等各种来源的数据。它涵盖了物理、认知和关系的本地维度。使用LocalBench，我们在封闭书籍和网络增强两种环境下评估了13项最新的大型语言模型。我们的研究发现，存在关键局限性：即使在叙事风格的问题上，表现最佳的模型准确率也只有56.8%，在数值推理方面的表现更是低于15.5%。此外，更大的模型规模和网络增强并不一定能保证更好的性能——例如，搜索功能提高了Gemini的准确率+13.6%，但降低了GPT系列的性能-11.4%。这些结果突显了对能够支持公平、地理意识的人工智能系统的语言模型的迫切需求：能够在地理和文化背景下的本地社区的多样化和精细化的现实中发挥作用。

论文及项目相关链接

PDF

摘要

大型语言模型（LLMs）在宏观地理任务上的评估已经相当广泛，如全球事实回忆、事件摘要和区域推理。然而，它们处理超本地知识的能力尚未被充分了解。随着从公民平台到社区新闻等现实应用需求不断增长，要求人工智能系统能够推理邻里动态、文化叙事和当地治理等地方特定事项，这一差距愈发重要。现有的基准测试在捕捉这种复杂性方面存在不足，通常依赖于粗略的数据或孤立的引用。我们推出了LocalBench，这是第一个旨在系统地评估美国县级本地知识的LLM的基准测试。LocalBench基于本地概念框架，包含美国49个州的526个县市的14782个经过验证的问题答案对，融合了诸如人口普查统计、本地Reddit论坛讨论和区域新闻等多元来源。它涵盖了局部的物理、认知和关系维度。使用LocalBench，我们对13款最新LLM进行了封闭式和网页增强模式下的评估。我们的研究发现，即使是表现最佳的模型在叙事风格问题上的准确率也只有56.8%，在数值推理上的表现低于15.5%。此外，更大的模型规模和网页增强并不一定能保证更好的表现，例如搜索功能提高了Gemini的准确率+13.6%，但降低了GPT系列的性能-11.4%。这些结果强调了语言模型需要支持公平、地理位置感知的AI系统：能够与跨越地理和文化背景的本地社区的多样化和精细现实进行互动。

关键见解

大型语言模型（LLMs）在超本地知识方面的能力尚未得到充分理解。
现有基准测试在评估LLM的县级本地知识方面存在不足。
LocalBench是首个旨在评估美国县级本地知识的LLM的基准测试，涵盖了物理、认知和关系维度。
LLM在叙事风格和数值推理方面的表现存在显著局限性。
模型规模和网页增强并不总能保证更好的性能。
语言模型需要支持公平、地理位置感知的AI系统，以适应地方社区的多样化和精细现实。

Cool Papers

点此查看论文截图

LongComp: Long-Tail Compositional Zero-Shot Generalization for Robust Trajectory Prediction

Authors:Benjamin Stoler, Jonathan Francis, Jean Oh

Methods for trajectory prediction in Autonomous Driving must contend with rare, safety-critical scenarios that make reliance on real-world data collection alone infeasible. To assess robustness under such conditions, we propose new long-tail evaluation settings that repartition datasets to create challenging out-of-distribution (OOD) test sets. We first introduce a safety-informed scenario factorization framework, which disentangles scenarios into discrete ego and social contexts. Building on analogies to compositional zero-shot image-labeling in Computer Vision, we then hold out novel context combinations to construct challenging closed-world and open-world settings. This process induces OOD performance gaps in future motion prediction of 5.0% and 14.7% in closed-world and open-world settings, respectively, relative to in-distribution performance for a state-of-the-art baseline. To improve generalization, we extend task-modular gating networks to operate within trajectory prediction models, and develop an auxiliary, difficulty-prediction head to refine internal representations. Our strategies jointly reduce the OOD performance gaps to 2.8% and 11.5% in the two settings, respectively, while still improving in-distribution performance.

在自动驾驶的轨迹预测方法中，必须应对罕见且对安全至关重要的场景，这使得仅依赖真实世界的数据收集变得不可行。为了评估在这种条件下的稳健性，我们提出了新的长尾评估设置，该设置重新划分数据集以创建具有挑战性的超出分布范围的（OOD）测试集。我们首先引入了一个以安全为考虑的场景分解框架，该框架将场景分解为离散的自驾和社会环境上下文。基于计算机视觉中组合零样本图像标签的类比，我们通过保留新颖上下文组合来构建具有挑战性的封闭世界和开放世界设置。这一过程导致未来运动预测的OOD性能差距分别为封闭世界和开放世界设置中的5.0%和14.7%，相对于最新基线在分布内的性能。为了改善泛化能力，我们扩展了任务模块化门控网络使其在轨迹预测模型内运行，并开发了一个辅助的难易度预测头来优化内部表示。我们的策略在这两种设置中分别将OOD性能差距降低到2.8%和11.5%，同时在分布内的性能仍然有所提升。

论文及项目相关链接

PDF 8 pages, 3 figures

Summary

本文提出在自动驾驶轨迹预测中，需要应对罕见且对安全性至关重要的场景，这些场景仅依赖真实世界的数据收集是不可行的。为此，作者提出了新型的长尾评估设置，通过重新划分数据集以创建具有挑战性的未知分布测试集。文章介绍了安全知识情境分解框架，将情境分解为离散的自驾与社会情境。借助计算机视觉中的组合零样本图像标签的类比，作者通过保持新颖情境组合来构建封闭世界和开放世界设置。这种方法导致未来运动预测在封闭世界和开放世界设置中的性能差距分别为5.0%和相对于先进基线方法的基线性能的14.7%。为了提高泛化能力，作者扩展了任务模块化门控网络在轨迹预测模型中的应用，并开发了一个辅助难度预测头来优化内部表示。这些策略分别将两个设置中的性能差距缩小到2.8%和11.5%，同时提高了分布内的性能。

Key Takeaways

自动驾驶轨迹预测需应对罕见且对安全性至关重要的场景，不能仅依赖真实世界数据收集。
提出新型长尾评估设置，通过重新划分数据集以创建挑战性未知分布测试集。
介绍安全知识情境分解框架，将情境分解为离散的自驾与社会情境。
利用计算机视觉中的组合零样本图像标签类比，构建封闭世界和开放世界设置。
未来运动预测在封闭世界和开放世界设置中存在性能差距。
扩展任务模块化门控网络在轨迹预测模型中的应用以提高泛化能力。

Cool Papers

点此查看论文截图

GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models

Authors:Oussema Dhaouadi, Johannes Meier, Jacques Kaiser, Daniel Cremers

Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods. The project page is available at https://deepscenario.github.io/GrounDiff/.

数字地形模型（DTM）代表地面高程，在众多的地理空间应用中具有重要意义。此类数据模型无法由传感器直接测量，通常是由激光雷达或摄影测量的数字表面模型（DSM）派生而来。传统的过滤方法依赖于手动调整的参数，而基于学习的方法则需要精心设计架构，通常结合后期处理。为了解决这些挑战，我们引入了Ground Diffusion（GrounDiff），这是第一个基于扩散的框架，通过将其制定为去噪任务来迭代地去除非地面结构。我们采用带有置信引导生成的门控设计，以实现选择性过滤。为了提高可扩展性，我们进一步提出了Prior-Guided Stitching（PrioStitch），它采用使用GrounDiff自动生成的降采样全局先验来指导局部高分辨率预测。我们在不同的数据集上评估了我们在DSM到DTM转换任务上的方法，结果表明GrounDiff始终优于基于深度学习的前沿方法，在ALS2DTM上减少了高达93%的RMSE，在USGS基准测试中减少了高达47%的RMSE。在道路重建任务中，既需要高精度也需要平滑度，我们的方法与GeRoD基准测试上的专用技术相比，距离误差降低了高达81%，同时使用仅使用DSM输入的竞平台面平滑度，无需针对任务进行优化。我们为道路重建设计的变体GrounDiff+能够产生更平滑的表面，进一步超越了前沿方法。项目页面可在https://deepscenario.github.io/GrounDiff/访问。

论文及项目相关链接

PDF Accepted at WACV 2026

Summary

数字地形模型（DTM）代表地面高程，在诸多地理空间应用中具有重要意义。本文介绍了基于扩散的方法Ground Diffusion（GrounDiff），该方法将DSM到DTM的翻译问题表述为去噪任务，通过迭代去除非地面结构。为提高可扩展性，本文还提出了Prior-Guided Stitching（PrioStitch），利用下采样全局先验自动生成的GrounDiff来指导局部高分辨率预测。在DSM到DTM的翻译任务和道路重建任务中，GrounDiff均表现出卓越性能，降低了RMSE误差，并在GeRoD基准测试中实现了较低的距离误差。相关项目页面可在链接查看。

Key Takeaways

Ground Diffusion（GrounDiff）是首个基于扩散的框架，用于从数字表面模型（DSM）生成数字地形模型（DTM）。
GrounDiff通过将问题表述为去噪任务，能够迭代去除非地面结构。
为提高可扩展性，提出了Prior-Guided Stitching（PrioStitch），利用全局先验指导局部预测。
GrounDiff在多种数据集上的DSM到DTM翻译任务中表现出卓越性能，降低了RMSE误差。
在需要高精度和光滑度的道路重建任务中，GrounDiff实现了较低的距离误差，并保持了表面光滑度。
GrounDiff+是专为道路重建设计的变体，能够产生更光滑的表面。

Cool Papers

点此查看论文截图

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Authors:Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

文档解析是文档智能的核心任务，支持信息提取、检索增强生成和自动化文档分析等应用程序。然而，现实世界的文档通常具有复杂的布局，包含多级表格、嵌入的图像或公式以及跨页结构，这对现有OCR系统仍然具有挑战性。我们推出了MonkeyOCR v1.5，这是一个统一的视觉语言框架，通过两阶段解析管道增强布局理解和内容识别。第一阶段采用大型多模态模型联合预测文档布局和阅读顺序，利用视觉信息确保结构和顺序一致性。第二阶段在检测到的区域内执行文本、公式和表格的局部识别，保持高视觉保真度的同时减少错误传播。为了解决复杂的表格结构问题，我们提出了一种基于视觉一致性强化学习方案，通过呈现和比较对齐来评估识别质量，提高结构准确性而无需手动注释。此外，还引入了两个专业模块，即图像解耦表格解析和类型指导表格合并，以实现对包含嵌入图像的表格的可靠解析以及对跨页或列的表格的重建。在OmniDocBench v1.5上的综合实验表明，MonkeyOCR v1.5达到了最新性能水平，优于PPOCR-VL和MinerU 2.5，在视觉复杂的文档场景中表现出惊人的稳健性。

论文及项目相关链接

PDF

Summary

本文介绍了文档解析在文档智能中的核心作用，以及针对现实世界中复杂文档布局的OCR系统面临的挑战。为此，文章引入了MonkeyOCR v1.5，一个统一的视觉语言框架，通过两阶段解析管道增强布局理解和内容识别。第一阶段使用大型多模态模型联合预测文档布局和阅读顺序，利用视觉信息确保结构和顺序一致性。第二阶段对检测到的区域进行局部文本、公式和表格识别，保持高视觉保真度，减少错误传播。针对复杂表格结构，文章提出了基于视觉一致性的强化学习方案，通过渲染和比较对齐评估识别质量，提高结构准确性，无需手动注释。此外，还介绍了两个专门模块——图像解耦表格解析和类型引导表格合并，以实现对包含嵌入式图像的表格的可靠解析和跨页或跨列的表格的重建。在OmniDocBench v1.5上的综合实验表明，MonkeyOCR v1.5实现了最先进的性能，优于PPOCR-VL和MinerU 2.5，在视觉复杂文档场景中表现出卓越的稳健性。

Key Takeaways

文档解析是文档智能中的核心任务，应用于信息提取、检索增强生成和自动化文档分析等。
现实世界的文档布局复杂，包含多层级表格、嵌入式图像和公式、跨页结构等，对现有的OCR系统构成挑战。
MonkeyOCR v1.5是一个统一的视觉语言框架，通过两阶段解析管道提高布局理解和内容识别能力。
第一阶段利用多模态模型预测文档布局和阅读顺序，确保结构和顺序的一致性。
第二阶段进行局部文本、公式和表格识别，保持高视觉保真度，并减少错误传播。
针对复杂表格结构，文章提出了基于视觉一致性的强化学习方案，通过渲染和比较对齐提高结构准确性。

Cool Papers

点此查看论文截图

Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Authors:Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li, Li Jin, Leiyi Hu, Zhizhao Zeng, Sirui Wang, Ke Zeng, Zhi Guo

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason – imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs’ critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon – ``LLMs incline to judge solutions with lower perplexity as correct’’, which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

为提高大型语言模型（LLM）的多步数学推理（MsMR）能力，从语料库中获取可规模化监督至关重要，这需要自动批判MsMR推理过程中的错误，并对问题解决方案给出最终判断。现有大多数方法都依赖于制作高质量的监督微调演示来增强批判能力，却忽视了深入挖掘LLM批判性能不佳的潜在原因。在本文中，我们正交量化和研究了潜在原因——评估偏好不均衡，并进行了统计偏好分析。受分析结果启发，我们提出了一种新颖的困惑度感知强化学习算法，以纠正评估偏好，提升批判能力。具体来说，为了探究LLM的批判特性，我们精心构建了一个一对一至多问题解决方案（OPS）基准测试，以量化LLM在评估自身生成的问题解决方案和他人解决方案时的行为差异。然后，为了深入研究行为差异，我们从困惑度出发进行了统计偏好分析，发现了一个有趣的现象——“LLM倾向于判断低困惑度的解决方案为正确”，这被称为“评估偏好不均衡”。为了纠正这一偏好，我们将困惑度视为算法中的指挥棒，在群体相对策略优化算法中支持LLM探索判断低困惑度错误、高困惑度正确的轨迹。在我们构建的OPS和现有的批判基准测试上的大量实验结果证明了我们的方法的有效性。

论文及项目相关链接

PDF Accepted by AAAI2026

Summary

本文关注大型语言模型在多步数学推理中的评估偏好问题，提出了一种困惑度感知的强化学习算法来纠正不平衡的评价偏好，从而提高模型的批判能力。通过构建一对一多问题解决方案基准测试，深入研究语言模型的批判特性，并发现语言模型倾向于将较低困惑度的解决方案判断为正确，即存在评价偏好失衡的问题。为解决这一问题，采用集团相对策略优化算法，以困惑度为指引，促使语言模型探索不同的判断轨迹。实验结果表明，该方法的有效性。

Key Takeaways

大型语言模型在多步数学推理的评估中存在评价偏好失衡的问题，即倾向于判断低困惑度解决方案为正确。
为了研究这一问题，构建了一对一个多问题解决方案基准测试，用于量化语言模型在评价自身和他人解决方案时的行为差异。
提出了一种困惑度感知的强化学习算法，以纠正语言模型的评价偏好失衡问题。
采用集团相对策略优化算法，以困惑度作为指引，促使语言模型探索不同的判断轨迹。
实验结果表明，该方法在自建的OPS和现有批判基准测试上均有效。
此研究对于提高大型语言模型的批判能力具有重要意义。

Cool Papers

点此查看论文截图

Music Flamingo: Scaling Music Understanding in Audio Language Models

Authors:Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro

We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model’s reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.

我们介绍了音乐火烈鸟（Music Flamingo），这是一种新型的大型音频语言模型，旨在推动基础音频模型中的音乐（包括歌曲）理解。尽管音频语言研究已经迅速发展，但由于音乐的动态、分层和信息密集的特性，音乐仍然具有挑战性。由于高质量音乐数据和注释的稀缺，扩展开放音频理解模型的难度进一步加大。因此，先前模型仅限于生成简短的高级标题，只能回答表面问题，并且在跨越不同音乐文化的泛化方面表现出局限性。为了应对这些挑战，我们整理了MF-Skills数据集，该数据集通过多阶段管道进行标记，生成丰富的标题和问答对，涵盖和声、结构、音色、歌词和文化背景。我们在MF-Skills上对增强的Audio Flamingo 3主干进行微调，并进一步加强对音乐理解相关的多种技能。为了改善模型的推理能力，我们引入了一种后训练配方：首先使用MF-Think进行冷启动，MF-Think是一种基于音乐理论的新型思维链数据集，接着使用基于GRPO的强化学习进行自定义奖励。音乐火烈鸟在10多个音乐理解和推理基准测试中取得了最佳结果，成为了一种多才多艺且音乐智能的音频语言模型。除了强大的实证结果外，音乐火烈鸟还为高级音乐理解设定了新的标准，展示了模型如何从表面识别走向歌曲的分层、人类感知。我们相信这项工作既为社区提供了基准测试，也为下一代模型奠定了基础，这些模型能够以人类的方式与音乐进行有意义的交互。

Summary

本文介绍了Music Flamingo，这是一种新型的大型音频语言模型，旨在推进音乐（包括歌曲）在基础音频模型中的理解。文章解决了音乐理解中的挑战，包括数据稀缺和标注困难等问题，通过构建MF-Skills数据集和多阶段管道标注技术，提高了模型对和谐、结构、音色、歌词和文化背景的丰富标注和问答对的能力。通过增强Audio Flamingo 3的后备并在MF-Think上进行后训练，Music Flamingo提高了音乐理解的能力。该模型在多个音乐理解和推理基准测试中取得最佳结果，并展现出从表面级识别向分层、人类般的歌曲感知的转变。这项工作不仅提供了强大的实证结果，还为社区建立了新的标准，为下一代模型的建设奠定了基础。

Key Takeaways

Music Flamingo是一个用于推进音乐理解的大型音频语言模型。
音乐理解的挑战包括数据的动态、分层和信息密集性质，以及高质量音乐数据和注解的稀缺性。
MF-Skills数据集的构建，通过多阶段管道标注技术，提供了丰富的标注和问答对，涵盖和谐、结构、音色、歌词和文化背景。
Music Flamingo在多个音乐理解和推理基准测试中取得最佳结果。
Music Flamingo能够从表面级识别向分层、人类般的歌曲感知转变。
该工作提供了强大的实证结果，为社区建立了新的标准。

Cool Papers

点此查看论文截图

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Authors:Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

面部情绪分析（FEA）通过融入可解释的、精细的推理，扩展了传统的面部情绪识别。该任务结合了三个子任务：情绪识别、面部动作单元（AU）识别以及基于AU的情绪推理，以共同模拟情感状态。尽管最近的方法利用视觉语言模型（VLM）并取得了有前景的结果，但它们面临两个关键局限：（1）虚幻推理，即VLM由于缺乏特定的情绪知识而产生似乎合理但不准确的解释；（2）情绪推理与识别之间的不匹配，这是由于观察到的面部特征与最终标签之间的联系片段化所导致的。我们提出Facial-R1，这是一个三阶段对齐框架，以最小的监督有效地解决这两个挑战。首先，我们采用指令微调来建立基本的情绪推理能力。其次，我们引入强化训练，以情感和AU标签作为奖励信号进行指导，显式地将生成的推理过程与预测的情绪对齐。第三，我们设计了一个数据合成管道，该管道通过迭代利用先前的阶段来扩展训练数据集，使模型能够实现可扩展的自我改进。基于此框架，我们引入了面部情绪分析-20K（FEA-20K）基准数据集，其中包含带有精细情绪分析注释的17737个训练样本和1688个测试样本。在八个标准基准测试上的广泛实验表明，Facial-R1在面部情绪分析方面达到了最新技术性能水平，具有强大的泛化能力和稳健的可解释性。

论文及项目相关链接

PDF This paper has been accepted by AAAI 2026. 16 pages, 3 figures, 10 tables

Summary

本文介绍了面部情绪分析（FEA）的新方法，它通过结合可解释性、精细粒度的推理来扩展传统的面部情绪识别。文章提出了一个三阶段对齐框架——Facial-R1，旨在解决当前方法面临的挑战，如幻觉推理和情绪推理与识别之间的不一致。该框架通过指令微调建立基本情绪推理能力，引入强化训练以情感与AU标签作为奖励信号明确对齐生成推理过程与预测情感，并设计数据合成管道以扩大训练数据集，实现模型的自我改进。基于该框架，文章还介绍了包含精细粒度情绪分析注释的FEA-20K基准数据集。实验表明，Facial-R1在面部情绪分析领域取得了最先进的性能，具有强大的泛化能力和可解释性。

Key Takeaways

面部情绪分析（FEA）结合了情绪识别、面部动作单元（AU）识别和基于AU的情绪推理三个子任务。
当前方法利用视觉语言模型（VLMs）取得了一定成果，但存在幻觉推理和情绪推理与识别不一致的问题。
Facial-R1框架通过三个阶段解决上述问题：指令微调、强化训练和基于情感与AU标签的数据合成管道设计。
Facial-R1实现了模型的有效自我改进，并通过扩展训练数据集提升其性能。
FEA-20K数据集包含精细粒度的情绪分析注释，有助于模型训练和评估。
实验证明Facial-R1在面部情绪分析领域具有出色的性能、泛化能力和可解释性。

Cool Papers

点此查看论文截图

ProgRAG: Hallucination-Resistant Progressive Retrieval and Reasoning over Knowledge Graphs

Authors:Minbae Park, Hyemin Yang, Jeonghyun Kim, Kunsoo Park, Hyunjoon Kim

Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.

大型语言模型（LLM）展现出强大的推理能力，但在幻觉和透明度方面存在局限。最近，集成知识图谱（KGs）的KG增强LLM已被证明可以改善推理性能，特别是对于复杂、知识密集型的任务。然而，这些方法仍然面临重大挑战，包括检索不准确和推理失败，这些挑战往往因长输入语境而加剧，相关信息的语境遮蔽或不同问题类型所需的更丰富逻辑方向难以捕捉。此外，许多这些方法依赖于LLM直接从KGs中检索证据，并自我评估证据的充分性，这往往导致过早或错误的推理。为了解决检索和推理失败的问题，我们提出了ProgRAG，这是一个多跳知识图谱问答（KGQA）框架，它将复杂问题分解为子问题，并通过回答每个子问题来逐步扩展部分推理路径。在每一步中，外部检索器收集候选证据，然后通过不确定性感知修剪由LLM进行精炼。最后，通过对从子问题答案中获得的局部推理路径进行组织和重新排列，优化LLM推理的上下文。在三个知名数据集上的实验表明，ProgRAG在多跳KGQA方面的性能超过了现有基线，提高了可靠性和推理质量。

论文及项目相关链接

PDF

Summary:
大型语言模型（LLMs）具备强大的推理能力，但在幻觉和透明度方面存在局限。融合知识图谱（KGs）的KG增强LLMs在复杂、知识密集型任务中改善了推理性能，但仍面临不准确检索和推理失败等挑战。为解决这些问题，提出ProgRAG多跳知识图谱问答（KGQA）框架，通过分解复杂问题为子问题，逐步扩展部分推理路径。实验证明，ProgRAG在多跳KGQA中优于现有基线，提高了可靠性和推理质量。

Key Takeaways:

LLMs具备强大的推理能力，但存在幻觉和透明度问题。
KG增强LLMs融合知识图谱改善了在复杂、知识密集型任务中的推理性能。
KG增强LLMs面临不准确检索和推理失败的挑战。
ProgRAG框架通过分解复杂问题为子问题，逐步扩展部分推理路径来解决这些问题。
ProgRAG利用外部检索器收集候选证据，并通过不确定性感知修剪进行精炼。
LLM优化上下文是通过组织和重新排列从子问题答案中获得的局部推理路径来实现的。

Cool Papers

点此查看论文截图

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

Authors:Daniel Herbst, Lea Karbeska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

基于大型语言模型的图推理器虽然前景广阔，但在处理图表示中的对称性时缺乏内置的不变性。由于操作的是序列化的图，这些模型在节点重新索引、边重新排序或格式更改时会产生不同的输出，这引发了关于其稳健性的担忧。我们系统地分析了这些影响，研究了微调对编码敏感性的影响以及对未见任务泛化的影响。我们对图序列化进行有原则的分解，包括节点标记、边缘编码和语法，并在一套全面的基准测试中对大型语言模型对这些因素变化的稳健性进行了评估。我们还贡献了一系列新的光谱任务，以进一步评估经过微调后的推理器的泛化能力。结果表明，较大的（未经过微调）模型更稳健。微调降低了对节点重新标记的敏感性，但可能会增加对结构和格式变化的敏感性，而且它并不总能提高未见任务的性能。

论文及项目相关链接

PDF AAAI 2026 Workshop on Graphs and more Complex Structures For Learning and Reasoning (GCLR)

Summary:
基于大型语言模型的图推理器虽具潜力，但在处理图形表示中的对称性时缺乏内置不变性。针对节点重新索引、边重新排序或格式更改等操作，LLMs会产生不同的输出，这引发了稳健性的担忧。本文通过系统地分析这些因素，研究微调对编码敏感性的影响以及在未见任务上的泛化能力。本文将图序列化分解为节点标签、边编码和语法，并评估LLM对不同因素的稳健性。同时，本文贡献了一系列光谱任务来进一步评估微调后的推理能力的泛化能力。结果表明，较大的（未微调）模型更稳健，微调可以减少对节点重新标记的敏感性，但可能会增加对结构和格式变化的敏感性，并且在未见任务上并不总是能提高性能。

Key Takeaways:

基于大型语言模型的图推理器存在稳健性问题，因为它们在处理图形表示中的对称性时缺乏内置不变性。
节点重新索引、边重新排序和格式更改会导致LLMs产生不同的输出。
通过系统地分析，研究发现微调影响编码敏感性和在未见任务上的泛化能力。
提出将图序列化分解为节点标签、边编码和语法的方法，以评估LLM对不同因素的稳健性。
贡献了一系列光谱任务来评估推理能力的泛化能力。
较大的模型（未经微调）表现出更高的稳健性。

Cool Papers

点此查看论文截图

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Authors:Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui

The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow’s high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

数据为中心的范式在人工智能中已成为关键，特别是在文本到SQL的转换中，其性能受限于稀缺、简单和低多样性的数据集。为解决此问题，我们提出了Text2SQL-Flow，这是一个SQL感知的数据增强框架，可从少量种子数据中生成大规模、语义有效且结构多样的文本到SQL对。它在六个增强维度上运行，并集成了一个端到端的管道，具有SQL执行验证、自然语言问题生成、思维链推理跟踪和数据分类等功能。模块化数据库管理器确保跨数据库兼容性和可扩展性。使用此框架，我们构建了SQLFlow数据集，其中包含89,544个注释示例。我们在两种环境下对SQLFlow进行了评估：（1）对于开源大型语言模型，在相同的数据预算下，对SQLFlow进行微调可以始终改善跨基准测试的性能。（2）对于封闭源代码的大型语言模型，我们引入了一种掩码对齐检索方法，将SQLFlow既视为知识库又作为检索器的训练数据。这通过建模问题和SQL查询之间的精细对齐，实现了结构感知的示例匹配。实验表明，我们的检索策略优于现有方法，突显了SQLFlow高保真数据的价值和我们的新技术。我们的工作为推进文本到SQL系统的技术建立了可扩展的、以数据为中心的基础，并强调了高质量结构化数据在现代人工智能中的关键作用。

论文及项目相关链接

PDF

Summary

本文介绍了数据驱动的方法在人工智能领域的重要性，特别是在文本到SQL转换中的应用。针对文本到SQL转换中数据集缺乏、简单和低多样性的问题，提出了一种名为Text2SQL-Flow的SQL感知数据增强框架。该框架通过六个增强维度生成大规模、语义有效和结构多样的文本到SQL对，并使用SQL执行验证、自然语言问题生成、思维链跟踪和数据分类等技术构建一个完整流程。通过使用该框架创建了名为SQLFlow的高质量数据集。此外，还探讨了使用开源大型语言模型和闭源大型语言模型的情况下数据集的应用方法，提出了一种新颖的检索策略，并通过实验证明了其价值。总体而言，该研究为文本到SQL系统的进步建立了可扩展的数据驱动基础，并强调了高质量结构化数据在现代人工智能中的关键作用。

Key Takeaways