发布日期: 2025-05-21

更新日期: 2025-05-27

文章字数: 20k

阅读时长: 82 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-21 更新

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Authors:Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks – where frontier models perform similarly and near saturation – our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

图表理解对于大型视觉语言模型（LVLMs）来说是一个独特的挑战，因为它需要整合复杂的文本和视觉推理能力。然而，当前的LVLMs在这些技能之间表现出明显的失衡，在文本中难以执行的视觉推理方面表现不足。我们通过一个只能通过视觉推理解决的合成数据集进行案例研究，并发现随着视觉复杂性的增加，模型性能显著下降，而人类性能保持稳健。随后，我们介绍了ChartMuseum，这是一个新的图表问答（QA）基准测试，包含1162个专家标注的问题，涵盖多种推理类型，这些问题是从来自184个来源的真实世界图表中精心挑选的，专门用于评估复杂的视觉和文本推理。与之前图表理解基准测试中的模型表现相似且接近饱和的情况不同，我们的基准测试揭示了模型与人类性能之间的巨大差距，同时有效地区分了模型的能力：虽然人类达到93%的准确率，但表现最佳的模型Gemini-2.5-Pro仅达到63.0%，领先的开源LVLM Qwen2.5-VL-72B-Instruct仅达到38.5%。此外，在主要需要视觉推理的问题上，所有模型的性能比侧重于文本推理的问题下降了35%-55%。最后，我们的定性错误分析揭示了对当前LVLMs具有挑战性的特定类别的视觉推理。

论文及项目相关链接

PDF

Summary

本文探讨了大型视觉语言模型（LVLMs）在图表理解上所面临的挑战。研究发现，LVLMs在视觉推理方面存在明显不足，难以应对日益复杂的视觉内容。为此，研究团队创建了一个新的图表问答（QA）基准测试ChartMuseum，包含1,162个真实世界图表的专家标注问题。相较于现有基准测试，ChartMuseum更能凸显模型与人类的性能差距，有效区分模型能力。尽管人类准确率高达93%，但最佳模型仅达到63%，开源LVLM模型Qwen2.5-VL-72B-Instruct仅达到38.5%。在主要依赖视觉推理的问题上，所有模型的性能较文本推理主导的问题下降了35%-55%。此外，定性错误分析揭示了当前LVLMs在特定类别视觉推理上的难点。

Key Takeaways

大型视觉语言模型（LVLMs）在图表理解方面面临视觉推理的挑战。
模型在复杂视觉内容下的性能显著下降，而人类性能保持稳定。
ChartMuseum基准测试包含多种推理类型的真实世界图表问题，旨在评估复杂的视觉和文本推理能力。
模型与人类的性能差距显著，最佳模型性能仍远低于人类。
在视觉推理主导的问题上，模型性能下降幅度较大。
当前LVLMs在特定类别的视觉推理上仍存在困难。

Cool Papers

点此查看论文截图

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Authors:Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

扩展测试时间的计算对于提高大型语言模型（LLM）的推理能力至关重要。现有方法通常采用强化学习（RL）来最大化推理轨迹结束时获得的验证奖励。然而，这些方法仅在固定且较大的令牌预算下优化最终性能，这阻碍了训练和部署的效率。在这项工作中，我们提出了一种新型框架“AnytimeReasoner”，以优化任何时间的推理性能，旨在提高令牌效率和在不同令牌预算约束下的推理灵活性。为实现这一目标，我们将完整的思考过程截断以适应从先验分布中采样的令牌预算，迫使模型为每次截断的思考过程总结最佳答案以进行验证。这为推理过程引入了可验证的密集奖励，有助于RL优化中更有效的信用分配。然后，我们以解耦的方式优化思考和总结策略，以最大化累积奖励。此外，我们引入了一种新的方差降低技术，即预算相对策略优化（BRPO），以提高在强化思考策略时学习过程的稳健性和效率。在数学推理任务的实证结果表明，我们的方法在所有思考预算下均优于GRPO，提高了训练和令牌效率。

论文及项目相关链接

PDF

Summary
大型语言模型的推理能力可通过扩展测试时间的计算能力来增强。现有的强化学习方法主要针对固定的令牌预算优化最终性能，限制了训练和部署的效率。本研究提出了一种新的框架AnytimeReasoner，旨在优化任意时间点的推理性能，提高令牌效率和应对不同令牌预算约束的灵活性。通过截断完整的思考过程以适应来自先验分布的采样令牌预算，迫使模型为每个截断的思考过程总结最佳答案以供验证。这引入了可验证的密集奖励到推理过程中，促进了更有效的强化学习优化中的信用分配。此外，我们还引入了一种新的方差减少技术——预算相对策略优化（BRPO），以提高强化思考策略时学习和过程的稳健性和效率。在数理推理任务中的经验结果表明，我们的方法在所有思考预算下均优于GRPO，提高了训练和令牌效率。

Key Takeaways

测试时间的计算能力对大型语言模型的推理能力至关重要。
当前强化学习方法存在固定令牌预算限制，影响训练和部署效率。
AnytimeReasoner框架旨在优化任意时间点的推理性能和提高令牌效率。
通过截断思考过程并适应采样令牌预算，迫使模型为每个截断的思考过程总结最佳答案。
引入可验证的密集奖励到推理过程中，促进更有效的强化学习中的信用分配。
引入预算相对策略优化（BRPO）技术，提高学习和过程的稳健性和效率。

Cool Papers

点此查看论文截图

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

Authors:Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao

While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

多模态大型语言模型（MLLMs）在视觉语言理解方面取得了令人印象深刻的进展，但在复杂的多步骤推理方面仍面临挑战，经常产生逻辑不一致或部分正确的解决方案。一个关键限制在于中间推理步骤缺乏精细的监督。为了解决这一问题，我们提出了MM-PRM，这是一种在全自动化、可扩展的框架内训练的流程奖励模型。我们首先构建了MM-Policy，这是一个在多样化的数学推理数据上训练的强大多模态模型。然后，我们构建了MM-K12，这是一个包含1万道可验证答案的多模态数学问题数据集，作为种子数据。利用基于蒙特卡洛树搜索（MCTS）的管道，我们生成了超过70万步的注释，无需人工标注。所得的PRM用于在Best-of-N推理设置中评分候选推理路径，并在域内（MM-K12测试集）和域外（OlympiadBench、MathVista等）基准测试上实现了显著的改进。进一步的分析证实了软标签、较小学习率和路径多样性在优化PRM性能方面的有效性。MM-PRM证明，流程监督是提高多模态推理系统逻辑稳健性的有力工具。我们在https://github.com/ModalMinds/MM-PRM上发布了所有代码和数据。

论文及项目相关链接

PDF

Summary
多媒体大型语言模型在视觉理解方面取得了显著进展，但在复杂多步骤推理方面仍存在困难。为解决此问题，该研究提出了一种名为MM-PRM的过程奖励模型，通过构建一个强大的多媒体模型MM-Policy和制作数据集MM-K12作为种子数据来训练模型。该研究使用蒙特卡洛树搜索（MCTS）生成超过70万步骤级别的注释，无需人工标注。该模型在域内和域外基准测试中均取得了显著改进，并验证了软标签、小学习率和路径多样性在优化PRM性能方面的有效性。研究证明了过程监督在增强多媒体推理系统的逻辑稳健性方面的强大作用。

Key Takeaways

多模态大型语言模型（MLLMs）在视觉理解方面取得显著进展，但在复杂多步骤推理上仍有困难。
缺乏中间推理步骤的精细监督是MLLMs面临的关键限制。
提出MM-PRM过程奖励模型，通过全自动和可扩展的框架进行训练。
构建MM-Policy强多媒体模型和MM-K12数据集，作为种子数据来训练模型。
使用蒙特卡洛树搜索（MCTS）生成大量步骤级别的注释，无需人工标注。
MM-PRM模型在域内和域外基准测试中表现优异。

Cool Papers

点此查看论文截图

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Authors:Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang

Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing’’ gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

视觉语言模型（VLMs）在许多直接的多模式任务中表现出色，但在将这一能力转化为游戏等交互性丰富、视觉丰富的环境中的有效决策方面却遇到了困难。这种“知而不能行”的差距极大地限制了它们作为自主智能体的潜力，因为领先的VLMs往往在简单的游戏中表现不佳。为了解决这一问题，我们引入了VLM-Gym，这是一个定制的强化学习（RL）环境，其中包含具有统一接口和可调、组合难度的各种视觉游戏，专门为可扩展的多游戏并行训练而设计。利用VLM-Gym，我们使用纯RL驱动的自我进化训练G0模型，展示了新兴的感知和推理模式。为了进一步缓解由游戏多样性带来的挑战，我们开发了G1模型。G1在RL微调之前加入了增强感知的冷启动阶段。我们的G1模型在所有游戏中都超越了其教师模型，并超越了领先的专有模型，如Claude-3.7-Sonnet-Thinking。系统分析揭示了一个有趣的发现：感知和推理能力在RL训练过程中相互引导、相互促进。VLM-Gym和RL训练的源代码已在https://github.com/chenllliang/G1发布，以促进未来在将VLMs发展为有能力交互的智能体方面的研究工作。

论文及项目相关链接

PDF 21 pages, 14 figures, code released at https://github.com/chenllliang/G1

Summary

本文介绍了针对视觉语言模型（VLMs）在交互式丰富视觉环境中决策能力不足的问题，提出了一种名为VLM-Gym的强化学习环境。通过在该环境中训练模型，实现了对VLMs的感知和推理能力的互相促进提升。作者训练了G0模型并开发了G1模型以应对不同游戏的挑战，其中G1模型在感知能力上进行了增强并实现了对老师的超越。

Key Takeaways

VLMs在直接的多模式任务中表现出色，但在视觉丰富环境下的决策制定中存在挑战。存在所谓的“知道做不到”的差距。
VLM-Gym是一个专为解决此问题设计的强化学习环境，提供多样化的视觉游戏，并具备统一的接口和可调整的组合难度，适用于大规模的多游戏并行训练。
G0模型通过纯RL驱动的自我进化进行训练，展现出新兴的感知和推理模式。
G1模型在RL微调前增加了感知增强的冷启动阶段，性能超越了老师和其他领先模型如Claude-3.7-Sonnet-Thinking。
系统分析发现感知和推理能力在RL训练过程中相互推动提升。
作者公开了包括VLM-Gym和RL训练的源代码，以推动未来研究在将VLMs发展为具有交互能力的智能体方面的发展。

Cool Papers

点此查看论文截图

FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning

Authors:Zhuozhao Hu, Kaishen Yuan, Xin Liu, Zitong Yu, Yuan Zong, Jingang Shi, Huanjing Yue, Jingyu Yang

Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person’s emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The dataset and code will be available at https://github.com/953206211/FEALLM.

面部表情分析（FEA）在视觉情感计算中扮演着至关重要的角色，其目的在于根据面部数据推断一个人的情感状态。从科学的角度来看，面部表情是面部肌肉协调运动的结果，可以分解为特定的动作单元（AU），从而提供详细的情感洞察。然而，传统的方法在解释性、通用性和推理能力方面存在局限。最近，多模态大型语言模型（MLLM）在各种视觉任务中表现出了卓越的性能，但在FEA中仍面临重大挑战，主要是由于缺乏专业数据集和无法捕捉面部表情和动作单元之间复杂关系的能力。为了解决这些问题，我们引入了一个新的面部表情分析指令数据集，提供准确和对齐的面部表情和动作单元描述，并建立它们之间的因果关系。随后，我们构建了一个新的基准测试平台FEABench。此外，我们提出了FEALLM，这是一种新型的多模态大型语言模型架构，旨在捕捉更详细的面部信息，提高其在面部表情分析任务中的能力。我们的模型在FEABench上表现出强大的性能，在各种数据集上通过零样本评估展现出令人印象深刻的泛化能力，包括RAF-DB、AffectNet、BP4D和DISFA，证明了其在面部表情分析任务中的稳健性和有效性。数据集和代码将在https://github.com/953206211/FEALLM上提供。

论文及项目相关链接

PDF 10 pages, 7 figures

Summary

面部情感分析（FEA）在视觉情感计算中起到关键作用，通过面部数据推断人的情感状态。传统方法存在解释性、通用性和推理能力有限的挑战。近期，多模态大型语言模型（MLLMs）在视觉任务中表现出卓越性能，但在FEA中仍面临特定数据集缺乏和难以捕捉面部表情（FEs）与动作单元（AUs）之间复杂关系的挑战。为解决这些问题，研究团队引入了新型FEA指令数据集，提供准确对齐的FE和AU描述，并建立它们之间的因果关系。此外，他们构建了新基准测试FEABench，并提出了FEALLM，一种旨在捕捉更多细节面部信息的新型MLLM架构。该模型在FEABench上表现出强劲性能，并在多种数据集上展示了其零样本评估的出色泛化能力，证明了其在FEA任务中的稳健性和有效性。

Key Takeaways

面部情感分析（FEA）在视觉情感计算中扮演关键角色，旨在基于面部数据推断人的情感状态。
传统方法在面部情感分析方面存在解释性、通用性和推理能力的局限。
多模态大型语言模型（MLLMs）在视觉任务中展现出强大性能，但在面部情感分析（FEA）中仍面临挑战，主要是由于缺乏专用数据集和捕捉复杂关系的能力不足。
为改善FEA，引入了新型的FEA指令数据集，提供准确的面部表情（FEs）和动作单元（AUs）描述，并建立它们之间的因果关系。
建立了新的基准测试FEABench。
提出了FEALLM，一种新型的多模态大型语言模型架构，旨在提高在面部情感分析任务中的性能。

Cool Papers

点此查看论文截图

AdaptThink: Reasoning Models Can Learn When to Think

Authors:Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li

Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.

最近，大型推理模型通过采用人类类似的深度思考，在各种任务上取得了令人印象深刻的性能。然而，冗长的思考过程极大地增加了推理开销，使效率成为关键瓶颈。在这项工作中，我们首先证明，对于相对简单的任务，NoThinking（提示推理模型跳过思考并直接生成最终解决方案）在性能和效率方面都是更好的选择。在此基础上，我们提出了AdaptThink，这是一种新型强化学习算法，用于教授推理模型根据问题难度自适应选择最佳思考模式。具体来说，AdaptThink具有两个核心组件：（1）约束优化目标，鼓励模型在保持整体性能的同时选择NoThinking；（2）重要性采样策略，在策略内训练中平衡思考和NoThinking样本，从而实现冷启动并允许模型在整个训练过程中探索和利用两种思考模式。我们的实验表明，AdaptThink显著降低了推理成本，并进一步提高性能。值得注意的是，在三个数学数据集上，AdaptThink将DeepSeek-R1-Distill-Qwen-1.5B的平均响应长度减少了53%，同时提高了其准确性2.4%，这突显了自适应选择思考模式在优化推理质量和效率之间的平衡方面的潜力。我们的代码和模型可在https://github.com/THU-KEG/AdaptThink上获取。

论文及项目相关链接

PDF

Summary

本文提出在简单的任务上采用NoThinking策略，使推理模型跳过思考过程直接生成答案，以提高性能和效率。在此基础上，作者提出了一种名为AdaptThink的新型强化学习算法，该算法使推理模型能够根据问题的难易程度自适应选择最佳思考模式。AdaptThink包含两个核心组件：一是通过约束优化目标鼓励模型在维持总体性能的同时选择NoThinking；二是采用重要性采样策略，在训练过程中平衡思考和NoThinking样本，使模型能够探索和利用两种思考模式。实验表明，AdaptThink在减少推理成本的同时提高了性能，特别是在三个数学数据集上的表现尤为突出。

Key Takeaways

NoThinking策略在简单任务上能提高推理模型的性能和效率，通过跳过思考过程直接生成答案。
AdaptThink是一种新型强化学习算法，根据问题的难易程度使推理模型自适应选择最佳思考模式。
AdaptThink包含约束优化目标和重要性采样策略两个核心组件。
约束优化目标鼓励模型在维持总体性能的同时选择NoThinking。
重要性采样策略在训练过程中平衡思考和NoThinking样本，使模型能够探索和利用两种思考模式。
实验表明，AdaptThink在减少推理成本的同时提高了性能。

Cool Papers

点此查看论文截图

MR. Judge: Multimodal Reasoner as a Judge

Authors:Renjie Pi, Felix Bai, Qibin Chen, Simon Wang, Jiulong Shan, Kieran Liu, Meng Cao

The paradigm of using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) as evaluative judges has emerged as an effective approach in RLHF and inference-time scaling. In this work, we propose Multimodal Reasoner as a Judge (MR. Judge), a paradigm for empowering general-purpose MLLMs judges with strong reasoning capabilities. Instead of directly assigning scores for each response, we formulate the judgement process as a reasoning-inspired multiple-choice problem. Specifically, the judge model first conducts deliberate reasoning covering different aspects of the responses and eventually selects the best response from them. This reasoning process not only improves the interpretibility of the judgement, but also greatly enhances the performance of MLLM judges. To cope with the lack of questions with scored responses, we propose the following strategy to achieve automatic annotation: 1) Reverse Response Candidates Synthesis: starting from a supervised fine-tuning (SFT) dataset, we treat the original response as the best candidate and prompt the MLLM to generate plausible but flawed negative candidates. 2) Text-based reasoning extraction: we carefully design a data synthesis pipeline for distilling the reasoning capability from a text-based reasoning model, which is adopted to enable the MLLM judges to regain complex reasoning ability via warm up supervised fine-tuning. Experiments demonstrate that our MR. Judge is effective across a wide range of tasks. Specifically, our MR. Judge-7B surpasses GPT-4o by 9.9% on VL-RewardBench, and improves performance on MM-Vet during inference-time scaling by up to 7.7%.

使用大型语言模型（LLM）和多模态大型语言模型（MLLM）作为评估法官的范式已经作为RLHF和推理时间缩放中的一种有效方法。在这项工作中，我们提出了“多模态推理法官（MR. Judge）”这一范式，旨在赋予通用MLLM法官强大的推理能力。与传统的直接为每个回答分配分数的方式不同，我们将判断过程制定为受推理启发的多项选择问题。具体来说，法官模型首先进行深思熟虑的推理，涵盖回答的不同方面，并最终从中选择最佳回答。这种推理过程不仅提高了判断的可解释性，还大大提高了MLLM法官的性能。为了应对缺乏带分数回答的问题，我们提出以下策略来实现自动标注：1）反向响应候选合成：从监督微调（SFT）数据集开始，我们将原始响应视为最佳候选，并提示MLLM生成合理的负面候选对象；2）基于文本的理由提取：我们精心设计了一个数据合成管道，用于从基于文本的推理模型中提炼推理能力，该能力被用来通过预热监督微调使MLLM法官重新获得复杂的推理能力。实验表明，我们的MR. Judge在多种任务上均有效。具体来说，我们的MR. Judge-7B在VL-RewardBench上的表现超过了GPT-4o的9.9%，并在推理时间缩放期间MM-Vet上的性能提高了高达7.7%。

论文及项目相关链接

PDF

Summary

该文本介绍了使用大型语言模型（LLMs）和多模态大型语言模型（MLLMs）作为评价者的范式在强化学习人类反馈（RLHF）和推理时间缩放中的有效性。提出一种名为“多模态判断者”（MR. Judge）的范式，通过将其转化为推理启发式的多选题问题，赋予通用MLLMs评价者强大的推理能力。为了提高评估的透明度并增强MLLM评价者的性能，将判断过程转化为深思熟虑的推理过程。为解决缺乏带分数响应的问题，提出了自动标注策略，并通过实验验证了MR. Judge的有效性。

Key Takeaways

大型语言模型（LLMs）和多模态大型语言模型（MLLMs）作为评价者已成为有效的评估方法。
提出了一种名为“多模态判断者”（MR. Judge）的范式，将判断过程转化为推理启发式的多选题问题，以提高通用MLLMs评价者的推理能力。
通过将判断过程转化为深思熟虑的推理，不仅提高了评价的透明度，还增强了MLLM评价者的性能。
针对缺乏带分数响应的问题，提出了自动标注策略，包括反向响应候选合成和基于文本的理由提取方法。
MR. Judge-7B在VL-RewardBench上的表现优于GPT-4o达9.9%。
在推理时间缩放时，MR. Judge能提高MM-Vet的性能，提高幅度高达7.7%。

Cool Papers

点此查看论文截图

Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks

Authors:Mikołaj Małkiński, Jacek Mańdziuk

The abstract visual reasoning (AVR) domain presents a diverse suite of analogy-based tasks devoted to studying model generalization. Recent years have brought dynamic progress in the field, particularly in i.i.d. scenarios, in which models are trained and evaluated on the same data distributions. Nevertheless, o.o.d. setups that assess model generalization to new test distributions remain challenging even for the most recent models. To advance generalization in AVR tasks, we present the Pathways of Normalized Group Convolution model (PoNG), a novel neural architecture that features group convolution, normalization, and a parallel design. We consider a wide set of AVR benchmarks, including Raven’s Progressive Matrices and visual analogy problems with both synthetic and real-world images. The experiments demonstrate strong generalization capabilities of the proposed model, which in several settings outperforms the existing literature methods.

抽象视觉推理（AVR）领域提供了一系列基于类比的任务，专门用于研究模型的泛化能力。近年来，该领域取得了动态进展，特别是在独立同分布（i.i.d.）场景中，模型可以在同一数据分布上进行训练和评估。然而，对于最新模型而言，评估其在新测试分布上的泛化能力仍然具有挑战性，这种泛化能力的测试涉及不同分布（o.o.d.）。为了推进AVR任务的泛化能力，我们提出了路径归一化组卷积模型（PoNG），这是一种新型神经网络架构，具有组卷积、归一化和并行设计的特点。我们考虑了一系列AVR基准测试，包括Raven的进步矩阵和包含合成图像和真实世界图像的视觉类比问题。实验表明，该模型具有很强的泛化能力，并且在某些情况下优于现有文献中的方法。

论文及项目相关链接

PDF Accepted to the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

Summary

近期在抽象视觉推理（AVR）领域出现了动态进步，特别是在独立同分布（i.i.d.）场景中。然而，对于模型推广到新的测试分布（o.o.d.）设置仍是一大挑战。为推进AVR任务的泛化能力，提出了Pathways of Normalized Group Convolution模型（PoNG），它采用分组卷积、归一化和并行设计的新神经网络架构。经过在广泛的AVR基准测试集（包括Raven的渐进矩阵和视觉类比问题，包括合成图像和真实世界图像）上的实验验证，该模型展现出强大的泛化能力，并在某些情况下优于现有文献方法。

Key Takeaways

AVR领域近年来在模型泛化方面取得动态进步，特别是在i.i.d.场景。
模型在o.o.d.设置下泛化至新测试分布仍具挑战性。
提出了一种新的神经网络架构——Pathways of Normalized Group Convolution模型（PoNG），融合了分组卷积、归一化和并行设计。
PoNG模型在多种AVR基准测试集上进行了实验验证。
实验结果表明PoNG模型展现出强大的泛化能力。
在某些情况下，PoNG模型的性能优于现有文献方法。

Cool Papers

点此查看论文截图

Thinkless: LLM Learns When to Think

Authors:Gongfan Fang, Xinyin Ma, Xinchao Wang

Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model’s ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, for concise responses and for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

推理语言模型能够执行扩展的连锁推理，在需要复杂逻辑推断的任务上表现出了显著的性能。然而，对所有查询都进行精细推理往往会导致大量的计算效率低下，特别是当许多问题都有简单直接的解决方案时。这引发了一个开放性的问题：LLM能否学会何时思考？为了回答这个问题，我们提出了Thinkless，这是一个可学习的框架，能够赋予LLM根据任务复杂度和模型能力自适应选择短形式和长形式推理的能力。Thinkless是在强化学习模式下进行训练的，并采用了两个控制令牌：用于简洁回应，用于详细推理。我们方法的核心是解耦组相对策略优化（DeGRPO）算法，该算法将混合推理的学习目标分解为两个部分：（1）控制令牌损失，它控制推理模式的选择；（2）响应损失，它提高生成答案的准确性。这种解耦公式使每个目标的贡献实现精细控制，稳定训练，并有效防止了原始GRPO中观察到的崩溃现象。从实证上看，在Minerva Algebra、MATH-500和GSM8K等多个基准测试中，Thinkless能够减少长链思考的使用率达到50%-90%，显著提高了推理语言模型的效率。代码可在https://github.com/VainF/Thinkless找到。

论文及项目相关链接

PDF

Summary：能进行链式思维推理的语言模型已在复杂逻辑推断任务上展现出卓越性能。然而，对所有查询进行精细推理会导致大量计算资源浪费，特别是在许多问题存在简单解决方案时。为解决此问题，提出了Thinkless框架，使语言模型能自适应选择简洁或长篇推理模式，这取决于任务复杂性和模型能力。Thinkless采用强化学习模式训练，使用两个控制符号和来分别控制简短回答和详细推理。其核心算法为解耦组相对策略优化（DeGRPO），将混合推理的学习目标分解为两部分：控制符号损失和响应损失。在Minerva Algebra、MATH-500和GSM8K等多个基准测试集上，Thinkless将长篇推理的使用量减少了50%-90%，显著提高了推理语言模型的效率。

Key Takeaways：

语言模型在复杂逻辑推断任务上表现出色，但全量精细推理引发计算资源浪费问题。
Thinkless框架使语言模型能自适应选择简洁或长篇推理模式。
Thinkless采用强化学习模式训练，使用和两个控制符号。
解耦组相对策略优化（DeGRPO）算法将混合推理的学习目标分解为控制符号损失和响应损失两部分。
Thinkless框架能显著减少长篇推理的使用量，提高语言模型的效率。
该框架在多个基准测试集上进行了实证验证。

Cool Papers

点此查看论文截图

Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

Authors:Adam Štorek, Mukur Gupta, Samira Hajizadeh, Prashast Srivastava, Suman Jana

Although modern Large Language Models (LLMs) support extremely large contexts, their effectiveness in utilizing long context for code reasoning remains unclear. This paper investigates LLM reasoning ability over code snippets within large repositories and how it relates to their recall ability. Specifically, we differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does). To measure semantic recall, we propose SemTrace, a code reasoning technique where the impact of specific statements on output is attributable and unpredictable. We also present a method to quantify semantic recall sensitivity in existing benchmarks. Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context, particularly with techniques requiring high semantic recall like SemTrace. Moreover, we find that lexical recall varies by granularity, with models excelling at function retrieval but struggling with line-by-line recall. Notably, a disconnect exists between lexical and semantic recall, suggesting different underlying mechanisms. Finally, our findings indicate that current code reasoning benchmarks may exhibit low semantic recall sensitivity, potentially underestimating LLM challenges in leveraging in-context information.

尽管现代的大型语言模型（LLM）支持极大的上下文环境，但它们利用长上下文进行代码推理的有效性仍不明确。本文调查了LLM在大仓库中的代码片段推理能力，以及其与回忆能力的关联。具体来说，我们对词汇代码回忆（逐字检索）和语义代码回忆（记住代码的功能）进行了区分。为了衡量语义回忆，我们提出了SemTrace这一代码推理技术，可以归因和预测特定语句对输出的影响。我们还提出了一种量化现有基准测试中语义回忆敏感度的方法。我们对最新LLM的评估表明，随着代码片段接近输入上下文的中间部分，代码推理的准确性大幅下降，特别是在需要高度语义回忆的技术如SemTrace中。此外，我们发现词汇回忆的粒度有所不同，模型在函数检索方面表现出色，但在逐行回忆方面遇到困难。值得注意的是，词汇回忆和语义回忆之间存在断层，表明存在不同的基础机制。最后，我们的研究结果表明，当前的代码推理基准测试可能表现出较低的语义回忆敏感性，可能会低估LLM在利用上下文信息方面的挑战。

论文及项目相关链接

PDF

Summary

大规模语言模型（LLM）在代码推理方面的能力尚待明确。本文探讨了LLM在处理大型代码库中的代码片段时的推理能力及其与召回能力的关联。文章区分了词汇代码召回（逐字检索）和语义代码召回（记住代码的功能）。为了衡量语义召回，我们提出了SemTrace这一代码推理技术，可以追溯特定语句对输出的影响。通过对现有基准模型的语义召回敏感性进行量化评估，我们发现先进LLM的代码推理准确性随着代码片段在输入上下文中的位置而显著下降，特别是在需要高语义召回的技术如SemTrace中。此外，还发现词汇召回在粒度上有所不同，模型擅长函数检索，但在逐行召回方面表现不佳。最后，研究发现词汇和语义召回之间存在断层，提示两者背后有不同的机制。本文的发现表明，当前代码推理基准测试的语义召回敏感性可能较低，可能会低估LLM在利用上下文信息方面的挑战。

Key Takeaways

LLM在代码推理方面的能力尚待明确。
文章探讨了LLM处理大型代码库中的代码片段时的推理能力，及其与召回能力的关系。
区分了词汇代码召回（逐字检索）和语义代码召回（理解代码功能）。
提出SemTrace这一代码推理技术来衡量语义召回。
先进LLM在代码推理方面存在准确性问题，特别是在需要高语义召回的技术中。
LLM在词汇召回的粒度上表现不同，擅长函数检索但逐行召回困难。

Cool Papers

点此查看论文截图

J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization

Authors:Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

随着大型语言模型（LLM）发展步伐的加快，模型输出评估已从耗时的人力评估转向自动评估。在自动评估中，LLM本身被赋予评估和批判其他模型输出的任务。LLM作为评判者的模型是一类生成式评估器，擅长评估相对简单的领域，如聊天质量，但在需要大量推理的领域却表现挣扎，因为这些领域的模型回应包含更多实质性和具有挑战的内容。为了弥补现有评判者的不足，我们探索使用强化学习（RL）来训练评判者。我们做出了三个关键贡献：（1）我们提出了等效初始状态组相对策略优化（EIS-GRPO）算法，该算法使我们能够训练评判者，以应对在更复杂评估环境中出现的定位偏见。（2）我们推出了ReasoningJudgeBench，一个可以在多样化推理环境中评估评判者的基准测试集，弥补了先前工作未涵盖的领域。（3）我们训练了用于推理的评判者（J4R），这是一个使用EIS-GRPO训练的7B评判者，其性能优于GPT-4o和下一个最佳小型评判者6.7％和9％，在JudgeBench和ReasoningJudgeBench上的表现与更大的GRPO训练评判者相匹配甚至更好。

论文及项目相关链接

PDF 25 pages, 4 figures, 6 tables. To be updated with links for code/benchmark

Summary

大型语言模型（LLM）的发展推动了模型输出评估的转变，从耗时的人力评估转向自动评估。本文探讨了使用强化学习（RL）训练LLM作为评价者的新方法，解决了现有评价者在复杂领域中的短板问题。研究的主要贡献包括：提出等效初始状态组相对策略优化（EIS-GRPO）算法，增强评价者对位置偏差的稳健性；推出ReasoningJudgeBench基准测试，涵盖多样化的推理场景；训练出Judge for Reasoning（J4R），该模型以EIS-GRPO算法训练，性能超过GPT-4o和其他小型评价者模型，同时在JudgeBench和ReasoningJudgeBench上表现出匹配或超越更大规模GRPO训练评价者的性能。

Key Takeaways

LLM的快速发展推动了模型输出评估从人力评估向自动评估的转变。
LLM自身被用来评估和批判其他模型输出，其中LLM-as-judge模型在简单领域如聊天质量评价中表现优异，但在需要大量推理的领域表现较差。
强化学习（RL）被用于改进现有评价者的不足，通过训练增强评价者的稳健性。
提出了Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO)算法，帮助评价者应对复杂环境下的位置偏差问题。
推出了ReasoningJudgeBench基准测试，用于评估在多样化推理场景中的评价者性能。
训练了Judge for Reasoning (J4R) 模型，性能优于其他小型评价者模型，并在两个基准测试中表现出良好的性能。

Cool Papers

点此查看论文截图

Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning

Authors:Song-Lin Li, Rui Zhu, Yu-Feng Li, Lan-Zhe Guo

Semi-supervised learning (SSL) alleviates the cost of data labeling process by exploiting unlabeled data, and has achieved promising results on various tasks such as image classification. Meanwhile, the Pretrain-Finetuning paradigm has garnered significant attention in recent years, and exploiting pre-trained models could also reduce the requirement of labeled data in downstream tasks. Therefore, a question naturally occurs: \emph{When the labeled data is scarce in the target tasks, should we exploit unlabeled data or pre-trained models?} To answer this question, we select pre-trained Vision-Language Models (VLMs) as representative pretrain-finetuning instances and propose \textit{Few-shot SSL} – a framework that enables fair comparison between these two paradigms by controlling the amount of labeled data used. Extensive experiments across various settings demonstrate that pre-trained VLMs generally outperform SSL methods in nearly all cases, except when the data has low resolution or lacks clear semantic structure. Therefore, we encourage future SSL research to compare with pre-trained models and explore deeper integration, such as using pre-trained knowledge to enhance pseudo-labeling. To support future research, we release our unified reproduction and evaluation framework. Codes are available at https://anonymous.4open.science/r/Rethinking-SSL-and-Pretrain-Finetuning-5566

半监督学习（SSL）通过利用未标记数据减轻了数据标记过程的成本，并在图像分类等各种任务上取得了有前景的结果。同时，近年来预训练微调范式也引起了人们的广泛关注，利用预训练模型也可以减少下游任务中对标记数据的需求。因此，一个自然而然的问题是：当目标任务的标记数据稀缺时，我们应该利用未标记数据还是预训练模型？为了回答这个问题，我们选择预训练的视觉语言模型（VLM）作为代表性的预训练微调实例，并提出了“Few-shot SSL”框架，通过控制使用的标记数据量，使这两种范式之间能够进行公平的比较。在多种设置的大量实验表明，除数据分辨率低或缺乏明确语义结构的情况外，预训练的VLM通常在所有情况下都优于SSL方法。因此，我们鼓励未来的SSL研究与预训练模型进行比较，探索更深入的集成，例如利用预训练知识来提高伪标签的质量。为了支持未来的研究，我们发布了统一的复制和评估框架。代码可在https://anonymous.4open.science/r/Rethinking-SSL-and-Pretrain-Finetuning-5566找到。

论文及项目相关链接

PDF

Summary
在半监督学习（SSL）可以利用无标签数据降低数据标注成本并在图像分类等任务上取得有前景的结果的同时，预训练微调（Pretrain-Finetuning）范式近年来也备受关注，利用预训练模型可以降低下游任务对标注数据的需求。当目标任务的标注数据稀缺时，我们应该如何利用无标签数据和预训练模型？为回答这个问题，我们以预训练视觉语言模型（VLMs）为例子，提出了“小样本SSL”框架，通过控制使用的标注数据量，公平地比较这两种范式。大量实验表明，除了数据分辨率低或语义结构不明确的情况外，预训练的VLMs在几乎所有情况下都优于SSL方法。因此，我们鼓励未来的SSL研究要与预训练模型进行比较，探索更深入的集成，如利用预训练知识增强伪标签。

Key Takeaways

半监督学习（SSL）能够利用无标签数据降低数据标注成本，并在多个任务上取得良好效果。
预训练微调（Pretrain-Finetuning）范式备受关注，可降低下游任务对标注数据的需求。
在目标任务的标注数据稀缺时，应探讨如何平衡利用无标签数据和预训练模型。
预训练的视觉语言模型（VLMs）在多种设置下的实验表现普遍优于SSL方法，除非数据分辨率低或语义结构不明确。
未来的SSL研究需要更多地与预训练模型进行比较和集成。
利用预训练知识增强伪标签是一种潜在的深入集成方法。

Cool Papers

点此查看论文截图

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Authors:Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng

Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model’s latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

人类的推理能力作为人工智能的核心组成部分，对大型语言模型（LLM）在实现通用人工智能（AGI）的过程中仍然构成重大挑战。尽管模型性能在训练规模效应下有所提升，但仍面临诸多挑战，特别是在训练算法方面，如灾难性遗忘和新型训练数据的有限可用性。作为一种替代方案，测试时缩放通过在测试时增加计算量而不更新参数来提高推理性能。不同于该范式中以前关注于符号空间的方法，我们提出利用潜在空间进行更有效的推理，并更好地遵循测试时缩放定律。我们介绍了LatentSeek，这是一个通过模型潜在空间内的测试时实例级适应（TTIA）来提高LLM推理能力的新型框架。具体来说，LatentSeek利用策略梯度来迭代更新潜在表示，由自我生成的奖励信号引导。LatentSeek在多个推理基准测试上进行了评估，包括GSM8K、MATH-50 扩展到中文略（待补充）-待审阅该部分是否可以删去或减少至足够精简的表述等架构的大型语言模型。结果表明，LatentSeek持续超越强大的基线，如思维链提示和微调方法。此外，我们的分析表明，LatentSeek非常高效，通常在解决平均复杂度问题的几个迭代内收敛，并且从额外的迭代中受益，从而突出了潜在空间中测试时缩放法的潜力。这些发现使LatentSeek成为提高LLM推理能力的轻便、可扩展和有效的解决方案。

论文及项目相关链接

PDF

Summary

本文探讨了大型语言模型（LLM）在追求人工智能通用智能（AGI）的过程中面临的核心挑战，特别是其在推理能力上的局限性。为了提升LLM的推理性能，本文提出了一种新型的测试时间缩放方法——LatentSeek框架。该框架利用测试时间实例级适应（TTIA）在模型的潜在空间内增强推理能力。实验结果表明，LatentSeek在多个推理基准测试上表现优异，包括GSM8K、MATH-500和AIME2024等多个架构的LLM。其优势在于高效且可缩放，有望成为提升LLM推理能力的轻量级解决方案。

Key Takeaways

大型语言模型（LLM）在追求人工智能通用智能（AGI）的过程中仍面临推理能力的挑战。
尽管模型性能已有所提升，但仍存在训练算法的挑战，如灾难性遗忘和新型训练数据的有限性。
测试时间缩放通过增加测试时间计算而不更新参数来提高推理性能。
LatentSeek框架利用潜在空间内的测试时间实例级适应（TTIA）增强LLM的推理能力。
LatentSeek使用策略梯度来迭代更新潜在表示，由自我生成的奖励信号引导。
LatentSeek在多个推理基准测试中表现优异，如GSM8K、MATH-500和AIME2024等。

Cool Papers

点此查看论文截图

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, Wanxiang Che

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.

链式思维（Chain-of-Thought，简称CoT）推理在提升大型语言模型（LLM）的复杂任务效率方面表现出了显著的成效，这引发了对其底层机制的研究热潮。然而，对于真实世界的应用来说，还存在两个主要的挑战：一是缺乏定量指标和可操作的指南来评估和优化可衡量的CoT能力边界；二是缺乏评估不可测量的CoT能力边界的方法，如多模态感知。为了解决这些空白，我们引入了Reasoning Boundary Framework++（RBF++）。针对第一个挑战，我们将推理边界（RB）定义为CoT性能的最大极限，并提出了一种RB组合定律，能够在各种CoT任务中进行定量分析并提供可操作的指导。对于第二个挑战，特别是在多模态场景中，我们引入了一个恒定假设，用场景特定的常数来替代不可测量的RBs。此外，我们还提出了推理边界划分机制，将不可测量的RBs分为两个子边界，便于量化和优化不可测量的领域知识和多模态感知能力。我们对38个模型进行了涉及13项任务的广泛实验，验证了我们的框架在跨模态设置中的可行性。此外，我们还评估了10种CoT策略，从两个互补的角度为优化和衰退提供了见解，并扩展了用于测量LLM推理中RB的评价基准。我们希望这项工作能推动对LLMs中的RB和优化策略的理解。相关代码和数据可通过https://github.com/LightChen233/reasoning-boundary访问。

论文及项目相关链接

PDF Manuscript

Summary

本文介绍了Chain-of-Thought（CoT）推理在大型语言模型（LLM）中的有效性及其在复杂任务中的性能增强。针对现实应用中的两个主要挑战——缺乏定量指标和可操作的指南来评估和优化可测量的CoT能力边界，以及无法评估不可测量的CoT能力边界（如多模态感知）的问题，提出了Reasoning Boundary Framework++（RBF++）。通过定义推理边界（RB）和组合定律来解决第一个挑战，为各种CoT任务提供定量分析和可操作指导。对于第二个挑战，特别是在多模态场景中，通过恒定假设替换不可测量的RBs，并提出推理边界划分机制，将不可测量的RBs分为两个子边界，从而量化并优化不可测量的领域知识和多模态感知能力。实验验证了该框架在跨模态设置中的可行性。

Key Takeaways

Chain-of-Thought (CoT) 推理增强了大型语言模型（LLM）在复杂任务上的性能。
存在两个主要挑战：缺乏评估和优化CoT能力边界的定量指标和可操作方法，以及无法评估不可测量的CoT能力边界（如多模态感知）。
引入Reasoning Boundary Framework++（RBF++）来解决这些挑战。
定义推理边界（RB）作为CoT性能的最大限制，并提出针对各种CoT任务的组合定律。
通过恒定假设处理不可测量的RBs，并在多模态场景中引入推理边界划分机制。
广泛实验验证了框架在跨模态设置中的可行性。

Cool Papers

点此查看论文截图

Authors:Alice Plebe, Timothy Douglas, Diana Riazi, R. Maria del Rio-Chanona

Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation. In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear. We present the first study examining how images influence VLMs’ propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior. To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels. Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news. Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity. Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation. These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems. Code and dataset are available at: https://github.com/3lis/misinfo_vlm

大型语言模型在新闻推荐系统中的集成日益增多，引发了人们对它们传播错误信息作用的担忧。众所周知，视觉内容可以提升人类对于信息的可信度和共享性，但其对视觉语言模型（VLMs）的影响仍不明确。我们首次研究了图像如何影响VLMs重新分享新闻内容的倾向，探讨了这种影响在不同模型家族中的差异，以及人格设定和内容属性如何调节这种行为。为了支持这一分析，我们提出了两种方法论上的贡献：一种受越狱启发而设计的提示策略，它通过模拟具有反社会特质和政治倾向的用户来激发VLMs的重新分享决策；另一个是来自PolitiFact的事实核查政治新闻多媒体数据集，配以相应的图像和真实准确性标签。跨模型家族的试验表明，图片的存在使真实新闻和虚假新闻的重新分享率分别提高了4.8%和15.0%。人格设定进一步调节了这一影响：黑暗三重特质放大了虚假新闻的重新分享，而共和党倾向的档案则降低了真实性敏感性。在测试的所有模型中，只有Claude-3-Haiku对视觉错误信息表现出稳健性。这些发现突显了多模态模型行为的新兴风险，并推动了为个性化AI系统量身定制评估框架和缓解策略的发展。代码和数据集可在：https://github.com/3lis/misinfo_vlm上找到。

论文及项目相关链接

PDF

Summary

大型语言模型在新闻推荐系统中的应用日益广泛，引发人们对传播误导信息的担忧。本研究首次探讨了图像对视觉语言模型（VLMs）重传新闻内容倾向性的影响，以及不同模型家族之间的差异和个人特质和内容属性如何影响这种行为。研究发现，图像的存在使真实新闻的再分享率提高了4.8%，而误导性新闻的再分享率提高了15%。此外，个人特质如黑暗三角特征会加剧误导性新闻的再分享，而共和党倾向的账户对真实性信息较为敏感。在测试的所有模型中，只有Claude-3-Haiku对视觉误导信息表现出稳健性。本研究揭示了多模式模型行为的新兴风险，并呼吁为个性化AI系统制定针对性的评估框架和缓解策略。

Key Takeaways

大型语言模型在新闻推荐系统中的集成引发关于传播误导信息的担忧。
图像对视觉语言模型（VLMs）重传新闻内容的影响尚不清楚。
图像可以显著提高真实和误导性新闻的再分享率。
个人特质（如黑暗三角特征）可能加剧误导性新闻的分享。
不同模型家族在重传新闻内容方面存在差异。
Claude-3-Haiku模型对视觉误导信息具有稳健性。

Cool Papers

点此查看论文截图

CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

Authors:Lei Sheng, Shuai-Shuai Xu

Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD development set, our 3B model achieves 65.28% execution accuracy, while the 7B model achieves 69.19%. The code will be open sourced at https://github.com/CycloneBoy/csc_sql.

大规模语言模型（LLM）在将关于关系数据库的自然语言问题翻译成SQL查询方面表现出了强大的能力。特别是，测试时的缩放技术，如自我一致性（Self-Consistency）和自我修正（Self-Correction），可以通过增加推理过程中的计算成本来提高SQL生成准确性。然而，这些方法也有明显的局限性：自我一致性可能会选择次优输出，尽管得到了多数投票，而自我修正通常只解决语法错误。为了利用这两种方法的优点，我们提出了CSC-SQL，一种将自我一致性和自我修正相结合的新方法。CSC-SQL选择并行采样中出现频率最高的两个输出，并将其输入到修订模型中进行校正。此外，我们采用群体相对策略优化（GRPO）算法，通过强化学习对SQL生成和修订模型进行微调，显著提高了输出质量。实验结果证实了CSC-SQL的有效性和通用性。在BIRD开发集上，我们的3B模型执行准确性达到65.28%，而7B模型达到69.19%。代码将在https://github.com/CycloneBoy/csc_sql开源。

论文及项目相关链接

PDF 11 pages, 5 figures

Summary

大型语言模型在将自然语言问题翻译成SQL查询方面表现出强大的能力。测试时扩展技术，如自我一致性（Self-Consistency）和自我修正（Self-Correction），可以通过增加推理过程中的计算工作量来提高SQL生成准确性。然而，这两种方法都有局限性：自我一致性可能会选择次优输出，而自我修正主要解决语法错误。为了结合这两种方法的优点，我们提出了一种新的方法CSC-SQL，它结合了自我一致性和自我修正。CSC-SQL从并行采样中选择出现最频繁的两个输出，并将其输入到一个修正合并模型中。此外，我们采用群体相对策略优化（GRPO）算法对SQL生成和修正模型进行微调，通过强化学习显著提高输出质量。实验结果表明CSC-SQL的有效性及通用性。在BIRD开发集上，我们的3B模型执行准确性达到65.28%，而7B模型达到69.19%。代码将在CycloneBoy/csc_sql上开源。

Key Takeaways

大型语言模型在将自然语言转化为SQL查询方面表现出强大的能力。
测试时扩展技术如自我一致性（Self-Consistency）和自我修正（Self-Correction）能提高SQL生成的准确性。
3 单独使用时这两种方法都存在局限性，需要提出一种集成两种方法的新策略来提高效果。我们提出了CSC-SQL来解决这一问题。
CSC-SQL方法选择两个最常见的输出并对其进行合并和修正以提高准确性。
采用群体相对策略优化（GRPO）算法对模型进行微调，显著提高输出质量。
实验结果表明CSC-SQL的有效性及通用性，在BIRD开发集上表现良好。

Cool Papers

点此查看论文截图

Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

Authors:Jingyi Ren, Yekun Xu, Xiaolong Wang, Weitao Li, Weizhi Ma, Yang Liu

Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive domains. However, although RAG achieved successes across distinct domains, there are still some unsolved challenges: 1) Effectiveness. Existing research mainly focuses on developing more powerful RAG retrievers, but how to enhance the generator’s (LLM’s) ability to utilize the retrieved information for reasoning and generation? 2) Transparency. Most RAG methods ignore which retrieved content actually contributes to the reasoning process, resulting in a lack of interpretability and visibility. To address this, we propose ARENA (Adaptive-Rewarded Evidence Navigation Agent), a transparent RAG generator framework trained via reinforcement learning (RL) with our proposed rewards. Based on the structured generation and adaptive reward calculation, our RL-based training enables the model to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces. Applied to Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, abundant experiments with various RAG baselines demonstrate that our model achieves 10-30% improvements on all multi-hop QA datasets, which is comparable with the SOTA Commercially-developed LLMs (e.g., OpenAI-o1, DeepSeek-R1). Further analyses show that ARENA has strong flexibility to be adopted on new datasets without extra training. Our models and codes are publicly released.

检索增强生成（RAG）在知识密集型领域显著提高了大型语言模型（LLM）的性能。然而，尽管RAG在不同领域取得了成功，但仍存在一些未解决的挑战：1）有效性。现有研究主要集中在开发更强大的RAG检索器上，但如何增强生成器（LLM）利用检索信息进行推理和生成的能力？2）透明度。大多数RAG方法忽略了哪些检索内容实际上对推理过程有所贡献，导致缺乏可解释性和可见性。为了解决这一问题，我们提出了ARENA（自适应奖励证据导航代理），这是一个通过强化学习（RL）训练的透明RAG生成器框架，并提出了我们的奖励。基于结构化生成和自适应奖励计算，我们的基于RL的训练使模型能够识别关键证据，进行结构化推理，并生成具有可解释决策轨迹的答案。应用于Qwen2.5-7B-Instruct和Llama3.1-8B-Instruct，与各种RAG基准的大量实验表明，我们的模型在所有多跳问答数据集上实现了10-30%的改进，可与最先进的商业开发LLM（例如OpenAI-o1、DeepSeek-R1）相媲美。进一步的分析表明，ARENA具有很强的灵活性，可以适应新的数据集而无需额外训练。我们的模型和代码已公开发布。

论文及项目相关链接

PDF

Summary：
RAG技术在知识密集型领域表现出卓越性能，但仍面临有效性和透明度挑战。为解决这些问题，提出ARENA框架，通过强化学习训练，使模型能够识别关键证据，进行结构化推理，并生成具有可解释决策轨迹的答案。在多个多跳问答数据集上，与商业领先的大型语言模型相比，ARENA取得了显著的改进。

Key Takeaways：

RAG技术在知识密集型领域表现优异，但仍需解决有效性和透明度问题。
ARENA框架被提出，采用强化学习训练，提高模型的推理和生成能力。
ARENA能够识别关键证据，进行结构化推理，并生成具有可解释决策轨迹的答案。
与多种RAG基准测试和多跳问答数据集相比，ARENA取得了显著改进。
ARENA的性能与商业领先的大型语言模型（如OpenAI-o1，DeepSeek-R1）相当。
ARENA具有很强的灵活性，可以适应新数据集而无需额外训练。

Cool Papers

点此查看论文截图

Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

Authors:Lili Zhang, Haomiaomiao Wang, Long Cheng, Libao Deng, Tomas Ward

As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social cooperation, and strategic flexibility. We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation. Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment. Rather than offering a performance benchmark, this work proposes a methodology for diagnosing decision-making weaknesses in LLM-based agents, providing actionable insights for alignment and safety research.

随着大型语言模型（LLM）越来越融入现实世界的决策系统，理解其行为漏洞对于人工智能安全和对齐来说仍然是一个关键挑战。现有的评估指标主要关注推理准确性或事实正确性，却往往忽视LLM是否能够对对抗性操作保持稳健，或在动态环境中使用自适应策略的能力。本文介绍了一个对抗性评估框架，旨在在对抗条件下系统地压力测试LLM的决策过程。该框架借鉴了认知心理学和博弈论的方法论，探讨模型如何响应两种典型任务：双臂匪徒任务和多轮信任任务。这些任务捕捉到了探索与利用之间的权衡、社会合作和战略灵活性的关键方面。我们将该框架应用于几种最先进的LLM，包括GPT-3.5、GPT-4、双子座-杰文石木、深度寻求v铁科泛求乐土寻至终呈会秘版本共网二叉三代演示操作方法与末制检索域略钚改进学言简草发展迹表明术演示法护服务悦聪体系状态等领域典型的有泽先两建模细务将悉应现代农室运房中的迭代下货成果率独特发布原只强怀户还集管理多回开发多页共览操作规范之语系的策略灵活性以及操纵诱导下对策略的适应性方面存在的特定脆弱性。我们的研究发现了不同模型之间的行为模式差异，并强调了适应性和公平性认知对于可信人工智能部署的重要性。这项工作并非提供性能基准测试，而是提出了一种诊断LLM代理决策弱点的方法，为对齐和安全研究提供可操作性的见解。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在现实决策系统中的集成应用日益广泛，对其行为脆弱性的理解对AI安全和对齐至关重要。现有评估指标主要关注推理准确性和事实正确性，却忽视了LLMs对对抗性操纵的稳健性和动态环境中策略适应能力。本文引入了一个对抗性评估框架，旨在系统地对LLMs的决策过程进行压力测试，该框架借鉴认知心理学和博弈论的方法，通过两个典型任务：双臂匪徒任务和多轮信任任务，探究模型如何应对探索与利用之间的权衡、社会合作和策略灵活性。应用于多个先进LLMs的结果揭示了模型特定的易受操纵性和策略适应的僵化性。本文强调适应性和公平性识别对于AI可靠部署的重要性。本研究旨在诊断LLM智能体决策中的弱点，为对齐和安全研究提供可操作性的见解，而非提供性能基准。

Key Takeaways

大型语言模型（LLMs）在现实决策系统中的应用带来了对其行为脆弱性的挑战。
当前评估指标主要关注推理准确性和事实正确性，忽略了模型的稳健性和适应动态环境的能力。
本文引入的对抗性评估框架旨在系统测试LLMs的决策过程，借鉴了认知心理学和博弈论的方法。
通过两个典型任务：双臂匪徒任务和多轮信任任务，该框架能够考察模型的探索与利用权衡、社会合作和策略灵活性。
对多个先进的LLMs应用该框架后，发现模型在应对操纵和策略适应方面存在特定弱点。
适应性和公平性识别对于AI的可靠部署至关重要。

Cool Papers

点此查看论文截图

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

Authors:Haolang Lu, Yilian Liu, Jingxin Xu, Guoshun Nan, Yuanlong Yu, Zhican Chen, Kun Wang

The development of Reasoning Large Language Models (RLLMs) has significantly improved multi-step reasoning capabilities, but it has also made hallucination problems more frequent and harder to eliminate. While existing approaches mitigate hallucinations through external knowledge integration, model parameter analysis, or self-verification, they often fail to capture how hallucinations emerge and evolve across the reasoning chain. In this work, we study the causality of hallucinations under constrained knowledge domains by auditing the Chain-of-Thought (CoT) trajectory and assessing the model’s cognitive confidence in potentially erroneous or biased claims. Our analysis reveals that in long-CoT settings, RLLMs can iteratively reinforce biases and errors through flawed reflective reasoning, eventually leading to hallucinated reasoning paths. Surprisingly, even direct interventions at the origin of hallucinations often fail to reverse their effects, as reasoning chains exhibit ‘chain disloyalty’ – a resistance to correction and a tendency to preserve flawed logic. Furthermore, we show that existing hallucination detection methods are less reliable and interpretable than previously assumed in complex reasoning scenarios. Unlike methods such as circuit tracing that require access to model internals, our black-box auditing approach supports interpretable long-chain hallucination attribution, offering better generalizability and practical utility. Code and data are available at: https://anonymous.4open.science/r/repo_for_meta_hallucination

推理大型语言模型（RLLMs）的发展显著提高了多步推理能力，但也使虚构问题更加频繁且难以消除。虽然现有方法通过外部知识整合、模型参数分析或自我验证来缓解虚构现象，但它们往往无法捕捉虚构如何在推理链中涌现和演变。在这项工作中，我们通过审核思维链（CoT）轨迹并评估模型对可能错误或偏见主张的认知信心，来研究受控知识域下虚构的因果关系。我们的分析表明，在长期的CoT设置中，RLLMs可以通过错误的反思推理迭代地强化偏见和错误，最终导致虚构的推理路径。令人惊讶的是，即使在虚构的原始来源处进行直接干预也常常无法逆转其影响，因为推理链表现出“链不忠”——一种对修正的抵抗力和保持错误逻辑的倾向。此外，我们表明，现有的虚构检测方法在复杂的推理场景中并不像以前假设的那么可靠和可解释。不同于需要访问模型内部的电路跟踪等方法，我们的黑盒审计方法支持可解释的长链虚构归因，提供更好的通用性和实用效用。代码和数据可在：https://anonymous.4open.science/r/repo_for_meta_hallucination获取。

论文及项目相关链接

PDF 33 pages (including references and appendix),11 figures, 6 tables

Summary

本文探讨了大型语言模型的推理能力发展所带来的幻觉问题。研究发现，在多步推理过程中，模型可能通过错误的反思推理来强化偏见和错误，最终导致幻觉推理路径的产生。而且，即使在幻觉的源头进行干预，也往往无法消除其影响，因为推理链存在“链不忠”现象，即对抗纠正并保持错误逻辑的倾向。同时，现有的幻觉检测方法在复杂推理场景下可靠性较低。研究提出了一种黑盒审计方法，可以对长链幻觉进行可解释的归因，具有更好的通用性和实用性。

Key Takeaways

大型语言模型的多步推理能力显著提高，但幻觉问题也更为频繁和难以消除。
现有方法通过外部知识整合、模型参数分析或自我验证来减轻幻觉，但往往无法捕捉幻觉在推理链中的产生和演变。
在长推理链中，大型语言模型可能通过错误的反思推理来强化偏见和错误，导致幻觉推理路径。
推理链存在“链不忠”现象，即使直接干预幻觉源头也难以消除其影响。
现有的幻觉检测方法在复杂推理场景下可靠性较低，缺乏可解释性。
研究提出了一种黑盒审计方法，可以对长链幻觉进行可解释的归因，具有更好的通用性和实用性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-05-21/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-05-21 Optimizing Anytime Reasoning via Budget Relative Policy Optimization

2025-05-21 LLM

LLM

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-05-20 Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

2025-05-20 Text-to-Motion

Text-to-Motion

R1_Reasoning

2025-05-21 更新

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning

AdaptThink: Reasoning Models Can Learn When to Think

MR. Judge: Multimodal Reasoner as a Judge

Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks

Thinkless: LLM Learns When to Think

Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization

Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning

I’ll believe it when I see it: Images increase misinformation sharing in Vision-Language Models

CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models