发布日期: 2025-11-09

更新日期: 2025-11-27

文章字数: 17.8k

阅读时长: 72 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-09 更新

QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

Authors:Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.

自然语言模型（LLM）的显著进步为Verilog代码生成提供了充满希望的机遇，这对自动化电路设计具有重要意义。缺乏有意义的奖励阻碍了基于强化学习（RL）的优化偏好，从而无法生成功能正确的Verilog代码。在本文中，我们提出了利用功能正确的输出信号代码片段优化强化学习训练的Verilog代码生成的信号感知学习（QiMeng-SALV）。考虑到Verilog代码指定硬件门和导线的结构互连，不同输出信号是独立的，QiMeng-SALV的关键在于提取部分错误模块中的验证信号感知实现，以增强有意义的功能奖励的提取。大致上，我们通过将生成的模块中的信号功能与训练数据中的参考模块进行比较，验证信号的函数正确性。然后，使用抽象语法树（AST）来识别来自错误模块的能提供有意义功能奖励的信号感知代码片段。最后，我们引入了信号感知DPO，它在正确的信号级代码片段上进行优化，从而防止了错误信号的噪声和干扰。提出的QiMeng-SALV强调了Verilog代码生成中从传统的模块级到细粒度的信号级优化的范式转变，解决了功能奖励不足的问题。实验表明，我们的方法在VerilogEval和RTLLVM上达到了最新性能水平，使用一个规模为7B的参数模型即可与DeepSeek v3 671B模型性能相匹配，并且在同一数据集上显著优于领先的开源模型CodeV。我们的代码位于https://github.com/zy1xxx/SALV。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

大语言模型（LLMs）的显著进展为Verilog代码生成带来了充满希望的机遇。在强化学习（RL）中，由于缺乏有意义的功能奖励阻碍了基于功能正确的Verilog代码的优化。本文提出利用功能正确的输出信号的代码片段优化RL训练的信号感知学习（QiMeng-SALV）。通过验证生成模块中信号的功能正确性，并从错误的模块中提取验证过的信号感知实现，以增强有意义的功能奖励的提取。最后，我们介绍了在正确的信号级别代码片段上进行优化的信号感知DPO，从而防止了错误信号的噪声和干扰。QiMeng-SALV从传统的模块级别转向精细粒度的信号级别优化，解决了Verilog代码生成中功能奖励不足的问题。实验表明，我们的方法在VerilogEval和RTLLVM上达到了最先进的性能。

Key Takeaways

LLMs的进步为Verilog代码生成带来了新机遇。
缺乏有意义的功能奖励是RL在Verilog代码生成中的挑战。
QiMeng-SALV通过利用功能正确的输出信号代码片段来优化RL训练。
QiMeng-SALV验证生成模块中信号的功能正确性。
信号感知DPO能防止错误信号的噪声和干扰。
QiMeng-SALV实现了从模块级别到信号级别的优化转变。

Cool Papers

点此查看论文截图

The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models

Authors:Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao

Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at https://github.com/Trae1ounG/Zero_Step_Thinking.

推理模型在数学和逻辑推理等任务中表现出了卓越的性能，这主要归功于其在推理过程中进行逐步思考的能力。然而，这常常导致过度思考，从而产生不必要的计算负担。为了解决这一问题，模式选择旨在通过利用思考或无思考模式来自动选择在长推理链（Long-CoT）或短推理链（Short-CoT）之间进行选择。同时，早期退出（Early Exit）则确定迭代推理过程中的最佳停止点。这两种方法都旨在减少计算负担。在本文中，我们首先将模式选择识别为早期退出问题的更具挑战性的变体，因为它们在目标上具有相似性，但在决策时间上存在差异。早期退出侧重于在推理时确定最佳停止点以进行简洁推理，而模式选择必须在推理过程的开始阶段做出决定，依赖于预先设定的虚拟思考，而不涉及明确的推理过程，这被称为零步思考。通过对九个基准点进行实证研究，我们发现基于提示的方法往往因分类能力有限而失败，当提供的信息很少时更是如此。相比之下，利用内部信息的做法在大多数场景中表现更好，但仍然存在稳定性问题。我们的研究发现，仅依赖模型提供的信息的现有方法不足以在有限信息的场景中有效地解决模式选择问题，这突显了这项任务所面临的持续挑战。我们的代码可在https://github.com/Trae1ounG/Zero_Step_Thinking找到。

论文及项目相关链接

PDF Accepted by NeurIPS’25 Efficient Reasoning Workshop

Summary

本文探讨了推理模型在任务中的出色表现及其导致的过度计算问题。为解决这个问题，提出了Mode Selection和Early Exit两种方法。Mode Selection能在推理过程开始时自动选择Long-CoT或Short-CoT模式，而Early Exit则确定推理过程中的最佳停止点。文章指出Mode Selection是Early Exit的一个更具挑战性的变体，因为它必须在没有明确的推理过程的情况下，依靠预设的虚拟想法做出决策。通过对九种基准方法的实证研究，发现基于提示的方法在提供少量手工信息时往往表现不佳，而利用内部信息的方法在大多数场景中表现更好但仍存在稳定性问题。现有方法在有限信息的场景下，仅依赖模型提供的信息不足以有效解决Mode Selection问题，表明该任务仍存在挑战。

Key Takeaways

推理模型在数学和逻辑推理等任务中表现出色，但常常导致过度计算。
Mode Selection和Early Exit两种方法被提出来减少计算负担。
Mode Selection自动选择推理模式，而Early Exit确定推理过程的最佳停止点。
Mode Selection是Early Exit的一个更具挑战性的变体，因为它必须在没有明确的推理过程的情况下做出决策。
基于提示的方法在提供少量手工信息时表现不佳。
利用内部信息的方法在大多数场景中表现较好，但仍存在稳定性问题。

Cool Papers

点此查看论文截图

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

Authors:Omar El Mansouri, Mohamed El Amine Seddik, Salem Lahlou

Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

基于人类反馈的强化学习（RLHF）或可验证奖励（RLVR）是调整大型语言模型或构建最新先进推理模型的标准范式，对于来自不一致或错误奖励的噪声具有高度敏感性。然而，此类噪声与广泛使用的基于群体的策略优化方法之间的相互作用仍未被充分探索。我们引入了一种稳健的群体相对策略优化（GRPO）和正确完成GRPO（Dr.GRPO）框架，该框架显式地将奖励腐败建模为伯努利噪声。我们的方法在估计奖励翻转概率后进行噪声校正，以消除学习信号的偏见，从而产生可证明的无偏见梯度估计。理论分析表明，基于群体的方法本质上减轻了个人层面的噪声，我们的校正策略增强了这种稳健性。从实证角度看，在我们的噪声校正应用于标准奖励模型使用时，数学和代码任务上均观察到了一致的改进，其中数学任务在准确率上提高了高达6.7个百分点，代码任务提高了1.5个百分点，处于现实的奖励模型条件下。这项工作将监督学习中的标签噪声校正与现代RLHF相结合，为噪声现实世界部署提供了理论见解和实用算法。

论文及项目相关链接

PDF

Summary
针对强化学习从人类反馈（RLHF）或可验证奖励（RLVR）的标准范式，本文提出了一种稳健的群组相对策略优化（GRPO）和正确完成GRPO（Dr.GRPO）框架。该框架显式地建模奖励腐败为伯努利噪声，并在估计奖励翻转概率后应用噪声校正来纠正学习信号，从而产生可验证的无偏梯度估计。理论分析和实证结果表明，群组方法本身就能缓解个体层面的噪声，而我们的校正策略增强了这种稳健性。

Key Takeaways

强化学习从人类反馈或可验证奖励的标准范式对噪声敏感，噪声来源于不一致或错误的奖励。
提出的Group Relative Policy Optimization (GRPO) 和 Done Right GRPO (Dr.GRPO) 框架显式地建模奖励腐败为伯努利噪声。
通过噪声校正估计奖励翻转概率，以纠正学习信号，产生可验证的无偏梯度估计。
群体方法本身就具有缓解个体层面噪声的特性。
噪声校正策略提高了群体方法的稳健性。
在数学和代码任务上应用噪声校正，可以观察到持续的性能提升。

Cool Papers

点此查看论文截图

KAT-Coder Technical Report

Authors:Zizheng Zhan, Ken Deng, Jinghui Wang, Xiaojiang Zhang, Huaixi Tang, Minglei Zhang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, C. Zhang, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen

Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.

最近大型语言模型（LLM）的进步推动了智能编码的发展，模型能够在交互式软件开发工作流中自主推理、规划和行动。然而，弥合基于静态文本的训练和动态现实世界智能执行之间的差距仍然是核心挑战。在本技术报告中，我们介绍了KAT-Coder，这是一个通过多阶段课程训练的大规模智能代码模型，包括中期训练、监督微调（SFT）、强化微调（RFT）和从强化到部署的适应。中期训练阶段通过真实软件工程数据和合成智能交互的语料库增强推理、规划和反思能力。SFT阶段构建了一个百万样本数据集，平衡了二十种编程语言、十种开发上下文和十种任务原型。RFT阶段引入了一种新型多地面真实奖励公式，用于稳定和高效样本的策略优化。最后，从强化到部署的阶段使模型适应生产级的IDE环境，采用错误掩盖SFT和树结构轨迹训练。总之，这些阶段使KAT-Coder实现了稳健的工具使用可靠性、指令对齐和长上下文推理，形成了现实世界智能编码代理的可部署基础。我们的KAT系列32B模型KAT-Dev已在https://huggingface.co/Kwaipilot/KAT-Dev上开源。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在代理编码方面取得了进展，这些模型可以在交互式软件开发流程中自主推理、规划和行动。然而，消除静态文本训练与动态现实世界代理执行之间的鸿沟仍是核心挑战。本技术报告中，我们介绍了KAT-Coder，这是一个大规模代理编码模型，经过中期训练、监督微调（SFT）、强化微调（RFT）和强化到部署适应等分阶段训练。这些阶段使得KAT-Coder在工具使用可靠性、指令对齐和长语境推理方面表现优秀，为现实世界的智能编码代理提供了可部署的基础。我们的KAT系列模型KAT-Dev已在Huggingface上以开源形式提供。

Key Takeaways

大型语言模型（LLM）在代理编码方面取得进展，能自主推理、规划和行动于交互式软件开发流程。
KAT-Coder是一个大规模代理编码模型，通过多阶段训练实现高效性能。
中期训练阶段增强了KAT-Coder的推理、规划和反思能力。
监督微调（SFT）阶段构建了平衡多种编程语言、开发环境和任务原型的百万样本数据集。
强化微调（RFT）阶段引入了多地面真实奖励公式，用于稳定和高效的策略优化。
强化到部署阶段使模型适应生产级IDE环境，通过错误掩蔽SFT和树结构轨迹训练实现。

Cool Papers

点此查看论文截图

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Authors:Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

关于人类行为的大型语言模型（LLM）模拟，如果能够真实地反映人类行为，就有潜力彻底改变社会和行为科学。目前的评估是分散的，基于特定任务和指标，导致结果无法比较。为了解决这个问题，我们引入了SimBench，这是第一个大规模标准化基准测试，用于构建稳健、可复制的大型语言模型模拟科学。SimBench统一了涵盖从道德决策到经济选择的全球大型参与者池的多种任务的二十种不同数据集，这为提出关于大型语言模型模拟何时成功或失败、如何成功或失败以及为何成功或失败等基本问题提供了必要的基础。我们的研究显示，即使是目前最好的大型语言模型模拟能力也有限（得分：40.80/100），性能会随着模型规模的增长而呈现对数线性增长趋势。增加推理时间的计算并不会提高模拟性能。我们证明了指令调整与模拟之间的权衡：指令调整在低熵（共识）问题上的表现有所提高，但在高熵（多样性）问题上的表现却有所下降。当模拟特定人口群体时，模型特别困难。最后，我们证明模拟能力与深度知识密集型推理之间存在强烈的相关性（MMLU-Pro相关性系数为r=0.939）。通过使进步可衡量，我们的目标是加速更真实的大型语言模型模拟器的发展。

Summary

大型语言模型（LLM）对人类行为的模拟有潜力革新社会和行为科学，前提是其必须真实地反映人类行为。当前的评价方法分散、缺乏统一标准，导致结果难以比较。为解决这一问题，本文引入SimBench，首个大规模、标准化的LLM模拟评估基准，以建立稳健、可重复的科学评估体系。SimBench统一了20个涵盖道德决策、经济选择等任务的多样化数据集，并涉及全球大量参与者。研究结果显示，即使是目前最好的LLM，其模拟能力依然有限，性能随模型规模增加而提升，但与推理能力相关度最高。通过SimBench，我们旨在加速更真实的人类行为模拟器的开发。

Key Takeaways

大型语言模型（LLM）对人类行为的模拟潜力巨大，前提是反映真实人类行为。
当前LLM模拟评价存在碎片化问题，缺乏统一标准。
SimBench为LLM模拟评估提供了大规模、标准化的基准。
SimBench统一了多样化数据集，涵盖全球大量参与者。
目前LLM模拟能力有限，性能随模型规模增加而提升。
LLM模拟性能与推理能力相关度最高。

Cool Papers

点此查看论文截图

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Authors:Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan

Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.

基于指令的图像编辑已经取得了显著的进步。然而，仅通过监督微调训练的模型往往会过度拟合标注模式，阻碍了其在训练分布之外的探索和泛化能力。为此，我们引入了Edit-R1，这是一种基于策略优化的新型指令式图像编辑后训练框架。具体来说，我们采用了扩散负感知微调（DiffusionNFT），这是一种与正向过程匹配流的无似然策略优化方法，从而能够使用高阶采样器并更有效地进行训练。另一个关键挑战在于缺乏通用的奖励模型，这是由于编辑指令和任务的多样性所导致的。为了弥补这一差距，我们采用了一种多模态大型语言模型（MLLM）作为统一的、无需训练奖励模型，利用其输出对数几率提供精细的反馈。此外，我们精心设计了一种低方差组过滤机制，以降低MLLM评分噪声并稳定优化过程。使用此框架训练的UniWorld-V2在ImgEdit和GEdit-Bench基准测试中取得了最新结果，得分分别为4.49和7.83。最重要的是，我们的框架对模型无特定要求，在应用于不同的基础模型（如Qwen-Image-Edit和FLUX-Kontext）时，可以实现显著的性能提升，证明了其广泛的应用性。相关代码和模型已公开发布，以支持进一步的研究。

论文及项目相关链接

PDF

Summary
基于指令的图像编辑领域已取得显著进展，但仅通过监督微调训练的模型容易过度拟合注释模式，限制了其在训练分布之外的探索与泛化能力。为此，我们推出Edit-R1，一种基于策略优化的指令式图像编辑新型后训练框架。我们采用无需似真度的扩散负感知微调法（DiffusionNFT），该方法与正向过程相匹配，可使用更高阶的采样器并提升训练效率。为解决不同编辑指令与任务的通用奖励模型缺失问题，我们采用多模态大型语言模型（MLLM）作为统一、无训练奖励模型，利用其输出对数几率提供精细反馈。此外，我们精心设计了低方差组过滤机制，以减少MLLM评分噪声并稳定优化。使用此框架训练的UniWorld-V2在ImgEdit和GEdit-Bench基准测试中取得最新最佳成绩，分别得分4.49和7.83。此框架具有模型无关性，在多种基础模型上应用时，可带来显著性能提升，证明其广泛应用性。

Key Takeaways

Edit-R1是一个基于策略优化的新型后训练框架，用于指令式图像编辑。
引入DiffusionNFT方法，提升模型训练效率和性能。
利用多模态大型语言模型（MLLM）作为通用奖励模型，提供精细反馈。
设计低方差组过滤机制，减少奖励模型评分噪声。
UniWorld-V2在ImgEdit和GEdit-Bench基准测试中表现卓越。
框架具有模型无关性，可广泛应用于多种基础模型。

Cool Papers

点此查看论文截图

Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

Authors:Enis Oğuz

The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.

生成式人工智能技术的发展为不同领域的创新铺平了道路。最近，生成式人工智能被提议作为评估学生作文的自动评估系统（AES）的竞争对手。考虑到人工智能在处理成语时可能存在的潜在局限性，本研究结合语料库语言学和计算语言学的见解，评估了生成式人工智能模型在有、无成语作文上的评分表现。从语料库中选取348篇学生作文，创建了两个相同的作文列表：一个列表中每篇作文都有多个成语，另一个列表则没有成语。请三个生成式人工智能模型（ChatGPT、Gemini和Deepseek）对所有作文进行三次评分，评分使用与人类评分者相同的作文评分标准。结果表明，所有模型的评分一致性很好，但Gemini在与其他模型相比时表现出更高的跨评价者可靠性。此外，人工智能评估中没有任何针对任何人口统计群体的偏见。对于包含多个成语的作文，Gemini的评分模式与人类评分者最为相似。虽然研究中的模型都表现出混合方法的潜力，但Gemini由于其处理修辞语言的能力而成为该任务的最佳候选模型，并有望在将来单独处理作文评分任务时表现出更大的潜力。

论文及项目相关链接

PDF

Summary

基于生成式AI技术的发展，对于包含习语和不包含习语的作文，研究人员评估了生成式AI模型在评分方面的表现。通过对选定的三个生成式AI模型（ChatGPT、双子座和Deepseek）进行三次评分测试，发现它们在评分上表现出高度一致性，其中双子座模型与人工评分者的评分模式最为接近，且在处理习语作文方面表现最佳。这表明双子座模型在处理比喻性语言方面具有较强的能力，并有望在未来独立处理作文评分任务。总体而言，这些模型展现了混合评估的潜力。

Key Takeaways

生成式AI技术为多个领域带来了创新。
生成式AI被提议作为AES系统在自动评估学生作文方面的竞争对手。
AI在处理习语方面可能存在潜在局限性。
通过语料库语言学和计算语言学，评估了包含习语和不包含习语的作文的生成式AI模型评分表现。
三个生成式AI模型在评分上表现出高度一致性。
双子座模型在处理比喻性语言和与人工评分者模式匹配方面表现最佳。

Cool Papers

点此查看论文截图

FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

Authors:Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao

Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs’ adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model’s associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

多模态大型语言模型（MLLMs）面临着忠诚度和创造力之间的固有权衡，因为不同的任务需要不同程度的联想推理。然而，现有方法缺乏调整这种推理强度的灵活性，限制了MLLMs在事实和创意场景中的适应性。为了弥补这一差距，我们提出为MLLMs配备机制，以实现对联想推理的灵活控制。我们通过调查MLLMs中联想行为的内在机制来开始研究，并发现：（1）中层在塑造模型的联想倾向方面起着关键作用；（2）修改这些层的表示可以有效地调节联想推理的强度；（3）可以利用幻觉来派生出引导向量，引导这种调制。基于这些发现，我们引入了Flexible Association Control（FlexAC），这是一个用于调节MLLMs中联想行为的轻量级、无需训练框架。FlexAC首先通过幻觉引导的中间表示来编码联想方向。然后，它选择高度关联的实例来构建有效的联想引导向量，其强度自适应地校准以平衡创造性指导和输出稳定性。最后，认识到联想推理的多维性质，FlexAC结合了来自目标域样本前向传递的任务特定联想向量，使模型能够遵循不同的联想方向并更好地适应创造性任务。值得注意的是，我们的方法在Creation-MMBench的创造力方面实现了高达5.8倍的改进，在CHAIR上的幻觉率降低了29%，超越了现有基准，证明了其在MLLMs中实现对联想推理的灵活控制的有效性。我们的代码可在https://github.com/ylhz/FlexAC上找到。

论文及项目相关链接

PDF 19 pages, 11 figures. Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Summary

本文探讨了多模态大型语言模型（MLLMs）在忠实性和创造力之间的权衡问题。针对任务需求的关联性推理程度不同，MLLMs面临这一固有矛盾。现有方法缺乏调整推理强度的灵活性，限制了MLLMs在事实和创意场景中的适应性。为弥补这一不足，本文提出了为MLLMs配备机制，以实现对关联性推理的灵活控制。通过探究MLLMs中关联性行为的内在机制，发现中间层在塑造模型关联性倾向方面起关键作用，修改这些层的表示可以有效地调节关联性推理强度，而幻觉可以被用来推导引导这种调节的引导向量。基于这些发现，本文引入了Flexible Association Control（FlexAC），这是一个轻量级、无需训练即可调整MLLMs中关联行为的框架。FlexAC首先通过幻觉引导的中间表示来编码关联方向，然后选择高关联实例来构建有效的关联引导向量，其强度可自适应地平衡创意指导和输出稳定性。最后，FlexAC认识到关联推理的多维性质，并结合目标域样本的前向传递来生成任务特定关联向量，使模型能够适应不同的关联方向并更好地适应创造性任务。实验结果表明，该方法在创造力提升方面取得了显著成效。

Key Takeaways

多模态大型语言模型（MLLMs）在忠实性和创造力之间存在权衡。
MLLMs在关联性推理方面缺乏灵活性，限制了其在不同场景中的适应性。
中间层在MLLMs的关联性行为中起关键作用。
修改中间层的表示可以有效地调节关联性推理强度。
幻觉可以被用来推导引导向量，以控制MLLMs的关联性行为。
FlexAC框架是一种轻量级、无需训练的方法，可实现MLLMs中关联行为的灵活控制。

Cool Papers

点此查看论文截图

FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

Authors:Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji

Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains – general knowledge understanding, scientific question answering, mathematical reasoning, and code generation – demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.

Low-Rank Adaptation（LoRA）是一种广泛应用于基础模型的参数效率高的微调方法，但它存在参数干扰的问题，导致性能不佳。虽然基于专家混合（MoE）的LoRA变体在单任务指令调整中缓解任务内相关性方面显示出希望，但它们引入了额外的路由器参数，并且在多任务模型合并中仍然无效，这里会出现任务间干扰。受苍蝇嗅觉电路的启发，我们提出了FlyLoRA，这是一种基于隐式MoE的LoRA变体，它引入了：（1）在投影矩阵中进行等级专家激活；（2）一种隐式路由器，统一专家路由和降维投影，其中冻结的稀疏随机投影矩阵取代了传统的密集可训练版本。这种设计通过不需要显式路由器解决了任务内解相关和计算效率之间的权衡，同时由于其随机矩阵的正交性属性，固有地减轻了任务间的干扰。在四个领域——通用知识理解、科学问答、数学推理和代码生成——的广泛实验表明，与现有方法相比，它实现了性能上的持续改进。除了经验收益之外，FlyLoRA还强调了生物结构如何激发人工智能技术的创新。代码可在https://github.com/gfyddha/FlyLoRA找到。

论文及项目相关链接

PDF NeurIPS 2025 accepted paper

Summary

本文介绍了基于仿生设计的新型参数高效微调方法FlyLoRA，解决了低秩适应（LoRA）中的参数干扰问题。通过引入rank-wise专家激活和隐性路由器，FlyLoRA能够在多任务模型合并中有效缓解任务间干扰，提高了模型性能。实验证明，FlyLoRA在多个领域均实现了对现有人工智能技术的性能提升。

Key Takeaways

FlyLoRA是一种基于隐式MoE（Mixture-of-Experts）的LoRA变体，解决了参数干扰问题。
引入rank-wise专家激活和隐性路由器，提高了模型性能。
通过消除显式路由器需求，FlyLoRA实现了任务内去相关性和计算效率之间的平衡。
随机矩阵的正交属性固有地减轻了任务间干扰。
FlyLoRA在多个领域实现了对现有人工智能技术的性能提升。
该方法从生物学结构中获得灵感，展现了生物学对人工智能技术的启发潜力。

Cool Papers

点此查看论文截图

Training Large Language Models To Reason In Parallel With Global Forking Tokens

Authors:Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan

Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.

尽管大型语言模型（LLMs）通过扩展并行测试时间计算展示了改进的性能，但这种做法依赖于生成多样且准确的推理路径。对于具有挑战性的问题，触发多样但正确的推理模式的分叉令牌通常位于采样树的深处。因此，鼓励多样性的常见策略（如温度缩放）在多样性和准确性之间面临权衡困难的困境。受此挑战激励，我们将并行推理视为下一个令牌预测的问题集合，并通过我们全局分叉令牌和自我监督二分匹配之间的基于集合的全局损失，将其纳入监督微调（SFT）。我们观察到，虽然用多个推理轨迹进行简单微调会破坏这些独特的推理模式，但我们提出的方法——集合监督微调（SSFT）可以保留这些模式并产生新兴的全局分叉令牌。在多个推理基准测试上的实验表明，我们的SSFT在Pass@1和Cons@k指标下始终优于SFT。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）通过增加并行测试时的计算规模来提高性能，这需要生成多样且准确的推理路径。对于难题，触发多样但正确的推理模式的标记通常位于采样树的深处。因此，鼓励多样性的常规策略（如温度缩放）在多样性和准确性之间遭遇了权衡问题。本研究将并行推理视为下一个标记预测问题，并将基于集合的全局损失纳入监督微调（SFT）中，通过全局标记和独特推理轨迹之间的自我监督二分匹配来实现。实验表明，与仅使用单一推理轨迹的微调方法相比，本研究所提出的集合监督微调（SSFT）能够保留这些独特的推理模式并产生新兴的全局标记。在多个推理基准测试上，SSFT在Pass@1和Cons@k指标上的表现均优于SFT。

Key Takeaways

大型语言模型（LLMs）提升性能依赖于生成多样且准确的推理路径。
触发多样正确推理模式的标记位于采样树的深处。
常规鼓励多样性的策略在平衡多样性和准确性时存在问题。
本研究将并行推理视为下一个标记预测问题，并纳入监督微调中。
通过全局标记和独特推理轨迹的自我监督二分匹配来实现方法。
集合监督微调（SSFT）能够保留独特的推理模式并产生新兴的全局标记。

Cool Papers

点此查看论文截图

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Authors:Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

近期的大型语言模型（LLM）展现出强大的推理能力。然而，一个重要的问题仍然存在：这些模型是否具备真正的推理技能，尤其是复杂的战略推理，还是它们主要擅长在训练数据中的复杂模式识别？为了回答这个问题，本文提出了一个象棋测试平台ChessArena，以评估LLM的战略推理能力。象棋需要复杂的战略推理能力，包括长期规划、严格理解规则和多轮对话记忆。具体来说，ChessArena是一个竞争框架，LLM之间在此平台上相互对抗，共有四种不同的游戏模式。该平台配备有排名算法和排行榜。该平台还可以评估精细的能力，包括基本理解、动作选择和问题解决。在ChessArena中评估了超过13种不同模式的大型语言模型，进行了超过800场比赛。结果揭示了当前LLM的显著不足：没有模型能够战胜人类业余水平的Maia-1100引擎，甚至有些模型无法击败随机玩家选择的随机动作。我们还为测试平台提供了强有力的基准：我们微调后的Qwen3-8B大幅提高了性能，接近当前最先进的推理模型。

论文及项目相关链接

PDF

Summary

近期大型语言模型（LLM）在推理能力上表现出色，但存在疑问：它们是否真正具备复杂的战略推理能力，抑或只是擅长训练数据中的模式识别？为解答这一问题，本文提出了一个名为ChessArena的象棋测试平台，旨在评估LLM的战略推理能力。该平台让LLM相互竞技，包含四种游戏模式，并配备排名算法和排行榜。通过评估基础理解、棋步选择和解题能力等多方面的精细能力，超过13款LLM参与测试，共对战800余局。结果发现当前LLM存在显著不足，无法击败人类业余水平的象棋引擎Maia-1100，部分模型甚至不敌随机选择的对手。经过调优的Qwen3-8B模型表现突出，接近顶尖推理模型性能。

Key Takeaways

大型语言模型（LLM）的推理能力受到关注，存在关于其是否具备复杂战略推理能力的疑问。
ChessArena测试平台被提出，用于评估LLM的战略推理能力，包括长期规划、规则理解和多轮对话记忆等方面。
平台采用竞技模式，包含四种游戏模式，并配备排名算法和排行榜。
测试发现当前LLM存在显著不足，无法击败人类业余水平的象棋引擎。
部分LLM模型在测试中表现不佳，甚至不敌随机选择的对手。
调优后的Qwen3-8B模型表现突出，成为测试中的强基线。

Cool Papers

点此查看论文截图

p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

Authors:Runyan Tan, Shuang Wu, Phillip Howard

Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments. The code is available at https://github.com/ryttry/p-less .

获取大型语言模型（LLM）的高质量输出通常取决于基于采样策略的解码策略的选择，该策略以概率方式选择每个生成步骤中的下一个令牌。虽然已经提出了多种这样的采样方法，但它们的性能对超参数的选择很敏感，这可能需要根据不同的生成任务和温度配置进行不同的设置。在这项工作中，我们引入了$p$-less采样：一种基于信息理论的采样方法，它根据整个令牌概率分布动态地在每个解码步骤中设置截断阈值。与现有方法不同，$p$-less采样没有超参数，并且随着温度的升高，始终产生高质量的输出。我们从理论上阐述了$p$-less采样的视角来支撑我们提出的方法，并通过实验在多种数学、逻辑推理和创造性写作任务上对其有效性进行了实证验证。结果表明，$p$-less采样在文本质量较高时始终优于现有采样方法，并且在较高温度下表现出较少的文本质量下降。我们还展示了如何通过较低的平均令牌采样时间和较短的生成长度来实现更高的推理时间效率，而不会牺牲准确性来证明无$p$的优越性。最后，我们通过分析定性示例、案例研究和多样性评估来强调无$p$的益处。相关代码已发布在 https://github.com/ryttry/p-less 上。

论文及项目相关链接

PDF

Summary：

本文介绍了无参数采样方法（p-less sampling），这是一种基于信息理论的方法，用于在大型语言模型（LLM）中动态设定解码步长的截断阈值。该方法无需调整超参数，随着温度的升高，输出质量始终保持在较高水平。通过理论和实验验证，p-less采样在各种任务中表现优异，包括数学、逻辑推理和创造性写作任务。此外，p-less采样还具有更高的推理效率，缩短了平均令牌采样时间和生成长度，同时不损失准确性。最后，通过定性示例、案例研究和多样性评估，展示了p-less的优势。

Key Takeaways：

p-less采样是一种基于信息理论的采样方法，适用于大型语言模型。
该方法无需调整超参数，适用于不同的生成任务和温度配置。
随着温度的升高，p-less采样能持续产生高质量的输出。
p-less采样在各种任务中表现优异，包括数学、逻辑推理和创造性写作任务。
p-less采样提高了推理效率，缩短了平均令牌采样时间和生成长度。
p-less采样在不损失准确性的情况下实现了高效推理。

Cool Papers

点此查看论文截图

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Authors:Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai

We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

我们提出了一种科学推理基础模型，该模型将自然语言与异质科学表示进行对齐。该模型在包含科学文本、纯序列和序列文本对的206B标记语料库上进行预训练，然后通过40M指令进行SFT对齐，采用退火冷启动引导法激发长形式思维链，并使用任务特定奖励形状进行强化学习，从而灌输有意识的科学推理。它支持四个能力家族，涵盖103个任务的工作流程：（i）文本和科学格式之间的忠实翻译，（ii）文本/知识提取，（iii）属性预测，（iv）属性分类，（v）无条件和条件序列生成和设计。与专用系统相比，我们的方法扩大了指令覆盖范围，提高了跨域泛化能力，并增强了保真度。我们详细描述了数据整理和训练过程，并表明跨学科学习加强了转移和下游可靠性。该模型、指令调整数据集和评估代码已在https://huggingface.co/SciReason和https://github.com/open-sciencelab/SciReason上开源。

论文及项目相关链接

PDF technical report

Summary

该文章介绍了一个科学推理基础模型，该模型将自然语言与异质科学表示对齐。模型在包含科学文本、纯序列和序列文本对的206B标记语料库上进行预训练，然后通过SFT对齐的40M指令进行微调，采用冷启动引导法激发长形式的思维链，并通过任务特定的奖励塑形强化学习，以灌输有意的科学推理。该模型支持四个能力家族，涵盖103个任务和工作流程，包括文本与科学格式之间的忠实翻译、文本/知识提取、属性预测、属性分类、无条件和有条件的序列生成和设计等。与专项系统相比，该方法扩大了指令覆盖范围，提高了跨域泛化能力，并提高了保真度。文章详细介绍了数据收集和模型训练的情况，并表明跨学科学习加强了转移和下游的可靠性。模型和评估代码已在https://huggingface.co/SciReason和https://github.com/open-sciencelab/SciReason开源。

Key Takeaways

模型将自然语言与异质科学表示对齐，涵盖多种科学文本格式。
模型在大量科学文本数据上进行预训练，并经过指令微调。
采用冷启动引导法激发长形式的思维链，以及强化学习来培养科学推理能力。
模型支持多种任务和工作流程，包括翻译、提取、预测、分类以及序列生成和设计等。
与其他专项系统相比，该模型在指令覆盖、跨域泛化和保真度方面有所优势。
文章强调了数据收集和模型训练的重要性，并展示了跨学科学习的效果。

Cool Papers

点此查看论文截图

LLMs as Layout Designers: Enhanced Spatial Reasoning for Content-Aware Layout Generation

Authors:Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen, Naren Ramakrishnan

While Large Language Models (LLMs) have demonstrated impressive reasoning and planning abilities in textual domains and can effectively follow instructions for complex tasks, their ability to understand and manipulate spatial relationships remains limited. Such capabilities are crucial for content-aware graphic layout design, where the goal is to arrange heterogeneous elements onto a canvas so that final design remains visually balanced and structurally feasible. This problem requires precise coordination of placement, alignment, and structural organization of multiple elements within a constrained visual space. To address this limitation, we introduce LaySPA, a reinforcement learning-based framework that augments LLM-based agents with explicit spatial reasoning capabilities for layout design. LaySPA employs hybrid reward signals that jointly capture geometric constraints, structural fidelity, and visual quality, enabling agents to navigate the canvas, model inter-element relationships, and optimize spatial arrangements. Through group-relative policy optimization, the agent generates content-aware layouts that reflect salient regions, respect spatial constraints, and produces an interpretable reasoning trace explaining placement decisions and a structured layout specification. Experimental results show that LaySPA substantially improves the generation of structurally valid and visually appealing layouts, outperforming larger general-purpose LLMs and achieving performance comparable to state-of-the-art specialized layout models.

虽然大型语言模型（LLM）在文本领域展示了令人印象深刻的推理和规划能力，并可以有效执行复杂任务的指令，但它们在理解和操作空间关系方面的能力仍然有限。这种能力对于内容感知的图形布局设计至关重要，其目的是将不同的元素排列在画布上，使最终设计在视觉上保持平衡和结构上的可行性。这个问题需要在一个受限的视觉空间内，精确协调多个元素的放置、对齐和结构组织。为了解决这一局限性，我们引入了LaySPA，一个基于强化学习的框架，它通过增强LLM代理的空间推理能力，用于布局设计。LaySPA采用混合奖励信号，联合捕获几何约束、结构保真度和视觉质量，使代理能够浏览画布，模拟元素之间的关系，并优化空间布局。通过群体相对策略优化，代理生成的内容感知布局反映了显著区域，遵守空间约束，并产生可解释推理轨迹来解释放置决策和结构布局规范。实验结果表明，LaySPA在生成结构上有效和视觉上吸引人的布局方面大有裨益，其表现优于大型通用LLM，且性能与最新专业化的布局模型相当。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在文本领域展现出强大的推理和规划能力，并能有效执行复杂任务的指令，但在理解和操作空间关系方面的能力仍然有限。这对于内容感知的图形布局设计至关重要，该领域的目标是将不同的元素排列在画布上，使最终设计在视觉上保持平衡并在结构上可行。为解决这一局限，我们推出了LaySPA，这是一种基于强化学习的框架，增强了LLM代理的空间推理能力，用于布局设计。LaySPA采用混合奖励信号，联合捕捉几何约束、结构保真度和视觉质量，使代理能够浏览画布、模拟元素间的关系，并优化空间布局。通过群体相对策略优化，该代理能够生成感知内容的布局，反映显著区域、尊重空间约束，并产生解释放置决策和结构化布局规范的推理轨迹。实验结果表明，LaySPA在生成结构上有效和视觉上吸引人的布局方面有着显著的提升，其性能优于更大的通用LLMs，并与最先进的专用布局模型相当。

Key Takeaways

大型语言模型（LLMs）在理解和操作空间关系方面存在局限性。
内容感知的图形布局设计需要安排异质元素在画布上，以保持视觉平衡和结构性。
LaySPA是一个强化学习框架，增强了LLMs的空间推理能力，用于布局设计。
LaySPA采用混合奖励信号来捕捉几何约束、结构保真度和视觉质量。
LaySPA通过群体相对策略优化生成内容感知的布局。
LaySPA提高了生成有效和吸引人布局的实质性能力。

Cool Papers

点此查看论文截图

RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Authors:Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chun-Chieh Liao, Fang-Ming Hung, Feng Liu

Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.

医疗问答需要融合领域知识和逻辑推理的先进推理能力。然而，现有的大型语言模型（LLM）通常生成的推理链缺乏事实准确性和临床可靠性。我们提出了排名偏好强化优化（RPRO）这一新颖框架，它通过强化学习与偏好驱动推理精炼的结合，以提高临床思维链（CoT）的性能。RPRO通过采用任务适应性推理模板和概率评估机制，将输出与既定临床工作流程对齐，同时自动识别和纠正低质量推理链，从而与先前的方法相区别。与传统的成对偏好方法不同，RPRO引入基于Bradley-Terry模型的组排名优化，并融入KL散度正则化以实现稳定训练。在PubMedQA和MedQA-USMLE上的实验表明，与强大的基线相比，我们的模型有持续的改进。值得注意的是，我们1.1B参数模型的表现优于7B-13B的大型模型，包括医疗专业变体。这些发现表明，偏好优化与质量驱动精炼的结合提供了一种可扩展和有效的途径，用于构建更可靠、以临床为基础的医疗LLM。

论文及项目相关链接

PDF

总结

医学问答需要融合领域知识和逻辑推理的先进推理能力。然而，现有的大型语言模型（LLMs）产生的推理链往往缺乏事实准确性和临床可靠性。本文提出一种名为Ranked Preference Reinforcement Optimization（RPRO）的新型框架，结合强化学习与偏好驱动推理优化，提升临床推理链（CoT）的性能。RPRO的独特之处在于采用任务适应性推理模板和概率评估机制，使输出与既定临床工作流程相符，同时自动识别和纠正低质量推理链。不同于传统的成对偏好方法，RPRO引入基于Bradley-Terry模型的组排名优化，并结合KL散度正则化实现稳定训练。在PubMedQA和MedQA-USMLE上的实验表明，相较于强大的基线模型，RPRO表现出一致的优势。值得注意的是，参数规模为1.1B的模型甚至超越了参数规模更大的7B至13B模型，包括医学专用变体。这些发现表明，偏好优化与质量驱动细化相结合的方法为构建更可靠、更贴近临床的大型语言模型提供了可伸缩和有效的途径。

关键见解

医学问答需要融合领域知识和逻辑推理的先进推理能力。
现有大型语言模型在医学问答中的推理链常缺乏事实准确性和临床可靠性。
RPRO框架结合强化学习与偏好驱动推理优化，旨在提高临床推理链性能。
RPRO采用任务适应性推理模板和概率评估机制，使输出更符合临床工作流程。
RPRO能自动识别并纠正低质量推理链。
与传统偏好方法不同，RPRO引入组排名优化和KL散度正则化。

Cool Papers

点此查看论文截图

Interpretable Reward Model via Sparse Autoencoder

Authors:Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang

Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

大型语言模型（LLM）已广泛应用于多个领域。强化学习从人类反馈（RLHF）利用奖励模型（RM）作为人类偏好的代理，使LLM行为与人类价值观保持一致，因此RM的准确性、可靠性和可解释性对于有效对齐至关重要。然而，传统RM缺乏可解释性，对奖励分配背后的推理提供有限的见解，并且对用户偏好变化不够灵活。虽然最近的多维RM旨在提高可解释性，但它们往往无法提供特征级别的归属，并且需要昂贵的注释。为了克服这些限制，我们引入了稀疏自编码器增强奖励模型（SARM），这是一种新型架构，它将预训练的稀疏自编码器（SAE）集成到奖励模型中。SARM将基于LLM的RM的隐藏激活映射到可解释、稀疏和单语义特征空间，其中标量头聚合特征激活以产生透明且概念上有意义的奖励分数。经验评估表明，SARM促进了奖励分配的特征级别归属，允许动态调整偏好变化，并实现了与常规奖励模型相比更优越的对齐性能。我们的代码可在https://github.com/schrieffer-z/sarm获取。

论文及项目相关链接

PDF Commercial firm need to review this paper before publishing it

Summary

大语言模型（LLM）在众多领域广泛应用。强化学习从人类反馈（RLHF）利用奖励模型（RM）作为人类偏好的代理，使LLM行为与人的价值观一致，使得RM的准确性、可靠性和可解释性对于有效对齐至关重要。然而，传统RM缺乏可解释性，对人类奖励分配背后的推理提供有限见解，且对用户偏好变化不够灵活。为克服这些局限，我们引入稀疏自编码器增强奖励模型（SARM），这是一种新型架构，将预训练的稀疏自编码器（SAE）融入奖励模型中。SARM将LLM基于RM的隐藏激活映射到可解释、稀疏且单语义的特征空间，标量头从该空间聚合特征激活以产生透明且概念上意义明确的奖励分数。实证评估表明，SARM促进奖励分配的直接特征级归因，允许动态适应偏好变化，并实现了相较于传统奖励模型的优越对齐性能。

Key Takeaways

大型语言模型（LLM）广泛应用于多个领域。
强化学习从人类反馈（RLHF）使用奖励模型（RM）来代表人类偏好，以对齐LLM行为与人类价值观。
传统RM缺乏可解释性，难以了解奖励分配背后的推理，并且不能灵活地适应用户偏好变化。
新提出的稀疏自编码器增强奖励模型（SARM）整合了预训练的稀疏自编码器（SAE）到RM中，提升了RM的可解释性。
SARM将RM的隐藏激活映射到可解释、稀疏的特征空间，提供奖励分配的直接特征级归因。
SARM允许动态调整以适应偏好变化，并通过实证评估实现了优越的对齐性能。
相关代码已公开可用。

Cool Papers

点此查看论文截图

Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning

Authors:Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, Dong Yu

Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

近年来，大型语言模型、多模态大型语言模型和大型音频语言模型（LALM）的进步通过基于规则的奖励的强化学习显著提高了其推理能力。然而，明确的推理过程在音频问答方面尚未显示出显著的效益，有效利用深度推理仍然是一个开放性的挑战，LALM在听觉语言推理方面仍未能达到人类水平。为了解决这些局限性，我们提出了Audio-Thinker，这是一个旨在提高LALM推理能力的强化学习框架，重点关注适应性、一致性和有效性的提升。我们的方法引入了一种自适应的思考准确性奖励，使模型能够基于任务的复杂性动态地调整其推理策略。此外，我们融入了一个外部奖励模型来评估推理过程的整体一致性和质量，辅以基于思考奖励，帮助模型在训练过程中区分有效的和错误的推理路径。实验结果表明，我们的Audio-Thinker模型在各种基准任务上优于现有的以推理为导向的LALM，展现出卓越的推理和泛化能力。

论文及项目相关链接

PDF preprint

Summary：
随着大型语言模型、多模态大型语言模型和大型音频语言模型（LALM）的近期进展，通过基于规则的奖励进行强化学习，其推理能力已显著提高。然而，在音频问答方面，明确推理过程尚未显示出显著效益，有效利用深度推理仍是开放挑战，LALM在音频语言推理方面仍达不到人类水平。为解决这个问题，我们提出了Audio-Thinker，一个旨在提高LALM推理能力的强化学习框架，重点提高其适应性、一致性和有效性。通过引入自适应思考精度奖励，使模型能够基于任务复杂性动态调整其推理策略。此外，我们结合外部奖励模型来评估推理过程的整体一致性和质量，辅以基于思考的奖励，帮助模型在训练过程中区分有效的和错误的推理路径。实验结果表明，我们的Audio-Thinker模型在各项基准任务上优于现有的以推理为导向的LALM，展现出卓越的推理和泛化能力。

Key Takeaways：