发布日期: 2025-10-23

更新日期: 2025-11-27

文章字数: 10.2k

阅读时长: 41 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-23 更新

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Authors:Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou

Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.

将大型语言模型（LLM）忠实个性化，以符合个人用户偏好是一项至关重要但具有挑战性的任务。虽然监督微调（SFT）可以快速达到性能峰值，但标准的人类反馈强化学习（RLHF）在个性化细节方面也存在困难。基于标量的奖励模型容易遭受奖励破解攻击，从而导致冗长和表面上的个性化回应。为了解决这些局限性，我们提出了Critique-Post-Edit，这是一个强大的强化学习框架，能够实现更忠实和可控的个性化。我们的框架集成了两个关键组件：（1）个性化生成奖励模型（GRM），提供多维分数和文本评价来抵抗奖励破解；（2）Critique-Post-Edit机制，策略模型根据这些评价修订其自己的输出来实现更有针对性和高效的学习。在严格控制的长度评估下，我们的方法在个性化基准测试上大大优于标准PPO。个性化Qwen2.5-7B的平均胜率提高了11%，个性化Qwen2.5-14B模型超越了GPT-4.1的性能。这些结果展示了一条实现忠实、高效和可控个性化的实用途径。

论文及项目相关链接

PDF work in progress

Summary

该文探讨了在大型语言模型（LLMs）中实现个性化所面临的挑战。文章指出监督微调（SFT）和基于人类反馈的标准强化学习（RLHF）在个性化方面存在局限性，并提出了名为Critique-Post-Edit的强化学习框架来更有效地实现个性化。该框架包含两个关键组成部分：一是Personalized Generative Reward Model（GRM），提供多维评分和文本评价来抵抗奖励黑客攻击；二是Critique-Post-Edit机制，策略模型根据这些评价修订自己的输出，以实现更有针对性的高效学习。在严格的长度控制评估下，该方法在个人化基准测试中显著优于标准PPO。个性化的Qwen2.5-7B模型平均提升了11%的胜率，而个性化的Qwen2.5-14B模型性能超过了GPT-4.1。这一实践途径实现了个性化模型的忠诚、高效和可控。

Key Takeaways

大型语言模型（LLMs）实现个性化是一项挑战，但至关重要的任务。
现有方法如监督微调（SFT）和基于人类反馈的强化学习（RLHF）在个性化方面存在局限性。
提出的Critique-Post-Edit强化学习框架旨在更忠实、可控地实现个性化。
该框架包含Personalized Generative Reward Model（GRM）和Critique-Post-Edit机制两个关键组成部分。
GRM提供多维评分和文本评价，抵抗奖励黑客攻击。
Critique-Post-Edit机制使策略模型根据评价修订输出，实现更高效、有针对性的学习。
在个人化基准测试中，该框架显著优于标准PPO。

Cool Papers

点此查看论文截图

Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring

Authors:Shuxin Lin, Dhaval Patel, Christodoulos Constantinides

Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.

小型语言模型（SLM）在工业应用等专门领域越来越受欢迎，因为它们效率高、计算要求低，并能针对特定领域的任务进行微调，从而实现精确且经济实惠的解决方案。然而，在像工业4.0这样的专业领域，使用SLM进行复杂推理仍然具有挑战性。在本文中，我们提出了一种面向工业资产健康的知识蒸馏框架，该框架通过思维链（CoT）蒸馏的方式，将大型语言模型（LLM）的推理能力转移到更小、更有效率的小型语言模型上。我们讨论了使用多选问答（MCQA）提示进行LLM蒸馏的优势和过程，以提高推理能力并优化决策制定。我们还进行了上下文学习，以验证生成知识的质量，并对比了微调后的SLM与广泛使用的大型语言模型的性能。结果表明，使用思维链推理的微调小型语言模型比基础模型有明显优势，与大型语言模型的差距缩小。我们的代码已开源在：https://github.com/IBM/FailureSensorIQ。

论文及项目相关链接

PDF Accepted at EMNLP 2025

Summary

本论文针对工业应用中的小型语言模型（SLMs）在复杂推理方面的挑战，提出了一种基于知识蒸馏的框架。该框架通过思维链（Chain-of-Thought，CoT）蒸馏的方式，将大型语言模型（LLMs）的推理能力转移到小型、更高效的模型上。通过多选问答（MCQA）提示来增强推理能力，并优化决策过程。实验结果表明，经过精细调整并具备CoT推理能力的SLMs在性能上显著超越了基础模型，并缩小了与LLMs的差距。

Key Takeaways

小型语言模型（SLMs）在工业应用中的优势在于其效率和计算资源需求较低，且能够针对特定领域进行微调，提供精确和经济的解决方案。
在复杂推理方面，特别是在工业4.0领域，SLMs仍面临挑战。
知识蒸馏是一种将大型语言模型（LLMs）的推理能力转移到小型模型的有效方法。
思维链（Chain-of-Thought，CoT）蒸馏通过模拟人类的思考过程，有助于提高模型的推理能力。
通过多选问答（MCQA）提示增强SLMs的推理能力并进行决策优化。
经过精细调整并具备CoT推理能力的SLMs性能显著超越基础模型，并接近大型语言模型的性能。

Cool Papers

点此查看论文截图

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

Authors:Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model’s own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.

我们提出了一种用于大型语言模型推理的简单自助在线监督微调（OSFT）范式。在此范式中，模型生成自己的答案，并立即根据这些自我生成的数据进行微调。OSFT是一种用于大型语言模型推理的高效训练策略，因为它不需要奖励，并且默认只使用一次滚动。实验结果表明，OSFT在具有挑战性的数学推理任务上的下游性能与具有可验证奖励的强化学习（RLVR）方法（如GRPO）相当。我们的消融研究进一步证明了OSFT的效率和稳健性。OSFT的主要机制在于促进模型从预训练中学到的自身偏好（潜在知识），从而提高推理能力。我们相信OSFT为更复杂的基于奖励的训练范式提供了高效且有前途的替代方案。我们的代码可在https://github.com/ElementQi/OnlineSFT处获得。

论文及项目相关链接

PDF

Summary

本文介绍了一种针对大型语言模型（LLM）推理任务的简单自助在线微调（OSFT）方法。该方法让模型自行生成响应并立即基于此数据进行微调。实验结果显示，OSFT在具有挑战性的数学推理任务上的性能与采用可验证奖励的强化学习（RLVR）方法相近，如GRPO等。OSFT的主要机制在于促进模型从预训练中学到的偏好（隐性知识），从而提高其推理能力。OSFT作为一种高效且前景广阔的训练方法，为更复杂的奖励基础训练范式提供了替代方案。

Key Takeaways

OSFT是一种用于LLM推理任务的简单自助在线微调方法，让模型自行生成响应并立即微调。
OSFT是一种高效训练策略，无需奖励，默认只使用一次滚动。
实验结果显示，OSFT在挑战性数学推理任务上的性能与强化学习方法如GRPO相当。
OSFT的主要机制在于促进模型从预训练中学到的偏好（隐性知识）。
OSFT对于提高模型的推理能力有积极作用。
OSFT为一种高效且前景广阔的训练方法，可作为更复杂奖励基础训练范式的替代方案。
相关代码已发布在https://github.com/ElementQi/OnlineSFT。

Cool Papers

点此查看论文截图

KAT-Coder Technical Report

Authors:Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui Wang, Huaixi Tang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Minglei Zhang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen

Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.

最近大型语言模型（LLM）的进展为代理编码带来了进步，其中模型可以在交互式软件开发工作流中自主推理、规划和行动。然而，弥合基于静态文本的培训和动态现实世界代理执行之间的差距仍是核心挑战。在本技术报告中，我们介绍了KAT-Coder，这是一个通过包括中期培训、监督微调（SFT）、强化微调（RFT）和强化到部署适应的多阶段课程训练的大规模代理编码模型。中期阶段通过真实软件工程数据和合成代理交互的语料库增强了推理、规划和反思能力。SFT阶段构建了一个百万样本数据集，平衡了二十种编程语言、十种开发背景和十种任务原型。RFT阶段引入了一种新型的多地面真实奖励公式，用于稳定和高效的策略优化。最后，强化到部署阶段使用错误掩码SFT和树结构轨迹训练使模型适应生产级IDE环境。总之，这些阶段使KAT-Coder实现了稳健的工具使用可靠性、指令对齐和长上下文推理，为现实世界的智能编码代理提供了可部署的基础。我们的KAT系列32B模型KAT-Dev已在https://huggingface.co/Kwaipilot/KAT-Dev上开源。

论文及项目相关链接

PDF

Summary：近期大型语言模型（LLM）的进步推动了编程代理编码的发展，自主推理、规划和行动的软件开发工作流程成为可能。本文介绍了一种名为KAT-Coder的代理编码模型，通过中期训练、监督微调（SFT）、强化微调（RFT）和强化部署适应等阶段实现。这些阶段增强了模型在工具使用可靠性、指令对齐和长期上下文推理方面的能力，为现实世界智能编码代理的部署奠定了基础。KAT系列32B模型KAT-Dev已在huggingface.co/Kwaipilot/KAT-Dev上开源。

Key Takeaways：

大型语言模型的进步推动了代理编码领域的发展。
KAT-Coder是一种新的代理编码模型，通过多阶段训练实现。
中期训练阶段增强了模型的推理、规划和反思能力。
监督微调阶段构建了平衡多种编程语言、开发环境和任务原型的百万样本数据集。
强化微调阶段引入了新的多地面真实奖励公式，用于稳定和高效的策略优化。
强化部署适应阶段使模型适应生产级IDE环境。

Cool Papers

点此查看论文截图

VAR: Visual Attention Reasoning via Structured Search and Backtracking

Authors:Wei Cai, Jian Zhao, Yuchen Yuan, Tianle Zhang, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

Multimodal Large Language Models (MLLMs), despite their advances, are hindered by their high hallucination tendency and heavy reliance on brittle, linear reasoning processes, leading to failures in complex tasks. To address these limitations, we introduce Visual Attention Reasoning (VAR), a novel framework that recasts grounded reasoning as a structured search over a reasoning trajectory space. VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought (CoT) generation, which incorporates a backtracking mechanism for self-correction. The search is guided by a multi-faceted reward function with semantic and geometric self-verification components, which penalize outputs that are not faithfully grounded in the visual input. We provide a theoretical analysis for our search strategy, validating its capability to find the correct solution with high probability. Experimental results show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks, significantly outperforming existing open-source models and demonstrating competitive performance against leading proprietary systems.

多模态大型语言模型（MLLMs）尽管取得了进展，但仍受到其高幻视倾向和过度依赖脆弱、线性推理过程的阻碍，导致在复杂任务中失败。为了解决这些局限性，我们引入了视觉注意力推理（VAR），这是一种新的框架，它将基于情境推理重新构建为在推理轨迹空间上的结构化搜索。VAR将推理过程分解为两个关键阶段：可追踪的证据接地和基于搜索的思维链（CoT）生成，后者结合了用于自我校正的回溯机制。搜索由多面奖励函数引导，包括语义和几何自我验证组件，该奖励函数惩罚那些没有忠实于视觉输入的输出来。我们对搜索策略进行了理论分析，验证了其以高概率找到正确解决方案的能力。实验结果表明，我们的7B模型（VAR-7B）在幻视和安全基准测试套件上创造了新的技术标杆，显著优于现有开源模型，并在领先专有系统中表现出竞争力。

论文及项目相关链接

PDF

Summary：
多模态大型语言模型（MLLMs）存在高幻视倾向和依赖脆弱、线性推理过程的问题，导致在复杂任务中失败。为解决这些问题，我们提出视觉注意力推理（VAR）新框架，将推理轨迹空间作为结构化搜索对象来进行重新构建。VAR将推理过程分解为两个关键阶段：可追溯证据定位和基于搜索的链式思维生成。搜索由多面奖励函数引导，包含语义和几何自我验证成分，惩罚那些没有忠实于视觉输入的输出来进行自校。我们为我们的搜索策略提供了理论分析，验证了其找到正确答案的高概率能力。实验结果表明，我们的7B模型VAR-7B在幻视和安全基准测试中树立了新的标准，显著优于现有开源模型，并在领先专有系统面前展现出竞争力。

Key Takeaways：

多模态大型语言模型（MLLMs）存在高幻视倾向和依赖线性推理的问题。
视觉注意力推理（VAR）框架被引入以解决这些问题。
VAR将推理过程分解为可追溯证据定位和基于搜索的链式思维生成两个阶段。
VAR采用结构化搜索来寻找推理轨迹空间。
多面奖励函数包含语义和几何自我验证成分，以指导搜索过程并避免错误输出。
VAR-7B模型在多项基准测试中表现优异，显著优于现有开源模型，并在专有系统面前具有竞争力。

Cool Papers

点此查看论文截图

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

Authors:Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, Ge Li

While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code’s textual representations and its underlying execution semantics.

虽然大型语言模型（LLM）通过从大量代码语料库中学习在代码生成方面表现出色，但它们在文本模式训练与目标功能性正确性之间仍存在基本语义鸿沟，功能性正确性是由形式执行语义决定的。强化学习加可验证奖励（RLVR）方法试图通过使用测试用例的执行结果奖励来弥补这一鸿沟。然而，仅仅依赖二进制通过/失败信号对于在代码文本表示与其执行语义之间建立良好对应关系是不高效的，特别是对于代码中的细微逻辑错误。在本文中，我们提出了CodeRL+方法，它将执行语义对齐集成到RLVR训练管道中进行代码生成。CodeRL+使模型能够推断变量级别的执行轨迹，提供执行语义的直接学习信号。CodeRL+能够直接使用现有的策略内滚动来构建执行语义对齐，并可以与各种强化学习算法无缝集成。大量实验表明，CodeRl+优于训练后的基线（包括RLVR和蒸馏），在pass@1的平均相对改进达到4.6%。CodeRL+在其他编程任务中也能有效推广，在代码推理和测试输出生成基准测试中分别提高了15.5%和4.4%的准确率。CodeRL+在多种强化学习算法和大语言模型中具有广泛的应用性。此外，探针分析提供了有力的证据表明CodeRL+加强了代码文本表示与其潜在执行语义之间的对齐。

论文及项目相关链接

PDF

Summary

本文介绍了代码生成领域中存在的语义差距问题，即大型语言模型在通过文本模式学习代码生成时，难以达到功能正确性要求。为此，提出了CodeRL+方法，将执行语义对齐集成到RLVR训练管道中，通过直接学习执行语义来弥补这一差距。CodeRL+能够推断变量级的执行轨迹，并利用现有的策略评估结果来构建执行语义对齐。实验结果显示，CodeRL+相较于训练后的基准模型（包括RLVR和蒸馏方法）表现出更好的性能，平均相对提高了4.6%的pass@1指标。此外，CodeRL+在其他编程任务中也表现出良好的泛化能力，并在代码推理和测试输出生成基准测试中分别提高了15.5%和4.4%的准确率。该方法在不同类型的强化学习算法和语言模型中具有广泛的应用性。

Key Takeaways

大型语言模型在代码生成方面存在语义差距，即训练时的文本模式与功能正确性要求之间存在差距。
CodeRL+方法旨在通过整合执行语义对齐到RLVR训练管道来缩小这一差距。
CodeRL+能够推断变量级的执行轨迹，提供直接的学习信号以了解执行语义。
CodeRL+利用现有的策略评估结果来构建执行语义对齐，且能无缝集成各种强化学习算法。
实验结果显示，CodeRL+在性能上超越了其他训练后的基准模型，平均相对提高了pass@1指标。
CodeRL+在其他编程任务中表现出良好的泛化能力，如代码推理和测试输出生成。

Cool Papers

点此查看论文截图

KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

Authors:Donghyeon Ko, Yeguk Jin, Kyubyung Chae, Byungwook Lee, Chansong Jo, Sookyo In, Jaehong Lee, Taesup Kim, Donghyun Kwak

We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.

我们推出了**Korean SimpleQA (KoSimpleQA)**，这是一个针对大型语言模型（LLM）的事实性评估基准测试，重点关注韩国文化知识。KoSimpleQA的设计旨在具有挑战性但易于评分，包含1000个简短、寻求事实的问题，答案明确。我们在支持韩语的开源LLM的多样化集合中进行了全面评估，发现即使是最强大的模型也只有33.7%的时间给出了正确答案，这凸显了KoSimpleQA的挑战性。值得注意的是，KoSimpleQA上的性能排名与英语SimpleQA上的排名存在很大差异，这凸显了我们数据集的独特价值。此外，我们对推理型LLM的分析表明，在事实问答任务中利用推理能力既有助于模型更好地激发其潜在知识，又能提高其在不确定时的弃权能力。KoSimpleQA可在https://anonymous.4open.science/r/KoSimpleQA-62EB找到。

论文及项目相关链接

PDF

Summary

韩国SimpleQA（KoSimpleQA）基准测试旨在评估大型语言模型（LLM）的事实性，尤其侧重于韩国文化知识。该测试包含1000个简短、寻求事实的问题，具有明确的答案。评估发现，即使在最强的模型上，也只有33.7%的时间生成正确答案，突显了KoSimpleQA的挑战性。此外，与英语SimpleQA相比，KoSimpleQA的排名表现差异显著，凸显了数据集的独特价值。同时，分析显示，在事实问答任务中引入推理能力有助于模型更好地激发潜在知识并提高不确定时的拒绝回答能力。

Key Takeaways

KoSimpleQA是一个针对大型语言模型（LLM）的事实性评估基准，特别关注韩国文化知识。
该测试包含1000个简短的问题，答案明确，旨在评估LLM的性能。
评估发现，即使是最好的模型也仅有33.7%的时间给出正确答案，突显了测试的挑战性。
KoSimpleQA与英语SimpleQA相比，表现排名差异显著，凸显了其独特价值。
引入推理能力有助于模型在事实问答任务中更好地激发潜在知识。
当模型不确定时，提高拒绝回答的能力可以优化其性能。

Cool Papers

点此查看论文截图

Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models

Authors:Lehan Wang, Yi Qin, Honglong Yang, Xiaomeng Li

Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model’s proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR’s significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.

激励多模态大型语言模型（MLLMs）的推理能力是医疗应用透明分析医学扫描和提供可靠诊断的关键。然而，现有的医疗MLLMs在推理过程中仅依赖内部知识，导致在遇到超出其训练范围的情况时出现幻想推理和事实错误。尽管最近的Agentic检索增强生成（RAG）方法激发了医疗模型的主动检索能力，但它们仅限于单模态LLM，忽视了推理和检索过程中的关键视觉信息。因此，我们提出了第一个多模态医疗推理检索框架Med-RwR，它会在推理过程中主动通过查询观察到的症状或特定领域的医疗概念来检索外部知识。具体来说，我们设计了一个两阶段强化学习策略，使用定制奖励来刺激模型利用视觉诊断结果和文本临床信息进行有效检索。在此基础上，我们进一步提出了一种信心驱动图像再检索（CDIR）方法，用于测试时间缩放，当检测到低预测信心时。在各种公共医疗基准测试上的评估表明，Med-RwR在基准模型上取得了显著改进，证明了通过整合外部知识增强推理能力的有效性。此外，尽管训练语料库中缺乏超声心动图数据，Med-RwR在我们提出的心动图基准测试（ECBench）上表现出卓越的可推广性，性能提高了8.8%。我们的数据、模型和代码将在https://github.com/xmed-lab/Med-RwR上公开发布。

论文及项目相关链接

PDF Work in progress

Summary

本文强调激励多模态大型语言模型（MLLMs）的推理能力对于医疗应用的重要性，以实现医疗扫描的透明分析和可靠的诊断。然而，现有的医疗MLLMs在推理时仅依赖内部知识，导致在面临超出训练范围的情况时出现虚构推理和事实不准确的问题。为此，提出了首个多模态医疗推理与检索框架Med-RwR，通过查询观察到的症状或特定领域的医学概念来积极检索外部知识。该框架采用两阶段强化学习策略，通过定制奖励刺激模型利用视觉诊断结果和文本临床信息进行有效检索。此外，还提出了信心驱动图像再检索（CDIR）方法，用于在检测到低预测信心时进行测试时间缩放。评估表明，Med-RwR在基准模型上取得了显著改进，证明了整合外部知识增强推理能力的有效性。此外，Med-RwR在我们提出的EchoCardiography Benchmark（ECBench）上表现出卓越的可泛化性，尽管训练语料库中缺乏超声心动图数据，但其性能提高了8.8%。

Key Takeaways

激励多模态大型语言模型（MLLMs）的推理能力对于医疗应用至关重要，能更可靠地分析医疗扫描和提供诊断。
现有医疗MLLMs存在局限，仅依赖内部知识导致虚构推理和事实不准确。
提出首个多模态医疗推理与检索框架Med-RwR，结合视觉和文本信息进行推理和检索。
Med-RwR采用两阶段强化学习策略，通过定制奖励刺激模型利用视觉和文本信息检索。
Med-RwR提出信心驱动图像再检索（CDIR）方法，提高预测信心。
Med-RwR在公共医疗基准测试上表现优异，证明整合外部知识增强推理能力的有效性。

Cool Papers

点此查看论文截图

From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

Authors:Ziwei Huang, Ying Shu, Hao Fang, Quanyu Long, Wenya Wang, Qiushi Guo, Tiezheng Ge, Leilei Gan

Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model’s temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.

主题驱动的图像生成模型在身份保留（保真度）和提示遵循（可编辑性）之间面临根本的权衡。虽然在线强化学习（特别是GPRO）提供了一种有前途的解决方案，但我们发现简单应用GRPO会导致性能下降，因为使用静态权重的简单线性奖励聚合会产生冲突的梯度信号，并与扩散过程的时态动态不匹配。为了克服这些局限性，我们提出了定制化的GRPO，这是一个具有两项关键创新的新型框架：（i）协同感知奖励塑形（SARS），这是一种非线性机制，可以明确惩罚冲突的奖励信号并放大协同信号，从而提供更清晰、更果断的梯度。（ii）时间感知动态加权（TDW），它通过优先关注早期的提示遵循和后期的身份保留，使优化压力与模型的时态动态相匹配。大量实验表明，我们的方法显著优于简单的GRPO基线，成功缓解了竞争退化问题。我们的模型实现了良好的平衡，生成的图像既保留了关键的身份特征，又准确地遵循了复杂的文本提示。

论文及项目相关链接

PDF

Summary
主题驱动图像生成模型在身份保留和提示遵循之间存在此基本权衡问题。GRPO在线强化学习为解决此问题提供了希望，但其简单应用会导致竞争退化。为此，我们提出Customized-GRPO框架，包含两大创新：协同奖励塑形（SARS）和时间感知动态加权（TDW）。SARS明确惩罚冲突奖励信号并放大协同信号，提供清晰梯度。TDW通过与模型时间动态对齐优化压力来优先进行早期提示和后期身份保留。实验证明，我们的方法明显优于简单的GRPO基线，成功缓解竞争退化问题。我们的模型在保留关键身份特征和准确遵循复杂文本提示之间达到了卓越平衡。

Key Takeaways

主题驱动图像生成模型面临身份保留和提示遵循之间的权衡问题。
简单的GRPO应用会导致竞争退化，因为静态权重下的线性奖励聚合会引起冲突的梯度信号。
Customized-GRPO框架通过两大创新解决此问题：协同奖励塑形（SARS）和时间感知动态加权（TDW）。
SARS明确惩罚冲突的奖励信号并放大协同信号，以提供更清晰的梯度方向。
TDW通过与模型的时间动态对齐优化压力，在初期更重视提示遵循，在后期更重视身份保留。
实验证明Customized-GRPO显著优于基础GRPO方法，有效缓解竞争退化。

Cool Papers

点此查看论文截图

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

Authors:Jiajun Fan, Tong Wei, Chaoran Cheng, Yuxin Chen, Ge Liu

Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates-reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.

在强化学习对生成模型进行微调时，平衡探索与利用是一项关键挑战。现有方法依赖于固定的发散正则化，这带来了一个固有的困境：强正则化保留了模型的能力，但限制了奖励优化；而弱正则化虽然实现了更大的对齐，但存在不稳定或奖励作弊的风险。我们引入了自适应发散正则化策略优化（ADRPO），它可以根据优势估计自动调整正则化强度，为高价值样本减少正则化，对不良样本应用更强的正则化，使策略能够根据数据质量在探索和激进利用之间导航。我们使用Wasserstein-2正则化实现流匹配生成模型，在文本到图像生成方面取得了显著成果，实现了比DPO等离线方法和具有固定正则化的在线方法（如ORW-CFM-W2）更好的语义对齐和多样性。ADRPO使一个具有2B参数的SD3模型能够在属性绑定、语义一致性、艺术风格转换和组合控制方面超越参数更多的4.8B和12B模型，同时保持生成多样性。ADRPO适用于文本专用大型语言模型和多模态推理模型的KL正则化微调，可以增强现有的在线RL方法，如GRPO。在大型语言模型微调中，ADRPO展现出通过主动探索逃离局部最优的潜能；而在多模态音频推理中，它通过出色的逐步推理超越了GRPO，使一个7B模型能够超越更大规模的商业模型，包括Gemini 2.5 Pro和GPT-4o Audio，为解决不同生成架构和模式中的探索与利用挑战提供了有效的即插即用解决方案。

论文及项目相关链接

PDF 30 pages

Summary

在强化学习微调生成模型时，探索与利用之间的平衡是一大挑战。现有方法依赖固定的发散正则化，存在优化奖励与保持模型能力之间的内在矛盾。我们提出了自适应发散正则化策略优化（ADRPO），根据优势估计自动调整正则化强度，对高质量样本减少正则化，对低质量样本应用更强的正则化。这使得策略能够根据数据质量在探索和激进利用之间灵活导航。在文本到图像生成方面，使用Wasserstein-2正则化的ADRPO实现显著成果，在属性绑定、语义一致性、艺术风格转换和组合控制等方面超越了许多更大的模型，同时保持生成多样性。ADRPO还可推广到KL正则化的文本仅大型语言模型和多模态推理模型的微调，并增强现有在线RL方法，如GRPO。在LLM微调中，ADRPO展现出通过主动探索逃离局部最优的能力；在多模态音频推理中，它凭借出色的逐步推理表现优于GRPO，使一个7B模型超越更大规模的商业模型，如Gemini 2.5 Pro和GPT-4o Audio。它为解决各种生成架构和模式中的探索与利用挑战提供了有效的即插即用解决方案。

Key Takeaways