发布日期: 2025-06-09

更新日期: 2025-07-06

文章字数: 21.2k

阅读时长: 87 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-09 更新

Authors:Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang

Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor’s captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

关于慢思考语言模型（如OpenAI-o1和DeepSeek-R1）的最新进展显示，通过模仿人类反思认知，它们在复杂推理任务中表现出了显著的能力。然而，将这些能力扩展到多模态大型语言模型（MLLMs）仍然具有挑战性，因为在升级基础推理LLMs时需要重新训练视觉语言对齐，成本高昂。一种简单的解决方案是将感知与推理解耦，即通过将视觉输入转换为语言表示形式（例如描述），然后传递给强大的纯文本推理器。然而，这种解耦引入了一个关键问题：视觉提取器必须生成既忠于图像又能支持准确下游推理的描述。为了解决这一问题，我们提出通过描述奖励优化实现推理感知解耦（RACRO）——一种受推理引导的强化学习策略，使提取器的描述行为与推理目标保持一致。通过基于奖励的优化来关闭感知推理循环，RACRO显著提高了视觉定位能力并提取了用于推理的优化表示形式。在多模态数学和科学基准测试上的实验表明，所提出的RACRO方法达到了最先进的平均性能，同时实现了卓越的可扩展性和即插即用适应更先进的推理LLMs的能力，无需昂贵的多模态重新对齐。

论文及项目相关链接

PDF

Summary
近期慢思考语言模型（如OpenAI-o1和DeepSeek-R1）在模拟人类反思认知后，展现出惊人的复杂推理任务能力。然而，将其能力扩展到多模态大型语言模型（MLLMs）时，由于需要升级底层推理器LLMs时重训视觉语言对齐的成本高昂而面临挑战。为解决这一问题，提出一种通过奖励优化（RACRO）进行推理对齐的感知解耦方法。该方法通过奖励优化实现感知与推理的闭环，显著提高了视觉定位的准确性并提取出推理优化后的表示形式。实验证明RACRO方法在多模态数学和科学基准测试中取得了最先进的平均性能，同时在适应更先进的推理LLMs时具有出色的可扩展性和即插即用性，无需昂贵的多模态重新对齐。

Key Takeaways

慢思考语言模型如OpenAI-o1和DeepSeek-R1具备复杂推理任务的能力，能够模拟人类的反思认知。
将这些模型的能力扩展到多模态大型语言模型（MLLMs）面临挑战，主要由于视觉语言对齐的重训成本高昂。
提出一种名为RACRO的方法来解决这一问题，它通过奖励优化实现感知与推理的解耦和闭环。
RACRO方法能够显著提高视觉定位的准确性并提取推理优化后的表示形式。
实验结果显示RACRO在多模态数学和科学基准测试中取得最先进的平均性能。
RACRO方法具有良好的可扩展性和适应性，可以轻松适应更先进的推理LLMs而无需进行复杂的重新训练和调整。

Cool Papers

点此查看论文截图

Leveraging Reward Models for Guiding Code Review Comment Generation

Authors:Oussama Ben Sghaier, Rosalia Tufano, Gabriele Bavota, Houari Sahraoui

Code review is a crucial component of modern software development, involving the evaluation of code quality, providing feedback on potential issues, and refining the code to address identified problems. Despite these benefits, code review can be rather time consuming, and influenced by subjectivity and human factors. For these reasons, techniques to (partially) automate the code review process have been proposed in the literature. Among those, the ones exploiting deep learning (DL) are able to tackle the generative aspect of code review, by commenting on a given code as a human reviewer would do (i.e., comment generation task) or by automatically implementing code changes required to address a reviewer’s comment (i.e., code refinement task). In this paper, we introduce CoRAL, a deep learning framework automating review comment generation by exploiting reinforcement learning with a reward mechanism considering both the semantics of the generated comments as well as their usefulness as input for other models automating the code refinement task. The core idea is that if the DL model generates comments that are semantically similar to the expected ones or can be successfully implemented by a second model specialized in code refinement, these comments are likely to be meaningful and useful, thus deserving a high reward in the reinforcement learning framework. We present both quantitative and qualitative comparisons between the comments generated by CoRAL and those produced by the latest baseline techniques, highlighting the effectiveness and superiority of our approach.

代码审查是现代软件开发的重要组成部分，涉及评估代码质量、对潜在问题提供反馈以及针对已识别问题对代码进行改进。尽管有这些好处，代码审查可能会相当耗时，并受到主观性和人为因素的影响。因此，文献中已提出了（部分）自动化代码审查过程的技巧。其中，利用深度学习（DL）的技巧能够通过像人类审查员那样对给定代码进行注释（即注释生成任务）或自动执行为解决审稿人评论所需的代码更改（即代码改进任务）来处理代码审查的生成方面。在本文中，我们介绍了CoRAL，这是一个利用强化学习自动化审查评论生成的深度学习框架，其奖励机制考虑到生成的注释的语义以及它们作为其他自动化代码改进任务模型的输入的有用性。核心思想是，如果DL模型生成的注释与预期的注释语义相似，或者可以被专门进行代码改进的第二模型成功实施，那么这些注释可能是有意义和有用的，因此在强化学习框架中值得高奖励。我们提供了CoRAL生成的注释与最新基准技术产生的注释之间的定量和定性比较，突出了我们方法的有效性和优越性。

论文及项目相关链接

PDF

Summary

本文介绍了代码审查在现代软件开发中的重要性，并指出了其存在的耗时、受主观因素影响等问题。为解决这些问题，文章提出了一种利用深度学习技术自动化代码审查过程的框架CoRAL。该框架能够生成与人类审查者相似的评论，并自动实施解决审查者评论所需的代码更改。文章通过定量和定性比较，展示了CoRAL相较于最新基线技术在评论生成方面的有效性和优越性。

Key Takeaways

代码审查在软件开发中的重要性：包括评估代码质量、提供潜在问题反馈和针对已识别问题优化代码。
代码审查的挑战：包括时间消耗大、受主观性和人为因素影响。
深度学习在自动化代码审查中的应用：尤其是评论生成和代码优化的任务。
CoRAL框架介绍：利用强化学习自动化评论生成，并考虑生成的评论语义和其作为代码优化输入的有用性。
CoRAL框架的奖励机制：根据评论的语义相似性和对代码优化模型的有效性来确定奖励。
文章通过比较展示了CoRAL的有效性和优越性：与最新基线技术在评论生成方面的比较。

Cool Papers

点此查看论文截图

RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

Authors:Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang

Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RFT performance, we conducted an extensive ablation study on key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RFT-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RFT-based red teaming. Code for the benchmark can be accessed at https://github.com/x-zheng16/RedRFT.git.

红队实践已被证明是识别和解决大型语言模型（LLM）漏洞的有效方法。强化微调（RFT）作为现有红队技术中的一种有前途的策略而崭露头角。然而，缺乏统一的基准测试阻碍了当前的基于RFT的红队方法。实现细节，特别是在基于近端策略优化（PPO）的RFT中，会显著影响结果的稳定性和可重复性。为了解决这一问题，我们引入了RedRFT，这是一个轻量级的基准测试，旨在简化和标准化基于RFT的红队的实现和评估。RedRFT结合了单文件CleanRL和高度模块化的Tianshou的设计优势，提供高质量的单文件红队实现和模块化PPO核心组件，如通用优势估计器。它支持多种令牌和句子多样性指标，具有模块化的内在奖励计算，便于即插即用实验。为了明确关键组件对RFT性能的影响，我们对包括低秩适应（LoRA）、Kullback-Leibler（KL）散度和拉格朗日乘数在内的关键组件进行了广泛的消融研究。我们希望这项工作有助于1）全面理解基于RFT的红队算法的实现细节；2）加快基于RFT的红队创新功能的快速原型设计。该基准测试的代码可在https://github.com/x-zheng16/RedRFT.git访问。

论文及项目相关链接

PDF

Summary：
强化微调（RFT）在大型语言模型（LLM）的红队测试中显示出潜力，但缺乏统一的基准测试影响了其应用。为解决这一问题，引入RedRFT基准测试，简化并标准化RFT红队测试的实施与评估。RedRFT结合CleanRL和Tianshou的优势，提供高质量的单文件红队测试实现和模块化PPO核心组件。它支持多种令牌和句子多样性指标，并模块化内在奖励计算，便于实验。通过消除研究，我们了解了关键组件对RFT性能的影响。此工作旨在全面理解RFT红队测试算法的实施细节并加速创新功能原型设计。

Key Takeaways：

红队测试已证明是识别并缓解大型语言模型（LLM）漏洞的有效方法。
强化微调（RFT）是红队测试中的有前途的策略。
缺乏统一基准影响了RFT的应用。
RedRFT基准测试旨在简化并标准化RFT红队测试的实施与评估。
RedRFT结合了CleanRL和Tianshou的优点，提供高质量的单文件实现和模块化PPO核心组件。
RedRFT支持多种令牌和句子多样性指标，并模块化内在奖励计算。
通过消除研究，关键组件对RFT性能的影响得到了理解。

Cool Papers

点此查看论文截图

OpenThoughts: Data Recipes for Reasoning Models

Authors:Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.

推理模型在数学、代码和科学等多个基准测试上取得了快速进展。然而，关于最佳推理训练方案仍然存在许多未解决的问题，因为最先进的模型通常依赖于专有数据集，而公开信息很少或没有。为了解决这一问题，OpenThoughts项目的目标是创建用于训练推理模型的开源数据集。经过初步探索，我们的OpenThoughts2-1M数据集催生了OpenThinker2-32B，这是首个在公开推理数据上训练的模型，能够在AIME和LiveCodeBench等标准推理基准测试上与DeepSeek-R1-Distill-32B相匹配。然后，我们通过系统地调查数据生成管道的每一步进行了1000多次受控实验，进一步改进了我们的数据集，从而产生了OpenThoughts3。将管道扩展到120万个例子，并使用QwQ-32B作为教师，我们得到了OpenThoughts3-7B模型，取得了最先进的成果：AIME 2025上达到53%，LiveCodeBench 06/24-01/25上达到51%，GPQA Diamond上达到54%——与DeepSeek-R1-Distill-Qwen-7B相比，分别提高了15.3、17.2和20.5个百分点。我们的所有数据集和模型都可以在https://openthoughts.ai上找到。

论文及项目相关链接

PDF https://www.openthoughts.ai/blog/ot3. arXiv admin note: text overlap with arXiv:2505.23754 by other authors

Summary

本文介绍了OpenThoughts项目致力于创建用于训练推理模型公开数据集的目标。从初始探索的OpenThoughts2-1M数据集开始，成功研发出能在公开推理数据上训练的OpenThinker2-32B模型，其性能匹配DeepSeek-R1-Distill-32B在AIME和LiveCodeBench等标准推理基准测试上的表现。后续通过系统地研究数据生成管道的每一步，进行了超过一千次的控制实验，推出了OpenThoughts3数据集和OpenThoughts3-7B模型。该模型在AIME 2025、LiveCodeBench 06/24-01/25以及GPQA Diamond上实现了业界领先的结果。所有数据集和模型均可在https://openthoughts.ai上获取。

Key Takeaways

OpenThoughts项目的目标是创建公开数据集以训练推理模型，解决当前许多推理模型依赖于专有数据集的问题。
OpenThoughts2-1M数据集衍生出OpenThinker2-32B模型，能在公开推理数据上进行训练，性能与DeepSeek-R1-Distill-32B相匹配。
通过系统地研究数据生成管道的每一步以及进行大量控制实验，OpenThoughts项目进一步优化了数据集并推出了OpenThoughts3。
OpenThoughts3-7B模型通过扩大管道规模至1.2M示例并使用QwQ-32B作为教师，实现了显著的性能提升。
OpenThoughts3-7B模型在AIME 2025、LiveCodeBench 06/24-01/25以及GPQA Diamond上取得了业界最佳结果。
OpenThoughts项目公开了所有数据集和模型，方便公众访问和使用。

Cool Papers

点此查看论文截图

Toward Entailment Checking: Explore Eigenmarking Search

Authors:Tatpong Katanyukul

Logic entailment is essential to reasoning, but entailment checking has the worst-case complexity of an exponential of the variable size. With recent development, quantum computing when mature may allow an effective approach for various combinatorial problems, including entailment checking. Grover algorithm uses Grover operations, selective phase inversion and amplitude amplification to address a search over unstructured data with quadratic improvement from a classical method. Its original form is intended to a single-winner scenario: exactly one match is promised. Its extension to multiple-winner cases employs probabilistic control over a number of applications of Grover operations, while a no-winner case is handled by time-out. Our study explores various schemes of ``eigenmarking’’ approach. Still relying on Grover operations, but the approach introduces additional qubits to tag the eigenstates. The tagged eigenstates are to facilitate an interpretation of the measured results and enhance identification of a no-winner case (related to no logic violation in entailment context). Our investigation experiments three variations of eigenmarking on a two-qubit system using an IBM Aer simulator. The results show strong distinguishability in all schemes with the best relative distinguishabilities of 19 and 53 in worst case and in average case, respectively. Our findings reveal a viable quantum mechanism to differentiate a no-winner case from other scenarios, which could play a pivot role in entailment checking and logic reasoning in general.

逻辑推理中蕴含关系至关重要，但蕴含关系的检验在最坏情况下的复杂度是变量大小的指数级。随着最新发展，当量子计算机成熟时，它可能为各种组合问题提供一种有效方法，包括蕴含关系检验。Grover算法使用Grover操作、选择性相位反转和振幅放大来解决对无结构数据的搜索问题，与经典方法相比有二次改进。其原始形式是为了解决单一胜出场景：保证恰好有一个匹配。其扩展到多个胜出场景的情况采用概率控制多次应用Grover操作，而无胜出情况则通过超时处理。我们的研究探索了“特征标记”方法的多种方案。该方法仍然依赖于Grover操作，但引入了额外的量子比特来标记本征态。标记的本征态有助于解释测量结果，并提高对无胜出情况的识别（与蕴含关系上下文中的无逻辑违规相关）。我们在IBM Aer模拟器上对一个两量子比特系统进行了三种特征标记方法的实验。结果表明，所有方案的区分度都很强，其中最坏情况下最佳相对区分度为19，平均情况下的最佳相对区分度为53。我们的研究揭示了一种可行的量子机制来区分无胜出情况与其他场景，这在蕴含检验和逻辑推理中可能起到关键作用。

论文及项目相关链接

PDF 8 pages

Summary

本文探讨了逻辑蕴含在推理中的重要性，以及量子计算在解决包含逻辑蕴含检查在内的组合问题上的潜在应用。文章介绍了Grover算法及其在不同场景下的应用，并探索了一种名为“eigenmarking”的方法，该方法使用额外的量子比特标记本征态，以提高无赢家情况的识别能力。在IBM Aer模拟器上的实验结果表明，这种方法的区分能力较强，在最坏和平均情况下，最佳相对区分度分别为19和53。这为量子机制在逻辑蕴含检查和逻辑推理中区分无赢家情况提供了可行的解决方案。

Key Takeaways

逻辑蕴含在推理中至关重要，但蕴含检查的计算复杂度较高。
量子计算对于解决组合问题，包括逻辑蕴含检查具有潜力。
Grover算法是一种用于搜索未结构化数据的有效方法，可通过二次改进经典方法。
Grover算法可以扩展到多赢家情况，并通过超时处理无赢家情况。
“Eigenmarking”方法使用额外的量子比特标记本征态，提高结果的解释性和无赢家情况的识别。
在IBM Aer模拟器上的实验表明，eigenmarking方法具有良好的区分能力。

Cool Papers

点此查看论文截图

Seed-Coder: Let the Code Model Curate Data for Itself

Authors:ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

大型语言模型（LLM）的预训练代码数据被认为是至关重要的，这不仅对与代码相关的任务至关重要，而且对提高LLM的整体智能水平也起着重要作用。目前，开源LLM往往严重依赖人工来生成它们的代码预训练数据，例如针对特定编程语言定制的手动过滤规则，或使用人工标注数据来训练质量过滤器。然而，这些方法本质上具有可扩展性限制、容易受主观偏见的影响，并且在跨不同编程语言时扩展和维护成本高昂。为了应对这些挑战，我们推出了Seed-Coder，这是一个开源LLM系列，包括基础、指令和推理模型，容量为8B，最大限度地减少人类参与数据构建。我们的代码预训练数据是由以模型为中心的数据管道生成的，该管道主要利用LLM进行评分和过滤代码数据。指令模型通过监督微调偏好优化进行进一步训练，而推理模型则利用长链思维（LongCoT）强化学习来提高多步骤代码推理能力。Seed-Coder在类似规模的开源模型中实现了最先进的成果，甚至超越了某些更大的模型，在代码生成、代码补全、代码编辑、代码推理和软件工程任务方面表现出卓越的性能。

论文及项目相关链接

PDF

Summary

大规模语言模型（LLM）预训练中的代码数据对于代码相关任务以及增强LLM的通用智能至关重要。当前开源LLM往往依赖人工来生成代码预训练数据，如针对个别编程语言定制的手动过滤规则或使用人工标注数据来训练质量过滤器。然而，这些方法存在可扩展性低、易引入主观偏见以及跨多种编程语言维护和扩展成本高昂等局限性。为解决这些挑战，我们推出Seed-Coder系列开源LLM，包括基础、指令和推理模型，规模均为8B，最小化数据构建中的人为干预。其代码预训练数据主要由以模型为中心的数据管道生成，主要利用LLM进行评分和过滤。指令模型通过监督微调与偏好优化进行训练，而推理模型则利用长链思维强化学习来提升多步骤代码推理能力。Seed-Coder在同类开源模型中表现领先，甚至在某些任务上超过了部分更大规模的模型，在代码生成、代码补全、代码编辑、代码推理和软件工程任务中展现了卓越性能。

Key Takeaways

代码数据在LLM预训练中的重要性：不仅对于代码相关任务关键，而且有助于增强LLM的通用智能。
当前依赖人工生成代码预训练数据的局限性：包括可扩展性低、主观偏见以及高维护成本。
Seed-Coder的特点：开源LLM系列，包括基础、指令和推理模型，最小化数据构建中的人为干预。
数据生成方法：以模型为中心的数据管道，主要利用LLM进行评分和过滤。
指令模型的训练方式：通过监督微调与偏好优化进行训练。
推理模型的强化学习应用：利用长链思维强化学习提升多步骤代码推理能力。

Cool Papers

点此查看论文截图

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking

Authors:Xianming Li, Aamir Shakir, Rui Huang, Julius Lipp, Jing Li

Reranking is fundamental to information retrieval and retrieval-augmented generation, with recent Large Language Models (LLMs) significantly advancing reranking quality. While recent advances with LLMs have significantly improved document reranking quality, current approaches primarily rely on large-scale LLMs (>7B parameters) through zero-shot prompting, presenting high computational costs. Small Language Models (SLMs) offer a promising alternative because of their efficiency, but our preliminary quantitative analysis reveals they struggle with understanding task prompts without fine-tuning. This limits their effectiveness for document reranking tasks. To address this issue, we introduce a novel two-stage training approach, ProRank, for SLM-based document reranking. First, we propose a prompt warmup stage using reinforcement learning GRPO to steer SLMs to understand task prompts and generate more accurate coarse-grained binary relevance scores for document reranking. Then, we continuously fine-tune the SLMs with a fine-grained score learning stage without introducing additional layers to further improve the reranking quality. Comprehensive experimental results demonstrate that the proposed ProRank consistently outperforms both the most advanced open-source and proprietary reranking models. Notably, our lightweight ProRank-0.5B model even surpasses the powerful 32B LLM reranking model on the BEIR benchmark, establishing that properly trained SLMs can achieve superior document reranking performance while maintaining computational efficiency.

重排序在信息检索和检索增强生成中起着根本性的作用，最近的大型语言模型（LLM）显著地推动了重排序质量的提升。尽管使用LLM的最新进展显著提高了文档重排序的质量，但当前的方法主要依赖于大规模LLM（具有超过7亿个参数）通过零样本提示来实现，这带来了较高的计算成本。小型语言模型（SLM）因其效率而成为一种有前途的替代方案，但我们的初步定量分析表明，它们在理解任务提示方面存在困难，除非进行微调。这限制了它们在文档重排序任务中的有效性。为了解决这一问题，我们提出了一种用于SLM文档重排序的新型两阶段训练方法ProRank。首先，我们提出了一种使用强化学习GRPO的提示预热阶段，以引导SLM理解任务提示并为文档重排序生成更准确的粗略二元相关性分数。然后，我们在不引入额外层的情况下，通过精细粒度的分数学习阶段持续微调SLM，以进一步提高重排序质量。全面的实验结果表明，所提出ProRank持续优于最先进的开源和专有重排序模型。值得注意的是，我们轻量级的ProRank-0.5B模型甚至在BEIR基准测试中超越了强大的32B LLM重排序模型，证明了经过适当训练的小型语言模型可以在保持计算效率的同时实现卓越的文档重排序性能。

论文及项目相关链接

PDF

Summary

本文探讨了信息检索和检索增强生成中的重排技术。尽管大型语言模型（LLM）在重排质量方面取得了显著进展，但它们的高计算成本限制了实际应用。为解决小语言模型（SLM）在任务提示理解方面的不足，提出了一种新型的两阶段训练法——ProRank。初步实验表明，ProRank在文档重排任务上表现卓越，尤其是轻量化模型ProRank-0.5B，在BEIR基准测试中甚至超越了强大的32B LLM重排模型，实现了计算效率与性能的双重优势。

Key Takeaways

信息检索和检索增强生成中的重排技术至关重要，近期大型语言模型（LLMs）在此领域取得显著进展。
虽然LLMs提高了文档重排质量，但其高计算成本限制了实际应用。
小语言模型（SLMs）因高效性成为有前景的替代方案，但在任务提示理解方面存在困难。
提出了一种新型的两阶段训练法——ProRank，以解决SLMs在文档重排任务中的挑战。
ProRank包括一个提示预热阶段，使用强化学习GRPO引导SLMs理解任务提示并生成更准确的粗粒度二进制相关性得分。
ProRank通过精细粒度的得分学习阶段持续微调SLMs，进一步提高重排质量，且无需添加额外层。

Cool Papers

点此查看论文截图

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Authors:Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

近年来，强化学习（RL）在具有数值反馈方面的进展，如标量奖励，已经显著提高了大型语言模型（LLM）的复杂推理能力。尽管如此，我们确定了仅使用数值反馈的RL所面临的三大挑战：性能高原期、自我反思的局限性以及持续失败。我们随后证明，即使在性能高原期后，通过利用批判形式的自然语言反馈，RL微调模型也可以在持续失败的问题上产生正确的改进。基于此见解，我们提出了Critique-GRPO，这是一个在线RL框架，它结合了自然语言反馈和数值反馈来进行有效的策略优化。Critique-GRPO使LLM能够同时从初始响应和批判指导的改进中学习，同时保持探索。使用Qwen2.5-7B-Base和Qwen3-8B-Base的广泛实验表明，Critique-GRPO在八个具有挑战性的数学、STEM和一般推理任务上始终优于基于监督学习和RL的微调方法，平均pass@1得分分别提高了约4.5%和5%。值得注意的是，Critique-GRPO超越了一个强大的基线，该基线结合了在线RL中的专家演示。进一步的分析揭示了关于策略探索的两个关键见解：（1）高熵并不总是保证从探索中有效地学习，（2）更长的响应并不一定能导致更有效的探索。

论文及项目相关链接

PDF 38 pages

Summary
强化学习（RL）在数值反馈（如标量奖励）方面的最新进展显著提高了大型语言模型（LLM）的复杂推理能力。然而，存在三大挑战。即使性能进入瓶颈期，利用批判性的自然语言反馈，RL微调模型仍然能对持续失败的问题进行正确的改进。基于此，提出了结合自然语言与数值反馈的在线RL框架——Critique-GRPO，用于有效的策略优化。该框架使LLM能够同时从初步响应和批判性指导的改进中学习，并保持探索。实验表明，Critique-GRPO在多个挑战性任务上始终优于基于监督学习和RL的微调方法，平均通过率提高约4.5%和5%。此外，还发现了关于策略探索的两个关键见解。

Key Takeaways

强化学习（RL）结合数值反馈提升了大型语言模型（LLM）的推理能力。
单纯依赖数值反馈的RL面临三大挑战：性能瓶颈、自我反思的局限性以及持续失败的问题。
利用批判性的自然语言反馈，RL微调模型可以在性能瓶颈期后对持续失败的问题进行正确改进。
提出的Critique-GRPO框架结合了自然语言与数值反馈，实现有效的策略优化，使LLM能从初步响应和批判指导的改进中学习。
Critique-GRPO框架在多个挑战性任务上表现优越，优于监督学习和RL的微调方法。
高熵不一定能保证从探索中高效学习。

Cool Papers

点此查看论文截图

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Authors:Kalyan Nakka, Nitesh Saxena

The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.

大型语言模型（LLMs）生成有害和不安全内容的固有风险，凸显了对安全对齐的需求。为了确保LLMs的安全对齐，开发了各种技术，如监督微调、强化学习人类反馈和团队协作。然而，这些对齐的LLMs的稳健性总是受到对抗性攻击的威胁，这些攻击利用安全对齐的未探索和潜在漏洞。在本文中，我们开发了一种新型的黑盒越狱攻击，称为BitBypass，它利用连字符分隔的比特流伪装来突破对齐的LLMs。这代表了一种新的越狱方向，它通过利用数据的基本信息表示作为连续的比特流来突破，而不是依靠提示工程或对抗性操纵。我们对五个最先进的LLMs，即GPT-4o、双子座1.5、克劳德3.5、羊驼驼式坦克列车高速船极地装甲运兵车V6北极号和Mixtral，从对抗性角度进行了评估，发现BitBypass绕过其安全对齐并诱骗它们生成有害和不安全内容的能力。此外，我们还观察到BitBypass在隐蔽性和攻击成功率方面优于几种最先进的越狱攻击。总体而言，这些结果突出了BitBypass在突破这些最先进的LLMs方面的有效性和效率。

论文及项目相关链接

PDF 24 pages, 24 figures, and 7 tables

Summary

大型语言模型（LLMs）生成有害和不安全内容的固有风险，突显了对安全对齐的需求。为确保LLMs的安全对齐，开发了各种技术，如监督微调、强化学习人类反馈和团队协作。然而，这些对齐的LLMs的稳健性总是受到挑战，因为对抗性攻击会利用尚未被发现的基础脆弱性。本文开发了一种新型的黑盒越狱攻击BitBypass，利用连字符分隔的比特流伪装来进行越狱。这代表了一个新方向，即通过利用数据连续位的基本信息表示进行越狱，而不是利用提示工程或对抗性操作。我们的评估显示，BitBypass能够绕过这些最先进LLMs的安全对齐，并欺骗它们生成有害和不安全的内容。总的来说，BitBypass在越狱这些最先进LLMs方面非常有效和高效。

Key Takeaways

大型语言模型（LLMs）在生成内容方面存在潜在风险，需要确保它们的安全对齐。
为确保LLMs的安全对齐，已经开发了多种技术，如监督微调、强化学习人类反馈等。
对齐的LLMs的稳健性会受到对抗性攻击的挑战，这些攻击会利用安全对齐的未知脆弱性。
提出了一种新型黑盒越狱攻击BitBypass，它利用比特流伪装技术。
BitBypass能够绕过最先进LLMs的安全对齐，并使其生成有害和不安全的内容。
BitBypass相比其他越狱攻击具有更高的隐蔽性和攻击成功率。

Cool Papers

点此查看论文截图

RewardBench 2: Advancing Reward Model Evaluation

Authors:Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Nathan Lambert

Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks – simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation – models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench – while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.

奖励模型被广泛应用于语言模型的训练后阶段，用于捕捉偏好数据的微妙信号，并为指令遵循、推理、安全等多个领域的优化提供训练目标。社区已经开始建立评估奖励模型的最佳实践，从开发测试特定技能领域能力的基准测试到测试与人类偏好一致性的基准测试。与此同时，评估方面的进展并未反映在奖励模型在下游任务中的有效性上——据报道，在许多情况下，更简单的直接对齐算法效果更好。本文介绍了RewardBench 2，这是一个新的多技能奖励建模基准测试，旨在带来更具挑战性数据以准确评估奖励模型——与第一个RewarBench相比，模型在RewardBench 2上的得分平均低约20分，但奖励模型与下游性能高度相关。与其他大多数基准测试相比，RewardBench 2源自新的人类提示而非来自下游评估的现有提示，从而促进了更严格的评估实践。在本文中，我们描述了我们的基准测试构建过程，并报告了现有模型在其上的表现，同时量化了该基准测试上的性能如何与下游在推理时间缩放算法（如N采最佳策略）和RLHF训练算法（如近端策略优化）中使用模型的性能相关。

论文及项目相关链接

PDF Data, models, and leaderboard available at https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51

Summary：语言模型后训练广泛应用奖励模型以捕捉偏好数据中的微妙信号，并为其在指令遵循、推理、安全等多领域的优化提供训练目标。社区开始建立评估奖励模型的最佳实践，从开发特定技能领域的基准测试到与人类偏好一致的测试。尽管评估方面取得进展，奖励模型在下游任务中的有效性尚未得到反映——在许多情况下，更简单的直接对齐算法表现更好。本文介绍了RewardBench 2，这是一个新的多技能奖励建模基准测试，旨在带来准确性的挑战数据以评估奖励模型——与第一个RewardBench相比，模型得分平均低约20分，但与下游性能高度相关。与其他大多数基准测试相比，RewardBench 2从下游评估中获取新的人类提示而非现有的提示，从而促进更严格的评估实践。本文描述了我们的基准测试构建过程并报告现有模型在其上的表现，同时量化基准测试上的表现与下游模型使用之间的相关性，包括推理时间缩放算法和RLHF训练算法等。

Key Takeaways：

奖励模型在捕捉偏好数据微妙信号方面表现出重要作用，并为语言模型的多个领域优化提供目标。
社区已经建立了评估奖励模型的最佳实践，包括特定技能领域的基准测试和与人类偏好一致的测试。
RewardBench 2是一个新的多技能奖励建模基准测试，旨在提供更具挑战性的数据以准确评估奖励模型的表现。
与先前的基准测试相比，RewardBench 2具有更低的模型平均得分，但高度相关于下游性能。
RewardBench 2采用新的人类提示，促进更严格的评估实践。
现有模型在RewardBench 2上的表现报告，为模型性能提供量化指标。

Cool Papers

点此查看论文截图

Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency

Authors:Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, Si Liu

Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training strategy to enhance the video-specific reasoning capabilities of MLLMs. Built upon the Group Relative Policy Optimization (GRPO) framework, we propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals. To facilitate effective preference-based optimization, we introduce a variance-aware data selection strategy based on repeated inference to identify samples that provide informative learning signals. We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA. Our method consistently outperforms supervised fine-tuning and existing RLT baselines, achieving superior performance with significantly less training data. These results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs. Notably, The initial code release (two months ago) has now been expanded with updates, including optimized reward mechanisms and additional datasets. The latest version is available at https://github.com/appletea233/Temporal-R1 .

理解和处理真实世界中的视频，这些视频具有复杂的语义和长期时间依赖性，仍然是计算机视觉领域的一个基本挑战。近期多模态大型语言模型（MLLMs）在视觉语言任务中表现出了强大的能力，而强化学习调优（RLT）则进一步提高了它们的推理能力。在这项工作中，我们探索了RLT作为一种后训练策略，以增强MLLMs的视频特定推理能力。基于群体相对策略优化（GRPO）框架，我们提出了一种双奖励公式，通过离散和连续奖励信号来监督语义和时间推理。为了促进有效的基于偏好的优化，我们引入了一种基于重复推理的方差感知数据选择策略，以识别提供信息丰富学习信号的样本。我们在八个具有代表性的视频理解任务上评估了我们的方法，包括视频问答（VideoQA）、时间视频定位（Temporal Video Grounding）和基于场景的视频问答（Grounded VideoQA）。我们的方法在各种任务上均表现出超越监督微调（supervised fine-tuning）和现有RLT基线方法的性能，且在训练数据量较小的情况下也表现出优越的性能。这些结果突显了奖励设计和数据选择在利用MLLMs推动以推理为中心的视频理解中的重要性。值得注意的是，初始代码发布（两个月前）现已更新扩展，包括优化的奖励机制和额外的数据集。最新版本可在https://github.com/appletea233/Temporal-R1找到。

论文及项目相关链接

PDF

Summary
多媒体模态大型语言模型（MLLMs）在视频理解任务中具有强大的能力，但仍面临处理复杂语义和长期时间依赖性的挑战。强化学习调优（RLT）可以改善其推理能力。本研究探索将RLT作为提高MLLMs视频特定推理能力的后训练策略。基于集团相对政策优化（GRPO）框架，提出了一种双奖励公式，通过离散和连续奖励信号监督语义和时间推理。本研究引入了一种基于重复推理的方差感知数据选择策略，以促进有效的基于偏好的优化，识别提供信息丰富学习信号的样本。在八个代表性视频理解任务上评估了该方法，包括VideoQA、时间视频定位、基于场景的视频问答等。该方法在训练数据较少的情况下表现出优越性能。重视奖励机制设计和数据选择对推进以推理为中心的视频理解具有重要意义。目前该项目已在GitHub上发布最新版本，初始代码已发布两个月以上，现在已更新包括优化奖励机制和新增数据集等内容：https://github.com/appletea233/Temporal-R1。

Key Takeaways

多模态大型语言模型（MLLMs）在视频理解中展现强大能力，但仍面临处理复杂语义和长期时间依赖性的挑战。
强化学习调优（RLT）能有效提高MLLMs的推理能力。
研究采用GRPO框架和双奖励公式，通过离散和连续奖励信号促进语义和时间推理。
引入方差感知数据选择策略，以提高基于偏好的优化效果。
方法在多个代表性视频理解任务上表现优越，尤其当训练数据有限时。
奖励设计和数据选择在提高视频理解中起关键作用。

Cool Papers

点此查看论文截图

Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high

Authors:PeiHsuan Huang, ZihWei Lin, Simon Imbot, WenCheng Fu, Ethan Tu

Large language models (LLMs) increasingly shape public understanding and civic decisions, yet their ideological neutrality is a growing concern. While existing research has explored various forms of LLM bias, a direct, cross-lingual comparison of models with differing geopolitical alignments-specifically a PRC-system model versus a non-PRC counterpart-has been lacking. This study addresses this gap by systematically evaluating DeepSeek-R1 (PRC-aligned) against ChatGPT o3-mini-high (non-PRC) for Chinese-state propaganda and anti-U.S. sentiment. We developed a novel corpus of 1,200 de-contextualized, reasoning-oriented questions derived from Chinese-language news, presented in Simplified Chinese, Traditional Chinese, and English. Answers from both models (7,200 total) were assessed using a hybrid evaluation pipeline combining rubric-guided GPT-4o scoring with human annotation. Our findings reveal significant model-level and language-dependent biases. DeepSeek-R1 consistently exhibited substantially higher proportions of both propaganda and anti-U.S. bias compared to ChatGPT o3-mini-high, which remained largely free of anti-U.S. sentiment and showed lower propaganda levels. For DeepSeek-R1, Simplified Chinese queries elicited the highest bias rates; these diminished in Traditional Chinese and were nearly absent in English. Notably, DeepSeek-R1 occasionally responded in Simplified Chinese to Traditional Chinese queries and amplified existing PRC-aligned terms in its Chinese answers, demonstrating an “invisible loudspeaker” effect. Furthermore, such biases were not confined to overtly political topics but also permeated cultural and lifestyle content, particularly in DeepSeek-R1.

大型语言模型（LLMs）越来越影响公众理解和公民决策，但它们的中立性成为了一个日益增长的担忧。虽然现有研究已经探索了LLM偏见的各种形式，但缺乏对具有不同地缘政治对齐模型进行直接、跨语言的比较，特别是PRC系统模型与非PRC系统的对应模型之间的比较。本研究通过系统地评估DeepSeek-R1（PRC对齐）与ChatGPT o3-mini-high（非PRC）进行中文国家宣传和反美情绪，来填补这一空白。我们开发了一个由1200个脱离上下文、以推理为导向的问题组成的新型语料库，这些问题来自简体中文、繁体中文和英文的新闻。通过结合基于规则的GPT-4o评分与人类注释的混合评估管道，对这两个模型的答案（共7200个）进行了评估。我们的研究结果表明存在显著的模型级别和语言依赖偏见。与ChatGPT o3-mini-high相比，DeepSeek-R1持续表现出更高比例的宣传和反美偏见，而ChatGPT o3-mini-high则基本不存在反美情绪，宣传水平也较低。对于DeepSeek-R1来说，简体中文查询引发的偏见率最高；在繁体中文中有所减少，英文中几乎不存在。值得注意的是，DeepSeek-R1有时会对繁体中文查询以简体中文回应，并在其中答案中放大现有PRC相关术语，显示出一种“隐形扬声器”效应。此外，这种偏见不仅限于明显的政治话题，也渗透到文化和生活方式内容中，特别是在DeepSeek-R1中。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）对公众理解和公民决策的影响日益增大，但其意识形态中立性引发关注。现有研究已探索了LLM的偏见形式，但缺乏对不同地缘政治对齐模型（特别是PRC系统模型与非PRC模型）的直接跨语言比较。本研究通过系统评估PRC对齐的DeepSeek-R1与非PRC对齐的ChatGPT o3-mini-high，针对中文国家宣传和反美情绪进行比较。研究发现，DeepSeek-R1在宣传和反美偏见上显著高于ChatGPT o3-mini-high，而ChatGPT o3-mini-high在反美情绪和宣传水平上较低。DeepSeek-R1在简体中文查询上引发的偏见率最高，在繁体中文中减少，在英语中几乎不存在。总体而言，这些偏见不仅限于政治话题，也渗透到文化和生活方式内容中。

Key Takeaways

大型语言模型（LLMs）在公众理解和决策中扮演重要角色，但其意识形态中立性受到关注。
现有研究已探索LLM偏见，但缺乏对不同地缘政治对齐模型的跨语言比较。
DeepSeek-R1（PRC对齐）在宣传和反美偏见上显著高于ChatGPT o3-mini-high（非PRC对齐）。
DeepSeek-R1在简体中文查询上的偏见率最高，在繁体中文和英语中表现不同。
偏见不仅限于政治话题，也渗透到文化和生活方式内容中。
DeepSeek-R1在回应某些查询时表现出“隐形扬声器”效应，即放大现有PRC对齐术语。

Cool Papers

点此查看论文截图

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Authors:Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

多模态大型语言模型（MLLMs）在推理任务中展现出有前景的能力，但在需要明确的自我反思和自我纠正的复杂问题上仍然面临挑战，尤其是与基于单模态文本的同类型模型相比。现有的反思方法过于简单，难以生成有意义和有益的反馈，因为预训练模型的推理能力和知识局限在初始训练阶段就已经基本固定。为了克服这些挑战，我们提出了基于群体相对策略优化的多模态自我反思增强推理（SRPO），这是一种两阶段的反思强化学习框架，专为增强多模态LLM推理能力而设计。在第一阶段，我们在先进MLLM的指导下构建了一个高质量的以反思为重点的数据集，该数据集根据初始响应生成反思，以帮助策略模型学习推理和自反射。在第二阶段，我们在GRPO框架内引入了一种新颖的奖励机制，该机制鼓励简洁而有认知意义的反思，同时避免冗余。在多个多模态推理基准测试上的大量实验，包括MathVista、Mathvision、MathVerse和MMMU-Pro等基准测试，使用Qwen-2.5-VL-7B和Qwen-2.5-VL-32B等模型显示，SRPO显著优于最先进的模型，在推理准确性和反思质量方面取得了显著的改进。

论文及项目相关链接

PDF Under review

Summary

MLLM在推理任务中展现出巨大的潜力，但在需要明确自我反思和自我纠正的复杂问题上仍有困难。为此，研究者提出了结合多模态自我反思与群体相对策略优化（SRPO）的强化学习框架，旨在提高MLLM的推理能力。第一阶段，利用先进的MLLM构建高质量、以反思为重点的数据集；第二阶段，引入新的奖励机制，鼓励简洁、认知意义明确的反思。实验证明，SRPO在多个多模态推理基准测试中显著优于现有模型，提高了推理准确性和反思质量。

Key Takeaways

MLLMs在复杂的需要自我反思和纠正的推理问题上存在挑战。
现有反思方法简单，难以生成有意义和指导性的反馈。
提出结合多模态自我反思与群体相对策略优化（SRPO）的强化学习框架，旨在提高MLLM的推理能力。
第一阶段：构建高质量的反思数据集，帮助策略模型学习推理和反思。
第二阶段：引入新的奖励机制，鼓励简洁且认知意义明确的反思。
SRPO在多个多模态推理基准测试中表现优异，提高了推理准确性和反思质量。

Cool Papers

点此查看论文截图

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

Authors:Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,’ the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.

通过强化学习从人类反馈（RLHF）进行对齐已经成为控制大型语言模型（LLM）输出质量的主导范式。然而，当被视为“损失+正则化”时，标准的RLHF目标缺乏理论支持，并激励退化、确定性解决方案，这一问题也是Direct Policy Optimization（DPO）等变体所继承的。在本文中，我们通过明确地建模目标语言模型如何透过偏好数据泄漏信息，将其重新构想为从成对偏好反馈进行的“分布学习”。这种明确的建模使我们提出三种原则性的学习目标：偏好最大似然估计、偏好蒸馏和反向KL最小化。我们从理论上证明，所有三种方法都享有对目标语言模型的强非渐近O(1/n)收敛性，自然地避免了退化和奖励过度拟合。最后，我们实证表明，我们的分布学习框架，尤其是偏好蒸馏，在各种任务和模型中始终优于或匹配RLHF和DPO的性能。

论文及项目相关链接

PDF 26 pages, 7 tables

Summary

本文重新思考了通过强化学习从人类反馈（RLHF）进行对齐的方法，将其重新定位为分布学习（distribution learning）问题，并提出了三种基于人类偏好反馈的学习目标：偏好最大似然估计、偏好蒸馏和反向KL最小化。理论分析和实验结果表明，这三种方法具有良好的收敛性和避免退化问题的优势，并且在各种任务和模型中表现优异。

Key Takeaways

RLHF作为控制大型语言模型输出的主要方法，存在理论上的不足和激励退化解决方案的问题。
分布学习框架重新思考了对齐问题，通过明确建模目标语言模型如何透过偏好数据呈现。
提出了三种基于人类偏好反馈的学习目标：偏好最大似然估计、偏好蒸馏和反向KL最小化。
这三种方法具有理论上的优势，能够自然避免退化和奖励过度拟合的问题。
这些方法在各种任务和模型中的表现优异，尤其是偏好蒸馏方法。
分布学习框架的理论和实证分析为大型语言模型的进一步改进提供了新的方向。

Cool Papers

点此查看论文截图

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Authors:Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model’s intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

最近大型多模态模型（LMMs）的进展为实现有效的视觉语言推理提供了可能，但理解视频内容的能力仍然受到次优帧选择策略的限制。现有方法通常依赖于静态启发式或外部检索模块将帧信息输入视频LLM中，这可能无法提供与查询相关的信息。在这项工作中，我们引入了ReFoCUS（用于上下文理解的强化引导帧优化），这是一个新颖的帧级策略优化框架，它将优化目标从文本响应转移到视觉输入选择。ReFoCUS通过强化学习来学习帧选择策略，使用来自参考LMM的奖励信号来反映模型对于最能支持基于时间点的响应的帧的内在偏好。为了有效地探索大型组合帧空间，我们采用了一种自回归条件选择架构，它在确保时间连贯性的同时降低了复杂性。我们的方法不需要在帧级别进行显式监督，并且在多个视频问答基准测试中持续提高了推理性能，这突出了将帧选择与模型内部效用对齐所带来的好处。

论文及项目相关链接

PDF

Summary

文章探讨了大型多模态模型（LMMs）在视频内容理解方面的新进展。针对现有方法中帧选择策略不足的问题，提出了ReFoCUS框架，通过强化学习优化帧选择策略，以支持时间定位响应的帧为奖励信号，提高视频内容的理解。该框架采用条件选择架构，确保时间连贯性，无需显式监督即可实现跨多个视频问答基准测试的性能提升。

Key Takeaways

大型多模态模型（LMMs）在视觉语言推理方面取得了有效进展，但在理解视频内容方面仍存在局限。
现有方法依赖静态启发式或外部检索模块将帧信息输入视频-LLMs，可能无法提供与查询相关的信息。
ReFoCUS框架通过强化学习优化帧选择策略，以支持时间定位响应的帧为奖励信号。
ReFoCUS框架采用条件选择架构，确保时间连贯性，并减少复杂性。
ReFoCUS框架不需要显式监督即可实现帧级别的优化。
ReFoCUS框架在多个视频问答基准测试中表现优异，证明了其有效性。

Cool Papers

点此查看论文截图

Uni-LoRA: One Vector is All You Need

Authors:Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a “one-vector-only” solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

Low-Rank Adaptation（LoRA）通过约束权重更新为低阶矩阵，已成为大型语言模型（LLM）的实际参数高效微调（PEFT）方法。最近的工作，如Tied-LoRA、VeRA和VB-LoRA，通过引入额外的约束来减少可训练参数空间，进一步提高了效率。在本文中，我们展示了这些LoRA变体所采用的参数空间减少策略可以在一个统一框架Uni-LoRA内制定。在这个框架中，LoRA参数空间被展平为一个高维向量空间RD，可以通过从子空间Rd的投影进行重建，其中d≪D。我们证明，各种LoRA方法之间的根本区别在于投影矩阵P∈RD×d的选择。大多数现有的LoRA变体依赖于逐层或结构特定的投影，这限制了跨层参数共享，从而损害了参数效率。鉴于此，我们引入了一个高效且理论上有根据的投影矩阵，该矩阵是等距的，能够实现全局参数共享，并减少计算开销。此外，在Uni-LoRA的统一观点下，这种设计只需要一个可训练向量来重建整个LLM的LoRA参数，使Uni-LoRA既是一个统一框架，也是一个“仅一个向量”的解决方案。在GLUE、数学推理和指令调整基准测试上的大量实验表明，Uni-LoRA在参数效率方面达到了最新水平，同时在预测性能上优于或匹配了先前的方法。

论文及项目相关链接

PDF

摘要

LoRA方法通过约束权重更新为低秩矩阵，成为大型语言模型参数效率微调的主流方法。本文介绍了一种统一框架Uni-LoRA，该框架可以整合各种LoRA变体所采用的参数空间缩减策略。通过投影矩阵P，将LoRA参数空间从高维空间RD投影到子空间Rd，实现参数空间的优化。现有LoRA变体通常采用层状或结构特定的投影方式，限制了跨层参数共享。因此，本文引入了一种高效且理论基础的等距投影矩阵，实现全局参数共享，降低计算开销。在GLUE、数学推理和指令调整基准测试中，Uni-LoRA展现了出色的参数效率和预测性能，达到或超越了现有方法。

关键见解

LoRA已成为大型语言模型的参数效率微调的主流方法，通过约束权重更新为低秩矩阵。
Uni-LoRA框架整合了各类LoRA变体的参数空间缩减策略。
LoRA参数空间可通过投影矩阵P从高维空间投影到子空间进行优化。
现有LoRA变体的投影方式限制了跨层参数共享。
引入的等距投影矩阵实现全局参数共享，提高计算效率。
Uni-LoRA在多种基准测试中展现出卓越的性能和参数效率。

Cool Papers

点此查看论文截图

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Authors:Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang

Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

临床决策通常需要对异质数据进行推理，但现有的多模态语言模型（MLLMs）仍然主要侧重于视觉，且无法在临床专科之间通用化。为了弥补这一差距，我们引入了QoQ-Med-7B/32B，这是第一个开放的通用临床基础模型，能够联合对医学图像、时间序列信号和文本报告进行推理。QoQ-Med采用领域感知相对策略优化（DRPO）进行训练，这是一种新型强化学习目标，按领域稀有性和模态难度分层缩放标准化奖励，缓解由临床数据分布不均导致性能不平衡的问题。在涵盖9个临床领域的261万个指令调整对上进行训练，我们证明，与GRPO等其他无批评训练方法相比，DRPO训练在提高所有视觉领域的宏观F1分数方面平均提高了43%的诊断性能。此外，QoQ-Med经过密集分割数据训练后，能够突出与诊断相关的关键区域，其IoU值比开放模型高10倍，同时达到OpenAI o4-mini的性能水平。为了促进可重复性和下游研究，我们在https://github.com/DDVD233/QoQ_Med上发布了（i）完整模型权重、（ii）模块化训练管道和（iii）所有中间推理轨迹。

论文及项目相关链接

PDF

Summary

基于临床决策需要处理异质数据的常规需求，现有的多模态语言模型主要集中在视觉领域，且难以跨临床专业领域进行泛化。为了弥补这一缺陷，我们推出了QoQ-Med-7B/32B，这是首个开放的通用临床基础模型，能够跨医学图像、时间序列信号和文本报告进行联合推理。QoQ-Med采用领域感知的相对策略优化（DRPO）进行训练，这是一种新型强化学习目标，能够按领域稀有性和模态难度分层缩放标准化奖励，从而缓解由临床数据分布不均导致性能不平衡的问题。在涵盖9个临床领域的261万个指令调整对上进行的训练表明，DRPO训练提高了与其他无批评训练方法（如GRPO）相比的平均宏观F1分数，诊断性能提高了43%。此外，经过密集分割数据训练的QoQ-Med能够突出显示与诊断相关的显著区域，其IoU是开放模型的十倍以上同时达到了OpenAI o4-mini的性能水平。为了促进复制和下游研究，我们在GitHub上发布了（i）完整模型权重、（ii）模块化训练管道和（iii）所有中间推理痕迹：https://github.com/DDVD233/QoQ_Med。

Key Takeaways

QoQ-Med是首个开放的通用临床基础模型，能跨医学图像、时间序列信号和文本报告进行联合推理。
DRPO训练法按领域稀有性和模态难度分层缩放标准化奖励，缓解临床数据分布不均导致的性能问题。
DRPO训练提高了诊断性能，与其他方法相比，宏观F1分数平均提高43%。
QoQ-Med经过密集分割数据训练，能突出显示与诊断相关的显著区域，IoU高于其他开放模型。
QoQ-Med的性能达到OpenAI o4-mini水平。
研究团队提供了完整的模型权重、模块化训练管道和中间推理痕迹以推动研究。

Cool Papers

点此查看论文截图

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Authors:Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based “thinking” skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

尽管链式思维推理和强化学习（RL）在NLP领域取得了突破，但它们融入生成式视觉模型中的研究仍然不足。我们引入了ReasonGen-R1，这是一个两阶段的框架，首先通过在新生成的基于文本推理数据集上进行有监督微调，使自回归图像生成器具备明确的基于文本的“思考”技能，然后使用群体相对策略优化对其输出进行改进。为了能够让模型在生成图像之前通过文本进行推理，我们自动生成并发布了一系列与视觉提示配对的人工构造的推理语料库，实现对物体布局、风格和场景组合的控制规划。我们的GRPO算法使用预训练视觉语言模型的奖励信号来评估总体视觉质量，并在每次更新中优化策略。在GenEval、DPG和T2I基准测试上的评估表明，ReasonGen-R1持续超越强大的基准线和先前最先进的模型。更多信息请访问aka.ms/reasongen。

论文及项目相关链接

PDF

Summary

该文本介绍了一个名为ReasonGen-R1的两阶段框架，它将链式思维推理和强化学习结合到生成式视觉模型中。该框架首先通过在新生成的文字推理数据集上进行监督微调，赋予自回归图像生成器基于文本的“思考”技能。然后，使用集团相对政策优化来优化其输出。该模型能够在生成图像之前通过文字进行推理，并自动生成与视觉提示配对的数据集，以实现对象布局、风格和场景组成的受控规划。评估结果表明，ReasonGen-R1在GenEval、DPG和T2I基准测试上均优于强大的基准模型和先前最先进的模型。

Key Takeaways

ReasonGen-R1框架成功结合了链式思维推理和强化学习，应用于生成式视觉模型。
该框架通过监督微调赋予图像生成器基于文本的“思考”技能，使用新生成的文字推理数据集。
Group Relative Policy Optimization用于优化模型的输出。
模型能够通过文字进行推理，再生成图像，增强了受控的规划和布局能力。
自动生成的数据集包含模型制作的理据和视觉提示配对，有助于场景组成的规划。
评估结果显示，ReasonGen-R1在多个基准测试上表现出卓越性能。

Cool Papers

点此查看论文截图

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Authors:Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model’s accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

强化学习与可验证奖励（RLVR）作为一种强大的范式，在训练大型语言模型（LLM）后表现出了出色的性能，特别是在具有结构化、可验证答案的任务上达到了最先进的性能。将RLVR应用于多模态LLM（MLLM）提供了巨大的机会，但由于需要微妙的视觉、逻辑和空间能力的视觉语言任务的广泛性和异质性，使其复杂化。因此，使用RLVR在多数据集上训练MLLM可能有益，但创建了来自不同数据集之间交互的冲突目标所带来的挑战，这突出表明需要最佳的数据集混合策略来提高泛化和推理能力。我们为多模态LLM RLVR引入了一个系统的后训练框架，其中包括严格的数据混合问题公式和基准实现。具体来说，(1) 我们为跨数据集的后训练开发了一个多模态RLVR框架，通过整理包含不同可验证的视觉语言问题的数据集，并实现了使用不同可验证奖励的多域在线RL学习；(2) 我们提出了一种数据混合策略，能够预测从数据混合分布中的RL微调结果，从而优化最佳的混合策略。综合实验表明，当多域RLVR训练与混合预测策略相结合时，可以显著提高MLLM的一般推理能力。我们的最佳混合与均匀数据混合后训练的模型相比，在超出分布的基准测试上的准确率平均提高了5.24%，与微调前的基准模型相比，提高了20.74%。

论文及项目相关链接

PDF Project Webpage: https://modomodo-rl.github.io/

Summary

强化学习可验证奖励（RLVR）在训练大型语言模型（LLM）后表现出强大的能力，特别是在具有结构化、可验证答案的任务上表现突出。对于多模态LLM（MLLM），RLVR的应用虽然面临挑战，如视觉语言任务的广泛性和异质性需求微妙的视觉、逻辑和空间能力。在多个数据集上训练MLLM使用RLVR可能有益，但不同数据集之间的交互会产生相互冲突的目标，凸显出需要优化数据集混合策略以提高泛化和推理能力。我们为多模态LLM RLVR引入了一个系统的后训练框架，包括严格的数据混合问题公式和基准实现。我们的最佳混合策略能提高模型在超出分布基准测试上的准确率，相较于均匀数据混合策略提高平均5.24%，相较于预微调基线提高总计20.74%。

Key Takeaways

RLVR对于训练大型语言模型后展现出强大能力，特别是在具有结构化答案的任务上。
在多模态LLM中应用RLVR面临挑战，如视觉语言任务的广泛性和异质性需求复杂的视觉、逻辑和空间技能。
多个数据集上训练MLLM使用RLVR可能有益，但存在不同数据集交互产生的冲突目标问题。
需要优化数据集混合策略以提高泛化和推理能力。
提出了一种系统的后训练框架和多模态LLM RLVR的基准实现。
引入了数据混合策略来预测RL精细调整结果，从而优化最佳混合策略。

Cool Papers

点此查看论文截图

Reinforcing Video Reasoning with Focused Thinking

Authors:Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nannan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, Tat-Seng Chua

Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group variance), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4% accuracy on CLEVRER (18.8% improvement over Video-R1) and 65.8% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.

最近强化学习方面的进展，特别是通过群体相对策略优化（GRPO），已经显著改善了多模态大型语言模型在复杂推理任务上的表现。然而，还有两个关键局限性：1）它们经常产生不聚焦、冗长的推理链，掩盖了重要的时空线索；2）二元奖励无法考虑部分正确的答案，导致奖励波动大和学习效率低下。在本文中，我们提出了TW-GRPO，这是一个新的框架，通过聚焦思考和密集的奖励粒度来增强视觉推理。具体来说，我们采用了一种令牌加权机制，优先处理具有高信息密度的令牌（由组内方差估计），同时抑制冗余令牌，如通用推理前缀。此外，我们通过对强化学习训练进行重构，从单选题转向多选题问答任务，其中软奖励能够通过区分部分正确性来实现更精细的梯度估计。我们还提出了问题答案反转，这是一种数据增强策略，可以从现有基准测试中生成多样化的多选样本。实验证明，我们在多个视频推理和通用理解基准测试上达到了最新技术水平。值得注意的是，TW-GRPO在CLEVRER上达到了50.4%的准确率（相对于Video-R1提高了18.8%），在MMVU上达到了65.8%。我们的代码可以在https://github.com/longmalongma/TW-GRPO上找到。

论文及项目相关链接

PDF

Summary

近期强化学习在通过Group Relative Policy Optimization（GRPO）方法在多模态大型语言模型中进行复杂推理任务方面的进展显著。然而，仍存在两个关键局限：一是产生的推理链常常不聚焦且冗长，掩盖了重要的时空线索；二是二元奖励无法应对部分正确答案，导致奖励波动较大且学习效率低下。针对这些问题，本文提出TW-GRPO框架，通过标记权重机制突出信息密度高的标记（通过组内方差估计），抑制通用推理前缀等冗余标记，增强视觉推理的聚焦思维。同时，通过从单选题转向多选题问答任务，以软奖励进行更精细的梯度估计来区分部分正确性。此外，采用问答倒置策略，对现有数据集进行多样化多选题样本生成。实验显示，该方法在多个视频推理和通用理解基准测试中表现领先，如在CLEVRER上达到50.4%准确率（较Video-R1提高18.8%），在MMVU上达到65.8%准确率。代码公开可访问：[链接地址]。

Key Takeaways

强化学习通过GRPO方法在多模态大型语言模型中用于复杂推理任务取得进展。
存在的两个关键问题是推理链不聚焦、冗长，以及二元奖励无法应对部分正确答案。
TW-GRPO框架通过标记权重机制突出信息密度高的标记，抑制冗余标记，增强视觉推理的聚焦思维。
采用软奖励进行多选题问答任务，以区分部分正确性，实现更精细的梯度估计。
问答倒置策略用于生成多样化多选题样本。
实验结果显示TW-GRPO在视频推理和通用理解基准测试中表现领先。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-06-09/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-06-09 LLaDA-V Large Language Diffusion Models with Visual Instruction Tuning

2025-06-09 LLM

LLM

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-06-08 Galactic Science -- Rapporteur Talk of the 8th Heidelberg International Symposium on High Energy Gamma Ray Astronomy

2025-06-08 Talking Head Generation

Talking Head Generation

R1_Reasoning

2025-06-09 更新

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Leveraging Reward Models for Guiding Code Review Comment Generation

RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

OpenThoughts: Data Recipes for Reasoning Models

Toward Entailment Checking: Explore Eigenmarking Search

Seed-Coder: Let the Code Model Curate Data for Itself

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

RewardBench 2: Advancing Reward Model Evaluation

Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency

Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Uni-LoRA: One Vector is All You Need

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Reinforcing Video Reasoning with Focused Thinking