发布日期: 2025-03-22

更新日期: 2025-05-14

文章字数: 20.7k

阅读时长: 85 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-03-22 更新

Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Authors:Zijian Li, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Rui Wang

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose \textit{Chain of Functions (CoF)}, a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textit{CoF} provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models. Employing \textit{CoF}, we construct the \textit{ChartCoF} dataset, with 1.4k complex reasoning Q&A for fine-grained analysis and 50k Q&A for reasoning enhancement. The fine-grained evaluation on \textit{ChartCoF} reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textit{ChartCoF} achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks. Furthermore, the novel paradigm of function-governed rationale generation in \textit{CoF} could inspire broader applications beyond charts.

视觉推理对于多模态大型语言模型（MLLMs）处理复杂图表查询至关重要，但高质量的理由数据仍然稀缺。现有方法利用（M）LLMs进行数据生成，但直接提示往往会导致精度和多样性有限。在本文中，我们提出了\textit{功能链（CoF）}，这是一种新的程序化推理数据生成管道，它利用自由探索的推理路径作为监督，以确保数据的精度和多样性。具体来说，它首先从原子功能（如最大数据和算术运算）之间进行无人工探索，以生成多样化的功能链，然后仅使用适度的开源大型语言模型将其翻译成语言理由和问题。CoF具有多个优点：1）精度：功能控制生成减少了与自由形式生成相比的幻觉；2）多样性：枚举功能链可实现变化的问题分类；3）可解释性：功能链作为内置的理由，允许超越总体准确度的精细评估；4）实用性：消除对超大模型的依赖。通过使用CoF，我们构建了ChartCoF数据集，其中包含1400个用于精细分析复杂推理的问答和5万个用于增强推理的问答。在ChartCoF上的精细评估显示，每个MLLM在不同问题分类中的性能各不相同，实验还表明，使用ChartCoF进行微调在同规模MLLMs中实现了广泛使用的基准测试的最佳性能。此外，CoF中的功能控制理由生成新范式可能会激发图表之外更广泛的应用。

论文及项目相关链接

PDF Under review

Summary

本文提出一种名为Chain of Functions（CoF）的新型程序化推理数据生成管道，用于解决多模态大型语言模型（MLLMs）在处理复杂图表查询时的难题。CoF利用人类无需参与的原子功能自由探索来生成多样的功能链，并将其转化为语言理性与问题。CoF具有精确性、多样性、可解释性和实用性等优点，可应用于构建图表推理数据集ChartCoF。

Key Takeaways

视觉推理在多模态大型语言模型（MLLMs）中处理复杂图表查询时至关重要。
现有方法利用MLLMs进行数据生成，但直接提示的精度和多样性有限。
提出的Chain of Functions（CoF）是一种新型的程序化推理数据生成管道，利用自由探索的推理路径作为监督，确保数据的精度和多样性。
CoF从原子功能（如最大数据和算术运算）开始自由探索，生成多样的功能链，并转化为语言理性与问题。
CoF提供了精确度、多样性、可解释性和实用性等好处。
使用CoF构建了ChartCoF数据集，包含1.4k复杂推理问答用于精细分析，以及50k问答用于推理提升。

Cool Papers

点此查看论文截图

Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Authors:Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, Liwen Zhang

Reasoning large language models are rapidly evolving across various domains. However, their capabilities in handling complex financial tasks still require in-depth exploration. In this paper, we introduce Fin-R1, a reasoning large language model specifically designed for the financial sector. Fin-R1 is built using a two-stage architecture, leveraging a financial reasoning dataset distilled and processed based on DeepSeek-R1. Through supervised fine-tuning (SFT) and reinforcement learning (RL) training, it demonstrates performance close to DeepSeek-R1 with a parameter size of 7 billion across a range of financial reasoning tasks. It achieves the state-of-the-art (SOTA) in the FinQA and ConvFinQA tasks between those LLMs in our evaluation, surpassing larger models in other tasks as well. Fin-R1 showcases strong reasoning and decision-making capabilities, providing solutions to various problems encountered in the financial domain. Our code is available at https://github.com/SUFE-AIFLM-Lab/Fin-R1.

推理大语言模型正在各个领域迅速发展。然而，它们在处理复杂的金融任务方面的能力仍需要进一步深入研究。在本文中，我们介绍了专门为金融领域设计的推理大语言模型Fin-R1。Fin-R1采用两阶段架构构建，利用基于DeepSeek-R1提炼和处理的金融推理数据集。通过监督微调（SFT）和强化学习（RL）训练，它在各种金融推理任务上的表现接近参数规模为7亿的DeepSeek-R1。在我们的评估中，它在FinQA和ConvFinQA任务上达到了最先进的水平，在其他任务中也超越了更大的模型。Fin-R1展示了强大的推理和决策能力，为解决金融领域遇到的问题提供了解决方案。我们的代码可在[https://github.com/SUFE-AIFLM-Lab/Fin-R

论文及项目相关链接

PDF

Summary

大型语言模型在多个领域迅速演变，但在处理复杂的金融任务方面仍需深入研究。本文介绍了专为金融领域设计的推理大型语言模型Fin-R1。Fin-R1采用两阶段架构，基于DeepSeek-R1蒸馏和处理金融推理数据集。通过监督微调（SFT）和强化学习（RL）训练，它在金融推理任务上表现出接近DeepSeek-R1的性能，参数规模为7亿。在FinQA和ConvFinQA任务中，它在我们评估的LLMs中达到了最新水平，在其他任务中也超越了更大的模型。Fin-R1展示了强大的推理和决策能力，为解决金融领域遇到的问题提供了解决方案。

Key Takeaways

大型语言模型在金融领域的推理能力尚待深入研究。
Fin-R1是一个专为金融领域设计的大型语言模型。
Fin-R1采用两阶段架构，利用金融推理数据集。
通过监督微调（SFT）和强化学习（RL）训练，Fin-R1表现出强大的性能。
Fin-R1在多个金融推理任务上达到了最新水平。
Fin-R1展现了强大的推理和决策能力。

Cool Papers

点此查看论文截图

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t

Authors:Quy-Anh Dang, Chris Ngo

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

增强大型语言模型（LLM）的推理能力通常依赖于大量的计算资源和庞大的数据集，这限制了资源受限环境的可访问性。我们的研究探讨了强化学习（RL）在提升小型LLM推理能力方面的潜力，重点是一个拥有1.5亿参数的模型——DeepSeek-R1-Distill-Qwen-1.5B，在严格的约束条件下：在4个NVIDIA A40 GPU（每个具有48GB VRAM）上进行为期24小时的培训。通过采用群体相对策略优化（GRPO）算法并整理一个紧凑、高质量的数学推理数据集，我们进行了三项实验来探索模型的行为和性能。我们的结果表明，推理能力迅速提高——例如，AMC23的准确性从63%提高到80%，AIME24达到46.7%，超过了o1-preview——这只需使用7000个样本和42美元的培训成本，而基线模型的培训成本则高达数千美元。然而，随着训练的延长，也出现了优化不稳定和长度约束等挑战。这些发现突显了基于RL的微调对于小型LLM的有效性，为大规模方法提供了成本效益更高的替代方案。我们已将代码和数据集作为开源资源发布，提供了有关权衡的见解，并为资源受限环境中可扩展的、具备推理能力的大型语言模型奠定了基础。所有资源均可在https://github.com/knoveleng/open-rs获取。

论文及项目相关链接

PDF

Summary

本文探讨了使用强化学习（RL）提高小型语言模型（LLM）推理能力的研究。在有限的计算资源和时间内，通过对小型语言模型进行RL优化训练，实验结果显示，推理能力得到了显著提高。这一方法的成本较低，为资源受限环境中可扩展的推理型LLM提供了基础。

Key Takeaways

强化学习可用于提高小型语言模型的推理能力。
在资源受限的环境下，使用强化学习对小型语言模型进行优化训练是可行的。
实验结果显示，强化学习显著提高了语言模型的推理性能。
强化学习方法的成本较低，相比基线模型可大幅降低训练成本。
研究中面临了优化不稳定和长度约束等挑战。
研究结果以开放源代码的形式发布，为未来的研究提供了基础。

Cool Papers

点此查看论文截图

CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Authors:Ming Li, Shitian Zhao, Jike Zhong, Yuxiang Lai, Kaipeng Zhang

Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process’ during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.

分类是机器学习中的核心任务之一。最新研究表明，尽管多模态大型语言模型（MLLMs）最初在图像分类方面表现不佳，但通过足够的数据进行微调可以显著提高它们的性能，使其与当前最佳（SOTA）分类模型相媲美。然而，获取大规模标记数据成本高昂。在本文中，我们探讨了MLLM的少量样本分类微调。我们发现，SFT可能导致严重的过拟合问题，甚至可能使零样本方法的表现下降。为了应对这一挑战，我们受到基于规则的强化学习最新成功的启发，提出了CLS-RL方法，该方法使用可验证的信号作为奖励来微调MLLMs。我们发现CLS-RL在大多数数据集上的表现优于SFT，并且在基础到新和少量学习设置中平均精度更高。此外，我们观察到CLS-RL的免费午餐现象；当模型在特定数据集上进行微调时，它们在其他不同数据集上的性能可能会超过零样本模型，即使这些数据集在分布和类别名称上有所不同。这表明基于RL的方法有效地教授了模型分类的基本原理。最后，受到最新推理时间思考研究工作的启发，我们重新审视了视觉分类的微调过程中的“思考过程”，这是基于RL的方法的一个关键方面。我们质疑这种任务在微调过程中是否需要大量的思考过程，并提出这可能会降低性能。基于这一前提，我们引入了无思考CLS-RL方法，通过设定精确奖励来减少训练过程中的思考过程。我们的研究结果表明，在大大缩短微调时间的情况下，无思考CLS-RL方法在领域内的性能和泛化能力方面优于CLS-RL方法。

论文及项目相关链接

PDF Preprint, work in progress

Summary

本文探索了基于规则强化学习（RL）的多模态大型语言模型（MLLM）分类微调方法。研究发现，使用可验证信号作为奖励的CLS-RL在大多数数据集上优于标准微调（SFT），并在基础到新的数据集和小样本学习设置中具有更高的平均准确性。此外，观察到一种免费午餐现象：当模型在某个数据集上进行微调时，它们在其它不同数据集上的性能可能会得到提升。本文还引入了无思考过程的CLS-RL方法，通过减少训练时的思考过程来提高性能。总结来说，本文提出了新的基于RL的方法优化了MLLM的分类性能并增强了其泛化能力。

Key Takeaways

多模态大型语言模型（MLLM）经过适当的微调可以显著提高图像分类性能。
标准微调（SFT）可能导致过拟合问题并可能降低性能，相比之下CLS-RL方法更有效。
CLS-RL在多数数据集上表现优异，并且在基础到新的数据集和小样本学习设置中具有更高的平均准确性。
RL方法能有效提升模型的泛化能力，表现在不同数据集上的性能提升。
RL中的“思考过程”在微调中可能降低性能，因此提出了无思考过程的CLS-RL方法。
无思考CLS-RL方法在减少训练时间的同时实现了优异的领域内部性能和泛化能力。
本文的研究为进一步优化MLLM在图像分类任务中的性能提供了新的思路和方法。

Cool Papers

点此查看论文截图

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Authors:Xiaomeng Chu, Jiajun Deng, Guoliang You, Wei Liu, Xingchen Li, Jianmin Ji, Yanyong Zhang

Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users’ intentions in the instructions. However, the LLM’s knowledge about objects’ physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.

灵活指令引导的6自由度抓取对现实世界的机器人系统来说是一个重要且具有挑战性的任务。现有方法利用大型语言模型的上下文理解能力来建立表达式和目标之间的映射关系，使机器人能够理解和执行指令中的用户意图。然而，尽管与抓取紧密相关，大型语言模型对物体的物理属性的了解仍然被忽视。在这项工作中，我们提出了GraspCoT，这是一个面向物理属性的6自由度抓取检测框架，它结合了链式思维（Chain-of-Thought，简称CoT）推理机制，并受辅助问答（QA）任务的引导。具体来说，我们设计了一系列问答模板，以实现包括目标解析、物理属性分析和抓取动作选择在内的分层推理。此外，GraspCoT呈现了一个统一的多模态大型语言模型架构，它将三维场景的多元视角观察编码为三维视觉令牌，然后将这些视觉令牌与基于CoT衍生的文本令牌嵌入大型语言模型中，以生成抓取姿态预测。此外，我们还推出了IntentGrasp，这是一个大规模基准测试集，填补了公开数据集中多对象抓取检测在多样化和间接言语指令下的空白。在IntentGrasp上的广泛实验证明了我们的方法优越性，并且在现实世界的机器人应用中的额外验证也证明了其实用性。代码和数据将公开发布。

论文及项目相关链接

PDF

Summary：提出了一种名为GraspCoT的6自由度抓取检测框架，该框架结合了面向物理属性的思维链（Chain-of-Thought，CoT）推理机制，通过辅助问答（QA）任务进行引导。设计了一系列QA模板，实现目标解析、物理属性分析和抓取动作选择的三阶段层次推理。同时，GraspCoT呈现了一种统一的多模态大型语言模型架构，将三维场景的多视角观察编码为三维视觉标记，并与CoT衍生的文本标记联合嵌入大型语言模型中，以生成抓取姿态预测。此外，还推出了IntentGrasp大型基准测试，填补了公共数据集中多对象抓取检测在多样化和间接言语指令下的空白。

Key Takeaways：

GraspCoT框架利用大型语言模型的上下文理解能力，通过思维链（Chain-of-Thought，CoT）推理机制实现灵活的指令引导6自由度抓取。
GraspCoT框架通过辅助问答（QA）任务进行引导，并设计了一系列QA模板以实现层次推理。
GraspCoT框架关注物体的物理属性，将其与抓取任务紧密结合。
提出了一个统一的多模态大型语言模型架构，结合了视觉标记和文本标记进行抓取姿态预测。
IntentGrasp基准测试填补了公共数据集中多对象抓取检测在复杂和间接指令下的空白。
实验表明GraspCoT框架在IntentGrasp基准测试上表现优越。

Cool Papers

点此查看论文截图

Grammar and Gameplay-aligned RL for Game Description Generation with LLMs

Authors:Tsunehiko Tanaka, Edgar Simo-Serra

Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone.

游戏描述生成（GDG）的任务是从自然语言文本中生成一种用游戏描述语言（GDL）编写的游戏描述。先前的研究已经探索了利用大型语言模型（LLM）的上下文理解能力的生成方法；然而，准确复制游戏描述中的游戏特性仍然是一个挑战。在本文中，我们提出了基于强化学习的LLM微调方法用于GDG（RLGDG）。我们的训练方法通过引入语法奖励和概念奖励，同时提高了语法正确性和对游戏概念的一致性。此外，我们采用了两阶段训练策略，在监督微调（SFT）之后应用强化学习（RL）。实验结果表明，我们提出的方法显著优于仅使用SFT的基线方法。

论文及项目相关链接

PDF

Summary

本文介绍了游戏描述生成（GDG）的任务，即利用游戏描述语言（GDL）从自然语言文本中生成游戏描述。先前的研究已经探索了利用大型语言模型（LLMs）的上下文理解能力进行生成的方法，但准确再现游戏描述中的游戏特性仍然是一个挑战。本文提出了基于强化学习的LLMs精细调整方法（RLGDG），通过引入语法奖励和概念奖励，同时提高语法正确性和对游戏概念的忠实度。此外，采用两阶段训练策略，在监督精细调整（SFT）之后应用强化学习（RL）。实验结果表明，该方法显著优于仅使用SFT的基线方法。

Key Takeaways

游戏描述生成（GDG）的任务是：从自然语言文本中，利用游戏描述语言（GDL）生成游戏描述。
大型语言模型（LLMs）已被用于游戏描述生成，但准确再现游戏特性是一大挑战。
提出了一种新的方法RLGDG，即基于强化学习的LLMs精细调整方法，旨在提高游戏描述的语法正确性和对游戏概念的忠实度。
RLGDG通过引入语法奖励和概念奖励来同时改善这两方面。
采用两阶段训练策略，先在监督学习下精细调整模型，再应用强化学习。
实验结果显示，RLGDG显著优于仅使用监督学习的基线方法。

Cool Papers

点此查看论文截图

Reinforcement Learning Environment with LLM-Controlled Adversary in D&D 5th Edition Combat

Authors:Joseph Emmanuel DL Dayo, Michel Onasis S. Ogbinar, Prospero C. Naval Jr

The objective of this study is to design and implement a reinforcement learning (RL) environment using D&D 5E combat scenarios to challenge smaller RL agents through interaction with a robust adversarial agent controlled by advanced Large Language Models (LLMs) like GPT-4o and LLaMA 3 8B. This research employs Deep Q-Networks (DQN) for the smaller agents, creating a testbed for strategic AI development that also serves as an educational tool by simulating dynamic and unpredictable combat scenarios. We successfully integrated sophisticated language models into the RL framework, enhancing strategic decision-making processes. Our results indicate that while RL agents generally outperform LLM-controlled adversaries in standard metrics, the strategic depth provided by LLMs significantly enhances the overall AI capabilities in this complex, rule-based setting. The novelty of our approach and its implications for mastering intricate environments and developing adaptive strategies are discussed, alongside potential innovations in AI-driven interactive simulations. This paper aims to demonstrate how integrating LLMs can create more robust and adaptable AI systems, providing valuable insights for further research and educational applications.

本研究的目标是利用D&D 5E战斗场景设计和实现一个强化学习（RL）环境，通过在与先进的大型语言模型（如GPT-4o和LLaMA 3 8B）控制的稳健对抗代理的互动来挑战较小的RL代理。本研究采用深度Q网络（DQN）用于小型代理，为战略人工智能发展创建一个测试平台，同时通过模拟动态和不可预测的战斗场景，也作为教育工具。我们成功地将复杂语言模型整合到RL框架中，增强了战略决策过程。我们的结果表明，尽管在标准指标上，RL代理通常优于大型语言模型控制的对手，但大型语言模型提供的战略深度在复杂的基于规则的环境中显著增强了整体的人工智能能力。讨论了我们的方法的创新性及其在学习复杂环境和开发自适应策略方面的意义，以及人工智能驱动交互式模拟中的潜在创新。本文旨在展示如何整合大型语言模型来创建更稳健和适应性更强的人工智能系统，为进一步的研究和教育应用提供有价值的见解。

论文及项目相关链接

PDF Preprint. Submitted to the 31st International Conference on Neural Information Processing (ICONIP 2024)

Summary
游戏设计中的强化学习（RL）环境通过使用D&D 5E战斗场景来挑战小型RL代理，并与其由先进的大型语言模型（LLM）控制的对抗性代理进行交互。该研究采用深度Q网络（DQN）为小型代理创建测试平台，同时作为模拟动态和不可预测战斗场景的教育工具。成功将复杂语言模型整合到RL框架中，提升策略决策过程。结果表明，尽管RL代理在标准指标上通常优于LLM控制的对手，但LLMs提供的战略深度在复杂的基于规则的环境中显著增强了整体AI能力。该研究探讨了其方法的创新性以及其在掌握复杂环境和开发自适应策略方面的意义，并提供了关于AI驱动交互式模拟的潜在创新。整合LLMs可以创建更稳健和适应性更强的AI系统，为进一步的科研和教育应用提供有价值的见解。

Key Takeaways

该研究旨在设计一个使用D&D 5E战斗场景的强化学习（RL）环境。
小型RL代理在与由先进的大型语言模型（LLM）控制的对抗性代理交互时接受挑战。
采用深度Q网络（DQN）为小型代理创建测试平台，模拟动态和不可预测的战斗场景。
成功整合复杂语言模型到RL框架中，增强策略决策过程。
RL代理在标准指标上通常优于LLM控制的对手，但LLMs在复杂环境中显著增强了AI的整体能力。
该研究的创新性在于其在复杂环境掌握和自适应策略开发方面的意义。

Cool Papers

点此查看论文截图

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Authors:Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs – including Phi-4, LLaMA-3.1, and Gemma-2 – to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

近期多模态大语言模型（MLLMs）的进步凸显了视觉主干和底层语言模型的关键作用。虽然早期的研究主要集中在将这些组件扩展到数十亿参数上，但模型大小、架构和性能之间的权衡仍未得到充分探索。此外，训练数据和评估协议的不一致性阻碍了直接比较，难以得出最佳的设计选择。在本文中，我们介绍了LLaVA-MORE，这是一个新的多模态语言模型系列，它将最新的语言模型与多样化的视觉主干集成在一起。为确保公平比较，我们采用了一种统一的应用于所有架构的训练协议。我们的分析系统地探索了包括Phi-4、LLaMA-3.1和Gemma-2在内的小型和中型规模的语言模型，以评估多模态推理、生成和指令遵循的能力，同时探讨模型大小与性能之间的关系。除了评估语言模型对最终结果的影响外，我们还对各种视觉编码器进行了全面的研究，包括基于CLIP的架构以及DINOv2、SigLIP和SigLIP2等替代方案。额外的实验还探讨了提高图像分辨率和预训练数据集变化的影响。总的来说，我们的研究结果为设计更有效的多模态语言模型提供了见解，提供了一个可重复的评价框架，便于直接比较，并能指导未来的模型开发。我们的源代码和训练好的模型可以在https://github.com/aimagelab/LLaVA-MORE公开获得。

论文及项目相关链接

PDF

Summary

本文介绍了LLaVA-MORE，一个集成了最新语言模型和多种视觉骨干网的新家族的多模态大型语言模型。研究通过统一训练协议，对小型和中等规模的语言模型进行了系统分析，评估了多模态推理、生成和指令跟随的能力，并探讨了模型尺寸与性能之间的关系。同时，也对不同的视觉编码器进行了深入研究，并公开了源代码和训练好的模型。

Key Takeaways

LLaVA-MORE是一个新的多模态大型语言模型家族，结合了最新的语言模型和多种视觉骨干网。
研究采用了统一的训练协议，以确保对不同架构的公平比较。
对小型和中等规模的语言模型进行了系统分析，包括Phi-4、LLaMA-3.1和Gemma-2。
评估了多模态推理、生成和指令跟随的能力，并探讨了模型尺寸与性能的关系。
对不同的视觉编码器进行了深入研究，包括CLIP基础架构以及其他替代方案，如DINOv2、SigLIP和SigLIP2。
通过增加图像分辨率和预训练数据集的变化进行了实验探究。

Cool Papers

点此查看论文截图

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Authors: NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Xiaodong Yang, Zhuolin Yang, Xiaohui Zeng, Zhe Zhang

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements. To facilitate the development of Physical AI, we will make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

物理人工智能系统需要在物理世界中感知、理解和执行复杂的动作。在本文中，我们提出了Cosmos-Reason1模型，该模型能够通过长链思维推理过程理解物理世界，并以自然语言生成适当的决策（例如下一步行动）。我们首先定义物理人工智能推理的关键能力，重点关注物理常识和实体推理。为了表示物理常识，我们使用层次本体论来捕捉关于空间、时间和物理的基本知识。对于实体推理，我们依赖于一个二维的本体论来概括不同的物理实体。基于这些能力，我们开发了两个多模态大型语言模型，即Cosmos-Reason1-8B和Cosmos-Reason1-56B。我们对模型进行了四个阶段的数据整理和训练：视觉预训练、一般监督微调（SFT）、物理人工智能SFT和物理人工智能强化学习（RL）作为后训练。为了评估我们的模型，我们根据我们的本体论建立了全面的基准测试，用于物理常识和实体推理。评估结果表明，物理人工智能SFT和强化学习带来了显著的改进。为了方便物理人工智能的开发，我们将在NVIDIA Open Model License下提供我们的代码和预训练模型，网址为：https://github.com/nvidia-cosmos/cosmos-reason1。

论文及项目相关链接

PDF

Summary

物理AI系统需要具备感知、理解和执行现实世界复杂动作的能力。本文介绍了Cosmos-Reason1模型，该模型能够通过长链思维推理过程理解物理世界并生成适当的决策。文章重点介绍了物理常识和实体推理能力，使用分层本体论来表示物理常识，并运用二维本体论进行实体推理。基于此，开发了两个多模态大型语言模型Cosmos-Reason1-8B和Cosmos-Reason1-56B。模型训练分为四个阶段：视觉预训练、一般监督微调（SFT）、物理AISFT和物理AI强化学习（RL）后训练。评估结果显理物理AISFT和强化学习带来了显著改善。为推进物理AI的发展，相关代码和预训练模型将在NVIDIA Open Model License下于https://github.com/nvidia-cosmos/cosmos-reason1开放获取。

Key Takeaways

物理AI系统需要理解并执行现实世界中的复杂动作。
Cosmos-Reason1模型通过长链思维推理过程理解物理世界并做出决策。
模型具备物理常识和实体推理能力，使用分层本体论和二维本体论进行表示。
开发了两个多模态大型语言模型Cosmos-Reason1-8B和Cosmos-Reason1-56B。
模型训练包括视觉预训练、一般监督微调、物理AI特定监督微调以及强化学习等阶段。
评估结果表明物理AI特定监督调整和强化学习显著提高了模型性能。

Cool Papers

点此查看论文截图

VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation

Authors:Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal

Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.

最近的视频扩散模型已经提高了视频编辑的能力，但在统一框架内处理指令编辑和多样化任务（例如添加、删除、更改）仍然具有挑战性。在本文中，我们介绍了VEGGIE，一个基于指令的接地生成视频编辑器，它是一个简单端到端的框架，统一了视频概念编辑、接地和基于多样化用户指令的推理。具体来说，给定一个视频和文本查询，VEGGIE首先利用MLLM来解释用户意图的指令并将其接地到视频上下文，为像素空间响应生成特定帧的接地任务查询。然后，扩散模型会呈现这些计划并生成符合用户意图的编辑视频。为了支持多样化的任务和复杂的指令，我们采用了一种课程学习策略：首先用大规模的指令图像编辑数据对齐MLLM和视频扩散模型，然后在高质量的多任务视频数据上进行端到端的微调。此外，我们还引入了一种新的数据合成管道，以生成用于模型训练的配对指令视频编辑数据。它通过利用图像到视频模型将静态图像数据转换为多样化、高质量的视频编辑样本。VEGGIE在具有不同编辑技能的指令视频编辑方面表现出强大的性能，作为一个通用模型，它超越了最佳指令基线，而其他模型在多任务处理方面则表现挣扎。VEGGIE在视频对象接地和推理分割方面也表现出色，而其他基线则失败了。我们还进一步揭示了多个任务是如何相互帮助的，并强调了有前途的应用，如零击多功能指令和上下文内视频编辑。

论文及项目相关链接

PDF First three authors contributed equally. Project page: https://veggie-gen.github.io/

Summary

文本介绍了一个名为VEGGIE的视频编辑系统，该系统基于用户指令进行视频概念编辑、定位和推理。它使用MLLM解读用户指令并将其与视频内容相结合，生成特定帧的任务查询，再通过扩散模型生成符合用户意图的编辑视频。为支持多样任务和复杂指令，采用课程学习策略，先在大规模指令图像编辑数据上对齐MLLM和扩散模型，再在高质量多任务视频数据上进行端到端微调。此外，引入新的数据合成管道生成配对指令视频编辑数据用于模型训练。VEGGIE在指令视频编辑和物体定位与推理方面表现优秀。

Key Takeaways

VEGGIE是一个统一的视频编辑系统，能够处理多种编辑任务（如添加、删除、修改）。
系统通过MLLM解读用户指令并将其与视频内容结合，生成特定帧的任务查询。
使用扩散模型来执行编辑计划并生成符合用户意图的视频。
采用课程学习策略来支持多样任务和复杂指令的训练。
引入数据合成管道生成配对指令视频编辑数据，利用Image-to-Video模型增加动态性。
VEGGIE在指令视频编辑和物体定位与推理方面表现优异，优于其他模型。
多任务训练有助于提升模型性能，并揭示不同任务间的相互促进关系。

Cool Papers

点此查看论文截图

JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System

Authors:Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, Yiqun Liu

This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: https://github.com/oneal2000/JuDGE.

本文介绍了JuDGE（判决书生成评估）这一新型基准测试，旨在评估中文法律体系中判决书生成的性能。我们将任务定义为根据给定的案件事实描述生成完整的法律判决书。为了推动这一基准测试，我们构建了一个综合数据集，其中包含来自真实法律案件的案情描述及其相应的完整判决书，这些判决书作为评估生成文档质量的真实标准。该数据集通过两个外部法律语料库进一步扩充，为任务提供了额外的法律知识：一个包含法规和条例，另一个则包含大量以往的判决书。我们与法律专业人士合作，建立了一个全面的自动化评估框架，从多个维度评估生成的判决书的质量。我们评估了各种基线方法，包括少样本上下文学习、微调以及多源检索增强生成（RAG）方法，这些方法既涉及通用的大型语言模型，也涉及法律领域的大型语言模型。实验结果表明，虽然RAG方法可以有效提高此任务性能，但仍存在很大的改进空间。所有代码和数据集均可在以下网址找到：https://github.com/oneal2000/JuDGE。

论文及项目相关链接

PDF

Summary
法律文书的生成与评估是一个重要的问题。本文介绍了针对中文法律系统的新型基准测试JuDGE，旨在评估法律文书生成性能。该任务被定义为根据给定的案件事实描述生成完整的法律判决书。为推进此基准测试，本文构建了包含真实案件事实描述与其对应的完整判决书的综合数据集，还辅以两个外部法律语料库提供额外的法律知识。与法务专业人士合作，建立了全面的自动化评估框架，以评估生成的法律文书质量。实验结果表明，多源检索增强生成（RAG）方法可有效提升性能，但仍存在较大的改进空间。所有代码和数据集可在https://github.com/oneal2000/JuDGE找到。

Key Takeaways

介绍了新型基准测试JuDGE，针对中文法律系统的法律文书生成性能进行评估。
任务定义为从给定的案件事实描述生成完整的法律判决书。
构建了综合数据集，包含真实案件的事实描述与对应判决书，作为评估生成文书质量的依据。
辅以两个外部法律语料库提供额外法律知识。
与法律专业人士合作建立全面的自动化评估框架。
实验显示RAG方法能有效提升任务性能。

Cool Papers

点此查看论文截图

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Authors:Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Xiaofeng Yang

Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored. Medical reasoning tasks demand robust image analysis and well-justified answers, posing challenges due to the complexity of medical images. Transparency and trustworthiness are essential for clinical adoption and regulatory compliance. We introduce Med-R1, a framework exploring reinforcement learning (RL) to enhance VLMs’ generalizability and trustworthiness in medical reasoning. Leveraging the DeepSeek strategy, we employ Group Relative Policy Optimization (GRPO) to guide reasoning paths via reward signals. Unlike supervised fine-tuning (SFT), which often overfits and lacks generalization, RL fosters robust and diverse reasoning. Med-R1 is evaluated across eight medical imaging modalities: CT, MRI, Ultrasound, Dermoscopy, Fundus Photography, Optical Coherence Tomography (OCT), Microscopy, and X-ray Imaging. Compared to its base model, Qwen2-VL-2B, Med-R1 achieves a 29.94% accuracy improvement and outperforms Qwen2-VL-72B, which has 36 times more parameters. Testing across five question types-modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attribute analysis Med-R1 demonstrates superior generalization, exceeding Qwen2-VL-2B by 32.06% and surpassing Qwen2-VL-72B in question-type generalization. These findings show that RL improves medical reasoning and enables parameter-efficient models to outperform significantly larger ones. With interpretable reasoning outputs, Med-R1 represents a promising step toward generalizable, trustworthy, and clinically viable medical VLMs.

视觉语言模型（VLMs）在自然场景推理方面取得了进展，但它们在医学成像中的作用仍未得到充分探索。医学推理任务需要稳健的图像分析和合理的答案，由于医学图像的复杂性，这构成了挑战。透明度和可信度对于临床采用和法规合规至关重要。我们引入了Med-R1框架，探索强化学习（RL）以增强VLMs在医学推理中的通用性和可信度。利用DeepSeek策略，我们采用集团相对策略优化（GRPO）通过奖励信号引导推理路径。与经常过度拟合且缺乏泛化能力的有监督微调（SFT）不同，RL促进稳健和多样化的推理。Med-R1在八种医学成像模态上进行了评估：CT、MRI、超声、皮肤镜检查、眼底摄影、光学相干断层扫描（OCT）、显微镜和X射线成像。与基准模型相比，Med-R1在准确性上提高了29.94%，并优于参数更多的Qwen2-VL-72B。在五种问题类型（模态识别、解剖识别、疾病诊断、病变分级和生物属性分析）的测试上，Med-R1展现出优越的泛化能力，超过Qwen2-VL-2B 32.06%，并在问题类型泛化方面超越Qwen2-VL-72B。这些发现表明，强化学习可以改善医学推理，并使参数效率模型能够显著超越更大的模型。Med-R1可解释的推理输出代表了一个有希望的步骤，朝着通用、可信和临床上可行的医学VLMs发展。

论文及项目相关链接

PDF

摘要

VLM在医疗图像领域的作用尚待深入研究。文章提出了Med-R1框架，旨在探索利用强化学习增强其在医疗推理中的泛化和可信度。使用DeepSeek策略并采用集团相对政策优化（GRPO）引导推理路径，相较于监督微调（SFT），Med-R1实现了更稳健和多样化的推理。在多种医学成像模态的测试显示，Med-R1相较于基准模型有显著改善，且展现出卓越的问题类型泛化能力。这表明强化学习能提升医疗推理能力，且参数效率更高的模型也能超越更大的模型。Med-R1提供了一个有前景的方向，朝着通用、可信和临床可行的医疗VLM发展。

关键见解

VLM在医疗图像领域的作用尚未得到充分探索，尤其是在复杂的医疗推理任务中。
Med-R1框架引入强化学习（RL）以增强VLM在医疗推理中的泛化和可信度。
Med-R1使用DeepSeek策略和GRPO方法，实现了稳健且多样化的推理，优于监督微调（SFT）。
Med-R1在多种医学成像模态的测试中有显著表现，相较于基准模型准确性提高29.94%。
Med-R1展现出卓越的问题类型泛化能力，超越基准模型达32.06%。
强化学习改善了医疗推理能力，参数效率更高的模型也能表现出超越更大模型的性能。

Cool Papers

点此查看论文截图

A Survey of Direct Preference Optimization

Authors:Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at https://github.com/liushunyu/awesome-direct-preference-optimization.

大型语言模型（LLM）已经展现出前所未有的生成能力，然而，将其与人类价值观对齐对于确保有用和无害的部署仍然至关重要。虽然强化学习从人类反馈（RLHF）已经成为将LLM与人类偏好对齐的强大范式，但它对复杂奖励模型的依赖带来了计算效率和训练稳定性方面的固有权衡。在这种情况下，直接偏好优化（DPO）最近因其作为使用人类偏好直接优化LLM的简化替代方案而备受关注，从而避免了显式奖励建模的需求。由于其理论上的优雅和计算效率，DPO迅速吸引了大量研究努力来探索其各种实现和应用。然而，这个领域目前缺乏系统的组织和比较分析。在这篇综述中，我们对DPO进行了全面的概述，并介绍了一种新的分类法，将以前的工作分为四个关键维度：数据策略、学习框架、约束机制和模型属性。我们进一步对标准化基准下的DPO变体进行了严格的实证分析。此外，我们还讨论了DPO的现实世界应用、开放挑战和未来方向。这项工作既提供了理解DPO的概念框架，也为实践者提供了实用指导，旨在推进稳健和通用的对齐范式。所有收集的资源都可用，并将在https://github.com/liushunyu/awesome-direct-preference-optimization上持续更新。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的生成能力空前，但其与人类价值观的对齐对于确保安全、有益的部署至关重要。强化学习从人类反馈（RLHF）已成为对齐LLM与人的偏好强大的范式，但其复杂的奖励建模带来计算效率和训练稳定的固有权衡。因此，直接偏好优化（DPO）作为一种优化的LLM使用人类偏好的方法迅速受到关注，规避了明确奖励建模的需要。本文全面概述了DPO，引入了一种新型分类法，将以前的工作分为四个关键维度：数据策略、学习框架、约束机制和模型属性。本文还对不同标准化基准进行了严格的实证分析，并讨论了DPO的现实应用、开放挑战和未来方向。此工作旨在为理解DPO提供概念框架，并为从业者提供实践指导，推动稳健、通用的对齐范式发展。

Key Takeaways

大型语言模型（LLMs）的生成能力强，但与人价值观的对接很关键。
强化学习从人类反馈（RLHF）是LLM对齐的有效方法，但存在计算效率和训练稳定的权衡问题。
直接偏好优化（DPO）作为无需明确奖励建模的方法受到关注。
DPO有四个关键维度：数据策略、学习框架、约束机制和模型属性。
DPO的不同变体在标准化基准上进行了严格的实证分析。
DPO具有广泛的应用前景和开放性挑战，及未来的发展方向。

Cool Papers

点此查看论文截图

Authors:Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

大型语言模型在复杂的文本任务中表现出了惊人的推理能力。然而，多模态推理，这需要整合视觉和文本信息，仍然是一个巨大的挑战。现有的视觉语言模型往往难以有效地分析和推理视觉内容，导致在复杂的推理任务上性能不佳。此外，缺乏全面的基准测试阻碍了多模态推理能力的准确评估。在本文中，我们介绍了R1-Onevision多模态推理模型，旨在弥合视觉感知和深度推理之间的差距。为实现这一目标，我们提出了跨模态推理管道，将图像转换为正式的纹理表示，实现基于精确语言推理。利用该管道，我们构建了R1-Onevision数据集，该数据集提供了不同领域的详细、逐步的多模态推理注释。我们进一步通过监督微调强化学习来开发R1-Onevision模型，以培养先进的推理和稳健的泛化能力。为了全面评估不同等级的多模态推理性能，我们推出了与人的教育阶段相一致的R1-Onevision-Bench基准测试，涵盖从初中到大学及以后的考试。实验结果表明，R1-Onevision在多个具有挑战性的多模态推理基准测试中达到了最新技术水平，超越了GPT-4o和Qwen2.5-VL等模型。

Summary：大型语言模型在复杂的文本任务中展现出色的推理能力，但在需要整合视觉和文本信息的多模态推理方面仍面临挑战。现有视觉语言模型在分析视觉内容方面存在困难，导致在复杂推理任务上的性能不佳。本文介绍R1-Onevision多模态推理模型，通过跨模态推理管道将图像转化为正式的文本表示，实现精确的语言推理。此外，构建R1-Onevision数据集并提供详细的多模态推理注释，并开发R1-Onevision模型通过监督微调与强化学习来培养先进的推理和稳健的泛化能力。为全面评估不同等级的多模态推理性能，本文还引入R1-Onevision-Bench基准测试。

Key Takeaways：

大型语言模型在复杂文本任务中表现出强大的推理能力，但在多模态推理方面仍有待提高。
现有视觉语言模型在分析和推理视觉内容方面存在困难，导致性能不佳。
R1-Onevision多模态推理模型通过跨模态推理管道转化图像为文本表示，实现精确的语言推理。
R1-Onevision数据集提供详细、逐步的多模态推理注释，涵盖不同领域。
R1-Onevision模型通过监督微调和强化学习进行开发，以培育高级推理和稳健的泛化能力。
引入R1-Onevision-Bench基准测试，以全面评估不同等级的多模态推理性能，与人类教育阶段相契合。

Cool Papers

点此查看论文截图

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Authors:Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.

本文介绍了Light-R1，这是一个开源套件，采用可复制和成本效益高的方法训练长推理模型。鉴于DeepSeek-R1系列中使用的数据的专有性质，我们开发了一种纯粹利用公开数据和模型的方法。我们的课程训练逐步增加数据难度，并结合多阶段后训练。我们的Light-R1-32B模型，基于Qwen2.5-32B-Instruct进行训练，在数学推理方面优于DeepSeek-R1-Distill-Qwen-32B。实验结果表明，当不同训练阶段拥有不同且多样的数据集时，这种课程方法变得更加有效：使用我们课程数据集中的3000个具有挑战性的例子对DeepSeek-R1-Distilled模型（DeepSeek团队在专有数据上进行预训练）进行微调，产生了最先进的7B和14B模型，而32B模型Light-R1-32B-DS的表现与QwQ-32B和DeepSeek-R1相当。此外，我们将工作扩展到在长推理模型上应用GRPO。我们的最终Light-R1-14B-DS在数学的14B模型中实现了先进性能，AIME24和AIME25的得分分别为74.0和60.2，超过了许多32B模型和DeepSeek-R1-Distill-Llama-70B。尽管以数学为中心的训练，但Light-R1-14B-DS显示出强大的跨域泛化能力。Light-R1标志着让复杂的推理模型在现实世界的实际应用中更加可访问和可实现的一个重大进步。我们的模型、训练数据和代码已在https://github.com/Qihoo360/Light-R1上提供。

论文及项目相关链接

PDF v2: better writing & format for later submission; all release at https://github.com/Qihoo360/Light-R1

Summary

本文介绍了Light-R1这一开源套件，该套件采用可重现和具有成本效益的方法训练长推理模型。针对DeepSeek-R1系列使用的专有数据，研究团队开发出一种替代方法，该方法仅使用公共数据和模型。通过渐进式增加数据难度和结合多阶段后训练的课程训练，Light-R1系列模型表现优越。实验结果表明，当不同训练阶段使用不同、多样的数据集时，这种课程训练方法更加有效。此外，研究团队将工作扩展至长推理模型上的GRPO应用，最终Light-R1-14B-DS模型在数学领域实现了卓越性能，并展现出强大的跨域泛化能力。Light-R1的出现使得复杂的推理模型更易于访问并在实际应用中得到实现。相关模型和代码已公开于https://github.com/Qihoo360/Light-R1。

Key Takeaways

Light-R1是一个用于训练长推理模型的开源套件，采用可重现和成本效益高的方法。
研究团队开发了一种仅使用公共数据和模型的替代方法。
通过课程训练和多阶段后训练，Light-R1模型在数学推理方面表现出卓越性能。
使用不同、多样的数据集进行不同训练阶段效果更佳。
研究团队将GRPO应用于长推理模型，Light-R1-14B-DS模型在数学领域实现卓越性能并展现强大的跨域泛化能力。
Light-R1使得复杂的推理模型更易于访问和实际应用。

Cool Papers

点此查看论文截图

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Authors:Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, Jiawei Han

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns – solely through reinforcement learning (RL) – to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over strong baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

在大规模语言模型（LLM）中进行有效的推理和文本生成，高效获取外部知识和最新信息至关重要。在推理过程中提示具有推理能力的先进LLM使用搜索引擎并不是最佳选择，因为LLM并没有学习如何与搜索引擎进行最佳交互。本文介绍了Search-R1，它是DeepSeek-R1模型的扩展，其中LLM仅通过强化学习（RL）学习在逐步推理过程中自主生成（多个）搜索查询，并进行实时检索。Search-R1通过多轮搜索交互优化LLM的滚动更新，利用检索令牌屏蔽进行稳定的RL训练以及简单的基于结果奖励函数。在七个问答数据集上的实验表明，相较于强大的基线模型，Search-R1的性能提高了26%（Qwen2.5-7B）、21%（Qwen2.5-3B）和10%（LLaMA3.2-3B）。本文还深入探讨了强化学习优化方法、LLM选择和响应长度动态在检索增强推理中的经验见解。代码和模型检查点位于https://github.com/PeterGriffinJin/Search-R1。

论文及项目相关链接

PDF 16 pages

Summary

本文介绍了Search-R1模型，该模型通过强化学习（RL）使大型语言模型（LLM）具备自主生成（多个）搜索查询的能力，以在实时检索过程中进行逐步推理。实验表明，Search-R1在七个问答数据集上的性能相较于基准模型有了显著提升。

Key Takeaways

Search-R1模型是DeepSeek-R1模型的扩展，使LLM具备自主生成搜索查询的能力。
该模型通过强化学习（RL）进行训练，使LLM学会如何最优地与搜索引擎交互。
Search-R1优化了LLM的推理过程，通过多轮搜索交互、检索令牌掩码和简单的结果导向奖励函数来实现。
实验结果表明，Search-R1在多个问答数据集上的性能优于基准模型，平均提升约20%。
该论文还提供了关于RL优化方法、LLM选择和检索增强推理的响应长度动态性的实证见解。
Search-R1模型的代码和模型检查点已公开发布。

Cool Papers

点此查看论文截图

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Authors:Ziyu Wan, Yunxiang Li, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen

Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking – enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Experimental results demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs.

关于大语言模型（LLM）的推理研究的最新进展，是通过融入元思维来进一步提升其性能。这使得模型能够监控、评估和控制其推理过程，以更适应和有效地解决问题。然而，当前的单智能体研究缺乏获取元思维的专门设计，导致效率较低。为了解决这一挑战，我们引入了强化元思维智能体（ReMA）这一新型框架，它利用多智能体强化学习（MARL）来激发元思维行为，鼓励LLM进行反思性思考。ReMA将推理过程解耦为两个层次智能体：高级元思维智能体，负责产生战略性监督和计划；低级推理智能体，负责详细执行。通过目标一致的迭代强化学习，这些智能体探索和学习协作，提高了通用性和稳健性。实验结果表明，在复杂的推理任务上，ReMA超越了单智能体RL基准线，包括竞技级数学基准和LLM作为法官的基准。全面的消融研究进一步说明了每个独特智能体的动态演变，提供了元思维推理过程如何增强LLM推理能力的宝贵见解。

论文及项目相关链接

PDF

Summary：近期关于大型语言模型（LLM）的研究致力于通过引入元思维进一步提升其性能。元思维使模型能够监控、评估和控制其推理过程，从而实现更自适应和高效的问题解决。为解决现有单智能体在获取元思维方面的不足，我们提出了强化元思维智能体（ReMA）这一新型框架，利用多智能体强化学习（MARL）激发元思维行为。ReMA将推理过程分为两个层次智能体：高级元思维智能体负责生成战略监督和计划，低级推理智能体负责具体执行。通过目标一致的迭代强化学习，这些智能体探索并学习协作，提高了泛化和稳健性。实验结果表明，ReMA在复杂的推理任务上优于单智能体RL基线，包括竞争性水平的数学基准测试和LLM作为法官的基准测试。

Key Takeaways：

LLMs的元思维增强研究旨在提高模型的自适应和高效问题解决能力。
ReMA框架利用MARL激发模型元思维行为，分为高级元思维智能体和低级推理智能体。
通过迭代强化学习，智能体之间探索并学习协作，提高模型的泛化和稳健性。
ReMA在复杂推理任务上表现优越，包括数学和LLM-as-a-Judge基准测试。
实验结果证明了ReMA框架的有效性和优势。
ReMA框架为LLMs的元思维推理过程提供了有价值的见解。

Cool Papers

点此查看论文截图

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Authors:Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions (“implicit post-hoc rationalization”). For example, when separately presented with the questions “Is X bigger than Y?” and “Is Y bigger than X?”, models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.

“链式思维”（Chain-of-Thought, CoT）推理技术显著提升了当前的人工智能能力。然而，近期的研究显示，CoT推理并不总是可靠，也就是说，CoT推理并不总能反映模型如何得出结论。迄今为止，大多数研究都集中在非自然语境下的不忠实情况，这种语境中引入了明显的偏见。与之相反，我们证明，在没有人为偏见的情况下，现实提示也会出现不忠实的CoT。我们的结果揭示了前沿模型中不可忽略的各种形式的不忠实推理率：Sonnet 3.7（16.3%）、DeepSeek R1（5.3%）和ChatGPT-4o（7.0%）都会回答相当比例的问题对时存在不忠实的情况。具体来说，我们发现模型会在回答二元问题时（如“X是否大于Y？”和“Y是否大于X？”），为其答案进行合理化的解释（“事后隐式合理化”）。例如，尽管回答两个逻辑矛盾的问题（“X是否大于Y？”的答案是“是”，“Y是否大于X？”的答案是“是”）在逻辑上是矛盾的，但有时模型会提供表面上合理的论证来支持这样的答案。我们还研究了模型在推理中的恢复错误（Dziri等人，2023），即模型在推理过程中犯错并默默纠正这些错误的情况，以及不忠实的捷径，即模型使用明显不合逻辑的推理来简化解决Putnam问题中的难题。我们的研究结果表明，对于依赖监控CoT来检测不希望出现行为的AI安全工作来说，这是一个挑战。

论文及项目相关链接

PDF Accepted to the Reasoning and Planning for LLMs Workshop (ICLR 25), 10 main paper pages, 39 appendix pages

摘要

链式思维（Chain-of-Thought，简称CoT）推理显著提升了人工智能的先进能力。但最新研究显示，CoT推理并不总是忠实于真实情况，意味着模型的结论并不总是反映其推理过程。先前的研究主要关注在不自然情境下的不忠实推理，而本研究显示，在没有人为偏见的现实提示下，也会出现不忠实的CoT。研究结果揭示了前沿模型中不可忽略的各种形式的不忠实推理率：Sonnet 3.7（16.3%）、DeepSeek R1（5.3%）和ChatGPT-4o（7.0%）在回答问题时均有显著比例的不忠实推理。特别是，我们发现模型在回答二元问题时，会为其答案进行事后理性化。例如，当分别面对“X是否大于Y？”和“Y是否大于X？”的问题时，模型有时会制造表面上合理的论据，以证明两个问题的答案都是“是”或都是“否”，尽管这样的回答在逻辑上是矛盾的。我们还研究了模型在修复错误时的不忠实行为，以及使用明显不合逻辑的推理来简化解决Putnam问题的快捷方式。我们的研究对依赖监测CoT来检测不希望出现行为的AI安全工作提出了挑战。

关键见解

链式思维（CoT）推理在提升AI能力方面发挥了重要作用，但存在不忠实于真实情况的问题。
不忠实CoT可在无人工偏见的现实提示下发生。
多种前沿模型存在不忠实推理问题，包括Sonnet 3.7、DeepSeek R1和ChatGPT-4o。
模型会在回答二元问题时进行隐式的事后理性化，有时导致逻辑矛盾。
模型在修复错误时存在不忠实行为，以及使用简化解决复杂问题的逻辑不清晰的推理快捷方式。
这些发现对依赖CoT监测来确保AI安全的工作构成了挑战。

Cool Papers

点此查看论文截图

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Authors:Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein

Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at https://github.com/gersteinlab/medagents-benchmark.

大型语言模型（LLM）在现有的医疗问答基准测试中表现出了令人印象深刻的性能。这种高性能使得对先进方法进行有意义地评估和区分变得越来越困难。我们推出了MedAgentsBench基准测试，它专注于具有挑战性的医疗问题，这些问题需要进行多步骤的临床推理、诊断制定和治疗计划制定。尽管在标准测试上表现良好，但在这些场景中，当前模型仍然面临困难。我们从七个公认的医学数据集中汲取知识，我们的基准测试解决了现有评估中的三个关键局限性：（1）普遍存在的简单问题，即使是基础模型也能取得良好的性能；（2）研究之间采样和评估协议的不一致性；（3）缺乏性能、成本和推理时间之间相互作用的系统性分析。通过对各种基础模型和推理方法的实验，我们证明了最新的思维模型DeepSeek R1和OpenAI o3在复杂的医疗推理任务中表现出卓越的性能。此外，与传统的搜索方法相比，先进的基于搜索的代理方法提供了具有前景的性能价格比。我们的分析揭示了不同模型家族在复杂问题上的性能差距，并针对不同计算约束确定了最佳模型选择。我们的基准测试和评估框架可在https://github.com/gersteinlab/medagents-benchmark公开访问。

论文及项目相关链接

PDF

Summary

大型语言模型在现有的医疗问答基准测试中表现出色，但这使得评估区分高级方法越来越困难。本文提出了MedAgentsBench基准测试，它专注于挑战需要多步骤临床推理、诊断制定和治疗规划的医疗问题，这是当前模型仍然面临困难的场景。该基准测试解决了现有评估中的三个关键局限性，包括简单问题占主导、采样和评估协议不一致以及缺乏性能、成本和推理时间之间的系统分析。通过不同的基础模型和推理方法的实验，证明最新的DeepSeek R1和OpenAI o3模型在复杂的医疗推理任务中表现出卓越性能。此外，先进的基于搜索的代理方法与传统方法相比，具有有吸引力的性能成本比。分析表明，在不同计算约束下，模型家族间存在显著性能差异，为不同计算环境选择最优模型提供了依据。基准测试和评估框架可在公开渠道获取。

Key Takeaways

大型语言模型在医疗问答基准测试中表现出色，但仍面临评估困难。
MedAgentsBench基准测试专注于挑战需要多步骤临床推理的医疗问题。
现有评估存在三个关键局限性：简单问题主导、采样和评估协议不一致以及缺乏系统分析性能、成本和推理时间的关系。
最新的大型语言模型如DeepSeek R1和OpenAI o3在复杂的医疗推理任务中表现出卓越性能。
先进的基于搜索的代理方法具有有吸引力的性能成本比。
不同模型家族间存在显著性能差异，为选择最优模型提供了依据。

Cool Papers

点此查看论文截图

Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss, Dynamics, and Success Amplification

Authors:Youssef Mroueh

Group Relative Policy Optimization (GRPO) was introduced and used successfully to train DeepSeek R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler ($\mathsf{KL}$) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy $\pi_{n}$ can be expressed explicitly in terms of the binary reward, as well as the first and second order statistics of the old policy ($\pi_{n-1}$) and the reference policy $\pi_0$. Iterating this scheme, we obtain a sequence of policies $\pi_{n}$ for which we can quantify the probability of success $p_n$. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success $p_0$ and the regularization parameter $\beta$ of the $\mathsf{KL}$ regularizer. We show that the fixed point $p^*$ is guaranteed to be larger than $p_0$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

相对策略优化（GRPO）被引入并成功用于使用可验证或二进制奖励训练DeepSeek R1模型，以促进大型语言模型的推理能力。本文展示了使用可验证奖励的GRPO可以表示为Kullback Leibler（KL）正则化的对比损失，其中对比样本是采样自旧策略的合成数据。最优GRPO策略πn可以明确表示为二进制奖励以及旧策略πn-1和参考策略π0的一阶和二阶统计量。通过迭代这个方案，我们获得了一系列策略πn，我们可以量化其成功的概率pn。我们展示了策略的成功概率满足一个递归，该递归收敛到一个依赖于初始成功概率p0和KL正则化的正则化参数β的函数的不动点。我们证明了不动点p*一定大于p0，从而证明GRPO有效地放大了策略成功的概率。

论文及项目相关链接

PDF

Summary

GRPO（Group Relative Policy Optimization）被成功应用于DeepSeek R1模型的训练，以提升LLM的推理能力。该研究将GRPO与可验证奖励相结合，并将其表述为Kullback Leibler（KL）正则化的对比损失。GRPO策略优化可以通过迭代获得，并可以量化成功概率。研究表明，GRPO能有效提高策略成功概率的固定点值。

Key Takeaways

GRPO被成功应用于训练DeepSeek R1模型，以提高LLM的推理能力，使用可验证或二进制奖励。
GRPO与可验证奖励的结合可以表示为KL正则化的对比损失。
GRPO策略优化可以通过迭代获得，并可以量化其成功概率。
GRPO策略优化中的最优策略$\pi_{n}$可以显式表达为二进制奖励以及旧策略$\pi_{n-1}$和参考策略$\pi_0$的一阶和二阶统计。
策略成功概率满足一个递归，收敛到依赖于初始成功概率$p_0$和KL正则化的正则化参数$\beta$的函数固定点。
固定点$p^*$保证大于$p_0$，表明GRPO有效地提高了策略的成功概率。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-03-22/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-03-22 The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

2025-03-22 LLM

LLM

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-03-21 Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control

2025-03-21 Text-to-Motion

Text-to-Motion

R1_Reasoning

2025-03-22 更新

Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t

CLS-RL: Image Classification with Rule-Based Reinforcement Learning

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Grammar and Gameplay-aligned RL for Game Description Generation with LLMs

Reinforcement Learning Environment with LLM-Controlled Adversary in D&D 5th Edition Combat

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation

JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

A Survey of Direct Preference Optimization

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss, Dynamics, and Success Amplification