嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-09-29 更新

Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching

Authors:Songze Li, Zhiqiang Liu, Zhengke Gui, Huajun Chen, Wen Zhang

Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs’ prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.

大型语言模型(LLM)在复杂任务中展现出强大的推理能力。然而,在知识密集型场景(如知识图谱问答(KGQA))中,它们仍然难以处理幻想和事实错误。我们将这归因于结构化的知识图谱(KGs)与非结构化查询之间存在的语义鸿沟,这种差距源于两者的重点和结构上的固有差异。现有方法通常使用资源密集、不可扩展的工作流对普通知识图谱进行推理,但忽略了这一差距。为了应对这一挑战,我们提出了一种灵活框架,名为“图谱丰富”(Enrich-on-Graph,EoG),它利用LLM的先验知识来丰富知识图谱,弥合图谱和查询之间的语义鸿沟。EoG能够实现从知识图谱中高效提取证据,进行精确而稳健的推理,同时确保低计算成本、可扩展性和跨不同方法的适应性。此外,我们提出了三项图质量评估指标,用于分析KGQA任务中的查询-图对齐情况,并通过我们的优化目标的理论验证来支持。在KGQA基准数据集上的大量实验表明,EoG可以有效地生成高质量的知识图谱,达到最先进的性能。我们的代码和数据集可在https://github.com/zjukg/Enrich-on-Graph找到。

论文及项目相关链接

PDF

Summary

大型语言模型(LLMs)在复杂任务中展现出强大的推理能力,但在知识密集型场景(如知识图谱问答)中仍存在幻觉和事实错误问题。这归因于结构化知识图谱(KGs)与非结构化查询之间的语义鸿沟,源于两者关注点和结构上的差异。现有方法通常依赖于资源密集型的、不可扩展的工作流程来推理普通知识图谱,却忽视了这一鸿沟。为解决此挑战,我们提出了灵活的框架Enrich-on-Graph(EoG),利用LLMs的先验知识来丰富知识图谱,缩小知识图谱与查询之间的语义鸿沟。EoG可实现从知识图谱中高效提取证据,进行精确稳健的推理,同时确保低计算成本、可扩展性和不同方法间的适应性。此外,我们提出了三项图质量评估指标,分析KGQA任务中的查询-图谱对齐情况,并通过优化目标的理论验证予以支持。在KGQA基准数据集上的广泛实验表明,EoG可有效生成高质量知识图谱,达到最新性能水平。

Key Takeaways

  1. LLMs在复杂任务中展现出强大的推理能力,但在知识密集型场景中存在幻觉和事实错误问题。
  2. 知识图谱(KGs)与非结构化查询之间存在语义鸿沟,源于其关注点和结构上的差异。
  3. 现有方法通常使用资源密集型的、不可扩展的工作流程来推理普通知识图谱。
  4. Enrich-on-Graph(EoG)框架利用LLMs的先验知识来丰富知识图谱,缩小知识图谱与查询之间的语义鸿沟。
  5. EoG可实现高效证据提取,进行精确稳健的推理,同时确保低计算成本、可扩展性和适应性。
  6. 提出了三项图质量评估指标,用于分析KGQA任务中的查询-图谱对齐情况。
  7. EoG在KGQA基准数据集上的实验表明其生成高质量知识图谱的能力,并达到最新性能水平。

Cool Papers

点此查看论文截图

LogReasoner: Empowering LLMs with Expert-like Coarse-to-Fine Reasoning for Log Analysis Tasks

Authors:Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, Yanghua Xiao

Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align with expert cognition and deliver precise details of reasoning steps. To address these challenges, we propose LogReasoner, a coarse-to-fine reasoning enhancement framework designed to enable LLMs to reason log analysis tasks like experts. LogReasoner consists of two stages: (1) coarse-grained enhancement of expert thinking, where high-level expert thoughts are constructed from collected troubleshooting flowcharts and existing tasks to enable LLMs to formulate structured reasoning workflows and (2) fine-grained enhancement of specific steps, where we first fine-tune the LLM with task-specific stepwise solutions to enhance the LLM for instantiated reasoning, then employ the preference learning to calibrate the LLM’s reasoning details from its mistakes, further strengthen the LLM’s analytical granularity and correctness. We evaluate LogReasoner on four distinct log analysis tasks using open-source LLMs such as Qwen-2.5 and Llama-3. Experimental results show that LogReasoner significantly outperforms existing LLMs, achieving state-of-the-art performance and demonstrating its effectiveness in enhancing the reasoning capabilities of LLMs for log analysis.

日志分析对于监控系统健康状况和诊断复杂系统中的故障至关重要。最近的大型语言模型(LLM)的进步为自动化日志分析提供了新的机会,利用其推理能力来执行异常检测和故障预测等任务。然而,通用LLM在形成与专家认知相符的结构化推理工作流程并提供精确的推理步骤细节方面存在困难。为了解决这些挑战,我们提出了LogReasoner,这是一个从粗到细的推理增强框架,旨在使LLM能够像专家一样进行日志分析任务。LogReasoner包含两个阶段:(1)专家思维的粗粒度增强,我们从收集的故障排除流程图和现有任务中构建高级专家思维,使LLM能够形成结构化推理工作流程;(2)特定步骤的精细粒度增强,我们首先对LLM进行与任务相关的分步解决方案的微调,以增强其实例化推理能力,然后利用偏好学习来校正LLM的推理细节中的错误,进一步提高了LLM的分析精细度和正确性。我们在四个不同的日志分析任务上评估了LogReasoner,使用了开源LLM,如Qwen-2.5和Llama-3。实验结果表明,LogReasoner显著优于现有LLM,达到了最先进的性能,证明了其在增强LLM日志分析推理能力方面的有效性。

论文及项目相关链接

PDF under review

Summary

基于日志分析在系统健康监测和复杂系统故障诊断中的重要性,研究利用大型语言模型(LLMs)进行自动化日志分析。LogReasoner框架通过粗到细的推理增强机制,使LLMs能够像专家一样进行日志分析任务。该框架包括两个阶段:构建专家思维的粗粒度增强和特定步骤的细粒度增强。实验结果表明,LogReasoner在日志分析任务上显著优于现有LLMs,达到最新性能水平,有效提升了LLMs的推理能力。

Key Takeaways

  1. 日志分析对于监测系统健康和诊断复杂系统故障至关重要。
  2. 大型语言模型(LLMs)为自动化日志分析提供了新的机会。
  3. LogReasoner框架旨在通过粗到细的推理增强机制,使LLMs能够像专家一样进行日志分析。
  4. LogReasoner包括构建专家思维的粗粒度增强和特定步骤的细粒度增强两个阶段。
  5. 通过与开源LLMs(如Qwen-2.5和Llama-3)的实验评估,LogReasoner显著优于现有LLMs。
  6. LogReasoner达到最新性能水平,在日志分析任务上表现出卓越的效果。

Cool Papers

点此查看论文截图

Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning

Authors:Yufan Mao, Hanjing Ye, Wenlong Dong, Chengjie Zhang, Hong Zhang

Navigating complex environments requires robots to effectively store observations as memories and leverage them to answer human queries about spatial locations, which is a critical yet underexplored research challenge. While prior work has made progress in constructing robotic memory, few have addressed the principled mechanisms needed for efficient memory retrieval and integration. To bridge this gap, we propose Meta-Memory, a large language model (LLM)-driven agent that constructs a high-density memory representation of the environment. The key innovation of Meta-Memory lies in its capacity to retrieve and integrate relevant memories through joint reasoning over semantic and spatial modalities in response to natural language location queries, thereby empowering robots with robust and accurate spatial reasoning capabilities. To evaluate its performance, we introduce SpaceLocQA, a large-scale dataset encompassing diverse real-world spatial question-answering scenarios. Experimental results show that Meta-Memory significantly outperforms state-of-the-art methods on both the SpaceLocQA and the public NaVQA benchmarks. Furthermore, we successfully deployed Meta-Memory on real-world robotic platforms, demonstrating its practical utility in complex environments. Project page: https://itsbaymax.github.io/meta-memory.github.io/ .

在复杂环境中导航要求机器人有效地存储观察作为记忆,并利用这些记忆回答关于空间位置的人类查询,这是一个至关重要但尚未被充分研究的研究挑战。虽然先前的研究在构建机器人记忆方面取得了一些进展,但很少有人解决高效记忆检索和整合所需的原则性机制。为了弥补这一差距,我们提出了Meta-Memory,这是一个由大型语言模型(LLM)驱动的智能体,它构建了环境的高密度记忆表示。Meta-Memory的关键创新在于它能够通过语义和空间模态的联合推理来检索和整合相关记忆,以响应自然语言位置查询,从而为机器人提供强大而准确的空间推理能力。为了评估其性能,我们引入了SpaceLocQA,这是一个包含各种现实世界空间问答场景的大规模数据集。实验结果表明,Meta-Memory在SpaceLocQA和公共NaVQA基准测试上的表现均优于最新技术。此外,我们成功地将Meta-Memory部署在真实的机器人平台上,证明了其在复杂环境中的实际效用。项目页面:https://itsbaymax.github.io/metamemory.github.io/。

论文及项目相关链接

PDF

Summary

机器人在复杂环境中运作时,需要有效存储观察作为记忆并利用这些记忆回答关于空间位置的人类查询,这是一个关键但尚未被充分研究的研究挑战。先前的研究虽然构建了机器人记忆系统,但很少有人关注高效记忆检索和整合所需的原理机制。为了弥补这一空白,我们提出了Meta-Memory,这是一个由大型语言模型驱动的代理,能够构建环境的高密度记忆表示。Meta-Memory的关键创新之处在于其通过语义和空间模态的联合推理来检索和整合相关记忆,以回应自然语言位置查询的能力,从而为机器人提供稳健和精确的空间推理能力。我们引入了SpaceLocQA数据集,以评估其在多种现实世界空间问答场景中的性能。实验结果表明,Meta-Memory在SpaceLocQA和公共NaVQA基准测试上的表现均优于当前最新方法。此外,我们还成功将Meta-Memory部署在真实世界的机器人平台上,证明了其在复杂环境中的实际应用价值。项目页面链接为:https://itsbaymax.github.io/meta-memory.github.io/。

Key Takeaways

  1. 机器人需要有效存储和检索记忆以应对复杂环境中的空间位置查询。
  2. Meta-Memory通过大型语言模型驱动,能构建环境的高密度记忆表示。
  3. Meta-Memory能结合语义和空间模态进行联合推理,回应自然语言位置查询。
  4. 引入的SpaceLocQA数据集用于评估机器人在多种空间问答场景中的性能。
  5. Meta-Memory在基准测试上的表现优于现有方法。
  6. Meta-Memory成功部署在真实世界的机器人平台上。

Cool Papers

点此查看论文截图

CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Authors:Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}ontrolling \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

强化学习(RL)已成为优化大型语言模型(LLM)以处理复杂推理任务的有力范式。这一过程中的核心挑战在于管理策略熵,它反映了训练过程中的探索与利用之间的平衡。现有方法,如近端策略优化(PPO)及其变体,由于裁剪机制而丢弃了来自低概率标记的有价值梯度信号。我们系统地分析了熵动态,并发现这些被裁剪的标记在调节熵演化方面发挥着至关重要的作用,但往往被忽视。我们提出了通过梯度保留策略优化控制熵(CE-GPPO)的新算法,它以温和且有限的方式在原生PPO中重新引入了被裁剪标记的梯度。通过控制来自裁剪区间外标记的梯度幅度,CE-GPPO能够实现探索与利用之间的平衡。我们提供了理论证明和实验证据,表明CE-GPPO有效地缓解了熵不稳定性的问题。在数学推理基准测试上的广泛实验表明,CE-GPPO在不同模型规模上始终优于强大的基准测试。

论文及项目相关链接

PDF

Summary
强化学习在处理大型语言模型以执行复杂推理任务时表现出强大的优势。核心挑战在于管理策略熵,即训练过程中的探索与利用之间的平衡。现有方法如近端策略优化(PPO)及其变体由于裁剪机制而丢弃了低概率标记的有价值梯度信号。系统分析表明,这些被裁剪的标记在调节熵演化中扮演着至关重要的角色却常被忽视。为此,我们提出一种通过梯度保留策略优化控制熵的新算法(CE-GPPO),它以温和和有限的方式重新引入原生PPO中被裁剪的标记的梯度。通过控制来自剪辑区间外标记的梯度幅度,CE-GPPO能够实现探索与利用的权衡。理论验证和实验证据表明,CE-GPPO有效地缓解了熵的不稳定性。在数学推理基准测试上的广泛实验表明,CE-GPPO在不同模型规模上均显著优于强大基线。

Key Takeaways

  1. 强化学习是优化大型语言模型处理复杂推理任务的重要范式。
  2. 策略熵管理是强化学习中的核心挑战,涉及探索与利用之间的平衡。
  3. 现有方法如PPO因裁剪机制忽视了低概率标记的重要性。
  4. 被裁剪的标记在调节熵演化中起关键作用。
  5. 提出的新算法CE-GPPO旨在通过梯度保留策略优化控制熵。
  6. CE-GPPO以温和和有限的方式重新引入原生PPO中被裁剪标记的梯度。

Cool Papers

点此查看论文截图

PEPS: Quantum-Inspired Reinforcement Learning for Coherent Reasoning Traces in LLMs

Authors:Venkat Margapuri, Garik Kazanjian, Naren Kosaraju

Large Language Models (LLMs) often struggle with maintaining coherent multi-step reasoning traces, particularly in tasks that require a structured logical flow. This work introduces a quantum-inspired approach to address the challenge by incorporating a fidelity-based reward derived from Projected Entangled Pair States (PEPS) into Proximal Policy Optimization. Unlike prior approaches that use direct supervision or contrastive objectives, the proposed method guides learning through structural consistency, offering a novel approach to enforce global coherence in generated reasoning traces. The proposed framework is evaluated using multiple coherence-determining metrics on diverse datasets such as GSM8K, StrategyQA, and EntailmentBank spanning arithmetic, intuitive, and entailment-based reasoning. Results show that the proposed quantum-inspired approach offers significant improvements over supervised, contrastive, and pretrained baseline approaches, highlighting the effectiveness of quantum-inspired fidelity as a foundation to improve reasoning trace coherence in LLMs.

大型语言模型(LLM)在维持连贯的多步骤推理轨迹方面经常遇到困难,特别是在需要结构化逻辑流的任务中。这项工作通过引入一种受量子启发的解决方案来解决这一挑战,将基于投影纠缠对态(PEPS)的保真度奖励纳入近端策略优化中。与之前使用直接监督或对比目标的方法不同,所提出的方法通过结构一致性来指导学习,为生成的推理轨迹中的全局连贯性提供了一种新的强制方法。所提出的方法在GSM8K、StrategyQA和EntailmentBank等多个数据集上进行了评估,这些数据集涵盖了算术、直观和基于蕴涵的推理。结果表明,与传统的监督方法、对比方法和预训练方法相比,所提出的新型量子启发方法显著提高了推理轨迹的连贯性,突出了量子启发式保真作为提高LLM中推理轨迹连贯性的基础的有效性。

论文及项目相关链接

PDF

Summary

大型语言模型(LLMs)在多步推理任务中面临保持逻辑连贯性的挑战,特别是在需要结构化逻辑流程的任务中。本研究引入了一种受量子启发的解决方案,通过结合投影纠缠对态(PEPS)衍生的保真度奖励,将其纳入近端策略优化(Proximal Policy Optimization)。与之前使用直接监督或对比目标的方法不同,该方法通过结构一致性来引导学习,为生成的推理轨迹强制执行全局连贯性提供了新的方法。在GSM8K、StrategyQA和EntailmentBank等多个数据集上进行的评估表明,该量子启发方法显著提高了一致性指标的表现,特别是在算术、直观和蕴含推理方面。这表明量子启发的保真度是提高LLM推理轨迹连贯性的有效基础。

Key Takeaways

  1. 大型语言模型(LLMs)在保持多步推理连贯性上遇到困难,特别是在结构化逻辑流程的任务中。
  2. 本研究提出了一种受量子启发的解决方案,结合投影纠缠对态(PEPS)的保真度奖励,以增强LLMs的推理连贯性。
  3. 该方法通过结构一致性来引导学习,为生成的推理轨迹强制执行全局连贯性。
  4. 相比直接监督或对比目标的方法,该量子启发方法能显著提高推理轨迹的连贯性。
  5. 该方法在多个数据集上进行了评估,包括GSM8K、StrategyQA和EntailmentBank等,涵盖了算术、直观和蕴含推理等多个方面。
  6. 结果显示,量子启发方法显著优于监督、对比和预训练基线方法。

Cool Papers

点此查看论文截图

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Authors:Tai-Ming Huang, Wei-Tung Lin, Kai-Lung Hua, Wen-Huang Cheng, Junichi Yamagishi, Jun-Cheng Chen

The increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations, highlighting the urgent need for accurate and interpretable detection methods. While existing approaches have made progress, most rely on binary classification without explanations or depend heavily on supervised fine-tuning, resulting in limited generalization. In this paper, we propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection. Our method leverages a Multimodal Large Language Model (MLLM) equipped with a forgery reasoning prompt and is trained using Group Relative Policy Optimization (GRPO) reinforcement learning with carefully designed reward functions. This design enables the model to perform step-by-step reasoning and produce interpretable, structured outputs. We further introduce a structured detection pipeline to enhance reasoning quality and adaptability. Extensive experiments show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark. These results validate our framework’s effectiveness and robustness. Code will be released upon acceptance.

人工智能生成的图像越来越逼真,引发了人们对误导信息和侵犯隐私的严重担忧,这凸显了对准确且可解释的检测方法的迫切需求。尽管现有方法已经取得了一些进展,但大多数方法依赖于没有解释的二元分类或严重依赖于监督微调,导致泛化能力有限。在本文中,我们提出了ThinkFake,这是一个基于推理和可泛化的检测人工智能生成图像的新框架。我们的方法利用配备伪造推理提示的多模态大型语言模型(MLLM),并使用经过精心设计的奖励函数采用群体相对策略优化(GRPO)强化学习进行训练。这种设计使模型能够逐步进行推理,并产生可解释的结构化输出。我们还引入了一个结构化检测管道,以提高推理质量和适应性。大量实验表明,ThinkFake在GenImage基准测试上优于最新方法,并在具有挑战性的LOKI基准测试上表现出强大的零样本泛化能力。这些结果验证了我们的框架的有效性和稳健性。代码将在接受后发布。

论文及项目相关链接

PDF

Summary

随着AI生成图像现实感的增强,关于虚假信息和隐私泄露的担忧日益加剧,凸显出对准确且可解释的检测方法的迫切需求。现有方法虽有所进展,但多数依赖于无解释的二元分类或依赖监督微调,导致泛化能力有限。本文提出ThinkFake框架,基于推理和泛化能力强的AI生成图像检测新方法。该方法利用配备伪造推理提示的多模态大型语言模型(MLLM),采用集团相对策略优化(GRPO)强化学习进行训练,精心设计奖励函数。此设计使模型能够进行逐步推理,产生可解释的结构化输出。引入结构化检测管道,提高推理质量和适应性。在GenImage和LOKI基准测试中,ThinkFake表现出卓越性能,验证了其有效性和稳健性。

Key Takeaways

  1. AI生成图像现实感的提升引发了关于虚假信息和隐私泄露的担忧。
  2. 现有图像检测方法的泛化能力有限,多数依赖于无解释的二元分类或监督微调。
  3. ThinkFake框架是一种基于推理的AI生成图像检测新方法,具有更好的泛化能力。
  4. ThinkFake利用多模态大型语言模型(MLLM)和集团相对策略优化(GRPO)强化学习进行训练。
  5. 精心设计奖励函数使模型能够进行逐步推理,产生可解释的结构化输出。
  6. ThinkFake引入结构化检测管道,旨在提高推理质量和适应性。

Cool Papers

点此查看论文截图

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Authors:Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He

With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers–such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)–each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

随着大型语言模型(LLM)的快速发展,它们对抗对抗性操纵,尤其是越狱后门攻击的稳健性变得至关重要。现有的嵌入越狱触发的方法,如监督微调(SFT)、模型编辑和强化学习从人类反馈(RLHF),都存在着诸如泛化能力差、隐蔽性受损或生成的越狱响应上下文可用性降低等局限性。为了克服这些问题,我们提出了双向组相对策略优化(bi-GRPO),这是一个专门为越狱后门注入定制的新型基于强化学习的框架。通过采用配对演练和配对奖励,bi-GRPO能够同时优化模型,使其能够可靠地产生带有触发的有害内容,并在其他方面保持安全。我们的方法利用基于规则的奖励机制,辅以长度和格式激励,消除了对高质量监督数据集或可能存在缺陷的奖励模型的依赖。大量实验表明,bi-GRPO实现了卓越的有效性(> 99%的攻击成功率),在非触发场景中保持了隐蔽性,并产生了高度可用和连贯的越狱响应,从而在越狱后门攻击方面取得了最先进的进展。

论文及项目相关链接

PDF

Summary

大型语言模型(LLM)对抗性操纵的稳健性,特别是对抗jailbreak后门攻击的能力至关重要。现有嵌入jailbreak触发器的方法存在局限性。为此,我们提出了基于强化学习的双向组相对策略优化(bi-GRPO)框架。它通过采用配对演练和配对奖励,联合优化模型以可靠地产生带有触发器的有害内容,并在其他情况下保持安全。实验证明,bi-GRPO在jailbreak后门攻击方面实现了卓越的效果,成功率为超过百分之九十九以上。在保证隐蔽性的同时,生成的响应也高度可用和连贯。这显著提升了jailbreak后门攻击的最新技术水平。

Key Takeaways

  • 大型语言模型(LLM)对抗jailbreak后门攻击的稳健性至关重要。
  • 现有嵌入jailbreak触发器的方法存在局限性,如监督微调(SFT)、模型编辑和强化学习从人类反馈(RLHF)。

Cool Papers

点此查看论文截图

NGRPO: Negative-enhanced Group Relative Policy Optimization

Authors:Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, Xingzhong Xu

RLVR has enhanced the reasoning capabilities of Large Language Models (LLMs) across various tasks. However, GRPO, a representative RLVR algorithm, suffers from a critical limitation: when all responses within a group are either entirely correct or entirely incorrect, the model fails to learn from these homogeneous responses. This is particularly problematic for homogeneously incorrect groups, where GRPO’s advantage function yields a value of zero, leading to null gradients and the loss of valuable learning signals. To overcome this issue, we propose NGRPO (Negative-enhanced Group Relative Policy Optimization), an algorithm designed to convert homogeneous errors into robust learning signals. First, NGRPO introduces Advantage Calibration. This mechanism hypothesizes the existence of a virtual maximum-reward sample during advantage calculation, thereby altering the mean and variance of rewards within a group and ensuring that the advantages for homogeneously incorrect samples are no longer zero. Second, NGRPO employs Asymmetric Clipping, which relaxes the update magnitude for positive samples while imposing stricter constraints on that of negative samples. This serves to stabilize the exploration pressure introduced by the advantage calibration. Our experiments on Qwen2.5-Math-7B demonstrate that NGRPO significantly outperforms baselines such as PPO, GRPO, DAPO, and PSR-NSR on mathematical benchmarks including MATH500, AMC23, and AIME2025. These results validate NGRPO’s ability to learn from homogeneous errors, leading to stable and substantial improvements in mathematical reasoning. Our code is available at https://github.com/nangongrui-ngr/NGRPO.

RLVR已经提高了大型语言模型(LLM)在各种任务中的推理能力。然而,作为RLVR算法的代表,GRPO存在一个严重的局限性:当一组中的所有响应都是完全正确或完全错误时,模型无法从这些同质的响应中学习。这对于完全错误的组来说特别成问题,因为GRPO的优势函数会产生零值,导致梯度消失和有价值的学习信号的丧失。为了克服这个问题,我们提出了NGRPO(负增强组相对策略优化)算法,该算法旨在将同质错误转化为稳健的学习信号。首先,NGRPO引入了优势校准。该机制假设在优势计算过程中存在一个虚拟最大奖励样本,从而改变组内的奖励均值和方差,并确保对完全错误的样本的优势不再为零。其次,NGRPO采用不对称裁剪,对正样本的更新幅度进行放松,同时对负样本的更新幅度施加更严格的约束。这有助于稳定由优势校准引入的探索压力。我们的实验结果表明,在Qwen2.5-Math-7B上,NGRPO在数学基准测试MATH500、AMC23和AIME2025上的表现明显优于PPO、GRPO、DAPO和PSR-NSR等基线算法。这些结果验证了NGRPO从同质错误中学习并带来稳定和实质性改进的推理能力。我们的代码可在https://github.com/nangongrui-ngr/NGRPO中找到。

论文及项目相关链接

PDF

Summary

RLVR技术增强了大型语言模型(LLM)的推理能力,但在处理某些任务时存在局限。当一组回应完全正确或完全错误时,代表性算法GRPO无法从中学习。为解决此问题,提出NGRPO算法,通过优势校准和不对称裁剪机制,将同质化错误转化为稳健的学习信号。实验证明,NGRPO在数学推理任务上显著优于基线方法,如PPO、GRPO、DAPO和PSR-NSR。代码已公开。

Key Takeaways

  1. RLVR技术增强了大型语言模型的推理能力。
  2. GRPO算法在处理完全正确或错误的回应组时存在局限。
  3. NGRPO算法旨在解决GRPO的局限性,通过优势校准和不对称裁剪机制将同质化错误转化为学习信号。
  4. NGRPO在数学推理任务上表现出显著优势,优于多种基线方法。
  5. NGRPO的优势在于其能够利用同质化错误进行稳定且实质性的改进。
  6. 公开了NGRPO算法的源代码,便于他人使用和研究。

Cool Papers

点此查看论文截图

MAPO: Mixed Advantage Policy Optimization

Authors:Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

关于基础模型的强化学习最新进展,如组相对策略优化(GRPO),已经显著提高了基础模型在推理任务上的性能。值得注意的是,优势函数在GRPO中作为排名轨迹重要性的核心机制。然而,现有的探索既遇到了优势反转也遇到了优势镜像问题,这两个问题阻碍了不同查询样本之间的合理优势分配。在这项工作中,我们提出了一种简单有效的GRPO策略,即混合优势策略优化(MAPO)。我们发现轨迹表现出不同的确定性,并为高确定性轨迹的样本提出了优势百分比偏差。此外,我们根据样本轨迹的确定性动态调整优势函数的权重,从而自适应地配置优势函数以考虑样本特定的特性。与相关领域最新方法的比较以及对不同优势变体的消融研究验证了我们的方法的有效性。

论文及项目相关链接

PDF

Summary
强化学习在基础模型领域的新进展,如组相对策略优化(GRPO),已显著提高模型在推理任务上的性能。然而,现有探索面临优势反转和优势镜像问题,这阻碍了不同查询样本之间的合理优势分配。本研究提出了一种简单有效的MAPO策略,通过为具有高确定性轨迹的样本引入优势百分比偏差,并动态调整优势函数以考虑样本的特定特征来解决这些问题。

Key Takeaways

  • 强化学习在基础模型上的新进展提高了模型在推理任务上的性能。
  • 组相对策略优化(GRPO)是强化学习的一种重要方法,但存在优势反转和优势镜像问题。
  • MAPO策略被提出以解决GRPO中的问题,通过引入优势百分比偏差和动态调整优势函数来合理分配优势。
  • 高确定性轨迹的样本在MAPO中得到了特别关注。
  • MAPO策略通过自适应配置优势函数来适应样本的特定特征。

Cool Papers

点此查看论文截图

Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought

Authors:Yuhan Wang, Cheng Liu, Zihan Zhao, Weichao Wu

Real-time threat monitoring identifies threatening behaviors in video streams and provides reasoning and assessment of threat events through explanatory text. However, prevailing methodologies, whether based on supervised learning or generative models, struggle to concurrently satisfy the demanding requirements of real-time performance and decision explainability. To bridge this gap, we introduce Live-E2T, a novel framework that unifies these two objectives through three synergistic mechanisms. First, we deconstruct video frames into structured Human-Object-Interaction-Place semantic tuples. This approach creates a compact, semantically focused representation, circumventing the information degradation common in conventional feature compression. Second, an efficient online event deduplication and updating mechanism is proposed to filter spatio-temporal redundancies, ensuring the system’s real time responsiveness. Finally, we fine-tune a Large Language Model using a Chain-of-Thought strategy, endow it with the capability for transparent and logical reasoning over event sequences to produce coherent threat assessment reports. Extensive experiments on benchmark datasets, including XD-Violence and UCF-Crime, demonstrate that Live-E2T significantly outperforms state-of-the-art methods in terms of threat detection accuracy, real-time efficiency, and the crucial dimension of explainability.

实时威胁监控能够在视频流中识别威胁行为,并通过解释性文本对威胁事件进行推理和评估。然而,无论是基于监督学习还是生成模型的主流方法,都很难同时满足实时性能和决策可解释性的苛刻要求。为了弥差距,我们引入了Live-E2T,这是一个通过三种协同机制统一这两个目标的新型框架。首先,我们将视频帧解构为结构化的人-对象-交互-场所语义元组。这种方法创建了紧凑、语义聚焦的表示形式,避免了传统特征压缩中的信息退化。其次,提出了一种高效的在线事件去重和更新机制,以过滤时空冗余信息,确保系统的实时响应能力。最后,我们使用思维链策略对大型语言模型进行微调,赋予其在事件序列上进行透明和逻辑推理的能力,以生成连贯的威胁评估报告。在包括XD-Violence和UCF-Crime在内的基准数据集上的广泛实验表明,Live-E2T在威胁检测准确性、实时效率和关键的可解释性方面显著优于最新方法。

论文及项目相关链接

PDF

Summary
实时威胁监测能够通过解释性文本对视频流中的威胁行为进行分析和评估。然而,现有的基于监督学习或生成模型的方法难以满足实时性能和决策可解释性的要求。为此,我们提出了Live-E2T框架,通过三种协同机制实现了这两个目标。首先,我们将视频帧解构为结构化的人-对象-交互-场所语义元组,形成紧凑、语义聚焦的表示。其次,提出了高效的在线事件去重和更新机制,保证系统的实时响应能力。最后,我们采用链式思维策略微调大型语言模型,使其具备对事件序列进行透明、逻辑推理的能力,生成连贯的威胁评估报告。在XD-Violence和UCF-Crime等基准数据集上的广泛实验表明,Live-E2T在威胁检测准确性、实时效率和关键的可解释性方面显著优于现有方法。

Key Takeaways

  1. 实时威胁监测能够通过解释性文本对视频中的威胁行为进行分析和评估。
  2. 现有方法难以满足实时性能和决策可解释性的要求。
  3. Live-E2T框架通过解构视频帧、去重更新机制和大型语言模型的微调,实现了威胁检测、实时性能和解释性的提升。
  4. 视频帧被解构为结构化的人-对象-交互-场所语义元组,形成紧凑的语义表示。
  5. 高效的在线事件去重和更新机制确保系统实时响应。
  6. 采用链式思维策略微调大型语言模型,使其具备对事件序列进行透明逻辑推理的能力。

Cool Papers

点此查看论文截图

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Authors:Seungyoun Shin, Dongha Ahn, Jiwoo Kim, Sungwook Jeon

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

近期的工作报告表明,在神经文本到语音(TTS)领域,采用组相对策略优化(GRPO)取得了一定的进展。然而,由于缺乏可验证的韵律奖励,基于转录导向信号(CER/NLL)训练的GRPO虽然降低了错误率,但却将韵律转化为单调、不自然的语音;增加说话人相似性会进一步使训练不稳定并降低CER。我们采用了一种迭代式的直接偏好优化(DPO)方案,该方案每轮仅使用数百个人工标记的偏好对,以直接优化韵律的自然性,同时对其进行正则化以符合当前模型。在《KoCC-TTS》这一经过筛选的韩国呼叫中心互动数据集上,我们的方法达到了最高的人类偏好(ELO),CER具有竞争力,超越了GRPO和强大的商业基线。这些结果表明,当韵律不能自动奖励时,人工偏好优化为自然和稳健的TTS提供了一条实用且数据高效的道路。演示页面可访问:https://tts.ch.dev。

论文及项目相关链接

PDF submitted to ICASSP 2026

Summary
神经文本转语音(TTS)领域近期采用群组相对策略优化(GRPO)取得了进展。然而,在缺乏可验证的韵律奖励的情况下,GRPO在面向转录的信号(CER/NLL)上进行训练,虽然降低了错误率,但生成的语音缺乏韵律,呈现单调不自然的状态。添加说话人相似性会进一步破坏训练并降低CER。本研究通过迭代直接偏好优化(DPO)方案解决这一问题,该方案仅使用几百个人工标注的偏好对进行一轮优化,可直接优化韵律的自然性并进行当前模型的规则化。在真实的韩国呼叫中心互动数据集KoCC-TTS上,我们的方法达到了最高的人类偏好(ELO),在CER上具有竞争力,优于GRPO和强大的商业基线。这些结果暗示,当韵律无法自动奖励时,人类偏好优化为自然和稳健的TTS提供了一条实用且数据高效的道路。

Key Takeaways

  1. Group Relative Policy Optimization (GRPO) 在神经文本转语音(TTS)中的应用虽然降低了错误率,但导致生成的语音缺乏韵律,呈现单调不自然的状态。
  2. 在缺乏韵律奖励的情况下,添加说话人相似性会进一步破坏训练并降低性能。
  3. 迭代直接偏好优化(DPO)方案被提出来解决这一问题,该方案使用少量人工标注的偏好对进行优化,能直接优化韵律的自然性并进行模型规则化。
  4. 在真实的韩国呼叫中心互动数据集KoCC-TTS上进行的实验表明,DPO方法达到了最高的人类偏好评分(ELO),且在字符错误率(CER)上具有竞争力。
  5. DPO方法优于GRPO和强大的商业基线,暗示了人类偏好优化对于创建自然和稳健的TTS的重要性。
  6. 研究结果提供了一种数据高效的方法,即当韵律无法自动奖励时,可以通过人类偏好优化来实现。

Cool Papers

点此查看论文截图

APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

Authors:Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum

Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community’s growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL

强化学习(RL)已成为推动大规模预训练语言模型(LLM)发展的核心。包括GPT-o系列、DeepSeek-R1、Kimi-K1.5、Grok 4和GLM-4.5等后续几代都依赖于大规模RL训练来提高推理和编码能力。为了满足社区日益增长的RL需求,已经提出了许多RL框架。然而,RL训练的计算成本仍然很高,rollout生成占总运行时间的90%以上。此外,其效率通常受到rollout响应长度长尾分布的限制,其中一些冗长的响应使整个批次陷入停滞,导致GPU空闲和利用率低下。随着模型和rollout规模的持续增长,这个瓶颈越来越限制了可扩展性。为了应对这一挑战,我们提出了强化学习中的主动部分rollout(APRIL),缓解了长尾低效问题。在rollout阶段,APRIL过度提供rollout请求,达到目标响应数后终止,并将未完成的响应回收用于未来步骤的继续。这一策略确保了不会丢弃任何rollout,同时大大减少GPU空闲时间。实验表明,APRIL在常用的RL算法(GRPO、DAPO、GSPO)中提高了最多44%的rollout吞吐量,加速了收敛,并在各项任务中实现了最多8%的更高最终精度。此外,APRIL具有框架和硬件无关性,已经集成到slime RL框架中,可在NVIDIA和AMD的GPU上部署。综上所述,APRIL的工作结合了系统层面和算法层面的考虑,旨在提高RL训练效率,并激发RL系统的进一步优化。我们的代码库可在https://github.com/RLsys-Foundation/APRIL找到。

论文及项目相关链接

PDF

Summary

强化学习(RL)在大规模预训练语言模型(LLM)的进步中起到关键作用。然而,RL训练计算成本高,面临rollout生成效率低下的问题。本文提出Active Partial Rollouts in Reinforcement Learning(APRIL)方法,通过过提供rollout请求、终止达到目标响应数后回收未完成的响应用于后续步骤继续,减少了GPU空闲时间,提高了训练效率。实验证明APRIL能提高常见RL算法的吞吐量,加速收敛,提高任务最终精度。APRIL框架已集成到slime RL框架中,并可在NVIDIA和AMD GPU上部署。本研究旨在推动RL训练效率的提升,为RL系统的进一步优化提供灵感。

Key Takeaways

  1. 强化学习在预训练语言模型的发展中起到了关键作用。
  2. 当前强化学习面临的挑战包括高计算成本和rollout生成效率低下的问题。
  3. 提出了一种新的方法APRIL,通过主动部分rollout策略提高了训练效率。
  4. APRIL通过回收未完成的响应并用于后续步骤,减少了GPU的空闲时间。
  5. 实验结果显示APRIL能提高训练吞吐量、加速收敛并提升任务精度。
  6. APRIL框架具有普遍适用性,可集成到不同的RL框架和硬件平台上。

Cool Papers

点此查看论文截图

CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Authors:Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT’s core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

当前大型语言模型(LLM)的长文本推理基准测试通常忽略了关键要素,如内在任务复杂性、干扰项干扰和任务长度。为了进行更精确的失败分析,我们引入了CogniLoad,这是一个基于认知负荷理论(CLT)的新型合成基准测试。CogniLoad生成自然语言逻辑谜题,具有可独立调整的参数,反映了CLT的核心维度:内在难度(d)控制内在负荷;干扰项与信号比例(ρ)调节外在负荷;任务长度(N)作为需要相关负荷条件的操作代理。通过对22个最新推理LLM进行评估,CogniLoad显示了不同的性能敏感性,确定了任务长度是主要的约束条件,并发现了对内在复杂性的不同容忍度和干扰项比例的U型反应。通过系统、因素地控制这些认知负荷维度,CogniLoad提供了一个可重复、可扩展且诊断丰富的工具,用于剖析LLM的推理局限性并指导未来的模型开发。

论文及项目相关链接

PDF 29 pages (main: 12 + supplemental material: 17), 6 figures, 4 tables, Code: https://github.com/kaiserdan/cogniload, Data: https://huggingface.co/datasets/cogniloadteam/cogniload

Summary

基于认知负荷理论(CLT),我们提出了一种新型合成基准测试CogniLoad,用于更精确地分析大型语言模型(LLM)在长文本推理中的局限性。该基准测试通过独立调节内在难度、干扰物与信号比例和任务长度等核心维度,生成自然语言逻辑谜题。评估结果显示,任务长度是主导约束条件,而对内在复杂度和干扰物比例的耐受性存在差异。CogniLoad为解析LLM推理能力提供了可复制、可扩展且诊断丰富的工具。

Key Takeaways

  1. CogniLoad是一种基于认知负荷理论(CLT)的合成基准测试,用于评估大型语言模型(LLM)的长文本推理能力。
  2. 该基准测试通过调节内在难度、干扰物与信号比例和任务长度等维度,生成自然语言逻辑谜题。
  3. 任务长度是影响LLM推理性能的主要约束条件。
  4. LLM对内在复杂度和干扰物比例的耐受性存在差异性。
  5. CogniLoad能够系统地分析LLM的推理限制,并为未来模型开发提供指导。
  6. CogniLoad具有可复制性、可扩展性和丰富的诊断功能。

Cool Papers

点此查看论文截图

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

本文介绍了TempSamp-R1,这是一个新的强化精细调整框架,旨在提高多模态大型语言模型(MLLMs)适应视频时间定位任务的效率。我们发现现有的强化学习方法,如群体相对策略优化(GRPO),依赖于策略更新的在策略采样。然而,在具有大时间搜索空间的任务中,此策略既效率低下,性能也有限,因为它往往无法识别出时间准确的解决方案。为了解决这个问题,TempSamp-R1利用真实标注作为离线监督,提供时间精确的指导,有效地补偿了在策略解决方案中的稀疏性和错位问题。为了进一步稳定训练并减少基于奖励的更新的方差,TempSamp-R1提供了一种非线性软优势计算方法,该方法通过不对称转换动态地重塑奖励反馈。通过采用混合式的思维链(CoT)训练范式,TempSamp-R1优化了一个单一的统一模型,以支持CoT和非CoT推理模式,从而能够高效处理不同推理复杂度的查询。实验结果表明,TempSamp-R1优于基于GRPO的基线方法,在基准数据集上取得了最新性能:Charades-STA(R1@0.7:52.9%,+2.7%)、ActivityNet Captions(R1@0.5:56.0%,+5.3%)和QVHighlights(mAP:30.0%,+3.0%)。此外,TempSamp-R1在有限数据下表现出强大的少量通用化能力。代码地址:https://github.com/HVision-NKU/TempSamp-R1

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

本文介绍了一种名为TempSamp-R1的新型强化精细调整框架,旨在提高多模态大型语言模型(MLLMs)在视频时序定位任务中的效率。针对现有强化学习方法在大型时序搜索空间中的局限性,TempSamp-R1利用真实标注作为离线策略监督,提供精确的时间指导,并引入非线性软优势计算方法,动态调整奖励反馈。结合Chain-of-Thought训练范式,TempSamp-R1优化单一模型,支持不同推理复杂度的查询处理。实验结果显示,TempSamp-R1在Charades-STA、ActivityNet Captions和QVHighlights等基准数据集上实现了最新性能。

Key Takeaways

  1. TempSamp-R1是一个强化精细调整框架,用于改进多模态大型语言模型在视频时序定位任务中的性能。
  2. 现有强化学习方法在大型时序搜索空间中存在局限性,TempSamp-R1通过引入真实标注作为离线策略监督来克服这些局限性。
  3. TempSamp-R1提供精确的时间指导,解决了在线策略解决方案中的稀疏性和不匹配问题。
  4. 通过非线性软优势计算方法,TempSamp-R1动态调整奖励反馈,进一步稳定训练并减少基于奖励的更新的方差。
  5. TempSamp-R1采用混合的Chain-of-Thought训练范式,优化单一模型以支持不同推理复杂度的查询。
  6. 实验结果表明,TempSamp-R1在多个基准数据集上实现了最新性能,包括Charades-STA、ActivityNet Captions和QVHighlights。
  7. TempSamp-R1展现出在有限数据下的稳健的少样本泛化能力。

Cool Papers

点此查看论文截图

MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents

Authors:Pan Tang, Shixiang Tang, Huanqi Pu, Zhiqing Miao, Zhixing Wang

This paper presents MicroRCA-Agent, an innovative solution for microservice root cause analysis based on large language model agents, which constructs an intelligent fault root cause localization system with multimodal data fusion. The technical innovations are embodied in three key aspects: First, we combine the pre-trained Drain log parsing algorithm with multi-level data filtering mechanism to efficiently compress massive logs into high-quality fault features. Second, we employ a dual anomaly detection approach that integrates Isolation Forest unsupervised learning algorithms with status code validation to achieve comprehensive trace anomaly identification. Third, we design a statistical symmetry ratio filtering mechanism coupled with a two-stage LLM analysis strategy to enable full-stack phenomenon summarization across node-service-pod hierarchies. The multimodal root cause analysis module leverages carefully designed cross-modal prompts to deeply integrate multimodal anomaly information, fully exploiting the cross-modal understanding and logical reasoning capabilities of large language models to generate structured analysis results encompassing fault components, root cause descriptions, and reasoning trace. Comprehensive ablation studies validate the complementary value of each modal data and the effectiveness of the system architecture. The proposed solution demonstrates superior performance in complex microservice fault scenarios, achieving a final score of 50.71. The code has been released at: https://github.com/tangpan360/MicroRCA-Agent.

本文介绍了MicroRCA-Agent,这是一个基于大型语言模型代理的微服务根本原因分析的创新解决方案,构建了一个具有多模式数据融合的智能故障根本原因定位系统。技术创新体现在三个方面:首先,我们将预训练的Drain日志解析算法与多级数据过滤机制相结合,有效地将大量日志压缩成高质量故障特征。其次,我们采用了一种双重异常检测方法,将隔离森林无监督学习算法与状态码验证相结合,实现全面的跟踪异常识别。第三,我们设计了一种统计对称比率过滤机制与两阶段LLM分析策略,以实现在节点服务pod层次结构中的全栈现象总结。多模式根本原因分析模块利用精心设计的跨模式提示来深入整合多模式异常信息,充分利用大型语言模型的跨模式理解和逻辑推理能力,生成包含故障组件、根本原因描述和推理跟踪的结构化分析结果。全面的消融研究验证了每种模态数据的互补性以及系统架构的有效性。所提出的解决方案在复杂的微服务故障场景中表现出卓越的性能,最终得分为50.71。代码已发布在:https://github.com/tangpan360/MicroRCA-Agent。

论文及项目相关链接

PDF 18 pages, 22 figures

Summary
微服务根因分析的新解决方案MicroRCA-Agent,基于大型语言模型代理构建智能故障根因定位系统,采用多模态数据融合技术。该方案结合了预训练的Drain日志解析算法与多级数据过滤机制,并采用双异常检测方法与统计对称比率过滤机制来实现全面且高效的故障特征提取和异常识别。同时,利用大型语言模型的跨模态理解和逻辑推理能力,生成包含故障组件、根本原因描述和推理轨迹的结构化分析结果。该解决方案在复杂的微服务故障场景中表现出卓越性能。

Key Takeaways

  1. MicroRCA-Agent是一个基于大型语言模型代理的智能故障根因分析系统,用于微服务根因分析。
  2. 该系统采用多模态数据融合技术,能够处理多种来源的数据。
  3. 结合预训练的Drain日志解析算法与多级数据过滤机制,能高效压缩大量日志数据以获取高质量的故障特征。
  4. 采用双异常检测方法和统计对称比率过滤机制进行异常识别和现象总结。
  5. 利用大型语言模型的跨模态理解和逻辑推理能力,生成包含故障组件、根本原因描述和推理轨迹的结构化分析结果。
  6. MicroRCA-Agent在复杂的微服务故障场景中展现出卓越性能,最终得分为50.71。

Cool Papers

点此查看论文截图

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

Authors:Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Jianshu Li

As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}

随着大型视觉语言模型(VLMs)的进步,其在多语言视觉问答(mVQA)方面的能力得到了显著提高。思维链(CoT)推理已经被证明可以提高解释性和复杂推理能力。然而,大多数现有方法主要依赖于文本CoT,对多语言多模态推理的支持有限,限制了它们在现实世界应用中的部署。为了弥补这一差距,我们引入了\textbf{LaV-CoT},这是第一个具有多方面奖励优化的语言感知视觉CoT框架。LaV-CoT采用了一个可解释的多阶段推理管道,包括带有边界框(BBox)的文本摘要、语言识别、空间对象级描述和逐步逻辑推理。遵循这一推理管道,我们设计了一种自动化数据整理方法,通过迭代生成、修正和细化生成多语言CoT注释,实现可扩展和高质量的训练数据。为了提高推理和泛化能力,LaV-CoT采用了一种两阶段训练范式,结合监督微调(SFT)和语言感知组相对策略优化(GRPO),由包括语言一致性、结构准确性和语义对齐在内的可验证多方面奖励指导。在包括MMMB、多语言MMBench和MTVQA等公共数据集上的广泛评估表明,LaV-CoT相较于类似规模的开源基准测试,准确率提高了高达9.5%,甚至超过了规模大两倍的模型约2.6%。此外,LaV-CoT还超越了先进的专有模型,如GPT-4o-0513和Gemini-2.5-flash。我们进一步进行了在线AB测试,以验证我们的方法在真实世界数据上的有效性,突显其在工业部署中的实用性。我们的代码位于以下链接:[https://github.com/HJNVR/LaV-CoT]

论文及项目相关链接

PDF 12 Pages, 12 Figures, 2 Tables

Summary

随着大型视觉语言模型(VLMs)的发展,其在多语言视觉问答(mVQA)方面的能力得到了显著提升。为解决现有方法在多语言多媒体推理方面的局限性,提出了一种名为LaV-CoT的新框架,通过结合可解释的推理管道和多种奖励优化策略,提高了模型的解释性和复杂推理能力。LaV-CoT在公共数据集上的表现优于其他开源模型和先进专有模型,并已经通过了在线AB测试的验证,证明其在工业部署中的有效性。其代码已在此链接提供:https://github.com/HJNVR/LaV-CoT

Key Takeaways

  1. 大型视觉语言模型在多语言视觉问答方面的能力已显著提升。
  2. LaV-CoT是首个结合语言感知的视觉链式思维框架,支持多方面奖励优化。
  3. LaV-CoT采用可解释的推理管道,包括文本摘要、语言识别、空间对象级描述和逐步逻辑推理。
  4. 通过自动化数据整理方法生成多语言链式思维注释,提高训练和推理质量。
  5. LaV-CoT采用两阶段训练范式,结合监督微调与语言感知群体相对策略优化,以多种可验证的奖励为指导。
  6. LaV-CoT在公共数据集上的表现优于其他模型,并通过在线AB测试验证其在工业部署中的有效性。

Cool Papers

点此查看论文截图

The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

Authors:Talha Tahir

Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.

接纳与承诺疗法(ACT)是一种第三波认知行为疗法,在多种精神疾病中有疗效的证据正在显现。本研究探讨了训练后的方法和明确推理对小型开放权重大型语言模型(LLM)执行ACT能力的影响。我们使用由Mistral-Large生成合成ACT转录本,采用两种不同的方法训练了Llama-3.2-3b-Instruct模型,即监督微调(SFT)和赔率比率策略优化(ORPO),每种方法都带有和不带有明确的链式思维(COT)推理步骤。通过将这些训练后的模型与基础指令模型进行比较来评估性能。这些模型在模拟治疗中进行评估,通过ACT保真度度量(ACT-FM)和咨询师同理心量表(TES)对性能进行定量评估,评估者是由人类评估进行微调过的LLM评判员。我们的研究结果表明,通过ORPO训练的模型在ACT保真度方面显著优于SFT和指令模型(χ²(5)=185.15,p < .001),并且在治疗同理心方面也表现出显著优势(χ²(5)=140.37,p < .001)。COT的影响是有条件的,因为它对SFT模型提供了显著的好处,ACT-FM得分平均提高了2.68分(p < .001),而对更高级的ORPO或指令微调模型没有明显优势。我们认为ORPO的优越性源于其学习治疗过程而非模仿内容的能力,这是ACT的一个关键方面,而COT则作为仅通过模仿训练的模型的必要支架。本研究表明,偏好对齐的策略优化可以有效地在小规模LLM中植入ACT能力,而明确推理的实用性在很大程度上取决于基本的训练模式。

论文及项目相关链接

PDF

Summary

本文研究了接受与承诺疗法(ACT)在小型开放权重大型语言模型(LLM)中的应用效果。实验采用合成ACT语音转录本训练LLM模型,并分别使用两种不同的训练方法——监督微调(SFT)和赔率比策略优化(ORPO),同时探讨了显式链式思维(COT)推理的作用。结果显示,ORPO训练模型在ACT忠诚度和治疗同理心方面显著优于其他模型。此外,COT的影响取决于训练范式,对SFT模型有显著改善作用,但对更高级的ORPO或指令调整模型则没有明显优势。研究认为,ORPO能够学习治疗过程而非单纯模仿内容,这是ACT的核心要素;而COT对于仅通过模仿训练的模型则是必要的支撑。本研究表明,偏好对齐策略优化可以有效赋予小型LLM执行ACT的能力,而显式推理的实用性高度依赖于基础训练范式。

Key Takeaways

  1. ACT是一种新兴有效的认知行为疗法,本研究探讨了其在小型LLM模型中的应用效果。
  2. 采用合成ACT语音转录本对LLM模型进行训练,采用两种训练方法:监督微调(SFT)和赔率比策略优化(ORPO)。
  3. 引入显式链式思维(COT)推理步骤来评估其对模型表现的影响。
  4. ORPO训练模型在ACT忠诚度和治疗同理心方面显著优于其他模型。
  5. COT的作用取决于训练范式,有助于改善SFT模型的表现,但对高级模型没有明显优势。
  6. ORPO能够学习治疗过程而非单纯模仿内容,这是ACT的核心要素。

Cool Papers

点此查看论文截图

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Authors:Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs.\ long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets on Qwen3-8B-Base model.

我们提出了FSPO(公平序列策略优化),这是一种针对大型语言模型(LLMs)的序列级强化学习方法,它对重要性采样(IS)权重执行长度公平的裁剪。我们研究了带有序列级IS的RL方法,并识别出在将PPO/GRPO风格裁剪移植到序列时的不匹配问题:固定的裁剪范围系统地重新调整短序列和长序列的权重,扭曲了优化方向。FSPO引入了一种简单的补救措施:我们用随$\sqrt{L}$缩放的频带裁剪序列对数IS比率。从理论上讲,我们通过长度加权误差(LRE)正式化长度公平性,并证明较小的LRE可以在裁剪更新和真实更新之间提供余弦方向保证。在经验上,FSPO在不同长度区间内平均了裁剪率,稳定了训练,并在Qwen3-8B-Base模型上的多个评估数据集上超越了所有基线。

论文及项目相关链接

PDF

Summary
提出一种名为FSPO的序列级强化学习方法,用于LLMs。该方法通过对重要性采样(IS)权重实施长度公平裁剪,解决了序列级强化学习方法中的固定裁剪范围问题。理论方面,通过长度重新加权误差(LRE)正式化长度公平性,并证明小LRE可以在剪辑和真实更新之间提供余弦方向保证。实验表明,FSPO在不同长度区间内平衡剪辑率,稳定训练,并在Qwen3-8B-Base模型上优于所有基线。

Key Takeaways

  1. FSPO是一种针对LLMs的序列级强化学习方法。
  2. FSPO解决了固定裁剪范围问题,通过实施长度公平裁剪来优化重要性采样权重。
  3. 理论方面,通过长度重新加权误差(LRE)正式化长度公平性。
  4. 小LRE可以在剪辑和真实更新之间提供余弦方向保证。
  5. FSPO在不同长度区间内平衡剪辑率。
  6. FSPO稳定训练过程。
  7. FSPO在Qwen3-8B-Base模型上的表现优于所有基线。

Cool Papers

点此查看论文截图

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Authors:Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

大型语言模型(LLM)的范式越来越转向代理应用程序,其中网页浏览能力是从各种在线源检索信息的基础。然而,现有的开源网络代理要么在复杂任务上表现出有限的信息搜索能力,要么缺乏透明的实现。在这项工作中,我们确定了关键挑战在于信息搜索中缺乏挑战性数据。为了解决这一局限性,我们推出了WebExplorer:一种基于模型探索和系统生成数据的方法,以及从长到短的迭代查询演化。这种方法可以创建需要多步骤推理和复杂网络导航的挑战性查询答案对。通过利用我们精心制作的高质量数据集,我们成功地开发了先进的网络代理WebExplorer-8B,通过监督微调后结合强化学习。我们的模型支持12.8万字的语境长度和高达10万次工具调用次数,能够实现长期规划问题求解。在各种信息搜索基准测试中,WebExplorer-8B在其规模上均取得了最先进的性能。值得注意的是,作为规模为8B的模型,WebExplorer-8B在强化学习训练后能够在平均16轮对话中进行有效搜索,其在BrowseComp-en/zh上的准确性高于WebSailor-72B,并在WebWalkerQA和FRAMES上达到了迄今为止参数不超过100B模型的最好表现。除了这些信息搜索任务之外,我们的模型在HLE基准测试上也实现了强大的泛化能力,尽管它仅针对知识密集型问答数据进行了训练。这些结果凸显了我们在构建具有长远视野的网络代理方面的实用途径。

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型(LLM)在朝着代理应用转变过程中的一项重要进展。文章指出信息检索中的关键挑战在于缺乏挑战性的数据,于是提出了一种名为WebExplorer的系统性数据生成方法。通过利用精心制作的高质量数据集,经过监督微调及强化学习后,成功开发出WebExplorer-8B高级网络代理模型。该模型能够支持长时间范围的问题解决,并且在各种信息搜索任务中达到了最新的最佳性能。该模型不仅适用于信息检索任务,还能够在其他领域实现强大的泛化性能。这些结果展示了实现长期网络代理的实际途径。

Key Takeaways

  1. 大型语言模型(LLM)正朝着代理应用转变,网页浏览能力在信息检索中变得越来越重要。
  2. 当前开源网络代理在复杂任务上的信息搜索能力有限,缺乏透明实施方法。
  3. 信息检索的关键挑战在于缺乏挑战性的数据。
  4. WebExplorer是一种新的系统性数据生成方法,通过模型探索和查询的迭代长短变化来创建挑战性的查询答案对。
  5. WebExplorer-8B模型通过监督微调和强化学习成功开发,支持长达16轮的搜索对话。

Cool Papers

点此查看论文截图

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Authors:Zeyu Gan, Hao Yi, Yong Liu

Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. This shift in perspective serves as a conceptual bridge, revitalizing foundational principles from classical learning theory to analyze the unique dynamics of LLMs. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents. We open-source our code at https://github.com/ZyGan1999/CoT-Space.

强化学习(RL)已成为提高大型语言模型(LLM)推理能力的重要方法。然而,仍存在重大的理论空白,因为传统的基于标记的RL框架无法与复杂的多步骤思维过程(如思维链)的推理级别相对应。为了解决这一挑战,我们引入了CoT空间,这是一个新的理论框架,它将LLM推理从离散标记预测任务重新塑造为连续推理级别语义空间中的优化过程。这种视角的转变作为概念桥梁,使我们从经典学习理论的基本原理重新分析LLM的独特动态。我们从噪声和风险两个角度分析了这一过程,证明了达到最佳思维链长度的收敛是欠拟合和过度拟合之间基本权衡的自然结果。此外,大量实验为我们的理论发现提供了强有力的实证验证。我们的框架不仅为过度思考等经验现象提供了连贯的解释,还为未来开发更有效和可推广的推理代理提供了坚实的理论基础。我们在https://github.com/ZyGan1999/CoT-Space公开源代码。

论文及项目相关链接

PDF Preprint Edition

Summary

强化学习(RL)在提升大型语言模型(LLM)的推理能力方面发挥着关键作用。然而,由于传统基于令牌的RL框架与复杂的多步骤推理过程(如链式思维)在理论层面上的不匹配,仍存在重大理论空白。为解决这一挑战,我们推出了CoT-Space这一新型理论框架,将LLM推理从离散令牌预测任务重新定义为连续推理级语义空间内的优化过程。从噪声和风险两个角度进行分析,我们证明了优化链式思维长度的收敛是避免欠拟合和过度拟合之间的基本权衡的自然结果。此外,我们的理论得到了广泛实验的强大实证验证。我们的框架不仅为过度思考等经验现象提供了连贯的解释,而且为开发更有效和更具通用性的推理智能体提供了坚实的理论基础。我们已经将我们的代码公开开源。

Key Takeaways

  1. 强化学习对于提升大型语言模型的推理能力至关重要。
  2. 传统基于令牌的RL框架在应对复杂多步骤推理过程时存在理论空白。
  3. CoT-Space框架将LLM推理从离散令牌预测任务转变为连续推理级语义空间内的优化过程。
  4. 从噪声和风险角度分析,优化链式思维长度的收敛是避免欠拟合和过度拟合权衡的结果。
  5. 广泛实验验证了CoT-Space框架的理论发现。
  6. 框架为过度思考等经验现象提供了连贯的解释。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
LLM LLM
LLM 方向最新论文已更新,请持续关注 Update in 2025-09-29 Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text
2025-09-29
下一篇 
Text-to-Motion Text-to-Motion
Text-to-Motion 方向最新论文已更新,请持续关注 Update in 2025-09-28 Lagrangian Motion Fields for Long-term Motion Generation
2025-09-28
  目录