嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-17 更新

From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis

Authors:Yen Nhi Truong Vu, Dan Guo, Sripad Joshi, Harshit Kumar, Jason Su, Thomas Paul Matthews

Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.

数字乳腺断层合成技术(DBT)通过提供体积信息减少了重叠组织的影响,提高了乳腺癌检测的可见性。然而,标注数据的有限限制了DBT深度学习模型的发展。为了解决数据稀缺的问题,现有方法试图通过重新使用二维全视野数字乳腺摄影(FFDM)模型来利用DBT数据,这些方法要么将DBT体积压平,要么单独处理切片,从而丢弃了体积信息。另一种方法是采用三维推理方法,但这需要引入更复杂的架构和更多的DBT训练数据。为了解决这些缺点,我们提出了M&M-3D架构,它能够在三维空间进行推理学习,同时相对于其FFDM对应模型M&M而言无需参数。M&M-3D构建了以恶性为引导的三维特征,并通过反复混合这些三维特征与切片级别的信息来学习三维推理。这是通过在M&M中修改操作而无需添加参数来实现的,从而实现了从FFDM的直接权重转移。大量实验表明,在定位和分类方面,M&M-3D超越了二维投影和基于三维切片的方法,分别提高了11-54%和3-10%。此外,在低数据情况下,M&M-3D在定位和分类方面分别优于复杂的三维推理变体20-47%和2-10%,而在高数据情况下则与之相匹配。在流行的BCS-DBT基准测试中,M&M-3D在分类方面优于先前的最佳基准模型4%,在定位方面提高了10%。

论文及项目相关链接

PDF

摘要
DBT技术提高了乳腺癌检测的可见性,但标注数据有限制约了深度学习模型的发展。为解冑数据稀缺问题,有人尝试复用2D全数字乳腺钼靶摄影(FFDM)模型,但会丢失体积信息。本文提出M&M-3D架构,可在无需新增参数的情况下实现三维推理学习,且相对FFDM对应架构M&M具有优异性能。M&M-3D通过构建恶性引导的三维特征,并反复混合这些特征与切片信息来学习三维推理。实验证明,M&M-3D在定位和分类任务上分别超越了传统方法达11-54%和3-10%。在低数据场景下,其性能优于复杂的三维推理变体达20-47%和2-10%,但在高数据场景下与之匹配。在BCS-DBT基准测试中,M&M-3D的分类和定位性能均优于现有最佳基线模型。

关键见解

  1. DBT技术通过提供体积信息提高乳腺癌检测的可见性,但标注数据有限限制了深度学习模型的发展。
  2. 现有方法试图通过复用FFDM模型来解决数据稀缺问题,但这样做会丢失体积信息。
  3. 本文提出的M&M-3D架构可在无需新增参数的情况下实现三维推理学习,具有优异性能。
  4. M&M-3D通过构建恶性引导的三维特征,并反复混合这些特征与切片信息来学习三维推理。
  5. M&M-3D在定位和分类任务上的性能均超越了传统方法以及复杂的三维推理变体。
  6. 在BCS-DBT基准测试中,M&M-3D的性能优于现有最佳基线模型。
  7. 该研究为乳腺癌检测提供了一种新的深度学习模型架构,具有潜在的临床应用价值。

Cool Papers

点此查看论文截图

Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

Authors:Mayank Vatsa, Aparna Bharati, Richa Singh

The architectural blueprint of today’s leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

当前领先的文本到图像模型的架构蓝图存在一个基本缺陷:无法处理逻辑组合。这篇综述调查了三大核心原始问题——否定、计数和空间关系的这个缺陷。我们的分析揭示了性能的大幅崩溃:在单个原始问题上是准确的模型,在这些问题结合起来时却迅速失败,暴露了严重的干扰。我们将这种失败追查到三个关键因素。首先,训练数据几乎完全缺乏明确的否定。其次,连续注意力架构从根本上不适合离散逻辑。第三,评估指标奖励视觉可信度而不是约束满足。通过分析最近的基准测试和方法,我们表明,当前解决方案和简单扩展无法弥补这一差距。我们得出结论,实现真正的组合性将需要表示和推理方面的根本性进步,而不是对现有架构的增量调整。

论文及项目相关链接

PDF Accepted in AAAI 2026

Summary

文章指出当前领先的文本到图像模型存在根本缺陷,即无法处理逻辑组合。调查分析了三个核心原始问题:否定、计数和空间关系。分析显示模型在单一原始问题上的准确性在组合时急剧下降,暴露出严重干扰。文章追踪了这种失败的三个关键因素:训练数据缺乏明确否定;连续注意力架构不适合离散逻辑;评估指标更重视视觉可信度而非约束满足。文章表明,实现真正的组合性需要代表和推理方面的根本性进展,而非对现有架构的微调。

Key Takeaways

  1. 当前领先的文本到图像模型存在无法处理逻辑组合的根本缺陷。
  2. 模型在组合否定、计数和空间关系等核心原始问题时表现急剧下降。
  3. 训练数据几乎完全缺乏明确的否定信息。
  4. 连续注意力架构不适合处理离散逻辑问题。
  5. 评估指标更注重视觉可信度,而非约束满足。
  6. 现有解决方案和简单扩展无法弥补这一差距。
  7. 实现真正的组合性需要代表和推理方面的根本性进展。

Cool Papers

点此查看论文截图

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

Authors:Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang, Yuze Zhai, Xiaocong Li

Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni-O4 on locomotion tasks and +12.4% on dexterous manipulation, demonstrating strong generalization and scalability.

离线到在线的强化学习(O2O-RL)已经成为安全有效的机器人策略部署的很有前景的模式,但它面临两个基本挑战:多模式行为的覆盖有限和在线适应过程中的分布偏移。我们提出了UEPO,这是一个受大型语言模型预训练和微调策略启发的统一生成框架。我们的贡献有三点:(1)一种多种子动力学感知扩散策略,能够高效捕获多种模式,而无需训练多个模型;(2)一种动态发散正则化机制,强制实施具有物理意义的策略多样性;(3)一种基于扩散的数据增强模块,提高了动力学模型的泛化能力。在D4RL基准测试中,UEPO在步态任务上相对于Uni-O4实现了+5.9%的绝对改进,在精细操作任务上实现了+12.4%的改进,显示出强大的泛化能力和可扩展性。

论文及项目相关链接

PDF Accepted by NeurIPS 2025 Workshop on Embodied World Models for Decision Making

Summary:
离线到在线强化学习(O2O-RL)在机器人政策部署中表现出巨大潜力,但仍面临多模态行为覆盖有限和在线适应过程中的分布偏移两大挑战。我们提出UEPO,一个受大型语言模型预训练和微调策略启发的统一生成框架。贡献有三点:一是多种子动态感知扩散策略,能高效捕获多种模式而无需训练多个模型;二是动态发散正则化机制,强制实施具有物理意义的政策多样性;三是基于扩散的数据增强模块,提高动态模型泛化能力。在D4RL基准测试中,UEPO在步态任务上相对于Uni-O4的绝对改进率为+5.9%,在精细操作任务上为+12.4%,显示出强大的泛化能力和可扩展性。

Key Takeaways:

  1. O2O-RL面临多模态行为覆盖有限和在线适应分布偏移的挑战。
  2. UEPO是一个统一生成框架,受到大型语言模型预训练和微调策略的启发。
  3. UEPO采用多种子动态感知扩散策略,能高效捕获多种模式。
  4. UEPO实施动态发散正则化机制,促进政策多样性。
  5. UEPO包含基于扩散的数据增强模块,提高动态模型的泛化能力。
  6. UEPO在D4RL基准测试中的步态任务性能有所提升。

Cool Papers

点此查看论文截图

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Authors:Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa

With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models’ multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

随着提交的科研论文数量不断增加,对能够帮助审稿人评估研究论点的系统的需求也在增长。实验结果是科研工作的核心组成部分,通常以表格或图表等多种形式呈现。了解当前的多模态大型语言模型(multimodal LLMs)在不同证据格式下验证科学论断的稳健性仍然是一个重要且尚未被充分研究的挑战。在本文中,我们设计并实施了一系列实验,以评估多模态LLMs使用表格和图表作为证据来验证科学论断的能力。为了进行此评估,我们适应了两套科学论文数据集,并融入了进行多模态论断验证任务所需的注解和结构。使用这个适应后的数据集,我们评估了12个多模态LLMs,发现当前模型在处理基于表格的证据时表现更好,而在处理基于图表的证据时则较为困难。我们还进行了人类评估,并观察到人类在这两种格式下的表现都很出色,与模型不同。我们的分析还显示,较小的多模态LLMs(小于8B)在基于表格和基于图表的任务之间性能相关性较弱,表明跨模态泛化能力有限。这些发现突显了当前模型在多模态推理能力方面的关键差距。我们建议未来的多模态LLMs应更加重视改进图表理解,以更好地支持科学论断验证。

论文及项目相关链接

PDF Accepted at AAAI 2026

Summary

本文探讨了科学论文评审中对研究主张的验证需求,并重点研究了多模态大型语言模型(multimodal LLMs)在处理不同格式证据(如表和图表)时的性能。实验结果显示,当前模型在处理基于图表的证据时表现较弱,而在处理基于表格的证据时表现较好。同时,小型多模态LLM在跨模态任务中的表现呈现出弱相关性。未来多模态LLM应更加重视图表理解能力的提升,以更好地支持科学主张的验证。

Key Takeaways

  1. 科学论文评审中对研究主张的验证需求增长,需要系统辅助评审。
  2. 多模态大型语言模型(multimodal LLMs)在处理科学证据时面临挑战。
  3. 当前模型在处理基于图表的证据时表现较弱,处理基于表格的证据时表现较好。
  4. 小型多模态LLM在跨模态任务中的表现呈现出弱相关性。
  5. 人类在两种格式的证据上都能保持强表现。
  6. 未来多模态LLM需要提升图表理解能力。

Cool Papers

点此查看论文截图

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Authors:Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, Yu Zhou

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion’’ scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound’’. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

多模态大型语言模型(MLLMs)能否区分视觉上存在但听觉不存在的混淆对象?为了研究这个问题,我们引入了一个新的基准测试AV-ConfuseBench,它通过修改视频中的对象声音来模拟“视听混淆”场景,例如,静音发声对象并询问MLLMs“是否有/一个静音对象的声音”。实验结果表显示,MLLMs(如Qwen2.5-Omni和Gemini 2.5)由于视觉主导的推理机制而无法区分不存在的音频。受此观察结果的启发,我们引入了基于强化学习的协作多模态大型语言模型RL-CoMM,该模型以Qwen2.5-Omni为基础构建。RL-CoMM包括两个阶段:1)为了减轻视觉主导的歧义性,我们引入了外部模型——大型音频语言模型(LALM)作为参考模型,以生成仅音频的推理。然后,我们设计了一种逐步推理奖励函数,使MLLMs能够借助仅音频的参考进行自我提升的视听推理。2)为了确保准确的答案预测,我们引入了以答案为中心的置信优化,以减少潜在异质推理差异的不确定性。在视听问答和视听幻觉方面的广泛实验表明,在有限训练数据的情况下,RL-CoMM的准确率较基线模型提高了10~30%。更多信息请访问:https://github.com/rikeilong/AVConfusion。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

本文研究了多模态大型语言模型(MLLMs)在视觉存在但音频缺失的混淆对象辨识方面的能力。为此,引入了一个新的基准测试AV-ConfuseBench,通过修改视频中的对象声音来模拟“视听混淆”场景。实验结果表明,MLLMs在辨识不存在的音频时存在困难,因为它们倾向于视觉主导的推理。为解决这一问题,提出了基于强化学习的多MLLM协作方法RL-CoMM。该方法包括两个阶段:首先,引入大型音频语言模型(LALM)作为参考模型,生成仅音频的推理,以缓解视觉主导的歧义。然后,设计了一种逐步推理奖励函数,使MLLMs能够通过仅音频的参考进行自我改进的音频-视觉推理。其次,为了确保准确的答案预测,引入了以答案为中心的信心优化,以减少潜在的不同推理方式的不确定性。实验表明,RL-CoMM在音频视觉问答和音频视觉幻觉方面,相较于基准模型,准确率提高了10~30%,并且在有限训练数据下表现更优秀。

Key Takeaways

  1. 多模态大型语言模型(MLLMs)在辨识视觉上存在但音频缺失的混淆对象时遇到困难。
  2. 引入新的基准测试AV-ConfuseBench,模拟“视听混淆”场景。
  3. MLLMs在辨识不存在的音频时受视觉主导的推理影响。
  4. 提出基于强化学习的多MLLM协作方法RL-CoMM。
  5. RL-CoMM包括两个阶段:引入大型音频语言模型(LALM)作为参考,生成仅音频的推理;设计逐步推理奖励函数,提高MLLMs的音频-视觉推理能力。
  6. RL-CoMM通过答案为中心的信心优化,减少不同推理方式的不确定性。

Cool Papers

点此查看论文截图

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

Authors:Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin

Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner’s tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

现有的工具增强型大型语言模型(LLM)在处理复杂查询时面临重大挑战。当前框架(如ReAct)由于依赖于增量决策过程,容易陷入局部优化陷阱。为了解决这些局限性,我们提出了一种新型的以规划者为中心的规划-执行范式,通过架构创新从根本上解决局部优化瓶颈。我们的方法的核心是一个新型规划器模型,该模型对复杂查询进行全局有向无环图(DAG)规划,使优化执行超越传统的工具协调。我们还引入了ComplexTool-Plan,这是一个大规模基准数据集,包含需要精细的多工具组合和协调能力的复杂查询。此外,我们开发了一种两阶段训练方法论,将监督微调(SFT)与群体相对策略优化(GRPO)相结合,通过基于结构的DAG规划,系统地提高了规划器的工具选择精度和全局规划意识。当与功能强大的执行器结合时,我们的框架在StableToolBench基准测试上实现了复杂用户查询的卓越性能,展示了出色的端到端执行能力和对复杂多工具工作流程的稳健处理。

论文及项目相关链接

PDF

Summary

该文针对现有工具增强的大型语言模型在处理复杂查询时面临的挑战,提出了一种新颖的Planner-centric Plan-Execute范式。该范式通过架构创新从根本上解决局部优化瓶颈,并引入了一个新颖的Planner模型,该模型对复杂查询进行全局有向无环图(DAG)规划,以实现优于传统工具协调的优化执行。此外,还推出了大型基准测试数据集ComplexTool-Plan,以及两阶段训练方法,包括监督微调(SFT)和群组相对策略优化(GRPO),提高了Planner的工具选择准确性和全局规划意识。与功能强大的执行器结合时,该框架在StableToolBench基准测试上实现了卓越性能,展现出出色的端到端执行能力和处理复杂多工具工作流的稳健性。

Key Takeaways

  1. 现有工具增强的大型语言模型在处理复杂查询时存在挑战。
  2. 提出了一种新颖的Planner-centric Plan-Execute范式,通过架构创新解决局部优化瓶颈。
  3. 引入了全局有向无环图(DAG)规划的新颖Planner模型,优化执行复杂查询。
  4. 推出了ComplexTool-Plan大型基准测试数据集,模拟复杂查询需求。
  5. 采用了两阶段训练方法,包括监督微调(SFT)和群组相对策略优化(GRPO),提高Planner的工具选择准确性和全局规划能力。
  6. 与功能强大的执行器结合时,该框架在StableToolBench基准测试上表现出卓越性能。

Cool Papers

点此查看论文截图

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Authors:Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao

Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

有效的人机协作在物理环境中不仅要求理解要采取的行动,还要求理解可操作元素的位置以及如何与它们互动。现有方法通常会在对象层面操作或以零散的方式处理精细粒度的可用性推理,缺乏连贯的、指令驱动的接地和推理。在这项工作中,我们引入了一项新任务:精细的3D实体推理,要求代理根据任务指令,预测3D场景中每个引用的可用性元素的结构化三元组,包括其空间位置、运动类型和运动轴。为了解决这个问题,我们提出了AffordBot,这是一个新的框架,它将多模式大型语言模型(MLLMs)与量身定制的链式思维(CoT)推理范式相结合。为了弥合3D输入和2D兼容MLLM之间的鸿沟,我们呈现了场景的环绕视图图像,并将3D元素候选者投影到这些视图中,形成与场景几何结构对齐的丰富视觉表示。我们的CoT管道始于主动感知阶段,提示MLLM根据指令选择最具有信息量的观点,然后进行逐步推理以定位可用性元素并推断合理的交互动作。在SceneFun3D数据集上评估,AffordBot达到了最先进的性能,展示了强大的泛化和物理接地推理能力,仅使用3D点云输入和MLLMs。

论文及项目相关链接

PDF NeurIPS 2025

Summary

本文提出了精细化的三维体现推理任务,要求智能体基于任务指令预测每个参照的可利用元素在三维场景中的空间位置、运动类型和运动轴的结构化三元组。为解决此任务,提出了AffordBot框架,融合了多模态大型语言模型与量身定制的链式思维推理模式。为解决三维输入与二维兼容的大型语言模型之间的差距,通过对场景进行环绕视图渲染,将三维元素候选者投影到这些视图中,形成与场景几何对齐的丰富视觉表征。在SceneFun3D数据集上评估,AffordBot仅通过三维点云输入和大型语言模型就实现了卓越的性能,展现了强大的泛化能力和物理基础推理能力。

Key Takeaways

  1. 提出了精细化三维体现推理任务,要求智能体预测每个参照物的结构化信息(空间位置、运动类型和运动轴)。
  2. AffordBot框架融合了多模态大型语言模型和链式思维推理模式,用于解决此任务。
  3. 解决了三维输入与二维语言模型之间的兼容性问题,通过场景环绕视图渲染和投影技术。
  4. AffordBot在SceneFun3D数据集上实现了卓越性能。
  5. 仅通过三维点云输入和大型语言模型,展现了强大的泛化能力和物理基础推理能力。
  6. 首次结合了自然语言理解和三维场景理解的任务要求,强调了对环境的动态感知和对指令的深入理解。

Cool Papers

点此查看论文截图

Reinforcing Trustworthiness in Multimodal Emotional Support Systems

Authors:Huy M. Le, Dat Tien Nguyen, Ngan T. T. Vo, Tuan D. Q. Nguyen, Nguyen Binh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Lizi Liao, Binh T. Nguyen

In today’s world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.

在如今这个时代,情感支持变得愈发重要,然而对于寻求帮助和提供帮助的人来说都面临挑战。通过整合各种数据源,采用多模式情感支持方法能够展现出巨大潜力,提供富有同情心和语境相关的回应,促进更有效的互动。然而,当前的方法存在明显的局限性,往往仅依赖文本或将其他数据类型转换为文本,或仅提供情绪识别,从而忽略了多模式输入的全面潜力。此外,许多研究优先侧重响应生成,而没有准确识别关键的情感支持要素或确保输出的可靠性。为了克服这些问题,我们推出了\textsc{MultiMood}这一新框架,它(i)利用视频、音频和文本的跨模式嵌入来预测情感成分,并产生符合专业治疗标准的响应。为了提高可信度,我们(ii)融入新颖的心理标准和运用强化学习(RL)来优化大型语言模型(LLM),使其始终符合这些标准。此外,我们还(iii)分析了多个先进的大型语言模型,以评估其多模式情感支持能力。实验结果表明,MultiMood在MESC和DFEW数据集上达到了最新水平,而由强化学习驱动的可信性改进则通过人类和大型语言模型评估得到了验证,证明了其在该领域应用多模式框架的卓越能力。

论文及项目相关链接

PDF

Summary

在如今的世界,情感支持变得日益重要,但对于寻求帮助者和提供帮助者来说仍然是个挑战。多模式情感支持方法通过整合各种数据源来提供体恤、上下文相关的回应,展现出巨大潜力,促进更有效的互动。然而,当前的方法有着明显的局限性,如仅依赖文本或将其他数据类型转换为文本,或只提供情绪识别,从而忽略了多模式输入的全面潜力。为了克服这些问题,我们推出了MultiMood框架,它(一)利用视频、音频和文本的多媒体嵌入来预测情绪成分,并产生与专业的治疗标准对齐的回应。(二)通过融入新的心理标准和应用强化学习(RL)来优化大型语言模型(LLM),以提高其一致性。(三)分析多个先进的LLM以评估其多模式情感支持能力。实验结果显示,MultiMood在MESC和DFEW数据集上达到了最新水平,而由RL驱动的信任度改进则通过人类和LLM评估得到了验证,证明了其在该领域应用多媒体框架的卓越能力。

Key Takeaways

  1. 情感支持在当今世界的重要性及其面临的挑战。
  2. 多模式方法在情感支持中的潜力以及当前方法的局限性。
  3. MultiMood框架通过利用多媒体嵌入来预测情绪成分并产生专业对齐的回应。
  4. MultiMood框架利用新的心理标准和强化学习优化大型语言模型。
  5. MultiMood框架在多个数据集上实现了最新水平的性能。
  6. 通过人类和LLM评估验证了MultiMood框架的信任度改进。

Cool Papers

点此查看论文截图

HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Authors:Nikunj Gupta, Bill Guo, Rajgopal Kannan, Viktor K. Prasanna

Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

大型语言模型(LLM)在许多任务上实现了最先进的性能,但带来了较高的计算和内存成本,限制了其在资源受限或实时环境中的部署。为了解决这一问题,我们提出了HierRouter,这是一种分层路由方法,它可以从一组专用的、轻型的语言模型中动态地组合推理管道。我们将该方法表述为一个有限视界的马尔可夫决策过程(MDP),并通过基于近端策略优化(PPO)的强化学习代理进行训练,以迭代地选择在多跳推理的每个阶段应调用哪些模型。该代理根据不断变化的上下文和累积的成本来做出上下文感知的路由决策。在问答、代码生成和数学推理等六个基准测试上对三个开源候选大型语言模型进行的实验表明,与单独使用各个模型相比,HierRouter的响应质量提高了高达2.4倍,同时平均带来的额外推理成本微乎其微。这些结果突显了分层路由在高效、高性能的大型语言模型推理中的潜力。所有代码均可在此处找到:https://github.com/Nikunj-Gupta/hierouter

论文及项目相关链接

PDF

Summary

大型语言模型(LLMs)在多个任务上表现出卓越的性能,但其计算与内存成本高昂,限制了其在资源受限或实时场景中的应用。为解决这一问题,我们提出了HierRouter,一种层次化路由方法,该方法能够动态地从专用、轻量级语言模型的池中组装推理管道。我们将此方法表述为一个有限期限的马尔可夫决策过程(MDP),并训练基于近端策略优化(PPO)的强化学习代理,以在多跳推理的每个阶段选择调用哪些模型。代理会根据不断变化的上下文和累积的成本来做出上下文感知的路由决策。实验表明,在问答、代码生成和数学推理等六个基准测试中,使用三个开源候选LLMs的HierRouter,在提高响应质量的同时,平均推理成本仅略有增加。这突显了分层路由在高效、高性能LLM推理中的潜力。

Key Takeaways

  1. 大型语言模型(LLMs)在多个任务上表现卓越,但计算与内存成本高。
  2. HierRouter是一种层次化路由方法,能够动态组装推理管道,从专用轻量级语言模型中选择。
  3. HierRouter方法表述为有限期限的马尔可夫决策过程(MDP)。
  4. 使用基于近端策略优化(PPO)的强化学习代理进行路由决策。
  5. 代理根据上下文和累积成本做出决策。
  6. 实验表明,HierRouter在多个基准测试中提高了响应质量,并降低了推理成本。

Cool Papers

点此查看论文截图

In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

Authors:Mingye Zhu, Yi Liu, Zheren Fu, Quan Wang, Yongdong Zhang

Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single “golden” rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

训练用于思维链推理的大型语言模型(LLMs)是一个重大挑战:对单一“黄金”理由的监督微调会损害泛化能力,因为它会平等地惩罚有效的替代方案,而使用可验证奖励的强化学习则面临信用分配和计算成本高昂的问题。为了解决这些局限性,我们引入了InTRO(In-Token Rationality Optimization,内部标记理性优化),这是一个新框架,能够为精确和简洁的推理提供标记级别的探索和自我反馈。InTRO不是直接优化所有有效推理路径上不可行的目标,而是利用校正因子——基于生成策略与其答案条件对应策略之间的信息差异估计的标记级重要性权重,来进行信息丰富的下一个标记选择。这种方法允许模型在单次前向传递中进行标记级别的探索并接收自我生成的反馈,最终鼓励准确且简洁的推理。在六个数学推理基准测试中,InTRO持续优于其他基准测试,相对于基础模型,解决方案的准确性提高了高达20%。它的思维链也更加简洁,减少了冗余。除此之外,InTRO实现了跨域迁移,成功适应超出数学领域的域外推理任务,展现了稳健的泛化能力。

论文及项目相关链接

PDF AAAI 2026 Oral

Summary

训练大型语言模型(LLM)进行链式思维推理面临挑战:单一“黄金”理由的监督微调会损害泛化能力,因为它会平等地惩罚有效的替代方案,而可验证奖励的强化学习则面临信用分配和禁止的计算成本问题。为了解决这些局限性,我们推出了InTRO(In-Token Rationality Optimization)新框架,它支持标记级别的探索和自我反馈,以实现准确和简洁的推理。InTRO不是直接优化所有有效推理路径的不可行目标,而是利用校正因子——基于生成策略与其答案条件对应策略之间的信息差异估计的标记级重要性权重,用于选择信息丰富的下一个标记。这种方法允许模型在单个前向传递中进行标记级别的探索和自我生成的反馈,从而鼓励准确和简洁的理由。在六个数学推理基准测试中,InTRO始终优于其他基准测试,相对于基础模型提高了高达20%的解决方案准确性。此外,InTRO的推理链更加简洁,减少了冗余。更重要的是,InTRO实现了跨域迁移,成功适应超出数学领域的域外推理任务,展现了强大的泛化能力。

Key Takeaways

  1. 训练大型语言模型(LLM)在链式思维推理上存挑战,因监督微调易损害泛化能力,而强化学习则面临信用分配和计算成本问题。
  2. InTRO框架通过引入校正因子解决上述问题,利用生成策略与答案条件间的信息差异估计标记级重要性权重。
  3. InTRO允许标记级别的探索和自我生成的反馈,鼓励准确和简洁的理由。
  4. InTRO在多个数学推理基准测试中表现优越,解决方案准确性提高显著。
  5. InTRO推理链更简洁,减少冗余。
  6. InTRO实现跨域迁移,适应超出数学领域的域外推理任务。

Cool Papers

点此查看论文截图

Uncertainty-Guided Checkpoint Selection for Reinforcement Finetuning of Large Language Models

Authors:Manh Nguyen, Dung Nguyen, Dai Do, Svetha Venkatesh, Hung Le

Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable and exhibits high variance across model checkpoints. In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set, while relying on the final checkpoint provides no guarantee of good performance. We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls. Our method identifies hard question-answer pairs using per-sample uncertainty and ranks checkpoints by how well they handle these challenging cases. By averaging the rewards of the top-uncertain samples over a short training window, our method produces a stable and discriminative signal without additional forward passes or significant computation overhead. Experiments across three datasets and three LLMs demonstrate that it consistently identifies checkpoints with stronger generalization, outperforming traditional strategies such as relying on training or validation performance. These results highlight that models solving their hardest tasks with low uncertainty are the most reliable overall.

强化学习(RL)微调对于对齐大型语言模型(LLM)至关重要,但这个过程非常不稳定,并且在模型检查点之间表现出较高的方差。在实践中,选择最佳检查点具有挑战性:在训练过程中在验证集上评估检查点计算成本高昂,并且需要一个良好的验证集,而依赖最终检查点无法保证良好性能。我们引入了一种基于不确定性的检查点选择方法(UGCS),避免了这些陷阱。我们的方法使用每个样本的不确定性来确定难以回答的问题对,并按检查点解决这些困难案例的能力对其进行排名。通过平均短时间内顶部不确定样本的奖励,我们的方法产生了一个稳定且具有区分度的信号,无需额外的前向传递或重大计算开销。在三个数据集和三个LLM上的实验表明,它始终能够识别具有更强泛化能力的检查点,并优于传统策略,如依赖训练或验证性能。这些结果强调,以低不确定性解决其最困难任务的模型是最可靠的。

论文及项目相关链接

PDF

Summary

在强化学习微调大型语言模型时,选择最佳的检查点是一个重要且具有挑战性的问题。传统的评估方式计算量大且无法确保准确性。本文提出了一种基于不确定性的检查点选择方法(UGCS),通过识别难以回答的问题对并利用每样本的不确定性来排名检查点,从而避免这些问题。通过平均短期内顶部不确定样本的奖励,我们的方法提供了稳定且具有区分度的信号,无需额外的前向传递或大量计算开销。实验表明,该方法在三个数据集和三个大型语言模型上均能有效识别具有更强泛化能力的检查点,优于依赖训练或验证性能的传统策略。这表明解决其最困难任务且不确定性低的模型是最可靠的。

Key Takeaways

  1. 强化学习微调大型语言模型时存在选择最佳检查点的挑战。
  2. 传统评估方式计算量大且无法确保准确性。
  3. 提出的UGCS方法通过识别难以回答的问题对并基于每样本的不确定性来排名检查点。
  4. UGCS方法通过平均短期内顶部不确定样本的奖励,提供稳定且具有区分度的信号。
  5. 该方法无需额外的前向传递或大量计算开销。
  6. 实验表明UGCS方法在多个数据集和语言模型上能有效识别具有更强泛化能力的检查点。

Cool Papers

点此查看论文截图

ConstrainedSQL: Training LLMs for Text2SQL via Constrained Reinforcement Learning

Authors:Weiqin Chen, Nhan Huu Pham, Michael Robert Glass, Long Hai Vu, Gaetano Rossiello, Dharmashankar Subramanian, Santiago Paternain

Reinforcement learning (RL) has demonstrated significant promise in enhancing the reasoning capabilities of Text2SQL LLMs, especially with advanced algorithms such as GRPO and DAPO. However, the performance of these methods is highly sensitive to the design of reward functions. Inappropriate rewards can lead to reward hacking, where models exploit loopholes in the reward structure to achieve high scores without genuinely solving the task. This work considers a constrained RL framework for Text2SQL that incorporates natural and interpretable reward and constraint signals, while dynamically balancing trade-offs among them during the training. We establish the theoretical guarantees of our constrained RL framework and our numerical experiments on the well-known Text2SQL datasets substantiate the improvement of our approach over the state-of-the-art RL-trained LLMs.

强化学习(RL)在提升Text2SQL大型语言模型(LLM)的推理能力方面展现出了巨大潜力,尤其是采用GRPO和DAPO等高级算法。然而,这些方法的性能高度依赖于奖励函数的设计。不恰当的奖励会导致奖励滥用,即模型会利用奖励结构中的漏洞来获得高分,而并未真正完成任务。本研究考虑了一个用于Text2SQL的约束性强化学习框架,该框架结合了自然和可解释的奖励和约束信号,并在训练过程中动态平衡它们之间的权衡。我们为我们的约束性强化学习框架建立了理论保证,并在著名的Text2SQL数据集上进行的数值实验证明了我们的方法相较于最新的RL训练LLM有所改进。

论文及项目相关链接

PDF

Summary:强化学习在提升Text2SQL大模型的推理能力方面展现出巨大潜力,但性能高度依赖于奖励函数的设计。不恰当的奖励可能导致模型奖励滥用,即模型利用奖励结构的漏洞实现高分而未真正完成任务。本研究采用约束强化学习框架进行Text2SQL任务,结合自然、可解释的奖励和约束信号,在训练过程中动态平衡这些信号之间的权衡。理论保证和数值实验均证明该框架相较于现有强化学习训练的大模型有所提升。

Key Takeaways

  1. 强化学习在Text2SQL大模型的推理能力提升方面表现出显著潜力,特别是使用GRPO和DAPO等高级算法。
  2. 不恰当的奖励函数设计可能导致模型奖励滥用,使模型通过利用奖励结构漏洞来获得高分而未真正完成任务。
  3. 本研究提出一种约束强化学习框架进行Text2SQL任务,该框架结合自然、可解释的奖励和约束信号。
  4. 该框架能够在训练过程中动态平衡各种信号之间的权衡。
  5. 研究建立了该约束强化学习框架的理论保证。
  6. 数值实验证明该框架在知名Text2SQL数据集上的表现优于现有强化学习训练的大模型。

Cool Papers

点此查看论文截图

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Authors:Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

虽然思考感知生成旨在提高复杂任务的性能,但我们发现了一种关键的失败模式,即现有的序列、自回归方法会由于误差传播而意外地降低性能。为了系统地分析这一问题,我们提出了ParaBench,一个旨在评估文本和图像输出模式的新基准测试。我们使用ParaBench的分析表明,这种性能下降与生成推理和最终图像之间的对齐不良有着强烈的相关性。

论文及项目相关链接

PDF Project Page: https://tyfeld.github.io/mmadaparellel.github.io/

Summary

文本中提到,思考感知生成旨在提高复杂任务的性能,但现有序、自回归方法存在一种关键失效模式,即错误传播可能导致性能下降。为系统分析这一问题,作者提出了ParaBench基准测试,用于评估文本和图像输出模式。分析显示性能下降与生成推理和最终图像之间的对齐不良有关。为解决此问题,作者提出了并行多模态扩散框架MMaDA-Parallel,在整个去噪轨迹中实现文本和图像的持续双向交互。该模型通过监督微调进行训练,并进一步通过Parallel Reinforcement Learning(ParaRL)优化,沿轨迹应用语义奖励来强制执行跨模态一致性。实验证明该模型显著提高了跨模态对齐和语义一致性,在ParaBench上的输出对齐度相比最新模型Bagel提高了6.9%,为思考感知图像合成建立了更稳健的范式。

Key Takeaways

  1. 现有自回归方法在复杂任务中存在错误传播导致的性能下降问题。
  2. ParaBench基准测试用于评估文本和图像输出模式,揭示了生成推理与图像对齐问题。
  3. MMaDA-Parallel框架实现文本和图像的持续双向交互,提高跨模态对齐。
  4. MMaDA-Parallel通过监督微调进行训练,并采用Parallel Reinforcement Learning(ParaRL)优化。
  5. ParaRL通过沿轨迹应用语义奖励来强化跨模态一致性。
  6. 实验表明MMaDA-Parallel模型在ParaBench上显著提高输出对齐度,相对Bagel模型有6.9%的提升。

Cool Papers

点此查看论文截图

Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey

Authors:Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, Yi R. Fung

LLM-based agents can autonomously accomplish complex tasks across various domains. However, to further cultivate capabilities such as adaptive behavior and long-term decision-making, training on static datasets built from human-level knowledge is insufficient. These datasets are costly to construct and lack both dynamism and realism. A growing consensus is that agents should instead interact directly with environments and learn from experience through reinforcement learning. We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents’ actions during task execution, and provide evaluative feedback on rollouts for subsequent learning. Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity. In this survey, we systematically review representative methods for environment scaling from a pioneering environment-centric perspective and organize them along the stages of the GEF loop, namely task generation, task execution, and feedback. We further analyze benchmarks, implementation strategies, and applications, consolidating fragmented advances and outlining future research directions for agent intelligence.

基于LLM的代理可以自主完成各种领域的复杂任务。然而,为了进一步培养适应行为和长期决策制定等能力,仅依靠对人类知识水平静态数据集的训练是不足的。这些数据集构建成本高昂,缺乏动态和真实性。越来越多的共识认为,代理应该直接与环境进行交互,并通过强化学习从经验中学习。我们将这一迭代过程正式确定为生成-执行-反馈(GEF)循环,其中环境生成任务以挑战代理,在任务执行期间对代理的行为做出反应并返回观察结果,对执行结果提供评估反馈以供后续学习。在这一模式下,环境作为经验数据的不可或缺的生产者,需要朝着更大的复杂性、真实性和交互性进行扩展。本文系统地回顾了从以环境为中心的角度进行环境规模扩展的代表方法,并按GEF循环的阶段(即任务生成、任务执行和反馈)对其进行整理。我们进一步分析了基准测试、实施策略和应用,整合了碎片化的进展,并概述了智能代理的未来研究方向。

论文及项目相关链接

PDF 20 pages, 4 figures, SEA Workshop @ NeurIPS 2025

Summary

本文探讨LLM(大型语言模型)代理人在不同领域自主完成复杂任务的能力。文章指出,仅靠基于人类知识的静态数据集进行训练,无法培养代理人的适应行为和长期决策能力。因此,提倡让代理人直接与环境互动,通过强化学习从经验中学习。文章将这一过程形式化为生成-执行-反馈(GEF)循环,并强调环境作为经验数据不可或缺的生产者,需要向更大的复杂性、现实性和互动性扩展。本文系统回顾了环境扩展的代表方法,并分析了基准测试、实施策略和应用程序,同时巩固了分散的进展并概述了未来研究方向。

Key Takeaways

  1. LLM-based agents can accomplish complex tasks across various domains.
  2. Training on static datasets built from human-level knowledge is insufficient for cultivating capabilities like adaptive behavior and long-term decision-making.
  3. Agents should interact directly with environments and learn from experience through reinforcement learning.
  4. The generation-execution-feedback (GEF) loop formalizes the process of agent learning from environments.
  5. Environments are indispensable for producing experiential data, necessitating scaling them for greater complexity, realism, and interactivity.
  6. This survey systematically reviews methods for environment scaling from a pioneering environment-centric perspective.

Cool Papers

点此查看论文截图

Efficient Reasoning via Reward Model

Authors:Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, Xiangyu Zhao

Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method’s effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.

强化学习与可验证奖励(RLVR)已证明可以提高大型语言模型(LLM)的推理能力,从而实现大型推理模型(LRM)的发展。然而,像DeepSeek-R1和OpenAI o1这样的LRM经常产生冗长且包含多余或无关推理步骤的响应——这种现象被称为过度思考,这极大地增加了计算成本。之前为解决此问题的尝试通常会在奖励函数中加入长度惩罚,但我们发现它们经常面临两个关键问题:长度崩溃和训练崩溃,导致性能不佳。为解决这些问题,我们提出了训练简洁性奖励模型(CRM)的管道,该模型能对推理路径的简洁性进行评分。此外,我们引入了一种名为简洁性奖励函数(CRF)的新型奖励公式,该公式将结果奖励与简洁性分数之间建立了明确的依赖关系,从而促进了更有效、更高效的推理。从理论上看,我们从降低差异性和改善收敛性两个角度证明了新奖励的优越性。此外,在五个数学基准数据集上进行的广泛实验证明了该方法的有效性和标记效率,在Qwen2.5-7B上实现了8.1%的准确率提升和19.9%的响应令牌长度减少。而且,该方法可以很好地推广到其他LLM,包括Llama和Mistral。相关实现代码和数据集已公开以供复制:https://anonymous.4open.science/r/CRM。

论文及项目相关链接

PDF

Summary

强化学习结合可验证奖励(RLVR)提高了大型语言模型(LLM)的推理能力,推动了大型推理模型(LRM)的发展。然而,LRMs如DeepSeek-R1和OpenAI o1产生的响应冗长,包含大量冗余或无关推理步骤,即“过度思考”,这增加了计算成本。为解决这个问题,我们提出了训练简洁奖励模型(CRM)的管道,该模型对推理路径的简洁性进行评分。同时,我们引入了名为简洁奖励函数(CRF)的新型奖励公式,将结果奖励与简洁性评分之间建立明确依赖关系,从而促进更有效、更高效的推理。实验证明,该方法在五个数学基准数据集上取得了显著效果,响应令牌长度减少了19.9%,准确率提高了8.1%。此方法对其他LLMs如Llama和Mistral具有良好的通用性。

Key Takeaways

  1. RLVR提高了LLM的推理能力,推动了LRM的发展。
  2. LRMs如DeepSeek-R1和OpenAI o1存在生成冗长响应的问题。
  3. 现有方法通过在奖励函数中引入长度惩罚来减轻这一问题,但存在长度崩溃和培训崩溃两个关键问题。
  4. 提出训练简洁奖励模型(CRM)来评分推理路径的简洁性。
  5. 引入简洁奖励函数(CRF),建立结果奖励与简洁性评分之间的明确依赖关系。
  6. 实验证明该方法在五个数学基准数据集上有效,响应令牌长度减少,准确率提高。

Cool Papers

点此查看论文截图

History-Aware Reasoning for GUI Agents

Authors:Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li

Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users’ concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.

多模态大型语言模型的进步极大地促进了图形用户界面(GUI)的自动化。为GUI代理配备可靠的情景推理能力是缩小用户简洁任务描述与真实世界执行复杂性之间差距的关键。当前的方法将强化学习(RL)与系统2思维链相结合,在推理增强方面取得了显著的成果。对于长周期GUI任务,历史交互将每个屏幕与以目标为导向的情节链连接起来,有效利用这些线索对于当前决策至关重要。然而,现有的本机GUI代理在显性推理方面表现出短期记忆弱的缺陷,他们将连锁互动解释为离散屏幕理解,即不了解情节内的历史交互。这种忽视历史的推理挑战了它们在GUI自动化方面的性能。为了缓解这一弱点,我们提出了历史感知推理(HAR)框架,该框架鼓励代理反思自己的错误,并通过定制的策略从错误中获取情景推理知识,这些策略增强了长周期交互的短期记忆。该框架主要包括构建反思学习场景、综合定制修正指南以及设计混合RL奖励功能。使用HAR框架,我们开发了一种本机端到端模型HAR-GUI-3B,它将固有的推理模式从忽视历史转变为感知历史,为GUI代理提供稳定的短期记忆和可靠的屏幕细节感知。在多种GUI相关基准测试上的综合评估证明了我们方法的有效性和通用性。

论文及项目相关链接

PDF Paper accepted to AAAI 2026

Summary

随着多模态大型语言模型的进步,图形用户界面(GUI)自动化得到了显著提升。为了在用户的简洁任务描述与现实世界的复杂执行之间搭建桥梁,为GUI代理配备可靠的情景推理能力至关重要。当前方法将强化学习(RL)与系统2思维链相结合,在推理增强方面取得了显著成效。对于长期GUI任务,历史互动将每个屏幕与以目标为导向的剧集链相连接,有效利用这些线索对当前的决策至关重要。然而,现有的本地GUI代理在显性推理方面存在短期记忆弱的缺陷,他们将连锁互动解释为离散屏幕理解,即不了解剧集内的历史互动。针对这一历史无知推理的缺陷,我们提出了历史感知推理(HAR)框架,鼓励代理反思自己的错误,并通过定制策略从错误中获取情景推理知识,增强长期互动中的短期记忆。使用HAR框架,我们开发了一个端到端的本地模型HAR-GUI-3B,它将固有的推理模式从缺乏历史感知转变为具备历史感知,为GUI代理配备稳定的短期记忆和可靠的屏幕细节感知。全面评估GUI相关基准测试的效果,证明了我们的方法的有效性和通用性。

Key Takeaways

  1. 多模态大型语言模型的进步显著增强了GUI自动化。
  2. 强化学习(RL)与系统2思维链的结合在增强GUI代理的推理能力方面取得了显著成效。
  3. 对于长期GUI任务,历史互动对当前的决策至关重要。
  4. 现有GUI代理存在短期记忆弱的缺陷,无法充分理解历史互动的重要性。
  5. 提出了历史感知推理(HAR)框架,鼓励代理从错误中学习并增强短期记忆。
  6. 使用HAR框架开发的HAR-GUI-3B模型具备历史感知的推理能力。

Cool Papers

点此查看论文截图

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Authors:Zezhen Ding, Zhen Tan, Jiheng Zhang, Tianlong Chen

Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM’s solving accuracy by up to $4.2%$. Remarkably, OR-R1 outperforms ORLM by over $2.4%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1%-6.4%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13%$ to $7%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

优化建模和求解是运筹学(OR)在现实世界决策应用中不可或缺的基础,然而将自然语言问题描述转化为正式模型和求解器代码的过程仍然需要高度专业化的知识。尽管最近的大型语言模型(LLM)的进步为自动化提供了新的机会,但现有LLM方法的一般化能力和数据效率仍然有限,因为大多数需要大量的注释或合成数据,导致成本高昂和可扩展性障碍。在这项工作中,我们提出了OR-R1,这是一个用于自动化优化建模和求解的数据高效训练框架。OR-R1首先采用监督微调(SFT)方法,帮助模型从有限的有标签数据中掌握问题制定和代码生成的基本推理模式。此外,它还通过测试时间组相对策略优化(TGRPO)提高了能力和一致性。这种两阶段的设计使OR-R1能够利用稀缺的有标签数据和丰富的无标签数据进行有效学习。实验表明,OR-R1的解决准确率达到了平均水平的最高水平为平均$ 67.7%$ ,只需要先前的方法(例如ORLM)十分之一的数据合成需求就能实现这种效果。更显著的是,仅用不到千分之一的合成样本时,OR-R1在ORLM上就具有超过四分之三点的准确率提升优势。另外值得一提的是,尽管存在标记尝试和非连续多次尝试的不确定性因素问题下TGRPO也仍然实现了对准确率的可观贡献值达$+ 3.1%- 6.4%$左右使得单一尝试成功与连续八次尝试中多次尝试的成功表现之间的百分比差距大幅减少(由十三缩小到七个百分点)。全面评估各种现实世界的基准测试表明,OR-R1为自动化的运筹优化问题建模和求解提供了稳健、可扩展且经济实惠的解决方案,降低了工业运筹应用中的专业知识和数据障碍。

论文及项目相关链接

PDF 9 pages, 5 figures, AAAI 2026

Summary

本文介绍了一种名为OR-R1的数据高效训练框架,用于自动化优化建模和求解。该框架采用监督微调(SFT)技术,帮助模型从有限标记数据中获取问题表述和代码生成的基本推理模式。此外,它采用测试时间群体相对策略优化(TGRPO)来提高能力和一致性。这种两阶段设计使OR-R1能够利用稀缺的标记数据和大量的无标记数据进行有效学习。实验表明,OR-R1在仅使用ORLM所需合成数据的十分之一的情况下,达到了最先进的性能,平均求解准确度为67.7%,并超越了ORLM最多高达4.2%的求解准确度。尤其令人印象深刻的是,OR-R1仅需100个合成样本就能超过ORLM至少超过2.4%。而且,TGRPO进一步提高了准确性,缩小了单次尝试和多次尝试之间的性能差距。广泛的真实世界基准测试表明,OR-R1为自动化运筹优化问题的建模和求解提供了稳健、可扩展和经济的解决方案。

Key Takeaways

  1. OR-R1是一种数据高效训练框架,用于自动化优化建模和求解。
  2. 采用监督微调(SFT)技术帮助模型从有限标记数据中获取基本推理模式。
  3. 测试时间群体相对策略优化(TGRPO)提高了能力和一致性。
  4. OR-R1通过仅使用十分之一的数据就达到了最先进的性能。
  5. 与现有方法相比,OR-R1在合成数据需求更少的情况下实现了更高的求解准确度。
  6. TGRPO对性能有额外贡献,缩小了单次尝试和多次尝试之间的性能差距。

Cool Papers

点此查看论文截图

Advancing Autonomous Emergency Response Systems: A Generative AI Perspective

Authors:Yousef Emami, Radha Reddy, Azadeh Pourkabirian, Miguel Gutierrez Gaitan

Autonomous Vehicles (AVs) are poised to revolutionize emergency services by enabling faster, safer, and more efficient responses. This transformation is driven by advances in Artificial Intelligence (AI), particularly Reinforcement Learning (RL), which allows AVs to navigate complex environments and make critical decisions in real time. However, conventional RL paradigms often suffer from poor sample efficiency and lack adaptability in dynamic emergency scenarios. This paper reviews next-generation AV optimization strategies to address these limitations. We analyze the shift from conventional RL to Diffusion Model (DM)-augmented RL, which enhances policy robustness through synthetic data generation, albeit with increased computational cost. Additionally, we explore the emerging paradigm of Large Language Model (LLM)-assisted In-Context Learning (ICL), which offers a lightweight and interpretable alternative by enabling rapid, on-the-fly adaptation without retraining. By reviewing the state of the art in AV intelligence, DM-augmented RL, and LLM-assisted ICL, this paper provides a critical framework for understanding the next generation of autonomous emergency response systems from a Generative AI perspective.

自动驾驶车辆(AVs)通过实现更快、更安全、更高效的响应,正朝着彻底改变应急服务领域迈出坚实步伐。这一变革是由人工智能(AI)的进步推动的,特别是强化学习(RL)的进步,强化学习使得自动驾驶车辆能够在复杂环境中导航并在实时做出关键决策。然而,传统的强化学习范式通常面临样本效率低和动态应急场景下适应性不足的问题。本文综述了下一代自动驾驶优化策略来解决这些局限性。我们分析了从传统强化学习向扩散模型(DM)增强型强化学习的转变,尽管计算成本增加,但扩散模型增强型强化学习通过合成数据生成提高了策略稳健性。此外,我们还探索了大型语言模型(LLM)辅助上下文学习(ICL)的新兴范式,它通过无需重新训练即可实现快速即时适应,提供了一种轻便且可解释性的替代方案。本文通过回顾自动驾驶智能、扩散模型增强型强化学习和大型语言模型辅助上下文学习的最新进展,提供了一个从生成人工智能的角度了解下一代自主应急响应系统的关键框架。

论文及项目相关链接

PDF 8 pages, 3 figures, 2 tables

Summary

自动驾驶车辆(AVs)通过运用人工智能(AI)尤其是强化学习(RL)技术,能够实现更快、更安全、更高效的应急服务响应,从而引领一场革命。然而,传统的RL范式在应对动态紧急场景时存在样本效率低下和适应性不足的局限性。本文评述了下一代AV优化策略,从传统的RL转向扩散模型(DM)增强RL,通过合成数据生成提高策略稳健性,同时探索大型语言模型(LLM)辅助的上下文学习(ICL)新兴范式。本文从生成式人工智能的角度为理解下一代自主应急响应系统提供了关键框架。

Key Takeaways

  1. 自动驾驶车辆(AVs)能更快、更安全、更高效地进行应急服务响应。
  2. 人工智能(AI)特别是强化学习(RL)技术是自动驾驶车辆实现高效响应的关键。
  3. 传统RL范式在应对动态紧急场景时存在样本效率低下和适应性不足的问题。
  4. 扩散模型(DM)增强RL是提升策略稳健性的新方法,但计算成本较高。
  5. 大型语言模型(LLM)辅助的上下文学习(ICL)提供了一种轻便、可解释性强的替代方案,可快速适应不需再训练。
  6. 本文从生成式人工智能的角度提供了理解下一代自主应急响应系统的关键框架。

Cool Papers

点此查看论文截图

SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving

Authors:Shengmin Piao, Sanghyun Park

Recent advances in large reasoning models have been driven by reinforcement learning and test-time scaling, accompanied by growing interest in latent rather than purely textual reasoning. However, existing latent reasoning methods lack mechanisms to ensure stable evolution of latent representations and a systematic way to interleave implicit and explicit reasoning. We introduce SpiralThinker, a unified framework that performs iterative updates over latent representations, enabling extended implicit reasoning without generating additional tokens. A progressive alignment objective combined with structured annotations maintains coherence between latent and textual reasoning. Across mathematical, logical, and commonsense reasoning tasks, SpiralThinker achieves the best overall performance among latent reasoning approaches, consistently surpassing previous methods across all benchmarks. Detailed analyses reveal that both iteration and alignment are indispensable, the numbers of latent tokens and iterations exhibit dataset-specific optima, and appropriate alignment proves critical for an effective iterative process. Overall, SpiralThinker bridges iterative computation and latent reasoning, demonstrating that aligned iterative updates can reliably steer reasoning in the latent space.

近年来,大型推理模型的进步得益于强化学习和测试时缩放技术的推动,同时对潜在推理而非纯文本推理的兴趣也在增长。然而,现有的潜在推理方法缺乏保证潜在表示稳定进化的机制,以及交替进行隐式和显式推理的系统方法。我们引入了SpiralThinker,这是一个统一框架,对潜在表示进行迭代更新,能够在不生成额外令牌的情况下进行扩展的隐式推理。结合结构化注释的渐进对齐目标,在潜在推理和文本推理之间保持一致性。在涵盖数学、逻辑和常识推理的任务中,SpiralThinker在潜在推理方法中实现了最佳的整体性能,在所有基准测试中均超过了以前的方法。详细分析表明,迭代和对齐都是必不可少的,潜在令牌的数量和迭代次数表现出数据集特定的最优状态,适当的对齐对于有效的迭代过程至关重要。总体上,SpiralThinker桥梁迭代计算和潜在推理,证明对齐迭代更新可以可靠地引导潜在空间中的推理。

论文及项目相关链接

PDF

Summary

近期大型推理模型的进展得益于强化学习与测试时缩放技术,同时人们对潜在推理的兴趣日益增长,而非纯粹的文本推理。然而,现有潜在推理方法缺乏保证潜在表示稳定演化的机制,以及混合隐式和显式推理的系统方法。为此,我们引入了SpiralThinker框架,它能够在不生成额外令牌的情况下,对潜在表示进行迭代更新,实现扩展的隐式推理。结合结构化注释,通过渐进对齐目标维持潜在推理与文本推理之间的一致性。在数学、逻辑和常识推理任务方面,SpiralThinker在潜在推理方法中表现最佳,在所有基准测试中均超越以前的方法。分析表明,迭代和对齐都是必不可少的,潜在令牌的数量和迭代次数展现出数据集特定的最优状态,适当的对齐对于有效的迭代过程至关重要。总的来说,SpiralThinker桥梁迭代计算与潜在推理,证明对齐迭代更新能够可靠地引导潜在空间中的推理。

Key Takeaways

  1. 近期大型推理模型的进展得益于强化学习与测试时缩放技术。
  2. 现有潜在推理方法缺乏保证潜在表示稳定演化的机制。
  3. SpiralThinker框架能通过迭代更新潜在表示实现扩展的隐式推理。
  4. SpiralThinker结合了结构化注释,保持潜在推理与文本推理之间的一致性。
  5. SpiralThinker在多种推理任务上表现最佳,超越之前的方法。
  6. 分析和实验表明迭代和对齐在SpiralThinker中起关键作用。

Cool Papers

点此查看论文截图

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Authors:Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi

We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine’s effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.

我们推出Lumine,这是首个开放配方,用于开发能够在具有挑战性的3D开放世界环境中实时完成数小时复杂任务的通用智能体。Lumine采用类似人类的交互范式,以端到端的方式统一感知、推理和行动,由视觉语言模型驱动。它处理原始像素的频率为5Hz,以产生精确的频率为30Hz的键盘鼠标动作,并在必要时自适应地触发推理。在《原神》的训练下,Lumine成功完成了长达五个小时的蒙德主线任务,效率堪比人类,并能根据自然语言指令执行广泛的3D开放世界探索和2D图形用户界面操作任务,包括收集、战斗、解谜和NPC交互等。除了域内表现外,Lumine还展现出强大的跨游戏零样本泛化能力。无需微调,它即可在《浪潮狂涌》中完成长达100分钟的使命,以及在《崩坏:星穹铁道》中完成完整的第一章节五小时任务。这些令人鼓舞的结果凸显了Lumine在不同世界和交互动态中的有效性,朝着开发开放环境中的通用智能体迈出了坚实的一步。

论文及项目相关链接

PDF

Summary

全新的通用智能体代理Lumine问世,能够在充满挑战的三维开放世界环境中完成实时长时任务。Lumine通过像素级别的处理到键盘鼠标操作的控制以及结合感知、推理和行动的交互模式来实现类人操作。在《原神》的训练下,Lumine成功完成了长达五小时的蒙德主线任务,效率堪比人类,并能在3D开放世界探索和2D界面操作中执行一系列任务。此外,Lumine还展现出强大的跨游戏零样本泛化能力,在未进行微调的情况下在其他游戏中也表现出色。

Key Takeaways

  1. Lumine是首个开发通用智能体代理的公开配方,能在挑战性的三维开放世界环境中完成长时间任务。
  2. Lumine采用类人交互模式,统一感知、推理和行动。
  3. Lumine通过视觉语言模型驱动,能处理原始像素数据并产生精确操作。
  4. 在《原神》的训练下,Lumine能高效完成长达五小时的主线任务。
  5. Lumine具备在三维开放世界探索和二维界面操作中的广泛任务执行能力。
  6. Lumine展现出强大的零样本泛化能力,在未进行微调的情况下在其他游戏中也表现出色。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
  目录