⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-22 更新
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Authors:Omkat Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
近期大型多模态模型(LMM)的进展赋予了令人印象深刻的推理和感知能力,然而,大多数现有的训练管道仍然依赖于人工编制的数据或外部验证的奖励模型,这限制了其自主性和可扩展性。在这项工作中,我们致力于以完全无监督的方式(无需任何注释数据或奖励蒸馏)提高LMM的推理能力。为此,我们提出了一种名为EvoLMM的自我进化框架,它从一个单一的主干模型中实例化两个协作代理:一个提议者,用于生成多样化、基于图像的问题,一个求解器,通过内部一致性解决这些问题,学习通过持续的自我奖励过程进行。这种动态反馈鼓励生成信息丰富的查询和精细的结构化推理,无需依赖真实数据或人类判断。当使用流行的Qwen2.5-VL作为基础模型时,我们的EvoLMM在多模态数学推理基准测试上产生了持续的增益,在ChartQA、MathVista和MathVision上的准确率提高了约3%,且仅使用原始训练图像。我们希望这种简单而有效的方法将成为在无监督环境中自我改进LMM的坚实基线,为未来研究铺平道路。我们的代码和模型可在https://github.com/mbzuai-oryx/EvoLMM获得。
论文及项目相关链接
PDF 9 Pages, 6 Figures, 4 Tables
Summary
大型多模态模型(LMM)的最新进展已经赋予了令人印象深刻的推理和感知能力。然而,大多数现有的训练管道仍然依赖于人工编制的数据或外部验证的奖励模型,这限制了其自主性和可扩展性。为了改善LMM的推理能力,我们提出了一种名为EvoLMM的纯粹无监督训练方法,该方法从单一主干模型中实例化两个合作代理:一个提案者,生成多样、以图像为基础的问题;一个求解器,通过内部一致性解决这些问题。这种动态反馈鼓励产生信息丰富的查询和结构化推理的改进,无需依赖真实数据或人为判断。使用流行的Qwen2.5-VL作为基础模型,EvoLMM在多模态数学推理基准测试上的表现提升了约3%,包括ChartQA、MathVista和MathVision等测试。
Key Takeaways
- 大型多模态模型(LMM)在推理和感知方面展现出显著能力,但仍受限于传统训练方法的自主性和可扩展性。
- 现有工作试图通过纯粹无监督的方式提高LMM的推理能力。
- EvoLMM框架由两个合作代理组成:提案者生成多样化、图像为基础的问题,求解器通过内部一致性解决这些问题。
- 动态反馈机制鼓励产生信息丰富的查询和不断改进结构化推理,无需依赖真实数据或人为判断。
- EvoLMM在多模态数学推理基准测试中表现优越,相比基础模型有约3%的提升。
- 研究的代码和模型在指定网站可用,便于未来研究参考和扩展。
点此查看论文截图
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Authors:Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
近期视觉生成技术的进展越来越注重探索融合推理能力。它们融入了文本推理,即在生成过程之前(作为预先规划)或之后(作为后期优化)进行思考,但在生成过程中缺乏即时多模式交互。在这项初步研究中,我们引入了“思考生成”(Thinking-while-Generating,TwiG)的概念,这是第一个交错框架,能够在视觉生成过程中实现协同演化的文本推理。随着视觉内容的逐步生成,文本推理被交织在一起,既引导即将到来的局部区域,又对先前合成的区域进行反思。这种动态交互产生了更具上下文意识和语义丰富的视觉输出。为了揭示该框架的潜力,我们研究了三种候选策略,包括零样本提示、在我们精选的TwiG-50K数据集上进行监督微调(SFT),以及通过定制的TwiG-GRPO策略进行强化学习(RL),每种策略都为交错推理的动力学提供了独特的见解。我们希望这项工作能激发对融合文本推理以改进视觉生成的进一步研究。代码将在以下网址发布:https://github.com/ZiyuGuo99/Thinking-while-Generating。
论文及项目相关链接
PDF Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating
Summary
视觉生成领域的最新进展越来越多地探索了推理能力的融合。该研究引入了一种名为Thinking-while-Generating(TwiG)的框架,它能在视觉生成过程中融入并实时演变文本推理能力。随着视觉内容的逐步生成,文本推理不仅能够指导未来局部区域的生成,还能对已经合成的区域进行反思。这种动态交互产生了更具上下文感知和语义丰富的视觉输出。该研究探讨了三种候选策略:零样本提示、在精选的TwiG-50K数据集上进行监督微调(SFT)以及通过定制的TwiG-GRPO策略进行强化学习(RL)。本研究工作期望能为将来的视觉生成中融入文本推理提供更多的启示和灵感。相关代码已在GitHub上发布:https://github.com/ZiyuGuo99/Thinking-while-Generating。
Key Takeaways
- Thinking-while-Generating(TwiG)框架允许在视觉生成过程中融入文本推理能力。
- 该框架结合了预规划和后修正的方式,允许在视觉内容生成时同步进行文本推理。
- 动态交互过程使视觉输出更具上下文感知和语义丰富性。
- 研究探讨了三种策略:零样本提示、监督微调(SFT)和强化学习(RL),每种策略都有其独特的优势和应用前景。
- 该研究强调了将文本推理融入视觉生成的重要性,为未来相关研究提供了新的启示和灵感。
- 代码已经公开发布,方便其他研究者使用和学习。
点此查看论文截图
Learning to Think Fast and Slow for Visual Language Models
Authors:Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou
When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
当面对复杂问题时,我们往往思考得较慢;相反,对于简单问题,我们则思考得较快。这种双系统思考机制使我们能够高效地分配认知资源,对简单直接的问题快速做出决策,同时保留深度分析思考以应对更复杂的挑战。然而,现有的以推理为导向的视觉语言模型(VLMs),无论是以明确的思维链注释进行训练还是以基于规则的RL奖励进行训练,主要追求的是冗长详细的推理链,这往往导致过高的计算成本。在这项工作中,我们提出了一种简单的RL方法,使VLMs能够根据任务难度自动切换快速和慢速思考模式。该方法分为两个阶段:在第一个阶段,我们根据模型输出长度来标注数据,需要快速思考还是慢速思考,这是受到以下观察的启发——预训练的VLMs通常针对不同的类型问题会产生不同长度的答案;在第二个阶段,我们使用GRPO和思维模式标签来训练模型,以发展双模式思维。尽管我们的模型DualMindVLM非常简单,但它显著优于基础模型,实现了与最新视觉推理模型相当的性能,同时保持了极高的令牌效率。
论文及项目相关链接
Summary
该文本介绍了一种简单的强化学习方法,使视觉语言模型能够根据任务难度自动切换快速和慢速思考模式。方法包括两个阶段:第一阶段根据模型输出长度标记数据需要快速思考还是慢速思考;第二阶段使用GRPO和思维模式标签训练模型,实现双模式思维。该方法显著提高了基础模型的性能,达到了与先进视觉推理模型相当的性能,同时保持了极高的令牌效率。
Key Takeaways
- 面临复杂问题时,人们倾向于缓慢思考,而面对简单问题时则快速思考。
- 现有视觉语言模型主要追求详细推理链,导致过高计算成本。
- 提出了一个简单强化学习的方法,使视觉语言模型能根据任务难度自动切换快速和慢速思考模式。
- 方法包括两个阶段:标记数据需要快速思考还是慢速思考,以及使用GRPO和思维模式标签训练模型。
- 该方法显著提高了基础模型的性能。
- 该方法的性能与最先进的视觉推理模型相当。
点此查看论文截图
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Authors:Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video’s inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
虽然语言模型在许多实际应用中产生了巨大影响,但视频生成仍然主要局限于娱乐。受到视频固有能力的启发,即展示难以仅通过语言传达的实体世界信息(例如,想象只用文字教别人打领带),我们发现了被忽视的机会,将视频扩展为一种新的答案形式——下一代事件预测(NEP),正式定义为视频下一代事件预测(VNEP)。现有的NEP任务接受包含过程或预测问题的视频作为输入来预测文本中的下一个事件,而VNEP则需要动态的视频响应。这种从讲述到展示的转变为过程学习和创造性探索提供了更直观和定制化的答案。然而,此任务对于现有模型来说仍然具有挑战性,因为它需要理解多模式输入、指令条件推理,以及生成具有视觉和语义一致性的视频。为了解决这个问题,我们引入了VANS模型,该模型利用强化学习将视觉语言模型(VLM)与视频扩散模型(VDM)对齐,用于VNEP。VANS的核心是我们提出的联合GRPO,它协调VLM和VDM作为一个单元进行工作。通过共享输出奖励来驱动,它优化VLM生成既准确又易于可视化的字幕,同时引导VDM生成的视频忠于这些字幕和输入视觉上下文。为了实现这一学习,我们精心制作了专门用于VNEP任务的VANS-Data-100K数据集。在过程和预测基准测试上的实验表明,VANS在视频事件预测和可视化方面都达到了最新技术水平。相关代码已发布在https://github.com/KlingTeam/VANS。
论文及项目相关链接
PDF Project page: https://video-as-answer.github.io/
Summary
该文探讨了语言模型在视频生成领域的应用潜力。作者认为视频具有展示物理世界信息的能力,可以作为一种新的答案形式,用于下一代事件预测(Next-Event Prediction,NEP),并提出了视频下一代事件预测(Video-Next-Event Prediction,VNEP)的概念。为解决现有模型面临的多模式输入理解、指令条件推理以及视频生成一致性等挑战,作者提出了名为VANS的模型,该模型结合了视觉语言模型和视频扩散模型,通过强化学习进行对齐。实验证明,VANS在视频事件预测和可视化方面均达到了业界领先水平。
Key Takeaways
- 视频作为一种新的答案形式,在下一代事件预测(NEP)中具有巨大潜力。
- VNEP任务要求模型能够理解多模式输入,包括视觉和语言信息。
- VANS模型结合了视觉语言模型和视频扩散模型,通过强化学习进行对齐。
- VANS模型在视频事件预测和可视化方面达到了业界领先水平。
- VANS模型的核心是联合GRPO,它能协调视觉语言模型和视频扩散模型协同工作。
- 为支持VANS模型的学习,作者创建了一个专用数据集VANS-Data-100K。
点此查看论文截图
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Authors:Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
最近,生成式视频模型(如Veo-3)的进展显示出了令人惊讶的零射击推理能力,这引发了人们对系统和可靠评估的日益增长的需求。我们推出了V-ReasonBench,这是一个旨在评估视频推理能力的基准测试,包括四个关键维度:结构化问题解决、空间认知、基于模式的推理和物理动态。该基准测试由合成和现实世界图像序列构成,提供了一套多样化、可验证答案的任务,这些任务具有可重复性、可扩展性和明确性。对六款最先进的视频模型的评估显示,在结构化、空间、基于模式和物理推理方面存在明显的差异。我们还将视频模型与强大的图像模型进行了比较,分析了常见的幻觉行为,并研究了视频时长对帧链推理的影响。总体而言,V-ReasonBench提供了一个统一和可重复的视频推理测量框架,旨在支持开发具有更可靠、与人类对齐的推理能力的模型。
论文及项目相关链接
PDF Project Page: https://oahzxl.github.io/VReasonBench
Summary
视频推理能力评估对于新兴的视频模型至关重要。V-ReasonBench的推出,为评估视频推理提供了重要工具,包括结构化问题解决、空间认知、模式推理和物理动态四个维度。此基准测试涵盖合成和真实世界的图像序列,具有可重复性、可扩展性和明确性等特点。对现有六大先进视频模型的评估表明,不同维度的视频推理能力差异明显。
Key Takeaways
- V-ReasonBench是一个用于评估视频推理的基准测试,包括结构化问题解决、空间认知、模式推理和物理动态四个关键维度。
- 该基准测试涵盖了合成和真实世界的图像序列,提供了多样化的可验证任务。
- 六大先进视频模型的评估结果显示,各模型在不同维度的视频推理能力上存在明显差异。
- 视频模型与图像模型的对比揭示了视频模型在某些方面的优势以及常见的幻觉行为。
- 视频持续时间对Chain-of-Frames推理的影响得到了研究和分析。
- V-ReasonBench提供了一个统一和可重复的框架来测量视频推理能力。
点此查看论文截图
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Authors:Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
具有强大推理能力的大型语言模型(LLM)的出现标志着一个重要的里程碑,为复杂问题解决开启了新的前沿。然而,通常使用强化学习(RL)训练这些推理模型时,会遇到关键的效率瓶颈:RL训练过程中的响应生成呈现出持久的长尾分布,少数非常长的响应主导了执行时间,浪费资源并增加了成本。为了解决这一问题,我们提出了TLT系统,它通过集成自适应投机解码无损加速推理RL训练。在RL中应用投机解码面临动态工作量、不断变化的目标模型和草稿模型训练开销等挑战。TLT通过两个协同工作的组件克服这些障碍:(1)自适应草稿编制者,这是一个轻量级的草稿模型,在长尾生成期间连续运行在闲置的GPU上,以无需额外成本的方式保持与目标模型的对齐;(2)自适应推出引擎,它维护一个高效的预捕获CUDAGraphs池,并自适应地为每个输入批次选择适合的SD策略。评估表明,TLT相较于最新系统实现了超过1.7倍端到端的RL训练加速,同时保持模型精度,并产生适合高效部署的免费高质量草稿模型作为副产品。代码已发布在https://github.com/mit-han-lab/fastrl。
论文及项目相关链接
Summary
大型语言模型(LLM)的出现,具备强大的推理能力,为复杂问题解决开启了新纪元。然而,使用强化学习(RL)训练这些推理模型时,存在效率瓶颈。响应生成过程中存在长尾分布现象,少数超长响应占据大量执行时间,浪费资源、增加成本。为此,提出TLT系统,通过集成自适应推测解码,无损加速推理RL训练。TLT包括两个协同组件:自适应起草人和自适应回滚引擎。评估显示,TLT较现有系统实现超过1.7倍端到端RL训练加速,保持模型精度,并产生适合高效部署的高质量草案模型作为副产品。
Key Takeaways
- 大型语言模型(LLM)具备强大推理能力,推动复杂问题解决的新进展。
- 强化学习(RL)在训练具备推理能力的大型语言模型时存在效率瓶颈。
- 响应生成过程中的长尾分布现象导致资源浪费和成本增加。
- TLT系统通过集成自适应推测解码技术,旨在解决上述问题,实现训练效率的提升。
- TLT系统包括两个关键组件:自适应起草人和自适应回滚引擎,分别负责在空闲GPU上持续训练轻量级草案模型和维护预捕获的CUDAGraphs的内存效率池。
- TLT系统能够实现超过1.7倍的端到端RL训练加速,同时保持模型精度。
- TLT系统产生的草案模型可作为高效部署的副产品。
点此查看论文截图
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Authors:Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba’s structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
训练针对多种规模和部署目标的大型语言模型家族是一项成本高昂的任务,需要为每个不同规模进行单独的训练。最近关于通过剪枝和知识蒸馏进行模型压缩的工作已经降低了成本;然而,此过程仍然会为每个压缩模型产生数百亿令牌的价值训练成本。在本文中,我们介绍了Nemotron Elastic框架,该框架用于构建面向推理的大型语言模型,包括混合Mamba-Attention架构,该架构在单个父模型中嵌入多个嵌套子模型,每个子模型都针对不同的部署配置和预算进行了优化。这些子模型与父模型共享权重,在部署期间无需额外的训练或微调即可实现零拍摄提取。我们通过端到端训练路由器和专门设计用于推理模型的两阶段训练课程来实现此功能。此外,我们引入了保留Mamba结构约束的组感知SSM弹性化、异质MLP弹性化、基于归一化MSE的层重要性以改进深度选择以及知识蒸馏,以实现同时多预算优化。我们将Nemotron Elastic应用于Nemotron Nano V2 12B模型,仅使用110B训练令牌同时生成9B和6B模型;与从头开始训练模型家族相比,这导致成本降低了360倍以上,与最新的压缩技术相比,成本降低了大约7倍。每个嵌套模型在准确性方面都达到了或超过了最新技术的水平。而且,与其他压缩方法不同,我们方法的嵌套功能允许拥有一个多合一的推理模型,其部署内存相对于模型中家庭的模型数量保持不变。
论文及项目相关链接
Summary
本文介绍了Nemotron Elastic框架,该框架用于构建面向推理的大型语言模型(LLMs),包括混合Mamba-Attention架构。该框架能够在单一父模型中嵌入多个嵌套子模型,针对不同的部署配置和预算进行优化。子模型可以与父模型共享权重,在部署时无需额外训练或微调即可零成本提取。通过使用端到端训练路由器和专门针对推理模型的两阶段训练课程,实现了这一功能。此外,还引入了群体感知SSM弹性化、异构MLP弹性化、基于归一化MSE的层重要性改进深度选择以及知识蒸馏,以实现同时多预算优化。将该框架应用于Nemotron Nano V2 12B模型,仅使用110B训练令牌就同时产生了9B和6B模型,与从头开始训练模型家族相比,成本降低了360倍以上,与最新压缩技术相比,成本降低了约7倍。每个嵌套模型在准确性方面与最新技术相当或更好。
Key Takeaways
- 构建面向推理的大型语言模型(LLMs)的Nemotron Elastic框架可以实现多尺度、多部署目标的语言模型训练。
- 通过在单一父模型中嵌入多个嵌套子模型,该框架降低了训练成本。
- 子模型可以与父模型共享权重,零成本提取,无需额外训练或微调。
- 该框架通过使用端到端训练路由器和两阶段训练课程实现功能。
- 引入多种技术如SSM弹性化、MLP弹性化、基于归一化MSE的层重要性改进和知识蒸馏等,实现同时多预算优化。
- 应用该框架于Nemotron Nano V2 12B模型,成本相比传统方法大幅降低,同时保持或提高模型准确性。
点此查看论文截图
PartUV: Part-Based UV Unwrapping of 3D Meshes
Authors:Zhaoning Wang, Xinyue Wei, Ruoxi Shi, Xiaoshuai Zhang, Hao Su, Minghua Liu
UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart’s distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.
UV展开是一种将3D表面展平为2D的技术,尽可能减少失真,通常需要将复杂的表面分解成多个图表。虽然UV展开技术已经得到了广泛的研究,但在处理AI生成的网格时,它们通常会遇到困难。AI生成的网格通常带有噪声、凹凸不平,并且状况不佳。这些方法往往会产生高度碎片化的图表和不佳的边界,引入伪影并阻碍下游任务。我们引入了PartUV,这是一种基于部分的UV展开管道,能够生成更少的、部分对齐的图表,同时保持低失真。PartUV建立在最近基于学习的部分分解方法PartField之上,它将高级语义部分分解与新型几何启发式方法相结合,在一个自上而下递归的框架中。它确保每个图表的失真保持在用户指定的阈值以下,同时尽量减少图表的总数。该管道集成并扩展了参数化和打包算法,增加了对非流形和退化网格的专门处理,并进行了广泛的并行化处理以提高效率。在包括人造、CAD、AI生成和常见形状在内的四个不同数据集上进行的评估表明,PartUV在图表数量和接缝长度方面优于现有工具和最近的神经营略方法,实现了相当的失真度,在具有挑战性的网格上取得了较高的成功率,并启用了如部分特定多瓷砖打包等新应用。我们的项目页面位于https://www.zhaoningwang.com/PartUV。
论文及项目相关链接
PDF project page: https://www.zhaoningwang.com/PartUV
Summary
UV解包技术将3D表面展平为2D,同时尽量减少变形。对于AI生成的网格,现有的UV解包方法通常会出现问题,生成高度碎片化的图表和次优边界。本文介绍的PartUV方法通过结合基于学习的部分分解方法和新颖的几何启发式策略,生成较少的部分对齐图表,同时保持低失真。它在保证每个图表的失真低于用户指定阈值的同时,尽量减少图表的总数。该项目已在多个数据集上进行了评估,并取得了良好效果。详情可访问我们的项目网页(链接)。
Key Takeaways
- UV解包技术用于将复杂的3D表面展平为2D,最小化变形。
- AI生成的网格通常带有噪声、凹凸不平和不良条件,现有UV解包方法在处理时经常遇到困难。
- PartUV是一种基于部分的UV解包方法,能生成较少的部分对齐图表,同时保持低失真。
- PartUV结合了基于学习的部分分解方法和新颖的几何启发式策略,确保图表的失真在用户指定的阈值内,并尽量减少图表的总数。
- PartUV优化了参数化算法和打包算法,并特别处理了非流形和退化网格。
- PartUV在多个数据集上的评估表现优异,包括人造、CAD、AI生成和常见形状数据集。
点此查看论文截图
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
Authors:Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
开发能够操作各种图形用户界面(GUI)并具有人类水平熟练度的智能代理,是通往通用人工智能道路上的一个重要里程碑。尽管大多数用于训练和评估GUI代理的现有数据集和基准测试都是静态且理想化的,无法反映真实环境的复杂性和不可预测性,尤其是异常的存在。为了弥补这一研究空白,我们提出了D-GARA,一个动态基准测试框架,以评估Android GUI代理在现实异常中的稳健性。D-GARA引入了一系列GUI代理在实践中常见的现实异常,包括中断,如对话框、电池警告和更新提示等。基于D-GARA框架,我们构建并标注了一个基准测试,其中包含常用的Android应用程序,并嵌入了异常,以支持更广泛的社区研究。综合实验和结果表明,最先进的GUI代理在异常丰富的环境中性能大幅下降,这强调了稳健性感知学习的必要性。D-GARA是模块化和可扩展的,支持无缝集成新任务、异常类型和交互场景,以实现特定的评估目标。
论文及项目相关链接
PDF Accepted to AAAI 2026
Summary
在面向通用人工智能的发展过程中,智能体操作图形用户界面(GUI)实现人类水平熟练度是一个关键里程碑。当前大多数用于训练和评估GUI智能体的数据集和基准测试都是静态且理想化的,无法反映真实世界的复杂性和不可预测性,尤其是异常的存在。为了弥补这一研究空白,我们提出了D-GARA动态基准测试框架,以评估Android GUI智能体在现实世界中异常情况的稳健性。D-GARA引入了一系列GUI智能体在实践中常见的现实异常,包括中断如权限对话框、电池警告和更新提示等。基于D-GARA框架,我们构建并标注了一个以常用Android应用程序为特色的基准测试集,其中包含嵌入的异常以支持更广泛的社区研究。实验和结果证明,在异常丰富的环境中,最先进的GUI智能体的性能会出现显著下降,这强调了鲁棒性感知学习的必要性。D-GARA具有模块化和可扩展性,能够无缝集成新任务、异常类型和交互场景以满足特定的评估目标。
Key Takeaways
- 开发能够在广泛范围内操作GUI的智能体是实现人工智能的一个重要里程碑。
- 当前用于评估GUI智能体的数据集和基准测试存在局限性,无法反映真实世界的复杂性和不可预测性。
- D-GARA是一个动态基准测试框架,用于评估Android GUI智能体在现实世界中异常情况的稳健性。
- D-GARA引入了一系列常见的现实异常来模拟GUI智能体在实践中面临的挑战。
- 最先进的GUI智能体在异常丰富的环境中性能显著下降,需要鲁棒性感知学习。
- D-GARA框架具有模块化和可扩展性,可根据特定评估目标进行定制。
点此查看论文截图
OpenQudit: Extensible and Accelerated Numerical Quantum Compilation via a JIT-Compiled DSL
Authors:Ed Younis
High-performance numerical quantum compilers rely on classical optimization, but are limited by slow numerical evaluations and a design that makes extending them with new instructions a difficult, error-prone task for domain experts. This paper introduces OpenQudit, a compilation framework that solves these problems by allowing users to define quantum operations symbolically in the Qudit Gate Language (QGL), a mathematically natural DSL. OpenQudit’s ahead-of-time compiler uses a tensor network representation and an e-graph-based pass for symbolic simplification before a runtime tensor network virtual machine (TNVM) JIT-compiles the expressions into high-performance native code. The evaluation shows that this symbolic approach is highly effective, accelerating the core instantiation task by up to $\mathtt{\sim}20\times$ on common quantum circuit synthesis problems compared to state-of-the-art tools.
高性能数值量子编译器依赖于经典优化,但由于缓慢的数值评估和一种设计,使得用新指令扩展它们对于领域专家来说是一项困难且易出错的任务。本文介绍了OpenQudit,一个解决这些问题的编译框架,它允许用户用符号形式在Qudit Gate Language(QGL)中定义量子操作,这是一种数学自然的DSL。OpenQudit的即时编译器使用张量网络表示和基于e图的传递进行符号简化,然后在运行时张量网络虚拟机(TNVM)JIT将表达式编译成高性能的本地代码。评估表明,这种符号方法非常有效,在常见的量子电路合成问题上,与最新工具相比,核心实例化任务的加速率高达$\sim 20\times$。
论文及项目相关链接
Summary
本文介绍了OpenQudit编译框架,该框架通过允许用户以符号方式定义量子操作来解决现有问题。OpenQudit采用Qudit Gate Language(QGL)语言进行符号化表示,并通过预先编译和优化技术提高性能。实验表明,相较于现有工具,该符号化方法能够显著加速量子电路合成任务的执行效率,最高达到约$\sim$倍。这种方法大大简化了开发难度并降低了出错概率。通过使用该框架提供的接口可以轻松实现可扩展性和创新实验的开发与实施。它为后续研究者提供了一个通用平台,以推动量子计算领域的发展。
Key Takeaways
以下是七个关于该文本的关键见解:
点此查看论文截图
ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions
Authors:Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal
We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP’s no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.
我们提出了ECPv2,这是一种可扩展的、理论上可行的算法,用于优化具有未知Lipschitz常数的Lipschitz连续函数的全局。该算法建立在Every Call is Precious(ECP)框架之上,确保每次接受的函数评估都具有潜在的信息价值。ECPv2解决了ECP的关键局限性,包括计算成本高和早期行为过于保守。ECPv2引入了三项创新:(i)自适应下限,以避免空洞的接受区域;(ii)一个限制与过去评估固定子集比较的Worst-m记忆机制;(iii)一个固定的随机投影,以加速高维中的距离计算。我们从理论上证明了ECPv2保留了ECP的无遗憾保证,具有最优的有限时间界限,并扩大了接受区域的高概率。我们通过广泛的实验和剔除研究进一步验证了这些发现。通过使用有原则的超级参数设置,我们在一系列高维非凸优化问题上评估了ECPv2的性能。在基准测试中,ECPv2始终匹配或优于最新优化器,同时显著减少时钟时间。
论文及项目相关链接
PDF Accepted at AAAI 2026 (main technical track), extended version
Summary
本文提出了ECPv2,这是一种针对具有未知Lipschitz常数的Lipschitz连续函数进行全局优化的可扩展且理论扎实的新算法。ECPv2在ECP框架的基础上解决了其高计算成本和过于保守的早期行为等关键局限性。它引入了三项创新技术:自适应下限避免无效接受区域、固定大小的最近最差记忆机制和固定随机投影加速高维距离计算。理论证明ECPv2保留了ECP的无遗憾保证并扩大了接受区域的可能性。经过广泛实验和消除研究验证,ECPv2在高维非凸优化问题上表现优异,与最先进的优化器相匹配或更胜一筹,同时减少了运行时间。
Key Takeaways
- ECPv2是针对具有未知Lipschitz常数的Lipschitz连续函数进行全局优化的新算法。
- ECPv2解决了ECP框架中的三个关键问题:高计算成本、过于保守的早期行为和无效接受区域。
- ECPv2引入三项创新技术:自适应下限、固定大小的最近最差记忆机制和固定随机投影。
- ECPv2理论上有无遗憾的保证并具有最优的有限时间界限。
- ECPv2通过广泛的实验验证,在高维非凸优化问题上表现优异。
- ECPv2与最先进的优化器相匹配或更胜一筹,同时减少了运行时间。
点此查看论文截图
Toward Valid Generative Clinical Trial Data with Survival Endpoints
Authors:Perrine Chassat, Van Tuan Nguyen, Lucas Ducrot, Emilie Lanoy, Agathe Guilloux
Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across synthetic and real trial datasets, we evaluate our model in two realistic scenarios: (i) data sharing under privacy constraints, where synthetic controls substitute for original data, and (ii) control-arm augmentation, where synthetic patients mitigate imbalances between treated and control groups. Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power. We propose a post-generation selection procedure that improves calibration, highlighting both progress and open challenges for generative survival modeling.
临床试验面临着日益增长的挑战:患者群体分散、入组缓慢以及不可持续的成本,特别是在肿瘤学和罕见疾病的后期试验阶段。虽然已经从现实世界数据中构建了外部对照试验臂,但一个令人瞩目的替代方案是利用生成式人工智能生成合成对照试验臂。一个核心挑战是生成事件时间结果,这些结果构成了肿瘤学和罕见疾病试验的主要终点,但在审查和小样本量的情况下难以对其进行建模。现有的生成方法大多基于生成对抗网络(GAN),它们对数据有很高的需求,不稳定,并且依赖于诸如独立审查的强烈假设。我们引入了一种变分自编码器(VAE),它在统一潜在变量框架下联合生成混合类型协变量和生存结果,无需假设独立审查。在合成和真实试验数据集上,我们针对两个现实场景评估了我们的模型:(i)在隐私约束下共享数据,合成对照物替代原始数据;(ii)对照组扩增,合成患者缓解了治疗组和对照组之间的不平衡。我们的方法在保真度、实用性和隐私指标上优于GAN基线,同时显示出I型错误的系统性校准误差和功效偏差。我们提出了一种后生成选择程序,以提高校准,同时突出生成生存模型的进展和开放挑战。
论文及项目相关链接
PDF P. Chassat and V.T. Nguyen contributed equally to this work
Summary
本文探讨了临床试验面临的挑战,如患者群体分散、招募缓慢和成本不可持续,特别是在肿瘤学和罕见疾病的后期试验阶段。为解决这些问题,研究者提出了一种基于变分自编码器(VAE)的方法,联合生成混合型协变量和生存结果,在统一潜在变量框架下无需假设独立审查。该方法在合成数据和真实试验数据集上的评估显示,其在保真度、实用性和隐私指标上优于基于GAN的方法,同时揭示了I型错误的系统性校准问题。研究者还提出了一种改进校准的后期选择程序,强调了生成生存模型的进步和开放挑战。
Key Takeaways
- 临床试验面临患者群体分散、招募缓慢和成本高昂等问题,特别是在肿瘤学和罕见疾病领域。
- 合成控制臂作为替代方案被探索,其中生成式AI技术受到关注。
- 生成时间至事件结果是一个核心挑战,这构成了肿瘤学和罕见疾病试验的主要终点。
- 现有生成方法(如GAN)在数据需求、稳定性和假设依赖方面存在局限性。
- 提出的变分自编码器(VAE)方法在统一潜在变量框架下联合生成混合型协变量和生存结果,无需独立审查假设。
- 在合成数据和真实试验数据集上的评估显示,VAE方法在保真度、实用性和隐私指标上优于基于GAN的方法。
点此查看论文截图
The Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation
Authors:Jiaheng Zhang, Daqiang Zhang
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
将大型语言模型(LLM)融入可解释推荐系统,在端到端架构中常常会导致性能效率的权衡,排名和解释的联合优化可能会导致次优妥协。为解决这一问题,我们提出了Prism,这是一种新型解耦框架,它将推荐过程严格分为专门的排名阶段和解释生成阶段。受知识蒸馏的启发,Prism利用强大的教师LLM(例如FLAN-T5-XXL)作为Oracle产生高保真解释知识。然后,一个经过微调的小型学生模型(例如BART-Base),即Prism,专门负责将这些知识合成个性化的解释。这种分解确保每个组件都针对其特定目标进行优化,消除了耦合模型中的固有冲突。在基准数据集上的大量实验表明,我们参数为1.4亿的Prism模型在人类对忠诚度和个性化的评估中显著优于参数为11亿的教师模型,同时在推理过程中实现了24倍的速度提升和10倍的内存消耗减少。这些结果验证了通过有针对性的蒸馏进行解耦,是通往高质量可解释推荐的高效有效路径。
论文及项目相关链接
PDF 11 pages,3 figures
Summary
大型语言模型(LLMs)融入可解释推荐系统时,在端到端架构中常面临性能与效率的权衡问题。为解决此问题,我们提出一种名为Prism的新型解耦框架,将推荐过程严格划分为专门的排名阶段和解释生成阶段。受知识蒸馏的启发,Prism利用强大的教师LLM(如FLAN-T5-XXL)作为Oracle产生高保真解释知识。然后,一个紧凑、经过微调的学生模型(即Prism,如BART-Base)专门负责将这些知识合成个性化的解释。这种分解确保每个组件都能针对其特定目标进行优化,消除了耦合模型中的固有冲突。在基准数据集上的大量实验表明,我们的1.4亿参数Prism模型在人类对忠诚度和个性化的评价中显著优于其11亿参数的老师,同时在推理过程中实现了24倍的速度提升和10倍的内存消耗减少。这些结果证明,结合有针对性的蒸馏进行解耦,是通往高质量可解释推荐的有效途径。
Key Takeaways
- 大型语言模型(LLMs)在可解释推荐系统中会导致性能与效率的权衡问题。
- Prism是一种解耦框架,将推荐过程分为排名和解释生成两个阶段。
- Prism利用知识蒸馏,使用强大的教师LLM产生高保真解释知识。
- Prism学生模型能够合成这些知识以产生个性化的解释。
- 分解确保每个组件针对特定目标进行优化,消除耦合模型中的冲突。
- Prism在基准数据集上的表现优于大型LLM教师,同时在推理速度上实现显著提升和内存消耗的减少。
点此查看论文截图
Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense
Authors:Sayak Mukherjee, Samrat Chatterjee, Emilie Purvine, Ted Fujimoto, Tegan Emerson
Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
为复杂动态环境中的自主网络攻击和防御学习代理设计奖励是主题专家面临的一项具有挑战性的任务。我们提出了一种基于大型语言模型(LLM)的奖励设计方法,以在深度强化学习(DRL)驱动的仿真实验环境中生成自主网络防御策略。我们设计了多个攻击和防御代理角色,以反映代理行为的异质性,以生成由LLM引导的奖励设计,其中LLM首先被提供上下文网络仿真环境信息。然后,这些奖励结构被用于DRL驱动的攻防仿真环境中,以学习一系列网络防御策略。我们的结果表明,LLM引导的奖励设计可以有效防御各种对抗性行为。
论文及项目相关链接
PDF Accepted in the AAAI-26 Workshop on Artificial Intelligence for Cyber Security (AICS)
Summary:
在复杂的动态环境中为自主网络安全攻防学习代理设计奖励是一项挑战。本文提出了一种基于大型语言模型(LLM)的奖励设计方法,用于在深度强化学习(DRL)驱动的仿真环境中生成自主网络安全防御策略。通过构建多个攻防代理角色来反映代理行为的异质性,利用LLM指导奖励设计,并为其提供网络模拟环境的上下文信息。这些奖励结构被用于DRL驱动的攻防仿真环境中,以学习一系列网络安全防御策略。结果表明,LLM引导的奖励设计可以有效应对多样化的对抗行为。
Key Takeaways:
- 自主网络安全攻防学习代理的奖励设计是一项挑战。
- 提出了一种基于大型语言模型的奖励设计方法。
- 在深度强化学习驱动的仿真环境中生成了自主网络安全防御策略。
- 通过构建多个攻防代理角色反映代理行为的异质性。
- LLM被用于指导奖励设计,并接收网络模拟环境的上下文信息。
- 奖励结构被应用于DRL驱动的攻防仿真环境。
点此查看论文截图
Reasoning Meets Representation: Envisioning Neuro-Symbolic Wireless Foundation Models
Authors:Jaron Fontaine, Mohammad Cheraghinia, John Strassner, Adnan Shahid, Eli De Poorter
Recent advances in Wireless Physical Layer Foundation Models (WPFMs) promise a new paradigm of universal Radio Frequency (RF) representations. However, these models inherit critical limitations found in deep learning such as the lack of explainability, robustness, adaptability, and verifiable compliance with physical and regulatory constraints. In addition, the vision for an AI-native 6G network demands a level of intelligence that is deeply embedded into the systems and is trustworthy. In this vision paper, we argue that the neuro-symbolic paradigm, which integrates data-driven neural networks with rule- and logic-based symbolic reasoning, is essential for bridging this gap. We envision a novel Neuro-Symbolic framework that integrates universal RF embeddings with symbolic knowledge graphs and differentiable logic layers. This hybrid approach enables models to learn from large datasets while reasoning over explicit domain knowledge, enabling trustworthy, generalizable, and efficient wireless AI that can meet the demands of future networks.
无线物理层基础模型(WPFMs)的最新进展为通用射频(RF)表示提供了新的范例。然而,这些模型继承了深度学习中的关键局限性,如缺乏可解释性、鲁棒性、适应性和符合物理及法规约束的可验证性。此外,AI原生6G网络的愿景要求系统内置更高级别的智能化并具备可靠性。在本愿景报告中,我们认为神经符号范式是弥补这一鸿沟的关键,该范式将基于数据的神经网络与基于规则和逻辑的符号推理相融合。我们设想了一种新型神经符号框架,该框架将通用RF嵌入与符号知识图谱和可微分逻辑层相融合。这种混合方法使得模型能够在处理大量数据的同时进行逻辑推理,从而实现对明确领域知识的利用,实现可靠、通用和高效的无线人工智能,以满足未来网络的需求。
论文及项目相关链接
PDF Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI and ML for Next-Generation Wireless Communications and Networking (AI4NextG)
Summary
无线物理层基础模型(WPFMs)的最新进展为通用射频(RF)表示带来了新的范例。然而,这些模型继承了深度学习中的关键局限性,如缺乏可解释性、鲁棒性、适应性和符合物理及法规约束的可验证性。为了实现AI原生6G网络的需求,需要在系统中深度嵌入智能和可信度。本文提出神经符号范式,将基于数据的神经网络与基于规则和逻辑的符号推理相结合,以弥补这一差距。我们设想了一种新型的神经符号框架,它将通用RF嵌入与符号知识图谱和可分化逻辑层相结合。这种混合方法使模型既可以从大规模数据中学习,又能对明确的领域知识进行推理,实现可靠、通用且高效的无线人工智能,以满足未来网络的需求。
Key Takeaways
- WPFMs为RF表示带来了新范例,但仍存在缺乏可解释性、鲁棒性等问题。
- AI原生6G网络需要深度嵌入智能和可信度。
- 神经符号范式是弥补差距的关键,结合了数据驱动的神经网络和规则、逻辑基础的符号推理。
- 提出的神经符号框架结合了通用RF嵌入、符号知识图谱和可分化逻辑层。
- 混合方法使模型能从大规模数据中学习,同时进行领域知识推理。
- 这种结合神经符号的方法可实现可靠、通用且高效的无线人工智能。
点此查看论文截图
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Authors:Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
近期大型推理模型的进步激发了将此类能力扩展到多模式领域的兴趣。然而,尽管在视觉推理方面取得了显著的进展,但缺乏透明和可重复的数据策划和培训策略仍是阻碍规模化研究的主要障碍。在这项工作中,我们介绍了OpenMMReasoner,这是一个完全透明的两阶段多模式推理配方,涵盖监督微调(SFT)和强化学习(RL)。在SFT阶段,我们使用严格按步骤验证的方法构建了一个包含87万样本的冷启动数据集,为推理能力提供了坚实的基础。随后的RL阶段利用不同领域的74K样本数据集进一步磨练和稳定这些能力,从而形成一个更稳健和高效的学习过程。广泛评估表明,我们的训练配方不仅超越了强大的基准测试,而且还突出了数据质量和培训设计在塑造多模式推理性能中的关键作用。值得注意的是,我们的方法在九个多模式推理基准测试上较Qwen 2.5-VL-7B-Instruct基线提高了11.6%,为未来大规模多模式推理研究建立了坚实的实证基础。我们在https://github.com/EvolvingLMMs-Lab/OpenMMReasoner上公开了所有代码、管道和数据。
论文及项目相关链接
Summary
最新进展推动大型推理模型向多模态领域发展,但仍面临数据治理和培训策略的透明性和可复现性问题。OpenMMReasoner方案提出透明化多模态推理的全程方案,包括监督微调(SFT)和强化学习(RL)两个阶段。通过构建严格验证的冷启动数据集和跨域样本数据集,提高推理能力并优化学习进程。评估显示,该方案不仅超越基线,也突显数据质量和培训设计对多模态推理表现的重要性。我们的方法在多模态推理基准测试中实现了显著改进,并公开所有代码和数据,为未来大规模多模态推理研究奠定实证基础。
Key Takeaways
以下是关键见解的要点:
- 大型推理模型在多模态领域的应用日益受到关注。
- 数据治理和培训策略的透明性和可复现性是进行多模态研究的重大挑战。
- OpenMMReasoner是一个透明的多模态推理方案,包括监督微调(SFT)和强化学习(RL)两个阶段。
- SFT阶段构建了一个严格的冷启动数据集用于培养推理能力。
- RL阶段进一步强化了这些能力并提高了学习过程的效率和稳健性。
点此查看论文截图
Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement
Authors:Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, WangJie You, Jie Tang, Qingsong Liu, Yuhang Guo, Yangyang Kang
Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only “simple” samples, defined by the model’s consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
通过利用结果正确性奖励进行强化学习(RL),具有规模化推理计算的大型推理模型(LRM)在复杂的推理任务上取得了显著的成功。然而,仅侧重于最终正确性的单向奖励,限制了其对内部推理过程提供详细监督的能力。这一缺陷导致了内部推理质量不佳,表现为过度思考、思考不足、冗余思考和思维混乱等问题。受到LRM自我奖励领域的最新进展的启发,我们引入了自我重写框架,该框架让模型重写自己的推理文本,并从重写的推理中学习,以改善内部思维过程的质量。在算法设计方面,我们提出了一种选择性重写的方法,其中只有模型一致认为是“简单”的样本才会被重写,从而保留GRPO的所有原始奖励信号。在实际应用中,我们将重写和原生的文本生成放在同一批次内处理,这既保持了强化学习算法的可扩展性,只引入了约10%的开销。在多样化的任务上进行的大量实验验证了不同模型大小的自我重写框架的有效性。在准确性与长度之间的权衡方面,自我重写方法在不需要重写提示明确指示减少推理长度的情况下实现了更高的准确性(+0.6%)和更短的推理时间(-46%),超越了现有的强大基线。在内部推理质量方面,自我重写在LLM作为评估标准的指标上取得了更高的分数(+7.2),成功缓解了内部推理缺陷。
论文及项目相关链接
PDF Accepted to AAAI 2026
Summary
基于强化学习(RL)的结果正确性奖励,大型推理模型(LRMs)在复杂的推理任务上取得了显著的成功。然而,仅侧重于最终正确性的单一奖励机制无法对内部推理过程提供详细的监督。这导致了内部推理质量不足,表现为过度思考、思考不足、重复思考以及混乱思考等问题。受LRM自我奖励的最近进展的启发,我们引入了自我重写框架,该框架允许模型重写其自己的推理文本,并从重写的推理中学习,从而提高内部思维过程的质量。我们提出了一种选择性重写的方法,只重写“简单”的样本,以保持GRPO的所有原始奖励信号。通过广泛的实验验证,自我重写在多种任务上展现了有效性。在准确性与长度之间的权衡方面,自我重写方法在不增加重写提示的明确指令的情况下实现了更高的准确性(+0.6)和更短的推理时间(-46%)。在内部推理质量方面,自我重写在LLM评判标准下取得了更高的分数(+7.2),成功减轻了内部推理缺陷。
Key Takeaways
- 强化学习(RL)与大型推理模型(LRMs)结合,通过结果正确性奖励,在复杂推理任务上表现优异。
- 单一奖励机制导致内部推理质量不足,出现过度思考、思考不足等问题。
- 引入自我重写框架,允许模型重写其推理文本并从重写中学习,以提高内部思维过程质量。
- 选择性重写方法仅重写“简单”样本,以保持原始奖励信号。
- 自我重写在多种任务上展现出有效性。
- 自我重写方法在准确性与长度之间取得平衡,实现更高的准确性和更短的推理时间。
点此查看论文截图
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Authors:Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Yoichi Sato
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room’ and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
尽管最先进的多模态大型语言模型(MLLMs)具备先进的推理能力,但它们明显缺乏人类智慧的核心组成部分:即“读懂房间”的能力以及在复杂社会互动中评估欺骗的能力。为了严格量化这一缺陷,我们引入了一项新任务——多模态交互欺骗评估(MIDA),并展示了一个新型多模态数据集,该数据集为每条声明提供了同步的视频和文字以及可验证的真实标签。我们建立了一个全面的基准测试,评估了12个最先进的开源和闭源MLLMs,发现了一个显著的性能差距:即使是像GPT-4o这样强大的模型,也很难可靠地区分真实和虚假。我们对失败模式的分析表明,这些模型未能有效地将语言与多模态社会线索相结合,缺乏对他人的知识、信仰或意图进行建模的能力,这突显了构建更具感知能力和可信度的AI系统的紧迫需求。为了向前迈进一步,我们设计了社会思维链(SoCoT)推理管道和动态社会认知记忆(DSEM)模块。我们的框架在这一具有挑战性的任务上取得了性能改进,为构建具备真正人类般社会推理能力的MLLMs展示了一条充满希望的新途径。
论文及项目相关链接
Summary
在高级多模态大型语言模型(MLLMs)中,尽管它们拥有先进的推理能力,但它们缺乏人类智能的核心组成部分:即在复杂的社会互动中“读懂房间”和评估欺骗的能力。为严格量化这一缺陷,我们引入了一个新的任务——多模态交互欺骗评估(MIDA),并展示了一个新型多模态数据集,该数据集提供同步的视频和文字,每个陈述都有可验证的地面真实标签。我们对12个最新开源和闭源MLLMs进行了全面评估,发现了一个显著的性能差距:即使是像GPT-4o这样强大的模型也很难可靠地区分真实和虚假信息。我们的失败模式分析表明,这些模型未能有效地将语言根植于多模态社会线索中,并缺乏对他人的知识、信仰或意图进行建模的能力,这突显了对构建更具感知力和可信度的AI系统的紧迫需求。为了向前迈进,我们设计了社会思维链(SoCoT)推理管道和动态社会认知记忆(DSEM)模块。我们的框架在这个具有挑战性的任务上提高了性能,为构建具备真正人类式社会推理能力的MLLMs显示出了一条有前途的新途径。
Key Takeaways
- 先进的多模态大型语言模型(MLLMs)缺乏“读懂房间”的能力,即在复杂社会互动中评估欺骗的核心人类智能。
- 引入新的任务——多模态交互欺骗评估(MIDA),用于量化模型在这一方面的不足。
- 建立一个多模态数据集,包含同步视频和文字,每个陈述都有可验证的地面真实标签。
- 评估了多个最新MLLMs,发现它们在区分真实和虚假信息方面存在显著性能差距。
- 模型失败的原因在于它们未能有效地将语言与多模态社会线索相结合,并缺乏理解他人知识、信仰和意图的能力。
- 强调了对构建更具感知力和可信度的AI系统的迫切需求。
点此查看论文截图
FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks
Authors:Zhen Hao Wong, Jingwen Deng, Hao Liang, Runming He, Chengyu Shen, Wentao Zhang
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
大型语言模型(LLM)的发展越来越依赖于高质量的有监督数据,然而现有的指令调整和强化学习数据集仍然成本高昂,并且经常依赖于合成样本,这引入了幻觉和有限的多样性。与此同时,教科书和练习材料包含丰富的高质量人类编写的问答(QA)内容,但由于将原始PDF转换为AI可用监督数据的难度,这些内容仍未得到充分开发。尽管现代光学字符识别和视觉语言模型可以准确地解析文档结构,但它们的输出缺乏训练所需语义对齐。我们提出了一种自动化管道,通过结合面向布局的光学字符识别(OCR)和基于大型语言模型(LLM)的语义解析,从教育文档中提取结构良好的问答和视觉问答(VQA)对。在多种文档类型上的实验表明,该方法生成的问答/视觉问答对准确、对齐并且低噪声。这种方法使得现实世界的教育内容可以大规模使用,并为改进面向推理的大型语言模型训练提供了合成数据生成的实用替代方案。所有代码和数据处理管道均已在https://github.com/OpenDCAI/DataFlow上开源。
论文及项目相关链接
Summary
从文本中可以看出,当前大型语言模型(LLMs)的发展越来越依赖于高质量的有监督数据。然而,现有的指令调整和强化学习数据集仍然成本高昂,且常常依赖于合成样本,这导致了引入虚构和有限的多样性问题。同时,教科书和练习材料中含有丰富的高质量人类创作的问题答案(QA)内容,但由于将原始PDF转换为AI可用监督数据的难度,这些内容尚未得到充分利用。尽管现代OCR和视觉语言模型可以准确地解析文档结构,但其输出缺乏语义对齐,无法用于训练。因此,本文提出了一种自动化管道,通过结合基于布局的OCR和LLM语义解析,从教育文档中提取形成良好的QA和视觉问答(VQA)对。实验证明,该方法生成的QA/VQA对准确、对齐且噪声低。这种方法使得现实世界的教育内容得以规模化使用,并为改善基于推理的LLM训练提供了实用的替代方案。
Key Takeaways
- 大型语言模型(LLMs)的发展依赖于高质量的有监督数据。
- 现有指令调整和强化学习数据集成本高昂,且常依赖合成样本,存在虚构和多样性限制。
- 教科书和练习材料含有丰富的高质量QA内容,但未得到有效利用。
- 现代OCR和视觉语言模型可以解析文档结构,但缺乏语义对齐。
- 提出了一种自动化管道,结合布局OCR和LLM语义解析,从教育文档中提取QA和VQA对。
- 实验证明该方法生成的QA/VQA对准确、对齐且噪声低。
- 该方法使得现实世界的教育内容得以规模化使用,为改善LLM训练提供实用替代方案。
点此查看论文截图
ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025
Authors:Xu Qiang, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li
Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model’s visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
在数学和物理的奥林匹克级别的基准测试中,它们是测试先进AI推理能力的关键平台,然而化学凭借其独特的多模式符号语言,仍然是一个开放性的挑战。我们引入了ChemO,这是一个新的基准测试,源自2025年国际化学奥林匹克竞赛(IChO)。ChemO为自动化评估提供了两个关键创新点:评估等效重构(AER),它将需要视觉输出(例如绘制分子)的问题转换为计算上可行的格式;以及结构化视觉增强(SVE),这是一种诊断机制,用于区分模型的视觉感知能力与其核心化学推理能力。为了应对这一基准测试,我们提出了ChemLabs,这是一个分层的多智能体框架,通过专门智能体进行问题分解、感知、推理和审计,从而模仿人类专家的协作。在最新多模式模型上的实验表明,将SVE与我们的多智能体系统相结合,可以产生显著的性能提升。我们的顶级配置得分93.6分(满分100分),超过了估计的人类金牌门槛,在自动化化学问题解决方面建立了新的最新技术。ChemO数据集地址:https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
论文及项目相关链接
PDF 13 pages, 1 figures
Summary
基于国际化学奥林匹克竞赛(IChO 2025)构建的ChemO基准测试,是化学领域中自动化评估的一大挑战。ChemO引入了两个关键创新点:等效改革评估(AER)和结构化视觉增强(SVE)。为应对这一基准测试,我们提出了ChemLabs多层级多智能体框架,通过专门智能体进行问题分解、感知、推理和审计来模拟人类专家协作。实验表明,结合SVE和多智能体系统可显著提高性能。我们的最佳配置得分93.6分(满分100分),超越人类金牌门槛,在自动化化学问题解决方面达到最新水平。
Key Takeaways
- ChemO基准测试采用国际化学奥林匹克竞赛数据,是化学领域中自动化评估的挑战之一。
- ChemO引入等效改革评估(AER)与结构化视觉增强(SVE)两大创新点,分别用于转化视觉输出问题和增强模型诊断能力。
- ChemLabs多智能体框架模拟人类专家协作,包含问题分解、感知、推理和审计等专门智能体。
- 结合SVE与ChemLabs多智能体系统显著提高性能。
- 最佳配置得分93.6分,超越人类金牌门槛,创自动化化学问题解决最新高水平。