发布日期: 2025-11-25

更新日期: 2025-11-27

文章字数: 21.8k

阅读时长: 89 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-25 更新

Native 3D Editing with Full Attention

Authors:Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen

Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.

指令引导的3D编辑是一个新兴领域，具有拓宽3D内容创作途径的潜力。然而，现有方法面临关键局限：优化型方法过于缓慢，而依赖多视角2D编辑的前馈方法常常面临几何不一致和视觉质量下降的问题。为了解决这些问题，我们提出了一种新型的原生3D编辑框架，它能在一次高效的前馈过程中直接操作3D表示。具体而言，我们创建了一个大规模的多模式数据集，用于指令引导的3D编辑，涵盖多种添加、删除和修改任务。该数据集经过精心筛选，确保编辑后的对象忠实于指令变化，同时保持未编辑区域与源对象的一致性。基于此数据集，我们探索了两种模型调节策略：传统的交叉注意机制和一种新型的3D令牌拼接方法。结果表明，令牌拼接更具参数效率，且性能更优。大量评估显示，我们的方法优于现有的2D提升方法，在生成质量、3D一致性和指令忠实度方面设定了新的基准。

论文及项目相关链接

PDF

Summary:
指令引导的3D编辑是一个新兴领域，具有扩大3D内容创作的潜力。然而，现有方法存在关键局限：优化基础的方法过于缓慢，而依赖多视角2D编辑的前馈方法则常常面临几何不一致和视觉质量下降的问题。为解决这些问题，我们提出了一种新型的原生3D编辑框架，该框架在单一高效的前馈过程中直接操作3D表示。我们建立了一个大规模、多模式的指令引导3D编辑数据集，涵盖各种添加、删除和修改任务。在此基础上，我们探索了两种模型调节策略：传统的交叉注意机制和新型的3D令牌拼接方法。结果证明，令牌拼接更节省参数，且性能更优。全面评估表明，我们的方法在生成质量、3D一致性和指令忠实度方面均超越了现有的2D提升方法，树立了新的基准。

Key Takeaways:

指令引导的3D编辑是扩大3D内容创作潜力的新兴领域。
现有方法面临优化缓慢和几何不一致等关键局限。
提出了一种新型原生3D编辑框架，直接操作3D表示，提高编辑效率。
建立大规模、多模式的指令引导3D编辑数据集，涵盖多种编辑任务。
探索了两种模型调节策略：交叉注意机制和3D令牌拼接方法。
令牌拼接方法更节省参数，且性能优于交叉注意机制。

Cool Papers

点此查看论文截图

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Authors:Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

理解文本丰富的视频需要阅读小且短暂的文本线索，这通常需要反复检查。然而，大多数视频问答模型依赖于对固定帧的单次感知，导致出现幻觉和在细微证据上的失败。受人类暂停、放大和重新阅读关键区域的方式的启发，我们引入了Video-R4（通过视觉反复加强文本丰富视频推理），这是一种视频推理LMM，它执行视觉反复：迭代选择帧，放大信息区域，重新编码检索的像素，并更新其推理状态。我们构建了两个带有可执行反复轨迹的数据集：用于监督练习的Video-R

论文及项目相关链接

PDF

Summary

该文提出了一种针对文本丰富视频的推理方法Video-R4，该方法模拟人类观看视频时的暂停、放大和重新阅读关键区域的行为，通过迭代选择帧、放大信息区域、重新编码检索的像素并更新推理状态，实现了视觉反复思考。同时，构建了用于监督训练和强化学习的两个数据集Video-R4-CoT-17k和Video-R4-RL-30k。此外，提出了一种多阶段反复思考学习框架，通过SFT和GRPO-based RL逐步微调了7B LMM，学习原子和混合视觉操作，实现了对M4-ViteVQA等任务的最佳表现，并进一步推广至多页文档问答、幻灯片问答和通用视频问答。

Key Takeaways

Video-R4方法通过模拟人类观看视频的行为，引入了视觉反复思考（Visual Rumination）机制，提高视频理解精度。
视频推理数据集Video-R4-CoT-17k和Video-R4-RL-30k被构建用于监督训练和强化学习。
提出了一种多阶段反复思考学习框架，通过逐步微调模型来提高性能。
Video-R4方法实现了在M4-ViteVQA等任务上的最佳表现。
Video-R4方法能够推广到多页文档问答、幻灯片问答和通用视频问答等场景。
文中提到的方法结合了文本和视觉信息，通过迭代处理帧内信息来提高模型的推理能力。

Cool Papers

点此查看论文截图

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Authors:Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as “what would happen if this object was removed?”, is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

世界模型学会预测给定控制信号下的视觉观测的暂时演变，这有望使智能体通过前向模拟来推理环境。由于专注于前向模拟，当前的世界模型基于事实观察生成预测。对于许多新兴应用（如在各种条件下对物理AI行为的全面评估），世界模型回答反事实查询的能力，例如“如果移除这个物体，会发生什么？”变得越来越重要。我们正式提出反事实世界模型，该模型将干预作为明确的输入，预测在假设修改观测场景属性下的时间序列。传统世界模型直接在纠缠的像素空间表示上操作，无法选择性修改对象属性和关系。这种建模选择阻止了针对特定场景属性的定向干预。我们介绍了CWMDT框架来克服这些限制，将标准视频扩散模型转化为有效的反事实世界模型。首先，CWMDT构建观察场景的数字化孪生体来显式编码对象及其关系，表示为结构化文本。其次，CWMDT应用大型语言模型对这些表示进行推理，并预测反事实干预如何通过时间传播来改变观察到的场景。第三，CWMDT使用修改后的表示为视频扩散模型提供条件，以生成反事实视觉序列。在两个基准测试上的评估表明，CWMDT方法达到了最先进的性能，这表明视频的其他表示形式（如这里考虑的数字孪生体）为基于视频前向模拟的世界模型提供了强大的控制信号。

论文及项目相关链接

PDF

Summary：世界模型通过学习预测给定控制信号下的视觉观测时间演化，从而可能使智能体通过前向模拟来推理环境。当前的世界模型主要基于事实观察进行预测，但对于许多新兴应用来说，如在不同条件下对物理AI行为的全面评估，世界模型回答反事实查询的能力越来越重要。例如，“如果这个物体被移除，会发生什么？”的问题。文章正式提出了反事实世界模型，该模型接受干预作为明确的输入，预测在假设修改观察到的场景属性下的时间序列。传统的世界模型直接在纠缠的像素空间表示上操作，无法选择性修改对象属性和关系。文章介绍了CWMDT框架来解决这些问题，将标准视频扩散模型转化为有效的反事实世界模型。首先，CWMDT构建观察到的场景的数字孪生，以明确编码对象及其关系，表示为结构化文本。其次，CWMDT应用大型语言模型对这些表示进行推理，并预测反事实干预如何通过时间传播来改变观察到的场景。最后，CWMDT用修改后的表示条件化视频扩散模型来生成反事实视觉序列。在两项基准测试上的评估表明，CWMDT方法达到了最新技术水平，表明视频的数字孪生等替代表示形式为基于视频前向模拟的世界模型提供了强大的控制信号。

Key Takeaways:

世界模型可以预测给定控制信号下的视觉观测时间演化。
当前世界模型主要基于事实观察进行预测。
反事实查询能力对于新兴应用（如物理AI行为的全面评估）至关重要。
CWMDT框架用于将标准视频扩散模型转化为反事实世界模型。
CWMDT通过构建数字孪生来表示观察到的场景的对象和关系。
CWMDT应用大型语言模型进行推理，并预测反事实干预对场景的影响。

Cool Papers

点此查看论文截图

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Authors:Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

机器人基础模型（RFMs）作为通用的端到端机器人控制系统，具有巨大的潜力。然而，它们在新环境、任务和体现中的泛化能力仍然有限。我们认为主要的瓶颈在于它们的基础：大多数RFMs是通过微调互联网上预训练的视觉语言模型（VLMs）来构建的。然而，这些VLMs是在2D图像语言任务上训练的，缺乏3D空间推理，这固有地适用于3D世界中的实体控制。直接通过大规模机器人数据弥合这一差距成本高昂且难以扩展。相反，我们提议用3D注释丰富易于收集的非机器人图像数据，并用3D理解能力增强预训练的VLM。根据这一策略，我们训练了SPEAR-VLM，这是一种3D感知的VLM，能够从单个2D图像中推断出物体在三维空间中的坐标。在SPEAR-VLM的基础上，我们介绍了主要贡献，即SPEAR-1：一种结合地面3D感知与语言指导的实体控制的机器人基础模型。在来自24个开放跨实体数据集的约45M帧上进行训练，SPEAR-1的性能优于或匹配了π_0-FAST和π_{0.5}等最新模型，同时使用的机器人演示次数减少了20倍。这种精心设计的训练策略解锁了新的VLM能力，并提高了实体控制的可靠性，超越了仅使用机器人数据所能实现的水平。我们公开提供了模型权重和3D注释数据集。

论文及项目相关链接

PDF

Summary

在机器人控制领域，Robotic Foundation Models（RFMs）作为通用、端到端的系统具有巨大潜力。然而，它们在新的环境、任务和体现形式上的泛化能力有限。主要瓶颈在于其基础：大多数RFMs是通过微调预训练的Vision-Language Models（VLMs）构建的。但这些VLMs是在2D图像语言任务上训练的，缺乏3D空间推理能力，这是实现3D世界中的实体控制所必需的。为弥补这一差距，我们提出用3D注释丰富易收集的非机器人图像数据，并增强预训练的VLM的3D理解能力。我们训练了SPEAR-VLM，一个能够从单张2D图像中推断物体三维坐标的3D感知VLM。在此基础上，我们介绍了主要贡献——SPEAR-1，一个整合地面3D感知与语言指令控制的机器人基础模型。它在约4.5亿帧的公开X-Embodiment数据集上进行训练，表现出优于或匹配π₀ -FAST和π（₀）.（両）等先进模型的能力，同时使用的机器人演示次数减少了20倍。这一精心设计的训练策略解锁了新的VLM能力，并提高了实体控制的可靠性。

Key Takeaways

Robotic Foundation Models (RFMs) 是一类通用、端到端的机器人控制系统，具有巨大的潜力。
RFMs在泛化到新环境、任务和体现形式方面存在局限性，主要瓶颈在于其基础。
大多数RFMs是建立在预训练的Vision-Language Models (VLMs) 上的，但这些VLMs缺乏3D空间推理能力。
提出通过丰富非机器人图像数据并增强预训练VLM的3D理解能力来解决问题。
介绍了SPEAR-VLM模型，该模型能够从单张2D图像中推断物体的三维坐标。
主要贡献是SPEAR-1模型，它结合了3D感知与语言指令控制，并在公开数据集上表现优异。

Cool Papers

点此查看论文截图

Accelerating the CLEAN algorithm of radio interferometry with convex optimization

Authors:Hendrik Müller, Mingyu Hsieh, Sanjay Bhatnagar

In radio-interferometry, we recover an image from an incompletely sampled Fourier data. The de-facto standard algorithm, the Cotton-Schwab CLEAN, is iteratively switching between computing a deconvolution (minor loop) and subtracting the model from the visibilities (major loop). The next generation of radio interferometers is expected to deal with much higher data rates, image sizes and sensitivity, making an acceleration of current data processing algorithms necessary. We aim to achieve this by evaluating the potential of various well-known acceleration techniques in convex optimization to the major loop. For the present manuscript, we limit the scope to study these techniques only in the CLEAN framework. To this end, we identify CLEAN with a Newton scheme, and use this chain of arguments backwards to express Nesterov acceleration and conjugate gradient orthogonalization in the major and minor loop framework. The resulting algorithms are simple extensions of the traditional framework, but converge multiple times faster than traditional techniques, and reduce the residual significantly deeper. These improvements achieved by accelerating the major loop are competitive to well-known improvements by replacing the minor loop with more advanced algorithms, but at lower numerical cost. The best performance is achieved by combining these two developments.CLEAN remains among the fastest and most robust algorithms for imaging in radio interferometry, and can be easily extended to an almost an order of magnitude faster convergence speed and dynamic range. The procedure outlined in this manuscript is relatively straightforward and could be easily extended.

在射电干涉法中，我们从不完全采样的傅里叶数据中恢复图像。事实上，标准的算法是Cotton-Schwab CLEAN，它在计算反卷积（小循环）和从可见度中减去模型（大循环）之间迭代切换。下一代射电干涉仪预计将处理更高的数据速率、更大的图像尺寸和更高的灵敏度，这加速了当前数据处理算法的必要性。我们的目标是通过评估各种著名的加速技术在凸优化中对大循环的潜力来实现这一点。在本手稿中，我们将研究范围仅限于在CLEAN框架中研究这些技术。为此，我们将CLEAN与牛顿方案相结合，并在此逆向论证链中表达Nesterov加速和共轭梯度正交化在大循环和小循环框架中。得到的算法是传统框架的简单扩展，但收敛速度比传统技术快几倍，并将残差显著降低。通过加速主循环实现的这些改进与用更先进的算法替换小循环的知名改进相竞争，但数值成本更低。通过结合这两种发展实现了最佳性能。CLEAN仍是射电干涉法中成像最快、最稳健的算法之一，可以轻松地扩展到几乎一个数量级的更快收敛速度和动态范围。本手稿概述的程序相对简单，可以很容易地进行扩展。

论文及项目相关链接

PDF accepted for publication A&A

Summary
在射电干涉法中，通过不完全采样的傅里数据恢复图像的标准算法是Cotton-Schwab CLEAN算法。下一代射电干涉仪将面临更高的数据速率、更大的图像尺寸和更高的灵敏度，需要加速当前的数据处理算法。本文研究了在CLEAN框架中应用凸优化中的知名加速技术对主要循环的潜力，通过将CLEAN与牛顿法相结合，用这一逆向逻辑表达Nesterov加速和共轭梯度正交化在主要和次要循环框架中的应用。新型算法是传统的简单扩展，但收敛速度比传统技术快几倍，将残差显著减小。加速主要循环的改进与用更先进的算法替换次要循环的已知改进具有竞争力，但数值成本较低。最佳性能是通过结合这两种发展实现的。CLEAN仍是射电干涉成像中最快、最稳健的算法之一，可轻松扩展至几乎一个数量级的快速收敛速度和动态范围。本文所述程序相对简单直接，易于扩展。

Key Takeaways

Cotton-Schwab CLEAN算法是射电干涉法中恢复图像的常用算法，但面临新一代数据处理挑战。
凸优化中的加速技术被应用于主要循环以加速算法。
通过将CLEAN与牛顿法结合，实现了Nesterov加速和共轭梯度正交化在主要和次要循环中的应用。
新型算法是传统算法的简单扩展，但收敛速度更快，并能显著减小残差。
最佳性能是通过结合对主要循环和次要循环的改进实现的。
CLEAN算法仍是射电干涉成像中最快、最稳健的算法之一，具有扩展潜力。

Cool Papers

点此查看论文截图

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

Authors:Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu

Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

多模态动作质量评估（AQA）作为一种新兴的有前途的范式，已经崭露头角。它通过利用共享上下文线索中的互补信息，提高了高度相似动作序列中细微的类内变化的判别评估能力。然而，在实际推理阶段，部分模态往往无法使用。任何模态的缺失往往会使现有的多模态模型无法运行。此外，由于跨模态交互的中断，它还会导致性能严重下降。为了解决这一问题，我们提出了一种新颖的结合专家混合的缺失补全框架（MCMoE），将单模态和联合表示学习统一在单阶段训练中。具体来说，我们提出了一种自适应的门控模态生成器，该生成器可以动态融合可用信息来重建缺失的模态。然后，我们设计模态专家来学习单模态知识，并动态混合所有专家的知识来提取跨模态联合表示。通过专家混合，缺失的模态得到了进一步的优化和补充。最后，在训练阶段，我们挖掘完整的多模态特征和单模态专家知识来引导模态生成和基于生成的联合表示提取。大量实验表明，我们的MCMoE在三套公共AQA基准测试中，无论是完整还是不完整的多模态学习，均达到了业界领先水平。代码可通过https://github.com/XuHuangbiao/MCMoE获取。

论文及项目相关链接

PDF AAAI 2026

Summary

基于多模态动作质量评估（AQA）的框架被提出，该框架解决了实际应用中经常出现的不完整模态问题，并在模态生成中融入了统一表达。文章介绍了提出的名为MCMoE（Missing Completion Framework with Mixture of Experts）的新框架，它能在训练阶段动态融合现有信息来重建缺失模态，并通过设计模态专家学习单模态知识，实现跨模态联合表达学习。MCMoE在三个公共AQA基准测试中取得了最佳成绩。代码已公开。

Key Takeaways

多模态动作质量评估（AQA）框架增强了对高度相似动作序列中的细微差异进行区分的能力。
实际推断阶段经常存在缺失模态的问题，导致现有多模态模型无法使用并触发性能下降。
MCMoE框架解决了这一问题，通过自适应门控模态生成器动态融合现有信息来重建缺失模态。
设计了模态专家学习单模态知识，并动态混合所有专家的知识以提取跨模态联合表达。专家知识的混合有助于弥补和完善缺失的模态。
在训练阶段挖掘完整的多模态特征和单模态专家知识，指导模态生成和基于生成的联合表达提取。

Cool Papers

点此查看论文截图

Authors:Yifan Li, Lichi Li, Anh Dao, Xinyu Zhou, Yicheng Qiao, Zheda Mai, Daeun Lee, Zichen Chen, Zhen Tan, Mohit Bansal, Yu Kong

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the “collision rate” and “warning rate” metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

虽然视觉大语言模型（VLLMs）作为实体代理展现出了巨大的潜力，但它们在空间推理方面仍然面临着巨大的挑战。现有的实体基准测试主要集中在被动、静态的家庭环境，并仅评估孤立的能力，无法捕捉动态、现实世界复杂性中的整体表现。为了填补这一空白，我们推出了IndustryNav，这是第一个用于主动空间推理的动态工业导航基准测试。IndustryNav利用12个手动创建的高保真Unity仓库场景，包含动态物体和人为移动。我们的评估采用PointGoal导航管道，有效结合以自我为中心的观点和全局测距，以评估整体的地方-全局规划。关键的是，我们引入了“碰撞率”和“警告率”这两个指标来衡量以安全为导向的行为和距离估计。对九种最先进的VLLMs的综合性研究（包括GPT-5-mini、Claude-4.5和Gemini-2.5等模型）显示，封闭源代码的模型具有持续的优势；然而，所有代理在稳健的路径规划、避碰和主动探索方面都存在明显的不足。这突显出实体研究的迫切需求，需要从被动的感知转向需要稳定规划、主动探索和动态现实环境中的安全行为的任务。

论文及项目相关链接

PDF

Summary：

空间推理领域的大规模语言模型（VLLMs）作为实体代理展现出巨大潜力，但仍面临诸多挑战。当前实体代理基准测试主要集中在被动、静态的家庭环境，并仅评估孤立能力，无法捕捉动态现实世界中的整体表现。为解决这一空白，我们推出了IndustryNav，这是首个针对主动空间推理的动态工业导航基准测试。IndustryNav利用12个手动创建的高保真度Unity仓库场景，包含动态物体和人类活动。我们的评估采用PointGoal导航管道，有效结合以自我为中心视力和全球里程计，以评估整体的地方到全球规划。重要的是，我们引入了碰撞率和警告率指标来衡量安全导向的行为和距离估计。一项针对九种最新VLLMs的综合研究（包括GPT-5-mini、Claude-4.5和Gemini-2.5等模型）显示，封闭源模型具有持续优势；然而，所有代理在稳健的路径规划、碰撞避免和主动探索方面均表现出明显不足。这突显出实体研究需要从被动感知转向需要稳定规划、主动探索和动态现实环境中的安全行为的任务。

Key Takeaways:

VLLMs作为实体代理在空间推理上展现潜力，但仍面临诸多挑战。
当前实体代理基准测试主要集中在被动、静态的环境，评估孤立能力，无法反映动态现实世界的复杂性。
IndustryNav是首个针对主动空间推理的动态工业导航基准测试，包含多种动态场景和评估方法。
评估采用PointGoal导航管道，结合以自我为中心视力和全球里程计，衡量代理的全面规划能力。
引入碰撞率和警告率指标来衡量代理的安全行为。
九种最新VLLMs的综合研究表明，封闭源模型具有优势，但在路径规划、碰撞避免和主动探索方面存在不足。

Cool Papers

点此查看论文截图

Anomaly Pattern-guided Transaction Bug Testing in Relational Databases

Authors:Huicong Xu, Shuang Liu, Xianyu Zhu, Qiyu Zhuang, Wei Lu, Xiaoyong Du

Concurrent transaction processing is a fundamental capability of Relational Database Management Systems (RDBMSs), widely utilized in applications requiring high levels of parallel user interaction, such as banking systems, e-commerce platforms, and telecommunications infrastructure. Isolation levels offer a configurable mechanism to manage the interaction between concurrent transactions, enabling varying degrees of consistency and performance trade-offs. These isolation guarantees are supported by all major RDBMSs. However, testing transaction behavior under different isolation levels remains a significant challenge due to two primary reasons. First, automatically generating test transactions that can effectively expose bugs in transaction handling logic is non-trivial, as such bugs are typically triggered under specific transactional constraints. Second, detecting logic anomalies in transaction outcomes is difficult because the correct execution results are often unknown for randomly generated transactions. To address these challenges, we propose an anomaly pattern-guided testing approach for uncovering transaction bugs in RDBMSs. Our solution tackles the first challenge by introducing a test case generation technique guided by predefined anomaly patterns, which increases the likelihood of exposing transactional bugs. For the second challenge, we present a two-phase detection process, involving explicit error detection and implicit error detection, to identify bugs in transaction execution. We have implemented our approach in a tool, APTrans, and evaluated it on three widely-used RDBMSs: MySQL, MariaDB, and OceanBase. APTrans successfully identified 13 previously unknown transaction-related bugs, 11 of which have been confirmed by the respective development teams.

并发事务处理是关系数据库管理系统（RDBMS）的基本功能，广泛应用于需要高水平并行用户交互的应用程序中，如银行系统、电子商务平台和电信基础设施。隔离级别提供了一种可配置的机制，以管理并发事务之间的交互，实现不同程度的一致性和性能权衡。所有主要的RDBMS都支持这些隔离保证。然而，由于两个主要原因，测试不同隔离级别下的交易行为仍然是一个巨大的挑战。首先，自动生成能够有效暴露交易处理逻辑中的错误的有效测试事务是非常困难的，因为这种错误通常是在特定的交易约束下触发的。其次，由于随机生成的事务的正确执行结果往往未知，因此检测交易结果中的逻辑异常是困难的。为了解决这些挑战，我们提出了一种针对RDBMS中交易错误的异常模式引导测试方法。我们的解决方案通过引入由预定义异常模式引导的测试案例生成技术来解决第一个挑战，这增加了暴露交易错误的概率。对于第二个挑战，我们提出了一个两阶段的检测过程，包括显式错误检测和隐式错误检测，以识别交易执行中的错误。我们已经将我们的方法实现在一个名为APTrans的工具中，并在三个广泛使用的RDBMS上对其进行了评估：MySQL、MariaDB和OceanBase。APTrans成功识别了13个以前未知的与交易相关的错误，其中11个已得到相应开发团队的确认。

论文及项目相关链接

PDF

Summary

本文介绍了关系数据库管理系统（RDBMS）中的并发事务处理的重要性，以及隔离级别在管理并发事务交互中的作用。文章指出了测试不同隔离级别下的事务行为存在的两个主要挑战，并提出了一种异常模式引导的测试方法来解决这些挑战。该方法通过引入异常模式引导测试用例生成技术，提高发现事务错误的可能性，并采用两阶段检测过程来识别事务执行中的错误。实施的工具APTrans在MySQL、MariaDB和OceanBase三个流行的RDBMS上成功识别了13个以前未知的事务相关错误，其中11个已得到开发团队的确认。

Key Takeaways

并发事务处理是RDBMS的核心功能，尤其在需要高并发用户交互的应用中如银行业务系统、电子商务平台和电信基础设施中广泛应用。
隔离级别是管理并发事务交互的可配置机制，能提供不同的一致性-性能权衡。
测试不同隔离级别下的事务行为存在挑战，主要因为有效暴露事务处理逻辑错误的测试交易自动生成非平凡，以及随机生成交易的正确执行结果未知。
提出了一种异常模式引导的测试方法来解决这些挑战，通过引入异常模式引导测试用例生成和提高发现事务错误的可能性。
两阶段检测过程包括显式错误检测和隐式错误检测，能有效识别事务执行中的错误。
实施的APTrans工具在多个流行的RDBMS上成功识别了未知的事务相关错误，证明了方法的有效性。

Cool Papers

点此查看论文截图

METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

Authors:Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.

构建能够感知、推理并在多种任务中行动的通用机器人仍然是一个开放性的挑战，尤其是在灵巧操作方面。主要的瓶颈在于缺乏大规模的动作标注数据来训练灵巧技能，因为遥操作难度大且成本高。人类数据因其大规模和多样化的操作行为，为学习机器人行动提供了丰富的先验知识。虽然之前的研究已经尝试利用人类演示数据，但它们通常受限于特定的场景和人机之间较大的视觉差距。为了消除这些限制，我们提出了METIS，这是一个在多种来源的自我中心数据集上预训练的用于灵巧操作的视觉-语言-行动（VLA）模型。我们首先构建了EgoAtlas，它整合了来自多个来源的大规模人类和机器人数据，并在一致的动作空间下进行了统一。我们进一步提取了运动感知动力学，这是一种紧凑且离散化的运动表示，为VLA训练提供了高效且富有表现力的监督。基于这些，METIS将推理和行动整合到一个统一的框架中，能够有效地部署到下游的灵巧操作任务中。我们的方法在六个真实世界任务中达到了最高的平均成功率，展现了卓越的灵巧操作能力。实验结果还突出了对超出分布场景的优越泛化和鲁棒性。这些发现强调了METIS作为通用灵巧操作模型的有前途的一步。

论文及项目相关链接

PDF

Summary
该文本主要介绍了一种面向灵巧操作的通用机器人模型METIS。该模型通过构建EgoAtlas整合大规模人类和机器人数据，采用运动感知动力学作为紧凑离散的运动表示，为视觉语言动作（VLA）训练提供高效且富有表现力的监督。METIS在下游灵巧操作任务中展现出卓越的能力，并在六个真实任务中取得了最高的平均成功率，具有出色的泛化和鲁棒性。

Key Takeaways

构建通用机器人模型METIS以完成灵巧操作任务，面临数据稀缺的挑战。
提出EgoAtlas，整合大规模人类和机器人数据，统一在一致的动作空间下。
采用运动感知动力学，作为紧凑且离散的运动表示，为VLA训练提供有效监督。
METIS将推理和行动整合到统一框架中，使得其在下游灵巧操作任务中表现出色。
METIS在六个真实任务中取得了最高的平均成功率。
METIS具有良好的泛化和鲁棒性，对于未知场景具有良好的适应能力。

Cool Papers

点此查看论文截图

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Authors:Taixi Chen, Jingyun Chen, Nancy Guo

Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

细胞级放射学特征为肿瘤表型提供了精细的洞察，并有可能显著提高苏木精和伊红（H&E）图像的诊断准确性。通过捕捉微观形态的形态和强度模式，这些特征支持更精确的肿瘤识别，并通过突出显示与诊断相关的细胞以供病理学家审查，提高了人工智能的可解释性。然而，大多数现有研究侧重于幻灯片级或补丁级的肿瘤分类，而很少探索细胞级的放射学分析。此外，目前尚无专门针对放射学数据的专用主干网络。受Mamba架构在视觉和语言领域近期成功的启发，我们引入了一种使用放射学特征的细胞级分类的统一注意力Mamba（UAM）主干网络。不同于以前将注意力和Mamba模块以固定比例集成的方法，我们的统一设计灵活地结合了它们的能力在一个单一、连贯的架构中，消除了对手动调整比例的需求，提高了编码能力。我们开发了两个UAM变种来全面评估这种统一结构的好处。在此基础上，我们进一步提出了一个多模式UAM框架，该框架联合执行细胞级分类和图像分割。实验结果表明，UAM在公共基准测试上实现了最先进的性能，超过了领先的基于图像的基础模型。它将细胞分类的准确性从74%提高到78%（n=349,882个细胞），肿瘤分割的精确度从75%提高到80%（n=406个补丁）。这些发现突出了UAM作为放射学驱动癌症诊断的统一和可扩展的多模式基础架构的有效性和前景。

论文及项目相关链接

PDF

Summary

本文介绍了一种名为Unified Attention-Mamba（UAM）的新型计算机视觉架构，它旨在应用于细胞级别的肿瘤诊断分析。此架构结合细胞形态学和强度模式信息，通过捕捉微观层面的特征，为肿瘤诊断提供精细见解。相较于以往固定组合的架构，UAM实现了灵活的集成并具有更佳的编码能力。研究证明UAM在公共数据集上实现了最先进的性能，提升了细胞分类和肿瘤分割的准确度。这为放射学驱动下的癌症诊断提供了一种高效、统一且可扩展的多模态基础模型。总的来说，此架构将大大改善H&E图像上的肿瘤诊断准确性。

Key Takeaways

细胞级放射学特征为肿瘤表型提供了精细洞察，可显著提高H&E图像上的诊断准确性。
UAM架构通过结合注意力机制和Mamba模块的优势，实现细胞级分类的精细分析。
UAM架构相较于传统方法展现出更佳的编码能力和灵活性，无需手动调整比例。
UAM在公共数据集上的性能达到业界顶尖水平，显著提高了细胞分类和肿瘤分割的准确性。

Cool Papers

点此查看论文截图

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

Authors:Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang

Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.

多模态大型语言模型（MLLMs）由于大型语言模型（LLMs）的强大推理能力而在图像和语言任务方面取得了显著进展。然而，大多数MLLM在解释和推断三维空间中的空间排列时，存在有限的空间推理能力。在这项工作中，我们提出了一种基于几何和语义特征分层融合的新型视觉编码器，生成具有空间感知的视觉嵌入，并提升MLLM的空间定位能力。具体来说，我们首先揭示出空间模糊这一缺陷源于大多数现有MLLM（例如CLIP）所使用的视觉编码器的信息损失嵌入，仅限于实例级语义特征。这促使我们通过分层适配器，用仅基于视觉的自我监督学习中的几何特征来补充CLIP，增强所提出的SpatialGeo中的空间感知。该网络有效利用预训练的LLaVA模型进行训练，并通过随机特征丢弃进行优化，以避免仅依赖CLIP编码器而出现简单解决方案。实验结果表明，SpatialGeo在空间推理任务上的准确性有所提高，在SpatialRGPT-Bench上的表现至少提高了8.0%，同时推理期间的内存成本降低了约50%。源代码可通过https://ricky-plus.github.io/SpatialGeoPages/获取。

论文及项目相关链接

PDF

Summary

多模态大型语言模型（MLLMs）在图像和语言任务中取得了显著进展，但其在三维空间中的推理能力有限。本研究提出了一种基于几何和语义特征层次融合的新型视觉编码器，生成具有空间感知的视觉嵌入，提升MLLMs的空间定位能力。通过补充CLIP模型与来自仅视觉自监督学习的几何特征，提出了SpatialGeo，增强了空间感知。使用预训练的LLaVA模型和随机特征丢弃进行优化，避免仅依赖CLIP编码器的平凡解决方案。实验结果表明，SpatialGeo在空间推理任务上的准确度有所提升，至少提升了SpatialRGPT-Bench中的现有模型性能8.0%，且在推理期间的内存成本降低了约50%。源代码可通过[https://ricky-plus.github.io/SpatialGeoPages/]访问。

Key Takeaways

多模态大型语言模型（MLLMs）在图像和语言任务中具备强大的推理能力，但空间推理能力有限。
提出了基于几何和语义特征层次融合的新型视觉编码器，以改善MLLMs的空间感知。
通过补充CLIP模型与仅视觉自监督学习的几何特征，创建了SpatialGeo。
使用预训练的LLaVA模型和随机特征丢弃技术来提升网络性能并避免平凡解决方案。
SpatialGeo在空间推理任务上的表现优于现有模型，准确度高至少8.0%。
SpatialGeo在推理期间的内存使用效率提高，成本降低了约50%。

Cool Papers

点此查看论文截图

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Authors:Wenrui Zhang, Xinggang Wang, Bin Feng, Wenyu Liu

Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight’s relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model’s performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

光学化学结构识别（OCSR）在现代化学信息学中扮演着至关重要的角色，它能够将来自科学文献、专利和教育材料中的化学结构图像自动转换为机器可读的分子表示。这一功能对于大规模化学数据挖掘、药物发现管道和相关领域的大型语言模型（LLM）应用至关重要。然而，由于区分立体异构体的细微视觉线索，如楔形和虚线键、环状构型和空间排列，现有的OCSR系统在准确识别立体化学信息方面面临重大挑战。为了解决这些挑战，我们提出了MolSight，这是一个用于OCSR的综合学习框架，采用三阶段训练范式。在第一阶段，我们在大规模但嘈杂的数据集上进行预训练，赋予模型对化学结构图像的基本感知能力。在第二阶段，我们使用具有更丰富监督信号的数据集进行多粒度微调，系统地探索辅助任务——特别是化学键分类和原子定位——对分子公式识别的贡献。最后，我们采用强化学习进行后训练优化，并引入了一个新的立体化学结构数据集。值得注意的是，我们发现即使使用相对较小的参数规模，Group Relative Policy Optimization（GRPO）算法也可以进一步提高模型在立体分子上的性能。通过在不同数据集上进行的大量实验，我们的结果表明，MolSight在（立体）化学光学结构识别方面达到了最先进的性能。

论文及项目相关链接

PDF

Summary：光学化学结构识别（OCSR）在现代化学信息学中扮演着至关重要的角色，能够实现化学结构图像的自动化转换，从科学文献、专利和教材等来源转化为机器可读的分子表征。针对现有OCSR系统在识别立体化学信息方面的挑战，我们提出了MolSight这一全新的学习框架，通过三阶段训练模式来提升模型对化学结构图像的识别能力。研究结果表明，MolSight在（立体）化学光学结构识别方面达到了最先进的性能。

Key Takeaways：

OCSR在化学信息学中很重要，能够实现化学结构图像的自动化转换。
现有OCSR系统在识别立体化学信息时面临挑战。
MolSight是一个用于OCSR的全面学习框架，采用三阶段训练模式。
第一阶段进行预训练，赋予模型对化学结构图像的基本感知能力。
第二阶段进行多粒度微调，利用更丰富监督信号的数据集来提升模型性能。
引入强化学习进行后训练优化，并提出新的立体化学结构数据集。
MolSight在化学光学结构识别方面达到了最先进的性能。

Cool Papers

点此查看论文截图

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

Authors:Anshul Singh, Rohan Chaudhary, Gagneet Singh, Abhay Kumary

The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

VLMs（视觉语言模型）令人印象深刻的性能大多是在无法捕捉现实场景复杂性的基准测试中测量的。现有的表格问答数据集，如WikiTableQuestions和FinQA，几乎全都是单语（英语）的，并以数字完美的清晰格式呈现表格。这给研究和实践之间造成了重大差距。为了解决这个问题，我们推出了MirageTVQA，这是一个新的基准测试，旨在在这些确切的维度上评估VLMs。MirageTVQA包含24种语言的近6万个问答对，它挑战模型处理的表格不仅是多语言的，而且视觉上不完美，融入现实噪声以模仿扫描文档。我们对领先的VLMs的评估揭示了两个主要的失败点：当面对视觉噪声时，性能严重下降（最佳模型下降超过35%），以及一贯的英语优先偏见，其中推理能力无法转移到其他语言。MirageTVQA提供了一个基准测试，用于衡量和推动朝着更稳健的表格推理VLM模型的发展。数据集和代码可在以下网址找到：https://github.com/anshulsc/MirageTVQA。

论文及项目相关链接

PDF Accepted as Spotligh Talk at EurIPS 2025 Workshop on AI For Tabular Data

Summary
文本主要介绍了针对视觉语言模型的跨语言表格问答数据集MirageTVQA的构建背景和重要性。现有数据集无法反映真实世界的复杂性，MirageTVQA旨在填补这一空白，通过包含近6万跨语言问答对和视觉不完美的表格，评估模型在视觉噪声和跨语言推理方面的性能。该数据集揭示了现有视觉语言模型的主要弱点，并提供了一个衡量和改进视觉语言模型表格推理性能的基准。

Key Takeaways

当前评估视觉语言模型的基准未能捕捉真实世界场景的复杂性。
MirageTVQA是一个新的基准数据集，旨在评估视觉语言模型在跨语言和视觉不完美场景下的性能。
MirageTVQA包含近6万跨语言问答对，模拟真实文档中的视觉噪声。
现有视觉语言模型面临视觉噪声时的性能严重下降，最佳模型性能下降超过35%。
现有模型存在英语优先的偏见，在其他语言的推理能力无法顺利转移。
MirageTVQA为开发更稳健的视觉语言模型提供了衡量进步的基准。

Cool Papers

点此查看论文截图

The Belief-Desire-Intention Ontology for modelling mental reality and agency

Authors:Sara Zuppiroli, Carmelo Fabio Longo, Anna Sofia Lippolis, Rocco Paolillo, Lorenzo Giammei, Miguel Ceriani, Francesco Poggi, Antonio Zinilli, Andrea Giovanni Nuzzolese

The Belief-Desire-Intention (BDI) model is a cornerstone for representing rational agency in artificial intelligence and cognitive sciences. Yet, its integration into structured, semantically interoperable knowledge representations remains limited. This paper presents a formal BDI Ontology, conceived as a modular Ontology Design Pattern (ODP) that captures the cognitive architecture of agents through beliefs, desires, intentions, and their dynamic interrelations. The ontology ensures semantic precision and reusability by aligning with foundational ontologies and best practices in modular design. Two complementary lines of experimentation demonstrate its applicability: (i) coupling the ontology with Large Language Models (LLMs) via Logic Augmented Generation (LAG) to assess the contribution of ontological grounding to inferential coherence and consistency; and (ii) integrating the ontology within the Semas reasoning platform, which implements the Triples-to-Beliefs-to-Triples (T2B2T) paradigm, enabling a bidirectional flow between RDF triples and agent mental states. Together, these experiments illustrate how the BDI Ontology acts as both a conceptual and operational bridge between declarative and procedural intelligence, paving the way for cognitively grounded, explainable, and semantically interoperable multi-agent and neuro-symbolic systems operating within the Web of Data.

BDI（信念-愿望-意图）模型是人工智能和认知科学中理性代理表示的核心基石。然而，将其整合到结构化、语义互操作的知识表示中仍然有限。本文提出了一种正式的BDI本体（Ontology），它被构想为一种模块化的本体设计模式（ODP），通过信念、愿望、意图及其动态相互关系来捕捉代理的认知架构。本体通过与基础本体和模块化设计的最佳实践相结合，确保了语义的精确性和可重用性。实验验证了其适用性：一是通过逻辑增强生成（LAG）将本体与大型语言模型（LLM）相结合，评估本体对推理连贯性和一致性的贡献；二是将本体整合到Semas推理平台中，该平台实现了Triple-to-Beliefs-to-Triple（T2B2T）范式，实现了RDF三元组与代理心理状态的双向流动。这些实验共同说明了BDI本体如何作为概念与操作之间的桥梁，在声明式智能和过程式智能之间搭建桥梁，为基于认知、可解释和语义互操作的多智能体和神经符号系统在数据网上运行铺平道路。

论文及项目相关链接

PDF

Summary

本文介绍了BDI模型在人工智能和认知科学中代表理性机构的核心地位，以及其融入结构化、语义互操作知识表示体系的局限性。文章提出了一种正式的BDI本体（Ontology），作为一种模块化本体设计模式（ODP），通过信仰、欲望、意图及其动态相互关系来捕捉代理的认知架构。该本体通过与基础本体和模块化设计的最佳实践相结合，确保了语义的精确性和可重用性。实验表明，该本体在大型语言模型（LLMs）和语义推理平台中的应用具有重要意义。这些实验表明，BDI本体不仅是概念与操作之间的桥梁，也是宣言性智能与过程性智能之间的桥梁，为认知基础、可解释和语义互操作的多智能体和神经符号系统在Web数据中的运行铺平了道路。

Key Takeaways

BDI模型在人工智能和认知科学中代表理性机构的核心地位。
BDI模型融入结构化、语义互操作知识表示体系的现状存在局限性。
提出了一种正式的BDI本体（Ontology），作为模块化本体设计模式，捕捉代理的认知架构。
该本体通过信仰、欲望、意图及其动态相互关系描述认知架构。
本体通过与基础本体和模块化设计的最佳实践相结合，确保语义的精确性和可重用性。
通过大型语言模型和语义推理平台的实验验证了该本体的适用性。

Cool Papers

点此查看论文截图

Authors:Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, Zhaoxin Fan

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.

视觉语言导航要求智能体通过在长视野范围内表现连贯的行为来理解不仅是局部的视觉上下文，还包括在多步骤指令中已经完成了多远的过程。然而，最近的视觉语言行为模型关注的是直接行为预测，而早期的进度方法预测数值成就；两者都忽略了观察和指令序列的单调协同进步属性。基于这一见解，Progress-Think引入了语义进度推理，从视觉观察中预测指令风格的进度，以实现更精确的导航。为了实现这一点，而无需昂贵的注释，我们提出了一个三阶段的框架。在初始阶段，自我对齐进度预训练通过视觉历史和指令前缀之间新颖的可微调整对齐来引导推理模块。然后，进度指导策略预训练将学习到的进度状态注入导航上下文，引导策略采取一致的行为。最后，进度策略协同微调与量身定制的进度感知强化目标共同优化这两个模块。在R2R-CE和RxR-CE上的实验显示了最先进的成功率和效率，证明了语义进度对于导航进度的更一致表示。

论文及项目相关链接

PDF

Summary

在视觉语言导航中，代理需要在长时间范围内以连贯的方式进行行动，理解不仅仅是当地的视觉上下文，还要了解其在多步骤指令中的进展程度。然而，最近的视觉语言行动模型专注于直接行动预测，早期进步方法预测数字成就，两者都忽略了观察和指令序列的单调协同进步属性。Progress-Think引入语义进步推理，从视觉观察中预测指令式进展，以实现更准确的导航。为实现这一目标而无需昂贵的注释，我们提出了一个三阶段框架。初始阶段为自我对齐进度预训练，通过视觉历史与指令前缀之间的新型可区分对齐来引导推理模块。然后，进度引导策略预训练将学到的进度状态注入导航上下文，引导策略采取一致行动。最后，进度策略联合微调同时优化这两个模块，具有针对性的进度感知强化目标。在R2R-CE和RxR-CE上的实验显示了最先进的成功率和效率，证明了语义进步为导航进展提供了更一致的表现。

Key Takeaways

视觉语言导航要求代理理解长期连贯的行动，涵盖本地视觉上下文和其在多步骤指令中的进展程度。
现有模型主要关注直接行动预测或数字成就预测，忽略了观察和指令序列的单调协同进步属性。
Progress-Think提出语义进步推理，从视觉观察预测指令式进展，以提升导航准确性。
提出一个三阶段框架实现这一方法，包括自我对齐进度预训练、进度引导策略预训练和进度策略联合微调。
该方法通过针对进度感知的强化目标进行优化。
在R2R-CE和RxR-CE上的实验显示了该方法的先进性能和效率。

Cool Papers

点此查看论文截图

Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

Authors:He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan

Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55% and 16.04% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

视频异常检测（VAD）在诸如安全监控、自动驾驶和工业监测等实际应用中发挥着至关重要的作用。最近大规模预训练模型的进步为无训练VAD提供了新的机会，利用丰富的先验知识和通用推理能力。然而，现有研究通常依赖于密集帧级推理，这带来了高计算成本和延迟。这引发了一个基本问题：在使用强大的预训练模型进行VAD系统时，密集推理真的是必要的吗？为了回答这个问题，我们提出了ReCoVAD，这是一个受人类神经系统双重反射和意识途径启发的全新框架，能够实现选择性帧处理以减少冗余计算。ReCoVAD由两个核心途径组成：（i）反射途径使用一个轻量级的CLIP基础模块来融合视觉特征与原型提示，并产生决策向量，这些决策向量查询过去帧的动态内存和异常分数以实现快速响应；（ii）意识途径采用中等规模的视觉语言模型来生成文本事件描述和新颖帧的异常分数。它不断更新内存和原型提示，同时一个集成的大型语言模型定期审查累积的描述以识别未见过的异常、纠正错误并优化原型。大量实验表明，ReCoVAD在UCF-Crime和XD-Violence数据集上以前所未有的无训练性能实现了仅处理28.55%和16.04%的帧，证明了稀疏推理对于有效的大型模型基于VAD是足够的。

论文及项目相关链接

PDF

摘要

视频异常检测（VAD）在现实应用如安全监控、自动驾驶和工业监测中扮演重要角色。近期大型预训练模型的进展为无训练VAD提供了新的机会，通过利用丰富的先验知识和通用推理能力。然而，现有研究通常依赖密集帧级推理，计算成本高且延迟大。本文提出ReCoVAD框架，受人类神经系统双重反射和意识通路的启发，实现选择性帧处理以减少冗余计算。ReCoVAD包含两个核心通路：一是反应通路，使用轻量级CLIP基础模块融合视觉特征与原型提示并产生决策向量，查询过去帧的动态内存和异常分数以实现快速响应；二是意识通路，采用中等规模视觉语言模型生成文本事件描述和精细异常分数用于新帧。它不断更新内存和原型提示，而集成的大型语言模型定期审查累积的描述以识别未见异常、纠正错误并优化原型。实验表明，ReCoVAD在UCF-Crime和XD-Violence数据集上实现了最先进的无训练性能，仅处理之前方法的28.55%和16.04%的帧，证明稀疏推理对于有效的大型模型基于VAD是足够的。

关键见解

视频异常检测（VAD）在现实应用中的重要性。
大型预训练模型在无训练VAD中的新机会。
现有VAD研究依赖密集帧级推理，存在高计算成本和延迟问题。
提出ReCoVAD框架，结合反射和意识通路实现选择性帧处理。
ReCoVAD包含两个核心通路：反应通路用于快速响应，意识通路用于新帧的精细分析。
ReCoVAD通过更新内存和原型提示以及集成大型语言模型来提高性能。

Cool Papers

点此查看论文截图

RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion

Authors:Bhanu Pratap Paregi, Vaibhav Kumar

Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.

最近的点云补全模型，包括基于变压器的、基于去噪的和其他最先进的方法，都能从部分输入生成全局合理的形状，但经常会留下局部几何不一致的问题。我们提出了RL-AD-Net，这是一个强化学习（RL）细化框架，它在预训练的点自编码器的潜在空间中进行操作。自编码器将补全编码为紧凑的全局特征向量（GFV），这些特征向量由RL代理选择性调整，以提高几何保真度。为了确保稳健性，一个轻量级的非参数化PointNN选择器评估原始补全和RL细化输出的几何一致性，保留更好的重建结果。当真实值可用时，Chamfer距离和几何一致性指标引导细化过程。由于强化学习的无监督性和动态性，针对高度不同的类别的训练是分别进行的，这使得跨类别收敛具有挑战性。尽管如此，该框架可以在未来的工作中扩展到多类别细化。在ShapeNetCore-2048上的实验表明，虽然基线补全网络在其训练风格的裁剪下表现合理，但在随机裁剪场景中却表现挣扎。相比之下，RL-AD-Net在这两种设置下均持续提供改进，突出了强化学习引导的综合细化的有效性。该方法轻巧、模块化、模型通用性强，可广泛应用于多种补全网络，无需重新训练。

论文及项目相关链接

PDF

Summary

本文提出一种基于强化学习（RL）的点云补全优化框架RL-AD-Net，该框架在预训练点自编码器的潜在空间中进行操作。通过编码补全为紧凑的全局特征向量（GFVs），并利用强化学习代理选择性调整，以提高几何精度。通过轻量级非参数PointNN选择器评估原始补全和强化学习优化输出的几何一致性，保留更好的重建结果。该框架在ShapeNetCore-2048数据集上的实验表明，与传统的补全网络相比，RL-AD-Net在随机裁剪场景中表现更优秀。此方法具有轻量化、模块化、模型无关性等特点，可广泛应用于多种补全网络，无需重新训练。

Key Takeaways

RL-AD-Net框架利用强化学习在点云补全领域进行优化。
该框架操作于预训练点自编码器的潜在空间，提高几何精度。
通过全局特征向量（GFVs）的选择性调整，强化学习代理能改善几何一致性。
使用轻量级非参数PointNN选择器来评估补全结果的几何一致性。
RL-AD-Net在多种裁剪场景下的表现优于传统补全网络。
该方法具有轻量化、模块化、模型无关性等特点。

Cool Papers

点此查看论文截图

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Authors:Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li

LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model’s tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

LVLM在图像级任务（如VQA和字幕）中表现出卓越的性能。然而，在许多实例级任务中，如视觉定位和对象检测，LVLM与之前的专业模型之间仍存在着性能差距。同时，虽然行人跟踪是一个经典的任务，但结合对象跟踪和自然语言出现了许多新主题，如引用MOT、跨视图引用MOT和语义MOT。这些任务强调模型应以高级语义级别理解被跟踪的对象，这正是LVLM擅长的领域。在本文中，我们提出了一种新的统一行人跟踪框架，即OmniPT，它能够跟踪、基于参考进行跟踪，并交互式地生成对跟踪对象的语义理解。我们解决了两个问题：如何将跟踪任务建模为基础模型可以执行的任务，以及如何使模型输出格式化答案。为此，我们实现了包括RL-Mid Training-SFT-RL在内的训练阶段。基于LVLM的预训练权重，我们首先进行简单的RL阶段，使模型能够输出固定和可监督的边界框格式。随后，我们使用大量行人相关数据集中进行中期训练阶段。最后，我们在几个行人跟踪数据集上进行监督微调，然后进行另一阶段的RL，以提高模型的跟踪性能并增强其遵循指令的能力。我们在跟踪基准测试上进行了实验，实验结果表明所提出的方法可以优于之前的方法。

论文及项目相关链接

PDF AAAI 2026

Summary

本文介绍了一种名为OmniPT的统一行人跟踪框架，该框架结合了LVLMs的语义理解能力，能够在实例级任务中表现优异。针对行人跟踪任务，提出了将跟踪任务建模为LVLMs可完成的任务，并通过一系列训练阶段实现模型的输出格式化答案。实验结果表明，该方法在跟踪基准测试上的性能优于以前的方法。

Key Takeaways

LVLMs在图像级别的任务如VQA和caption中表现优异，但在实例级别的任务如视觉定位和目标检测中仍存在性能差距。
论文提出了一个新的行人跟踪框架OmniPT，结合了LVLMs的语义理解能力，强调模型应对追踪目标有高级语义层面的理解。
OmniPT解决了如何将跟踪任务建模为LVLMs可完成的任务，以及如何让模型输出格式化答案的问题。
OmniPT的训练阶段包括基于预训练权重的简单强化学习阶段、中期训练阶段以及监督微调阶段，以提高模型的跟踪性能和指令遵循能力。
OmniPT框架在跟踪基准测试上的性能优于以前的方法。
论文引入了新的任务重点，如结合对象跟踪和自然语言的参照MOT、跨视图参照MOT和语义MOT等。

Cool Papers

点此查看论文截图

ReVul-CoT: Towards Effective Software Vulnerability Assessment with Retrieval-Augmented Generation and Chain-of-Thought Prompting

Authors:Zhijie Chen, Xiang Chen, Ziming Li, Jiacheng Xue, Chaoyang Gao

Context: Software Vulnerability Assessment (SVA) plays a vital role in evaluating and ranking vulnerabilities in software systems to ensure their security and reliability. Objective: Although Large Language Models (LLMs) have recently shown remarkable potential in SVA, they still face two major limitations. First, most LLMs are trained on general-purpose corpora and thus lack domain-specific knowledge essential for effective SVA. Second, they tend to rely on shallow pattern matching instead of deep contextual reasoning, making it challenging to fully comprehend complex code semantics and their security implications. Method: To alleviate these limitations, we propose a novel framework ReVul-CoT that integrates Retrieval-Augmented Generation (RAG) with Chain-of-Thought (COT) prompting. In ReVul-CoT, the RAG module dynamically retrieves contextually relevant information from a constructed local knowledge base that consolidates vulnerability data from authoritative sources (such as NVD and CWE), along with corresponding code snippets and descriptive information. Building on DeepSeek-V3.1, CoT prompting guides the LLM to perform step-by-step reasoning over exploitability, impact scope, and related factors Results: We evaluate ReVul-CoT on a dataset of 12,070 vulnerabilities. Experimental results show that ReVul-CoT outperforms state-of-the-art SVA baselines by 16.50%-42.26% in terms of MCC, and outperforms the best baseline by 10.43%, 15.86%, and 16.50% in Accuracy, F1-score, and MCC, respectively. Our ablation studies further validate the contributions of considering dynamic retrieval, knowledge integration, and CoT-based reasoning. Conclusion: Our results demonstrate that combining RAG with CoT prompting significantly enhances LLM-based SVA and points out promising directions for future research.

背景：软件漏洞评估（SVA）在评估软件系统中的漏洞并进行排名以确保其安全性和可靠性方面起着至关重要的作用。目标：尽管大型语言模型（LLM）在SVA中显示出显著潜力，但它们仍面临两大局限。首先，大多数LLM都是在通用语料库上进行训练的，因此缺乏针对有效SVA的特定领域知识。其次，它们倾向于依赖浅层的模式匹配，而不是深度的上下文推理，这使得全面理解复杂的代码语义及其安全影响具有挑战性。方法：为了缓解这些局限性，我们提出了一种新型框架ReVul-CoT，该框架结合了检索增强生成（RAG）和思维链（COT）提示。在ReVul-CoT中，RAG模块从构建的本地知识库中动态检索与上下文相关的信息，该知识库整合了来自权威来源（如NVD和CWE）的漏洞数据，以及相应的代码片段和描述信息。基于DeepSeek-V3.1，COT提示引导LLM逐步进行可利用性、影响范围和相关因素的推理。结果：我们在包含12,070个漏洞的数据集上评估了ReVul-CoT。实验结果表明，在MCC方面，ReVul-CoT优于最先进的SVA基线模型，性能提升了16.50%~42.26%，并且在准确性、F1分数和MCC方面分别优于最佳基线模型10.43%、15.86%和16.50%。我们的消融研究进一步验证了考虑动态检索、知识融合和基于COT的推理的贡献。结论：我们的结果表明，将RAG与COT提示相结合显著增强了基于LLM的SVA，并指出了未来研究的有希望的方向。

论文及项目相关链接

PDF

Summary

本文介绍了软件脆弱性评估（SVA）的重要性及其面临的挑战，包括大型语言模型（LLM）在SVA中的两个主要局限性。为了缓解这些问题，提出了一种新的框架ReVul-CoT，该框架结合了检索增强生成（RAG）和思维链（COT）提示技术。实验结果表明，ReVul-CoT在软件脆弱性评估方面优于现有技术。

Key Takeaways

软件脆弱性评估（SVA）在评价软件系统的安全性方面起着重要作用。
大型语言模型（LLM）在SVA中面临两个主要挑战：缺乏领域特定知识和依赖浅模式匹配而非深度上下文推理。
提出了一种新的框架ReVul-CoT，结合了检索增强生成（RAG）和思维链（COT）提示技术来解决这些挑战。
ReVul-CoT通过从权威来源动态检索相关知识和上下文信息，并利用DeepSeek-V3.1进行深度上下文推理，提高了大型语言模型在SVA中的性能。
实验结果表明，ReVul-CoT在软件脆弱性评估方面优于现有技术，并在准确率、F1分数和MCC等方面取得了显著进展。

Cool Papers

点此查看论文截图

ToC: Tree-of-Claims Search with Multi-Agent Language Models

Authors:Shuyang Yu, Jianan Liang, Hui Hu

Optimizing patent claims is a critical yet challenging task, demanding careful balance between maximizing novelty and preserving legal scope. Manual claim drafting is labor-intensive, costly, and inherently inconsistent, while conventional Large Language Models (LLMs) often lack the structured, iterative reasoning essential for precise claim refinement. To address these challenges, we introduce Tree of Claims (ToC), an innovative framework that redefines claim editing as a guided search problem. ToC synergistically integrates Monte Carlo Tree Search (MCTS) with a collaborative multi-agent system, comprising an LLM-based EditorAgent that proposes contextually grounded edits, and an ExaminerAgent that mimics patent examiner critiques through structured, chain-of-thought analyses of novelty and prior art disclosure. Driven by a carefully designed multi-objective reward function, ToC jointly optimizes novelty, scope retention, and semantic coherence. Experimental evaluation on a benchmark of 1145 claims demonstrates that ToC significantly outperforms standard LLMs in zero-shot and few-shot scenarios, achieving an average composite score improvement of 8%, and up to 9% in certain cases. Extensive experiments, including detailed ablation studies, validate ToC’s efficacy in generating superior, legally robust claim revisions. Overall, ToC establishes a transparent, controllable, and interpretable methodology that effectively bridges advanced LLM reasoning capabilities with strategic MCTS planning for structured patent claim optimization.The source code is available at https://github.com/ysy2003/ToC.

优化专利索赔是一项至关重要的任务，同时极具挑战性，需要在实现新颖性的最大化和保持法律范围之间找到谨慎的平衡。手动草拟索赔劳动密集型、成本高昂且本质上具有内在的不一致性，而传统的大型语言模型（LLM）往往缺乏精确调整索赔所需的结构化迭代推理。为了应对这些挑战，我们引入了索赔树（ToC）这一创新框架，它将索赔编辑重新定义为一项引导搜索问题。索赔树（ToC）协同整合了蒙特卡洛树搜索（MCTS）与协作多智能体系统，该系统包括一个基于LLM的EditorAgent，能够提出基于上下文内容的编辑建议，以及一个模拟专利审查员批评的ExaminerAgent，通过结构化、思维链分析新颖性和现有技术披露情况。通过精心设计的多目标奖励函数驱动，索赔树（ToC）联合优化新颖性、范围保留和语义连贯性。在包含1145项索赔的基准测试上的实验评估表明，在零样本和少样本场景中，索赔树（ToC）显著优于标准LLM，平均综合得分提高了8%，在某些情况下甚至提高了9%。包括详细消融研究在内的广泛实验验证了索赔树（ToC）在生成优质、具有法律韧性的索赔修订方面的有效性。总体而言，索赔树（ToC）建立了一种透明、可控且可解释的方法，有效地将高级LLM推理能力与用于结构化专利索赔优化的战略MCTS规划相结合。源代码可在https://github.com/ysy2003/ToC获取。

论文及项目相关链接

PDF Accepted by AAAI 2026 (Oral)

Summary
专利索赔的优化是一项既关键又具挑战性的任务，需要在最大化新颖性与保留法律范围之间找到平衡。手动起草索赔既劳动密集型又成本高昂，而且结果常不一致，传统的语言模型则缺乏结构化的迭代推理来进行精确索赔的修订。为解决这些挑战，我们提出了索赔树（ToC）这一创新框架，将索赔编辑重新定义为一项有指导意义的搜索问题。索赔树协同结合了蒙特卡洛树搜索和协作式多智能体系统，包括基于LLM的EditorAgent提出上下文相关的编辑建议，以及模拟专利审查员评论的ExaminerAgent通过结构化分析新颖性和现有技术披露链的思维分析。通过精心设计的多目标奖励函数驱动，索赔树能联合优化新颖性、范围保留和语义连贯性。在基准测试上的实验评估显示，在零样本和少量样本场景下，索赔树明显优于标准的大型语言模型。在详尽的测试中进行了全面的评估验证了索赔树的有效性。总体而言，索赔树建立了一种透明、可控且可解释的方法论，有效地将大型语言模型的推理能力与蒙特卡洛树搜索的战略性计划相结合来进行结构化专利索赔优化。代码可以在给定的GitHub仓库找到：https://github.com/ysy2003/ToC。

Key Takeaways

优化专利索赔需要在最大化新颖性和保留法律范围之间找到平衡。
传统的大型语言模型在专利索赔优化方面缺乏结构化的迭代推理能力。
索赔树是一个创新框架，将索赔编辑定义为有指导意义的搜索问题。
索赔树结合蒙特卡洛树搜索和协作式多智能体系统来提高专利索赔优化的效率和质量。
通过精心设计的多目标奖励函数，索赔树能联合优化新颖性、范围保留和语义连贯性。
实验评估显示索赔树在专利索赔优化方面显著优于标准的大型语言模型。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-25/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-11-25 Counterfactual World Models via Digital Twin-conditioned Video Diffusion

2025-11-25 LLM

LLM

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-11-22 SUNAC Source-aware Unified Neural Audio Codec

2025-11-22 Talking Head Generation

Talking Head Generation

R1_Reasoning

2025-11-25 更新

Native 3D Editing with Full Attention

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Accelerating the CLEAN algorithm of radio interferometry with convex optimization

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Anomaly Pattern-guided Transaction Bug Testing in Relational Databases

METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

The Belief-Desire-Intention Ontology for modelling mental reality and agency

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

ReVul-CoT: Towards Effective Software Vulnerability Assessment with Retrieval-Augmented Generation and Chain-of-Thought Prompting

ToC: Tree-of-Claims Search with Multi-Agent Language Models