LLM

发布日期: 2025-11-25

更新日期: 2025-11-27

文章字数: 22k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-25 更新

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Authors:Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as “what would happen if this object was removed?”, is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

世界模型学会根据控制信号预测视觉观察的时间演变，这有望使智能体通过前向模拟对环境进行推理。由于专注于前向模拟，当前的世界模型是基于事实观察进行预测的。对于许多新兴应用来说，如在不同条件下对物理AI行为进行综合评价，世界模型回答反事实查询的能力，例如“如果这个物体被移除会发生什么？”这一点变得越来越重要。我们正式提出反事实世界模型，该模型额外将干预作为明确输入，预测在观察到的场景属性假设修改下的时间序列。传统世界模型直接在纠缠的像素空间表示上操作，其中对象属性和关系不能选择性修改。这种建模选择阻止了针对特定场景属性的定向干预。我们介绍了CWMDT框架，以克服这些限制，将标准视频扩散模型转化为有效的反事实世界模型。首先，CWMDT构建观察场景的数字孪生，以明确编码对象及其关系，表示为结构化文本。其次，CWMDT应用大型语言模型对这些表示进行推理，并预测反事实干预如何随时间传播来改变观察到的场景。第三，CWMDT用修改后的表示来条件化视频扩散模型，以生成反事实视觉序列。在两个基准测试上的评估表明，CWMDT方法达到了最新性能，这表明视频的其他表示形式，如这里考虑的数字孪生，为世界模型提供了强大的控制信号，用于基于视频前向模拟的方法。

论文及项目相关链接

PDF

Summary

世界模型学习基于控制信号预测视觉观察的时间演变，这有望使智能体通过正向模拟来推理环境。当前的世界模型主要关注正向模拟，根据事实观察生成预测。对于新兴应用，如不同条件下物理AI行为的全面评估，世界模型回答反事实查询的能力越来越重要，例如“如果移除这个物体会发生什么？”。我们形式化反事实世界模型，该模型额外接受干预作为显式输入，预测假设修改观察场景属性后的时间序列。传统世界模型直接在纠缠的像素空间表示上操作，无法选择性修改对象属性和关系。我们引入CWMDT框架，克服这些限制，将标准视频扩散模型转化为有效的反事实世界模型。首先，CWMDT构建观察场景的数字孪生，以结构化文本显式编码对象和他们的关系。其次，CWMDT应用大型语言模型对这些表示进行推理，并预测反事实干预如何随时间传播以改变观察到的场景。最后，CWMDT用修改后的表示条件化视频扩散模型，生成反事实视觉序列。在两项基准测试上的评估表明，CWMDT方法达到最新性能，表明视频的数字孪生等替代表示方式对于基于视频正向模拟的世界模型提供了强大的控制信号。

Key Takeaways

世界模型能够基于控制信号预测视觉观察的时间演变，使智能体通过正向模拟推理环境。
当前世界模型主要关注正向模拟，生成基于事实观察的预测。
反事实世界模型的重要性在于能够回答新兴应用中的反事实查询，如移除某物体后的场景变化。
传统世界模型在像素空间表示上操作，限制了对特定场景属性的靶向干预。
CWMDT框架通过构建数字孪生体克服这些限制，将视频扩散模型转化为有效的反事实世界模型。
CWMDT应用大型语言模型对场景的数字孪生进行推理，并预测反事实干预如何影响观察到的场景。

Cool Papers

点此查看论文截图

PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

Authors:Siqi Liang, Yudi Zhang, Yue Guo

We propose a novel framework for persona-based language model system, motivated by the need for personalized AI agents that adapt to individual user preferences. In our approach, the agent embodies the user’s “persona” (e.g. user profile or taste) and is powered by a large language model (LLM). To enable the agent to leverage rich contextual information, we introduce a Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) mechanism that constructs an LLM-derived graph index of relevant documents and summarizes communities of related information. Our framework generates personalized prompts by combining: (1) a summary of the user’s historical behaviors and preferences extracted from the knowledge graph, and (2) relevant global interaction patterns identified through graph-based community detection. This dynamic prompt engineering approach allows the agent to maintain consistent persona-aligned behaviors while benefiting from collective knowledge. On the LaMP benchmark, our method improves news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and reduces product rating MAE by 10.4% over prior methods. Our code is available at https://anonymous.4open.science/r/PersonaAgentwGraphRAG-DE6F

我们针对基于个性语言模型系统提出了一种新型框架，该框架的动机在于需要能够适应个人用户偏好的个性化AI代理。在此方法中，代理体现用户的“人格”（如用户个人资料或品味），并由大型语言模型（LLM）提供支持。为了使代理能够利用丰富的上下文信息，我们引入了增强知识图谱检索生成（Graph RAG）机制，该机制构建了由相关文档索引的LLM衍生图，并总结了相关信息社区。我们的框架通过结合以下两个方面来生成个性化提示：（1）从知识图谱中提取的用户的过往行为和偏好的摘要；（2）通过基于图的社区检测发现的全球互动模式。这种动态提示工程方法使得代理能够在保持与人格一致的行为的同时，受益于集体知识。在LaMP基准测试中，我们的方法在新闻分类F1上提高了11.1%，电影标签F1提高了56.1%，产品评级MAE降低了10.4%，超过了先前的方法。我们的代码可在https://anonymous.4open.science/r/PersonaAgentwGraphRAG-DE6F找到。

论文及项目相关链接

PDF

Summary

基于个性化需求的AI代理需求，我们提出了一种新颖的基于人格的语言模型系统框架。该框架结合了用户人格与大型语言模型（LLM），并引入了知识图谱增强的检索增强生成（Graph RAG）机制，构建LLM衍生图索引相关文档并总结相关信息社区。通过结合用户历史行为和偏好的摘要以及通过图社区检测识别的相关全局交互模式，生成个性化提示。在LaMP基准测试中，我们的方法在新闻分类F1上提高了11.1%，电影标签F1提高了56.1%，产品评级MAE降低了10.4%。

Key Takeaways

提出了基于人格的语言模型系统框架，满足个性化AI代理的需求。
结合用户人格与大型语言模型（LLM）。
引入知识图谱增强的检索增强生成（Graph RAG）机制。
通过结合用户历史行为和偏好的摘要以及全局交互模式，生成个性化提示。
该框架能够保持与用户需求一致的行为，并受益于集体知识。
在LaMP基准测试中，该方法在新闻分类、电影标签和产品评级方面表现优越。
相关的代码已公开可用。

Cool Papers

点此查看论文截图

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Authors:Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong

Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

视觉语言模型（VLMs）在安全关键型应用中的部署越来越多，这使得其对抗性稳健性成为关注的重点。尽管对抗性知识蒸馏在将稳健性从教师模型转移到学生模型方面显示出潜力，但传统的单教师方法存在知识多样性有限、收敛速度慢以及难以平衡稳健性和准确性等问题。为了应对这些挑战，我们提出了MMT-ARD：一种多模态多教师对抗性稳健蒸馏框架。我们的关键创新之处在于双教师知识融合架构，该架构协同优化清洁特征保持和稳健特征增强。为了更好地处理具有挑战性的对抗样本，我们引入了一种基于教师信心的动态权重分配策略，使模型能够自适应地关注更困难的样本。此外，为了减轻教师之间的偏见，我们设计了一种基于自适应Sigmoid的加权函数，以平衡跨模态知识转移的强弱。在ImageNet和零样本基准测试上的大量实验表明，MMT-ARD在ViT-B-32模型上提高了+4.32%的稳健准确性和+3.5%的零样本准确性，同时相对于传统单教师方法提高了2.3倍的培训效率。这些结果凸显了MMT-ARD在增强多模态大型模型的对抗稳健性方面的有效性和可扩展性。我们的代码可在https://github.com/itsnotacie/MMT-ARD上找到。

论文及项目相关链接

PDF 10 pages

摘要

视觉语言模型（VLMs）在安全性至关重要的应用中部署日益增多，其对抗性稳健性成为重要关注点。针对传统单教师对抗性稳健蒸馏方法存在的知识多样性有限、收敛速度慢以及稳健性和准确性平衡困难等问题，本文提出MMT-ARD：一种多模态多教师对抗性稳健蒸馏框架。关键创新点在于双教师知识融合架构，该架构协同优化清洁特征保持和稳健特征增强。通过基于教师信心的动态权重分配策略，更好地处理具有挑战性的对抗样本。此外，为了减轻教师间的偏差，设计了一种自适应Sigmoid加权函数，平衡跨模态的知识转移强度。在ImageNet和零样本基准测试上的实验表明，MMT-ARD在ViT-B-32模型上提高了对抗性准确率和零样本准确率，同时相比传统单教师方法训练效率提高2.3倍。结果证明了MMT-ARD在增强多模态大型模型的对抗稳健性方面的有效性和可扩展性。代码可访问https://github.com/itsnotacie/MMT-ARD获取。

关键见解

VLMs在安全性关键应用中的部署增长，使得其对抗稳健性变得至关重要。
传统单教师对抗稳健蒸馏方法存在局限性，如知识多样性有限、收敛速度慢以及稳健性和准确性平衡困难。
提出MMT-ARD框架，采用双教师知识融合架构，协同优化清洁和稳健特征的保持与增强。
通过动态权重分配策略，更好地处理具有挑战性的对抗样本。
设计自适应Sigmoid加权函数以平衡跨模态知识转移的强弱，减少教师间的偏差。
在ImageNet和零样本基准测试上的实验证明了MMT-ARD的有效性，提高了对抗性准确率和零样本准确率。

Cool Papers

点此查看论文截图

REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

Authors:Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir

Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.

基础模型（FMs）在遥感（RS）中得到了越来越广泛的应用，用于环境监控、灾害评估和土地利用映射等任务。这些模型包括在单一数据模态上训练的单模态视觉编码器和在SAR、多光谱、超光谱和图像文本数据组合上训练的跨模态架构。它们支持多种遥感任务，包括语义分割、图像分类、变化检测和视觉问答。然而，由于文档分散、格式异构和部署约束各异，选择合适的遥感基础模型（RSFM）仍然是一个难题。我们引入了遥感基础模型数据库（RS-FMD），这是一个结构化资源，涵盖了超过150个跨越多种数据模态、分辨率和学习方法的RSFM。基于RS-FMD，我们推出了REMSA，这是第一个基于LLM的自动化遥感基础模型选择代理，可从自然语言查询中进行选择。REMSA解释用户需求，解决缺失的约束，使用上下文学习对候选模型进行排名，并提供透明的理由。我们还提出了一个由75个专家验证的遥感查询场景组成的基准测试，在专家中心评估协议下产生了900个配置。REMSA的性能优于几个基线，包括简单代理、密集检索和基于非结构化RAG的LLMs。它完全基于公开可用的元数据运行，不访问私有或敏感数据。

论文及项目相关链接

PDF Code and data available at https://github.com/be-chen/REMSA

Summary：
远程遥感领域正在越来越多地使用基础模型（FMs），包括单模态视觉编码器和多模态架构。这些模型用于环境监控、灾害评估和土地用途映射等任务。为了支持多样化的遥感任务并选择合适的遥感基础模型（RSFM），推出了遥感基础模型数据库（RS-FMD）。基于RS-FMD，我们引入了REMSA，这是一个基于LLM的自动化RSFM选择代理。它能够解释用户需求、解决缺失约束、对候选模型进行排名并提供透明解释。通过专家验证的RS查询场景基准测试显示，REMSA表现优异。它完全基于公开元数据运行，不涉及私有或敏感数据。

Key Takeaways：

基础模型（FMs）在遥感领域应用广泛，用于环境监控、灾害评估和土地用途映射等任务。
存在多种遥感基础模型（RSFMs），包括单模态和多模态架构，选择合适的模型具有挑战性。
为了支持多样化的遥感任务，推出了遥感基础模型数据库（RS-FMD）。
基于RS-FMD，引入REMSA代理，能够自动化选择适当的RSFM，具备解释用户需求、解决缺失约束等功能。
REMSA通过专家验证的基准测试表现优异，相较于其他基线方法具有优势。
REMSA完全基于公开元数据运行，不涉及私有或敏感数据。

Cool Papers

点此查看论文截图

RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation

Authors:Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, Zhaoye Long, Yue Wang, Chong Liu, Dihan Wang, Ziqiang Ni, Xiang Yang, You Liu, Ruoxuan Feng, Runtian Xu, Lei Zhang, Denghang Huang, Chenghao Jin, Anlan Yin, Xinlong Wang, Zhenguo Sun, Junkai Zhao, Mengfei Du, Mingyu Cao, Xiansheng Chen, Hongyang Cheng, Xiaojie Zhang, Yankai Fu, Ning Chen, Cheng Chi, Sixiang Chen, Huaihai Lyu, Xiaoshuai Hao, Yankai Fu, Yequan Wang, Bo Lei, Dong Liu, Xi Yang, Yance Jiao, Tengfei Pan, Yunyan Zhang, Songjing Wang, Ziqian Zhang, Xu Liu, Ji Zhang, Caowei Meng, Zhizheng Zhang, Jiyang Gao, Song Wang, Xiaokun Leng, Zhiqiang Xie, Zhenzhen Zhou, Peng Huang, Wu Yang, Yandong Guo, Yichao Zhu, Suibing Zheng, Hao Cheng, Xinmin Ding, Yang Yue, Huanqian Wang, Chi Chen, Jingrui Pang, YuXi Qian, Haoran Geng, Lianli Gao, Haiyuan Li, Bin Fang, Gao Huang, Yaodong Yang, Hao Dong, He Wang, Hang Zhao, Yadong Mu, Di Hu, Hao Zhao, Tiejun Huang, Shanghang Zhang, Yonghua Lin, Zhongyuan Wang, Guocai Yao

Bimanual manipulation is essential for achieving human-like dexterity in robots, but the large-scale and diverse bimanual robot datasets remain scarce due to hardware heterogeneity across robotic platforms. To address the challenge, we present RoboCOIN, a comprehensive multi-embodiment bimanual manipulation dataset with over 180,000 demonstrations collected from 15 distinct robotic platforms. The dataset covers 16 scenarios, including residential, commercial, and working environments, with 421 tasks systematically organized by bimanual coordination patterns and object properties. Our key innovation is a hierarchical capability pyramid that provides multi-level annotations, spanning trajectory-level concepts, segment-level subtasks, and frame-level kinematics. We further develop CoRobot, a comprehensive processing framework featuring Robot Trajectory Markup Language (RTML) for quality assessment, automated annotation generation, and unified multi-embodiment management. Extensive experiments demonstrate the reliability and effectiveness of RoboCOIN in multi-embodiment bimanual learning, with significant performance improvements across various model architectures and robotic platforms. The complete dataset and framework are open-sourced and publicly available for further research purposes. Project website: https://FlagOpen.github.io/RoboCOIN/.

双手操作对于实现机器人的人类般灵巧性至关重要，但由于机器人平台之间的硬件异质性，大规模且多样化的双手机器人数据集仍然稀缺。为了应对这一挑战，我们推出了RoboCOIN，这是一套全面的多体现双手操作数据集，从15个不同的机器人平台收集了超过18万个演示样本。该数据集涵盖了16种场景，包括住宅、商业和工作环境，有421项任务按照双手协调模式和物体属性进行系统化组织。我们的关键创新之处在于层次能力金字塔，它提供了多层次注释，涵盖轨迹级概念、分段级子任务和帧级运动学。此外，我们还开发了CoRobot，这是一个全面的处理框架，具有机器人轨迹标记语言（RTML）用于质量评估、自动化注释生成和统一的多体现管理。大量实验表明，RoboCOIN在多体现双手学习中具有可靠性和有效性，在各种模型架构和机器人平台上实现了显著的性能提升。整套数据集和框架均开源，可供进一步研究之用。项目网站：https://FlagOpen.github.io/RoboCOIN/。

论文及项目相关链接

PDF

Summary

本文介绍了一个综合的多体现双肢操作机器人数据集RoboCOIN，包含超过18万条从不同机器人平台收集的演示数据。该数据集涵盖16种场景，包括住宅、商业和工作环境，系统性地组织421个任务，并提出一种层次化的能力金字塔，提供多层次注释，包括轨迹级概念、分段级子任务和帧级运动学。此外，还开发了CoRobot处理框架，采用机器人轨迹标记语言（RTML）进行质量评估、自动化注释生成和统一的多体现管理。研究表明，RoboCOIN在多体现双肢学习中具有可靠性和有效性，在各种模型架构和机器人平台上表现出显著的性能提升。数据集和框架均已开源供进一步研究使用。

Key Takeaways

RoboCOIN是一个多体现双肢操作机器人数据集，包含大量演示数据。
数据集涵盖多种环境场景和421个任务，任务系统性组织。
首次提出层次化的能力金字塔，提供多层次注释。
CoRobot处理框架用于质量评估、自动化注释生成和多体现管理。
RoboCOIN在多体现双肢学习中表现出可靠性和有效性。
数据集和框架已开源供研究使用。

Cool Papers

点此查看论文截图

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Authors:Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

机器人基础模型（RFMs）作为通用的端到端机器人控制系统具有巨大的潜力。然而，它们在新的环境、任务和体现中的推广能力仍然有限。我们认为主要的瓶颈在于它们的基础：大多数RFMs都是通过微调互联网上预训练的视觉语言模型（VLMs）来构建的。然而，这些VLMs是在2D图像语言任务上训练的，缺乏3D空间推理，这本身对于在3D世界中的实体控制是必需的。直接通过大规模的机器人数据来弥补这一差距成本高昂且难以扩展。因此，我们提议丰富容易收集的非机器人图像数据以提供3D注释，并使用3D理解能力增强预训练的VLM。遵循这一策略，我们训练了SPEAR-VLM，这是一种3D感知的VLM，可以从单个2D图像中推断出对象在三维空间中的坐标。在SPEAR-VLM的基础上，我们介绍了主要贡献——SPEAR-1：一种结合地面3D感知与语言指导的实体控制的机器人基础模型。在来自24个开放跨体现数据集的约45M帧上进行训练，SPEAR-1的性能超过了或匹配了如π0-FAST和π0.5等最新模型，同时使用的机器人演示次数减少了20倍。这种精心设计的训练策略解锁了新的VLM能力，并因此提高了超越仅使用机器人数据所能实现的实体控制的可靠性。我们公开提供了模型权重和3D注释数据集。

论文及项目相关链接

PDF

摘要

本文探讨了Robotic Foundation Models（RFMs）在机器人控制中的通用性端系统应用前景。尽管RFMs具有广泛的应用潜力，但它们在新的环境、任务和体现中的泛化能力仍然有限。文章指出，主要瓶颈在于其基础：大多数RFMs是通过微调互联网预训练的Vision-Language Models（VLMs）构建的。然而，这些VLMs是在2D图像语言任务上训练的，缺乏内在需要的3D空间推理能力，为实现在3D世界中的控制带来局限性。为弥补这一差距，本文提出了一个易于实现的方法，即通过采集的非机器人图像数据增强3D注释，并增强预训练的VLM的3D理解能力。基于这一策略，训练了SPEAR-VLM，一个能够从单个2D图像推断对象在三维空间中坐标的3D感知VLM。在此基础上，本文介绍了主要贡献，即SPEAR-1：一个整合了接地3D感知和语言指导的机器人控制功能的RFM。在来自24个Open X-Embodiment数据集的约4.5亿帧上进行训练，SPEAR-1的性能超过了当前的技术水平模型如π_0-FAST和π_{0.5}，同时使用了较少的机器人演示数据。这种精心设计的训练策略解锁了新的VLM能力，并提高了机器人控制的可靠性。本文公开了模型权重和3D注释数据集。

关键见解

Robotic Foundation Models (RFMs) 作为通用、端到端的机器人控制系统具有巨大潜力，但在新环境、任务和体现中的泛化能力受限。
主要瓶颈在于RFMs大多基于互联网预训练的Vision-Language Models (VLMs)，这些模型缺乏3D空间推理能力，对于机器人控制而言是必需的。
提出了一种通过采集的非机器人图像数据增强并带有3D注释的方法，以增强预训练VLM的3D理解能力。
训练了SPEAR-VLM模型，能够从单个2D图像中推断物体在三维空间中的坐标。
介绍了SPEAR-1模型，这是一种结合了接地3D感知和语言指导的机器人控制功能的RFM。
SPEAR-1在多个数据集上的性能超过了当前技术水平，且使用的机器人演示数据较少。

Cool Papers

点此查看论文截图

That’s not natural: The Impact of Off-Policy Training Data on Probe Performance

Authors:Nathalie Kirch, Samuel Dower, Adrians Skapars, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

Probing has emerged as a promising method for monitoring Large Language Models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect varies by behaviour. We find that successful generalisation from off-policy data, to test sets where the model is incentivised to produce the target behaviour, is predictive of successful on-policy generalisation. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. Notably, shifts in the training data domain still cause even larger performance degradation, with different-domain test scores being consistently lower than the same-domain ones. These results indicate that, in the absence of on-policy data, using same-domain off-policy data yields more reliable probes than using on-policy data from a different domain, emphasizing the need for methods that can better handle distribution shifts in LLM monitoring.

探查作为一种有前途的方法，已经崭露头角，用于监控大型语言模型（LLM），以在推理时检测欺骗和谄媚等令人担忧的行为。然而，许多行为的自然实例很少见，迫使研究人员依赖合成或偏离策略的LLM响应来训练探针。我们系统地评估了使用合成和偏离策略的数据对跨八种不同LLM行为探针泛化的影响。通过对多个LLM测试线性和注意力探针，我们发现响应生成策略可以显著影响探针性能，尽管这种影响的效果因行为而异。我们发现从偏离策略的数据成功泛化到模型受激励产生目标行为的测试集，可以预测在策略内的成功泛化。利用这一结果，我们预计欺骗和沙袋探查在真实监控场景中可能会从偏离策略到策略内的数据泛化失败。值得注意的是，训练数据域的变化甚至会导致更大的性能下降，不同域的测试分数始终低于相同域的测试分数。这些结果表明，在没有策略内数据的情况下，使用相同域的偏离策略数据比使用来自不同域的策略内数据产生的探针更可靠，这强调了需要能够更好处理LLM监控中分布变化的方法。

论文及项目相关链接

PDF 10 pages, EurIPS 2025 Workshop on Private AI Governance

Summary

探针法已成为监控大型语言模型（LLM）的一种有前途的方法，能在推理时间检测欺骗和谄媚等令人担忧的行为。然而，许多行为的自然例子很少见，迫使研究者依赖合成或偏离政策的LLM响应来训练探针。本文系统地评估了使用合成和偏离政策数据对探针在八种不同LLM行为上的泛化能力的影响。通过对多个LLM测试线性和注意力探针，发现响应生成策略可以显著影响探针性能，尽管这种影响在不同行为之间有所不同。本文还发现，从偏离政策数据成功泛化到激励模型产生目标行为的测试集，是预测在策略内成功泛化的关键。利用这一结果，本文预测欺骗和沙袋探针在真实监控场景中可能无法从偏离政策数据到策略内数据的泛化。值得注意的是，训练数据域的转移仍会导致更大的性能下降，不同域测试分数始终低于相同域测试分数。这些结果表明，在没有策略内数据的情况下，使用相同域的偏离政策数据比使用来自不同域的策略内数据更能产生可靠的探针，这强调了需要能够更好处理LLM监控中分布变化的方法。

Key Takeaways

探针法已成为监控LLM的有效方法，能检测包括欺骗和谄媚在内的特定行为。
研究者常依赖合成或偏离政策的数据来训练探针，因为自然实例较为罕见。
响应生成策略显著影响探针性能，这种影响在不同行为间有所差异。
从偏离政策数据成功泛化到激励模型产生目标行为的测试集是预测探针性能的关键。
真实监控场景中，某些探针（如欺骗和沙袋探针）可能无法从偏离政策数据到策略内数据有效泛化。
训练数据域的转移会导致探针性能显著下降，使用相同域的偏离政策数据比不同域的策略内数据更可靠。

Cool Papers

点此查看论文截图

Authors:Yifan Li, Lichi Li, Anh Dao, Xinyu Zhou, Yicheng Qiao, Zheda Mai, Daeun Lee, Zichen Chen, Zhen Tan, Mohit Bansal, Yu Kong

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the “collision rate” and “warning rate” metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

视觉大型语言模型（VLLMs）作为实体代理显示出巨大的潜力，但它们在空间推理方面仍面临巨大挑战。现有的实体基准测试主要关注被动、静态的家庭环境，仅评估孤立的能力，无法捕捉动态、真实世界复杂环境中的整体表现。为了填补这一空白，我们推出了IndustryNav，这是针对主动空间推理的第一个动态工业导航基准测试。IndustryNav利用12个手动创建的高保真Unity仓库场景，包含动态物体和人类活动。我们的评估采用PointGoal导航管道，有效结合以自我为中心的观点和全局里程计，以评估整体的地方-全局规划。关键的是，我们引入了“碰撞率”和“警告率”这两个指标来衡量安全导向的行为和距离估计。对九种最新VLLMs的全面研究（包括GPT-5-mini、Claude-4.5和Gemini-2.5等模型）显示，封闭源模型具有持续的优势；然而，所有代理在稳健的路径规划、避碰和主动探索方面都存在明显的不足。这突显了实体研究需要从被动感知转向需要稳定规划、主动探索和动态真实世界环境中的安全行为的任务。

论文及项目相关链接

PDF

Summary

这篇论文针对视觉大语言模型（VLLMs）在动态环境中的空间推理能力展开研究。现有基准测试主要集中在静态的家庭环境上，评估孤立的能力，不能全面反映模型在现实世界中的性能。论文提出了一种新的基准测试：IndustryNav，它涵盖了动态的工业导航场景，要求空间推理是主动的。评估中结合了第一人称视觉和全局里程计来评估整体的本地和全局规划能力。此外，论文引入了碰撞率和警告率两个指标来评估模型的安全行为。研究结果显示，尽管闭源模型在某些方面表现较好，但所有模型在稳健的路径规划、碰撞避免和主动探索方面都存在明显不足。这表明需要进一步研究动态环境中的主动感知和任务需求的空间推理能力。

Key Takeaways

VLLMs作为实体代理显示出巨大潜力，但在空间推理方面仍面临挑战。
现有基准测试主要集中在静态的家庭环境上，无法全面评估模型在现实世界复杂场景中的性能。
IndustryNav是首个针对主动空间推理的动态工业导航基准测试。
IndustryNav利用手动创建的Unity仓库场景，具有动态物体和人为动作。
评估结合了第一人称视觉和全局里程计，引入碰撞率和警告率两个指标来评估模型的安全行为。
研究发现闭源模型在某些方面表现较好，但所有模型在路径规划、碰撞避免和主动探索方面存在不足。

Cool Papers

点此查看论文截图

Exploring Scientific Debt: Harnessing AI for SATD Identification in Scientific Software

Authors:Eric L. Melin, Ahmed Musa Awon, Nasir U. Eisty, Neil A. Ernst, Shurui Zhou

Developers often leave behind clues in their code, admitting where it falls short, known as Self-Admitted Technical Debt (SATD). In the world of Scientific Software (SSW), where innovation moves fast and collaboration is key, such debt is not just common but deeply impactful. As research relies on accurate and reproducible results, accumulating SATD can threaten the very foundations of scientific discovery. Yet, despite its significance, the relationship between SATD and SSW remains largely unexplored, leaving a crucial gap in understanding how to manage SATD in this critical domain. This study explores SATD in SSW repositories, comparing SATD in scientific versus general-purpose open-source software and evaluating transformer-based models for SATD identification. We analyzed SATD in 27 scientific and general-purpose repositories across multiple domains and languages. We fine-tuned and compared 10 transformer-based models (100M-7B parameters) on 67,066 labeled code comments. SSW contains 9.25x more Scientific Debt and 4.93x more SATD than general-purpose software due to complex computations, domain constraints, and evolving research needs. Furthermore, our best model outperforms existing ones. This study uncovers how SATD in SSW differs from general software, revealing its impact on quality and scientific validity. By recognizing these challenges, developers and researchers can adopt smarter strategies to manage debt and safeguard the integrity of scientific discovery.

开发者在代码中经常会留下线索，承认其存在的不足，这被称为“自我承认的技术债务”（SATD）。在科学软件（SSW）领域，创新进展迅速且合作是关键，这样的债务不仅普遍存在，而且影响深远。由于研究依赖于准确且可重复的结果，累积的SATD可能威胁到科学发现的基石。然而，尽管其意义重大，SATD与SSW之间的关系仍大多未被探索，对于如何在这一关键领域管理SATD的理解存在重大空白。本研究探讨了SSW存储库中的SATD，比较了科学软件与通用开源软件中的SATD，并评估了基于转换器的模型进行SATD识别。我们分析了27个科学和通用存储库中的SATD，涉及多个领域和编程语言。我们对67,066个标记的代码注释进行了微调，并比较了10个基于转换器的模型（参数范围从1亿到7亿）。由于复杂的计算、领域约束和不断变化的研究需求，SSW中的科学债务和SATD分别是通用软件的9.25倍和4.93倍。此外，我们最好的模型表现超过了现有模型。这项研究揭示了SSW中的SATD与通用软件的区别，以及其在对质量和科学有效性方面的影响。通过认识这些挑战，开发者和研究者可以采取更智能的策略来管理债务，保障科学发现的完整性。

论文及项目相关链接

PDF 11 pages, 2 figures, 6 tables

Summary

本文探讨了科学软件（SSW）中的自我承认技术债务（SATD）的问题。研究发现，由于复杂的计算、领域约束和不断变化的科研需求，SSW中的科学债务和技术债务比一般软件更多。通过对多个领域和语言的仓库进行分析，研究人员发现SATD对软件质量和科学有效性产生了重大影响。此外，该研究还展示了如何识别和管理这种技术债务的策略。

Key Takeaways

自我承认技术债务（SATD）在科研软件（SSW）中普遍存在，并且其影响深远。
SATD的存在可能威胁到科学发现的基石，因为研究依赖于准确和可重复的结果。
与一般目的的开源软件相比，SSW包含更多的科学债务和技术债务。
SSW中的技术债务多是因为复杂的计算、领域约束和不断变化的科研需求。
研究通过对比分析和模型评估，发现了SSW中SATD的特点和影响。
最佳模型在SATD识别方面的表现超过了现有模型。

Cool Papers

点此查看论文截图

METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

Authors:Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.

构建能够感知、推理和执行多样化任务的通用机器人仍然是一个开放挑战，尤其是在灵巧操作方面。主要的瓶颈在于缺乏大规模的动作注释数据来支持灵巧技能，因为遥操作难度大且成本高。人类数据因其大规模和多样化的操作行为而为我们学习机器人行动提供了丰富的先验知识。虽然之前的研究已经探索了利用人类演示的方法，但它们通常受限于特定的场景和人机之间较大的视觉差距。为了消除这些限制，我们提出了METIS，这是一个用于灵巧操作的视觉语言动作（VLA）模型，预先在大规模第一人称数据集上进行训练。我们首先构建EgoAtlas，它整合了来自多个来源的大规模人类和机器人数据，并在一致的动作空间下统一。我们进一步提取运动感知动力学，这是一种紧凑且离散化的运动表示，为VLA训练提供高效且富有表现力的监督。基于这些，METIS将推理和行动整合到一个统一框架中，能够有效地部署到下游的灵巧操作任务中。我们的方法在六个真实世界任务中达到了最高的平均成功率，展示了卓越的灵巧操作能力。实验结果还强调了其在超出分布场景中的优越泛化能力和稳健性。这些发现表明METIS是朝着通用灵巧操作模型的有希望的一步。

论文及项目相关链接

PDF

Summary

本文提出一种面向灵巧操作的视觉语言动作（VLA）模型METIS，该模型在多源第一人称数据集上进行预训练。通过建立EgoAtlas，整合大规模的人机数据，在统一动作空间下，提取运动感知动力学，为VLA训练提供高效且富有表现力的监督信息。METIS将推理和动作集成到一个统一框架中，实现在下游灵巧操作任务中的有效部署。该方法在六个真实任务中取得最高平均成功率，表现出卓越的灵巧操作能力，并强调其在面向分布外场景的优越泛化和稳健性。

Key Takeaways

METIS是一种面向灵巧操作的视觉语言动作（VLA）模型，能感知、推理和行动。
METIS通过构建EgoAtlas整合大规模人机数据，在统一动作空间下学习。
提出运动感知动力学，作为紧凑且离散化的运动表示，为VLA训练提供有效监督。
METIS集成了推理和行动，在一个统一框架中实现了有效部署到下游灵巧操作任务。
METIS在六个真实任务中取得最高平均成功率，展现出卓越性能。
METIS具有优越泛化能力，适应于各种场景。

Cool Papers

点此查看论文截图

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Authors:Taixi Chen, Jingyun Chen, Nancy Guo

Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

细胞级放射学特征为肿瘤表型提供了精细的洞察，并有可能显著提高苏木精和伊红（H&E）图像的诊断准确性。通过捕捉微观水平的形态和强度模式，这些特征支持更精确的肿瘤识别，并通过突出显示与诊断相关的细胞，提高人工智能的可解释性，供病理学家审查。然而，现有的大多数研究主要关注于幻灯片级别或斑块级别的肿瘤分类，很少探索细胞级别的放射学分析。此外，目前还没有专门为放射学数据设计的专用骨干网络。我们受到Mamba架构在视觉和语言领域近期成功的启发，我们引入了一种使用放射学特征的细胞级分类的统一注意力Mamba（UAM）骨干网络。不同于以前将注意力和Mamba模块以固定比例集成的混合方法，我们的统一设计灵活地结合了它们的能力在一个单一的连贯架构中，消除了对手动比例调整的需求，提高了编码能力。我们开发了两个UAM变体来全面评估这种统一结构的好处。在此基础上，我们进一步提出了一个多模式UAM框架，该框架联合执行细胞级分类和图像分割。实验结果表明，UAM在公共基准测试上实现了两项任务的最先进性能，超越了领先的图像基础模型。它将细胞分类准确率从74%提高到78%（n=349,882个细胞），肿瘤分割精度从75%提高到80%（n=406个斑块）。这些发现突显了UAM作为放射学驱动癌症诊断的统一和可扩展多模式基础架构的有效性和潜力。

论文及项目相关链接

PDF

摘要
细胞层面放射学特征为肿瘤表现型提供了精细的洞察，有望显著提高在苏木精和伊红（H&E）图像上的诊断准确性。通过捕捉微观形态的强度模式，这些特征支持更精确的肿瘤识别，并强调诊断相关的细胞，从而提高人工智能的可解释性，供病理学家审查。然而，大多数现有研究集中在幻灯片级别或补丁级别的肿瘤分类上，很少探索细胞层面的放射学分析。此外，目前没有专门为放射学数据设计的专用骨干网。受Mamba架构在视觉和语言领域近期成功的启发，我们引入了用于细胞级别分类的Unified Attention-Mamba（UAM）骨干网。不同于以前将注意力模块和Mamba模块以固定比例集成的混合方法，我们的统一设计在一个单一、协调的架构中灵活地结合了他们的能力，消除了对手动比例调整的需求，提高了编码能力。在此基础上，我们进一步提出了多模态UAM框架，该框架联合执行细胞级别的分类和图像分割。实验结果表明，UAM在公共基准测试中实现了卓越的性能，在细胞分类和肿瘤分割方面都超过了领先的图像基础模型。细胞分类准确度从74%提高到78%（n=349,882个细胞），肿瘤分割精确度从75%提高到80%（n=406个补丁）。这些发现突显了UAM作为放射学驱动癌症诊断的统一和可扩展的多模态基础的有效性和前景。

要点概括

细胞级放射学特征为肿瘤表现型提供精细洞察，能提高诊断准确性。
现有研究多关注于幻灯片或补丁级别的肿瘤分类，忽视了细胞级别的放射学分析。
引入Unified Attention-Mamba（UAM）骨干网，灵活结合注意力模块和Mamba模块，提高编码能力。
UAM在细胞分类和图像分割方面实现卓越性能，超过现有图像基础模型。
UAM提高细胞分类准确度至78%（n=349,882个细胞），肿瘤分割精确度至80%（n=406个补丁）。
UAM具有统一和多模态的潜力，可作为放射学驱动癌症诊断的基础。

Cool Papers

点此查看论文截图

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

Authors:Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

实现人机共同目标的人机协作要求机器人能够理解人类的行为并与周围环境进行交互。本文重点研究基于人机对话的人机交互（HRI），它依赖于机器人动作确认和动作步骤生成的多模式场景理解。最新方法使用多模式转换器来生成与机器人动作确认相对应的机器人动作步骤，这些动作步骤来自一个显示多个微步骤组合的任务的单片段。尽管针对长期任务的动作依赖于整个视频中的相互关联，但当前的方法主要集中在片段级处理上，并未充分利用长期上下文信息。本文提出了一个结合视频全长左右上下文依赖的长上下文Q-former。此外，本文还提出了一种文本调节方法，将文本嵌入直接送入LLM解码器，以减轻Q-former文本中的高信息抽象问题。在YouCook2语料库上的实验表明，确认生成的准确性是动作规划性能的主要影响因素。此外，我们还证明了通过整合VideoLLaMA3，长上下文Q-former能够改进确认和动作规划。

论文及项目相关链接

PDF Accepted to ASRU 2025

Summary
本文研究了基于人机对话的人机协作问题，提出一种长上下文Q-former模型，结合视频左右上下文信息，对由多个微步骤组成的任务进行机器人动作规划。通过YouCook2语料库的实验，验证了动作确认生成对行动规划性能的重要性，并展示了长上下文Q-former模型结合VideoLLaMA3在动作确认和规划方面的改进。

Key Takeaways

人机协作需要机器人理解人类行为和与周围环境的交互。
论文关注基于人机对话的人机交互（HRI）。
现有方法主要关注片段级处理，缺乏长上下文信息的利用。
论文提出长上下文Q-former模型，结合视频左右上下文依赖进行全视频处理。
论文采用文本条件方法，将文本嵌入直接输入到LLM解码器中，以缓解Q-former中文本信息的高抽象性。
实验表明，动作确认生成是行动规划性能的重要因素。

Cool Papers

点此查看论文截图

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

Authors:Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang

Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.

多模态大型语言模型（MLLMs）由于大型语言模型（LLMs）的强大推理能力而在图像和语言任务方面取得了显著进展。然而，大多数MLLM在解释和推断三维空间中的空间排列时，受到空间推理能力的限制。在这项工作中，我们提出了一种基于几何和语义特征分层融合的新型视觉编码器，生成具有空间感知的视觉嵌入，并提升MLLM的空间定位能力。具体来说，我们首先揭示空间模糊缺陷源于大多数现有MLLMs（例如CLIP）所使用的视觉编码器的嵌入损失，仅限于实例级语义特征。这促使我们通过分层适配器，用仅来自视觉的自我监督学习的几何特征来补充CLIP，从而增强所提出的SpatialGeo中的空间感知。网络使用预训练的LLaVA模型进行有效训练，并通过随机特征丢弃进行优化，以避免仅依赖CLIP编码器而出现的简单解决方案。实验结果表明，SpatialGeo提高了空间推理任务的准确性，在SpatialRGPT-Bench上的表现至少提高了8.0%，同时在推理过程中的内存成本降低了约50%。源代码可通过https://ricky-plus.github.io/SpatialGeoPages/获取。

论文及项目相关链接

PDF

Summary
多模态大型语言模型（MLLMs）在图像和语言任务中取得了显著进展，但其在三维空间中的空间推理能力有限。本研究提出了一种基于几何和语义特征层次融合的新型视觉编码器，生成空间感知视觉嵌入，提升MLLMs的空间定位能力。通过补充CLIP模型的几何特征，增强了SpatialGeo模型的空间感知能力。实验结果表明，SpatialGeo在空间推理任务上的准确率有所提升，相较于现有模型在SpatialRGPT-Bench上至少提高了8.0%，且在推理期间的内存成本降低了约50%。

Key Takeaways

多模态大型语言模型（MLLMs）在图像和语言任务中具备强大的推理能力。
大多数MLLMs在解释和推断三维空间中的空间排列方面存在空间推理能力有限的不足。
本研究提出了一种基于几何和语义特征层次融合的新型视觉编码器，以改善MLLMs的空间推理能力。
通过补充CLIP模型的几何特征，增强了模型的空间感知能力。
SpatialGeo模型通过高效地使用预训练的LLaVA模型和随机特征丢弃优化技术，避免了依赖CLIP编码器的简单解决方案。
实验结果表明，SpatialGeo在空间推理任务上的性能显著提升，相较于现有模型在SpatialRGPT-Bench上的准确率至少提高8.0%。

Cool Papers

点此查看论文截图

Authors:Koena Ronny Mabokela, Tim Schlippe, Matthias Wölfel

Sentiment analysis can aid in understanding people’s opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.

情感分析有助于理解人们对社会问题的看法和情感。在多语言社区中，情感分析系统可用于快速识别社交媒体帖子中的社会挑战，使政府部门能够更精确、更有效地检测和解决这些问题。最近，大型语言模型（LLM）已经面向广大公众开放，初步分析表明，它们在英语的零样本情感分析方面表现出卓越的能力。然而，尚没有工作研究如何利用LLM对南非语言的社交媒体帖子进行情感分析并检测社会挑战。因此，在这项工作中，我们分析了最先进的LLMs GPT-3.5、GPT-4、LlaMa 2、PaLM 2和Dolly 2的零样本性能，以研究英语、Sepedi和Setswana社交媒体帖子中十大最新话题的情感倾向，这些帖子属于南非政府部门的管辖范围。结果表明，不同LLM、主题和语言之间存在很大差异。此外，我们还表明，不同LLM结果的融合可以在情感分类性能方面带来很大提升，情感分类错误低于1%。因此，现在可以提供系统生成关于情感分析的可靠信息，以检测社会挑战，并就特定主题和不同语言群体可能的行动需求得出结论。

论文及项目相关链接

PDF Published in the Proceedings of The Southern African Conference on AI Research (SACAIR 2024), Bloemfontein, South Africa, 2-6 December 2024. ISBN: 978-0-7961-6069-0

Summary

情感分析有助于理解人们在社交问题上的意见和情感。在多语言社区中，情感分析系统可快速识别社交媒体帖子中的社会挑战，使政府部门能够更精确、有效地检测和应对这些问题。最近，大型语言模型（LLM）已广泛普及，初步分析表明，它们在英语的零样本情感分析方面表现出卓越的能力。然而，尚未有研究利用LLM对南非语言的社交媒体帖子进行情感分析并检测社会挑战。因此，在这项工作中，我们分析了最先进LLM的零样本性能，包括GPT-3.5、GPT-4、LlaMa 2、PaLM 2和Dolly 2，以研究英语、Sepedi和Setswana社交媒体帖子中十大最热门话题的情感极性，这些帖子属于南非十个政府部门的管辖范围。结果表明，各LLM、主题和语言之间的差异很大。此外，我们还表明，不同LLM结果的融合在情感分类性能方面提供了很大的提升，情感分类错误低于1%。因此，现在可以提供系统生成关于情感分析的可靠信息，以检测社会挑战，并就特定主题和不同语言群体得出行动需求结论。

Key Takeaways

情感分析有助于理解人们在社交问题上的意见和情感。
在多语言社区中，情感分析系统可以快速识别社交媒体中的社会挑战。
大型语言模型（LLM）在情感分析方面表现出卓越的能力。
目前尚未有研究利用LLM对南非语言的社交媒体帖子进行情感分析。
不同LLM在分析不同语言、主题时的性能存在差异。
融合不同LLM的结果可大幅提高情感分类性能。

Cool Papers

点此查看论文截图

Authors:Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, Fons van der Sommen

We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

我们介绍了SPECTRE，这是一种基于全变压器的容积计算机断层扫描（CT）基础模型。我们的CT表示提取的自我监督和跨模态预训练（SPECTRE）方法利用可扩展的3D视觉变压器架构和现代自我监督和视觉语言预训练策略来学习通用CT表示。容积CT带来了一些独特的挑战，如极端令牌扩展、几何各向异性和微弱或嘈杂的临床监督，这使得标准变压器和对比学习配方无法直接使用。该框架联合优化用于高分辨率容积特征提取的局部变压器和用于整体扫描上下文建模的全局变压器，使大规模3D注意力计算可行。值得注意的是，SPECTRE仅在有公开可用的CT数据集上进行训练，表明可以在不依赖私有数据的情况下实现高性能、可推广的表示。预训练结合了DINO风格的自我蒸馏和基于SigLIP的配对放射学报告视觉语言对齐，产生既几何一致又具有临床意义的特征。在多个CT基准测试中，SPECTRE在零样本和微调设置中均表现出优于先前的CT基础模型，确立了SPECTRE作为一个可扩展、开放、完全基于变压器的3D医学成像基础模型地位。

论文及项目相关链接

PDF

Summary

SPECTRE是一种基于全变压器的体积计算机断层扫描（CT）基础模型。它采用自我监督和跨模态预训练策略，利用可扩展的3D视觉变压器架构学习通用CT表示。SPECTRE解决了体积CT带来的极端令牌缩放、几何各向异性和微弱或嘈杂的临床监督等独特挑战。该框架联合优化用于高分辨率体积特征提取的局部变压器和用于整体扫描上下文建模的全局变压器，实现了大规模三维注意力的计算可行性。SPECTRE在多个CT基准测试中表现出卓越性能，证明了其作为可伸缩、开放和基于全变压器的三维医学影像基础模型的潜力。

Key Takeaways

SPECTRE是一个基于全变压器的CT基础模型，用于体积医学成像。
它采用自我监督和跨模态预训练策略，结合现代的可扩展3D视觉变压器架构学习通用CT表示。
3.SPECTRE解决了体积CT带来的独特挑战，如极端令牌缩放、几何各向异性和微弱或嘈杂的临床监督。
该框架通过联合优化局部和全局变压器，实现了高分辨率体积特征提取和整体扫描上下文建模。
5.SPECTRE通过结合DINO风格的自我蒸馏和基于SigLIP的视语言对齐，训练出既几何一致又具有临床意义的特征。
6.SPECTRE在多个CT基准测试中表现出卓越性能，优于先前的CT基础模型。

Cool Papers

点此查看论文截图

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Authors:Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li

LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model’s tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

LVLMs在图像级别的任务（如VQA和caption）中表现卓越。然而，在许多实例级别的任务中，如视觉定位和目标检测，LVLMs与之前的专家模型相比仍显示出性能差距。同时，虽然行人跟踪是一个经典的任务，但在结合目标跟踪和自然语言方面已经出现了许多新话题，如基于引用的MOT、跨视图引用MOT和语义MOT。这些任务强调模型应以高级语义水平理解被跟踪的目标，这正是LVLMs擅长的领域。在本文中，我们提出了一种新的统一行人跟踪框架，即OmniPT，它可以追踪、基于参考进行追踪，并交互式地生成对追踪目标的语义理解。我们解决了两个问题：如何将跟踪任务建模为基金会模型可以执行的任务，以及如何让模型输出格式化的答案。为此，我们实现了包括RL-Mid Training-SFT-RL在内的训练阶段。基于LVLM的预训练权重，我们首先进行简单的RL阶段，使模型能够输出固定和可监督的边界框格式。随后，我们使用大量行人相关数据集进行中期训练阶段。最后，我们在几个行人跟踪数据集上进行有监督微调，然后进行另一阶段的RL，以提高模型的跟踪性能并增强其遵循指令的能力。我们在跟踪基准测试上进行了实验，实验结果表明所提出的方法优于之前的方法。

论文及项目相关链接

PDF AAAI 2026

Summary

该文介绍了LVLM在处理行人跟踪任务上的表现。虽然LVLM在图像级任务上表现优异，但在实例级任务如视觉定位和目标检测方面仍存在性能差距。针对行人跟踪这一经典任务，结合目标跟踪与自然语言的新趋势，文章提出了OmniPT框架，实现行人跟踪、基于参考的跟踪和语义理解的交互生成。针对LVLM如何建模跟踪任务并输出格式化答案的问题，文中实施了包含RL-Mid Training-SFT-RL的训练阶段，并在多个行人跟踪数据集上进行实验验证，结果表明该方法优于先前的方法。

Key Takeaways

LVLM在图像级任务上表现优异，但在实例级任务如视觉定位和目标检测方面存在性能差距。
行人跟踪是结合目标跟踪与自然的经典任务，涉及模型对跟踪目标的高级语义理解。
OmniPT框架能进行行人跟踪、基于参考的跟踪和语义理解的交互生成。
文章解决了LVLM如何建模跟踪任务并输出格式化答案的问题。
训练阶段包括简单的强化学习阶段、中期训练阶段、监督微调阶段和再次强化学习阶段。
实验结果表明提出的OmniPT框架在行人跟踪任务上表现优于先前的方法。

Cool Papers

点此查看论文截图

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Authors:Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

大型语言模型（LLM）正在重塑几乎所有行业，包括软件工程。近年来，已经提出了许多LLM代理来解决现实世界中的软件问题。此类软件代理通常配备了一套编程工具，并能够自主决定下一步行动，以形成完整的轨迹来解决端到端的软件任务。虽然前景广阔，但它们通常需要专门设计，并且可能仍然不够理想，因为穷尽整个代理架构的设计空间可能极具挑战性和成本高昂。

论文及项目相关链接

PDF

摘要

大型语言模型（LLM）正在重塑包括软件工程在内的几乎所有行业。近年来，人们提出了许多LLM代理来解决现实世界中的软件问题。这些软件代理通常配备了一套编码工具，并能够自主决定下一步行动以形成完整的轨迹来解决端到端的软件任务。尽管前景广阔，但它们通常需要专门的设计，并且可能仍然不够理想，因为穷尽整个代理架构的设计空间极具挑战性和成本高昂。研究人员已经认识到软件代理本质上是软件本身，可以进一步进行改进和修改，因此最近已经提出了一些自我改进的软件代理，包括达尔文-歌德机器（DGM）。然而，这样的自我改进代理需要在特定的基准测试上进行昂贵的离线训练，并且可能不能在不同的LLM或基准测试之间很好地推广。本文提出了Live-SWE-agent，这是第一个能够在解决现实世界软件问题时自主且持续地进行实时进化的软件代理。具体来说，Live-SWE-agent从最基础的代理架构开始，仅使用bash工具（例如mini-SWE-agent），在解决现实世界软件问题时自主进化自己的架构实现。在广泛研究的SWE-Bench验证基准上的评估表明，LIVE-SWE-AGENT的解决率高达77.4%，无需测试时缩放即可超越所有现有软件代理，包括最佳专有解决方案。此外，Live-SWE-agent在最近的SWE-Bench Pro基准测试上的表现优于最新的手动定制软件代理，实现了迄今为止最佳的解决率45.8%。

关键见解

大型语言模型（LLM）正在重塑软件工程行业，通过解决现实世界中的软件问题来促进发展。
软件代理通常配备一套编码工具并能自主决策，但设计合适的架构极具挑战性和成本高昂。
自我改进的软件代理如达尔文-歌德机器（DGM）已受到关注，但仍需在特定基准测试上进行昂贵的离线训练，且难以在不同LLM或基准测试中推广。
提出的Live-SWE-agent能够自主且持续地在解决现实世界软件问题时进化其架构。
Live-SWE-agent从最基础的代理架构开始，仅使用bash工具。
Live-SWE-agent在广泛研究的SWE-Bench验证基准上的解决率高达77.4%，表现出卓越的性能。

Cool Papers

点此查看论文截图

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Authors:Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wengqing Wu, Le Zhang, Massimo Poesio, Juntao Yu

Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89%} and reduces end-to-end latency by \textbf{28%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.

尽管最近多模态大型语言模型（MLLMs）在视频理解能力方面取得了进展，但长视频理解仍然是一个挑战。主要问题之一是视觉符号的数量随着视频长度的增加而线性增长，这导致了注意力成本、内存和延迟的爆炸式增长。为了解决这一挑战，我们提出了查询感知符号选择器（QTSplus），这是一个轻量级但功能强大的视觉符号选择模块，作为视觉编码器与LLMs之间的信息网关。给定文本查询和视频符号，QTSplus通过（i）交叉注意力对视觉符号进行评分，（ii）根据查询的复杂性预测实例特定的保留预算，以及（iii）在训练期间使用可微分的直通估计器并选择Top-n符号，在推理时使用硬门选择最重要的视觉证据来响应输入文本查询。此外，一个小型再编码器使用绝对时间信息保持时间顺序，能够在保持全局覆盖的同时实现第二级定位。集成到Qwen2.5-VL中的QTSplus可将视觉流压缩高达89%，并减少端到端延迟28%。在八个长视频理解基准测试上的评估表明，与原始Qwen模型相比，总体准确率近乎持平，且在TempCompass方向和顺序准确性上分别高出+20.5和+5.6个百分点。这些结果表明，QTSplus是一种有效且通用的机制，可将MLLMs扩展到现实世界的长视频场景，同时保留与任务相关的证据。

论文及项目相关链接

PDF

Summary

本文提出一种针对多模态大型语言模型（MLLMs）在长视频理解方面的挑战的解决方案。通过引入Query-aware Token Selector（QTSplus）模块，实现视觉令牌的选择性处理，降低注意力成本、内存和延迟。QTSplus能根据文本查询动态选择最重要的视觉证据，通过评分视觉令牌、预测基于查询复杂性的保留预算以及选择Top-n令牌等方式工作。集成到Qwen2.5-VL中，QTSplus可压缩视觉流高达89%，并减少端到端延迟28%。在八个长视频理解基准测试上的评估结果表明，与原始Qwen模型相比，QTSplus在保持任务相关证据的同时，实现了接近或更高的准确性。

Key Takeaways

长视频理解对于多模态大型语言模型（MLLMs）仍然是一个挑战。
主要问题在于随着视频长度的增长，视觉令牌数量线性增长，导致注意力成本、内存和延迟的大幅增加。
Query-aware Token Selector（QTSplus）模块作为一种轻量级但强大的视觉令牌选择机制，能够动态选择最重要的视觉证据。
QTSplus通过评分视觉令牌、预测基于查询复杂性的保留预算和选择Top-n令牌来实现高效选择。
QTSplus集成到Qwen2.5-VL中，显著压缩了视觉流并减少了端到端延迟。
在多个长视频理解基准测试上，QTSplus实现了接近或更高的准确性，表明其有效性和通用性。

Cool Papers

点此查看论文截图

Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

Authors:Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi

Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature – intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets – KTH, Penn Action, HAA500, and UCF-101 – demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

对于视频动作识别模型的解释，应当解开时间进程中动作如何展开与周围空间环境的关系。然而，现有的基于显著性的方法产生的解释是纠缠在一起的，使得预测是依赖于动作还是空间上下文变得不明确。基于语言的方法提供了结构，但由于其隐含性，往往无法解释动作——虽然可以直觉上理解，但难以用语言表达。为了解决这些挑战，我们提出了基于解纠缠动作与上下文概念的可解释性（DANCE）视频动作识别框架，该框架通过解纠缠的概念类型预测动作：运动动力学、对象和场景。我们将运动动力学概念定义为人体姿态序列。我们利用大型语言模型自动提取对象和场景的概念。基于特定的概念瓶颈设计，DANCE通过这些概念进行预测。在KTH、Penn Action、HAA500和UCF-101四个数据集上的实验表明，DANCE在保持竞争力的同时，显著提高了解释的清晰度。我们通过用户研究验证了DANCE卓越的可解释性。实验结果还表明，DANCE对模型调试、编辑和故障分析是有益的。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight paper. Project page: https://jong980812.github.io/DANCE/

Summary
视频动作识别模型的解释应分解动作如何在时间中展开及其与周围空间环境的关联。现有基于显著性的方法产生的解释混淆了动作和运动上下文，难以分辨预测依赖于动作还是空间上下文。语言方法提供了结构，但由于其隐含性，难以解释动作。为解决这些挑战，我们提出了解耦动作与上下文概念的解释框架DANCE（Disentangled Action n’Context Explainable），它通过解耦概念类型预测动作：运动动态、物体和场景。我们定义运动动态概念为人类姿势序列，并借助大型语言模型自动提取物体和场景概念。建立在特定的概念瓶颈设计上，DANCE强制通过这些概念进行预测。在KTH、Penn Action、HAA500和UCF-101四个数据集上的实验表明，DANCE显著提高了解释清晰度并保持了竞争力。用户研究验证了DANCE的卓越可解释性。实验结果还显示，DANCE对模型调试、编辑和故障分析有益。

Key Takeaways

DANCE框架旨在解决视频动作识别模型中动作与空间上下文混淆的问题。
现有方法如基于显著性或语言的方法在解释视频动作识别时存在挑战。
DANCE通过解耦动作和上下文概念（包括运动动态、物体和场景）来提高解释清晰度。
运动动态概念被定义为人类姿势序列，并使用大型语言模型提取物体和场景概念。
DANCE通过特定的概念瓶颈设计强制通过解耦的概念进行预测。
在多个数据集上的实验表明，DANCE提高了解释清晰度，并保持竞争力。

Cool Papers

点此查看论文截图

Fairness Evaluation of Large Language Models in Academic Library Reference Services

Authors:Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian

As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We find no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrate nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

随着图书馆在虚拟参考服务中探索大型语言模型（LLM）的应用，一个重要的问题出现了：LLM能否公平地为所有用户提供服务，无论其人口统计特征或社会地位如何？虽然它们提供了可扩展支持的巨大潜力，但LLM也可能会复制其训练数据中的社会偏见，从而损害图书馆致力于公平服务的承诺。为了解决这一担忧，我们通过提示六个最先进的LLM协助不同性别、种族/民族和机构角色的客户，来评估LLM是否根据用户身份做出不同响应。我们没有发现种族或民族的差异化证据，只有一个模型存在对女性的刻板偏见的微弱证据。LLM通过正式性、礼貌以及领域特定词汇的语言选择来微妙地适应机构角色，这反映了专业规范而非歧视性待遇。这些发现表明，当前LLM在支持学术图书馆参考服务的公平和语境适当的沟通方面已显示出令人鼓舞的成熟程度。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在图书馆虚拟参考服务中的应用引发了一个关键问题：LLM是否能公平地为所有用户提供服务，无论其人口统计特征或社会地位如何？评估发现，LLM在回应不同性别、种族/民族和机构角色的用户时，没有明显的种族或民族差异，只有一个模型存在对女性的刻板偏见。LLM通过正式、礼貌和领域特定词汇的语言选择来适应机构角色，反映专业规范而非歧视性待遇。这表明当前LLM在支持公平和语境适当的学术图书馆参考服务方面显示出良好的准备程度。

Key Takeaways

LLM在图书馆虚拟参考服务中有潜力为所有用户提供公平支持。
LLM可能会从其训练数据中复制社会偏见，这可能对图书馆的公平服务承诺造成风险。
评估发现LLM在回应不同用户时，没有明显的种族或民族差异。
仅在一个模型中发现了对女性的轻微刻板偏见。
LLM能够通过语言选择适应机构角色，反映专业规范。
LLM在支持语境适当的学术图书馆参考服务方面显示出良好的准备程度。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-25/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-11-25 REMSA An LLM Agent for Foundation Model Selection in Remote Sensing

2025-11-25 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-11-25 Native 3D Editing with Full Attention

2025-11-25 R1_Reasoning

R1_Reasoning

LLM

2025-11-25 更新

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

That’s not natural: The Impact of Off-Policy Training Data on Probe Performance

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Exploring Scientific Debt: Harnessing AI for SATD Identification in Scientific Software

METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages

Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

Fairness Evaluation of Large Language Models in Academic Library Reference Services