发布日期: 2025-11-22

更新日期: 2025-11-27

文章字数: 19.7k

阅读时长: 80 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-22 更新

TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Authors:Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie, Alan Yuille

With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

随着对3D动画需求的增长，从文本描述生成高保真、可控的4D虚拟形象仍然是一项重大挑战。尽管在4D生成建模方面付出了显著的努力，但现有方法存在基本局限性，阻碍了其更广泛的应用，包括时间和几何不一致性、感知伪影、运动不规则、高计算成本和动力学控制有限等问题。为了解决这些挑战，我们提出了TriDiff-4D，这是一种新型的4D生成管道，采用基于扩散的三平面重新定位技术，以产生高质量、时间连贯的4D虚拟形象。我们的模型采用自回归策略来生成任意长度的4D序列，每个3D帧都可以通过单个扩散过程进行合成。TriDiff-4D通过从大规模3D和运动数据集中显式学习3D结构和运动先验知识，实现了骨架驱动的4D生成，在时间连贯性、运动准确性、计算效率和视觉逼真度方面表现出色。具体来说，TriDiff-4D首先从文本提示生成标准3D虚拟形象和相应的运动序列，然后使用第二个扩散模型根据运动序列使虚拟形象动起来，支持任意长度的4D生成。实验结果表明，TriDiff-4D显著优于现有方法，通过消除优化过程，将生成时间从小时缩短到秒，同时在生成复杂运动、高保真外观和准确的3D几何方面有了实质性的改进。

论文及项目相关链接

PDF 8 pages, 10 figures, Under review at a conference

Summary

本文提出一种名为TriDiff-4D的新型4D生成管道，用于从文本描述生成高质量、时间连贯的4D动漫人物。采用基于扩散的三平面重新定位技术，通过大规模3D和运动数据集学习3D结构和运动先验知识，实现骨架驱动的4D生成，具有优秀的时间连贯性、运动准确性、计算效率和视觉逼真度。

Key Takeaways

TriDiff-4D是一种新型的4D生成管道，能够生成高质量、时间连贯的4D动漫人物。
该方法采用基于扩散的三平面重新定位技术，以提高生成质量。
TriDiff-4D能够从文本描述中生成对应的3D人物和动作序列。
通过学习大规模3D和运动数据集的先验知识，实现骨架驱动的4D生成。
TriDiff-4D具有优秀的时间连贯性、运动准确性、计算效率和视觉逼真度。
与现有方法相比，TriDiff-4D显著减少了生成时间，并提高了复杂动作的生成质量。

Cool Papers

点此查看论文截图

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Authors:Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

手术视频分割在计算机辅助手术中至关重要，能够实现仪器和组织的精确定位和跟踪。交互式视频对象分割（iVOS）模型，如Segment Anything Model 2（SAM2），提供了基于提示的灵活性，超越了具有预定义类别的方法，但由于领域差距和有限的长时跟踪，在手术场景中面临挑战。为了解决这些局限性，我们构建了SA-SV，这是最大的手术iVOS基准数据集，具有实例级时空注释（masklets），涵盖八种程序类型（61k帧，1.6k masklets），为长期跟踪和零样本泛化提供了全面的开发和评估。基于SA-SV，我们提出SAM2S，这是一个基础模型，通过以下方面增强手术iVOS的SAM2：（1）DiveMem，一种用于稳健长时跟踪的可训练多样化记忆机制；（2）用于仪器理解的时间语义学习；（3）模糊耐受学习，以缓解多源数据集之间的注释不一致性。大量实验表明，在SA-SV上进行微调可以使性能得到实质性提升，SAM2的平均$\mathcal{J}$&$\mathcal{F}$提高12.99。SAM2S进一步提高了性能，平均$\mathcal{J}$&$\mathcal{F}$达到80.42，超过原生和微调后的SAM2分别高出17.10和4.11个点，同时保持68 FPS的实时推理和强大的零样本泛化能力。代码和数据集将在https://jinlab-imvr.github.io/SAM2S发布。

论文及项目相关链接

PDF 11 pages, 4 figures

Summary

本文介绍了手术视频分割在计算机辅助手术中的重要性，并指出交互式视频对象分割（iVOS）模型如Segment Anything Model 2（SAM2）在手术场景中的挑战。为了应对这些挑战，本文构建了SA-SV基准测试集，并在此之上提出了SAM2S模型，该模型通过引入DiveMem机制、时序语义学习和模糊韧性学习等技术增强SAM2在手术iVOS中的性能。实验表明，在SA-SV上进行微调可实现显著的性能提升。

Key Takeaways

手术视频分割对于计算机辅助手术至关重要，它能够实现仪器和组织的精确定位和追踪。
交互式视频对象分割（iVOS）模型如SAM2在手术场景中存在挑战，主要由于领域差距和长期追踪的局限性。
SA-SV基准测试集是最大的手术iVOS基准测试集，包含实例级时空标注，可用于全面开发和评估长期追踪和零样本泛化能力。
SAM2S模型是对SAM2的增强，通过引入DiveMem机制、时序语义学习和模糊韧性学习等技术应对手术iVOS中的挑战。
在SA-SV上进行微调可实现显著的性能提升，SAM2的平均$\mathcal{J}$&$\mathcal{F}$得分提高了12.99。
SAM2S进一步提高了性能，平均$\mathcal{J}$&$\mathcal{F}$得分达到80.42，超越了原始和微调后的SAM2。

Cool Papers

点此查看论文截图

D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

Authors:Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang

Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.

在通往通用人工智能的道路上，开发能够操作各种图形用户界面（GUI）并具有人类水平熟练度的智能代理是一个关键里程碑。虽然目前大多数用于训练和评估GUI代理的数据集和基准测试都是静态且理想化的，无法反映真实世界环境的复杂性和不可预测性，尤其是异常的存在。为了弥补这一研究空白，我们提出了D-GARA，这是一个动态基准测试框架，用于评估Android GUI代理在现实世界中异常情况的稳健性。D-GARA引入了一套多样化的现实世界异常，这些异常是GUI代理在实践中通常面临的，包括中断如对话框、电池警告和更新提示等权限。基于D-GARA框架，我们构建并标注了一个基准测试，该测试涵盖了常用Android应用程序中的嵌入式异常，以支持更广泛的社区研究。综合实验和结果表明，在异常丰富的环境中，最先进的GUI代理的性能出现了显著下降，这凸显了需要稳健性感知学习。D-GARA是模块化和可扩展的，支持无缝集成新任务、异常类型和交互场景，以实现特定的评估目标。

论文及项目相关链接

PDF Accepted to AAAI 2026

Summary

D-GARA框架旨在评估智能GUI代理在真实世界异常环境中的稳健性，填补了现有数据集和基准测试在模拟真实环境复杂性方面的空白。D-GARA引入了一系列常见异常，包括权限对话框、电池警告和更新提示等，建立并标注了Android应用程序的基准测试集。实验结果展示了智能GUI代理在异常环境中的性能下降，强调需要引入稳健性意识学习。D-GARA框架模块化、可扩展，能够无缝集成新任务、异常类型和交互场景。

Key Takeaways

D-GARA框架用于评估GUI代理在真实世界异常环境中的稳健性。
现有数据集和基准测试无法充分模拟真实环境的复杂性和不可预测性。
D-GARA引入了一系列真实世界的异常，如权限对话框、电池警告和更新提示等。
基于D-GARA框架建立了标注的Android应用程序基准测试集，以支持更广泛的社区研究。
实验结果显示现有GUI代理在异常丰富的环境中性能显著下降。
需要引入稳健性意识学习来提升GUI代理在真实环境中的表现。

Cool Papers

点此查看论文截图

LLM4EO: Large Language Model for Evolutionary Optimization in Flexible Job Shop Scheduling

Authors:Rongjie Liao, Junhao Qiu, Xin Chen, Xiaoping Li

Customized static operator design has enabled widespread application of Evolutionary Algorithms (EAs), but their search performance is transient during iterations and prone to degradation. Dynamic operators aim to address this but typically rely on predefined designs and localized parameter control during the search process, lacking adaptive optimization throughout evolution. To overcome these limitations, this work leverages Large Language Models (LLMs) to perceive evolutionary dynamics and enable operator-level meta-evolution. The proposed framework, LLMs for Evolutionary Optimization (LLM4EO), comprises three components: knowledge-transfer-based operator design, evolution perception and analysis, and adaptive operator evolution. Firstly, initialization of operators is performed by transferring the strengths of classical operators via LLMs. Then, search preferences and potential limitations of operators are analyzed by integrating fitness performance and evolutionary features, accompanied by corresponding suggestions for improvement. Upon stagnation of population evolution, gene selection priorities of operators are dynamically optimized via improvement prompting strategies. This approach achieves co-evolution of populations and operators in the search, introducing a novel paradigm for enhancing the efficiency and adaptability of EAs. Finally, a series of validations on multiple benchmark datasets of the flexible job shop scheduling problem demonstrate that LLM4EO accelerates population evolution and outperforms both mainstream evolutionary programming and traditional EAs.

定制静态操作符的设计已经使得进化算法（EA）得到广泛应用，但其在迭代过程中的搜索性能是暂时的，容易退化。动态操作符旨在解决这一问题，但在搜索过程中通常依赖于预先定义的设计和局部参数控制，缺乏在整个进化过程中的自适应优化。为了克服这些局限性，本研究利用大型语言模型（LLM）来感知进化动态并实现操作符级别的元进化。所提出的框架LLM4EO包含三个组成部分：基于知识转移的操作符设计、进化感知与分析、自适应操作符进化。首先，通过LLM转移经典操作符的优势来实现操作符的初始化。然后，通过整合适应度性能和进化特征来分析操作符的搜索偏好和潜在局限性，并给出相应的改进建议。当种群进化停滞时，通过改进提示策略动态优化操作符的基因选择优先级。这种方法实现了搜索过程中种群和操作符的协同进化，为提升EA的效率和适应性引入了新的范式。最后，在柔性作业车间调度问题的多个基准数据集上进行的一系列验证表明，LLM4EO加速了种群进化，并优于主流进化编程和传统EA。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的进化优化框架（LLM4EO）结合了知识迁移操作设计、进化感知分析和自适应操作进化三个组件，实现了操作层面的元进化。该框架通过转移经典操作的优势进行初始化，通过分析操作的搜索偏好和潜在局限性来优化基因选择优先级，当种群进化停滞时，通过改进提示策略实现操作的动态优化。这一系列策略提高了进化算法的效率与适应性，并在柔性作业车间调度问题的多个基准数据集上进行了验证，证明了LLM4EO加速种群进化的效果优于主流进化编程和传统进化算法。

Key Takeaways

大型语言模型（LLM）用于理解进化动力学并实现操作层面的元进化，克服现有进化算法（EA）的局限性。
LLM4EO框架包括知识迁移操作设计、进化感知分析和自适应操作进化三个主要组件。
通过转移经典操作的优势进行初始化，提高搜索效率。
分析操作的搜索偏好和潜在局限性，提供改进建议。
在种群进化停滞时，通过改进提示策略实现操作的动态优化。
LLM4EO提高了进化算法的效率与适应性。

Cool Papers

点此查看论文截图

Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

Authors:Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen, Weifeng Liu

Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

少量样本分割（FSS）旨在通过元学习范式在有限的支持样本指导下对新型类别进行分割。现有的方法主要通过从支持图像中提取参考作为元指导。然而，由于视觉表示中的类内差异，从支持图像中提取的元信息无法为未训练过的类别提供准确的分割指导。本文认为，来自支持图像的参考可能并不是必要的，支持角色的关键在于为已训练的和未训练的类别提供无偏见的元指导。为此，我们引入了语言驱动属性泛化（LDAG）架构，利用目标属性语言描述的内在信息来构建稳健的支持策略。具体来说，为了获得无偏见的支持表示，我们设计了一个多属性增强（MaE）模块，该模块通过大型语言模型（LLM）生成目标类的多个详细属性描述，然后利用多模态匹配构建精细的视觉文本先验指导。同时，由于文本视觉模态转换，属性文本在促进视觉特征表示方面遇到困难，我们设计了多模态属性对齐（MaA）来实现属性文本和视觉特征之间的跨模态交互。实验表明，我们提出的方法明显优于现有方法，实现了新的最新性能。代码将被公开。

论文及项目相关链接

PDF

Summary

本文提出了一个名为Language-Driven Attribute Generalization（LDAG）的架构来解决少样本分割（FSS）问题。该架构利用目标属性的语言描述来构建稳健的支持策略，旨在提供对训练和未训练类别都公正的支持指导。通过设计Multi-attribute Enhancement（MaE）模块和多模态匹配方法，获得公正的支持表示并优化视觉文本先验指导。同时，为了解决文本视觉模态转换的问题，设计了Multi-modal Attribute Alignment（MaA）实现属性文本与视觉特征的跨模态交互。实验表明，该方法明显优于现有方法，达到新的先进水平。

Key Takeaways

少样本分割（FSS）的目标是借助有限的样本对新型类别进行分割。
现有方法主要通过从支持图像中挖掘参考作为元指导，但由于类内变化，这些指导可能不准确。
本文认为支持图像中的参考可能不是关键，关键在于为训练和未训练类别提供无偏见的元指导。
引入Language-Driven Attribute Generalization（LDAG）架构，利用目标属性的语言描述来构建稳健的支持策略。
设计了Multi-attribute Enhancement（MaE）模块，通过大型语言模型（LLMs）产生目标类别的多个详细属性描述，建立精细的视觉文本先验指导。
针对文本视觉模态转换的问题，设计了Multi-modal Attribute Alignment（MaA）模块，实现属性文本与视觉特征之间的跨模态交互。

Cool Papers

点此查看论文截图

Mind the Gap: Bridging Prior Shift in Realistic Few-Shot Crop-Type Classification

Authors:Joana Reuss, Ekaterina Gikalo, Marco Körner

Real-world agricultural distributions often suffer from severe class imbalance, typically following a long-tailed distribution. Labeled datasets for crop-type classification are inherently scarce and remain costly to obtain. When working with such limited data, training sets are frequently constructed to be artificially balanced – in particular in the case of few-shot learning – failing to reflect real-world conditions. This mismatch induces a shift between training and test label distributions, degrading real-world generalization. To address this, we propose Dirichlet Prior Augmentation (DirPA), a novel method that simulates an unknown label distribution skew of the target domain proactively during model training. Specifically, we model the real-world distribution as Dirichlet-distributed random variables, effectively performing a prior augmentation during few-shot learning. Our experiments show that DirPA successfully shifts the decision boundary and stabilizes the training process by acting as a dynamic feature regularizer.

现实世界中的农业分布经常面临严重的类别不平衡问题，通常呈现长尾分布。用于作物类型分类的标记数据集本身就很稀缺，而且获取成本很高。在有限数据的情况下，训练集通常会被人为地平衡构建——特别是在小样本学习的情况下——无法反映真实世界的条件。这种不匹配会导致训练和测试标签分布之间的偏移，降低在现实世界中的泛化能力。为了解决这一问题，我们提出了狄利克雷先验增强（DirPA）方法，这是一种在模型训练过程中主动模拟目标域未知标签分布偏斜的新方法。具体来说，我们将现实世界分布建模为狄利克雷分布的随机变量，在少量样本学习期间有效地执行先验增强。我们的实验表明，DirPA通过作为动态特征正则化器成功地改变了决策边界并稳定了训练过程。

论文及项目相关链接

PDF 7 pages, 4 figures

Summary

针对现实世界中农业分布存在的类不平衡问题，尤其是标签数据集稀缺且获取成本高昂的情况，本文提出了一种名为Dirichlet Prior Augmentation（DirPA）的新方法。该方法在模型训练过程中主动模拟目标域的未知标签分布偏斜，以改善少样本学习中的训练与测试标签分布不匹配的问题，从而提高模型在现实世界中的泛化能力。

Key Takeaways

现实世界的农业分布常常面临类不平衡问题，呈现长尾分布特性。
作物类型分类的标签数据集稀缺且获取成本高昂。
在少样本学习情况下，人工平衡训练集无法反映真实世界条件，导致训练与测试标签分布不匹配。
DirPA方法被提出以解决上述问题，通过模拟目标域的未知标签分布偏斜，改善模型在现实世界中的泛化能力。
DirPA将真实世界分布建模为Dirichlet分布随机变量，在模型训练过程中进行有效的先验增强。
实验表明，DirPA能够成功调整决策边界，稳定训练过程，并作为动态特征正则化器发挥作用。

Cool Papers

点此查看论文截图

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Authors:Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

近期视觉-语言-动作（VLA）模型的进展表明，视觉信号可以有效补充稀疏动作监督。然而，让VLA直接预测高维视觉状态会分散模型容量并产生高昂的训练成本，而将视觉状态压缩成更紧凑的监督信号不可避免地会造成信息瓶颈。此外，由于忽视语言监督，现有方法往往存在理解和推理能力不足的缺陷。本文介绍了Mantis，一个以分离视觉预见（DVF）为特色的新型框架来解决这些问题。具体来说，Mantis通过结合元查询和扩散Transformer（DiT）头来将视觉预见预测与主干网络分离。通过残差连接将当前视觉状态提供给DiT，简单的下一个状态预测目标使元查询能够自动捕获描述视觉轨迹的潜在动作，从而增强显式动作的学习。这种分离减轻了VLA主干的负担，使其能够通过语言监督保持理解和推理能力。实证上，通过在人类操作视频、机器人演示和图像文本对上进行预训练，Mantis在LIBERO基准测试上进行微调后达到了96.7%的成功率，超越了强大的基准线，并表现出了高速的收敛性。现实世界评估表明，Mantis在遵循指令能力、对未见指令的泛化能力以及推理能力方面超越了领先的开源VLA模型π_{0.5}。代码和权重已发布以支持开源社区。

论文及项目相关链接

PDF

Summary

该论文提出了一种名为Mantis的新框架，解决了视觉语言动作（VLA）模型在预测高维视觉状态时的容量分布和训练成本问题。通过解耦视觉预测与主干结合元查询和扩散Transformer（DiT）头，Mantis框架能够提高模型对明确动作的学习能力，同时保持通过语言监督的理解与推理能力。经验上，Mantis在LIBERO基准测试上取得了96.7%的成功率，超越了强大的基线，并表现出较高的收敛速度。

Key Takeaways

Mantis框架通过解耦视觉预测，解决了VLA模型在预测高维视觉状态时的难题。
结合元查询和扩散Transformer（DiT）头，Mantis提高了模型对明确动作的学习能力。
Mantis框架通过语言监督，提升了模型的理解与推理能力。
Mantis在LIBERO基准测试上取得了96.7%的成功率，表现出较高的性能。
Mantis框架具有高效性，展现出较快的收敛速度。
Mantis框架在现实世界评估中，特别是在指令遵循能力、对未见指令的泛化能力和推理能力方面，表现出优于其他模型的性能。

Cool Papers

点此查看论文截图

ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

Authors:Qing Zhang, Bing Xu, Xudong Zhang, Yifan Shi, Yang Li, Chen Zhang, Yik Chung Wu, Ngai Wong, Yijie Chen, Hong Dai, Xiansen Chen, Mian Zhang

The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.

大规模语言模型（LLM）的出色表现高度依赖于精心设计的提示。然而，手动提示工程是一个艰巨的过程，成为LLM实际应用的核心瓶颈。这种现象引发了一个名为自动提示优化（APO）的新研究领域的出现，该领域近年来发展迅速。现有的APO方法，如基于进化算法或试错方法的方法，在一定程度上实现了高效且准确的提示优化。然而，这些研究侧重于生成策略和优化过程的单一模型或算法，这限制了它们在处理复杂任务时的性能。为解决此问题，我们提出了一个名为基于集成学习的提示优化（ELPO）的新型框架，以实现更准确、更稳健的结果。受集成学习思想的启发，ELPO执行投票机制，并引入共享生成策略以及不同的搜索方法来搜索更好的提示。此外，ELPO还创造性地提出了更高效的算法来进行提示生成和搜索过程。实验结果表明，ELPO在不同任务上的提示优化效果优于最新方法，例如在ArSarcasm数据集上F1得分提高了7.6分。

论文及项目相关链接

PDF

Summary
大型语言模型的卓越性能在很大程度上依赖于精心设计的提示。然而，手动提示工程是一个繁琐的过程，成为实际应用中的核心瓶颈。自动提示优化（APO）这一新兴研究领域近年来迅速发展，为解决这一问题提供了新的方向。虽然现有的APO方法如基于进化算法或试错法的方法在一定程度实现了高效准确的提示优化，但它们在处理复杂任务时性能受限。为解决这一问题，我们提出了基于集成学习的提示优化（ELPO）框架，实现了更准确、更稳健的结果。ELPO采用投票机制，引入共享生成策略与不同的搜索方法以寻找优质提示。此外，ELPO还为提示生成和搜索过程提供了更有效的算法。实验结果表明，ELPO在不同任务上优于最先进的提示优化方法，例如在ArSarcasm数据集上提高了F1分数7.6%。

Key Takeaways

大型语言模型的性能高度依赖于提示设计。
自动提示优化（APO）是近年来的新兴研究领域。
现有APO方法在处理复杂任务时存在局限性。
ELPO框架结合集成学习思想，通过投票机制和多种搜索方法实现更优的提示。
ELPO引入共享生成策略以提高效率。
ELPO框架在实验中表现出卓越性能，优于其他先进方法。
ELPO在ArSarcasm数据集上的F1分数提高了7.6%。

Cool Papers

点此查看论文截图

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

Authors:Yijun Yang, Lichao Wang, Jianping Zhang, Chi Harold Liu, Lanqing Hong, Qiang Xu

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack

随着视觉语言模型（VLMs）的滥用日益严重，提供商采取了多种保护措施，包括对齐调整、系统提示和内容审核。然而，这些防御措施在现实世界中对抗对抗性攻击的稳健性尚未得到充分探索。我们引入了多面攻击（MFA）框架，该框架系统地暴露了领先的安全防护装备齐全的VLMs（如GPT-4o、Gemini-Pro和Llama-4）中的一般安全漏洞。MFA的核心组件是注意力转移攻击（ATA），它将有害指令隐藏在具有竞争目标的元任务中。我们从奖励黑客攻击的角度提供了理论解释为什么这种攻击能够成功。为了提高跨模型的迁移能力，我们进一步引入了一种轻量级的传输增强算法，结合简单的重复策略，共同绕过输入级和输出级的过滤器，无需针对特定模型进行微调。实证表明，针对单一视觉编码器优化的对抗性图像可以广泛迁移到未见过的VLMs上，这表明共享的视觉表示形式创建了跨模型的安全漏洞。总体而言，MFA实现了高达58.5%的成功率并始终优于现有方法。在先进商业模型上，MFA达到52.8%的成功率，超越第二名攻击者34%。这些结果挑战了当前防御机制的稳健性，并强调了现代VLM中的持续安全弱点。代码链接：https://github.com/cure-lab/MultiFacetedAttack

论文及项目相关链接

PDF AAAI 2026 Oral

Summary

本文主要介绍了针对视觉语言模型（VLMs）的防御机制的鲁棒性问题。研究人员提出了一种名为Multi-Faceted Attack（MFA）的框架，用于系统地揭示现有防御机制中的安全漏洞。该框架的核心是Attention-Transfer Attack（ATA），它通过在具有竞争目标的任务中隐藏有害指令来攻击模型。研究表明，针对单一视觉编码器的优化对抗图像可以广泛转移至未见过的VLMs，这揭示了跨模型的安全漏洞。总体来说，MFA成功率为58.5%，并在商业模型中也有较高的表现。

Key Takeaways

视觉语言模型（VLMs）存在滥用问题，供应商已部署多种保障措施，但它们的实际鲁棒性有待研究。
引入Multi-Faceted Attack（MFA）框架，用于揭示现有防御机制中的安全漏洞。
MFA的核心是Attention-Transfer Attack（ATA），它通过隐藏有害指令在具有竞争目标的任务中进行攻击。
MFA具有广泛的跨模型转移能力，可以绕过输入和输出级别的过滤器。
MFA实现了58.5%的成功率，优于现有方法。
对先进商业模型的攻击成功率为52.8%，较第二名攻击方法高出34%。
研究结果指出了当前防御机制的局限性和视觉语言模型的持久安全弱点。

Cool Papers

点此查看论文截图

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Authors:Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu

In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

在大规模语言模型（LLM）中，上下文学习（ICL）是指通过输入上下文中提供的小样本演示来执行新任务。视觉上下文学习（VICL）的最新进展表明，通过统一视觉语言模型（VLM），解决下游任务的能力具有巨大潜力。当视觉提示和目标图像来自不同的视觉任务时，VLM是否仍然能够实现VICL？在论文中，我们提出了一种完全协作的管道，即T2T-VICL，供VLM研究跨任务VICL的潜力。从根本上说，我们设计了一种机制来生成和选择文本提示，这些提示最好地隐含地描述了两个不同低级视觉任务之间的差异，并构建了第一个跨任务VICL数据集。在此基础上，我们提出了一种新的推理框架，它将基于感知得分的推理与传统评估指标相结合，以执行跨任务VICL。我们的方法在九个跨任务场景中取得了顶尖结果，在另外十个场景中取得了第二梯队的表现，在VLM内解锁了跨任务VICL的边界。

论文及项目相关链接

PDF

Summary

大型语言模型中，基于上下文的视觉学习（VICL）解决下游任务具有广阔潜力。该研究针对视觉提示与目标图像源自不同视觉任务时，VLM是否能进行跨任务VICL展开探讨，提出了一种全面的协作管道T2T-VICL。研究设计了一种机制生成并筛选最能隐含描述两个不同低级别视觉任务间差异的文字提示，并构建首个跨任务VICL数据集。基于此数据集提出了结合感知得分推理与传统评估指标的新推断框架，实现跨任务VICL的顶尖成果。

Key Takeaways

VICL在大型语言模型中解决下游任务具有巨大潜力。
当视觉提示和目标图像来自不同视觉任务时，研究探讨了VLM的跨任务VICL能力。
提出了一个全面的协作管道T2T-VICL，用于生成和选择最佳文本提示，以描述不同低级别视觉任务之间的差异。
构建了首个跨任务VICL数据集，为跨任务VICL研究提供了基础。
结合感知得分推理与传统评估指标的推断框架能有效进行跨任务VICL。
该方法在多场景下实现了顶尖成果，为跨任务VICL在大型语言模型中的应用提供了新的可能性。

Cool Papers

点此查看论文截图

Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud

Authors:Qing Zhang, Jing Huang, Mingyang Xu, Jun Rekimoto

While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately “lo-fi” approach. We present the “Semantic Glitch,” a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a “physical glitch” derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a “narrative mind” that complements the “weak,” historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework’s robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling “plan to execution” gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.

主流机器人技术追求的是精确度和无懈可击的性能，而本文则探讨了一种刻意采取“低保真”方法的创意潜力。我们展示了“语义故障”（Semantic Glitch）项目，这是一座采用柔和飞行设计的机器人艺术装置。其物理形态采用3D像素风格云设计，源自数字考古中的“物理故障”。我们详细介绍了一种新颖的自主管道流程，它摒弃了激光雷达和SLAM等常规传感器，仅依靠多模态大型语言模型的定性语义理解进行导航。我们通过自然语言提示为机器人赋予了生物启发式的个性，创造了一个能够弥补其“弱小”且充满历史意味的身体的“叙事思维”。我们的分析始于一份为期13分钟的自主飞行日志，后续的统计数据验证了该框架创建具有鲜明特征人物的稳健性。综合数据分析揭示了一系列新兴行为，从基于地标的导航到引人注目的从规划到执行的差距，以及一个性格不确定、行为合理但缺乏精确本体感受的角色。这展示了一个采用低保真方法创造不完美伙伴的框架，其成功之处体现在性格而非效率上。

论文及项目相关链接

PDF NeurIPS 2025 Creative AI Track, The Thirty-Ninth Annual Conference on Neural Information Processing Systems

Summary

本文探索了一种故意采用“低保真”策略的创意潜力，介绍了一种名为“语义故障”的软体飞行机器人艺术装置。该装置采用一种从数字考古中衍生出的“物理故障”的3D像素风格云形式。该装置通过依赖定性语义理解的多模态大型语言模型导航，摒弃了激光雷达和SLAM等传统传感器。通过自然语言提示为机器人编写生物启发的人格，创造了一种与“弱”历史负载身体相辅相成的“叙事思维”。初步分析记录了机器人长达13分钟的自主飞行日志，随后的研究从统计学上验证了该框架在创建可量化独特人格方面的稳健性。这些发现展示了机器人表现出的新兴行为，从地标导航到吸引人的计划执行能力差距明显体现了一个由于缺乏精确本体感受的不完美伴侣的框架成功塑造了角色的性格而非效率。这是一种低精度策略的范例，旨在创造具有个性和创造力的机器人伙伴。

Key Takeaways

该论文探索了一种“低保真”策略在机器人艺术装置中的应用，重点关注创意和个性化方面的发展。
介绍了一种名为“语义故障”的飞行机器人艺术装置，其特点是物理形式类似于一个“物理故障”，并通过语言模型导航来实现自主性。
通过自然语言提示赋予机器人人格特征，使机器人更具叙事性和人性化。自主飞行日志记录和统计验证展示了框架的有效性。
机器人行为展现了独特的导航和计划执行能力差距，这是由缺乏精确本体感受的不完美伙伴的特性导致的。这些特性表明了成功的衡量标准从效率转向了角色性格的体现。

Cool Papers

点此查看论文截图

FxSearcher: gradient-free text-driven audio transformation

Authors:Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim

Achieving diverse and high-quality audio transformations from text prompts remains challenging, as existing methods are fundamentally constrained by their reliance on a limited set of differentiable audio effects. This paper proposes FxSearcher, a novel gradient-free framework that discovers the optimal configuration of audio effects (FX) to transform a source signal according to a text prompt. Our method employs Bayesian Optimization and CLAP-based score function to perform this search efficiently. Furthermore, a guiding prompt is introduced to prevent undesirable artifacts and enhance human preference. To objectively evaluate our method, we propose an AI-based evaluation framework. The results demonstrate that the highest scores achieved by our method on these metrics align closely with human preferences. Demos are available at https://hojoonki.github.io/FxSearcher/

实现基于文本提示的多样化和高质量音频转换仍然具有挑战性，因为现有方法从根本上受到其依赖于有限的可微音频效果的制约。本文提出了FxSearcher，这是一种新型的无梯度框架，能够发现音频效果（FX）的最佳配置，以根据文本提示转换源信号。我们的方法采用贝叶斯优化和基于CLAP的评分函数来高效执行此搜索。此外，引入了一个引导提示来防止不需要的伪迹并增强人类偏好。为了客观地评估我们的方法，我们提出了一个基于AI的评估框架。结果表明，我们的方法在这些指标上获得的最高分数与人类偏好紧密吻合。演示可在https://hojoonki.github.io/FxSearcher/找到。

论文及项目相关链接

PDF

Summary
文本中的核心挑战在于从文本提示实现多样化和高质量的音频转换。为此，本文提出了FxSearcher这一全新无需梯度的框架，用以发现音频效应的最佳配置，以便根据文本提示转换源信号。它结合了贝叶斯优化和基于CLAP的评分函数进行高效搜索，并引入指导提示以避免不良伪影并增强人类偏好。此外，客观评估方法是基于AI的评价框架。实验结果显示，该方法的最高分数与人类偏好紧密一致。具体成果展示请参见提供的网址链接。

Key Takeaways

FxSearcher是一种新颖的无需梯度的框架，旨在通过文本提示实现音频转换。
它利用贝叶斯优化和CLAP评分函数搜索最佳音频效应配置。
指导提示被引入来防止产生不想要的伪影并提升人类偏好。
引入了基于AI的客观评估框架进行效果评估。
该方法在评价中获得高分，并与人类偏好相吻合。

Cool Papers

点此查看论文截图

CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

Authors:Xianming Gu, Lihui Wang, Ying Cao, Zeyu Deng, Yingfeng Ou, Guodong Hu, Yi Chen

Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.

多对比度磁共振成像（MRI）超分辨率技术旨在利用从不同对比度获取的高分辨率（HR）参考图像中的结构信息，从低分辨率（LR）扫描中重建高分辨率图像。该技术提高了解剖细节和软组织分辨能力，对于早期诊断和临床决策至关重要。然而，不同模态之间固有的对比度差异在有效地利用参考图像纹理以指导目标图像重建方面构成了根本挑战，往往导致特征融合不佳。为了解决这一问题，我们提出了一种基于卷积字典特征解耦（CD-DPE）策略的多对比度MRI超分辨率双重提示专家网络。具体来说，我们引入了一个迭代卷积字典特征解耦模块（CD-FDM），将特征分离为跨对比度和内部对比度成分，从而减少冗余和干扰。为了充分融合这些特征，我们提出了一种新型的双提示特征融合专家模块（DP-FFEM）。该模块使用频率提示来指导选择相关的参考特征，并将其合并到目标图像中，而自适应路由提示则确定融合参考和目标特征的最佳方法，以提高重建质量。在公共多对比度MRI数据集上的广泛实验表明，CD-DPE在重建细节方面优于最先进的方法。此外，在未见数据集上的实验表明，CD-DPE具有很强的泛化能力。

论文及项目相关链接

PDF This paper has been accepted by AAAI, but due to the final camera-ready version not being finalized, there are still some expression errors. It will be re-published after correction

Summary

该文介绍了多对比度磁共振成像（MRI）超分辨率技术，旨在利用高分辨率（HR）参考图像的结构信息，从低分辨率（LR）扫描中重建高分辨率图像。文章提出了一种基于卷积字典特征解耦（CD-DPE）策略的双提示专家网络来解决多对比度MRI超分辨率问题。通过引入迭代卷积字典特征解耦模块（CD-FDM）将特征分离为跨对比度和内部对比度组件，减少了冗余和干扰。为了充分整合这些特征，提出了一种新型的双提示特征融合专家模块（DP-FFEM），该模块使用频率提示来指导选择相关的参考特征并将其合并到目标图像中，同时自适应路由提示确定融合参考和目标特征的最佳方法以提高重建质量。在公共多对比度MRI数据集上的实验表明，CD-DPE在重建细节方面优于现有方法，并在未见数据集上展现出强大的泛化能力。

Key Takeaways

多对比度MRI超分辨率技术通过结合低分辨率扫描和高分辨率参考图像的结构信息来重建高分辨率图像。
提出的CD-DPE策略包括CD-FDM模块，可将特征分离为跨对比度和内部对比度组件。
DP-FFEM模块使用频率提示和自适应路由提示来指导特征选择和融合，提高重建质量。
在公共多对比度MRI数据集上的实验表明CD-DPE方法优于现有技术。
CD-DPE方法在未见数据集上展现出强大的泛化能力。
该技术对于提高医学诊断的准确性和早期发现病变具有重要意义。

Cool Papers

点此查看论文截图

vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

Authors:Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long

Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

最近，以大型语言模型（LLM）提炼的医学语义先验为指导的上下文优化（CoOp）进展为基于CLIP的生物医学视觉语言模型（VLM）提供了可扩展的替代手动提示工程和全微调的方法。然而，在这种情况下，提示学习面临由于不同训练语料库和模型架构导致的LLM和CLIP变体之间的语义不一致的挑战；它进一步缺乏跨不断发展的基础模型的扩展性。更为关键的是，通过传统的欧几里得空间优化进行的配对多模态对齐缺乏建模统一表示或应用局部几何约束的能力，这往往会放大复杂生物医学成像中的模态差距并破坏小样本适应的稳定性。在这项工作中，我们提出了vMFCoOp框架，它通过逆向估计共享超球流形上的冯·米塞斯-费希尔（vMF）分布，并通过统一语义锚对齐任意LLM和CLIP之间的语义偏差，以实现稳健的生物医学提示和卓越的小样本分类。基于三个互补约束，vMFCoOp在14个医疗数据集、12种医疗成像方式和1 3个解剖区域上显示出一致的改进，在准确性、泛化能力和临床适用性方面优于现有先进技术。本工作的目标是不断扩大以涵盖更多的下游应用，相关资源将通过https://github.com/VinyehShaw/UniEqui进行共享。

论文及项目相关链接

PDF Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)

Summary

基于大型语言模型（LLM）蒸馏医学语义先验的上下文优化（CoOp）为生物医学CLIP基视觉语言模型（VLM）的适应提供了可扩展的替代方案，避免了手动提示工程和完全微调。本研究提出vMFCoOp框架，通过共享超球面流形上的von Mises-Fisher（vMF）分布估计，实现对任意LLM和CLIP骨架之间的语义偏差对齐，实现稳健的生物医学提示和出色的小样本分类。该框架在14个医疗数据集、12种医疗成像方式和13个解剖区域上表现出持续一致的改进，在准确性、通用性和临床适用性方面优于现有方法。

Key Takeaways

vMFCoOp利用大型语言模型蒸馏的医学语义先验，为生物医学CLIP基视觉语言模型提供上下文优化。
语义偏差对齐通过共享超球面流形上的vMF分布估计实现。
vMFCoOp采用统一语义锚点技术，实现稳健的生物医学提示和出色的小样本分类。
在多个医疗数据集和成像模态上，vMFCoOp表现出优异的性能，优于现有方法。
vMFCoOp旨在持续扩展，涵盖更多下游应用，并通过特定链接共享资源。

Cool Papers

点此查看论文截图

Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Authors:Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

在少样本动作识别（FSAR）中，广角视频能够有效地表达特定场景内的动作。然而，由于缺乏全局理解和背景和主体的双重认知，识别此类样本中的动作仍然具有挑战性，背景干扰是一个主要原因。接纳加权键值（RWKV）通过学习不同维度之间的交互，展现出全局建模的潜力。然而，直接将RWKV应用于广角FSAR可能会因背景信息过多而无法突出主体。此外，相似背景导致的帧之间时间关系的退化难以重建，进一步影响了性能。因此，我们设计了化合物分段和时序重建RWKV（Otter）。具体来说，化合物分割模块（CSM）旨在分割并强调每一帧中的关键区域，有效地将主体与背景信息区分开来.时序重建模块（TRM）被纳入时序增强原型构建中，以实现双向扫描，更好地重建时序关系。此外，常规原型与时序增强原型相结合，可以同时增强主体突出和时序建模，提高广角FSAR的性能。在SSv2、Kinetics、UCF101和HMDB51等基准测试集上的大量实验表明，Otter达到了最先进的性能。在VideoBadminton数据集上的额外评估进一步验证了Otter在广角FSAR中的优越性。

论文及项目相关链接

PDF Accepted by AAAI 2026 Oral

Summary

本文研究了在few-shot动作识别（FSAR）中，宽视角视频的有效表达及识别挑战。为提高对主体与背景的全面理解，提出了Receptance Weighted Key Value（RWKV）模型。然而，直接应用于宽视角FSAR时，因背景信息过多可能难以突出主体。同时，相似背景导致的时序关系退化也影响性能。为此，设计了一个名为Otter的模型，包含化合物分割模块（CSM）和时序重建模块（TRM）。CSM能分割并强调关键帧中的关键区域，有效突出主体；TRM则用于构建时序增强原型，实现双向扫描，改善时序关系的重建。通过多个基准测试集的实验验证，Otter在宽视角FSAR中取得了最新技术水平的性能表现。

Key Takeaways

宽视角视频在few-shot动作识别（FSAR）中能有效表达动作，但背景干扰导致识别具有挑战性。
RWKV模型在全局建模方面显示出潜力，但直接应用于宽视角FSAR时存在主体突出不足的问题。
Otter模型通过化合物分割模块（CSM）和时序重建模块（TRM）解决上述问题。
CSM能分割并强调关键帧中的关键区域，有效突出主体。
TRM用于构建时序增强原型，实现双向扫描，改善时序关系的重建。
Otter模型在多个基准测试集上的表现优于其他方法，达到最新技术水平。

Cool Papers

点此查看论文截图

Kaggle Chronicles: 15 Years of Competitions, Community and Data Science Innovation

Authors:Kevin Bönisch, Leandro Losaria

Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle’s growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience.

自2010年以来，Kaggle一直是一个全球数据科学家聚集在一起竞争、协作和推动数据科学界限的平台。在这15年的发展中，它已经从一个纯粹以比赛为重点的网站成长为一个拥有论坛、笔记本、模型、数据集等更广泛生态系统的平台。随着Kaggle元代码和Kaggle元数据集的发布，我们现在有了一个独特的机会来探索这些比赛、技术以及机器学习和人工智能在现实世界中的应用。因此，在这项研究中，我们详细研究了Kaggle上15年的数据科学——通过元数据、共享代码、社区讨论和比赛本身。我们探索了Kaggle的成长，它对数据科学社区的影响，发现了隐藏的技术趋势，分析了比赛获奖者的情况，以及Kagglers一般如何解决问题等。这是通过分析数百万个内核和讨论线程来进行纵向趋势分析和标准探索性数据分析的结果。我们的研究结果表明，Kaggle是一个持续增长的平台，具有日益多样化的用例，而Kagglers很快就能适应新趋势并将其应用于现实世界的挑战，同时产生的模型平均具有良好的泛化能力。我们还提供了对整个平台的快照，突出其历史和技术演变。最后，这项研究还配有一篇视频（[https://www.youtube.com/watch?v=YVOV9bIUNrM）和一篇文字记录（https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15年比赛的写论文，方便你的查阅。](https://www.youtube.com/watch?v=YVOV9bIUNrM%EF%BC%89%E5%92%8C%E4%B8%80%E7%BA%BF%E7%BD%91%E7%BB%9C%E8%AE%BA%EF%BC%8CKaggleChronicles：15年的比赛历程）

论文及项目相关链接

PDF

Summary

本文介绍了Kaggle平台的发展历程及其在数据科学领域的影响。从纯粹的竞赛网站发展至今，Kaggle已成为拥有论坛、笔记本、模型和数据集等更广泛生态系统的平台。通过元数据、共享代码、社区讨论和竞赛本身，本文深入探讨了Kaggle上15年的数据科学。研究发现，Kaggle是一个稳定增长的平台，具有越来越多的用例，并且Kagglers能够快速适应新趋势并将其应用于现实挑战。

Key Takeaways

Kaggle自2010年以来已成为全球数据科学家交流、协作和推动数据科学界限的平台。
Kaggle已从纯粹的竞赛网站扩展为一个拥有论坛、笔记本、模型和数据集的更广泛的生态系统。
Kaggle平台上的元数据、共享代码、社区讨论和竞赛为研究提供了深入了解其15年数据科学的机会。
Kaggle平台稳步增长，具有越来越多的用例，展现了Kagglers快速适应新趋势并将其应用于现实挑战的能力。
Kaggle上的模型表现出良好的泛化能力。
Kaggle的历史和技术演变得到了全面的展示。

Cool Papers

点此查看论文截图

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Authors:Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to “victim” LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

大型语言模型（LLMs）现已成为信息检索的重要组成部分。因此，它们作为问答聊天机器人，由于对对抗性中间人（MitM）攻击的脆弱性，引发了人们极大的担忧。在这里，我们通过对LLM在提示注入下的事实记忆进行首次有原则的攻击评估，提出了我们的新型理论基础的MitM框架Xmera。通过扰动三种封闭书和基于事实的QA设置下提供给“受害者”LLM的输入，我们破坏了响应的正确性，并评估了生成过程的不确定性。令人惊讶的是，基于简单指令的攻击报告了最高的成功率（高达约85.3%），同时对于回答错误的问题具有很高不确定性。为了提供对抗Xmera的简单防御机制，我们在响应不确定性水平上训练随机森林分类器，以区分受攻击和未受攻击的查询（平均AUC高达约96%）。我们认为，提醒用户谨慎对待来自黑箱和潜在受损的大型语言模型的答案，是保障用户网络安全的第一道防线。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在信息检索中扮演着重要角色，但作为问答聊天机器人，它们容易受到中间人攻击（MitM）的影响。本研究首次提出基于理论框架Xmera的评估，用于检测在提示注入情境下LLM的事实记忆攻击。研究通过对输入扰动的方式攻击模型并检测回答正确性及其生成过程中的不确定性。结果惊人地发现，基于简单指令的攻击成功率最高（达到约85.3%），同时对错误答案的不确定性也很高。为防御Xmera攻击，本研究采用基于响应不确定性的随机森林分类器进行训练，以区分受攻击和未受攻击的查询（平均AUC达到约96%）。提醒用户注意黑箱和潜在缺陷的大型语言模型提供的答案，是用户网络安全的重要防线。

Key Takeaways

大型语言模型在信息检索领域具有重要地位，但易受中间人攻击影响。
Xmera作为一种基于理论框架的攻击评估方法，旨在检测大型语言模型在面临攻击时的事实记忆正确性。
通过扰动输入进行攻击的方式表现出较高的成功率，简单指令的攻击策略具有高达约85.3%的成功率。
这些攻击策略会引发对错误答案的高度不确定性，提醒人们评估回答来源的可靠性至关重要。
研究提供了一种使用随机森林分类器进行防御的方法，以区分受攻击和未受攻击的查询，平均AUC值高达约96%。这表明可能具备一种有效识别受干扰回答的防御手段。
结果表明关注大型语言模型回答的可靠性和准确性对用户具有重要意义，因为它们可能被不法分子操纵作为操控信息的工具。

Cool Papers

点此查看论文截图

TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Authors:Pasan Dissanayake, Sanghamitra Dutta

Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.

基于Transformer的模型在有限的训练数据场景下，与传统的神经网络和梯度提升决策树（GBDTs）相比，在表格数据上表现出了有前景的性能。它们利用预训练的知识来适应新领域，仅在少数训练样本的情况下就能取得令人钦佩的性能，这也被称为小样本体制。然而，小样本体制中的性能提升是以显著增加的复杂性和参数数量为代价的。为了规避这种权衡，我们引入了TabDistill，这是一种新的策略，用于将基于复杂Transformer的模型中的预训练知识蒸馏到更简单的神经网络中，以有效地对表格数据进行分类。我们的框架结合了两者之优点：参数效率高，同时在有限的训练数据上表现良好。蒸馏后的神经网络在同等训练数据下超越了经典基线，如常规神经网络、XGBoost和逻辑回归，在某些情况下，甚至超越了它们所蒸馏的原始基于Transformer的模型。

论文及项目相关链接

PDF

Summary

预训练知识在有限训练数据场景下，基于Transformer的模型相较于传统模型（如神经网络和梯度增强决策树）在表格数据上表现优异。它们利用预训练知识适应新领域，并在仅有少量训练样本的少数学习情况下实现可观性能。但为提高少数情况下的性能需要应对显著增加的复杂性和参数数量。为克服这一权衡，我们推出TabDistill策略，可将复杂的基于Transformer模型的预训练知识蒸馏到更简单的神经网络中，实现表格数据的有效分类。我们的框架实现了双赢：既经济高效又能在有限训练数据上表现良好。蒸馏后的神经网络在同等训练数据条件下超越了传统基线模型（如常规神经网络、XGBoost和逻辑回归），在某些情况下甚至超过了其原始Transformer模型。

Key Takeaways

基于Transformer的模型在有限训练数据下对表格数据的性能表现优于传统模型。
利用预训练知识使模型能在新的数据领域快速适应，特别是在少数学习情况下。
高性能和参数数量之间存在权衡。复杂模型通常有更多的参数以提高性能，但会增加计算和存储负担。
TabDistill策略能将复杂的基于Transformer模型的预训练知识蒸馏到简单的神经网络中，实现有效分类。
蒸馏后的神经网络在有限的训练数据上表现良好，超越了传统模型和原始Transformer模型。
通过蒸馏技术，可以构建既经济高效又性能良好的模型。

Cool Papers

点此查看论文截图

Discovering EV Charging Site Archetypes Through Few Shot Forecasting: The First U.S.-Wide Study

Authors:Kshitij Nikhal, Lucas Ackerknecht, Benjamin S. Riggan, Phillip Stahlfeld

The decarbonization of transportation relies on the widespread adoption of electric vehicles (EVs), which requires an accurate understanding of charging behavior to ensure cost-effective, grid-resilient infrastructure. Existing work is constrained by small-scale datasets, simple proximity-based modeling of temporal dependencies, and weak generalization to sites with limited operational history. To overcome these limitations, this work proposes a framework that integrates clustering with few-shot forecasting to uncover site archetypes using a novel large-scale dataset of charging demand. The results demonstrate that archetype-specific expert models outperform global baselines in forecasting demand at unseen sites. By establishing forecast performance as a basis for infrastructure segmentation, we generate actionable insights that enable operators to lower costs, optimize energy and pricing strategies, and support grid resilience critical to climate goals.

交通运输的脱碳依赖于电动汽车（EVs）的广泛采用，这需要准确了解充电行为，以确保具有成本效益和电网恢复能力的基础设施。现有工作受到小规模数据集、简单基于接近度的时序依赖建模，以及对操作历史有限的网站推广能力弱的限制。为了克服这些局限性，这项工作提出了一个框架，该框架通过集成聚类和小样本预测，利用充电需求的新型大规模数据集来揭示网站原型。结果表明，特定原型的专家模型在未见过的网站上的预测需求优于全球基线。通过以预测性能作为基础设施细分的基础，我们产生了可操作性的见解，使运营商能够降低成本，优化能源和定价策略，并支持对气候目标至关重要的电网韧性。

论文及项目相关链接

PDF Tackling Climate Change with Machine Learning: Workshop at NeurIPS 2025

Summary

电动车辆的广泛应用是实现交通去碳化的关键，这需要准确了解充电行为以确保建设成本有效、电网稳定的基础设施。为克服现有工作受限于小规模数据集、简单基于接近度的时序依赖建模以及缺乏对运营历史有限站点的通用化等问题，本研究提出一个结合聚类与少量样本预测方法的框架，利用新型大规模充电需求数据集揭示站点原型。结果显示，针对原型定制的模型在未见站点上的需求预测表现优于全球基准模型。以预测性能为基础进行基础设施分段，为运营商提供实际操作建议，降低成本、优化能源和定价策略，并支持对实现气候目标至关重要的电网韧性。

Key Takeaways

电动车辆的广泛应用对于实现交通去碳化至关重要。
准确了解充电行为是确保建设成本有效、电网稳定的基础设施的关键。
现有工作在小规模数据集、简单接近度建模和站点通用化方面存在局限。
本研究利用大规模充电需求数据集，结合聚类与少量样本预测方法揭示站点原型。
针对原型定制的模型在未见站点上的需求预测表现优于全球基准模型。
以预测性能为基础进行基础设施分段可提供实际操作建议。

Cool Papers

点此查看论文截图

Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

Authors:Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar

Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.

实体链接（EL）传统上依赖于大规模注释数据集和广泛的模型微调。虽然最近的少数镜头方法通过提示利用大型语言模型（LLM）来减少训练要求，但它们通常因为昂贵的LLM基于的推理而出现效率不高的情况。ARTER（自适应路由和靶向实体推理）呈现了一个结构化管道，通过战略性地结合候选生成、基于上下文评分、自适应路由和选择性推理，而无需深度微调即可实现高性能。ARTER计算一小部分互补信号（基于嵌入和LLM），对检索到的候选对象进行分类，将上下文提及分为简单和困难的情况。然后这些情况由低计算实体链接器（例如ReFinED）和更昂贵的基于LLM的针对性推理分别处理。在标准基准测试中，ARTER比ReFinED高出+4.47%，在6个数据集中的5个数据集上平均提高了+2.53%，与对所有提及内容使用LLM基于推理的管道表现相当，同时在LLM令牌数量方面效率提高了一倍。

论文及项目相关链接

PDF Accepted to EMNLP 2025 Industry Track

Summary

实体链接（EL）传统上依赖于大规模标注数据集和复杂的模型微调。最近的方法利用大型语言模型（LLM）通过提示减少训练要求，但常因昂贵的LLM推理而出现效率问题。ARTER（自适应路由和目标实体推理）通过结合候选生成、基于上下文打分、自适应路由和选择性推理，实现了高性能而无需深度微调的结构化管道。ARTER计算一小部分互补信号（包括嵌入和基于LLM的信号）对检索到的候选进行分类，将上下文提及分为简单和困难的情况。简单的情况由计算量较低的实体链接器（如ReFinED）处理，而复杂的情况则由更昂贵的基于LLM的推理进行处理。在标准基准测试中，ARTER的性能优于ReFinED高达+4.47%，在六个数据集中的五个上平均增益为+2.53%，并且与完全使用LLM进行推理的管道相比表现相当，同时在LLM令牌的效率上是后者的两倍。

Key Takeaways