发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 4.6k

阅读时长: 19 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

Authors:Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, Jiaheng Wei

Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.

人类能够从连续的视觉观察中感知和理解3D空间以及长视频。但视觉语言模型（VLMs）能做到吗？最近的研究表明，尽管在典型的视觉语言任务中表现强大，但即使是最先进的VLMs在理解3D空间和长视频方面仍存在困难。当前的方法通常依赖于专门设计的架构来分别提高3D任务和视频理解任务的性能。与之相反，我们提出LAST（时空思考学习，LeArn to Think in Space and Time），旨在仅使用一组2D图像作为输入，为通用VLMs联合提高3D空间和长视频理解能力。LAST使VLMs在给出最终答案之前能够在空间和时间内进行思考，从而在3D空间和时间维度上建立视觉思维轨迹。我们在两种场景中展示了LAST的有效性：1）零样本，我们直接提示专有模型；2）使用包含时空思维轨迹的数据微调通用VLMs。我们证明，LAST在各种基准测试中取得了显著的提升，包括3个空间理解、4个视频理解和3个图像理解任务。尤其值得注意的是，在零样本方式下的EgoSchema基准测试中，LAST为GPT-4o带来了15.8%的提升，在VSI-Bench基准测试中相比Qwen2.5-VL-7B有8.3%的提升。

论文及项目相关链接

PDF

Summary

该文本讨论了人类可以感知和理解三维空间以及长视频的能力，而目前先进的视觉语言模型（VLMs）在理解和处理三维空间及长视频方面仍存在困难。文章提出了一种名为LAST的方法，该方法可以在没有特定架构设计的情况下，仅通过一组二维图像作为输入，提高一般VLM对三维空间以及长视频的理解能力。LAST让VLM在给出最终答案之前能够在空间和时间的思考中进行轨迹构建，从而提高对三维空间和时间维度的视觉理解。文章展示了LAST在零样本和微调场景下的有效性，在各种基准测试中取得了显著的成果提升。

Key Takeaways

人类可以轻松感知和理解三维空间及长视频，但当前的视觉语言模型（VLMs）在这方面存在困难。
现有的方法常常依赖于专门的架构设计来提高对三维任务和视频理解任务的性能。
LAST方法旨在提高一般VLM对三维空间以及长视频的理解能力，仅使用一组二维图像作为输入。
LAST让VLM在给出答案前进行空间和时间的思考，构建视觉思考轨迹。
LAST在零样本和微调场景下均展示了其有效性。
LAST在各种基准测试中取得了显著的成果提升，包括空间理解、视频理解和图像理解任务。

Cool Papers

点此查看论文截图

PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

Authors:Yuheng Shao, Lizhang Wang, Changhao Li, Peixian Chen, Qinyuan Liu

Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.

零样本异常检测（ZSAD）旨在识别和定位未见对象类别图像中的异常区域。虽然基于视觉语言模型的最新方法（如CLIP）显示出前景，但其性能受限于现有的提示工程策略。当前的方法，无论是依赖单一固定、可学习还是密集动态提示，都存在表示瓶颈，并且容易在辅助数据上过度拟合，无法推广到未见异常的复杂性和多样性。为了克服这些局限性，我们提出了$\texttt{PromptMoE}$。我们的核心见解是，稳健的ZSAD需要一种组合式的提示学习方法。不同于学习单一的提示，$\texttt{PromptMoE}$学习一组专家提示，作为可组合语义原始数据的基础集，以及一种视觉引导的混合专家（MoE）机制，动态地结合每个实例的提示。我们的框架通过视觉引导提示混合（VGMoP）实现这一概念，它采用图像门控稀疏MoE来聚合各种正常和异常的专家状态提示，生成语义丰富的文本表示，具有较强的泛化能力。在工业和医疗领域的15个数据集上的大量实验证明了$\texttt{PromptMoE}$的有效性和最先进的性能。

论文及项目相关链接

PDF 14 pages, 8 figures. Accepted to AAAI 2026

Summary

基于视觉的零样本异常检测（ZSAD）旨在识别并定位未见对象类别图像中的异常区域。尽管基于CLIP等视觉语言模型的最新方法显示出潜力，但其性能受限于当前的提示工程策略。PromptMoE提出了一种新型的prompt学习方法，该方法通过专家提示池和视觉引导的混合专家（MoE）机制进行动态组合实例。实验结果证明，该方法在15个工业和医疗数据集上的表现优于其他方法。

Key Takeaways

ZSAD旨在识别和定位未见对象类别图像中的异常区域。
当前基于CLIP的方法受限于提示工程策略，存在代表性瓶颈和过度拟合问题。
PromptMoE提出了一种通过专家提示池和视觉引导的混合专家（MoE）机制来解决上述问题的方法。
专家提示池包含可组合语义原始的基础集。
VGMoP采用图像门控稀疏MoE，聚合各种正常和异常的专家状态提示，生成语义丰富的文本表示。
这种动态组合提示的方法增强了模型的泛化能力。

Cool Papers

点此查看论文截图

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Authors:Xiaohong Liu, Xiufeng Song, Huayu Zheng, Lei Bai, Xiaoming Liu, Guangtao Zhai

The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM) branch adopts a learnable reasoning paradigm to acquire Multimodal Forgery Representation (MFR) by harnessing the powerful comprehension and reasoning capabilities of Multimodal Large Language Models (MLLMs), which discerns the forgery traces from a flexible semantic perspective. To integrate multimodal representations into a coherent space, a UML module is introduced to consolidate the generalization ability of MM-Det++. In addition, we also establish a large-scale and comprehensive Diffusion Video Forensics (DVF) dataset to advance research in video forgery detection. Extensive experiments demonstrate the superiority of MM-Det++ and highlight the effectiveness of unified multimodal forgery learning in detecting diffusion-generated videos.

由扩散模型生成的视频激增引发了人们对信息安全的日益关注，这突出了对合成媒体进行可靠检测的必要性和迫切性。现有方法主要集中在图像级别的伪造检测上，而对通用的视频级别的伪造检测研究则相对缺乏。为了推进视频取证技术的进步，我们提出了一种整合的多模式检测算法，名为MM-Det++，专门用于检测扩散生成的视频。我们的方法由两个创新分支和一个统一多模式学习（UML）模块组成。具体来说，时空（ST）分支采用了一种新颖的中心帧视觉转换器（FC-ViT），通过聚合时空信息来检测扩散生成的视频，其中FC令牌（FC-tokens）能够捕获每个视频帧的整体伪造痕迹。与此同时，多模式（MM）分支采用可学习的推理模式，通过利用多模式大型语言模型（MLLMs）的强大理解和推理能力，获取多模式伪造表示（MFR），从灵活语义的角度识别伪造痕迹。为了将多模式表示集成到一个连贯的空间中，我们引入了UML模块，以巩固MM-Det++的泛化能力。此外，我们还建立了大规模的扩散视频取证（DVF）数据集，以推动视频伪造检测的研究。大量实验表明MM-Det++的优越性，并突出了统一多模式伪造学习在检测扩散生成视频中的有效性。

论文及项目相关链接

PDF Code and dataset are available at https://github.com/SparkleXFantasy/MM-Det-Plus

Summary

Cool Papers

点此查看论文截图

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Authors:Minseok Seo, Mark Hamilton, Changick Kim

We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. \textbf{Project page:} \href{https://seominseok0429.github.io/Upsample-Anything/}{https://seominseok0429.github.io/Upsample-Anything/}

我们提出了Upsample Anything，这是一个轻量级的测试时优化（TTO）框架，无需任何训练即可将低分辨率特征恢复为高分辨率的像素级输出。尽管视觉基础模型（Vision Foundation Models）在多种下游任务中表现出强大的泛化能力，但它们通常将表示下采样到原来的十四分之一或十六分之一（例如ViT），这限制了它们在像素级应用中的直接使用。现有的特征上采样方法依赖于数据集特定的重新训练或复杂的隐式优化，这限制了其可扩展性和泛化能力。Upsample Anything通过简单的逐图像优化解决了这些问题，该优化学习一个结合了空间线索和范围线索的仿射高斯核，有效地桥接了高斯喷溅和联合双边上采样。学习到的核作为一种通用、边缘感知算子，无缝地跨越架构和模态进行转移，能够精确地重建特征、深度或概率图的高分辨率。它在处理每张大小为224x224的图像时运行时间仅为约0.419秒，并在语义分割、深度估计以及深度和概率图上采样方面达到了最先进的性能。项目页面：[https://seominseok0429.github.io/Upsample-Anything/]

论文及项目相关链接

PDF 15 pages, 12 figures

Summary
在这个摘要中，我们提出了一种名为“Upsample Anything”的测试时优化框架，该框架能够恢复低分辨率特征以生成高分辨率像素输出，无需进行任何训练。这一框架解决了现有特征上采样方法依赖于特定数据集的重训练或复杂的隐优化的问题，通过简单的图像优化学习一个结合空间和范围线索的异构高斯核，实现了高斯溅墨和联合双边上采样的有效融合。所学习的核作为一个通用、边缘感知算子，可以无缝地应用于各种架构和模态，能够实现特征、深度或概率地图的高精度重建。其在语义分割、深度估计以及深度和概率地图上采样方面取得了最佳性能。具体细节和成果展示可见项目页面：https://seominseok0429.github.io/Upsample-Anything/。简而言之，“Upsample Anything”为不同应用场景提供了一个强大且高效的工具。该方法的效率极高，每秒能处理约数十张图像，同时保持出色的性能表现。

Key Takeaways

“Upsample Anything”是一个测试时优化框架，用于恢复低分辨率特征到高分辨率像素输出，无需额外训练。
通过学习一个结合空间和范围线索的异构高斯核来解决现有特征上采样方法的局限性。
所学习的核作为通用、边缘感知算子，可广泛应用于各种架构和模态。
该方法在语义分割、深度估计以及深度和概率地图上采样方面达到最佳性能。
“Upsample Anything”框架具备高效性，能够在短时间内处理大量图像。
项目详细信息和成果可通过项目页面查看。

Cool Papers

点此查看论文截图

Interpretable and Testable Vision Features via Sparse Autoencoders

Authors:Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc tools supply both in a single, model-agnostic procedure. We use sparse autoencoders (SAEs) to bridge this gap; each sparse feature comes with real-image exemplars that reveal its meaning and a decoding vector that can be manipulated to probe its influence on downstream task behavior. By applying our method to widely-used pre-trained vision models, we reveal meaningful differences in the semantic abstractions learned by different pre-training objectives. We then show that a single SAE trained on frozen ViT activations supports patch-level causal edits across tasks (classification and segmentation) all without retraining the ViT or task heads. These qualitative, falsifiable demonstrations position SAEs as a practical bridge between concept discovery and causal probing of vision models. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/saev.

为了真正了解视觉模型，我们不仅要解释它们学习的特征，而且还要通过受控实验验证这些解释。尽管早期的工作提供了丰富的语义或直接控制，但很少有事后工具能在单一、模型无关的程序中同时提供这两者。我们使用稀疏自动编码器（SAE）来填补这一空白；每个稀疏特征都带有真实图像示例，揭示了其含义和可以操纵的解码向量，以探查其对下游任务行为的影响。我们将该方法应用于广泛使用的预训练视觉模型，揭示了不同预训练目标所学习语义抽象中的有意义差异。然后，我们证明在冻结的ViT激活上训练的单个SAE可以在任务（分类和分割）之间支持补丁级别的因果编辑，而无需重新训练ViT或任务头。这些定性、可证伪的演示将SAE定位为发现概念和因果探查视觉模型之间的实用桥梁。有关我们的项目网站上的代码、演示和模型，请参见：https://osu-nlp-group.github.io/saev。

论文及项目相关链接

PDF Main text is 10 pages with 7 figures

Summary

本文利用稀疏自动编码器（SAE）来解读并验证视觉模型的特性。SAE不仅能够揭示不同预训练目标所学习的语义抽象之间的差异，还能够支持在不重新训练视觉模型或任务头的情况下，进行跨任务的补丁级因果编辑。这为概念发现和视觉模型的因果探查之间建立了一座实用的桥梁。

Key Takeaways

视觉模型的理解需要同时解读其学习到的特性和通过实验进行验证。
早期的工作虽然提供了丰富的语义或直接控制，但很少有工具可以同时提供这两者。
稀疏自动编码器（SAE）被用来弥合这一鸿沟。每个稀疏特征都带有揭示其含义的真实图像示例和可以操纵的解码向量，以探索其对下游任务行为的影响。
对预训练视觉模型应用SAE方法揭示了不同预训练目标所学习的语义抽象之间的差异。
一个单一的SAE在冻结的ViT激活上训练，支持跨任务的补丁级因果编辑，无需重新训练ViT或任务头。
SAE方法为概念发现和视觉模型的因果探查之间建立了一座桥梁。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-26/Vision%20Transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Vision Transformer

检测/分割/跟踪

检测/分割/跟踪方向最新论文已更新，请持续关注 Update in 2025-11-26 SAM3-Adapter Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation

2025-11-26 检测/分割/跟踪

检测/分割/跟踪

视频理解

视频理解方向最新论文已更新，请持续关注 Update in 2025-11-26 Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

2025-11-26 视频理解

视频理解