发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 8.2k

阅读时长: 33 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Authors:Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen, Yasuhiko Nakashima

Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

在全景图像（Whole Slide Images，简称WSI）上的终身学习旨在在一系列癌症相关任务上顺序地训练或微调统一模型，从而减少数据迁移和处理的资源和努力，特别是在考虑到全景图像吉字节规模大小的情况下更是如此。在本文中，我们介绍了MergeSlide，这是一个简单有效的框架，它将终身学习视为模型合并问题，并利用视觉语言病理学基础模型进行处理。当有新的任务到来时，它是：1）通过类感知提示进行定义，2）使用一个无多层感知机（MLP）的骨干进行几个周期的微调，以及3）使用正交持续合并策略将其合并到统一模型中，该策略既保留了性能又减轻了灾难性遗忘。在类别增量学习（CLASS-IL）设置下进行推理时，任务身份是未知的，因此我们引入了任务到类别提示对齐（TCP）推理。具体来说，TCP首先使用任务级提示识别最相关的任务，然后应用相应的类感知提示来生成预测。为了评估MergeSlide，我们在六个TCGA数据集上进行了一系列实验。结果表明，MergeSlide的表现优于基于复习的连续学习和视觉语言零样本基线。代码和数据可在https://github.com/caodoanh2001/MergeSlide找到。

论文及项目相关链接

PDF WACV2026 Accepted

Summary

本文介绍了MergeSlide框架，该框架将终身学习视为模型合并问题，并利用视觉语言病理学基础模型进行处理。新任务通过类感知提示进行定义，使用无MLP骨架进行微调，并采用正交持续合并策略合并到统一模型中，从而保留性能并减轻灾难性遗忘。在类增量学习设置下进行推理时，引入任务到类提示对齐（TCP）推理，通过任务级别提示识别最相关任务，然后应用相应的类感知提示生成预测。在TCGA数据集上的实验表明，MergeSlide优于基于复习的持续学习和视觉语言零样本基线。

Key Takeaways

MergeSlide框架将终身学习视为模型合并问题，利用视觉语言病理学基础模型进行处理。
新任务通过类感知提示进行定义，并采用无MLP骨架进行微调。
采用正交持续合并策略，将新任务合并到统一模型中，以保留性能并减轻灾难性遗忘。
引入任务到类提示对齐（TCP）推理，适应类增量学习设置。
TCP推理通过任务级别提示识别最相关任务，然后应用类感知提示生成预测。
在TCGA数据集上的实验表明，MergeSlide框架的性能优于其他方法。

Cool Papers

点此查看论文截图

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Authors:Xunzhi Xiang, Xingye Tian, Guiyu Zhang, Yabo Chen, Shaofeng Zhang, Xuebo Wang, Xin Tao, Qi Fan

Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

变分自编码器（VAEs）通常将图像编码成紧凑的潜在空间，降低了计算成本，但同时也引入了优化困境：高维潜在空间提高了重建保真度，但往往会妨碍生成性能。最近的方法尝试通过使用外部视觉基础模型（VFMs）对高维潜在空间进行正则化来解决这一困境。然而，高维VAE潜在空间如何影响生成模型的优化仍不清楚。据我们所知，我们的分析首次揭示了高维潜在空间中的冗余高频成分会阻碍扩散模型的训练收敛，从而降低了生成质量。为了缓解这个问题，我们提出了一种光谱自正则化策略，以抑制冗余的高频噪声，同时保持重建质量。由此产生的去噪-VAE是一种基于ViT的自编码器，它不依赖于VFMs，产生更干净、噪声更低的潜在特征，从而提高了生成质量和更快的优化收敛速度。为了进一步优化基于去噪-VAE的生成模型，我们还引入了一种光谱对齐策略。我们的完整方法使扩散模型的收敛速度大约比SD-VAE快2倍，同时在ImageNet 256x256基准测试上实现了最先进的重建质量（rFID = 0.28，PSNR = 27.26）和具有竞争力的生成性能（gFID = 1.82）。

论文及项目相关链接

PDF

Summary

本文探讨了变分自编码器（VAEs）在高维潜在空间中对图像编码时面临的问题。研究发现，高维潜在空间中的冗余高频成分会阻碍扩散模型的训练收敛，影响生成质量。为解决这一问题，提出了谱自正则化策略，能够抑制冗余高频噪声，同时保持重建质量。基于此策略的Denoising-VAE模型，在不依赖外部视觉基础模型（VFMs）的情况下，生成了更清洁、低噪声的潜在空间，提高了生成质量和优化收敛速度。同时，还引入了谱对齐策略，有助于优化基于Denoising-VAE的生成模型。该方法使扩散模型的收敛速度大约提高了两倍，同时在ImageNet 256×256基准测试上实现了先进的重建质量（rFID = 0.28，PSNR = 27.26）和具有竞争力的生成性能（gFID = 1.82）。

Key Takeaways

变分自编码器（VAEs）在高维潜在空间中面临优化难题，冗余高频成分影响生成模型的训练收敛。
谱自正则化策略能够抑制高维潜在空间中的冗余高频噪声，提高重建质量。
Denoising-VAE模型在不依赖外部视觉基础模型（VFMs）的情况下，生成了更清洁、低噪声的潜在空间。
Denoising-VAE提高了生成质量和优化收敛速度，同时引入了谱对齐策略来优化生成模型的性能。
该方法使扩散模型的收敛速度大约提高了两倍。
在ImageNet 256×256基准测试上，Denoising-VAE实现了先进的重建质量和具有竞争力的生成性能。
7.该研究对于解决变分自编码器在高维潜在空间中的优化问题具有重要的理论和实践意义。

Cool Papers

点此查看论文截图

LoRA-Enhanced Vision Transformer for Single Image based Morphing Attack Detection via Knowledge Distillation from EfficientNet

Authors:Ria Shekhawat, Sushrut Patwardhan, Raghavendra Ramachandra, Praveen Kumar Chandaliya, Kishor P. Upla

Face Recognition Systems (FRS) are critical for security but remain vulnerable to morphing attacks, where synthetic images blend biometric features from multiple individuals. We propose a novel Single-Image Morphing Attack Detection (S-MAD) approach using a teacher-student framework, where a CNN-based teacher model refines a ViT-based student model. To improve efficiency, we integrate Low-Rank Adaptation (LoRA) for fine-tuning, reducing computational costs while maintaining high detection accuracy. Extensive experiments are conducted on a morphing dataset built from three publicly available face datasets, incorporating ten different morphing generation algorithms to assess robustness. The proposed method is benchmarked against six state-of-the-art S-MAD techniques, demonstrating superior detection performance and computational efficiency.

人脸识别系统（FRS）对安全至关重要，但仍容易受到换脸攻击的影响，即合成图像融合了多个个体的生物识别特征。我们提出了一种新的基于教师-学生框架的单图像换脸攻击检测（S-MAD）方法，其中基于CNN的教师模型对基于ViT的学生模型进行精细化调整。为了提高效率，我们集成了低秩适应（LoRA）进行微调，在保持高检测精度的同时降低了计算成本。我们在由三个公开可用的人脸数据集构建的换脸数据集上进行了广泛的实验，结合了十种不同的换脸生成算法来评估稳健性。该方法与六种最先进的S-MAD技术进行了比较，表现出优异的检测性能和计算效率。

论文及项目相关链接

PDF

Summary
基于教师-学生框架，我们提出了一种新的单图像变形攻击检测（S-MAD）方法，使用CNN教师模型优化ViT学生模型。为提高效率，我们整合了低秩适应（LoRA）进行微调，降低计算成本的同时保持高检测精度。在由三个公开人脸数据集构建的变形数据集上进行了广泛实验，评估了模型的稳健性。与六种最先进的S-MAD技术相比，该方法在检测性能和计算效率方面均表现出优越性。

Key Takeaways

提出了一种新的Single-Image Morphing Attack Detection (S-MAD)方法，利用教师-学生框架进行人脸识别系统的安全性增强。
采用CNN教师模型对ViT学生模型进行精细化训练，提高了模型的性能。
通过集成Low-Rank Adaptation (LoRA)进行微调，增强了模型的计算效率和准确性。
在多种算法生成的合成人脸图像数据集上进行了大量实验验证。
对比六种当前最前沿的S-MAD技术，展示了自己提出方法的优越性能。
此方法可以有效检测出利用多个人的生物特征合成的人脸图像，有助于人脸识别系统的安全性提升。

Cool Papers

点此查看论文截图

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

Authors:Hossein Mohebbi, Mohammed Abdulrahman, Yanting Miao, Pascal Poupart, Suraj Kothawade

Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

文本到图像生成的最新进展已经产生了强大的单镜头模型，然而，没有单个系统能够可靠地执行创造性工作流程中所典型的长期组合提示。我们引入了Image-POSER，这是一种反射性强化学习框架，它（i）协调了多样化的预训练文本到图像和图像到图像的专家注册，（ii）通过动态任务分解端到端地处理长形式提示，以及（iii）通过视觉语言模型批评者的结构化反馈来监督每一步的对齐。通过将图像合成和编辑作为马尔可夫决策过程，我们学习了非平凡的专家管道，这些管道能够自适应地结合各模型的优点。实验表明，在标准行业基准和自定义基准测试中，Image-POSER在对齐、保真度和美学方面的表现均优于基线模型，并在人类评估中始终受到青睐。这些结果突显了强化学习可以使AI系统具备自主分解、重新排序和组合视觉模型的能力，朝着通用视觉助理的方向发展。

论文及项目相关链接

PDF

Summary

本文介绍了Image-POSER，一个反射强化学习框架，该框架能够协调不同的文本到图像和图像到图像的预训练专家模型，处理长形式的提示并进行端到端的动态任务分解，并通过来自视觉语言模型批评者的结构化反馈来监督每一步的对齐。通过采用马尔可夫决策过程对图像合成和编辑进行学习，Image-POSER可以自适应地结合各个模型的优点。实验表明，Image-POSER在对齐、保真度和美学方面优于基准测试，包括前沿模型，并且在人类评估中表现一致。这一结果突显了强化学习可以使AI系统具备自主分解、重新排序和组合视觉模型的能力，朝着通用视觉助理的方向发展。

Key Takeaways

Image-POSER是一个反射强化学习框架，用于协调多种文本到图像和图像到图像的预训练专家模型。
该框架能够处理长形式的提示并进行端到端的动态任务分解。
Image-POSER通过结构化反馈监督每一步的对齐，来自视觉语言模型批评者。
采用马尔可夫决策过程对图像合成和编辑进行学习。
Image-POSER可以自适应地结合各个模型的优点，形成复杂的视觉模型流水线。
实验表明，Image-POSER在对齐、保真度和美学方面优于其他模型，包括前沿模型。

Cool Papers

点此查看论文截图

LeakyCLIP: Extracting Training Data from CLIP

Authors:Yunhao Chen, Shujie Wang, Xin Wang, Xingjun Ma

Understanding the memorization and privacy leakage risks in Contrastive Language–Image Pretraining (CLIP) is critical for ensuring the security of multimodal models. Recent studies have demonstrated the feasibility of extracting sensitive training examples from diffusion models, with conditional diffusion models exhibiting a stronger tendency to memorize and leak information. In this work, we investigate data memorization and extraction risks in CLIP through the lens of CLIP inversion, a process that aims to reconstruct training images from text prompts. To this end, we introduce \textbf{LeakyCLIP}, a novel attack framework designed to achieve high-quality, semantically accurate image reconstruction from CLIP embeddings. We identify three key challenges in CLIP inversion: 1) non-robust features, 2) limited visual semantics in text embeddings, and 3) low reconstruction fidelity. To address these challenges, LeakyCLIP employs 1) adversarial fine-tuning to enhance optimization smoothness, 2) linear transformation-based embedding alignment, and 3) Stable Diffusion-based refinement to improve fidelity. Empirical results demonstrate the superiority of LeakyCLIP, achieving over 258% improvement in Structural Similarity Index Measure (SSIM) for ViT-B-16 compared to baseline methods on LAION-2B subset. Furthermore, we uncover a pervasive leakage risk, showing that training data membership can even be successfully inferred from the metrics of low-fidelity reconstructions. Our work introduces a practical method for CLIP inversion while offering novel insights into the nature and scope of privacy risks in multimodal models.

理解对比语言图像预训练（CLIP）中的记忆和隐私泄露风险对于确保多模态模型的安全性至关重要。最近的研究已经证明了从扩散模型中提取敏感训练实例的可行性，其中条件扩散模型表现出更强的记忆和泄露信息的倾向。在这项工作中，我们通过CLIP反转的视角来研究CLIP中的数据记忆和提取风险。CLIP反转是一个旨在从文本提示重建训练图像的过程。为此，我们引入了名为“LeakyCLIP”的新型攻击框架，旨在通过CLIP嵌入实现高质量、语义准确的图像重建。我们发现CLIP反转面临三个关键挑战：1）特征不稳健，2）文本嵌入中的视觉语义有限，以及3）重建保真度低。为解决这些挑战，LeakyCLIP采用1）对抗性微调以增强优化的平稳性，2）基于线性变换的嵌入对齐，以及3）基于稳定扩散的精炼以提高保真度。经验结果表明LeakyCLIP的优越性，在LAION-2B子集上与基准方法相比，ViT-B-16的结构相似性指数测量（SSIM）提高了超过258％。此外，我们发现了普遍的泄漏风险，并展示即使是从低保真重建的指标中也可以成功推断出训练数据成员身份。我们的工作介绍了一种实用的CLIP反转方法，同时提供了对多模态模型中隐私风险本质和范围的全新见解。

论文及项目相关链接

PDF

Summary

本文探讨了Contrastive Language–Image Pretraining（CLIP）模型中的记忆和隐私泄露风险问题。通过CLIP反演的角度进行研究，引入了LeakyCLIP攻击框架，实现了高质量、语义准确的图像重建。针对CLIP反演中的三个关键挑战，LeakyCLIP采取了相应的措施。研究结果表明，LeakyCLIP在结构相似性指数度量（SSIM）上取得了显著改进，并揭示了训练数据成员身份可以从低质量重建的图像中推断出来，这表明存在普遍的泄露风险。本文提供了关于多模态模型中隐私风险的实用方法和新见解。

Key Takeaways

研究关注CLIP模型的记忆和隐私泄露风险问题，强调了其安全性和隐私保护的重要性。
通过CLIP反演的角度进行研究，引入LeakyCLIP攻击框架进行图像重建。
LeakyCLIP解决了CLIP反演中的三个关键挑战：特征不稳健、文本嵌入中的视觉语义有限以及重建保真度低。
LeakyCLIP通过对抗性微调、基于线性变换的嵌入对齐和基于Stable Diffusion的细化来改进性能。
研究表明LeakyCLIP在SSIM上实现了显著的改进，超过了基线方法。
训练数据成员身份可以从低质量重建的图像中推断出来，存在普遍的泄露风险。

Cool Papers

点此查看论文截图

FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

Authors:Hongyang Wang, Yichen Shi, Zhuofu Tao, Yuhao Gao, Liepiao Zhang, Xun Lin, Jun Feng, Xiaochen Yuan, Zitong Yu, Xiaochun Cao

Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model’s generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released at https://github.com/Why0912/FaceShield.

人脸识别防欺骗（Face Anti-Spoofing, FAS）对于保护人脸识别系统免受伪造攻击至关重要。之前的方法将这项任务视为一个分类问题，缺乏预测结果的可解释性和合理性。最近，多模态大型语言模型（Multimodal Large Language Models, MLLMs）在视觉任务的感知、推理和决策方面表现出了强大的能力。然而，目前尚未出现专门针对FAS任务的通用且全面的MLLM和专门设计的数据集。为了弥补这一空白，我们提出了FaceShield，这是一个用于FAS的MLLM，以及相应的预训练和监督微调（Supervised Fine-tuning, SFT）数据集FaceShield-pre10K和FaceShield-sft45K。FaceShield能够确定人脸的真实性、识别欺骗攻击的类型、为判断提供理由，并检测攻击区域。具体来说，我们采用了基于先验知识的欺骗感知视觉感知（Spoof-Aware Vision Perception, SAVP），它结合了原始图像和辅助信息。然后，我们使用提示引导视觉令牌屏蔽（Prompt-Guided Vision Token Masking, PVTM）策略来随机屏蔽视觉令牌，从而提高模型的泛化能力。我们在三个基准数据集上进行了广泛的实验，结果表明FaceShield在四项FAS任务上显著优于之前的深度学习任务模型和通用MLLM，包括粗粒度分类、细粒度分类、推理和攻击定位。我们的指令数据集、协议和代码将在https://github.com/Why0912/FaceShield上发布。

论文及项目相关链接

PDF Accepted by AAAI 2025. Hongyang Wang and Yichen Shi contribute equally. Corresponding author: Zitong Yu

Summary
面部反欺诈技术对于保护人脸识别系统免受伪装攻击至关重要。为弥补现有方法在处理这一任务时存在的分类问题和解释性不足，本研究提出FaceShield，一种适用于面部反欺诈的多模态大型语言模型，并辅以预训练和监督微调数据集FaceShield-pre10K和FaceShield-sft45K。FaceShield可判断面部真实性、识别欺诈攻击类型、提供判断依据并检测攻击区域。通过采用结合原始图像和先验知识的辅助信息的欺骗感知视觉感知技术，以及引导式视觉令牌掩码策略，该模型在泛化能力上有所提升。在三个基准数据集上的实验表明，FaceShield在四个面部反欺诈任务上的表现均优于以往的深度学习和一般的多模态大型语言模型。相关资源将公开于https://github.com/Why0912/FaceShield。

Key Takeaways

Face anti-spoofing (FAS)是保护面部识别系统免受伪装攻击的关键技术。
当前方法在处理FAS任务时缺乏解释性和推理能力。
提出FaceShield，一种适用于面部反欺诈的多模态大型语言模型，可判断面部真实性、识别攻击类型、提供判断依据及检测攻击区域。
FaceShield采用欺骗感知视觉感知技术和引导式视觉令牌掩码策略，提升模型的泛化能力。
FaceShield在多个基准数据集上的实验表现优于以往的深度学习和多模态大型语言模型。
FaceShield包含预训练和监督微调数据集FaceShield-pre10K和FaceShield-sft45K。

Cool Papers

点此查看论文截图

RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

Authors:Yujin Wang, Quanfeng Liu, Jiaqi Fan, Jinlong Hong, Hongqing Chu, Mengjian Tian, Bingzhao Gao, Hong Chen

Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-language models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, RAC3, a novel framework designed to enhance the performance of VLMs in corner case comprehension, is proposed. RAC3 integrates a frequency-spatial fusion (FSF) image encoder, a cross-modal alignment training method for embedding models with hard and semi-hard negative mining, and a fast querying and retrieval pipeline based on K-Means clustering and hierarchical navigable small world (HNSW) indexing. A multimodal chain-of-thought (CoT) prompting strategy to guide analogical reasoning and reduce hallucinations during inference is introduced. Moreover, an update mechanism is integrated into RAC3 to ensure continual learning within the framework. Extensive experiments on the CODA and nuScenes datasets demonstrate that RAC3 significantly improves corner case comprehension across multiple downstream tasks. Compared to prior state-of-the-art methods, RAC3 achieves the highest final score of 74.46 on the CODA-LM benchmark and shows consistent performance gains when integrated with end-to-end frameworks like DriveLM. These results demonstrate the effectiveness of retrieval-augmented strategies and cross-modal alignment for safer and more interpretable autonomous driving.

理解并处理边界情况对于确保自动驾驶系统的安全性和可靠性至关重要。视觉语言模型（VLMs）在提高场景理解方面发挥着至关重要的作用，但它们面临着巨大的挑战，如幻觉和现实世界定位不足，这会影响它们在关键驾驶场景中的表现。在这项工作中，提出了RAC3这一新型框架，旨在提高VLMs在边界案例理解方面的性能。RAC3集成了频率空间融合（FSF）图像编码器、用于嵌入模型的跨模态对齐训练方法和硬负样本及半硬负样本挖掘、基于K-Means聚类和分层可导航小型世界（HNSW）索引的快速查询和检索管道。还引入了一种多模态思维链（CoT）提示策略，以引导类比推理并在推理过程中减少幻觉。此外，RAC3还集成了更新机制，以确保框架内的持续学习。在CODA和nuScenes数据集上的大量实验表明，RAC3在多个下游任务中显著提高了边界案例的理解能力。与之前的最新方法相比，RAC3在CODA-LM基准测试中取得了最高分74.46分，在与端到端框架（如DriveLM）集成时，表现出一贯的性能提升。这些结果证明了检索增强策略和跨模态对齐在更安全、更可解释的自动驾驶中的有效性。

论文及项目相关链接

PDF Accepted by IEEE Transactions on Multimedia

摘要

本文强调理解和处理自动驾驶系统的边缘案例的重要性。视觉语言模型（VLMs）在提高场景理解方面发挥着关键作用，但它们面临着诸如幻觉和缺乏现实世界基础等挑战。为此，提出了RAC3框架，旨在提高VLMs在边缘案例理解方面的性能。RAC3融合了频率空间融合图像编码器、用于嵌入模型的跨模态对齐训练方法和基于K-Means聚类和分层可导航小世界索引的快速查询和检索管道。还引入了多模态思维链提示策略，以引导类比推理并减少推理过程中的幻觉。RAC3还集成了更新机制，以确保框架内的持续学习。在CODA和nuScenes数据集上的大量实验表明，RAC3显著提高了边缘案例理解的性能，并在CODA-LM基准测试中取得了最高分74.46分。与先前的先进方法相比，RAC3在与端到端框架（如DriveLM）集成时表现出持续的性能提升。结果表明，检索增强策略和跨模态对齐对于更安全、更可解释的自动驾驶更为有效。

关键见解

理解和处理边缘案例对于确保自动驾驶系统的安全和可靠性至关重要。
视觉语言模型（VLMs）在提高自动驾驶场景理解方面发挥关键作用，但仍面临挑战。
RAC3框架通过融合多种技术提高了VLMs在边缘案例理解方面的性能。
RAC3采用了频率空间融合图像编码器、跨模态对齐训练方法和快速查询检索管道。
多模态思维链提示策略引导类比推理，减少推理过程中的幻觉。
RAC3的更新机制确保了框架内的持续学习。
在多个基准测试中，RAC3显著提高了边缘案例理解的性能，展示了其在自动驾驶中的有效性。

Cool Papers

点此查看论文截图

ComFe: An Interpretable Head for Vision Transformers

Authors:Evelyn J. Mannix, Liam Hodgkinson, Howard Bondell

Interpretable computer vision models explain their classifications through comparing the distances between the local embeddings of an image and a set of prototypes that represent the training data. However, these approaches introduce additional hyper-parameters that need to be tuned to apply to new datasets, scale poorly, and are more computationally intensive to train in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a highly scalable interpretable-by-design image classification head for pretrained Vision Transformers (ViTs) that can obtain competitive performance in comparison to comparable non-interpretable methods. To our knowledge, ComFe is the first interpretable head and unlike other interpretable approaches can be readily applied to large-scale datasets such as ImageNet-1K. Additionally, ComFe provides improved robustness and outperforms previous interpretable approaches on key benchmark datasets while using a consistent set of hyperparameters and without finetuning the pretrained ViT backbone. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction. Code is available at github.com/emannix/comfe-component-features.

解释性计算机视觉模型通过比较图像局部嵌入与代表训练数据的一组原型之间的距离来解释其分类。然而，这些方法引入了额外的超参数，需要针对新数据集进行调整，扩展性差，与黑盒方法相比计算训练更为密集。在这项工作中，我们引入了组件特征（ComFe），这是一种为预训练的视觉变压器（ViTs）设计的可解释性图像分类头，具有高度的可扩展性，在与其他类似的非解释性方法相比时可以获得竞争性的性能。据我们所知，ComFe是第一个可解释的头部，不同于其他可解释的方法，它可以轻松应用于大规模数据集，如ImageNet-1K。此外，ComFe提供了改进的稳健性，并在关键基准数据集上优于之前的可解释方法，同时使用一组一致的超参数且无需微调预训练的ViT主干。仅使用全局图像标签，无需分割或部分注释，ComFe可以识别图像内的一致组件特征，并确定哪些特征在做出预测时是有用的。代码可在github.com/emannix/comfe-component-features找到。

论文及项目相关链接

PDF

Summary
基于局部嵌入距离比较的分类解释模型通过比较图像局部嵌入与训练数据集的原型集之间的距离来解释其分类结果。然而，这些方法引入额外的超参数，需要针对新数据集进行调整，扩展性差，并且相较于黑盒方法计算成本较高。在这项研究中，我们引入了组件特征（ComFe），这是一种为预训练的视觉变压器（ViTs）设计的可规模化且设计为可解释的图像分类头部，与相应的非解释性方法相比，可以获得有竞争力的性能表现。据了解，ComFe是首个可解释的头部，不同于其他可解释方法的是它可以轻松应用于大规模数据集如ImageNet-1K。此外，ComFe提供了更高的稳健性，在关键基准数据集上优于之前的解释方法，并且使用了一组固定的超参数且未对预训练的ViT主干进行微调。仅凭全局图像标签且无需分段或部分注释，ComFe就可以在图像中识别出一致组件特征并确定哪些特征对于做出预测是有意义的。

Key Takeaways