发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 5.5k

阅读时长: 22 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology

Authors:Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Diana Mechtcheriakova, Amirreza Mahbod

Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet

自动化组织病理学图像分析在计算机辅助多种疾病的诊断中扮演着至关重要的角色。在已开发的算法中，深度学习的方法在多任务上表现出色，包括组织病理学图像中的语义组织分割。在这项研究中，我们提出了一种基于统一双编码器模型中的注意力驱动特征融合的卷积神经网络（CNN）和视觉变压器（ViT）的新型方法，以提高语义分割性能。在两个公开数据集上的评估显示，我们的模型在GCPS数据集上达到了μIoU/μDice分数为76.79%/86.87%，在PUMA数据集上为64.93%/76.60%，超过了最先进和基准测试的基准线。我们的方法的实现可以在GitHub仓库中找到：https://github.com/NimaTorbati/ACS-SegNet 。

论文及项目相关链接

PDF 5 pages

Summary
该研究提出了一种基于注意力驱动特征融合的统一双编码器模型，结合了卷积神经网络（CNNs）和视觉转换器（ViTs）的新方法，以提高语义分割性能。在公开数据集上的评估显示，该方法在GCPS数据集上达到了μIoU/μDice分数为76.79%/86.87%，在PUMA数据集上为64.93%/76.60%，超过了先进基准测试和基线。该方法的具体实现可在GitHub上找到。

Key Takeaways

此研究强调了自动化病理图像分析在计算机辅助诊断中的重要性。
研究提出了一种新型算法，结合了卷积神经网络（CNNs）和视觉转换器（ViTs），基于注意力驱动特征融合的统一双编码器模型进行语义分割。
在GCPS和PUMA两个公开数据集上的实验表明，该方法实现了高效的语义分割性能。其在GCPS数据集上的μIoU/μDice分数高于现有技术，表明其优异表现。
此方法的实现细节已被公开分享在GitHub上，便于其他研究者进行参考和使用。
该研究强调了深度学习方法在病理图像分析中的潜力，特别是结合了CNN和ViTs的方法。

Cool Papers

点此查看论文截图

Transformed Multi-view 3D Shape Features with Contrastive Learning

Authors:Márcus Vinícius Lobo Costa, Sherlon Almeida da Silva, Bárbara Caroline Benato, Leo Sampaio Ferraz Ribeiro, Moacir Antonelli Ponti

This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs’ ability to understand overall shapes and contrastive learning’s effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.

本文旨在通过探究最前沿的骨干网络与对比监督学习和自监督学习目标的配对，解决3D形状特征表示学习中的挑战。计算机视觉方法在识别2D图像中的3D物体时面临困难，通常需要大量的标注数据，并依赖于可能忽略关键形状关系的卷积神经网络（CNNs）。我们的工作表明，当基于Vision Transformer（ViT）的架构与现代对比目标相结合时，它们在我们的下游任务的多视图3D分析中取得了有前景的结果，统一了对比和3D形状理解流程。例如，在ModelNet10上，监督对比损失达到了约90.6%的准确率。使用ViT和对比学习，利用ViT理解整体形状的能力和对比学习的有效性，克服了需要大量标注数据的需要以及CNN在捕捉关键形状关系上的局限性。成功的原因在于通过ViT捕获全局形状语义，并通过对比优化精炼局部判别特征。重要的是，我们的方法是实证的，它基于广泛的实验评估，以验证将ViT与对比目标相结合进行3D表示学习的有效性。

论文及项目相关链接

PDF

Summary

本文探讨了利用先进的骨干网络搭配对比监督学习和自监督学习目标进行3D形状特征表示学习的挑战。研究指出，计算机视觉方法在识别2D图像中的3D物体时存在困难，需要大量标注数据并依赖可能忽略关键形状关系的卷积神经网络（CNNs）。本研究展示了基于Vision Transformer（ViT）的架构在搭配现代对比目标时，在多视图3D分析下游任务中取得了令人鼓舞的结果，统一了对比和3D形状理解管道。例如，通过监督对比损失，ModelNet10的准确率达到了约90.6%。本研究利用ViTs的理解和对比学习的有效性，克服了需要大量标注数据和CNN在捕捉关键形状关系方面的局限性。其成功源于ViT捕获全局形状语义并通过对比优化细化局部判别特征。

Key Takeaways

论文探讨了使用先进的骨干网络与对比监督学习和自监督学习目标在3D形状特征表示学习中的结合应用。
计算机视觉在识别2D图像中的3D物体方面存在挑战，需要依赖大量标注数据和卷积神经网络（CNNs）。
Vision Transformer（ViT）架构在配合现代对比目标时，对多视图3D分析展现出优越性能。
通过监督对比损失，ModelNet10的准确率达到了约90.6%。
ViTs与对比学习相结合，克服了标注数据的需求和CNN捕捉关键形状关系的局限性。
ViT能够捕获全局形状语义，并通过对比优化细化局部判别特征。

Cool Papers

点此查看论文截图

OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation

Authors:Heng Zhang, Wei-Hsing Huang, Gokhan Solak, Arash Ajoudani

We present OmniVIC, a universal variable impedance controller (VIC) enhanced by a vision language model (VLM), which improves safety and adaptation in any contact-rich robotic manipulation task to enhance safe physical interaction. Traditional VIC have shown advantages when the robot physically interacts with the environment, but lack generalization in unseen, complex, and unstructured safe interactions in universal task scenarios involving contact or uncertainty. To this end, the proposed OmniVIC interprets task context derived reasoning from images and natural language and generates adaptive impedance parameters for a VIC controller. Specifically, the core of OmniVIC is a self-improving Retrieval-Augmented Generation(RAG) and in-context learning (ICL), where RAG retrieves relevant prior experiences from a structured memory bank to inform the controller about similar past tasks, and ICL leverages these retrieved examples and the prompt of current task to query the VLM for generating context-aware and adaptive impedance parameters for the current manipulation scenario. Therefore, a self-improved RAG and ICL guarantee OmniVIC works in universal task scenarios. The impedance parameter regulation is further informed by real-time force/torque feedback to ensure interaction forces remain within safe thresholds. We demonstrate that our method outperforms baselines on a suite of complex contact-rich tasks, both in simulation and on real-world robotic tasks, with improved success rates and reduced force violations. OmniVIC takes a step towards bridging high-level semantic reasoning and low-level compliant control, enabling safer and more generalizable manipulation. Overall, the average success rate increases from 27% (baseline) to 61.4% (OmniVIC).

我们提出了OmniVIC，这是一个通过视觉语言模型（VLM）增强的通用可变阻抗控制器（VIC），它提高了任何接触丰富的机器人操作任务中的安全性和适应性，从而增强了安全物理交互。传统VIC在机器人与环境进行物理交互时显示出优势，但在涉及接触或不确定性的通用任务场景中，缺乏未见、复杂和非结构化安全交互的泛化能力。为此，所提出的OmniVIC解释从图像和自然语言中派生的任务上下文推理，并为VIC控制器生成自适应阻抗参数。具体来说，OmniVIC的核心是自我完善的检索增强生成（RAG）和上下文学习（ICL），其中RAG从结构化记忆库中检索相关先验经验，以向控制器提供类似过去任务的信息，而ICL则利用这些检索到的示例和当前任务的提示来查询VLM，以生成当前操作场景的上下文感知和自适应阻抗参数。因此，自我完善的RAG和ICL保证了OmniVIC在通用任务场景中的运行。阻抗参数调节还受到实时力/扭矩反馈的启发，以确保交互力保持在安全阈值内。我们证明，我们的方法在一系列复杂的接触丰富的任务中超越了基准线，无论是在模拟还是在现实世界的机器人任务中，都提高了成功率并减少了力违规。OmniVIC朝着弥合高级语义推理和低级顺应控制的方向迈出了一步，使操作更安全、更具泛化能力。总体而言，成功率从基准线的27%提高到OmniVIC的61.4%。

论文及项目相关链接

PDF Code, video and RAG dataset are available at \url{https://sites.google.com/view/omni-vic}

Summary

OmniVIC是一个结合了视觉语言模型（VLM）的通用可变阻抗控制器（VIC），旨在提高接触丰富的机器人操作任务中的安全性和适应性，增强安全物理交互。OmniVIC通过图像和自然语言的上下文理解来生成适应性阻抗参数，适用于通用任务场景中的未知、复杂和非结构化安全交互。其核心是自我改进的回忆增强生成（RAG）和上下文学习（ICL），能够从结构化记忆库中检索相关先验经验，并利用当前任务的提示来查询VLM，生成适用于当前操作场景的上下文感知和自适应阻抗参数。此外，OmniVIC还通过实时力/扭矩反馈来调节阻抗参数，确保交互力在安全阈值内。在模拟和真实机器人任务上的复杂接触丰富的任务演示表明，OmniVIC的方法优于基线，成功率提高，力违规减少。OmniVIC朝着弥合高级语义推理和低级顺从控制的方向迈出了重要一步，使操作更安全、更具通用性。

Key Takeaways

OmniVIC结合了视觉语言模型（VLM）与可变阻抗控制器（VIC），旨在提高机器人操作任务中的安全性和适应性。
OmniVIC通过解读任务上下文（图像和自然语言）来生成自适应阻抗参数，适用于复杂、非结构化的安全交互场景。
OmniVIC的核心是自我改进的回忆增强生成（RAG）和上下文学习（ICL），能从结构化记忆库中检索相关先验经验并查询VLM以生成适应性的阻抗参数。
OmniVIC通过实时力/扭矩反馈调节阻抗参数，确保交互力在安全范围内。
在模拟和真实机器人任务上进行的复杂接触丰富的任务演示表明，OmniVIC优于基线方法，提高了成功率并减少了力违规。
OmniVIC提高了机器人操作的安全性并增强了其通用性，朝着弥合高级语义推理和低级控制的方向迈出了重要一步。

Cool Papers

点此查看论文截图

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Authors:Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

少量学习（FSL）旨在从仅有的少量标记样本中识别出新概念。最近的研究通过融入额外的语义信息或设计复杂的语义融合模块来提升支撑特征。然而，由于缺乏在实际实例中的定位，它们仍然会出现与视觉证据相矛盾的幻觉语义，导致产生嘈杂的指导和昂贵的修正。为了解决这些问题，我们提出了一种新的框架——基于大型语言模型（LLM）的视觉与文本融合少量学习（VT-FSL），它构建精确的跨模态提示，这些提示以大型语言模型和支撑图像为条件，通过几何感知对齐无缝集成。它主要由跨模态迭代提示（CIP）和跨模态几何对齐（CGA）组成。具体而言，CIP以类名和支撑图像为条件对LLM进行训练，以生成精确的类描述，并通过单一结构化推理过程进行迭代。这些描述不仅丰富了对新颖类的语义理解，还实现了语义一致图像的零样本合成。这些描述和合成图像分别作为补充的文本和视觉提示，提供高级类语义和低级类内多样性，以弥补有限的支撑数据。此外，CGA通过最小化它们所跨越的3维平行四边形的核化体积来联合对齐融合的文本、支撑和合成视觉表示。它捕捉了所有表示之间的全局和非线性关系，实现了结构化和一致的多模态集成。所提出的VT-FSL方法在包括标准、跨域和细粒度少量学习场景在内的十个不同基准测试上达到了新的最先进的性能。代码可在https://github.com/peacelwh/VT-FSL找到。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

该文本介绍了一种新型的跨模态学习框架——VT-FSL，旨在解决小样本学习中的语义理解与视觉证据不一致的问题。它通过结合大型语言模型（LLM）和支持图像，构建精确的跨模态提示，并通过几何感知对齐进行无缝集成。该框架包括跨模态迭代提示和跨模态几何对齐两部分，旨在增强对新型类别的语义理解并合成语义一致的图像，以弥补支持数据的不足。VT-FSL方法在不同基准测试中均达到最新性能水平。

Key Takeaways

VT-FSL框架旨在解决小样本学习中由于缺乏实际实例而导致的语义与视觉证据不一致的问题。
VT-FSL结合了大型语言模型和支持图像，构建精确的跨模态提示。
跨模态迭代提示能够生成精确类别描述，丰富对新型类别的语义理解，并合成语义一致的图像。
合成图像和支持数据为学习提供互补的文本和视觉提示，以弥补有限的支持数据。
跨模态几何对齐通过最小化所有表示所跨越的三维平行四边形的内核体积来联合对齐文本、支持和合成视觉表示。
VT-FSL框架能够捕捉所有表示之间的全局和非线性关系，实现结构化和一致的多模态集成。
VT-FSL方法在多个基准测试中实现了最新的性能水平，包括标准、跨域和细粒度小样本学习场景。

Cool Papers

点此查看论文截图

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Authors:Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model’s attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model’s hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

推测解码是加速大型语言模型（LLM）推理的广泛采用的技术，然而它在视觉语言模型（VLM）中的应用仍然被探索得不够深入，现有方法只实现了适度的加速（<1.5倍）。随着多模态能力成为大规模模型的核心，这一差距变得越来越显著。我们假设大型VLM可以逐层有效地过滤掉冗余的图像信息，而不损害文本理解，而较小的草稿模型则很难做到这一点。为了解决这一问题，我们引入了针对VLM量身定制的Vision-Aware Speculative Decoding（ViSpec）新型框架。ViSpec采用轻量级的视觉适配器模块，将图像令牌压缩成紧凑的表示形式，无缝地集成到草稿模型的注意机制中，同时保留原始图像的位置信息。此外，我们还为每个输入图像提取全局特征向量，并将其增强到所有随后的文本令牌中，以提高多模态一致性。为了克服缺乏具有长辅助响应的多模态数据集的问题，我们通过重新利用现有数据集并生成目标模型的扩展输出来制作专用的训练数据集。我们的训练策略减轻了草稿模型直接访问目标模型的隐藏状态的风险，如果仅在目标模型输出上进行训练，这可能导致捷径学习。大量实验验证了ViSpec的有效性，据我们所知，这是VLM推测解码中的首次实质性加速。代码可在https://github.com/KangJialiang/ViSpec获得。

论文及项目相关链接

PDF NeurIPS 2025

Summary：
本文提出了一种针对视觉语言模型（VLMs）的新的加速推断方法——Vision-Aware Speculative Decoding（ViSpec）。该方法通过引入轻量级视觉适配器模块来压缩图像标记为紧凑表示，并将其无缝集成到草案模型的注意力机制中，同时保留原始图像的位置信息。同时，为每个输入图像提取全局特征向量并将其增强到所有后续文本标记中，以提高多模态的连贯性。实验结果表明，ViSpec在VLM推测解码中实现了显著的速度提升。

Key Takeaways：