⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-22 更新
VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
Authors:Qilin Liao, Anamika Lochab, Ruqi Zhang
Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
视觉语言模型(VLMs)将大型语言模型扩展到了视觉推理领域,但其多模态设计也引入了新的、尚未充分探索的漏洞。现有的多模态红队方法大多依赖于脆弱的模板,侧重于单一攻击环境,并且只暴露出少量的漏洞。为了克服这些局限性,我们引入了VERA-V,这是一个变分推理框架,它将多模态越狱发现重新定义为学习配对文本图像提示的联合后验分布。这种概率论视角能够生成隐蔽的、耦合的对抗输入,从而绕过模型防线。我们训练了一个轻量级的攻击者来近似后验概率,从而能够高效地对各种越狱进行采样,并提供对漏洞的分布见解。VERA-V还集成了三种互补的策略:(i)基于字体的文本提示,嵌入有害线索;(ii)基于扩散的图像合成,引入对抗信号;(iii)结构化干扰项来分散VLM注意力。在HarmBench和HADES基准测试上的实验表明,VERA-V在开源和前沿的VLMs上均优于最新基线,在GPT-4o上的最佳基线攻击成功率(ASR)提高了高达53.75%。
论文及项目相关链接
PDF 18 pages, 7 Figures,
Summary
视觉语言模型(VLMs)融合了大型语言模型和视觉推理,但其多模态设计也引入了新的、尚未充分探索的漏洞。针对现有研究的不足,我们提出了名为VERA-V的变分推理框架,通过模拟文本和图像提示之间的联合后验分布来重新发现多模态漏洞。这一概率视角可以生成绕过模型警戒系统的隐蔽性耦合对抗输入。我们训练了一个轻量级攻击者来近似后验分布,以便有效采样多样化的漏洞,并提供关于脆弱性的分布见解。VERA-V结合了三种互补策略,包括基于字体的文本提示、基于扩散的图像合成以及结构化干扰信息来分散VLM注意力。在HarmBench和HADES基准测试上的实验表明,VERA-V在开源和前沿的VLMs上均优于最新基线技术,在GPT-4o上的攻击成功率提高了高达53.75%。此成果揭示了针对VLM的新挑战和改进方向。
Key Takeaways
- VLMs融合视觉和文本数据,引入新的漏洞挑战。
- VERA-V框架采用变分推理来模拟文本和图像提示的联合后验分布,突破现有多模态检测限制。
- VERA-V框架能生成隐蔽的对抗输入以绕过模型的警戒系统。
- 轻量级攻击者被训练来近似后验分布,以实现高效的漏洞采样和对脆弱性的深入洞察。
- VERA-V结合了多种策略:基于字体的文本提示、基于扩散的图像合成和结构化干扰信息。
点此查看论文截图
Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging
Authors:Suqiang Ma, Subhadeep Sengupta, Yao Lee, Beikang Gu, Xianyan Chen, Xianqiao Wang, Yang Liu, Mengjia Xu, Galit H. Frydman, He Li
Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.
循环血细胞团(CCCs)包含红细胞(RBCs)、白细胞(WBCs)和血小板,是与血栓、感染和炎症等疾病相关的重要生物标志物。流式细胞术结合荧光染色常用于分析这些细胞团,揭示细胞形态和蛋白质谱。虽然基于机器学习的计算方法在单细胞流式细胞术图像自动分析方面取得了进展,但对于自动分析包含CCC的图像的工具开发仍缺乏努力。与单个细胞不同,细胞团通常表现出不规则的形状和大小。此外,这些细胞团往往由多种细胞类型组成,需要多通道染色来识别团内的特定细胞类型。本研究介绍了一种新的计算框架,用于分析CCC图像并识别团内的细胞类型。我们的框架采用两步分析策略。首先,它通过微调You Only Look Once(YOLOv11)模型将图像分为细胞团和非细胞团两组,该模型在性能上优于传统的卷积神经网络(CNNs)和视觉转换器(ViT)。然后,它通过叠加群集轮廓和多通道荧光染料的区域来识别细胞类型,即使在细胞碎片和染色伪影的情况下也能提高准确性。该方法在集群分类和表型识别方面的准确率均超过95%。总之,我们的自动化框架有效地分析了流式细胞仪的CCC图像,结合了明场和荧光数据。最初在血细胞上进行了测试,它在分析免疫和肿瘤细胞团等更广泛应用中具有潜力,支持各种疾病的细胞研究。
论文及项目相关链接
摘要
循环血细胞团(CCCs)包含红细胞(RBCs)、白细胞(WBCs)和血小板,是血栓、感染和炎症等疾病的重要生物标志物。流式细胞术结合荧光染色常用于分析这些细胞团,可揭示细胞形态和蛋白质谱。虽然基于机器学习的计算方法在单细胞流式细胞术图像分析方面取得了进展,但尚未有工具自动分析包含CCCs的图像。与单一细胞不同,细胞团常表现出不规则的形态和大小。此外,这些细胞团通常由多种细胞类型组成,需要多通道染色来识别团内的特定细胞类型。本研究介绍了一种新的计算框架,用于分析CCC图像并识别团内的细胞类型。该框架采用两步分析策略。首先,它通过微调You Only Look Once(YOLOv11)模型将图像分类为细胞团和非细胞团组,表现出比传统卷积神经网络(CNNs)和视觉转换器(ViT)更优越的性能。然后,通过叠加群轮廓与多通道荧光染色的区域来识别细胞类型,提高了在细胞碎片和染色伪影存在下的准确性。该方法在集群分类和表型识别方面的准确率超过95%。总之,我们的自动化框架有效地分析了流式细胞术中的CCC图像,利用明场和荧光数据。最初在血细胞上的测试表明其在更广泛的应用中具有潜力,例如分析免疫和肿瘤细胞团,支持各种疾病的细胞研究。
要点摘要
- 循环血细胞团(CCCs)是多种疾病的重要生物标志物。
- 流式细胞术结合荧光染色是分析CCCs的常用方法。
- 当前缺乏自动分析CCCs图像的工具。
- 研究介绍了一种新的计算框架,采用两步分析策略进行CCCs图像分析。
- 框架首先通过改进的YOLOv11模型区分细胞团和非细胞团。
- 框架通过叠加多通道荧光染色与群轮廓,准确识别细胞类型。
点此查看论文截图
ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
Authors:Athanasios Angelakis, Amne Mousa, Micah L. A. Heldeweg, Laurens A. Biesheuvel, Mark A. Haaksma, Jasper M. Smit, Pieter R. Tuinman, Paul W. G. Elbers
Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.
在肺部超声(LUS)视频中,从心源性肺水肿(CPE)区分非心源性和结构正常的肺部仍然具有挑战性,因为非心源性炎症模式(NCIP/ARDS类似)、间质性肺疾病和健康肺部的视觉变化很大。这种异质性使得自动化分类变得复杂,因为B线重叠和胸膜伪影很常见。我们引入了ZACH-ViT(零令牌自适应紧凑分层视觉转换器),这是一种具有0.25M参数的视觉转换器变体,它移除了位置嵌入和[CLS]令牌,使其完全具有排列不变性,适用于无序的医学图像数据。为了提高泛化能力,我们提出了ShuffleStrides数据增强(SSDA),它通过打乱探头视图序列和帧顺序,同时保留解剖有效性。ZACH-ViT在来自95名危重患者的380个LUS视频上进行了评估,与九个最先进的基线相比。尽管非心源性组的异质性,ZACH-ViT仍获得了最高的验证和测试ROC-AUC(分别为0.80和0.79),并且具有平衡的敏感性(0.60)和特异性(0.91),而所有其他模型则无法进行有意义的分类。它的训练速度比参数较少的Minimal ViT快1.35倍,并且参数更少,支持实时临床部署。这些结果表明,将架构设计与数据结构相匹配可以在小型医疗图像数据集中超越规模优势。
论文及项目相关链接
PDF 14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: https://github.com/Bluesman79/ZACH-ViT Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License
摘要
零标记自适应紧凑分层视觉Transformer(ZACH-ViT)能有效解决心脏性肺水肿与非心脏性及正常肺结构的鉴别问题。去除位置嵌入和全局类别信息token实现了数据的全排列不变性。使用ShuffleStrides数据增强提高了模型通用性。在肺部超声视频评估中,ZACH-ViT展现出卓越性能,实现了最高验证和测试ROC-AUC值,并训练速度快于其他模型,支持实时临床应用。结果证明,针对医学图像的小数据集,适配架构设计优于规模优势。
关键见解
- ZACH-ViT模型解决了心脏性肺水肿与非心脏性肺疾病的鉴别难题。
- 通过去除位置嵌入和全局类别信息token,模型实现全排列不变性,适用于医学图像分析。
- ShuffleStrides数据增强技术提高了模型的泛化能力,通过打乱视图序列和帧顺序来保留解剖结构的有效性。
- 在肺部超声视频评估中,ZACH-ViT表现出最佳性能,具有高验证和测试ROC-AUC值。
- ZACH-ViT与其他模型相比,具有更高的敏感性和特异性。
- ZACH-ViT训练速度快,参数少,适合实时临床应用。
点此查看论文截图
M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
Authors:U. V. B. L Udugama, George Vosselman, Francesco Nex
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
部署边缘设备上的实时空间感知需要高效的多任务模型,这种模型能够利用互补任务信息,同时最小化计算开销。本文介绍了Multi-Mono-Hydra(M2H),这是一种为语义分割、深度、边缘和表面法线估计设计的新型多任务学习框架,这些均基于单目图像。不同于传统依赖于独立单任务模型或共享编码器-解码器架构的方法,M2H引入了基于窗口的跨任务注意力模块,它能够在保留任务特定细节的同时,实现结构化特征交换,提高任务间预测的一致性。M2H建立在轻量级的基于ViT的DINOv2主干网上,优化了实时部署,并成为支持动态环境中3D场景图构建的单目空间感知系统的基础。综合评估表明,M2H在NYUDv2上的性能优于最先进的多任务模型,在Hypersim上超越了单任务深度和语义基准测试,并在Cityscapes数据集上实现了卓越的性能,同时在笔记本电脑硬件上保持了计算效率。除了基准测试之外,M2H还在真实世界数据上得到了验证,证明了其在空间感知任务中的实用性。
论文及项目相关链接
PDF Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures
Summary
本文提出一种名为Multi-Mono-Hydra(M2H)的新型多任务学习框架,适用于从单一单目图像进行语义分割、深度、边缘和表面法线估算。M2H采用基于窗口的跨任务注意模块,促进结构化特征交换同时保留任务特定细节,改进了跨任务的预测一致性。它建立在轻量级的ViT-based DINOv2 backbone上,经过优化,适合在动态环境中进行实时部署的单目空间感知系统支持3D场景图构建。评估表明,M2H在多任务模型上的性能优于NYUDv2上的最新技术,并在Hypersim上超过了单任务深度和语义基线,同时在Cityscapes数据集上实现卓越性能,同时在笔记本电脑硬件上保持计算效率。
Key Takeaways
- M2H是一个用于实时空间感知的多任务学习框架,适用于边缘设备部署。
- 它采用基于窗口的跨任务注意模块,促进特征交换并改进预测一致性。
- M2H建立在轻量级的ViT-based DINOv2 backbone上,优化用于实时部署。
- 该框架支持从单一单目图像进行语义分割、深度、边缘和表面法线估算。
- M2H在多个数据集上的性能优于其他最新技术,包括NYUDv2、Hypersim和Cityscapes。
- M2H在真实世界数据上进行了验证,证明了其实用性。
点此查看论文截图
OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation
Authors:Heng Zhang, Wei-Hsing Huang, Gokhan Solak, Arash Ajoudani
We present OmniVIC, a universal variable impedance controller (VIC) enhanced by a vision language model (VLM), which improves safety and adaptation in any contact-rich robotic manipulation task to enhance safe physical interaction. Traditional VIC have shown advantages when the robot physically interacts with the environment, but lack generalization in unseen, complex, and unstructured safe interactions in universal task scenarios involving contact or uncertainty. To this end, the proposed OmniVIC interprets task context derived reasoning from images and natural language and generates adaptive impedance parameters for a VIC controller. Specifically, the core of OmniVIC is a self-improving Retrieval-Augmented Generation(RAG) and in-context learning (ICL), where RAG retrieves relevant prior experiences from a structured memory bank to inform the controller about similar past tasks, and ICL leverages these retrieved examples and the prompt of current task to query the VLM for generating context-aware and adaptive impedance parameters for the current manipulation scenario. Therefore, a self-improved RAG and ICL guarantee OmniVIC works in universal task scenarios. The impedance parameter regulation is further informed by real-time force/torque feedback to ensure interaction forces remain within safe thresholds. We demonstrate that our method outperforms baselines on a suite of complex contact-rich tasks, both in simulation and on real-world robotic tasks, with improved success rates and reduced force violations. OmniVIC takes a step towards bridging high-level semantic reasoning and low-level compliant control, enabling safer and more generalizable manipulation. Overall, the average success rate increases from 27% (baseline) to 61.4% (OmniVIC).
我们提出了OmniVIC,这是一个通过视觉语言模型(VLM)增强的通用可变阻抗控制器(VIC),它提高了任何接触丰富的机器人操作任务中的安全性和适应性,从而增强了安全物理交互。传统VIC在机器人与环境进行物理交互时显示出优势,但在涉及接触或不确定性的通用任务场景中缺乏应对未见、复杂和非结构化安全交互的泛化能力。为此,所提出的OmniVIC解释从图像和自然语言中得出的任务上下文推理,并为VIC控制器生成自适应阻抗参数。具体来说,OmniVIC的核心是自我完善的检索增强生成(RAG)和上下文学习(ICL),其中RAG从结构化记忆库中检索相关先前经验,以向控制器告知类似过去任务的信息,而ICL则利用这些检索到的示例和当前任务的提示来查询VLM,以生成当前操作场景的上下文感知和自适应阻抗参数。因此,自我完善的RAG和ICL保证了OmniVIC在通用任务场景中的运行。阻抗参数调节还受到实时力/扭矩反馈的启发,以确保交互力保持在安全阈值内。我们证明,我们的方法在一系列复杂的接触丰富的任务中超越了基准线,无论是在模拟还是在现实世界的机器人任务中,都提高了成功率并减少了力度违规。OmniVIC朝着弥合高级语义推理和低级顺应控制的方向迈出了一步,使操作更安全、更具泛化能力。总体而言,成功率从基准线的27%提高到OmniVIC的61.4%。
论文及项目相关链接
PDF Code, video and RAG dataset are available at \url{https://sites.google.com/view/omni-vic}
摘要
OmniVIC是一个通过视觉语言模型(VLM)增强的通用可变阻抗控制器(VIC),旨在提高接触丰富的机器人操作任务中的安全性和适应性,增强安全物理交互。OmniVIC结合图像和自然语言的上下文理解,生成适应性的阻抗参数,用于控制接触式任务的机器人。其核心是自改进检索增强生成(RAG)和上下文学习(ICL)。RAG从结构化记忆库中检索相关先验经验,为控制器提供类似过去任务的信息;而ICL则利用这些检索到的示例和当前任务的提示,查询VLM以生成当前操作场景的上下文感知和适应性阻抗参数。此外,阻抗参数调节还受到实时力/扭矩反馈的启发,以确保交互力保持在安全阈值内。在模拟和真实机器人任务的一系列复杂接触式任务中,我们的方法表现出优于基准线的性能,成功率提高,力违规减少。OmniVIC朝着弥合高级语义推理和低级顺应控制之间的鸿沟迈出了重要一步,使机器人操作更安全、更具泛化能力。总的来说,成功率从基准的27%提高到OmniVIC的61.4%。
关键见解
- OmniVIC是一个结合视觉语言模型(VLM)的通用可变阻抗控制器(VIC),旨在提高接触丰富的机器人操作任务的安全性和适应性。
- OmniVIC通过结合图像和自然语言理解任务上下文,生成适应性的阻抗参数。
- 核心机制包括自改进检索增强生成(RAG)和上下文学习(ICL)。
- RAG从结构化记忆库中检索相关先验经验,为控制器提供类似过去任务的信息。
- ICL利用检索到的示例和当前任务提示来查询VLM,生成适应当前操作场景的阻抗参数。
- OmniVIC受到实时力/扭矩反馈的启发来进行阻抗参数调节,确保交互力保持在安全范围内。
点此查看论文截图
Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation
Authors:Akhila Kambhatla, Ahmed R Khaled
Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15%) and Pixel Accuracy (97.04%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24% mIoU, and DeepLabV3+ R101-D8 reaches 92.76% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.
热武器分割对于监控和安全应用至关重要,能够在低光和视觉遮蔽条件下实现稳健检测,而基于RGB的系统则无法做到。虽然卷积神经网络(CNN)在热分割领域占据主导地位,但其捕获长距离依赖关系和精细结构细节的能力有限。视觉变压器(ViTs)凭借其全局上下文建模能力,在RGB分割任务中取得了最新结果,但其在热武器分割中的潜力仍未得到充分探索。这项工作适应了四种基于变压器的架构SegFormer、DeepLabV3+、SegNeXt和Swin Transformer,用于自定义热数据集上的二进武器分割。该数据集包含从现实世界监控视频中收集的9711张图像,并使用SAM2自动标注。我们在MMSegmentation框架内采用了标准增强策略,以确保模型的稳健训练和公平的架构比较。实验结果表明,分割性能得到显着提高:SegFormer-b5达到最高mIoU(94.15%)和像素准确率(97.04%),而SegFormer-b0提供最快的推理速度(98.32 FPS),具有竞争力的mIoU(90.84%)。SegNeXt-mscans提供均衡的性能,具有85.12 FPS和92.24%的mIoU,而DeepLabV3+ R101-D8达到92.76%的mIoU,帧速率为每秒29.86帧。这些基于变压器的架构显示出强大的泛化能力,适用于低光和遮挡的热环境中武器检测的多样化实时安全应用,具有灵活的精度速度权衡。
论文及项目相关链接
PDF 9 Images with 1 figure and 3 Tables. This is a preprint submitted to arXiv
Summary
本文研究了热成像武器分割在监控和安全应用中的重要性,针对低光照和遮蔽条件下的武器检测,评估了四种基于Vision Transformer的架构(SegFormer、DeepLabV3+、SegNeXt和Swin Transformer)在自定义热成像数据集上的表现。实验结果表明,这些Transformer架构在热武器分割性能上有显著提升,并展示了良好的泛化能力和准确速度折衷,适用于实时安全应用。
Key Takeaways
- 热武器分割对于监控和安全应用至关重要,在低光照和视觉遮蔽条件下,RGB-based系统失效时仍能进行稳健检测。
- 虽然CNN在热分割文献中占主导地位,但其在捕获长距离依赖关系和精细结构细节方面的能力有限。
- Vision Transformers(ViTs)凭借全局上下文建模能力,已在RGB分割任务上取得最佳结果。
- 本文评估了四种Transformer架构(SegFormer、DeepLabV3+、SegNeXt和Swin Transformer)在自定义热成像数据集上的热武器分割性能。
- SegFormer-b5获得最高mIoU(94.15%)和像素准确率(97.04%),而SegFormer-b0提供最快的推理速度(98.32 FPS)并表现出竞争力mIoU(90.84%)。
- SegNeXt-msans提供了平衡的绩效,具有85.12 FPS和92.24% mIoU,而DeepLabV3+ R101-D8达到了92.76% mIoU。
点此查看论文截图
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Authors:Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a ‘student friendly’ best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
本文的目标是提高文本到图像的检索性能。为此,我们引入了一种新的框架,该框架可以提升大规模预训练视觉语言模型的性能,使其可用于文本到图像的重新排序。这种方法称为增强语言图像预训练(ELIP)。ELIP通过使用文本查询和简单的MLP映射网络来预测一系列视觉提示,以调节ViT图像编码。ELIP可以轻松应用于常用的CLIP、SigLIP和BLIP-2网络。为了使用有限的计算资源进行架构训练,我们开发了一种“学生友好型”的最佳实践方法,包括全局硬样本挖掘和大规模数据集的制作。在评估方面,我们建立了两个新的超出分布(OOD)基准测试,即被遮挡的COCO和ImageNet-R,以评估模型在不同领域的零样本泛化能力。结果表明,ELIP显著提高了CLIP/SigLIP/SigLIP-2的文本到图像检索性能,并在多个基准测试中优于BLIP-2,同时提供了一种适应OOD数据集的简单方法。
论文及项目相关链接
PDF Accepted by CBMI 2025 (IEEE International Conference on Content-Based Multimedia Indexing)
Summary:
本文旨在提高文本到图像检索的性能。为此,引入了一种新框架ELIP,能够增强大规模预训练视语言模型的性能,用于文本到图像的重新排序。ELIP通过简单的MLP映射网络使用文本查询来预测一系列视觉提示,以调节ViT图像编码。可在CLIP、SigLIP和BLIP-2网络中使用。为在有限的计算资源下训练架构,我们制定了“学生友好”的最佳实践,包括全局硬样本挖掘和大规模数据集的制作。在评估方面,我们建立了两个新的域外分布(OOD)基准测试,包括遮挡COCO和ImageNet-R,以评估模型在不同域的零样本泛化能力。结果显示,ELIP在CLIP/SigLIP/SigLIP-2的文本到图像检索性能上有显著提高,并在几个基准测试中优于BLIP-2,且易于适应OOD数据集。
Key Takeaways:
- 引入新框架ELIP以提高文本到图像检索性能。
- ELIP框架能够增强大规模预训练视语言模型的性能,用于文本到图像的重新排序。
- ELIP通过MLP映射网络使用文本查询预测视觉提示,以调节图像编码。
- ELIP可轻松应用于CLIP、SigLIP和BLIP-2网络。
- 采用全局硬样本挖掘和大规模数据集制作的最佳实践来训练架构。
- 建立了两个新的OOD基准测试来评估模型在不同域的零样本泛化能力。
- ELIP显著提高文本到图像检索性能,并在多个基准测试中优于BLIP-2。
点此查看论文截图
Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM
Authors:Jaemin Kim, Bryan Sangwoo Kim, Jong Chul Ye
Diffusion models have achieved impressive results in generative tasks for text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions trained for videos, hindering their scalability and applicability. In this paper, we propose \textbf{Free$^2$Guide}, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess text-to-video alignment, we leverage \textit{stitching} between video frames and use system prompts to capture sequential attributions. Our framework supports the flexible ensembling of multiple reward models to synergistically enhance alignment without significant computational overhead. Experimental results confirm that Free$^2$Guide using image-trained LVLMs significantly improves text-to-video alignment, thereby enhancing the overall video quality. Our results and code are available at https://kjm981995.github.io/free2guide/
扩散模型在文本到视频(T2V)合成的生成任务中取得了令人印象深刻的结果。然而,由于帧之间的复杂时间依赖性,T2V生成中实现文本对齐仍然是一个挑战。现有的基于强化学习(RL)的方法来提高文本对齐通常需要对视频进行可微分的奖励函数训练,这阻碍了其可扩展性和适用性。在本文中,我们提出了Free$^2$Guide,这是一个新颖的无需梯度和无需训练的框架,用于将生成的视频与文本提示对齐。具体来说,借助路径积分控制原理,Free$^2$Guide使用不可微分的奖励函数来近似扩散模型的指导,从而能够整合强大的黑箱大型视觉语言模型(LVLMs)作为奖励模型。为了允许图像训练的LVLMs评估文本到视频的对齐,我们利用视频帧之间的“拼接”和系统提示来捕获序列归属。我们的框架支持多个奖励模型的灵活集成,以协同增强对齐度,而不会造成重大计算负担。实验结果表明,使用图像训练的LVLMs的Free$^2$Guide显著提高了文本到视频的对齐,从而提高了整体视频质量。我们的结果和代码可在https://kjm981995.github.io/free2guide/找到。
论文及项目相关链接
PDF ICCV 2025 accepted
Summary
文本介绍了Diffusion模型在文本到视频生成任务中的成果,但准确对齐文本仍是挑战。现有强化学习方法需要为视频训练可微分的奖励函数,限制了其可扩展性和适用性。本文提出Free$^2$Guide框架,无需梯度和训练,使用非可微分奖励函数为扩散模型提供指导,并利用路径积分控制原理实现。通过视频帧的拼接和系统提示来评估文本到视频的对齐情况。该框架支持多个奖励模型的灵活集成,协同增强对齐效果,且计算开销不大。实验结果显示,使用图像训练的LVLMs的Free$^2$Guide显著提高了文本到视频的对齐效果,增强了视频整体质量。
Key Takeaways
- Diffusion模型在文本到视频生成任务中表现优异,但文本对齐仍是难题。
- 现有强化学习方法需要可微分的奖励函数进行视频训练,限制了其应用。
- Free$^2$Guide框架采用非可微分奖励函数,基于路径积分控制原理实现文本与视频对齐。
- 通过视频帧拼接和系统提示评估文本到视频的对齐情况。
- Free$^2$Guide支持多个奖励模型的集成,能协同增强对齐效果,且计算成本较低。
- 实验证明,使用图像训练的LVLMs的Free$^2$Guide在文本到视频对齐上表现优异。
点此查看论文截图