发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 5.7k

阅读时长: 23 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance

Authors:Neema Jakisa Owor, Joshua Kofi Asamoah, Tanner Wambui Muturi, Anneliese Jakisa Owor, Blessing Agyei Kyem, Andrews Danyo, Yaw Adu-Gyamfi, Armstrong Aboah

Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.

鱼眼相机通过从一个有利的位置捕获较大的视野，为宽区域交通监控提供了有效的解决方案。然而，鱼眼图像固有的强烈径向失真和非均匀分辨率给标准目标检测器带来了巨大的挑战，特别是在图像边界附近，目标外观严重退化。在这项工作中，我们提出了一种在这些条件下稳健运行的目标检测框架。我们的方法采用简单有效的预处理和后处理管道，提高了图像中检测的稳定性，特别是在受严重失真影响的区域。我们在鱼眼交通图像上训练了多个最先进的检测模型，并通过集成策略结合了它们的输出，以提高总体检测精度。我们的方法在2025年智能城市挑战赛第四赛道上取得了F1分数为0.6366的成绩，在62支队伍中排名第8。这些结果证明了我们的框架在解决鱼眼图像固有问题的有效性。

论文及项目相关链接

PDF The paper was accepted at ICCV 2025 and published in CVF database

Summary：鱼眼相机用于宽区域交通监控，但因其图像的非均匀分辨率和强径向失真给标准目标检测器带来挑战。本研究提出一种针对鱼眼图像设计的检测框架，通过预处理和后处理流程提高图像中目标检测的稳定性，特别是在失真严重的区域。通过集成多个先进检测模型的输出，提高了整体检测精度。在AI城市挑战赛Track 4中取得F1分数为0.6366，排名第八。证明该框架对鱼眼图像问题的有效性。

Key Takeaways：

鱼眼相机用于宽区域交通监控，具有大视野的特点。
鱼眼图像的非均匀分辨率和强径向失真给目标检测带来挑战。
研究提出了一种针对鱼眼图像设计的检测框架，包括预处理和后处理流程。
框架在失真严重的区域也能保持稳定的检测性能。
通过集成多个先进检测模型的输出，提高了整体检测精度。
在AI城市挑战赛Track 4中获得较高排名，证明了框架的有效性。

Cool Papers

点此查看论文截图

Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection

Authors:Ariana Yi, Ce Zhou, Liyang Xiao, Qiben Yan

As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.

随着目标检测模型在自动驾驶和监控平台等网络物理系统中的应用越来越广泛，确保它们对抗恶意威胁的安全性至关重要。虽然先前的研究已经探索了图像领域的对抗性攻击，但视频领域的攻击仍然在很大程度上未被研究，特别是在无框设置中。在本文中，我们提出了α通道斗篷（{\alpha}-Cloak），这是一种全新的无框对抗性攻击目标检测器，它通过RGBA视频的alpha通道进行全操作。{\alpha}-Cloak利用alpha通道将恶意目标视频与良性视频融合，产生对人类观察者看似无害的融合视频，但却始终欺骗目标检测器。我们的攻击无需访问模型架构、参数或输出，并且不会引入任何明显的伪影。我们系统地研究了常见视频格式和播放应用程序中对alpha通道的支持情况，并设计了一种融合算法，以确保视觉隐蔽性和兼容性。我们在五个一流的目标检测器、一个视觉语言模型和一个多模态大型语言模型（Gemini-2.0-Flash）上评估了α通道斗篷的效能，在所有场景中均实现了100%的攻击成功率。我们的研究揭示了基于视频的感知系统中一个未被探索过的漏洞，并强调了防御策略需要迫切考虑对抗环境中的alpha通道问题。

论文及项目相关链接

PDF

Summary：

针对对象检测模型在自主车辆和监控平台等网络物理系统中的部署，本论文提出了名为{\alpha}-Cloak的无框视频域对抗攻击方法。通过利用RGBA视频的Alpha通道融合恶意目标视频与良性视频，该方法创造出看似无害的融合视频，欺骗对象检测器。攻击无需接触模型架构、参数或输出，且不引入可见失真。系统研究了常见视频格式和播放应用程序对Alpha通道的支持，并设计了确保视觉隐蔽性和兼容性的融合算法。评估显示，{\alpha}-Cloak在所有场景下攻击成功率达百分之百，揭示了视频感知系统中新的安全隐患，凸显需要迫切为对抗性设置中的Alpha通道进行防御工作。

Key Takeaways：

论文针对对象检测模型提出了无框视频域对抗攻击方法{\alpha}-Cloak。
{\alpha}-Cloak利用Alpha通道融合恶意和良性视频，欺骗对象检测器。
该攻击方法无需了解模型架构、参数或输出，且无可见失真。
系统研究了常见视频格式和播放应用程序对Alpha通道的支持情况。
论文设计了确保视觉隐蔽性和兼容性的融合算法。
在五种先进的对象检测器、视觉语言模型和大型多模态语言模型上进行了评估，攻击成功率达百分之百。

Cool Papers

点此查看论文截图

HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking

Authors:Yao Deng, Xian Zhong, Wenxuan Liu, Zhaofei Yu, Jingling Yuan, Tiejun Huang

RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network’s computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.

RGB相机擅长捕捉丰富的高空间分辨率纹理细节，而事件相机则提供出色的时间分辨率和高动态范围（HDR）。利用它们的互补优势，可以在高速运动、HDR环境和动态背景干扰等挑战条件下，大幅增强目标跟踪性能。然而，由于它们根本的成像机制不同，这两种模式之间存在显著的空间时间不对称性，阻碍了有效的多模式融合。为了解决这一问题，我们提出了{分层不对称蒸馏}(HAD)，这是一种多模式知识蒸馏框架，能够显式地建模并缓解时空不对称性。具体来说，HAD提出了一种分层对齐策略，在保持学生网络计算效率和参数紧凑性的同时，最小化信息损失。大量实验表明，HAD持续优于最新方法，综合消融研究进一步验证了每个设计组件的有效性和必要性。代码很快会发布。

论文及项目相关链接

PDF

Summary：RGB相机捕捉丰富纹理细节和高空间分辨率，事件相机提供出色的时间分辨率和高动态范围（HDR）。结合两者的优势可以显著提高挑战性条件下的目标跟踪性能，例如高速运动、HDR环境和动态背景干扰。然而，由于两者成像机制的根本不同，存在时空不对称性，阻碍了有效的多模态融合。为解决这一问题，我们提出了分层不对称蒸馏（HAD）的多模态知识蒸馏框架，它明确建模并减轻了时空不对称性。经过大量实验验证，HAD在技术上超越其他领先方法。综合消融研究进一步验证了每个设计组件的有效性和必要性。代码即将发布。

Key Takeaways：

RGB相机和事件相机各有优势：RGB相机捕捉丰富纹理细节和高空间分辨率，事件相机提供出色的时间分辨率和高动态范围（HDR）。
结合两者优势可以改进目标跟踪性能，尤其在挑战条件下如高速运动、HDR环境和动态背景干扰。
RGB和事件相机存在时空不对称性，影响多模态融合效果。
提出分层不对称蒸馏（HAD）框架来解决时空不对称问题。
HAD通过明确建模并减轻时空不对称性，在技术上超越其他领先方法。
广泛的实验和消融研究验证了HAD的有效性及每个设计组件的必要性。

Cool Papers

点此查看论文截图

MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment

Authors:Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.

现有的大多数水下实例分割方法受到封闭词汇预测的限制，无法识别新的海洋类别。为了支持评估，我们引入了\textbf{MARIS}(\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation)，这是水下开放词汇（OV）分割的第一个大规模精细基准测试，特点是有限的已知类别和多样的未知类别。虽然在自然图像上，开放词汇分割显示出了一定的前景，但我们的分析表明，转移到水下场景会受到严重的视觉退化（例如，颜色衰减）和语义不对准的影响，这是由于缺乏水下类别的定义所导致的。为了解决这些问题，我们提出了一个统一的框架，包含两个互补的组件。几何先验增强模块(\textbf{GPEM})利用稳定的部分级别和结构线索来保持对象在退化视觉条件下的一致性。语义对齐注入机制(\textbf{SAIM})丰富了语言嵌入的特定领域先验知识，减轻了语义模糊，提高了对未见类别的识别能力。实验表明，我们的框架在MARIS上无论是内部还是跨域设置都优于现有的OV基线，为未来水下感知研究奠定了坚实基础。

论文及项目相关链接

PDF

Summary

本文介绍了水下开放词汇实例分割的新挑战，并引入了首个大规模水下开放词汇分割基准MARIS。现有方法在水下场景中的迁移面临视觉退化和语义不对齐的问题。为此，提出了一个统一框架，包括几何先验增强模块和语义对齐注入机制，以提高对象的一致性和对未见类别的识别能力。在MARIS上的实验表明，该方法在域内和跨域设置下均优于现有开放词汇基线，为未来水下感知研究奠定了坚实基础。

Key Takeaways

水下实例分割面临开放词汇预测的挑战，即识别新型海洋类别的能力受限。
引入了首个大规模水下开放词汇分割基准MARIS，包含有限的已知类别和多样的未知类别。
现有方法在水下场景中的迁移存在视觉退化和语义不对齐的问题。
提出了一个统一框架，包括几何先验增强模块（GPEM）和语义对齐注入机制（SAIM）。
GPEM利用稳定的局部和结构线索来维持对象在退化视觉条件下的一致性。
SAIM通过丰富语言嵌入与特定领域先验，减少语义模糊，提高未见类别的识别能力。

Cool Papers

点此查看论文截图

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

基于互联网规模数据训练的视觉语言模型（VLMs）在常见对象（如汽车、卡车和行人）的零样本检测性能上取得了显著的成果。然而，最先进的模型仍然难以推广到其预训练中没有出现的类别、任务和成像模式。我们并不主张仅通过更多的视觉数据对VLMs进行重新训练，而是认为应该将VLMs与包含少量视觉示例和丰富文本描述的新概念注解指令对齐。为此，我们推出了Roboflow100-VL，这是一个包含100个多模态对象检测数据集的大规模集合，其中包含的概念多样且在其VLM预训练中并不常见。我们在各种设置（零样本、少样本、半监督和全监督）的基准测试上评估了最先进的模型，以便跨数据领域进行比较。值得注意的是，我们发现像GroundingDINO和Qwen2.5-VL这样的VLM在Roboflow100-VL中的挑战性医学影像数据集上的零样本准确率低于2%，这显示了进行少样本概念对齐的必要性。最后，我们讨论了近期CVPR 2025基础FSOD竞赛的情况并分享了来自社区的见解。值得注意的是，冠军团队相较于我们的基线模型大幅提升了17 mAP！我们的代码和数据集可在https://github.com/roboflow/rf10_vl和[https://universe.roboflow.com/rf100-vl/]上进行访问。

论文及项目相关链接

PDF The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/

Summary：本文介绍了在Roboflow 推出的100种多模态物体检测数据集下对先进的视觉语言模型进行的评估报告。研究指出尽管现有模型能够在普通物体上进行零发射检测，但在面对非预训练分布外的类别、任务和成像模式时仍面临挑战。文章强调了对新概念的适应需要，并提出了通过标注指令结合少量视觉示例和丰富的文本描述来实现对齐的方法。同时，还讨论了CVPR 2025基础FSOD竞赛的最新进展和社区见解。

Key Takeaways:

VLMs在通用物体上的零发射检测性能显著，但在面对非预训练分布外的类别和任务时仍面临挑战。
VLMs对于未在预训练中涵盖的新概念，例如医疗图像中的概念，需要进行概念对齐。
提出了一种使用标注指令包含视觉示例和丰富文本描述的方法来适应新概念的方法。
Roboflow发布了大型的多模态物体检测数据集Roboflow100-VL，用于评估模型性能。
在Roboflow数据集上进行了零发射、少发射、半监督和全监督的实验设置，允许在不同数据环境下进行比较。
CVPR 2025 Foundational FSOD竞赛中，获胜队伍的表现显著超越了基线水平。

Cool Papers

点此查看论文截图

gen2seg: Generative Models Enable Generalizable Instance Segmentation

Authors:Om Khangaonkar, Hamed Pirsiavash

By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE’s ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

通过预训练从受扰输入中合成连贯的图像，生成模型本质上学习了理解对象边界和场景组合。我们如何将这些生成表示用于通用感知组织呢？我们对Stable Diffusion和MAE（编码器+解码器）进行微调，以用于类别不可知的实例分割，仅使用我们的实例着色损失针对少数对象类型（室内家具和汽车）。令人惊讶的是，我们的模型表现出强大的零样本泛化能力，能够准确分割微调中未见过的对象类型和风格（并且在许多情况下，MAE的ImageNet-1K预训练也是如此）。在未见过的对象类型和风格进行评估时，我们表现最佳的模型接近高度监督的SAM，在分割精细结构和模糊边界时则表现优于SAM。相比之下，现有的可提示分割架构或判别预训练模型无法泛化。这表明生成模型学习了一种固有的分组机制，可以跨类别和领域进行迁移，即使不使用互联网规模的预训练也是如此。代码、预训练模型和演示可在我们的网站上找到。

Summary：预训练生成模型从扰动输入中合成连贯图像，可理解物体边界和场景组合。如何通过微调Stable Diffusion和MAE（编码器+解码器）进行类别无关的实例分割？我们采用实例着色损失仅对少量物体类型（室内家具和汽车）进行训练。令人惊讶的是，模型展现出强大的零样本泛化能力，准确分割训练未见过的物体类型和风格。最佳模型在未见物体类型和风格上的表现接近高度监督的SAM，在分割精细结构和模糊边界上表现更优。相比之下，现有的提示分割架构或判别预训练模型无法泛化。这表明生成模型学会了一种内在分组机制，可在类别和域之间迁移，即使不使用互联网规模的预训练也是如此。

Key Takeaways：

生成模型通过预训练从扰动输入中合成图像，能够自然地理解物体边界和场景组合。
通过微调Stable Diffusion和MAE模型进行实例分割，采用实例着色损失针对特定物体类型。
模型展现出强大的零样本泛化能力，能准确分割未见过的物体类型和风格。
最佳模型性能接近高度监督的SAM，在特定任务上表现更优。
现有提示分割架构或判别预训练模型无法泛化至新场景。
生成模型学会了一种内在分组机制，该机制可在类别和域之间迁移。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-25/%E6%A3%80%E6%B5%8B_%E5%88%86%E5%89%B2_%E8%B7%9F%E8%B8%AA/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

检测/分割/跟踪

无监督/半监督/对比学习

无监督/半监督/对比学习方向最新论文已更新，请持续关注 Update in 2025-10-25 Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

2025-10-25 无监督/半监督/对比学习

无监督/半监督/对比学习

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-10-25 ACS-SegNet An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology

2025-10-25 Vision Transformer

Vision Transformer