发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 3.2k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

Authors:Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang

Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.

基于摄像头的多视角3D检测对于自动驾驶至关重要。PETR及其变体（PETRs）在基准测试中表现优异，但由于计算成本高和内存占用大，面临着部署挑战。量化是一种通过降低权重和激活值的位宽来压缩深度神经网络的有效技术。然而，直接将现有量化方法应用于PETRs会导致准确性严重下降。这个问题主要源于两个关键挑战：（1）多模态特征之间幅度差异显著，特别是图像特征和相机射线位置嵌入（PE）；（2）量化非线性算子效率低下和近似误差，这通常依赖于不利于硬件的计算。在本文中，我们提出了FQ-PETR，这是一个针对PETRs的全量化框架，具有三个关键创新点：（1）量化友好的激光雷达射线位置嵌入（QFPE）：用激光雷达先验引导的单点采样和基于锚点的嵌入替换多点采样，消除了有问题的非线性（例如逆sigmoid），并使PE尺度与图像特征对齐，从而保留准确性。（2）双查找表（DULUT）：该算法使用两个级联的线性LUT来近似复杂的非线性函数，以最小的条目实现高保真，无需专用硬件。（3）数值稳定后的量化（QANS）：在softmax数值稳定后进行量化，减轻了大输入引起的注意力失真。在PETRs（例如PETR、StreamPETR、PETRv2、MV2d上），FQ-PETR在W8A8下实现了接近浮点精度（降低1%的精度）的同时，延迟降低了高达75%，显著优于现有的PTQ和QAT基线。

论文及项目相关链接

PDF This paper is acceptted by AAAI 2026

Summary

PETR及其变体在自主驾驶的多视角三维检测中表现出色，但面临高计算成本和内存占用的问题。本文提出一种全新的全量化框架FQ-PETR，包括三个关键创新点：Quantization-Friendly LiDAR-ray Position Embedding、Dual-Lookup Table以及Quantization After Numerical Stabilization。这些创新解决了PETR在量化过程中的准确性下降问题，实现了高效压缩，同时保持近浮点精度。

Key Takeaways

PETR及其变体在自主驾驶领域面临高计算成本和内存占用的问题。
直接应用现有量化方法会导致PETR的准确性严重下降。
FQ-PETR框架包括三个关键创新点：Quantization-Friendly LiDAR-ray Position Embedding、Dual-Lookup Table和Quantization After Numerical Stabilization。
Quantization-Friendly LiDAR-ray Position Embedding通过替换多点采样为LiDAR引导的单点采样和基于锚的嵌入，消除了非线性问题并保持了准确性。
Dual-Lookup Table使用两个级联的线性LUT来近似复杂的非线性函数，实现了高保真度且无需专用硬件。
Quantization After Numerical Stabilization在量化后进行softmax数值稳定处理，减轻了注意力失真。

Cool Papers

点此查看论文截图

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Authors:Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim

Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model’s robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

指代图像分割（Refering Image Segmentation，简称RIS）是一项涉及识别并按自由形式的文本描述对图像中的对象进行分割的高级视觉语言任务。虽然以前的研究集中在视觉和语言特征的匹配上，但探索训练技术，如数据增强，仍然被忽视。在这项工作中，我们探索了适用于RIS的有效数据增强方法，并提出了一种新的训练框架，称为掩膜指代图像分割（Masked Referring Image Segmentation，简称MaskRIS）。我们发现传统的图像增强方法对于RIS来说效果有限，会导致性能下降，而简单的随机掩膜能显著提高RIS的性能。MaskRIS同时使用图像和文本掩膜，然后进行失真感知上下文学习（Distortion-aware Contextual Learning，简称DCL），以充分利用掩膜策略的优点。这种方法可以提高模型对遮挡、不完整信息和各种语言复杂性的鲁棒性，从而显著提高性能。实验表明，MaskRIS可以轻松地应用于各种RIS模型，在全监督和弱监督设置下均优于现有方法。最后，MaskRIS在RefCOCO、RefCOCO+和RefCOCOg数据集上达到了最新的最佳性能。代码可在https://github.com/naver-ai/maskris找到。

论文及项目相关链接

PDF Accepted to TMLR 2025. First two authors contributed equally

Summary

本文探讨了图像描述分割（RIS）任务中的有效数据增强方法，并提出了一种新的训练框架——Masked Referring Image Segmentation（MaskRIS）。研究发现，传统的图像增强方法对于RIS效果不佳，而简单的随机遮蔽能显著提升性能。MaskRIS结合图像和文本遮蔽，再通过Distortion-aware Contextual Learning（DCL）最大化遮蔽策略的优势。该方法提高了模型对遮挡、信息不全和各种语言复杂性的鲁棒性，并在多种数据条件下实现显著性能提升。最新成果在RefCOCO等数据集上取得最佳表现。

Key Takeaways

RIS是一项高级视觉语言任务，涉及根据自由形式的文本描述来识别和分割图像中的对象。
研究发现传统图像增强方法对RIS效果不佳，简单随机遮蔽反而能提高性能。
MaskRIS结合了图像和文本遮蔽，并通过DCL最大化遮蔽的优势。
MaskRIS提高了模型对遮挡、信息不全和语言复杂性的鲁棒性。
MaskRIS可在多种数据条件下实现显著性能提升，并在RefCOCO等数据集上取得最佳表现。

Cool Papers

点此查看论文截图

Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

Authors:Quang-Huy Nguyen, Jin Peng Zhou, Zhenzhen Liu, Khanh-Huyen Bui, Kilian Q. Weinberger, Wei-Lun Chao, Dung D. Le

Recent object detectors have achieved impressive accuracy in identifying objects seen during training. However, real-world deployment often introduces novel and unexpected objects, referred to as out-of-distribution (OOD) objects, posing significant challenges to model trustworthiness. Modern object detectors are typically overconfident, making it unreliable to use their predictions alone for OOD detection. To address this, we propose leveraging an auxiliary model as a complementary solution. Specifically, we utilize an off-the-shelf text-to-image generative model, such as Stable Diffusion, which is trained with objective functions distinct from those of discriminative object detectors. We hypothesize that this fundamental difference enables the detection of OOD objects by measuring inconsistencies between the models. Concretely, for a given detected object bounding box and its predicted in-distribution class label, we perform class-conditioned inpainting on the image with the object removed. If the object is OOD, the inpainted image is likely to deviate significantly from the original, making the reconstruction error a robust indicator of OOD status. Extensive experiments demonstrate that our approach consistently surpasses existing zero-shot and non-zero-shot OOD detection methods, establishing a robust framework for enhancing object detection systems in dynamic environments.

最近的物体检测器在识别训练过程中看到的物体方面取得了令人印象深刻的准确性。然而，在现实世界部署中，常常会出现新的和意料之外的物体，这些被称为超出分布范围的物体（OOD对象），对模型的信任度构成重大挑战。现代物体检测器通常过于自信，仅使用其预测结果进行OOD检测并不可靠。为了解决这个问题，我们提出利用辅助模型作为补充解决方案。具体来说，我们利用现成的文本到图像生成模型，如Stable Diffusion，该模型采用与判别式物体检测器不同的目标函数进行训练。我们假设这种基本差异使得通过测量模型之间的一致性来检测OOD物体成为可能。具体来说，对于给定的检测到的物体边界框及其预测的内置分布类别标签，我们对移除物体的图像进行类别条件填充。如果物体是OOD，填充后的图像可能与原始图像有很大偏差，使得重建误差成为OOD状态的稳健指标。大量实验表明，我们的方法始终优于现有的零样本和非零样本OOD检测方法，为增强动态环境中的物体检测系统建立了稳健的框架。

论文及项目相关链接

PDF Accepted in WACV 2026 (Algorithms track)

Summary

近期对象检测器在识别训练过程中所见物体方面取得了令人印象深刻的准确性。然而，现实世界部署中常常会出现新型和意料之外的物体，即所谓的超出分布范围（OOD）物体，这给模型可信度带来了重大挑战。现代对象检测器通常过于自信，仅使用其预测结果进行OOD检测是不可靠的。为解决这一问题，我们提出利用辅助模型作为补充方案。具体来说，我们利用现成的文本到图像生成模型（如Stable Diffusion），该模型使用的目标函数与判别式对象检测器存在根本差异。我们假设这种差异有助于通过测量模型间的不一致性来检测OOD物体。对于给定的检测物体边界框及其预测的分布内类别标签，我们在移除物体后的图像上进行类别条件填充。如果物体是OOD，填充后的图像很可能与原始图像有很大偏差，使得重建误差成为判断OOD状态的稳健指标。大量实验表明，我们的方法始终优于现有的零样本和非零样本OOD检测方法，为动态环境中增强对象检测系统建立了一个稳健的框架。

Key Takeaways

现代对象检测器在识别训练过程中的物体方面表现出高准确性，但在现实世界的部署中面临新型和意料之外的物体的挑战。
仅依赖现代对象检测器的预测结果进行OOD检测是不可靠的，因为这些检测器通常过于自信。
为解决这一问题，提出了利用辅助模型（如文本到图像生成模型Stable Diffusion）作为补充方案。
文本到图像生成模型与判别式对象检测器的目标函数存在根本差异，这种差异有助于检测OOD物体。
通过测量模型间的不一致性来检测OOD物体，具体做法包括利用类别条件填充技术处理图像。
重建误差是判断OOD状态的有效指标。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-21/%E6%A3%80%E6%B5%8B_%E5%88%86%E5%89%B2_%E8%B7%9F%E8%B8%AA/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

检测/分割/跟踪

Speech

Speech 方向最新论文已更新，请持续关注 Update in 2025-11-21 Scriboora Rethinking Human Pose Forecasting

2025-11-21 Speech

Speech

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-11-21 IPTQ-ViT Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

2025-11-21 Vision Transformer

Vision Transformer