⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision–Language Models
Authors:Nhat Hoang-Xuan, Minh Vu, My T. Thai, Manish Bhattarai
Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
大型视觉语言模型(LVLMs)虽然功能强大,但由于物体幻觉的存在,它们仍然不可靠。在这项工作中,我们发现LVLM在许多幻觉预测中实际上忽略了图像,而是依赖于先前生成的输出(初步)令牌来推断新物体。我们通过初步条件下图像和预测物体之间的互信息来衡量这种行为,证明弱图像依赖性与幻觉之间存在强烈关联。基于这一发现,我们引入了初步注意力得分(PAS),这是一种轻量级的、无需训练的信号,通过初步令牌上的注意力权重计算得出。PAS不需要额外的前向传递,可以在推理过程中即时计算。通过利用这一以前被忽视的信号,PAS在多个模型和数据集上实现了最先进的物体幻觉检测,能够实现实时过滤和干预。
论文及项目相关链接
Summary
大型视觉语言模型(LVLMs)具有强大的能力,但在对象生成方面存在不稳定性问题,容易发生对象幻觉。本研究发现,LVLM在产生幻觉预测时,往往忽略图像信息,而依赖先前生成的输出(初步)令牌来推断新对象。通过初步令牌与预测对象之间的互信息来衡量这种行为,发现弱图像依赖与幻觉之间存在强烈关联。在此基础上,我们引入了初步注意力得分(PAS),这是一种轻量级的、无需额外训练的信号,通过计算初步令牌上的注意力权重得到。PAS无需额外的前向传递,可在推理过程中实时计算。利用这一先前被忽视的信号,PAS在多个模型和数据集上实现了最先进的对象幻觉检测,能够实现实时过滤和干预。
Key Takeaways
- LVLMs在对象生成方面存在不稳定性问题,容易发生对象幻觉。
- LVLM在产生幻觉预测时,倾向于依赖先前生成的初步令牌来推断新对象,而非依赖图像信息。
- 通过衡量初步令牌与预测对象之间的互信息,发现弱图像依赖与幻觉之间的强烈关联。
- 引入了一种新的信号——初步注意力得分(PAS),用于衡量初步令牌在模型预测中的作用。
- PAS是一种轻量级、无需训练的信号,可通过计算初步令牌上的注意力权重实时获取。
- PAS在多个模型和数据集上实现了最先进的对象幻觉检测效果。
点此查看论文截图
YOLO-Drone: An Efficient Object Detection Approach Using the GhostHead Network for Drone Images
Authors:Hyun-Ki Jung
Object detection using images or videos captured by drones is a promising technology with significant potential across various industries. However, a major challenge is that drone images are typically taken from high altitudes, making object identification difficult. This paper proposes an effective solution to address this issue. The base model used in the experiments is YOLOv11, the latest object detection model, with a specific implementation based on YOLOv11n. The experimental data were sourced from the widely used and reliable VisDrone dataset, a standard benchmark in drone-based object detection. This paper introduces an enhancement to the Head network of the YOLOv11 algorithm, called the GhostHead Network. The model incorporating this improvement is named YOLO-Drone. Experimental results demonstrate that YOLO-Drone achieves significant improvements in key detection accuracy metrics, including Precision, Recall, F1-Score, and mAP (0.5), compared to the original YOLOv11. Specifically, the proposed model recorded a 0.4% increase in Precision, a 0.6% increase in Recall, a 0.5% increase in F1-Score, and a 0.5% increase in mAP (0.5). Additionally, the Inference Speed metric, which measures image processing speed, also showed a notable improvement. These results indicate that YOLO-Drone is a high-performance model with enhanced accuracy and speed compared to YOLOv11. To further validate its reliability, comparative experiments were conducted against other high-performance object detection models, including YOLOv8, YOLOv9, and YOLOv10. The results confirmed that the proposed model outperformed YOLOv8 by 0.1% in mAP (0.5) and surpassed YOLOv9 and YOLOv10 by 0.3% and 0.6%, respectively.
使用无人机捕获的图像或视频进行物体检测是一项在不同行业具有巨大潜力的有前途的技术。然而,一个主要挑战是无人机图像通常是从高空拍摄的,这使得物体识别变得困难。本文针对这一问题提出了一种有效的解决方案。实验所用的基础模型是YOLOv11(最新的物体检测模型),并结合YOLOv11n进行了特定实现。实验数据来自广泛使用和可靠的VisDrone数据集,这是无人机物体检测的标准基准数据集。本文对YOLOv11算法的Head网络进行了一项改进,称为GhostHead网络。结合了这种改进的模型被称为YOLO-Drone。实验结果表明,与原始的YOLOv11相比,YOLO-Drone在关键的检测精度指标上取得了显著的提升,包括精确度、召回率、F1分数和mAP(0.5)。具体来说,所提出的模型在精确度上提高了0.4%,在召回率上提高了0.6%,在F1分数上提高了0.5%,在mAP(0.5)上提高了0.5%。此外,衡量图像处理速度的推理速度指标也显示出明显的改进。这些结果表明,YOLO-Drone是一个与YOLOv11相比具有增强精度和速度的高性能模型。为了进一步验证其可靠性,还与其他高性能物体检测模型(包括YOLOv8、YOLOv9和YOLOv10)进行了对比实验。结果证实,所提出的模型在mAP(0.5)上比YOLOv8高出0.1%,并分别比YOLOv9和YOLOv10高出0.3%和0.6%。
论文及项目相关链接
PDF Preprint version. Accepted for publication in the Journal of Information Systems Engineering and Management
Summary
无人机图像或视频中的物体检测技术具有广泛的应用前景。本文提出一种针对无人机图像高海拔拍摄导致的物体识别困难问题的有效解决方案。通过对YOLOv11算法的Head网络进行改进,提出了GhostHead网络,并实验验证了YOLO-Drone模型在检测精度和速度上的优越性。与其他高性能物体检测模型的对比实验也证实了YOLO-Drone的优越性。
Key Takeaways
- 无人机图像或视频中的物体检测是一个有前景的技术,但存在高海拔拍摄导致的物体识别困难问题。
- YOLOv11算法是基础模型,通过改进Head网络,提出了GhostHead网络。
- YOLO-Drone模型在检测精度上实现了显著的提升,包括精确度、召回率、F1分数和mAP(0.5)。
- 与其他高性能物体检测模型的对比实验证实YOLO-Drone具有优越性。
- YOLO-Drone模型在图像处理速度上也有显著的提升。
- YOLO-Drone模型在mAP(0.5)指标上相较于YOLOv8提高了0.1%,相较于YOLOv9和YOLOv10分别提高了0.3%和0.6%。
点此查看论文截图
EIDSeg: A Pixel-Level Semantic Segmentation Dataset for Post-Earthquake Damage Assessment from Social Media Images
Authors:Huili Huang, Chengeng Liu, Danrong Zhang, Shail Patel, Anastasiya Masalava, Sagar Sadak, Parisa Babolhavaeji, WeiHong Low, Max Mahdi Roozbahani, J. David Frost
Rapid post-earthquake damage assessment is crucial for rescue and resource planning. Still, existing remote sensing methods depend on costly aerial images, expert labeling, and produce only binary damage maps for early-stage evaluation. Although ground-level images from social networks provide a valuable source to fill this gap, a large pixel-level annotated dataset for this task is still unavailable. We introduce EIDSeg, the first large-scale semantic segmentation dataset specifically for post-earthquake social media imagery. The dataset comprises 3,266 images from nine major earthquakes (2008-2023), annotated across five classes of infrastructure damage: Undamaged Building, Damaged Building, Destroyed Building, Undamaged Road, and Damaged Road. We propose a practical three-phase cross-disciplinary annotation protocol with labeling guidelines that enables consistent segmentation by non-expert annotators, achieving over 70% inter-annotator agreement. We benchmark several state-of-the-art segmentation models, identifying Encoder-only Mask Transformer (EoMT) as the top-performing method with a Mean Intersection over Union (mIoU) of 80.8%. By unlocking social networks’ rich ground-level perspective, our work paves the way for a faster, finer-grained damage assessment in the post-earthquake scenario.
地震后的快速灾害评估对于救援和资源规划至关重要。然而,现有的遥感方法依赖于昂贵的航空图像、专家标注,并且仅提供用于早期评估的二进制的灾害地图。虽然来自社交网络的地面图像是填补这一空白的有价值来源,但针对这项任务的大规模像素级标注数据集仍不可用。我们介绍了EIDSeg,这是专门为地震后社交媒体图像设计的大规模语义分割数据集。该数据集包含来自九次大地震(2008年至2023年)的3,266张图像,这些图像被标注为五个类别的设施损坏情况:未损坏建筑、损坏建筑、毁坏建筑、未损坏道路和损坏道路。我们提出了一种实用的跨学科的三阶段标注协议和标注指南,使非专家标注者能够进行一致的分割,实现了超过70%的标注者间一致性。我们对一些最先进的分割模型进行了基准测试,发现仅编码器Mask Transformer(EoMT)是表现最好的方法,其平均交并比(mIoU)为80.8%。我们的工作解锁了社交网络的丰富地面视角,为地震后的快速、精细灾害评估铺平了道路。
论文及项目相关链接
PDF Camera-Ready for AAAI-AISI26
Summary
本文介绍了针对灾后社交媒体影像的语义分割数据集EIDSeg。该数据集包含来自九个主要地震的3,266张图像,涵盖五种基础设施损坏类别。提出实用的三阶段跨学科标注协议,使非专家标注者能够进行一致分割,实现超过70%的标注者间一致性。同时,对最先进的分割模型进行基准测试,发现Encoder-only Mask Transformer表现最佳,Mean Intersection over Union (mIoU)达到80.8%。本文利用社交媒体丰富的地面视角,为灾后场景实现更快、更精细的损害评估铺平了道路。
Key Takeaways
- 灾后快速损害评估对救援和资源规划至关重要。
- 现有远程感应方法依赖昂贵的航空图像和专家标注,仅提供二进制损害地图进行初步评估。
- 社交媒体上的地面影像为填补这一空白提供了宝贵资源。
- 引入EIDSeg数据集,专为灾后社交媒体影像设计的大规模语义分割数据集。
- 数据集包含来自九个主要地震的3,266张图像,涵盖五种基础设施损坏类别。
- 提出实用的三阶段跨学科标注协议,实现非专家标注者的一致分割和超过70%的标注者间一致性。
- Encoder-only Mask Transformer在基准测试中表现最佳,具有80.8%的Mean Intersection over Union (mIoU)。
点此查看论文截图
NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Authors:Max Gandyra, Alessandro Santonicola, Michael Beetz
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals’ bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the “Model-based 2D segmentation of unseen objects” task.
在RGB图像中对新型对象实例进行实例分割,给定每个对象的示例图像,这是计算机视觉领域的一个知名问题。设计一种通用模型,能够应用于各种新型对象而无需(重新)训练,已被证明是一项艰巨的任务。为了处理这个问题,我们提出了一种新的无训练框架,称为:基于循环阈值的新型对象实例分割(NOCTIS)。NOCTIS整合了两个预训练模型:用于生成带有精确边界框和相应分割掩码的提案的Grounded-SAM 2;以及具有零样本能力的DINOv2,用于生成稳健的类别和补丁嵌入。在内部,通过基于类别嵌入的相似性以及补丁嵌入与新的循环阈值(CT)机制的最大平均相似性的确定来实现提案对象匹配,该CT机制减轻了由重复纹理或视觉相似模式引起的不稳定匹配。除了CT之外,NOCTIS还引入了:(i)一种不受对象选择偏见影响的外观评分;(ii)使用提案的边界框和掩码的平均置信度作为评分组成部分;(iii)仅使用RGB的管道,其性能甚至超过了RGB-D管道。我们通过实验证明,NOCTIS无需进一步的训练或微调,在BOP 2023挑战的七个核心数据集上的“未见对象的基于模型的二维分割”任务的平均AP得分方面,超过了最佳的RGB和RGB-D方法。
论文及项目相关链接
PDF 9 pages, 3 figures, 5 tables, CVPR 2026 preprint
Summary
本文提出一种无需训练的新型物体实例分割框架NOCTIS,整合了Grounded-SAM 2和DINOv2两个预训练模型。通过循环阈值机制实现提案物体匹配,并引入外观评分、提案的边界框和遮罩的平均置信度作为评分依据。在BOP 2023挑战的七个核心数据集上,NOCTIS在“未见物体模型基础二维分割”任务中的平均精度超出最佳RGB和RGB-D方法。
Key Takeaways
- 提出一种无需训练的新型物体实例分割框架NOCTIS。
- 整合Grounded-SAM 2和DINOv2两个预训练模型。
- 采用循环阈值机制实现提案物体匹配,解决由重复纹理或视觉相似模式导致的不稳定匹配问题。
- 引入外观评分,不受物体选择偏误影响。
- 使用提案的边界框和遮罩的平均置信度作为评分依据。
- NOCTIS框架在BOP 2023挑战的七个核心数据集上表现优异,超出最佳RGB和RGB-D方法。
点此查看论文截图
FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection
Authors:Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang
Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.
基于摄像头的多视角3D检测对于自动驾驶至关重要。PETR及其变体(PETRs)在基准测试中表现优异,但由于计算成本高和内存占用大,面临着部署挑战。量化是一种通过降低权重和激活的位宽来压缩深度神经网络的有效技术。然而,直接将现有量化方法应用于PETRs会导致精度严重下降。这个问题主要源于两个关键挑战:(1)多模态特征之间幅度差异显著,特别是图像特征和相机射线位置嵌入(PE);(2)量化非线性算子效率低下和近似误差,这通常依赖于硬件不友好的计算。在本文中,我们提出了FQ-PETR,这是一个针对PETRs的全量化框架,具有三个关键创新点:(1)量化友好的激光雷达射线位置嵌入(QFPE):用激光雷达先验引导的单点采样和基于锚点的嵌入替换多点采样,消除了问题非线性(例如,反sigmoid),并使PE尺度与图像特征对齐,保留精度。(2)双查找表(DULUT):该算法使用两个级联的线性LUT来近似复杂的非线性函数,以最小的条目实现高保真,无需专用硬件。(3)数值稳定后进行量化(QANS):在softmax数值稳定后进行量化,减轻了大输入导致的注意力失真。在PETRs(例如PETR、StreamPETR、PETRv2、MV2d上),FQ-PETR在W8A8下实现了接近浮点精度(1%的降级),同时延迟降低了高达75%,显著优于现有的PTQ和QAT基线。
论文及项目相关链接
PDF This paper is acceptted by AAAI 2026
Summary
针对自动驾驶中的多视角三维检测,PETR及其变体面临高计算成本和内存占用问题,本文提出FQ-PETR框架,通过三项创新技术实现高效量化,在保持精度接近浮点精度的同时,降低了延迟。
Key Takeaways
- PETR及其变体在自动驾驶多视角三维检测中面临计算成本和内存占用问题。
- 直接应用现有量化方法会导致PETR精度严重下降,主要源于特征幅度差异和非线性算子的量化挑战。
- FQ-PETR框架包含三项关键创新:Quantization-Friendly LiDAR-ray Position Embedding(QFPE)、Dual-Lookup Table(DULUT)和Quantization After Numerical Stabilization(QANS)。
- QFPE通过引入LiDAR引导的单点采样和基于锚点的嵌入,消除了非线性问题并保持了精度。
- DULUT使用两个级联的线性LUT来近似复杂的非线性函数,实现了高保真度且无需专用硬件。
- QANS在softmax数值稳定后进行量化,减轻了注意力扭曲问题。