发布日期: 2025-03-05

更新日期: 2025-05-14

文章字数: 1.9k

阅读时长: 7 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-03-05 更新

Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

Authors:Stephen Hausler, Peyman Moghadam

In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders. The Pair-VPR website is: https://csiro-robotics.github.io/Pair-VPR.

在这项工作中，我们针对视觉位置识别（VPR）提出了一种新型联合训练方法，该方法同时学习全局描述器和用于重新排序的对分类器。对分类器可以预测给定的两个图像是否来自同一地点。该网络仅包含用于编码器和配对分类器的视觉转换器（Vision Transformer）组件，两个组件均使用各自的类标记进行训练。在现有的VPR方法中，网络通常使用通用图像数据集（例如ImageNet）的预训练权重进行初始化。在这项工作中，我们提出了一种替代的预训练策略，即使用Siamese Masked Image Modelling作为预训练任务。我们提出了一种基于大型VPR数据集集合的位置感知图像采样程序，用于对我们的模型进行预训练，以学习专门用于VPR的视觉特征。通过重用第二阶段训练中的Mask Image Modelling编码器和解码器权重，Pair-VPR可以在使用ViT-B编码器时，在五个基准数据集上实现最先进的VPR性能，并在使用更大的编码器时进一步提高定位召回率。Pair-VPR网站地址为：https://csiro-robotics.github.io/Pair-VPR。

论文及项目相关链接

PDF

Summary
本文提出了一种新型的视觉场所识别（VPR）联合训练方法，该方法同时学习全局描述器和配对分类器以进行排序。配对分类器可预测给定图像对是否来自同一地点。网络仅包含用于编码器和配对分类器的视觉转换器组件，这两个组件均使用各自的类标记进行训练。本文还提出了一种基于Siamese Masked Image Modelling的预训练策略，并提出了一种从大型VPR数据集中采集场所感知图像样本的预训练方法，以学习专门用于VPR的视觉特征。Pair-VPR在五个基准数据集上实现了最先进的VPR性能，并在使用更大的编码器时进一步提高了定位召回率。

Key Takeaways

本文提出了一种新的VPR联合训练方法，结合全局描述器和配对分类器。
网络仅由视觉转换器组件构成，编码器和配对分类器分别使用各自的类标记进行训练。
提出了基于Siamese Masked Image Modelling的预训练策略，针对VPR任务优化了模型的视觉特征学习。
采用了场所感知图像采样方法，从大型VPR数据集中选取样本进行预训练。
Pair-VPR在五个基准数据集上达到了最先进的性能。
使用更大的编码器时，Pair-VPR在定位召回率方面有了进一步的提升。

Cool Papers

点此查看论文截图

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

Authors:Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in the real world. For richer classes of the similarity, we propose the use of weighted point sets, namely, sets of pairs of weight and vector, as representations of instances. In this work, we theoretically show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP, which we call symmetric InfoNCE. We clarify that the optimal similarity that minimizes symmetric InfoNCE is the pointwise mutual information, and show an upper bound of excess risk on downstream classification tasks of representations that achieve the optimal similarity. In addition, we show that our proposed similarity based on weighted point sets consistently achieves the optimal similarity. To verify the effectiveness of our proposed method, we demonstrate pretraining of text-image representation models and classification tasks on common benchmarks.

在典型的跨模态对比学习（如CLIP）中，编码器为每个输入生成潜在表示空间中的一个点。然而，单点表示很难捕捉现实世界中的大量实例之间的关系和相似性结构。为了表示更丰富的相似性类别，我们建议使用加权点集（即权重和向量的点对集合）作为实例的表示。在这项工作中，我们通过新的CLIP对比损失的理解，展示了我们的方法的好处，我们称之为对称InfoNCE。我们明确指出，最小化对称InfoNCE的最优相似性是点间互信息，并展示了在达到最优相似性的下游分类任务中过度风险的上限。此外，我们证明，基于加权点集提出的相似性始终能够达到最优相似性。为了验证我们方法的有效性，我们在通用基准上展示了文本图像表示模型的预训练和分类任务。

论文及项目相关链接

PDF ICLR 2025 (Spotlight)

Summary：
本文介绍了典型的对比学习如CLIP中编码器为每个输入生成一个点表示的问题。为捕捉现实世界中大量实例的关系和相似性结构，提出使用加权点集作为实例表示的方法。本文通过新的CLIP对比损失理解——对称InfoNCE，理论证明了该方法的好处。此外，本文展示了基于加权点集的相似性始终达到最优相似性的证据，并通过预训练文本-图像表示模型和常见基准分类任务验证了该方法的有效性。

Key Takeaways：