⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-25 更新
Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Authors:Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot
Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.
近期生成建模的最新进展使得扩散模型成为从复杂数据分布中采样的最先进的工具。尽管这些模型在单模态领域(如图像和音频)取得了显著的成功,但它们的能力拓展到跨不同感官模态的模态翻译(MT)仍然是一个开放挑战。现有方法通常依赖于限制性假设,包括共享维度、高斯源先验和特定模态架构,这些假设限制了它们的通用性和理论依据。在这项工作中,我们提出了潜在去噪扩散桥梁模型(Latent Denoising Diffusion Bridge Model,LDDBM),这是一个基于去噪扩散桥梁模型的潜在变量扩展的通用模态翻译框架。通过在共享潜在空间中进行操作,我们的方法在无需对齐维度的情况下学习了不同模态之间的桥梁。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性,并设计了一种针对潜在空间中噪声预测的领域通用编码器-解码器架构。此外,我们提出了预测损失来引导训练以实现准确的跨域翻译,并探索了多种训练策略来提高稳定性。我们的方法支持任意模态对,并在多种MT任务上表现强劲,包括多视图到3D形状生成、图像超分辨率和多视图场景合成。综合实验和消融实验验证了我们的框架的有效性,在一般模态翻译中建立了新的强劲基准。更多信息请参见我们的项目页面:https://sites.google.com/view/lddbm/home。
论文及项目相关链接
Summary
本文介绍了最近发展的扩散模型在单模态领域(如图像和音频)中的显著成功,并针对跨不同感官模态的信息翻译(模态翻译)的挑战,提出了潜在去噪扩散桥梁模型(Latent Denoising Diffusion Bridge Model,LDDBM)。该模型在潜在变量扩展的基础上,通过共享潜在空间学习不同模态之间的桥梁,无需对齐维度。通过引入对比对齐损失和针对潜在空间的噪声预测设计的领域通用编码器-解码器架构,以及预测损失和多种训练策略,该模型支持任意模态对,并在多种模态翻译任务上表现出强大的性能。
Key Takeaways
- 扩散模型在单模态领域(如图像和音频)已取得显著成功。
- 模态翻译(Modality Translation,MT)是将信息从一个感官模态翻译到另一个感官模态的挑战性问题。
- 现有方法常常依赖于共享维度、高斯源先验和模态特定架构等限制性假设,限制了其通用性和理论根据。
- 潜在去噪扩散桥梁模型(LDDBM)是一种基于潜在变量扩展的通用模态翻译框架。
- LDDBM通过在共享潜在空间操作,无需对齐维度即可学习不同模态之间的桥梁。
- LDDBM引入对比对齐损失以加强配对样本之间的语义一致性,并设计了针对潜在空间噪声预测的领域通用编码器-解码器架构。
点此查看论文截图
Transformed Multi-view 3D Shape Features with Contrastive Learning
Authors:Márcus Vinícius Lobo Costa, Sherlon Almeida da Silva, Bárbara Caroline Benato, Leo Sampaio Ferraz Ribeiro, Moacir Antonelli Ponti
This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs’ ability to understand overall shapes and contrastive learning’s effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.
本文探讨了利用最前沿的骨干网络配对对比监督学习和自监督学习目标的挑战,进行三维形状特征表示学习的问题。计算机视觉方法在识别二维图像中的三维物体方面存在困难,通常需要大量的标注数据,并依赖于可能忽略关键形状关系的卷积神经网络(CNN)。我们的工作证明,基于视觉转换器(ViT)的架构与现代对比目标相结合时,在我们的下游任务多视角三维分析中取得了有前景的结果,统一了对比和三维形状理解流程。例如,在ModelNet10上,监督对比损失达到了约90.6%的准确率。利用ViT和对比学习的结合,借助ViT对整体形状的理解和对比学习的有效性,克服了需要大量标注数据的依赖和对CNN捕捉关键形状关系的局限性。这种成功的关键在于通过ViT捕获全局形状语义信息,并通过对比优化来提炼局部判别特征。值得一提的是,我们的方法以实证研究为基础,通过实验验证ViT与对比目标结合用于三维表示学习的有效性。
论文及项目相关链接
Summary
本文探讨了利用先进的骨干网络结合对比监督学习和自监督学习目标进行3D形状特征表示学习的挑战。文章指出计算机视觉方法在识别3D物体时面临从二维图像中识别三维物体的难题,需要依赖大量标注数据且可能忽略重要的形状关系。本研究表明,基于Vision Transformer(ViT)的架构结合现代对比目标可实现令人鼓舞的多视图3D分析结果。例如,在ModelNet10上,使用监督对比损失达到了约90.6%的准确率。本研究通过结合ViTs和对比学习,克服了需要大量标注数据的需要以及卷积神经网络(CNN)在捕捉关键形状关系方面的局限性。其成功源于通过ViT捕获全局形状语义并通过对比优化细化局部判别特征。本研究通过广泛的实验评估验证了ViT与对比目标结合进行3D表示学习的有效性。
Key Takeaways
- 该论文探讨了利用先进的骨干网络(如Vision Transformer)结合对比学习和监督学习进行3D形状特征表示的挑战。
- 指出计算机视觉方法识别三维物体时面临的挑战,包括从二维图像中识别三维物体的难度以及依赖大量标注数据的问题。
- 研究发现Vision Transformer(ViT)结合现代对比目标在三维分析任务上表现优异,实现了约90.6%的准确率。
- ViTs能够捕捉全局形状语义,而对比学习则有助于细化局部判别特征。
- 该方法克服了需要大规模标注数据的难题以及卷积神经网络(CNN)在捕捉关键形状关系方面的局限性。
- 研究强调了结合ViT和对比学习目标的3D表示学习的有效性,并进行了广泛的实验评估。
点此查看论文截图
Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models
Authors:Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun, Hongyuan Zhang, Xuelong Li
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
视觉语言模型(VLMs)通过大规模图像文本预训练展示了令人印象深刻的零样本泛化能力,然而一旦部署分布与训练分布出现偏差,其性能可能会下降。为了解决这个问题,测试时间适应(TTA)方法使用无标签的目标数据来更新模型。然而,现有方法往往忽略两个关键挑战:长尾分布中的原型退化以及语义相似类别之间的混淆。为了解决这些问题,我们提出了面向VLMs的类感知原型学习与负对比(CPL-NC)增强框架,这是一个轻量级的测试时间适应框架,旨在提高分布偏移下的泛化能力。CPL-NC引入了类感知原型缓存模块,根据测试时的频率和激活历史动态调整每类的容量,并为不活跃类别提供复苏机制以保留罕见类别的知识。此外,负对比学习机制能够识别和限制困难的视觉文本负样本,以提高类间可分性。该框架采用对称优化方法,仅精炼文本原型,同时以稳定的视觉特征为锚点。在15个基准测试上的实验表明,CPL-NC在ResNet-50和ViT-B/16两种主干网络上均优于先前的TTA方法。
论文及项目相关链接
Summary
本文提出一种专为Vision-Language Models(VLMs)设计的测试时适应(TTA)框架,名为Class-Aware Prototype Learning with Negative Contrast(CPL-NC)。该框架旨在解决VLMs在面对部署分布与训练分布差异时的性能下降问题。通过引入类感知原型缓存模块和负对比学习机制,CPL-NC提高了模型在分布转移下的泛化能力。
Key Takeaways
- VLMs在面临部署分布与训练分布差异时性能可能下降,需要采用Test-Time Adaptation(TTA)方法进行模型更新。
- 现有TTA方法常常忽略长尾分布中的原型退化以及语义相似类别之间的混淆问题。
- CPL-NC框架通过引入类感知原型缓存模块,根据测试时的频率和激活历史动态调整每类的容量,并设有机制保留不活跃类别的知识。
- CPL-NC采用负对比学习机制,识别并约束困难的视觉文本负样本,提高类间可分性。
- 该框架采用对称优化,只精炼文本原型,同时依托稳定的视觉特征进行锚定。
- 实验结果表明,CPL-NC在多个基准测试上均优于先前的TTA方法,适用于ResNet-50和ViT-B/16两种骨干网络。
点此查看论文截图
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Authors:Yiqi Lin, Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Mike Zheng Shou
Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.
对比视觉语言模型,如CLIP,通过从对齐的图像文本对中学习,已在多种跨模态任务中展现出强大的性能。然而,它们处理复杂、真实世界网页文档的能力仍然有限,特别是在文本和图像交错、松散对齐或以视觉形式嵌入的场景中。为了解决这些挑战,我们提出了以视觉为中心的对比学习(VC2L)框架,该框架使用单个视觉转换器对文本、图像及其组合进行建模。VC2L完全在像素空间中进行操作,将所有输入(无论是文本、视觉还是组合)呈现为图像,从而消除了对OCR、文本标记化或模态融合策略的需求。为了捕获跨模态网页文档中的复杂关系,VC2L采用片段级的对比学习目标来对齐连续的跨模态片段,利用文档的内在连贯性,无需明确配对的图像文本数据。为了评估该方法的有效性,我们引入了三个检索基准测试,即AnyCIR、SeqCIR和CSR,分别用于评估跨模态检索、精细的序列理解和对未见数据的泛化能力。实证结果表明,与CLIP风格的模型相比,VC2L在提出的基准测试和M-BEIR、MTEB等现有数据集上均表现出具有竞争力的性能或更优越的性能。这些发现强调了跨模态网络数据作为对比学习的宝贵训练资源的潜力,并展示了统一、以视觉为中心的跨模态表示学习方法的大规模性。代码和模型可通过以下网址获取:https://github.com/showlab/VC2L。
论文及项目相关链接
PDF Project page: this https://linyq17.github.io/VC2L/
Summary
本文介绍了针对复杂、真实世界网页文档处理的视觉中心对比学习(VC2L)框架。该框架通过像素空间建模文本、图像及其组合的单一视觉转换器,解决图像与文本交叉、松散对齐或嵌入视觉形式的多模态场景挑战。VC2L通过渲染所有输入(无论是文本、视觉还是组合)为图像来工作,消除了对OCR、文本标记或模态融合策略的需求。VC2L使用片段级对比学习目标捕捉多模态网页文档中的复杂跨模态关系,利用文档的内在连贯性对齐连续的多模态片段,无需明确配对图像文本数据。实验结果表明,VC2L在提出的基准测试集和现有的数据集上取得了竞争性或优越的性能。
Key Takeaways
- 对比视觉语言模型如CLIP已在多任务上表现出强大的性能,但在处理复杂、真实世界的网页文档时存在挑战。
- 提出了Vision-Centric Contrastive Learning(VC2L)框架,该框架使用一个单一的视觉转换器来建模文本、图像及其组合。
- VC2L完全在像素空间操作,通过将所有输入渲染为图像来工作,简化了多模态数据的处理。
- VC2L使用片段级对比学习来捕捉多模态网页文档中的复杂跨模态关系,利用文档的内在连贯性。
- VC2L无需明确配对图像文本数据,增强了模型的灵活性。
- 实验结果表明VC2L在多个基准测试集上取得了优越的性能。
点此查看论文截图
BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning
Authors:Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
大规模训练的模型展现出显著的涌现行为,这些行为超越了其初始训练目标,学习到了新的能力。我们在通过大规模对比视觉语言训练的生物视觉模型中发现了这样的涌现行为。为此,我们首先整理了TreeOfLife-200M数据集,其中包含2.14亿个生物体图像,是目前为止最大且最多元的生物体图像数据集。然后我们在TreeOfLife-200M数据集上训练BioCLIP 2以区分不同的物种。尽管训练目标较为单一,但BioCLIP 2在应用于各种生物视觉任务(如栖息地分类和特征预测)时表现出惊人的准确性。我们在BioCLIP 2的嵌入空间中确定了涌现特性。在物种间层面,不同物种的嵌入分布与功能和生态意义紧密对齐(例如,喙的大小和栖息地)。在物种内部层面,物种内的变化(如生命阶段和性别)并未被削弱,而是在与物种间差异的正交子空间中得以保留并更好地分离。我们提供正式证明和分析来解释为什么层次化监督和对比目标会促进这些涌现特性。关键的是,我们的结果表明,随着训练数据规模的增加,这些特性变得越来越重要,从而形成一个具有生物学意义的嵌入空间。
论文及项目相关链接
PDF NeurIPS 2025 Spotlight; Project page: https://imageomics.github.io/bioclip-2/
Summary
大规模训练的模型展现出令人瞩目的新兴行为,这些行为超越了其初始训练目标,学习了新的能力。本文通过大规模对比视觉语言训练在生物视觉模型中发现此类新兴行为。为达成这一目标,首先整合了迄今为止最大且最多元化的生物有机体图像数据集TreeOfLife-200M,包含2.14亿张生物图像。随后在TreeOfLife-200M上训练BioCLIP 2以区分不同物种。尽管其训练目标较为单一,但应用于各种生物视觉任务时表现出惊人的准确性,如栖息地分类和特征预测。BioCLIP 2的嵌入空间中存在新兴属性。在物种间层面,不同物种的嵌入分布与功能和生态意义紧密对齐;在物种内层面,物种内变化(如生命阶段和性别)得以保留,并与物种间差异正交子空间更好地分离。本文提供了正式证明和分析,解释为何层次监督和对比目标会促进这些新兴属性的出现。最重要的是,研究结果表明,随着训练数据规模的增大,这些属性变得越来越重要,从而形成一个具有生物学意义的嵌入空间。
Key Takeaways
- 通过大规模对比视觉语言训练,发现生物视觉模型中的新兴行为。
- 整合了TreeOfLife-200M数据集,包含2.14亿张生物图像,为训练提供了丰富数据。
- BioCLIP 2在多种生物视觉任务上表现出高准确性,如栖息地分类和特征预测。
- BioCLIP 2的嵌入空间具有新兴属性,物种间和物种内的差异在其中得到体现。
- 层次监督和对比目标有助于促进新兴属性的出现。
- 随着训练数据规模的增加,这些属性的重要性逐渐凸显,形成具有生物学意义的嵌入空间。
点此查看论文截图
A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning
Authors:Berkay Guler, Giovanni Geraci, Hamid Jafarkhani
Current applications of self-supervised learning to wireless channel representation often borrow paradigms developed for text and image processing, without fully addressing the unique characteristics and constraints of wireless communications. To bridge this gap, we introduce ContraWiMAE, Wireless Contrastive Masked Autoencoder, a transformer-based foundation model that unifies masked reconstruction and masked contrastive learning for wireless channel representation. Our key innovation is a new wireless-inspired contrastive objective that exploits the inherent characteristics of wireless environment, including noise, fading, and partial observability, as natural augmentation. Through extensive evaluation on unseen scenarios and conditions, we demonstrate our method’s effectiveness in multiple downstream tasks, including cross-frequency beam selection, line-of-sight detection, and channel estimation. ContraWiMAE exhibits superior linear separability and adaptability in diverse wireless environments, demonstrating exceptional data efficiency and competitive performance compared with supervised baselines under challenging conditions. Comparative evaluations against a state-of-the-art wireless channel foundation model confirm the superior performance and data efficiency of our approach, highlighting its potential as a powerful baseline for future research in self-supervised wireless channel representation learning. To foster further work in this direction, we release the model weights and training pipeline for ContraWiMAE.
当前自监督学习在无线信道表示中的应用往往借鉴了文本和图像处理的模式,并未完全解决无线通信的独特特性和约束。为了弥补这一差距,我们引入了ContraWiMAE,即无线对比掩码自编码器,这是一种基于transformer的基础模型,统一了掩码重建和掩码对比学习来进行无线信道表示。我们的主要创新之处在于一种新的受无线启发的对比目标,它利用无线环境的固有特性,包括噪声、衰减和部分可观测性,作为自然增强。通过对未见场景和条件的广泛评估,我们证明了我们的方法在多个下游任务中的有效性,包括跨频选波束、视线路径检测和信道估计。ContraWiMAE在不同的无线环境中表现出优异的线性可分性和适应性,在具有挑战性的条件下,与有监督的基线相比,显示出卓越的数据效率和竞争性能。与最先进的无线信道基础模型的比较评估证实了我们方法的优越性能和数据效率,突显了其在自监督无线信道表示学习方面作为未来研究的强大基准的潜力。为了促进这一方向的研究工作,我们发布了ContraWiMAE的模型权重和训练流程。
论文及项目相关链接
PDF - 17 pages, 7 figures, 5 tables - Submitted to IEEE JSAC Large AI Models for Future Wireless Communication Systems - Some of the results will appear in NeurIPS 2025, AI4NextG Workshop - This version is an extensive improvement in all aspects over the previous version with the same title - Dataset and implementation: https://github.com/BerkIGuler/WirelessContrastiveMaskedLearning
Summary
无线通道表示的自监督学习应用常借鉴文本和图像处理的模式,未能完全解决无线通信的独特特性和约束。为解决此问题,我们推出ContraWiMAE(无线对比掩码自编码器),这是一款基于transformer的基础模型,融合了掩码重建和掩码对比学习来进行无线通道表示。其关键创新在于新的无线启发对比目标,利用无线环境的固有特性(如噪声、衰减和部分可观测性)作为自然增强。通过广泛评估,我们的方法在跨频选波束、视距检测和通道估计等多个下游任务中证明了有效性。ContraWiMAE在不同无线环境中表现出优异的线性可分离性和适应性,在具有挑战性的条件下展现出卓越的数据效率和竞争力。与最新的无线通道基础模型相比,我们的方法表现出更优越的性能和数据效率,突出了其在自监督无线通道表示学习领域作为有力基准的潜力。我们发布了ContraWiMAE的模型权重和训练流程以推动进一步的研究。
Key Takeaways
- 无线通道表示自监督学习需考虑无线通信的独特特性和约束。
- ContraWiMAE是一款基于transformer的基础模型,融合掩码重建和掩码对比学习进行无线通道表示。
- 新提出的无线启发对比目标利用无线环境固有特性作为自然增强。
- 在跨频选波束、视距检测和通道估计等任务中展现出优异性能。
- ContraWiMAE在不同无线环境中表现出良好的线性可分离性和适应性。
- 与其他无线通道基础模型相比,ContraWiMAE具有优越性能和数据效率。
点此查看论文截图