⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-25 更新
Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?
Authors:Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman
Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: “Is it necessary to model all channel interactions?”. Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.
视觉Transformer(ViTs)已经成为视觉基础模型的骨干,然而,针对多通道领域(如细胞染色或卫星图像)的优化仍然被探索得不够深入。这些领域的一个关键挑战是捕获通道之间的交互,因为每个通道都携带不同的信息。虽然现有的作品通过在标记化过程中独立处理每个通道来展示其有效性,但这种方法自然会在注意力块中引入主要的计算瓶颈——通道比较导致注意力呈二次增长,从而产生过多的浮点运算量(FLOPs)和较高的训练成本。在这项工作中,我们将重点从有效性转向被忽视的多通道注意力效率挑战,并问:“需要建模所有通道交互吗?”受稀疏混合专家(MoE)哲学的启发,我们提出了MoE-ViT,这是一种用于ViTs多通道图像的混合专家架构,它将每个通道视为专家,并采用轻量级路由器来选择每个补丁最相关的专家进行注意力处理。在真实世界数据集JUMP-CP和So2Sat上的概念验证实验表明,MoE-ViT在取得实质性效率提升的同时,不牺牲性能甚至在某些情况下增强性能,使其成为多通道成像实用且具有吸引力的骨干架构。
论文及项目相关链接
PDF This has been accepted at the NeurIPS AI4Science Workshop 2025
Summary
ViT在多通道图像领域(如细胞染色或卫星图像)的优化仍有待探索。本研究关注跨通道注意力的效率挑战,提出MoE-ViT架构,将每个通道视为专家,并通过轻量级路由器选择最相关的专家进行注意力处理。实验证明,MoE-ViT在不牺牲性能的情况下实现了显著的效率提升,成为多通道成像的实际和吸引人的主干网络。
Key Takeaways
- Vision Transformers (ViTs)已成为视觉基础模型的骨干,但在多通道领域(如细胞染色、卫星图像等)的优化尚待研究。
- 跨通道注意力处理是多通道图像领域的关键挑战之一。
- 现有方法将每个通道独立处理,导致注意力块中的计算瓶颈和过多的浮点运算(FLOPs)。
- MoE-ViT架构受Mixture-of-Experts(MoE)启发,针对多通道图像提出了专家的概念。
- MoE-ViT通过将每个通道视为专家,并通过轻量级路由器选择最相关的专家进行注意力处理,实现了效率提升。
- 在真实数据集上的实验证明,MoE-ViT在不牺牲性能的情况下实现了显著的效率提升。
点此查看论文截图
DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction
Authors:Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein
Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods–including those trained on multimodal image-text data–while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.
视觉复杂度预测是计算机视觉领域的一个基本问题,在图像压缩、检索和分类等方面都有应用。了解人类如何感知图像的复杂性也是认知科学中长期存在的问题。最近的方法利用多模态模型,结合视觉和语言表示,但尚不清楚语言信息是否对该任务至关重要。我们提出了DReX(DINO-ResNet融合)模型,这是一种仅使用视觉的模型,它通过可学习的注意力机制融合自监督和卷积表示来预测图像复杂性。我们的架构融合了ResNet-50的多尺度层次特征与DINOv3 ViT-S/16的语义丰富表示,使模型能够捕捉低级别的纹理模式和高级别的语义结构。DReX在IC9600基准测试上实现了最先进的性能(皮尔逊系数r=0.9581),超越了以前的方法——包括那些在多元图像文本数据上训练的方法——同时使用了大约少21.5倍的可学习参数。此外,DReX在多个数据集和指标上表现稳健,在皮尔逊和斯皮尔曼相关系数、均方根误差和平均绝对误差上取得了优越的结果。消融分析和注意力分析证实,DReX利用了两个主干网络的互补线索,其中DINOv3的[CLS]令牌增强了其对视觉复杂性的敏感度。我们的研究结果表明,仅使用视觉特征就足以进行与人类对齐的复杂度预测,并且当适当融合时,自监督变压器和受监督的深度卷积神经网络可以为该任务提供互补和协同的好处。
论文及项目相关链接
PDF 8 pages
Summary
本文提出一种仅依赖视觉的模型DReX,融合自监督与卷积表示,通过可学习的注意力机制预测图像复杂度。模型结合ResNet-50的多层次特征与DINOv3 ViT-S/16的丰富语义表示,捕获低层次纹理与高层次语义结构,在IC9600基准测试中取得卓越性能,参数使用量为先前的多模态模型的约1/21.5。此外,DReX在多个数据集和指标上表现稳健,证明其良好的泛化能力。分析显示,DINOv3的CLS令牌增强了模型对视觉复杂度的敏感性。研究结果表明,仅依赖视觉特征即可进行与人类一致的复杂度预测,适当融合的自监督变换器和深度卷积神经网络可为此任务带来互补协同优势。
Key Takeaways
- 论文介绍了视觉复杂度预测是计算机视觉中的一个基本问题,在图像压缩、检索和分类等应用中具有实际意义。
- 研究人员提出了一种新的视觉模型DReX,通过结合自监督和卷积表示来预测图像复杂度。
- DReX结合了ResNet-50的多层次特征和DINOv3的丰富语义表示,能捕获低层次纹理和高层次语义结构。
- DReX在IC9600基准测试中表现卓越,超越了包括多模态图像文本数据训练的方法,同时使用了较少的可学习参数。
- DReX在不同数据集和评价指标上表现稳健,证明了其良好的泛化能力。
- 消融和注意力分析表明,DReX确实利用了两个主干网络的互补线索,其中DINOv3的CLS令牌增强了模型对视觉复杂度的敏感性。
点此查看论文截图
The Finer the Better: Towards Granular-aware Open-set Domain Generalization
Authors:Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen
Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns” that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.
开放集域泛化(OSDG)解决了部署模型遇到域迁移和新型对象类别时的现实场景。尽管在诸如CLIP的视觉语言模型方面取得了令人印象深刻的进展,但现有方法仍然陷入已知类别的结构风险和未知类别的开放空间风险之间的困境,并且容易过度自信,尤其是在区分与已知类别具有细微视觉相似性的“硬未知”时。为此,我们提出了一个语义增强的CLIP(SeeCLIP)框架,通过精细的语义增强显式解决这一困境。在SeeCLIP中,我们提出了一个语义感知的提示增强模块,将图像分解为具有区分性的语义令牌,从而实现超越粗略类别标签的精细视觉语言对齐。为了有效地定位未知提示,我们引入了具有互补目标的双重对比学习,即排斥力以保持与已知类别的可分性,和凝聚力以保持语义接近。此外,我们的语义引导扩散模块通过扰动提取的语义令牌合成伪未知数,生成与已知类别视觉上相似但具有关键局部差异的样本。这些硬阴性迫使模型学习更精细的决策边界。在五个基准测试上的广泛实验表明,与最新技术相比,我们的方法在准确率和H分数上分别提高了3%和5%。
论文及项目相关链接
PDF 9 pages,3 figures,aaai2026
Summary
基于开放集域泛化(OSDG)的现实场景,模型在面临域迁移和新型对象类别时常常面临挑战。现有方法如CLIP虽有所进展,但仍面临已知类别结构风险和未知类别开放空间风险的困境,尤其在区分与已知类别具有细微视觉相似性的“硬未知”时容易过于自信。为此,我们提出了语义增强CLIP(SeeCLIP)框架,通过精细语义增强来解决这一困境。SeeCLIP包含语义感知提示增强模块,将图像分解为判别性语义标记,实现超越粗略类别标签的精细视觉语言对齐。通过引入双重对比学习来有效地定位未知提示,包含排斥和凝聚的互补目标,以维持与已知类别的可分性和保持语义接近。此外,我们的语义引导扩散模块通过扰动提取的语义标记合成伪未知数,生成与已知类别视觉上相似但具有关键局部差异的挑战样本,迫使模型学习更精细的决策边界。在五个基准测试上的广泛实验表明,与最新技术相比,SeeCLIP在准确率和H得分上分别提高了3%和5%。
Key Takeaways
- 开放集域泛化(OSDG)处理现实场景,涉及模型遭遇的域迁移和新型对象类别问题。
- 现有方法如CLIP在区分与已知类别具有细微视觉相似性的“硬未知”时存在困境。
- SeeCLIP框架通过精细语义增强解决此问题,包含语义感知提示增强模块,分解图像为判别性语义标记。
- SeeCLIP引入双重对比学习来定位未知提示,包括排斥和凝聚的互补目标。
- 语义引导扩散模块合成伪未知数,生成挑战样本,迫使模型学习更精细的决策边界。
- SeeCLIP在多个基准测试上表现出优于现有方法的性能。