发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 3.1k

阅读时长: 12 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation

Authors:Umair Hassan

Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.

乌尔都语有超过2亿5千万的使用者，但在多模态和视觉语言研究中却鲜有涉及。缺乏大规模的高质量数据集限制了乌尔都语系统的开发，并加剧了主要基于高资源语言训练的多语种视觉语言模型的偏见。为了弥补这一空白，我们推出了COCO-乌尔都语数据集，这是一个从MS COCO衍生而来的大规模图像描述数据集，包含5万9千张图像和31万9千条乌尔都语描述，通过分层采样选择以保留原始分布。描述翻译使用了无缝M4T v2工具，并通过混合多模态质量评估框架进行验证，该框架集成了COMET-Kiwi用于翻译质量评估、基于CLIP的相似性用于视觉定位，以及BERTScore和回译用于语义一致性检查；低分描述不断迭代优化使用开源的大型语言模型。我们还对COCO-乌尔都语数据集的BLEU、SacreBLEU和chrF等进行了评估，结果均表现强劲。据我们所知，COCO-乌尔都语数据集是公开可用的最大规模的乌尔都语描述数据集。我们发布数据集和质量评估管道的目的，旨在减少多模态研究中的语言偏见，并为建立包容性的视觉语言系统奠定基础。

论文及项目相关链接

PDF 17 pages, 3 figures, 3 tables. Dataset available at https://huggingface.co/datasets/umairhassan02/urdu-translated-coco-captions-subset. Scripts and notebooks to reproduce results available at https://github.com/umair-hassan2/COCO-Urdu

Summary：为解决乌尔都语在多模态和视觉语言研究中的服务不足问题，推出了COCO-乌尔都语数据集。该数据集包含从MS COCO衍生出的59,000张图像和相应的乌尔都语描述，描述通过无缝M4T v2翻译并经过混合多模态质量评估框架验证。该数据集的发布旨在为包容性视觉语言系统奠定基础，并减少多模态研究中的语言偏见。

Key Takeaways：

乌尔都语在多模态和视觉语言研究中的服务严重不足，缺乏大规模高质量数据集限制了乌尔都语系统的开发。
COCO-乌尔都语数据集是首个大规模的乌尔都语图像描述数据集，包含59,000张图像和相应的乌尔都语描述。
描述是通过无缝M4T v2进行翻译的，并使用混合多模态质量评估框架进行验证，该框架集成了COMET-Kiwi用于翻译质量评估，CLIP用于视觉定位，以及BERTScore和回译用于语义一致性检查。
低质量的描述使用开源的大型语言模型进行迭代改进。
COCO-乌尔都语数据集在BLEU、SacreBLEU和chrF等评估指标上表现优异。
数据集的发布旨在减少多模态研究中的语言偏见，并建立包容性视觉语言系统的基石。

Cool Papers

点此查看论文截图

CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision

Authors:Puskal Khadka, Rodrigue Rizk, Longwei Wang, KC Santosh

Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin

视觉Transformer（ViTs）通过利用自注意力建模远程依赖关系，在计算机视觉领域取得了令人印象深刻的结果。然而，它们对全局上下文的重视往往以牺牲小数据集的局部特征提取为代价，特别是由于缺乏关键归纳偏见（如局部性和平移等变性）。为了缓解这一问题，我们提出了CoSwin，这是一种新型的特征融合架构，它通过分层移位窗口注意力与局部卷积特征学习相结合来增强功能。具体来说，CoSwin将可学习的局部特征增强模块集成到每个注意力块中，使模型能够同时捕获精细的空间细节和全局语义结构。我们在多个图像分类基准测试（包括CIFAR-10、CIFAR-100、MNIST、SVHN和Tiny ImageNet）上评估了CoSwin的性能。实验结果表明，相较于最新的卷积模型和基于Transformer的模型，CoSwin具有持续的性能提升。值得注意的是，在CIFAR-10上，CoSwin相较于基线Swin Transformer提高了2.17%；在CIFAR-100上提高了4.92%；在MNIST上提高了0.1%；在SVHN上提高了0.26%；在Tiny ImageNet上提高了4.47%。这些改进突显了局部全局特征融合在提高小规模视觉任务的转换器的通用性和稳健性方面的有效性。相关代码和预训练权重可在https://github.com/puskal-khadka/coswin上找到。

论文及项目相关链接

PDF

Summary

本文介绍了ViTs在大规模视觉任务中的出色表现，但其在小数据集上的局限性在于过于重视全局上下文而忽视局部特征提取。为解决这一问题，提出一种新型特征融合架构CoSwin，结合了层次化移位窗口注意力和局部卷积特征学习。CoSwin在多个图像分类基准测试中表现优异，相较于其他卷积和基于transformer的模型，实现了性能上的提升。

Key Takeaways

ViTs利用自注意力建模长距离依赖，在计算机视觉中取得显著成果，但在小数据集上过于重视全局上下文，导致局部特征提取不足。
CoSwin是一种新型特征融合架构，旨在解决ViTs在小数据集上的问题。
CoSwin结合了层次化移位窗口注意力和局部卷积特征学习，提升了模型对细粒度空间细节和全局语义结构的捕捉能力。
CoSwin在多个图像分类基准测试中表现出色，实现了对Swin Transformer等基准模型的性能提升。
CoSwin的实现代码和预训练权重已公开，便于后续研究使用。
局部与全局特征融合有助于增强transformer在小规模视觉任务中的通用性和鲁棒性。

Cool Papers

点此查看论文截图

ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation

Authors:Feng Yuan, Yifan Gao, Wenbin Wu, Keqing Wu, Xiaotong Guo, Jie Jiang, Xin Gao

Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba’s selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2’s image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at https://github.com/gatina-yone/ABS-Mamba

准确的多模态医学影像翻译需要协调全局解剖语义和局部结构保真度，这一挑战因模态间信息丢失和结构失真而复杂化。我们提出了ABS-Mamba，这是一种新型架构，它集成了Segment Anything Model 2（SAM2）用于器官感知语义表示，专业化的卷积神经网络（CNN）用于保留模态特定的边缘和纹理细节，以及Mamba的选择性状态空间模型，用于有效处理长短范围特征依赖性。结构上，我们的双分辨率框架利用SAM2的图像编码器捕获来自高分辨率输入的组织规模语义，而并行CNN分支提取精细的局部特征。稳健特征融合网络（RFFN）集成了这些表示，双向Mamba残差网络（BMRN）使用螺旋扫描和双向状态空间动力学对空间依赖性进行建模。三阶段跳跃融合解码器增强了边缘和纹理保真度。我们采用高效的低秩适应（LoRA+）微调技术，在精确领域专业化的同时，保持预训练组件的基础能力。在SynthRAD2023和BraTS2019数据集上的广泛实验验证表明，ABS-Mamba优于现有技术，实现了高保真度的跨模态合成，保留了解剖语义和结构细节，提高了临床应用中诊断的准确性。代码可在https://github.com/gatina-yone/ABS-Mamba找到。

论文及项目相关链接

PDF MICCAI 2025(under view)

Summary

该文提出一种新型的多模态医学影像翻译架构ABS-Mamba，该架构融合了多种技术以提升影像翻译的准确度。它结合了Segment Anything Model 2 (SAM2)进行器官感知语义表示，使用卷积神经网络（CNN）保留模态特定的边缘和纹理细节，并采用Mamba的选择状态空间建模以有效处理长短程特征依赖。实验验证表明，ABS-Mamba在跨模态合成方面表现出优异性能，能保留解剖学语义和结构细节，提高诊断准确性。

Key Takeaways

ABS-Mamba融合了SAM2进行器官感知语义表示，以捕捉全局解剖学语义。
CNNs用于保留模态特定的边缘和纹理细节，确保局部结构保真度。
Mamba的选择状态空间建模处理长短程特征依赖，提高多模态医学影像翻译的准确性。
ABS-Mamba采用双分辨率框架，结合SAM2图像编码器和CNNs分支，分别捕捉器官规模语义和精细局部特征。
Robust Feature Fusion Network (RFFN)和Bidirectional Mamba Residual Network (BMRN)增强特征融合和空间依赖性建模。
采用三阶段跳过融合解码器提高边缘和纹理保真度。
Efficient Low-Rank Adaptation (LoRA+)微调技术使ABS-Mamba能够精确适应不同领域，同时保持预训练组件的基础能力。

Cool Papers

点此查看论文截图