发布日期: 2025-09-24

更新日期: 2025-11-27

文章字数: 20.2k

阅读时长: 83 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-24 更新

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

本文介绍了TempSamp-R1，这是一个新的强化微调框架，旨在提高多模态大型语言模型（MLLMs）适应视频时序定位任务的有效性。我们发现现有的强化学习方法，如群体相对策略优化（GRPO），依赖于策略更新时的在策略采样。然而，在具有大时间搜索空间的任务中，由于这种方法常常无法找到时序精确的解决方案，所以它的效率和性能受限。为了解决这一局限性，TempSamp-R1利用真实标签作为离线监督，提供精确的时间指导，有效地弥补了在线解决方案中的稀疏性和不匹配问题。为了进一步优化基于奖励的更新并减少方差，TempSamp-R1提供了一种非线性软优势计算方法，该方法通过不对称转换动态重塑奖励反馈。通过采用混合式的思维链（CoT）训练范式，TempSamp-R1优化了一个单一统一的模型来支持CoT和非CoT推理模式，能够高效处理不同推理复杂度的查询。实验结果表明，TempSamp-R1优于基于GRPO的基线模型，在基准数据集上取得了最新性能：Charades-STA（R1@0.7: 52.9%，+2.7%），ActivityNet Captions（R1@0.5: 56.0%，+5.3%）和QVHighlights（mAP: 30.0%，+3.0%）。此外，TempSamp-R1在有限数据下表现出强大的少样本泛化能力。代码地址：https://github.com/HVision-NKU/TempSamp-R1

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

本文介绍了一个名为TempSamp-R1的新型强化微调框架，用于改进多模态大型语言模型在视频时序定位任务中的适应性。针对现有强化学习方法在大型时序搜索空间中的效率和性能局限，TempSamp-R1利用真实标注作为离线策略监督，提供精确的时间指导，并引入非线性软优势计算方法以稳定训练和减少基于奖励的更新的方差。结合Chain-of-Thought训练范式，TempSamp-R1优化了单一联合模型，支持CoT和非CoT推理模式，实现了高效查询处理。实验结果显示，TempSamp-R1在Charades-STA、ActivityNet Captions和QVHighlights等基准数据集上实现了卓越性能，并展示了强大的少样本泛化能力。

Key Takeaways

TempSamp-R1是一个强化微调框架，用于改进多模态大型语言模型在视频时序定位任务中的性能。
现有强化学习方法在大型时序搜索空间中存在效率和性能局限。
TempSamp-R1利用真实标注作为离线策略监督，提供精确的时间指导。
通过非线性软优势计算方法，TempSamp-R1稳定训练并减少奖励更新的方差。
TempSamp-R1采用Chain-of-Thought训练范式，支持多种推理模式。
实验结果表明，TempSamp-R1在多个基准数据集上实现了卓越性能。
TempSamp-R1展现出强大的少样本泛化能力。

Cool Papers

点此查看论文截图

Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification

Authors:Sheng Huang, Jiexuan Yan, Beiyan Liu, Bo Liu, Richang Hong

Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.

现实世界的数据集通常在多个类别之间表现出类别不平衡，表现为长尾分布和少量场景。这在类别不平衡的多标签图像分类（CI-MLIC）任务中尤其具有挑战性，数据不平衡和多目标识别构成了重大障碍。为了解决这些挑战，我们提出了一种名为带有层次提示的双视图对齐学习（HP-DVAL）的新方法，该方法利用视觉语言预训练（VLP）模型的跨模态知识来缓解多标签设置中的类别不平衡问题。具体来说，HP-DVAL采用双视图对齐学习，通过提取图像和文本之间的互补特征来转移VLP模型的强大特征表示能力，以实现准确的图像文本对齐。为了更好地适应CI-MLIC任务的VLP模型，我们引入了一种层次化的提示调整策略，该策略利用全局和局部提示来学习特定任务和上下文相关的先验知识。此外，我们在提示调整过程中设计了一种语义一致性损失，以防止学习到的提示偏离VLP模型中嵌入的通用知识。我们的方法在MS-COCO和VOC2007两个CI-MLIC基准测试上的有效性得到了验证。大量的实验结果表明，我们的方法在长尾多标签图像分类任务上较SOTA方法取得了mAP提高10.0%和5.2%的优异表现，在多标签少镜头图像分类任务上分别提高了6.8%和2.9%。

论文及项目相关链接

PDF accepted by IEEE Transactions on Image Processing

Summary

该文针对现实世界数据集中常见的类别不平衡问题，特别是在多标签图像分类中的挑战，提出了一种新的方法——基于层次提示的双视对齐学习（HP-DVAL）。该方法利用视觉语言预训练（VLP）模型的多模态知识，通过双视对齐学习技术提取互补特征，实现图像与文本的准确对齐，以缓解类别不平衡问题。同时，引入层次提示调整策略，利用全局和局部提示来学习和适应特定任务和上下文相关的先验知识。在MS-COCO和VOC2007两个多标签图像分类基准测试上的实验结果表明，该方法相较于现有方法表现优越，在长尾多标签图像分类任务上的mAP提升10.0%和5.2%，在多标签少样本图像分类任务上的mAP提升6.8%和2.9%。

Key Takeaways

文章主要解决现实世界数据集中的类别不平衡问题，特别是在多标签图像分类中的挑战。
提出了一种新的方法——HP-DVAL，该方法结合视觉语言预训练模型，通过双视对齐学习实现图像与文本的准确对齐。
HP-DVAL利用层次提示调整策略来适应特定任务和上下文相关的先验知识。
该方法在MS-COCO和VOC2007基准测试上的表现超越现有方法。
HP-DVAL在长尾多标签图像分类任务上的mAP有显著提升。
该方法在多标签少样本图像分类任务上也表现出优越性。

Cool Papers

点此查看论文截图

A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition

Authors:Zilin Gao, Qilong Wang, Bingbing Zhang, Qinghua Hu, Peihua Li

Thanks to capability to alleviate the cost of large-scale annotation, few-shot action recognition (FSAR) has attracted increased attention of researchers in recent years. Existing FSAR approaches typically neglect the role of individual motion pattern in comparison, and under-explore the feature statistics for video dynamics. Thereby, they struggle to handle the challenging temporal misalignment in video dynamics, particularly by using 2D backbones. To overcome these limitations, this work proposes an adaptively aligned multi-scale second-order moment network, namely A$^2$M$^2$-Net, to describe the latent video dynamics with a collection of powerful representation candidates and adaptively align them in an instance-guided manner. To this end, our A$^2$M$^2$-Net involves two core components, namely, adaptive alignment (A$^2$ module) for matching, and multi-scale second-order moment (M$^2$ block) for strong representation. Specifically, M$^2$ block develops a collection of semantic second-order descriptors at multiple spatio-temporal scales. Furthermore, A$^2$ module aims to adaptively select informative candidate descriptors while considering the individual motion pattern. By such means, our A$^2$M$^2$-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation. Notably, our proposed method generalizes well to various few-shot settings and diverse metrics. The experiments are conducted on five widely used FSAR benchmarks, and the results show our A$^2$M$^2$-Net achieves very competitive performance compared to state-of-the-arts, demonstrating its effectiveness and generalization.

近年来，由于能够减轻大规模标注的成本，小样本动作识别（FSAR）已经引起了研究者们的广泛关注。然而，现有的FSAR方法通常忽略了个体运动模式的作用，并且没有充分探索视频动态的特征统计。因此，它们在处理视频动态中的挑战性时间错位问题时遇到了困难，特别是使用2D骨干网时。为了克服这些局限性，本文提出了自适应对齐多尺度二阶矩网络，即A$^2$M$^2$-Net，用于描述潜在的视频动态，并收集一系列强大的表示候选者，以实例引导的方式自适应地对齐它们。为此，我们的A$^2$M$^2$-Net包含两个核心组件，即用于匹配的自适应对齐（A$^2$模块）和用于强大表示的多尺度二阶矩（M$^2$块）。具体而言，M$^2$块在多个时空尺度上开发了一系列语义二阶描述符。此外，A$^2$模块旨在自适应地选择信息候选描述符，同时考虑个体运动模式。通过这种方式，我们的A$^2$M$^2$-Net能够通过为强大表示建立自适应对齐协议，解决具有挑战性的时间错位问题。值得注意的是，我们提出的方法在各种小样本设置和多样化指标上具有很好的通用性。实验是在五个广泛使用的FSAR基准数据集上进行的，结果表明，我们的A$^2$M$^2$-Net与最新技术相比取得了非常有竞争力的性能，证明了其有效性和通用性。

论文及项目相关链接

PDF 27 pages, 13 figures, 7 tables

Summary

本文提出一种名为A$^2$M$^2$-Net的适应对齐多尺度二阶矩网络，用于描述潜在的视频动态，并通过实例引导的方式自适应对齐。该网络包含两个核心组件：用于匹配的适应性对齐（A$^2$模块）和用于强表征的多尺度二阶矩（M$^2$块）。M$^2$块在多个时空尺度上生成语义二阶描述符，而A$^2$模块则旨在自适应选择信息候选描述符，同时考虑个体运动模式。因此，A$^2$M$^2$-Net能够处理具有挑战性的时间错位问题，为强表征建立自适应对齐协议。在五个广泛使用的FSAR基准测试上的实验结果表明，A$^2$M$^2$-Net与现有技术相比取得了极具竞争力的性能。

Key Takeaways

FSAR（Few-Shot Action Recognition）近年来受到研究者的关注，因为它能减轻大规模标注的成本压力。
现有FSAR方法通常忽略个体运动模式的作用，并且在视频动态的特征统计方面的探索不足。
A$^2$M$^2$-Net网络被提出，以解决现有FSAR方法的局限性。它包含两个核心组件：适应性对齐（A$^2$模块）和多尺度二阶矩（M$^2$块）。
M$^2$块在多个时空尺度上生成语义二阶描述符，以描述视频动态。
A$^2$模块能够自适应选择信息候选描述符，同时考虑个体运动模式。
A$^2$M$^2$-Net能够处理时间错位问题，并通过自适应对齐协议实现强表征。

Cool Papers

点此查看论文截图

From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge

Authors:Lars Heckler-Kram, Ashwin Vaidya, Jan-Hendrik Neudeck, Ulla Scheler, Dick Ameln, Samet Akcay, Paula Ramos

Visual anomaly detection is a strongly application-driven field of research. Consequently, the connection between academia and industry is of paramount importance. In this regard, we present the VAND 3.0 Challenge to showcase current progress in anomaly detection across different practical settings whilst addressing critical issues in the field. The challenge hosted two tracks, fostering the development of anomaly detection methods robust against real-world distribution shifts (Category 1) and exploring the capabilities of Vision Language Models within the few-shot regime (Category 2), respectively. The participants’ solutions reached significant improvements over previous baselines by combining or adapting existing approaches and fusing them with novel pipelines. While for both tracks the progress in large pre-trained vision (language) backbones played a pivotal role for the performance increase, scaling up anomaly detection methods more efficiently needs to be addressed by future research to meet real-time and computational constraints on-site.

视觉异常检测是一个应用驱动性很强的研究领域。因此，学术界与工业界之间的联系至关重要。鉴于此，我们推出VAND 3.0挑战赛，旨在展示不同实际应用场景中异常检测的当前进展，同时解决该领域的关键问题。该挑战赛包含两个赛道，一个赛道是培育针对现实世界分布变化具有鲁棒性的异常检测方法（第一类），另一个赛道是探索少样本体制下视觉语言模型的能力（第二类）。参赛者的解决方案通过结合或适应现有方法并将其与新型管道融合，在基线之上取得了重大改进。对于这两个赛道来说，大型预训练视觉（语言）主干的进展对性能提升起到了至关重要的作用。为了在现场满足实时和计算约束条件，未来研究还需要更有效地提高异常检测方法。

论文及项目相关链接

PDF

Summary

视觉异常检测是一个应用驱动的研究领域，学术界与工业界的联系至关重要。为展示当前异常检测在不同实际场景中的进展并解决领域中的关键问题，我们推出了VAND 3.0挑战赛。该挑战包含两个赛道，分别关注针对现实世界分布变化的异常检测方法的稳健性（赛道一）以及在少镜头体制下探索视觉语言模型的能力（赛道二）。参赛者的解决方案通过结合或适应现有方法并融合新管道，显著改进了之前的基线。虽然对于两个赛道来说，大型预训练视觉（语言）主干的进展对于性能提升起到了关键作用，但未来研究需要更高效地扩展异常检测方法，以满足现场实时和计算约束的要求。

Key Takeaways

视觉异常检测是一个应用驱动的研究领域，需要学术界与工业界的紧密合作。
VAND 3.0挑战赛旨在展示异常检测技术的最新进展，并解决领域中的关键问题。
挑战包括两个赛道，分别关注异常检测方法的稳健性和视觉语言模型在少镜头体制下的能力。
参赛者通过结合或适应现有方法与新型管道，显著改进了基线表现。
大型预训练视觉（语言）主干的进展对性能提升起到关键作用。
未来研究需要更高效地扩展异常检测方法，以满足实时和计算约束的要求。

Cool Papers

点此查看论文截图

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Authors:Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan

The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

大型语言模型（LLM）的出色性能可以通过测试时的计算增强，该计算依赖于外部工具和其他深度学习模型。然而，将非文本模态表示集成到LLM中的现有方法通常需要额外的昂贵的有监督训练，这限制了在新领域和模态的即时适应。在这项工作中，我们探索了以无训练的方式将非文本基础模型（FM）的表示集成到文本为基础的LLM中的可行性。我们提出基于情境表示学习（ICRL）的概念验证，允许LLM以自适应的方式利用非文本模态表示进行小样本学习。与传统的上下文学习不同，ICRL结合了文本标签对，并用FM表示替换文本输入，使LLM能够在不进行微调的情况下执行多模态推理。我们在分子领域的多个任务上评估了ICRL，并探讨了三个核心研究问题：（i）如何以无训练的方式将FM表示映射到LLM中，（ii）哪些因素影响ICRL的性能，以及（iii）ICRL有效性的基础机制是什么。据我们所知，ICRL是第一个将非文本模态表示集成到基于文本的LLM中的无训练框架，为可适应的多模态泛化提供了有前景的方向。

论文及项目相关链接

PDF NIPS 2025

Summary

本文探索了以非文本为基础的模式（FMs）表示在无需训练的情况下整合到基于文本的超大语言模型（LLMs）中的可行性。提出一种名为上下文表示学习（ICRL）的概念验证，使LLMs能够自适应地利用非文本模态表示进行小样本学习。不同于传统的上下文学习，ICRL用FM表示替换文本输入，使LLM无需微调即可进行多模态推理。在分子领域的多个任务上评估ICRL，并研究三个核心问题：如何在无需训练的情况下将FM表示映射到LLMs中、哪些因素影响ICRL的性能以及ICRL有效性的机制是什么。ICRL是首个无需训练的框架，用于整合非文本模态表示到基于文本的LLMs中，为可适应的多模态推广提供了有前景的方向。

Key Takeaways

LLMs可以通过测试时间的计算增强性能，这依赖于外部工具和其他深度学习模型。
目前将非文本模态表示集成到LLMs中的方法需要额外的昂贵监督训练，限制了在新领域和模态的即时适应。
提出了一种新的方法ICRL，允许LLMs以训练无关的方式自适应地利用非文本模态表示进行小样本学习。
ICRL使用FM表示替换文本输入，使LLM进行多模态推理而无需微调。
ICRL在分子领域的多个任务上进行了评估，并研究了如何将FM表示映射到LLMs中、影响ICRL性能的因素以及ICRL有效性的机制。
ICRL是首个无需训练的框架，用于整合非文本模态表示到基于文本的LLMs中。

Cool Papers

点此查看论文截图

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Authors:María Andrea Cruz Blandón, Zakaria Aldeneh, Jie Chi, Maureen de Seyssel

Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.

自监督学习（SSL）在语音表征学习方面取得了重大进展。像wav2vec 2.0和HuBERT这样的模型在语音识别等任务上达到了最先进的水平，特别是在单语环境中。然而，多语种SSL模型往往在每个单一语言的性能上不及单语种模型，特别是在双语等语言数量较少的多语种场景中。在这项工作中，我们研究了一种通过引入有限的视觉定位来减少这种性能差距的新方法，并将其应用于双语语音SSL模型中。我们的结果表明，视觉定位对单语种和双语种模型都有好处，对后者的好处尤为突出，将零射语音的多元语言性能差距从仅音频模型的31.5%降低到带有定位功能的8.04%。

论文及项目相关链接

PDF 5 pages, 2 figures

Summary

这篇论文探讨了自我监督学习在语音表示学习中的最新进展，并尝试通过在双语语音SSL模型中引入有限的视觉基础来解决多语种环境下的性能问题。研究结果显示，视觉基础不仅有益于单语模型，更有助于提升双语模型的性能，尤其在零射语音辨识任务上，通过视觉基础将多语种的性能差距从音频模型的31.5%缩小到仅使用视觉基础的8.04%。

Key Takeaways

自我监督学习在语音表示学习中取得了显著进展。通过wav2vec 2.0和HuBERT等模型，已经在语音辨识等任务上实现了顶尖表现。然而，在多语种环境中存在性能短板。对此论文尝试通过引入视觉基础来解决这一问题。研究指出在多语种场景下仅使用音频模型的性能差距较大。通过视觉基础的引入，这一差距得到了显著缩小。
在双语环境中引入视觉基础能够有效提升语音自我监督学习的性能表现。尽管这一方法在单语环境下也表现良好，但对双语环境的优化效果更为显著。

Cool Papers

点此查看论文截图

Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

Authors:Manish Acharya, David Hyde

The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: BOSW, a one-shot BO scheme on the unit sphere; RBOSW, a periodic-refresh variant; ABOSW, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and ARBOSW, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead.

切片Wasserstein距离（SW）将$\mathbb{R}^d$上的最优传输简化为一系列一维投影之和。由于这种高效性，它广泛应用于几何、生成建模和注册任务。最新工作显示，使用准蒙特卡洛构造（QSW）计算SW可以产生具有良好近似误差的方向集。本文提出了一种替代的、新颖的方法：使用贝叶斯优化（BO）学习方向，特别是在SW出现在优化循环中的环境（例如，梯度流）。我们引入了一系列即插即用的投影方向选择器：BOSW，单位球上的一次性BO方案；RBOSW，一种周期性刷新变体；ABOSW，一种自适应混合方法，从竞争性的QSW集合中播种并进行一些轻量级的BO优化；以及ARBOSW，一种在优化过程中定期重新学习方向的重启混合方法。我们的BO方法与QSW及其变体相结合（由ABOSW/ARBOSW展示），不需要改变下游损失或梯度。我们提供数值实验，证明我们的方法达到了最先进的性能。在原QSW论文的实验套件上，我们发现ABOSW和ARBOSW可以达到与最佳QSW变体相当的收敛速度，且运行时开销适度。

论文及项目相关链接

PDF 19 pages, 11 figures

摘要
Wasserstein距离切片法（SW）通过在$\mathbb{R}^d$上的一系列一维投影优化传输，因其效率广泛应用于几何、生成建模和注册任务。近期研究使用准蒙特卡洛构造计算SW（QSW），产生具有良好近似误差的方向集。本文介绍一种新型方法：用贝叶斯优化（BO）学习方向，特别是在SW用于优化循环（如梯度流）的场景下。引入一系列即插即用的选择器用于投影方向：BOSW，在单位球上的一次性BO方案；RBOSW，周期性刷新变体；ABOSW，一种自适应混合方法，从竞争性的QSW集合中播种，并进行一些轻量级的BO优化；以及ARBOSW，一种优化过程中定期重新学习方向的重启混合方法。我们的BO方法与QSW及其变种结合（通过ABOSW/ARBOSW展示），无需更改下游损失或梯度。数值实验表明，我们的方法达到了最先进的性能水平，在原QSW论文的实验套件上，我们发现ABOSW和ARBOSW的收敛性与最佳QSW变种相当，且运行时开销适度。

关键见解

Wasserstein距离切片法（SW）通过一维投影优化了传输效率，广泛应用于几何、生成建模和注册任务。
最新研究使用准蒙特卡洛构造（QSW）来计算SW，生成具有良好近似误差的方向集。
提出一种新型方法：结合贝叶斯优化（BO）学习方向，用于更复杂场景如梯度流中的SW优化循环。
引入多种BO方向选择器，如BOSW、RBOSW、ABOSW和ARBOSW，这些选择器可以与QSW及其变种结合使用。
BO方法无需修改下游损失或梯度即可应用。
数值实验表明，使用BO方法的方向选择器在性能上达到或超越了最新标准。

Cool Papers

点此查看论文截图

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

Authors:Sydney Anuyah, Mehedi Mahmud Kaushik, Krishna Dwarampudi, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty

We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025

我们介绍了CoDe-KG这一开源的端到端知识图谱提取管道，它结合了鲁棒的核心引用解析和句法句子分解来提取句子级知识图谱。使用我们的模型，我们贡献了一个包含超过15万条知识三元组的开源数据集。我们还为句子复杂度贡献了7248行的训练语料库，使用PubMed的开源肺癌摘要为核心引用解析贡献了190行的黄金人类标注，为句子转换策略贡献了900行的黄金人类标注，以及398个知识三元组的黄金人类标注。我们系统地选择了五个复杂度类别中的最佳提示模型对，表明混合思维链和少量提示在句子简化方面可以达到高达99.8%的精确匹配准确率。在关系抽取（RE）方面，我们的管道在REBEL上实现了65.8%的宏观F1分数，比现有技术高出8个百分点，在WebNLG2上实现了75.7%的微观F1分数，同时在Wiki-NRE和CaRB上的性能相匹配或更高。消融研究表明，集成核心引用和分解可以将稀有关系的召回率提高超过20%。代码和数据集可在https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025找到。

论文及项目相关链接

PDF

Summary：

本文介绍了名为CoDe-KG的开源、端到端知识图谱提取管道，通过结合健壮的核心解析与句法句子分解来提取句子级别的知识图谱。同时贡献了一个超过15万知识三元组的开源数据集，并提供了训练语料库和黄金标准注释。研究表明，混合思维链和少量提示的方法在句子简化任务上达到了高达99.8%的精确匹配准确率。在关系抽取任务上，该管道在REBEL上实现了65.8%的宏观F1分数，并在Wiki-NRE和CaRB上达到或超过了之前的性能水平。消融研究表明，整合核心引用和分解能提高稀有关系的召回率超过20%。代码和数据集可在链接获取。

Key Takeaways：

CoDe-KG是一个结合核心解析与句法句子分解的开源、端到端知识图谱提取管道。
贡献了一个包含超过15万知识三元组的开源数据集。
通过混合思维链和少量提示的方法，在句子简化任务上实现了高精确匹配准确率。
在关系抽取任务上，CoDe-KG管道在多个数据集上实现了显著的性能提升。
消融研究表明整合核心引用和分解能显著提高稀有关系的召回率。
提供了训练语料库和黄金标准注释用于模型训练与评估。

Cool Papers

点此查看论文截图

CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner

Authors:Yao Du, Jiarong Guo, Xiaomeng Li

Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.

超声心动图是一种重要的无创心脏评估方法，左心室射血分数（LVEF）是评估心脏功能的关键指标。现有的LVEF估计方法依赖于大规模注释视频数据集，这些数据集成本高昂，且在不同临床环境中的适应性有限。最近的超声心动图视觉语言模型，如EchoCLIP，应用图像到文本的预训练，但未能捕捉关键的时间动态和局部心脏结构，这对于准确诊断至关重要。为了解决这些挑战，我们提出了CardiacCLIP，这是一个基于视频的框架，通过基于注意力的帧聚合和多分辨率输入缩放来增强LVEF预测。具体来说，我们引入了MFL（多帧学习），这是一种基于注意力的新机制，用于选择性地融合信息帧，以及EchoZoom，一种多尺度特征提取策略，用于细化心脏结构的空间表示。作为CLIP模型在少数超声心动图视频分析中的新应用，我们的方法显著提高了诊断准确性，在EchoNet-Dynamic数据集上的单样本设置中，平均绝对误差（MAE）降低了2.07。代码可在https://github.com/xmed-lab/CardiacCLIP获取。

论文及项目相关链接

PDF Accepted by MICCAI 2025

Summary

本文介绍了心脏评估中非侵入性的重要模态——超声心动图，以及左心室射血分数（LVEF）作为心脏功能的关键指标。现有LVEF估计方法依赖于大规模注释视频数据集，成本高昂且在不同临床环境中的适应性有限。为应对这些挑战，本文提出了CardiacCLIP这一视频基础框架，通过基于注意力的帧聚合和多分辨率输入缩放技术来提升LVEF预测。引入MFL（多帧学习）这一新型注意力机制，实现信息帧的选择性融合；同时提出EchoZoom多尺度特征提取策略，以优化心脏结构的空间表示。作为CLIP模型在少数样本超声心动图视频分析中的新应用，该方法显著提高诊断准确性，在EchoNet-Dynamic数据集上的平均绝对误差（MAE）减少2.07个百分点。代码已公开。

Key Takeaways

超声心动图是评估心脏功能的重要非侵入性技术，其中左心室射血分数（LVEF）是关键指标。
现有LVEF估计方法依赖大规模注释视频数据集，成本高且适应性有限。
CardiacCLIP框架通过基于注意力的帧聚合和多分辨率输入技术提升LVEF预测。
MFL（多帧学习）机制实现信息帧选择性融合。
EchoZoom策略优化心脏结构的空间表示。
作为CLIP模型在少数样本超声心动图视频分析的新应用，显著提高了诊断准确性。

Cool Papers

点此查看论文截图

CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework Multi-lingual Discourse Relation Classification

Authors:Nawar Turk, Daniele Comitogianni, Leila Kosseim

We present our submission to Task 3 (Discourse Relation Classification) of the DISRPT 2025 shared task. Task 3 introduces a unified set of 17 discourse relation labels across 39 corpora in 16 languages and six discourse frameworks, posing significant multilingual and cross-formalism challenges. We first benchmark the task by fine-tuning multilingual BERT-based models (mBERT, XLM-RoBERTa-Base, and XLM-RoBERTa-Large) with two argument-ordering strategies and progressive unfreezing ratios to establish strong baselines. We then evaluate prompt-based large language models (namely Claude Opus 4.0) in zero-shot and few-shot settings to understand how LLMs respond to the newly proposed unified labels. Finally, we introduce HiDAC, a Hierarchical Dual-Adapter Contrastive learning model. Results show that while larger transformer models achieve higher accuracy, the improvements are modest, and that unfreezing the top 75% of encoder layers yields performance comparable to full fine-tuning while training far fewer parameters. Prompt-based models lag significantly behind fine-tuned transformers, and HiDAC achieves the highest overall accuracy (67.5%) while remaining more parameter-efficient than full fine-tuning.

我们向DISRPT 2025共享任务的第3个任务（话语关系分类）提交我们的作品。第3个任务在39个语料库的16种语言和六个话语框架中引入了统一的17个话语关系标签，带来了显著的多语种和跨形式主义的挑战。我们首先通过微调基于BERT的多语种模型（mBERT、XLM-RoBERTa-Base和XLM-RoBERTa-Large）以及两种参数排序策略和渐进的解冻比例来为该任务制定基准测试，以建立强大的基准线。然后，我们在零样本和少样本的情况下评估基于提示的大型语言模型（即Claude Opus 4.0），以了解大型语言模型如何响应新提出的统一标签。最后，我们介绍了HiDAC，一种分层双适配器对比学习模型。结果表明，虽然更大的transformer模型精度更高，但改进幅度适中，解冻编码器顶层75%的层性能与完全微调相当，同时训练参数更少。基于提示的模型显著落后于微调过的transformer模型，而HiDAC在保持较高总体精度（67.5%）的同时，比完全微调更具参数效率。

论文及项目相关链接

PDF

Summary

任务3（话语关系分类）的DISRPT 2025共享任务提交。该任务引入了统一的17个话语关系标签，跨越39个语料库、涉及多种语言和话语框架，带来多语种和跨形式主义的挑战。研究通过微调多语言BERT模型并引入分层双适配器对比学习模型HiDAC，发现大型转换器模型精度较高但改进有限，解冻顶层75%的编码器层可实现与全微调相当的性能但训练参数更少。基于提示的模型显著落后于微调后的转换器，而HiDAC在保持较高参数效率的同时实现了最高总体精度（67.5%）。

Key Takeaways

任务3在DISRPT 2025共享任务中引入了跨多种语言和话语框架的统一的17个话语关系标签。
任务面临显著的多语种和跨形式主义挑战。
通过微调多语言BERT模型（mBERT等）建立基准线，并研究两种论元排序策略和逐层解冻比例对性能的影响。
评估了基于提示的大型语言模型在零样本和少样本场景下的表现。
引入HiDAC模型，该模型在话语关系分类任务上实现了最高总体精度（67.5%）。
大型转换器模型精度高，但改进有限；解冻顶层编码器层可有效提高参数效率。

Cool Papers

点此查看论文截图

DA-Font: Few-Shot Font Generation via Dual-Attention Hybrid Integration

Authors:Weiran Chen, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu

Few-shot font generation aims to create new fonts with a limited number of glyph references. It can be used to significantly reduce the labor cost of manual font design. However, due to the variety and complexity of font styles, the results generated by existing methods often suffer from visible defects, such as stroke errors, artifacts and blurriness. To address these issues, we propose DA-Font, a novel framework which integrates a Dual-Attention Hybrid Module (DAHM). Specifically, we introduce two synergistic attention blocks: the component attention block that leverages component information from content images to guide the style transfer process, and the relation attention block that further refines spatial relationships through interacting the content feature with both original and stylized component-wise representations. These two blocks collaborate to preserve accurate character shapes and stylistic textures. Moreover, we also design a corner consistency loss and an elastic mesh feature loss to better improve geometric alignment. Extensive experiments show that our DA-Font outperforms the state-of-the-art methods across diverse font styles and characters, demonstrating its effectiveness in enhancing structural integrity and local fidelity. The source code can be found at \href{https://github.com/wrchen2001/DA-Font}{\textit{https://github.com/wrchen2001/DA-Font}}.

基于少样本的字体生成旨在使用有限的字符引用生成新的字体。它可大大减少手动字体设计的成本。然而，由于字体的多样性和复杂性，现有方法生成的结果往往存在明显的缺陷，如笔画错误、伪影和模糊等。为了解决这些问题，我们提出了DA-Font，这是一个集成了双注意力混合模块（DAHM）的新型框架。具体来说，我们引入了两种协同注意力块：组件注意力块，它利用内容图像中的组件信息来引导风格转换过程；关系注意力块则通过交互内容特征与原始和风格化组件表示来进一步改进空间关系。这两个块协同工作，保留了准确的字符形状和风格纹理。此外，我们还设计了角一致性损失和弹性网格特征损失来更好地改进几何对齐。大量实验表明，我们的DA-Font在多种字体风格和字符上超过了最先进的方法，证明了其在提高结构完整性和局部保真度方面的有效性。源代码可以在\href{https://github.com/wrchen2001/DA-Font}{\textless{}https://github.com/wrchen2001/DA-Font\textgreater{}}找到。

论文及项目相关链接

PDF Accepted by ACM MM 2025

Summary

本文介绍了基于少样本的字体生成技术，旨在通过有限的字形参考来创建新字体，降低手动字体设计的成本。为解决现有方法生成结果中存在的笔画错误、伪影和模糊等可见缺陷，提出了新型的DA-Font框架，集成了双注意力混合模块（DAHM）。该框架包括两个协同工作的注意力块：组件注意力块利用内容图像中的组件信息来指导风格转换过程，关系注意力块则通过交互内容特征与原始和风格化组件表示来进一步调整空间关系。此外，还设计了角一致性损失和弹性网格特征损失来改进几何对齐。实验表明，DA-Font在多种字体风格和字符上的表现优于现有方法，提高了结构完整性和局部保真度。

Key Takeaways

少样本字体生成旨在减少手动字体设计的劳动力成本。
现有方法生成的字体常存在笔画错误、伪影和模糊等缺陷。
DA-Font框架集成了双注意力混合模块（DAHM）来解决这些问题。
DA-Font包含两个协同的注意力块：组件注意力块和关系注意力块。
组件注意力块利用内容图像组件信息指导风格转换。
关系注意力块通过交互内容特征与原始和风格化组件表示来调整空间关系。

Cool Papers

点此查看论文截图

Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method

Authors:Shuaibo Li, Zhaohu Xing, Hongqiu Wang, Pengfei Hao, Xingyu Li, Zekai Liu, Lei Zhu

The rapid advancement of generative AI in medical imaging has introduced both significant opportunities and serious challenges, especially the risk that fake medical images could undermine healthcare systems. These synthetic images pose serious risks, such as diagnostic deception, financial fraud, and misinformation. However, research on medical forensics to counter these threats remains limited, and there is a critical lack of comprehensive datasets specifically tailored for this field. Additionally, existing media forensic methods, which are primarily designed for natural or facial images, are inadequate for capturing the distinct characteristics and subtle artifacts of AI-generated medical images. To tackle these challenges, we introduce \textbf{MedForensics}, a large-scale medical forensics dataset encompassing six medical modalities and twelve state-of-the-art medical generative models. We also propose \textbf{DSKI}, a novel \textbf{D}ual-\textbf{S}tage \textbf{K}nowledge \textbf{I}nfusing detector that constructs a vision-language feature space tailored for the detection of AI-generated medical images. DSKI comprises two core components: 1) a cross-domain fine-trace adapter (CDFA) for extracting subtle forgery clues from both spatial and noise domains during training, and 2) a medical forensic retrieval module (MFRM) that boosts detection accuracy through few-shot retrieval during testing. Experimental results demonstrate that DSKI significantly outperforms both existing methods and human experts, achieving superior accuracy across multiple medical modalities.

医学影像生成人工智能的快速进步既带来了重大机遇，也带来了严峻挑战，尤其是假冒医学影像可能破坏医疗系统的风险。这些合成图像带来了严重的风险，如诊断欺骗、金融欺诈和误导信息。然而，针对这些威胁的医疗取证研究仍然有限，而且专门针对该领域的综合数据集严重缺乏。此外，现有的媒体取证方法主要设计用于自然或面部图像，因此不足以捕捉人工智能生成的医学影像的独特特征和细微伪影。为了应对这些挑战，我们引入了MedForensics数据集，这是一个大规模的医疗取证数据集，涵盖六种医学模态和十二种最先进的医学生成模型。我们还提出了DSKI，这是一种新型的双阶段知识注入检测器（Dual-Stage Knowledge Infusing detector），它构建了一个定制的视觉语言特征空间，用于检测人工智能生成的医学影像。DSKI包含两个核心组件：1）跨域精细跟踪适配器（CDFA），用于在训练过程中从空间和噪声域中提取微妙的伪造线索；2）医疗取证检索模块（MFRM），通过测试过程中的少量检索来提高检测准确性。实验结果表明，DSKI在多个医学模态上的检测精度显著优于现有方法和人类专家。

论文及项目相关链接

PDF

摘要
医学成像领域生成式人工智能的快速进步带来了重要机遇和严峻挑战，特别是虚假医学图像可能破坏医疗系统的风险。这些合成图像可能造成诊断欺骗、金融欺诈和误导信息等严重风险。然而，针对这些威胁的医学研究仍受限于专门的综合数据集。现有媒体鉴定方法主要用于自然图像或面部图像，无法捕捉医学图像AI生成物的独特特征和细微痕迹。为应对这些挑战，我们推出MedForensics数据集，涵盖六种医学模态和十二种最先进的医学生成模型的大规模医学鉴定数据集。同时提出DSKI检测器，构建专门用于检测AI生成的医学图像的视觉语言特征空间。实验结果表明，DSKI在多个医学模态上的检测精度显著优于现有方法和人类专家。

关键见解

生成式人工智能在医学成像中的进步带来了重大机遇和挑战，尤其是虚假医学图像对医疗系统的潜在威胁。
合成医学图像存在诊断欺骗、金融欺诈和误导信息等风险。
目前针对医学领域的鉴定研究仍面临缺乏专门的综合数据集的问题。
现有媒体鉴定方法主要面向自然或面部图像，不适用于医学图像AI生成物的检测。
推出MedForensics数据集，涵盖多种医学模态和先进的生成模型，为医学鉴定提供大规模资源。
提出DSKI检测器，通过双阶段知识注入（DSKI）构建专门的视觉语言特征空间用于检测AI生成的医学图像。

Cool Papers

点此查看论文截图

SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

Authors:Cristian Sbrolli, Matteo Matteucci

The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge’s compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.

整体大于部分之和，甚至在3D文本对比学习中也是如此。我们推出了SceneForge，这是一个新型框架，通过结构化多目标场景组合，增强了3D点云和文本之间的对比对齐。SceneForge利用单个3D形状构建具有明确空间关系的多目标场景，将它们与由大型语言模型完善的一致多目标描述相配对。通过将这些结构化的、组合的样本增强对比训练，SceneForge有效地解决了大规模3D文本数据集的稀缺问题，极大地丰富了数据的复杂性和多样性。我们系统地研究了关键的设计元素，如每个场景中的最佳目标数、训练批次中组合样本的比例以及场景构建策略。大量实验表明，SceneForge在多个任务上实现了显著的性能提升，包括ModelNet、ScanObjNN、Objaverse-LVIS和ScanNet上的零样本分类，以及ShapeNetPart上的少样本部分分割。SceneForge的组合增强是模型无关的，在多种编码器架构上都能提高性能。此外，SceneForge改进了ScanQA的3D视觉问答，在场景复杂性不断增加的检索场景中实现了稳健的泛化，并通过适应空间配置来展示空间推理能力，以精确符合文本指令。

论文及项目相关链接

PDF to appear in NeurIPS 2025

Summary

SceneForge框架通过结构化的多对象场景组合，增强了3D点云和文本之间的对比对齐。该框架利用个体3D形状构建具有明确空间关系的多对象场景，并与由大型语言模型精炼的多对象描述相配对。通过利用这些结构化、组合样本增强对比训练，SceneForge有效解决大规模3D-文本数据集稀缺问题，极大地丰富了数据复杂性和多样性。

Key Takeaways

SceneForge是一个用于增强3D点云和文本之间对比对齐的框架。
通过构建具有明确空间关系的多对象场景，SceneForge解决了大规模3D-文本数据集的稀缺问题。
SceneForge利用个体3D形状和多对象描述进行配对，其中描述由大型语言模型精炼。
该框架通过结构化、组合样本丰富对比训练，有效提高了数据复杂性和多样性。
SceneForge在多个任务上实现了显著的性能提升，包括ModelNet、ScanObjNN、Objaverse-LVIS和ScanNet的零样本分类，以及ShapeNetPart的少量部分分割。
SceneForge的组合增强方法是模型无关的，可以在多个编码器架构上实现性能改进。

Cool Papers

点此查看论文截图

State-of-the-Art Dysarthric Speech Recognition with MetaICL for on-the-fly Personalization

Authors:Dhruuv Agarwal, Harry Zhang, Yang Yu, Quan Wang

Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). Measuring Word Error Rate (WER) on state-of-the-art subsets, the model achieves 13.9% WER on Euphonia which surpasses speaker-independent baselines (17.5% WER) and rivals user-specific personalized models. On SAP Test 1, its 5.3% WER significantly bests the 8% from even personalized adapters. We also demonstrate the importance of example curation, where an oracle text-similarity method shows 5 curated examples can achieve performance similar to 19 randomly selected ones, highlighting a key area for future efficiency gains. Finally, we conduct data ablations to measure the data efficiency of this approach. This work presents a practical, scalable, and personalized solution.

个性化自动语音识别（ASR）对于发音障碍者的语音至关重要，但由于需要训练并存储个别用户适配器，这具有挑战性。我们提出了一种混合元训练方法，该方法适用于单一模型，通过上下文学习（ICL）实现零样本和少量样本即时个性化，表现出卓越性能。在最新子集上测量词错误率（WER），该模型在Euphonia上实现了13.9%的WER，超过了非说话人依赖的基线（17.5%的WER），并与用户特定的个性化模型相匹敌。在SAP测试1中，其5.3%的WER明显优于个性化适配器的8%。我们还证明了示例选择的重要性，其中理想文本相似性方法表明，通过精心挑选的5个示例可以实现与随机选择的19个示例相似的性能，这突显了未来提高效率的关键领域。最后，我们通过数据消融法测量了这种方法的数据效率。这项工作提供了一个实用、可扩展和个性化的解决方案。

论文及项目相关链接

PDF

Summary
针对口齿不清（dysarthric speech）的语音识别问题，提出一种混合元训练法。此方法在一单模型上即可实现零样本或少样本实时个性化，通过上下文学习（ICL）提升性能。在最新子集上测量词错误率（WER），模型在Euphonia上的表现优于非个性化基线并接近个性化模型，同时在SAP Test 1上的表现显著优于个性化适配器。此外，还强调了样本选择的重要性，并展示了通过优化选择方法可提高效率。最后，通过数据消融实验验证了该方法的数据效率。本研究提供了一种实用、可扩展且个性化的解决方案。

Key Takeaways

提出一种混合元训练法用于解决口齿不清的语音识别个性化问题。
方法在一单模型上实现零样本或少样本实时个性化，通过上下文学习提升性能。
模型在Euphonia上的词错误率表现优于非个性化基线并接近个性化模型。
在SAP Test 1上的表现显著优于个性化适配器。
样本选择对性能有重要影响，优化选择方法可提高效率。
通过数据消融实验验证了该方法的实用性、可扩展性及其数据效率。

Cool Papers

点此查看论文截图

Authors:Tuo Xiang, Xuemiao Xu, Bangzhen Liu, Jinyi Li, Yong Li, Shengfeng He

The rapid growth of 3D digital content necessitates expandable recognition systems for open-world scenarios. However, existing 3D class-incremental learning methods struggle under extreme data scarcity due to geometric misalignment and texture bias. While recent approaches integrate 3D data with 2D foundation models (e.g., CLIP), they suffer from semantic blurring caused by texture-biased projections and indiscriminate fusion of geometric-textural cues, leading to unstable decision prototypes and catastrophic forgetting. To address these issues, we propose Cross-Modal Geometric Rectification (CMGR), a framework that enhances 3D geometric fidelity by leveraging CLIP’s hierarchical spatial semantics. Specifically, we introduce a Structure-Aware Geometric Rectification module that hierarchically aligns 3D part structures with CLIP’s intermediate spatial priors through attention-driven geometric fusion. Additionally, a Texture Amplification Module synthesizes minimal yet discriminative textures to suppress noise and reinforce cross-modal consistency. To further stabilize incremental prototypes, we employ a Base-Novel Discriminator that isolates geometric variations. Extensive experiments demonstrate that our method significantly improves 3D few-shot class-incremental learning, achieving superior geometric coherence and robustness to texture bias across cross-domain and within-domain settings.

随着三维数字内容的快速发展，开放世界场景下需要可扩展的识别系统。然而，由于几何不对齐和纹理偏差，现有的三维类增量学习方法在极端数据稀缺的情况下表现挣扎。虽然最近的方法将三维数据与二维基础模型（例如CLIP）相结合，但它们受到纹理偏向投影和几何纹理线索的随机融合导致的语义模糊的影响，导致决策原型不稳定和灾难性遗忘。为了解决这些问题，我们提出了跨模态几何校正（CMGR）框架，该框架利用CLIP的分层空间语义来提高三维几何保真度。具体来说，我们引入了一个结构感知几何校正模块，该模块通过注意力驱动的几何融合分层地将三维部分结构与CLIP的中间空间先验对齐。此外，纹理放大模块合成最小但具有区分力的纹理，以抑制噪声并增强跨模态一致性。为了进一步稳定增量原型，我们采用了基础-新型鉴别器，该鉴别器隔离了几何变化。大量实验表明，我们的方法在三维小样本类增量学习上显著改进，实现了跨域和域内设置中的出色几何连贯性和对纹理偏差的稳健性。

论文及项目相关链接

PDF ICCV2025

Summary

本文探讨了三维数字内容的快速增长对开放世界场景下的识别系统提出的新需求。现有三维类增量学习方法在极端数据稀缺情况下存在几何失配和纹理偏差的问题。文章通过整合三维数据与二维基础模型（如CLIP）来解决这一问题，但存在语义模糊的问题。为此，本文提出了跨模态几何校正（CMGR）框架，通过利用CLIP的分层空间语义来提高三维几何保真度。该框架包括结构感知几何校正模块和纹理放大模块，分别实现三维部分结构与CLIP中间空间先验的层次对齐以及最小判别性纹理的合成，以抑制噪声并增强跨模态一致性。此外，为了稳定增量原型，还采用了基础-新颖鉴别器来隔离几何变化。实验证明，该方法显著提高了三维小样本类增量学习的几何一致性和对纹理偏差的鲁棒性。

Key Takeaways

三维数字内容的增长推动了开放世界场景下识别系统的需求。
现有三维类增量学习方法面临极端数据稀缺时的挑战，主要表现为几何失配和纹理偏差。
跨模态几何校正（CMGR）框架被提出，以提高三维几何数据的保真度。
CMGR利用CLIP的分层空间语义进行结构感知几何校正，实现三维数据与二维模型的深度融合。
纹理放大模块用于合成最小判别性纹理，以增强跨模态一致性并抑制噪声。
基础-新颖鉴别器的采用有助于隔离几何变化，进一步稳定增量原型。

Cool Papers

点此查看论文截图

Authors:Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

Inspired by the success of LLMs, GFMs are designed to learn the optimal embedding functions from multi-domain text-attributed graphs for the downstream cross-task generalization capability. Among the diverse architectures, graph VQ-MAE stands out among the increasingly diverse landscape of GFM. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.

受到大型语言模型（LLMs）成功的启发，图神经网络模型（GFMs）被设计用于从多域文本属性图中学习最优嵌入函数，以用于下游跨任务泛化能力。在多种架构中，图VQ-MAE在众多日益多样化的GFM领域中脱颖而出。这归功于其能够将多个领域的拓扑和文本属性联合编码到具有清晰语义边界的离散嵌入空间中的能力。尽管如此，领域泛化冲突还是会导致一些不易察觉的陷阱。在本文中，我们实例化了两个这样的陷阱，它们就像是GFM优化同一枚硬币的两面——第一面是模型退化：编码器和代码本无法捕捉输入的多样性；第二面是表示崩溃：由于来自狭窄表示子空间的约束，隐藏嵌入和代码本向量无法保持语义可分性。这两个陷阱（两面）共同影响了解码器，并产生了低质量重建的监督信息，导致了预训练过程中的GFM优化困境（硬币）。通过实证研究，我们将上述挑战归因于信息瓶颈和正则化缺陷。为了解决这些问题，我们提出了MoT方法：（1）针对两个陷阱的信息微调策略，它利用边缘语义融合策略和混合代码本以及域感知路由来提高信息容量；（2）针对优化硬币的正则化微调策略，它利用两种额外的正则化方法来进一步改进我们提出的信息微调中的梯度监督。值得注意的是，作为一个灵活的架构，MoT遵循GFM的扩展定律，提供了一个可控的模型规模。在6个领域的22个数据集上的实验表明，与最新技术基准相比，MoT在监督学习、小样本学习和零样本学习场景中实现了显著改进。

论文及项目相关链接

PDF

Summary
图VQ-MAE在GFM架构中表现突出，但存在模型退化和表示崩溃的问题。针对这些问题，本文提出了MoT方案，通过信息调整和正则化改进来解决，提高模型的信息容量和梯度监督。MoT在多个数据集上的实验结果表明，它在监督、少样本和零样本场景下实现了显著的改进。

Key Takeaways

GFMs旨在从多域文本属性图中学习最优嵌入函数，以实现下游跨任务的泛化能力。
图VQ-MAE是GFMs中的突出架构，能够联合编码多个领域的拓扑和文本属性到离散嵌入空间。
GFM中存在两个主要问题：模型退化和表示崩溃，这两个问题共同影响了解码器并导致GFM优化困境。
模型退化表现为编码器和代码本无法捕捉输入多样性；表示崩溃则导致隐藏嵌入和代码本向量无法保持语义可分性。
信息瓶颈和正则化缺陷被认为是上述挑战的原因。
针对这些问题，提出了MoT方案，包括信息调整和正则化改进，以提高信息容量和梯度监督。

Cool Papers

点此查看论文截图

HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

Authors:Xin Wang, Ting Dang, Xinyu Zhang, Vassilis Kostakos, Michael J. Witbrock, Hong Jia

Mobile and wearable healthcare monitoring play a vital role in facilitating timely interventions, managing chronic health conditions, and ultimately improving individuals’ quality of life. Previous studies on large language models (LLMs) have highlighted their impressive generalization abilities and effectiveness in healthcare prediction tasks. However, most LLM-based healthcare solutions are cloud-based, which raises significant privacy concerns and results in increased memory usage and latency. To address these challenges, there is growing interest in compact models, Small Language Models (SLMs), which are lightweight and designed to run locally and efficiently on mobile and wearable devices. Nevertheless, how well these models perform in healthcare prediction remains largely unexplored. We systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, and deployed the best performing fine-tuned SLMs on mobile devices to evaluate their real-world efficiency and predictive performance in practical healthcare scenarios. Our results show that SLMs can achieve performance comparable to LLMs while offering substantial gains in efficiency and privacy. However, challenges remain, particularly in handling class imbalance and few-shot scenarios. These findings highlight SLMs, though imperfect in their current form, as a promising solution for next-generation, privacy-preserving healthcare monitoring.

移动和可穿戴医疗设备在促进及时干预、管理慢性健康状况以及最终提高个人生活质量方面发挥着至关重要的作用。先前关于大型语言模型（LLM）的研究已经突出了其在医疗保健预测任务中的惊人泛化能力和有效性。然而，大多数基于LLM的医疗服务解决方案都是基于云的，这引发了人们对隐私的重大担忧，并导致了内存使用和延迟的增加。为了解决这些挑战，人们对小型语言模型（SLM）的兴趣日益浓厚，这些模型轻便且设计用于在移动设备和可穿戴设备上本地高效运行。然而，这些模型在医疗保健预测方面的表现如何仍然在很大程度上未被探索。我们系统地评估了SLM在健康预测任务上的零样本、少样本和指令微调方法的表现，并在移动设备上部署了表现最佳的微调SLM，以评估其在现实世界的效率和预测性能在实际情况下的医疗保健场景中的表现。我们的结果表明，SLM的性能可以与LLM相媲美，同时在效率和隐私方面提供了实质性的改进。然而，仍然存在挑战，特别是在处理类别不平衡和少样本场景方面。这些发现突出了SLM尽管在当前形式下并不完美，但作为下一代隐私保护医疗保健监测的有前途的解决方案。

论文及项目相关链接

PDF 9 pages, 6 tables, 6 figures

Summary

移动和可穿戴式健康监测在促进及时干预、管理慢性健康状况以及提高个人生活质量方面发挥着重要作用。针对大型语言模型在医疗保健预测任务中的隐私和效率问题，研究界转向研究小型语言模型（SLMs）。本文通过系统地评估SLMs在健康预测任务中的表现，发现其在零样本、少样本和指令微调等任务中的表现与大型语言模型相当，且在实际部署时表现出较高的效率和隐私保护能力。然而，处理类别不平衡和少样本场景仍存在挑战。总体而言，小型语言模型是下一代隐私保护健康监测的有前途的解决方案。

Key Takeaways

移动和可穿戴式健康监测对及时干预、管理慢性健康状况和提高生活质量至关重要。
大型语言模型在医疗保健预测中存在隐私和效率问题。
小型语言模型（SLMs）适用于移动和可穿戴设备，以应对这些问题。
SLMs在健康预测任务中的表现与大型语言模型相当。
SLMs在实际部署中表现出较高的效率和隐私保护能力。
处理类别不平衡和少样本场景是SLMs面临的挑战。

Cool Papers

点此查看论文截图

VisText-Mosquito: A Unified Multimodal Benchmark Dataset for Visual Detection, Segmentation, and Textual Reasoning on Mosquito Breeding Sites

Authors:Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda

Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, we tested a range of large vision-language models (LVLMs) in both zero-shot and few-shot settings. Our fine-tuned Mosquito-LLaMA3-8B model achieved the best results, with a final loss of 0.0028, a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.85. This dataset and model framework emphasize the theme “Prevention is Better than Cure”, showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito

蚊虫传播的疾病构成了全球性的重大健康威胁，需要提前检测和积极控制繁殖地点以防止爆发。在本文中，我们介绍了VisText-Mosquito，这是一个多模式数据集，融合了视觉和文本数据，支持蚊子繁殖地点分析的自动化检测、分割和推理。该数据集包含1828张用于目标检测的注释图像、142张用于水面分割的图像以及与每张图像相关的自然语言推理文本。YOLOv9s模型在目标检测方面达到了0.92926的精度和0.92891的mAP@50，而YOLOv11n-Seg在分割精度上达到了0.91587，mAP@50为0.79795。对于推理生成，我们在零样本和少样本环境中测试了一系列大型视觉语言模型（LVLMs）。我们微调的Mosquito-LLaMA3-8B模型取得了最佳结果，最终损失为0.0028，BLEU得分为54.7，BERTScore为0.91，ROUGE-L为0.85。该数据集和模型框架突出了“预防胜于治疗”的主题，展示了基于AI的检测如何积极应对蚊虫传播疾病的风险。数据集和实现的代码已公开发布在GitHub上：https://github.com/adnanul-islam-jisun/VisText-Mosquito

论文及项目相关链接

PDF

Summary
本文介绍了VisText-Mosquito数据集，该数据集融合了视觉和文本数据，支持蚊子繁殖地分析的自动化检测、分割和推理。通过使用YOLOv9s模型和YOLOv11n-Seg模型进行对象检测和图像分割，以及使用大型视觉语言模型进行推理生成，该数据集为蚊子繁殖地的早期检测和预防提供了有力支持。此外，还强调了预防优于治疗的重要性。

Key Takeaways

VisText-Mosquito数据集融合了视觉和文本数据，用于支持蚊子繁殖地的自动化检测、分割和推理。
数据集包含用于对象检测的1,828张注释图像、用于水面分割的142张图像以及与每张图像相关的自然语言推理文本。
YOLOv9s模型在对象检测方面表现出高精确度，而YOLOv11n-Seg模型在图像分割方面表现出良好的性能。
使用大型视觉语言模型进行推理生成，在零样本和少样本设置下进行了测试。
Fine-tuned Mosquito-LLaMA3-8B模型在推理生成方面取得了最佳结果。
该数据集和模型框架强调“预防优于治疗”的主题，展示了AI检测如何主动应对蚊子传播的疾病风险。

Cool Papers

点此查看论文截图

Are LLMs Better Formalizers than Solvers on Complex Problems?

Authors:Rikhil Amonkar, May Lai, Ronan Le Bras, Li Zhang

A trending line of recent work advocates for using large language models (LLMs) as formalizers instead of as end-to-end solvers for logical reasoning problems. Instead of generating the solution, the LLM generates a formal program that derives a solution via an external solver. While performance gain of the seemingly scalable LLM-as-formalizer over the seemingly unscalable LLM-as-solver has been widely reported, we show that this superiority does not hold on real-life constraint satisfaction problems. On 4 domains, we systematically evaluate 6 LLMs including 4 large reasoning models with inference-time scaling, paired with 5 pipelines including 2 types of formalism. We show that in few-shot settings, LLM-as-formalizer underperforms LLM-as-solver. While LLM-as-formalizer promises accuracy, robustness, faithfulness, and efficiency, we observe that the present LLMs do not yet deliver any of those, as their limited ability to generate formal programs leads to failure to scale with complexity, hard-coded solutions, and excessive reasoning tokens. We present our detailed analysis and actionable remedies to drive future research that improves LLM-as-formalizer.

近期的研究趋势主张将大型语言模型（LLM）作为形式化工具，而非作为端到端的求解器来解决逻辑推理问题。LLM不是生成解决方案，而是生成一个形式化程序，通过外部求解器来推导解决方案。虽然作为形式化工具的LLM在看似可扩展方面的性能优势已经得到了广泛的报道，但我们在实际生活中的约束满足问题上并没有保持这种优势。我们在4个领域系统地评估了6个LLM，包括4个具有推理时间缩放的大型推理模型，与5个管道（包括2种形式化类型）配对。我们发现在小样本设置下，作为形式化工具的LLM性能低于作为求解器的LLM。虽然LLM作为形式化工具在准确性、稳健性、忠诚性和效率方面做出了承诺，但我们观察到目前的LLM并没有实现这些承诺，因为它们生成形式化程序的能力有限，导致无法随着复杂性进行扩展、解决方案过于僵化以及过多的推理符号。我们提供了详细的分析和可行的补救措施，以推动未来研究改进LLM作为形式化工具的使用。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）作为形式化工具而非端到端的解决方案，在逻辑推理问题上的趋势逐渐显现。虽然LLM作为形式化工具在性能上的优势被广泛报道，但在现实生活中的约束满足问题上，这种优势并不明显。在少数场景下，LLM作为形式化工具的表现甚至不如LLM作为求解器。关于LLM作为形式化工具存在精度、稳健性、忠诚度和效率的问题，因生成的形式程序有限，难以随着复杂性而扩展，产生硬编码解决方案和过多的推理标记。我们提供了详细的分析和可行的补救措施，以推动未来改善LLM作为形式化工具的研究。

Key Takeaways

大型语言模型（LLM）可以作为形式化工具用于逻辑问题的求解。
在现实生活中的约束满足问题上，LLM作为形式化工具的优势并不显著。
在少数场景下，LLM作为形式化工具的表现不如作为求解器。
LLM作为形式化工具存在精度、稳健性、忠诚度和效率的问题。
LLM生成的形式程序有限，难以随着问题的复杂性而扩展。
LLM容易产生硬编码解决方案和过多的推理标记。

Cool Papers

点此查看论文截图

DescriptorMedSAM: Language-Image Fusion with Multi-Aspect Text Guidance for Medical Image Segmentation

Authors:Wenjie Zhang, Liming Luo, Mengnan He, Jiarui Hai, Jiancheng Ye

Accurate organ segmentation is essential for clinical tasks such as radiotherapy planning and disease monitoring. Recent foundation models like MedSAM achieve strong results using point or bounding-box prompts but still require manual interaction. We propose DescriptorMedSAM, a lightweight extension of MedSAM that incorporates structured text prompts, ranging from simple organ names to combined shape and location descriptors to enable click-free segmentation. DescriptorMedSAM employs a CLIP text encoder to convert radiology-style descriptors into dense embeddings, which are fused with visual tokens via a cross-attention block and a multi-scale feature extractor. We designed four descriptor types: Name (N), Name + Shape (NS), Name + Location (NL), and Name + Shape + Location (NSL), and evaluated them on the FLARE 2022 dataset under zero-shot and few-shot settings, where organs unseen during training must be segmented with minimal additional data. NSL prompts achieved the highest performance, with a Dice score of 0.9405 under full supervision, a 76.31% zero-shot retention ratio, and a 97.02% retention ratio after fine-tuning with only 50 labeled slices per unseen organ. Adding shape and location cues consistently improved segmentation accuracy, especially for small or morphologically complex structures. We demonstrate that structured language prompts can effectively replace spatial interactions, delivering strong zero-shot performance and rapid few-shot adaptation. By quantifying the role of descriptor, this work lays the groundwork for scalable, prompt-aware segmentation models that generalize across diverse anatomical targets with minimal annotation effort.

精确的组织分割对于放疗计划和疾病监测等临床任务至关重要。最近的MedSAM等基础模型通过使用点或边界框提示取得了强大的结果，但仍需要手动交互。我们提出了DescriptorMedSAM，它是MedSAM的一个轻量级扩展，结合了结构化的文本提示，从简单的器官名称到组合的形状和位置描述符，以实现无点击分割。DescriptorMedSAM采用CLIP文本编码器将放射学风格的描述符转换为密集嵌入，通过交叉注意块和多尺度特征提取器与视觉令牌融合。我们设计了四种描述符类型：名称（N）、名称+形状（NS）、名称+位置（NL）和名称+形状+位置（NSL），并在FLARE 2022数据集上进行了零样本和少样本设置下的评估，其中在训练期间未见到的器官必须使用尽可能少的数据进行分割。NSL提示取得了最高性能，在完全监督下的Dice得分为0.9405，零样本保留率为76.31%，仅使用每未见器官50个标记切片进行微调后，保留率达到了97.02%。添加形状和位置线索始终提高了分割精度，特别是对于小或形态复杂的结构。我们证明了结构化语言提示可以有效地替代空间交互，实现强大的零样本性能，并快速适应少样本环境。通过量化描述符的作用，这项工作为可扩展的、提示感知的分割模型奠定了基础，这些模型可以跨多种解剖目标进行推广，并最小化注释工作。

论文及项目相关链接

PDF

Summary

本文提出一种名为DescriptorMedSAM的轻量级医学影像分割模型，它结合了结构化的文本提示信息，如器官名称、形状和位置描述等，实现了无需点击的分割。该模型采用CLIP文本编码器将放射学风格的描述信息转换为密集嵌入向量，并通过跨注意力块和多尺度特征提取器与视觉标记融合。在FLARE 2022数据集上的实验表明，结合名称、形状和位置的描述提示（NSL prompts）取得了最佳性能，全监督下的Dice得分为0.9405。该模型能有效替代空间交互，实现强大的零样本性能并快速适应少样本情况。这项研究为可扩展的、基于提示的分割模型奠定了基础，可以跨不同的解剖目标进行推广，并最小化标注工作量。

Key Takeaways

DescriptorMedSAM是MedSAM的轻量级扩展，可结合结构化文本提示实现无需点击的分割。
使用CLIP文本编码器将放射学描述转换为密集嵌入向量。
通过跨注意力块和多尺度特征提取器融合视觉标记。
描述符类型包括名称（N）、名称+形状（NS）、名称+位置（NL）和名称+形状+位置（NSL）。
在FLARE 2022数据集上，NSL提示在全监督下达到较高的Dice得分。
模型能有效替代空间交互，实现零样本和少样本情况下的良好性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-24/Few-Shot/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Few-Shot

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-09-24 Seg4Diff Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

2025-09-24 I2I Translation

I2I Translation

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-09-24 The STAR-XAI Protocol An Interactive Framework for Inducing Second-Order Agency in AI Agents

2025-09-24 Agent

Agent

Few-Shot

2025-09-24 更新

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification

A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition

From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner

CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework Multi-lingual Discourse Relation Classification

DA-Font: Few-Shot Font Generation via Dual-Attention Hybrid Integration

Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method

SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

State-of-the-Art Dysarthric Speech Recognition with MetaICL for on-the-fly Personalization

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification

Two Facets of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models

HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

VisText-Mosquito: A Unified Multimodal Benchmark Dataset for Visual Detection, Segmentation, and Textual Reasoning on Mosquito Breeding Sites

Are LLMs Better Formalizers than Solvers on Complex Problems?

DescriptorMedSAM: Language-Image Fusion with Multi-Aspect Text Guidance for Medical Image Segmentation