⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving
Authors:Fabian Schmidt, Markus Enzweiler, Abhinav Valada
Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6% increase in driving score for LMDrive and 17.5% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.
视觉语言模型最近作为自动驾驶的有前途的规划器而出现,其成功取决于对多模态输入的空间结构和动态交互的拓扑感知推理。然而,现有的模型通常在没有明确编码这些关系依赖性的监督下进行训练,这限制了它们从原始传感器数据中推断出代理和其他交通实体如何相互影响的能力。在这项工作中,我们通过一种新型模型无关的方法填补了这一空白,该方法以交通场景图的形式为基于语言的驾驶模型设定结构化关系上下文。我们以各种抽象层次和格式序列化场景图,并通过结构化提示模板将它们纳入模型,从而能够系统地分析关系监督何时以及如何带来最大的益处。在公共LangAuto基准测试上的广泛评估显示,通过场景图调节的尖端方法的驾驶性能得到了很大且持续的提升。值得注意的是,我们观察到LMDrive的驾驶评分提高了15.6%,BEVDriver提高了17.5%,这表明模型可以通过场景图调节的训练更好地内化并验证关系先验知识,即使在测试时不需要场景图输入。代码、微调模型和我们的场景图数据集可在https://github.com/iis-esslingen/GraphPilot上公开获得。
论文及项目相关链接
Summary
基于视觉的语言模型在自动驾驶中有着广阔的应用前景,成功依赖于对空间结构和动态交互的多模态输入的拓扑感知推理。然而,现有模型通常在没有明确编码这些关系依赖的监管下进行训练,限制了它们从原始传感器数据中推断出其他交通实体之间如何相互影响的能力。本研究通过一种新的模型无关方法,以交通场景图的形式,将结构化关系上下文纳入语言驾驶模型的考量中。通过在不同抽象层次和格式下序列化场景图并将其融入模型,我们得以系统地分析关系监督在何时以及如何带来最大的益处。在公共LangAuto基准测试上的评估显示,使用场景图调节的最先进方法大大提高了驾驶性能。特别是我们在LMDrive和BEVDriver中观察到驾驶评分分别提高了最高达15.6%和17.5%,这显示出模型可以通过场景图调节训练来更好地内部化和接地关系先验知识,即使在测试时不使用场景图输入也能实现这一效果。我们的代码、微调模型和场景图数据集可在公开网站上获取。
Key Takeaways
- 视觉语言模型在自动驾驶领域具有应用潜力,但需要处理空间结构和动态交互的拓扑感知推理。
- 现有模型缺乏明确编码关系依赖的训练,限制了从传感器数据中推断实体间相互影响的能力。
- 提出了一种新的模型无关方法,通过交通场景图的结构化关系上下文来增强语言驾驶模型。
- 通过序列化场景图的不同抽象层次和格式,系统分析了关系监督的效益。
- 在LangAuto基准测试上的评估显示,场景图调节显著提高了驾驶性能。
- 模型能在不依赖场景图输入的情况下,通过场景图调节训练内化关系先验知识。
点此查看论文截图
Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing
Authors:Cong Cao, Yujie Xu, Xiaodong Xu
In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.
近年来,图像编辑领域越来越受到关注。然而,当面对新的风格时,通用图像编辑模型往往无法产生令人满意的结果。挑战在于如何使用有限的配对数据有效地对通用图像编辑模型进行微调以适应新的风格。针对这一问题,本文提出了一种新型的少样本风格编辑框架。为此任务,我们构建了一个包含五种不同风格的基准数据集。相应地,我们提出了一种参数高效的混合专家低秩适应(MoE LoRA)方法,具有风格特定和风格共享路由机制,以联合微调多种风格。风格特定的路由机制确保不同的风格不会相互干扰,而风格共享的路由机制则自适应地分配共享的MoE LoRA来学习常见模式。我们的MoE LoRA可以通过一种新的度量引导方法自动确定每层的最佳秩次,该方法估计每个单一秩次组件的重要性得分。此外,我们探索了在扩散转换器(DiT)模型中最优插入LoRA的位置,并整合对抗学习和流匹配来引导扩散训练过程。实验结果表明,我们提出的方法在较少的LoRA参数下,超越了现有的最先进方法。
论文及项目相关链接
Summary
近期图像编辑领域备受关注,但一般图像编辑模型面对新风格时效果不佳。针对使用有限配对数据对新风格进行微调的问题,本文提出一种新型少样本风格编辑框架。构建包含五种不同风格的基准数据集,并提出参数高效的多风格混合专家低秩适应(MoE LoRA)方法,包含风格特定和风格共享路由机制,用于联合微调多种风格。风格特定路由确保不同风格互不干扰,而风格共享路由则自适应分配共享MoE LoRA学习共同模式。MoE LoRA可自动确定每层的最佳秩次,通过新型指标引导方法估算各单秩成分的重要性得分。此外,本文探讨在扩散变压器模型中的最佳位置插入LoRA,并结合对抗学习与流匹配引导扩散训练过程。实验结果表明,所提方法优于现有先进技术,且LoRA参数更少。
Key Takeaways
- 图像编辑领域面临对新风格挑战的问题。
- 论文提出了一种新型的少样本风格编辑框架来解决该问题。
- 构建了一个包含五种不同风格的基准数据集。
- 提出了参数高效的多风格混合专家低秩适应(MoE LoRA)方法。
- MoE LoRA具有风格特定和风格共享路由机制,用于联合微调多种风格。
- MoE LoRA能自动确定每层的最佳秩次。
点此查看论文截图
RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting
Authors:Ruocheng Wu, Haolan He, Yufei Wang, Zhihao Li, Bihan Wen
3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
3D高斯融合(3DGS)因其高质量实时渲染能力而在3D场景表示中受到广泛关注。然而,当输入包含稀疏训练视图时,3DGS容易过度拟合,这主要是由于缺乏中间视图监督。受视频扩散模型(VDM)近期成功的启发,我们提出了一种名为指导得分蒸馏(GSD)的框架,用于从预训练的VDMs中提取丰富的多视图一致性先验。基于得分蒸馏采样(SDS)的见解,GSD监督来自多个相邻视图的渲染图像,引导高斯融合表示朝向VDM的生成方向。然而,生成方向通常涉及对象运动和随机相机轨迹,这使得在优化过程中进行直接监督具有挑战性。为了解决这一问题,我们引入了一种统一的指导形式来纠正VDM的噪声预测结果。具体来说,我们结合了基于真实深度图的深度warp指导以及基于语义图像特征的指导,确保VDM的得分更新方向与正确的相机姿态和准确几何对齐。实验结果表明,我们的方法在多个数据集上的表现均优于现有方法。
论文及项目相关链接
Summary
本文介绍了基于三维高斯喷溅(3DGS)和预训练的视频扩散模型(VDM)的框架,名为指导评分蒸馏(GSD)。该框架旨在解决当输入包含稀疏训练视图时,由于缺乏中间视图监督而导致过拟合的问题。文章首先引出目前面临的问题并阐述可能的挑战,如对象的运动随机和摄像机轨迹变化导致的难题,提出引入统一的指导形式来纠正VDM的噪声预测结果。通过结合基于真实深度图的深度warp指导和基于语义图像特征的指导,确保评分更新方向与正确的相机姿态和准确的几何结构保持一致。实验结果表明,该方法在多个数据集上的表现优于现有方法。
Key Takeaways
- 引入指导评分蒸馏(GSD)框架,用于解决三维高斯喷溅(3DGS)在稀疏训练视图下的过拟合问题。
- 利用预训练的视频扩散模型(VDM)提取丰富的多视图一致性先验信息。
- 采用评分蒸馏采样(SDS)技术进行监督渲染的图像。利用GSD将渲染从多个相邻视图的图像组合在一起,从而指导高斯喷溅表示的生成方向向VDM方向发展。在此过程中融入对视图转化的优化策略。
- 针对对象运动和随机相机轨迹带来的挑战,引入统一的指导形式来纠正VDM的噪声预测结果。
点此查看论文截图
Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA
Authors:Ayush Pandey, Jai Bardhan, Ishita Jain, Ramya S Hebbalaguppe, Rohan Raju Dhanakshirur, Lovekesh Vig
In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system’s confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM – each following distinct prompting strategies – generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model’s true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent’s confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.
在视觉问答(VQA)和智能代理AI的上下文中,校准是指AI系统对其答案的信心如何准确反映其实际正确性。当这样的系统自主运行并在视觉不确定性下做出决策时,这方面变得尤为重要。虽然现代VQA系统凭借先进的视觉语言模型(VLMs)在医疗诊断和自主导航等高风险领域的应用越来越广泛,但由于其信心估计的可靠性尚未得到充分检验,仍存在一些问题。特别是,这些系统经常产生过于自信的回应。为了解决这一问题,我们引入了AlignVQA,这是一个基于辩论的多代理框架,其中多种专业VLM——每个跟随不同的提示策略——生成候选答案,然后进行两阶段互动:专家代理人对这些提案进行批判、精炼和汇总。这种辩论过程产生的信心估计能更准确地反映模型的真正预测性能。我们发现经过校准的专业代理人产生的信心更加符合实际。此外,我们引入了一种新型的差分校准感知损失函数,称为aligncal,旨在通过最小化校准误差的上界来微调专业代理人,从而提高每个代理人信心估计的准确性。经验证据显示,我们的方法在多基准VQA数据集上非常有效,显著减少了校准差异。此外,我们提出了一种新型的差分校准感知损失函数来微调专业代理人,并基于最小化上界校准误差来提高其个体信心估计的质量。
论文及项目相关链接
PDF 17 pages, 6 figures, 5 tables. Accepted to Special Track on AI Alignment, AAAI 2026. Project Page- https://refine-align.github.io/
Summary
该文本介绍了视觉问答(VQA)和智能体AI中的校准概念,强调AI系统的置信度与其答案实际正确性的吻合程度。在现代基于视觉语言模型(VLM)的VQA系统中,虽然准确性不断提高并在医疗诊断和自主导航等高风险领域得到广泛应用,但其置信度估计的可靠性尚未得到充分研究。文本提出一种基于辩论的多智能体框架AlignVQA,通过不同专业VLM生成候选答案并进行两阶段交互,提高模型置信度的准确性。同时引入了一种新型可微分校准感知损失函数aligncal,以微调专业智能体并提高其置信度估计的精度。实验结果表明,该方法有效减少了校准差异。
Key Takeaways
- AI系统的置信度校准在自主决策和视觉不确定性的环境下尤为重要。
- 现代VQA系统虽然在医疗诊断和自主导航等领域表现出高准确性,但其置信度估计的可靠性尚未得到充分研究。
- 提出的AlignVQA框架利用多智能体辩论过程提高模型置信度的准确性。
- AlignVQA通过不同专业VLM生成候选答案并进行两阶段交互,包括批判、精化和聚合。
- 引入了一种新型可微分校准感知损失函数aligncal,用于微调专业智能体并提高其置信度估计的精度。
- 该方法通过最小化校准误差的上界来优化目标,从而提高每个智能体的置信度估计的保真度。
点此查看论文截图
Utilizing LLMs for Industrial Process Automation: A Case Study on Modifying RAPID Programs
Authors:Salim Fares, Steffen Herbold
How to best use Large Language Models (LLMs) for software engineering is covered in many publications in recent years. However, most of this work focuses on widely-used general purpose programming languages. The utility of LLMs for software within the industrial process automation domain, with highly-specialized languages that are typically only used in proprietary contexts, is still underexplored. Within this paper, we study enterprises can achieve on their own without investing large amounts of effort into the training of models specific to the domain-specific languages that are used. We show that few-shot prompting approaches are sufficient to solve simple problems in a language that is otherwise not well-supported by an LLM and that is possible on-premise, thereby ensuring the protection of sensitive company data.
近年来,如何最好地使用大型语言模型(LLM)进行软件工程已成为许多出版物关注的主题。然而,大部分工作都集中在通用的编程语言上。对于工业过程自动化领域内的软件,以及通常只在专有环境中使用的高度专业化的语言,LLM的实用性仍然被探索得不够深入。在本文中,我们研究了企业在不使用大量特定领域语言模型培训的情况下,如何在自己的环境中实现这一目标。我们证明,通过少量提示方法就足以解决在这种语言上遇到的简单问题,而这些问题通常不会受到LLM的良好支持。而且这是可以在公司内部实现的,从而确保了敏感公司数据的保护。
论文及项目相关链接
PDF Submitted to the International Conference on Software Engineering (ICSE) track Software Engineering in Practice (SEIP) 2026
Summary
大型语言模型(LLMs)在软件工程中的应用近年来备受关注,但大多数研究集中在通用编程语言上。针对工业过程自动化领域中的专有语境和高度专业化的语言,LLMs在软件方面的应用仍然缺乏研究。本文研究了企业如何在不需要为特定领域语言训练模型的情况下,利用少量提示解决简单问题,实现LLMs在该领域的应用,同时保护敏感的公司数据。
Key Takeaways
- 大型语言模型(LLMs)在软件工程中的应用日益受到关注。
- 目前大多数研究集中在通用编程语言上,对工业过程自动化领域中的高度专业化语言的应用仍然缺乏研究。
- 在不使用特定领域语言训练模型的情况下,企业可以通过少量提示实现LLMs在专有语境中的应用。
- 通过这种方式,可以解决简单问题并实现软件的自动化。
- 这种应用方式有助于保护敏感的公司数据,避免了大规模训练所带来的数据泄露风险。
- 利用LLMs的特点可以在特殊语言背景下提供有价值的服务支持。
点此查看论文截图
Detection of Bark Beetle Attacks using Hyperspectral PRISMA Data and Few-Shot Learning
Authors:Mattia Ferrari, Giancarlo Papitto, Giorgio Deligios, Lorenzo Bruzzone
Bark beetle infestations represent a serious challenge for maintaining the health of coniferous forests. This paper proposes a few-shot learning approach leveraging contrastive learning to detect bark beetle infestations using satellite PRISMA hyperspectral data. The methodology is based on a contrastive learning framework to pre-train a one-dimensional CNN encoder, enabling the extraction of robust feature representations from hyperspectral data. These extracted features are subsequently utilized as input to support vector regression estimators, one for each class, trained on few labeled samples to estimate the proportions of healthy, attacked by bark beetle, and dead trees for each pixel. Experiments on the area of study in the Dolomites show that our method outperforms the use of original PRISMA spectral bands and of Sentinel-2 data. The results indicate that PRISMA hyperspectral data combined with few-shot learning offers significant advantages for forest health monitoring.
树皮甲虫入侵对维护针叶林健康构成严峻挑战。本文提出了一种利用对比学习进行树皮甲虫入侵检测的少样本学习方法,并使用卫星PRISMA高光谱数据来实现。该方法基于对比学习框架,对一维CNN编码器进行预训练,能够从高光谱数据中提取稳健的特征表示。这些提取的特征随后被用作支持向量回归估计器的输入,为每个类别训练少量的标记样本,以估计每个像素中健康树木、被树皮甲虫攻击树木和死树的比例。在多多洛米茨研究区域的实验表明,我们的方法优于使用原始的PRISMA光谱波段和Sentinel-2数据的方法。结果表明,PRISMA高光谱数据与少样本学习相结合,在森林健康监测方面具有显著优势。
论文及项目相关链接
PDF 5 pages, 3 figures, accepted at IGARSS conference 3-8 August 2025 Brisbane, Australia
Summary
本文提出了一种利用对比学习进行少数样本学习的检测方法,用于利用卫星PRISMA高光谱数据检测树皮甲虫入侵情况。通过对比学习框架训练一维CNN编码器,提取高光谱数据的稳健特征表示。这些特征随后被用作支持向量回归估计器的输入,对每个像素的健康树木、树皮甲虫攻击和死树比例进行估计。在Dolomites研究区域的实验表明,该方法优于使用原始PRISMA光谱波段和Sentinel-2数据的方法,显示出PRISMA高光谱数据与少数样本学习相结合在森林健康监测中的显著优势。
Key Takeaways
- 树皮甲虫入侵对针叶林健康构成严峻挑战。
- 提出一种基于对比学习的少数样本学习方法,用于检测树皮甲虫入侵。
- 利用卫星PRISMA高光谱数据进行森林健康监测。
- 通过对比学习框架训练一维CNN编码器,以提取高光谱数据的稳健特征。
- 使用支持向量回归估计器,根据提取的特征估计每个像素的树木状态。
- 在Dolomites地区的实验表明,该方法优于使用其他数据源的方法。
点此查看论文截图
Scalable Population Training for Zero-Shot Coordination
Authors:Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang
Zero-shot coordination(ZSC) has become a hot topic in reinforcement learning research recently. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators that are not seen before without any fine-tuning. Population-based training has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi and confirms its superiority.
零镜头协调(ZSC)最近已成为强化学习研究的热点。它主要关注代理的泛化能力,要求它们能够很好地与以前未见过的合作伙伴进行协调,而无需进行微调。基于种群训练已被证明能提供良好的零镜头协调性能;然而,现有方法受到计算资源的限制,主要集中在优化小种群的多样性,而忽视扩大种群规模可能带来的性能提升。针对这一问题,本文提出了可扩展种群训练(ScaPT)这一高效训练框架,包括两个关键组成部分:一个元代理,通过选择性共享参数有效地实现种群,以及一个互信息调节器,保证种群多样性。为了实证验证ScaPT的有效性,本文在Hanabi中对其进行了评估代表性框架,并证实了其优越性。
论文及项目相关链接
Summary
零样本协调(ZSC)是强化学习研究领域的热门话题。它关注代理的泛化能力,要求它们能与未见过的合作伙伴进行良好协调,无需微调。基于种群训练已被证明能提供出色的零样本协调性能,但现有方法受限于计算资源,主要关注小种群的多样性优化,而忽视了扩大种群规模可能带来的性能提升。针对这一问题,本文提出可扩展种群训练(ScaPT)这一高效训练框架,包括两个关键组成部分:能高效实现种群选择性共享参数的元代理和保证种群多样性的互信息调节器。通过实证评估,本文在Hanabi中的代表性框架下验证了ScaPT的有效性,并确认了其优越性。
Key Takeaways
- 零样本协调(ZSC)是强化学习领域的重要研究方向,关注代理的泛化能力。
- 基于种群训练对于零样本协调性能表现良好,但现有方法在计算资源方面存在限制。
- 本文提出一种名为可扩展种群训练(ScaPT)的训练框架,旨在解决现有方法的计算资源限制问题。
- ScaPT框架包括两个关键组成部分:元代理和互信息调节器。
- 元代理通过选择性共享参数来高效实现种群。
- 互信息调节器保证种群多样性。
点此查看论文截图
GraphMASAL: A Graph-based Multi-Agent System for Adaptive Learning
Authors:Biqing Zeng, Mengquan Liu, Zongwei Zhen
The advent of Intelligent Tutoring Systems (ITSs) has marked a paradigm shift in education, enabling highly personalized learning pathways. However, true personalization requires adapting to learners’ complex knowledge states (multi-source) and diverse goals (multi-sink); existing ITSs often lack the necessary structural-reasoning capability and knowledge dynamism to generate genuinely effective learning paths, and they lack scientifically rigorous validation paradigms. In this paper we propose GraphMASAL (A Graph-based Multi-Agent System for Adaptive Learning), which integrates (i) a dynamic knowledge graph for persistent, stateful learner modeling; (ii) a LangGraph-orchestrated trio of agents (Diagnostician, Planner, Tutor); (iii) a knowledge-graph-grounded two-stage neural IR component (dual-encoder dense retrieval with cross-encoder listwise re-ranking and calibrated score fusion); and (iv) a multi-source multi-sink (MSMS) planning engine with a cognitively grounded cost and an approximation guarantee via greedy set cover. Under blinded automated evaluations with matched inputs and inference settings across diverse student profiles, GraphMASAL consistently outperforms LLM prompting and structured ablations in planning–achieving stronger structural/sequence alignment of learning paths, higher coverage of weak concepts, and lower learning cost–while also surpassing prompt-based baselines in cognitive diagnosis. Agreement with expert/LLM-proxy ratings further supports the validity of our evaluation protocol. These findings indicate that grounding LLM agents in a dynamic knowledge graph, coupled with optimization under educational constraints, yields reliable, interpretable, and pedagogically plausible learning plans, advancing personalized and goal-oriented education.
智能辅导系统(ITS)的出现标志着教育领域的范式转变,它为实现高度个性化的学习路径提供了可能。然而,真正的个性化需要适应学习者的复杂知识状态(多源)和多样化的目标(多汇);现有的ITS通常缺乏必要的结构化推理能力和知识动态性,无法生成真正有效的学习路径,也缺乏科学严谨的验证范式。在本文中,我们提出了GraphMASAL(基于图的多智能体自适应学习系统),它集成了(i)用于持久状态学习者建模的动态知识图;(ii)由LangGraph协调的三个智能体(诊断师、规划师、辅导员);(iii)基于知识图的两个阶段神经IR组件(具有交叉编码器列表重新排序和校准分数融合的双重编码器密集检索);以及(iv)多源多汇(MSMS)规划引擎,具有认知基础的成本和通过贪心集合覆盖的近似保证。在多样化的学生配置文件匹配的输入和推理设置下进行盲自动评估,GraphMASAL在规划方面始终优于大型语言模型提示和结构化的消除方法,实现了更强的学习路径结构/序列对齐、更高的弱概念覆盖率、更低的学习成本;同时也在认知诊断上超越了基于提示的基线。与专家/大型语言模型代理评分的协议进一步支持了我们评估协议的有效性。这些结果表明,将大型语言模型智能体建立在动态知识图上,并在教育约束下进行优化,可以产生可靠、可解释、符合教学法的学习计划,推动个性化和目标导向的教育发展。
论文及项目相关链接
PDF 9 pages, 3 figures,submitted to AAMAS 2026
Summary
智能辅导系统(ITS)的出现标志着教育领域的范式转变,为实现高度个性化的学习路径提供了可能。然而,真正的个性化需要适应学习者的复杂知识状态和多样目标。现有ITSs缺乏必要的结构化推理能力和知识动态,难以生成真正有效的学习路径。本文提出GraphMASAL(基于图的多智能体自适应学习系统),结合动态知识图、LangGraph协调的三个智能体、基于知识图的两个阶段神经网络IR组件以及多源多汇规划引擎等技术,实现个性化教育的新突破。
Key Takeaways
- 智能辅导系统(ITS)在教育领域实现了个性化学习路径的可能性。
- 现有ITSs在适应学习者复杂知识状态和多样目标方面存在挑战。
- GraphMASAL通过结合动态知识图、智能体和规划引擎等技术,实现了个性化教育的新突破。
- GraphMASAL利用知识图的持久状态建模,提高了学习路径的结构化和序列对齐性。
- GraphMASAL在弱概念覆盖和学习成本方面表现出优势,超过了基于提示的基线。
- 专家评估协议和LLM代理评分支持GraphMASAL的有效性。
- GraphMASAL系统可生成可靠、可解释且教育上可行的学习计划。
点此查看论文截图
SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation
Authors:Sumin Yu, Taesup Moon
While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity–adjusting guidance strength based on the prompt–and selectivity–targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.
基于扩散的T2I模型虽然已经在图像生成质量方面取得了显著的成效,但它们也容易导致有害内容的轻松创建,引发了社会关注,并凸显了更安全生成的需求。现有的推理时间引导方法既缺乏适应性——根据提示调整引导强度,也缺乏选择性——仅针对图像的不安全区域进行引导。我们的方法SP-Guard通过估计提示的有害性并应用选择性引导掩膜来指导不安全的区域,从而解决了这些局限性。实验表明,SP-Guard在最小化意外内容更改的同时,比现有方法生成更安全的图像。除了提高安全性之外,我们的发现还强调了透明度和可控性在图像生成中的重要性。
论文及项目相关链接
PDF Accepted for presentation at TRUST-AI Workshop, ECAI 2025. Proceedings to appear in CEUR-WS
Summary
扩散基于的T2I模型虽然能够生成高质量的图像,但它们也易于创建有害内容,引发社会关注并强调需要更安全的内容生成方法。现有推理时间引导方法缺乏自适应性和选择性,无法根据提示调整引导力度并仅针对图像的不安全区域进行引导。我们的方法SP-Guard通过估算提示危害并应用选择性引导掩膜来解决这些问题,以引导仅不安全区域。实验表明,SP-Guard在生成更安全的图像的同时,尽量减少意外的内容改动。除了提高安全性外,我们的发现还强调了图像生成过程中透明度和可控性的重要性。
Key Takeaways
- 扩散基于的T2I模型能够生成高质量图像,但易于创建有害内容,引发社会关注。
- 现有推理时间引导方法缺乏自适应性和选择性,无法有效应对有害内容的生成。
- SP-Guard方法通过估算提示危害并应用选择性引导掩膜,能够生成更安全的图像并减少意外内容改动。
- SP-Guard方法能够仅针对图像的不安全区域进行引导,提高了图像生成的可控性。
- 实验验证了SP-Guard在图像安全性方面的优越性。
- 除了提高安全性,图像生成的透明度和可控性也是重要的考量因素。
点此查看论文截图
Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning
Authors:Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu
Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.
类增量学习(CIL)使模型能够在执行顺序任务时持续学习新类别,而不会忘记先前获得的知识。虽然最近的视觉语言模型(如CLIP)在跨域方面表现出了强大的泛化能力,但将它们扩展到连续设置仍然具有挑战性。特别是,对于新引入的类别学习特定任务的软提示通常会导致严重的分类器偏见,因为在没有先前数据的情况下,文本原型会过度拟合最近的类别。在本文中,我们提出了DMC,这是一个简单有效的两阶段CLIP基于CIL的框架,它解耦了视觉编码器的适应性和文本软提示的优化。每个阶段在另一个冻结的状态下进行训练,允许一种模态作为另一种的稳定语义锚,以保持跨模态对齐。此外,当前的CLIP基于CIL的方法通常存储类级别的高斯统计信息进行生成回放,但它们忽略了当视觉编码器随时间更新时出现的分布漂移问题。为了解决这一问题,我们引入了DMC-OT,这是DMC的增强版本,它采用了一种以最优传输为指导的校准策略,以在演化的编码器之间对齐内存统计信息,以及一种特定任务的提示设计,以提高任务间的可分离性。在CIFAR-100、Imagenet-R、CUB-200和UCF-101上的广泛实验表明,DMC和DMC-OT均达到了最新技术的性能水平,其中DMC-OT的平均准确度进一步提高1.80%。
论文及项目相关链接
摘要
本研究探讨类增量学习(CIL)在CLIP模型中的应用,针对类增量学习中的新问题提出了基于CLIP的两阶段框架DMC。该框架可以解耦视觉编码器的适应性和文本软提示的优化,并在不同训练阶段进行相互冻结,以保留跨模态对齐。同时,研究解决了由于忽视分布漂移的问题导致的当前CLIP类增量学习方法的缺陷。为了解决这个问题,研究引入了增强版的DMC-OT框架,该框架采用最优传输引导校准策略对齐记忆统计在演化的编码器之间,同时设计了任务特定的提示以增强任务间的可分离性。实验结果表明,无论是DMC还是DMC-OT都在不同数据集上取得了优于其他模型的性能表现。尤其是引入最优传输策略的DMC-OT进一步提高了平均准确度1.80%。
关键见解
点此查看论文截图
Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Authors:Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
最近大型语言模型(LLM)的进展使得理解和处理语音和非语音音频成为可能,但同时也暴露出当前安全保障措施不足以应对复杂音频输入所带来的新的安全风险。我们推出SACRED-Bench(用于红队测试的语音-音频组合),以评估LLM在复杂的基于音频的攻击下的稳健性。与现有的基于扰动的方法不同,这些方法依赖于噪声优化或白盒访问,SACRED-Bench利用语音-音频组合机制。SACRED-Bench采用三种机制:(a)语音重叠和多人对话,在良性语音之下或旁边嵌入有害提示;(b)语音-音频混合,通过非语音音频传达不安全意图,伴随良性语音或音频;(c)多样的口头指令格式(开放式问答、是非题)可绕过仅文本的过滤器。实验表明,即使是目前最先进的专有LLM——Gemini 2.5 Pro,在SACRED-Bench测试集中攻击成功率仍高达66%,暴露出跨模态和语音-音频组合攻击下的漏洞。为了弥补这一差距,我们提出了SALMONN-Guard,这是一种保护LLM的安全保障措施,可联合检查语音、音频和文本以进行安全判断,将攻击成功率降低到20%。我们的研究结果强调了为多媒体LLM配备音频感知防御措施的必要性。基准测试和SALMONN-Guard检查点可访问:https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench。警告:本论文包含可能具有冒犯性或有害的示例。
论文及项目相关链接
摘要
大型语言模型(LLMs)在理解语音和非语音音频方面取得了最新进展,但同时也暴露出由复杂音频输入带来的新的安全风险,当前的安全措施无法充分应对。为此,我们引入了SACRED-Bench(用于红队团队的语音-音频组合评估),以评估LLMs在复杂音频攻击下的稳健性。不同于现有的基于扰动的方法,SACRED-Bench利用语音-音频组合机制,采用三种机制:(a)语音重叠和多人对话,在良性语音旁边或下方嵌入有害提示;(b)语音-音频混合,通过非语音音频传递不安全意图,并与良性语音或音频相结合;(c)多样化的口语指令格式(开放式问答、是非问),以避免仅基于文本的过滤。实验表明,即使在最先进的私有LLGMini 2.5 Pro上,在SACRED-Bench测试集上的攻击成功率仍高达66%,凸显出在跨模态、语音-音频组合攻击下的漏洞。为了弥补这一差距,我们提出了SALMONN-Guard安全卫士,它可以联合检查语音、音频和文本进行安全判断,将攻击成功率降低至20%。我们的结果强调了多模式LLM的音频感知防御需求。SACRED-Bench的基准测试和SALMONN-Guard检查点可在[https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench找到。请注意:本文包含可能具有冒犯性或有害性的示例。
关键见解
- LLMs虽然能够处理复杂的音频输入,但存在安全风险漏洞。现有的安全措施未能充分应对由复杂音频输入带来的风险。
- SACRED-Bench通过模拟真实的语音和音频组合攻击来评估LLMs的稳健性。其包含三种机制:语音重叠和多说话者对话、语音和音频混合以及多样的口语指令格式。
- 实验结果显示,即使是目前最先进的LLGMini 2.5 Pro在SACRED-Bench测试集上也容易受到攻击,攻击成功率高达66%。这表明LLMs在跨模态和语音-音频组合攻击下存在显著的脆弱性。
- 为了增强LLMs的安全性,我们提出了SALMONN-Guard安全卫士。它联合检查语音、音频和文本进行安全判断,显著降低了攻击成功率至20%。
点此查看论文截图
Toward Automated Cognitive Assessment in Parkinson’s Disease Using Pretrained Language Models
Authors:Varada Khanna, Nilay Bhatt, Ikgyu Shin, Sule Tinaz, Yang Ren, Hua Xu, Vipina K. Keloth
Understanding how individuals with Parkinson’s disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.
理解帕金森病患者如何描述日常生活中的认知体验,可以为与疾病相关的认知和情绪变化提供宝贵的见解。然而,从非结构化的患者叙述中提取此类信息是一项挑战,因为认知结构具有细微且重叠的性质。本研究开发和评估了自然语言处理(NLP)模型,这些模型能够自动从去标识的第一人称叙述中识别反映各种认知过程的类别。比较了三种模型家族在零镜头和少镜头设置下的表现,这些模型包括基于Bio_ClinicalBERT的跨度分类模型(用于嵌套实体识别)、使用QLoRA进行指令跟随的精细调整的Meta-Llama-3-8B-Instruct模型以及GPT-4o mini模型。在提取七个类别时,我们的研究发现,不同类别和模型家族的模型性能存在很大差异。精细调整的Meta-Llama-3-8B-Instruct获得了最高的总体F1分数(微平均0.74和宏平均0.59),尤其在思想和社交互动等上下文相关的类别中表现突出。Bio_ClinicalBERT表现出较高的精度但召回率较低,对于某些类别类型(如地点和时间)与Llama表现相当,但在其他类别(如思想、情感和社交互动)上表现不佳。与传统的信息提取任务相比,由于复杂认知过程的叙述具有抽象性和重叠性,此任务呈现更大的挑战。然而,随着不断的改进,这些NLP系统在实现轻松的长期认知功能监测方面前景广阔,并可作为正式神经心理学评估在帕金森病中的宝贵补充。
论文及项目相关链接
PDF 15 pages, 4 figures, 1 table. Varada Khanna and Nilay Bhatt are co-first authors. Sule Tinaz and Hua Xu are co-senior authors. Corresponding author: Vipina K. Keloth (vipina.kuttichikeloth@yale.edu)
摘要
本研究利用自然语言处理(NLP)模型,从帕金森病患者(PD)的非结构化叙述中自动识别反映不同认知过程的类别。比较了三种模型家族在零样本和少样本环境下的表现,发现模型在不同类别和模型家族之间的表现存在显著差异。精细调整的Meta-Llama-3-8B-Instruct模型在总体F1分数上表现最佳,特别是在上下文相关的类别上。这项任务因为叙述的复杂认知过程的抽象性和重叠性,相比传统的信息提取任务更具挑战性。但NLP系统经持续优化后,有望在帕金森病中实现对认知功能的低负担、长期监测,并成为正式神经心理学评估的宝贵补充。
关键见解
- NLP模型可用于从帕金森病患者(PD)的非结构化叙述中自动识别认知过程类别。
- 模型家族包括基于Bio_ClinicalBERT的跨度分类模型、使用QLoRA进行指令跟随的Meta-Llama-3-8B-Instruct精细调整模型和GPT-4o mini模型。
- 模型性能在类别和模型家族之间差异显著。
- Meta-Llama-3-8B-Instruct模型在总体F1分数上表现最佳,尤其在上下文相关的类别如思想和社交互动方面表现出色。
- Bio_ClinicalBERT精度高但召回率低,在某些类别如地点和时间方面与Llama表现相当,但在思想、情感和社交互动等方面表现不佳。
- 与传统信息提取任务相比,从叙述中提取认知过程更具挑战性,因为叙述内容具有抽象性和重叠性。
点此查看论文截图
Divide-and-Conquer Decoupled Network for Cross-Domain Few-Shot Segmentation
Authors:Runmin Cong, Anpeng Wang, Bin Wan, Cong Zhang, Xiaofei Zhou, Wei Zhang
Cross-domain few-shot segmentation (CD-FSS) aims to tackle the dual challenge of recognizing novel classes and adapting to unseen domains with limited annotations. However, encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains. To address this issue, we propose a Divide-and-Conquer Decoupled Network (DCDNet). In the training stage, to tackle feature entanglement that impedes cross-domain generalization and rapid adaptation, we propose the Adversarial-Contrastive Feature Decomposition (ACFD) module. It decouples backbone features into category-relevant private and domain-relevant shared representations via contrastive learning and adversarial learning. Then, to mitigate the potential degradation caused by the disentanglement, the Matrix-Guided Dynamic Fusion (MGDF) module adaptively integrates base, shared, and private features under spatial guidance, maintaining structural coherence. In addition, in the fine-tuning stage, to enhanced model generalization, the Cross-Adaptive Modulation (CAM) module is placed before the MGDF, where shared features guide private features via modulation ensuring effective integration of domain-relevant information. Extensive experiments on four challenging datasets show that DCDNet outperforms existing CD-FSS methods, setting a new state-of-the-art for cross-domain generalization and few-shot adaptation.
跨域小样本分割(CD-FSS)旨在应对识别新类别和适应未见域的双挑战,同时仅使用有限的标注。然而,编码器特征经常混淆与域相关和与类别相关的信息,限制了泛化和快速适应新域的能力。为了解决这一问题,我们提出了分而治之解耦网络(DCDNet)。在训练阶段,为了解决阻碍跨域泛化和快速适应的特征纠缠问题,我们提出了对抗性对比特征分解(ACFD)模块。它通过对比学习和对抗学习将主干特征解耦为与类别相关的私有表示和与域相关的共享表示。然后,为了减轻解耦可能带来的潜在退化问题,矩阵引导动态融合(MGDF)模块在空间引导下自适应地融合基础、共享和私有特征,保持结构连贯性。此外,在微调阶段,为了增强模型的泛化能力,在MGDF之前放置了跨自适应调制(CAM)模块,其中共享特征通过调制引导私有特征,确保有效整合域相关信息。在四个具有挑战性的数据集上的广泛实验表明,DCDNet超越了现有的CD-FSS方法,为跨域泛化和小样本适应设定了新的最新技术状态。
论文及项目相关链接
Summary
基于跨域小样本分割(CD-FSS)面临的挑战,本文提出了一个名为DCDNet的分割网络,旨在解决模型在新领域中的快速适应和泛化问题。通过训练阶段的对抗性对比特征分解(ACFD)模块和矩阵引导动态融合(MGDF)模块,以及微调阶段的跨自适应调制(CAM)模块,DCDNet能够在有限标注的情况下实现类别识别和领域适应的解耦。在四个具有挑战性的数据集上的实验表明,DCDNet在跨域泛化和小样本适应方面达到了新的先进水平。
Key Takeaways
- DCDNet针对跨域小样本分割的挑战而设计,解决了模型在新领域的快速适应和泛化问题。
- ACFD模块通过对比学习和对抗性学习将特征分解为类别相关私有和领域相关共享表示。
- MGDF模块在空间引导下自适应融合基础、共享和私有特征,保持结构连贯性。
- CAM模块在微调阶段增强了模型的泛化能力,通过调制共享特征来引导私有特征。
- DCDNet通过训练阶段的特征分解和模块融合,以及微调阶段的自适应调制,实现了有效的跨域泛化和小样本适应。
- 在四个具有挑战性的数据集上的实验表明,DCDNet的性能超过了现有的CD-FSS方法。
点此查看论文截图
HCFSLN: Adaptive Hyperbolic Few-Shot Learning for Multimodal Anxiety Detection
Authors:Aditya Sneh, Nilesh Kumar Sahu, Anushka Sanjay Shelke, Arya Adyasha, Haroon R. Lone
Anxiety disorders impact millions globally, yet traditional diagnosis relies on clinical interviews, while machine learning models struggle with overfitting due to limited data. Large-scale data collection remains costly and time-consuming, restricting accessibility. To address this, we introduce the Hyperbolic Curvature Few-Shot Learning Network (HCFSLN), a novel Few-Shot Learning (FSL) framework for multimodal anxiety detection, integrating speech, physiological signals, and video data. HCFSLN enhances feature separability through hyperbolic embeddings, cross-modal attention, and an adaptive gating network, enabling robust classification with minimal data. We collected a multimodal anxiety dataset from 108 participants and benchmarked HCFSLN against six FSL baselines, achieving 88% accuracy, outperforming the best baseline by 14%. These results highlight the effectiveness of hyperbolic space for modeling anxiety-related speech patterns and demonstrate FSL’s potential for anxiety classification.
焦虑障碍影响全球数百万人,然而传统诊断依赖于临床访谈,而机器学习模型因数据有限而面临过拟合问题。大规模数据采集仍然成本高昂、耗时过长,限制了可访问性。为解决这一问题,我们引入了双曲曲面曲率小样本学习网络(HCFSLN),这是一种用于多模式焦虑检测的新型小样本学习(FSL)框架,融合了语音、生理信号和视频数据。HCFSLN通过双曲嵌入、跨模态注意力和自适应门控网络提高特征可分性,在少量数据的情况下实现稳健的分类。我们从108名参与者中收集了一个多模式焦虑数据集,并将HCFSLN与六个FSL基准进行了比较,准确率为88%,比最佳基准高出14%。这些结果突显了双曲空间在模拟焦虑相关语音模式方面的有效性,并展示了FSL在焦虑分类方面的潜力。
论文及项目相关链接
Summary
本文主要介绍了Hyperbolic Curvature Few-Shot Learning Network(HCFSLN)这一新型Few-Shot Learning(FSL)框架,用于多模态焦虑检测。该框架结合了语音、生理信号和视频数据,通过增强特征可分性,实现了少量数据下的稳健分类。实验结果表明,HCFSLN在焦虑分类方面表现出优异性能,准确率达到了88%。
Key Takeaways
- 介绍了一种新型的Few-Shot Learning框架——Hyperbolic Curvature Few-Shot Learning Network(HCFSLN),用于多模态焦虑检测。
- HCFSLN集成了语音、生理信号和视频数据,提高了特征可分性。
- 通过引入超球面嵌入、跨模态注意力和自适应门控网络,HCFSLN实现了在少量数据下的稳健分类。
- 建立了包含108名参与者的多模态焦虑数据集。
- HCFSLN与六种FSL基线模型进行了比较,准确率达到了88%。
- HCFSLN相比最佳基线模型性能提升了14%,凸显了超曲面空间在焦虑相关语音模式建模中的有效性。
点此查看论文截图
Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV
Authors:Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama
Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.
在少数动作识别(FSAR)中,广角视频能够有效地表达特定场景下的动作。然而,由于缺乏主体和背景的全局理解,识别此类样本中的动作仍然具有挑战性,因为背景干扰会导致识别困难。接收加权键值(RWKV)通过学习不同维度之间的交互,在全局建模方面显示出潜力。然而,直接将RWKV应用于广角FSAR可能会因过多的背景信息而无法突出主体。此外,由于具有相似背景的帧导致的时序关系退化难以重建,进一步影响了性能。因此,我们设计了化合物分段和时序重建RWKV(Otter)。具体来说,化合物分段模块(CSM)旨在分割并强调每一帧中的关键补丁,有效地将主体与背景信息突出对比。时序重建模块(TRM)被纳入时序增强原型构建中,以实现双向扫描,从而更好地重建时序关系。此外,将常规原型与时序增强原型相结合,可以同时增强主体突出和时序建模,提高广角FSAR的性能。在SSv2、Kinetics、UCF101和HMDB51等基准测试上的大量实验表明,Otter达到了最先进的性能。在VideoBadminton数据集上的额外评估进一步验证了Otter在广角FSAR中的优越性。
论文及项目相关链接
PDF Accepted by AAAI 2026 Oral
Summary:
本文研究了基于广角视频的少样本动作识别(FSAR)问题。尽管广泛使用的RWKV模型可以学习各种维度间的交互作用以实现全局建模,但由于缺少对背景和主题的全面理解,直接在FSAR中应用可能会导致识别效果受限。为此,文章提出了一个名为Otter的新方法,其中包括CSM(复合分段模块)和TRM(时间重建模块)。CSM能够分割并强调关键帧中的关键区域,有效突出主题信息;而TRM则通过构建时间增强原型来实现双向扫描,从而重建时间关系。同时,通过对常规原型和时间增强原型的结合,既能突出主题又能强化时间建模,提升了宽视角FSAR的性能。实验结果证明,Otter方法在多个基准数据集上达到了最佳性能。
Key Takeaways:
- Wide-angle视频在少样本动作识别中能有效表达特定场景的动作,但背景干扰使得动作识别变得具有挑战性。
- RWKV模型在全局建模方面表现出潜力,但直接应用于宽视角FSAR可能无法突出主题信息。
- Otter方法通过CSM和TRM解决上述问题,CSM能突出关键区域,TRM重建时间关系。
- Otter方法结合了常规原型和时间增强原型,增强了主题突出和时间建模效果。
- Otter方法在多个基准数据集上实现了最佳性能。
- 宽视角的视频背景下动作识别的难点包括主体与背景区分不明确以及时间关系的重建等。
点此查看论文截图
FreqGRL: Suppressing Low-Frequency Bias and Mining High-Frequency Knowledge for Cross-Domain Few-Shot Learning
Authors:Siqi Hui, Sanping Zhou, Ye deng, Wenli Huang, Jinjun Wang
Cross-domain few-shot learning (CD-FSL) aims to recognize novel classes with only a few labeled examples under significant domain shifts. While recent approaches leverage a limited amount of labeled target-domain data to improve performance, the severe imbalance between abundant source data and scarce target data remains a critical challenge for effective representation learning. We present the first frequency-space perspective to analyze this issue and identify two key challenges: (1) models are easily biased toward source-specific knowledge encoded in the low-frequency components of source data, and (2) the sparsity of target data hinders the learning of high-frequency, domain-generalizable features. To address these challenges, we propose \textbf{FreqGRL}, a novel CD-FSL framework that mitigates the impact of data imbalance in the frequency space. Specifically, we introduce a Low-Frequency Replacement (LFR) module that substitutes the low-frequency components of source tasks with those from the target domain to create new source tasks that better align with target characteristics, thus reducing source-specific biases and promoting generalizable representation learning. We further design a High-Frequency Enhancement (HFE) module that filters out low-frequency components and performs learning directly on high-frequency features in the frequency space to improve cross-domain generalization. Additionally, a Global Frequency Filter (GFF) is incorporated to suppress noisy or irrelevant frequencies and emphasize informative ones, mitigating overfitting risks under limited target supervision. Extensive experiments on five standard CD-FSL benchmarks demonstrate that our frequency-guided framework achieves state-of-the-art performance.
跨域小样本学习(CD-FSL)旨在在显著域迁移的情况下,仅通过少量标记样本识别新型类别。虽然最近的方法利用有限的标记目标域数据来提高性能,但源数据丰富与目标数据稀缺之间的严重不平衡仍然是有效表示学习的关键挑战。我们从频率空间的角度分析这个问题,并识别出两个关键挑战:(1)模型很容易偏向于源数据低频组件中编码的源特定知识;(2)目标数据的稀疏性阻碍了高频、通用领域特征的学习。为了解决这些挑战,我们提出了\textbf{FreqGRL},这是一种新的CD-FSL框架,减轻了频率空间中数据不平衡的影响。具体来说,我们引入了一个低频替换(LFR)模块,用目标域的低频组件替换源任务中的低频组件,以创建与目标特性更相符的新源任务,从而减少源特定偏见并促进通用表示学习。我们还设计了一个高频增强(HFE)模块,该模块可以过滤掉低频组件,直接在频率空间中对高频特征进行学习,以提高跨域泛化能力。此外,还结合了全局频率滤波器(GFF)来抑制噪声或无关的频率,并强调有意义的一个,以在有限的目标监督下减少过拟合风险。在五个标准的CD-FSL基准测试上的广泛实验表明,我们的频率引导框架达到了最新水平。
论文及项目相关链接
Summary
本文介绍了跨域小样学习(CD-FSL)领域的一个重要挑战,即在显著域迁移下如何仅通过少量标记样本识别新型类别。为解决源数据量大而目标数据量稀缺的不平衡问题,文章首次从频率空间角度进行分析,并提出两个关键挑战。为应对这些挑战,提出了名为FreqGRL的新框架,该框架包括低频替代(LFR)、高频增强(HFE)和全局频率滤波器(GFF)等模块,以提高跨域泛化能力,实现少样本下的新型类别识别。通过大量实验验证,该频率引导框架在五个标准CD-FSL基准测试中取得了最佳性能。
Key Takeaways
- CD-FSL面临在显著域迁移下仅通过少量标记样本识别新型类别的挑战。
- 文章首次从频率空间角度分析了源数据量大而目标数据量稀缺的问题。
- 文章提出了两个关键挑战:模型易偏向源数据的低频部分知识,目标数据的稀缺性阻碍了高频、通用特征的学习。
- 为应对这些挑战,文章提出了名为FreqGRL的新框架,包括低频替代(LFR)、高频增强(HFE)和全局频率滤波器(GFF)等模块。
- LFR模块通过替换源任务的低频部分来创建更符合目标特性的新源任务,从而减少特定源的偏见并促进通用表示学习。
- HFE模块在频率空间中对高频特征进行直接学习,从而提高跨域泛化能力。
点此查看论文截图
MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech
Authors:Junming Yuan, Ying Shi, Dong Wang, Lantian Li, Askar Hamdulla
Few-shot keyword spotting aims to detect previously unseen keywords with very limited labeled samples. A pre-training and adaptation paradigm is typically adopted for this task. While effective in clean conditions, most existing approaches struggle with mixed keyword spotting–detecting multiple overlapping keywords within a single utterance–a capability essential for real-world applications. We have previously proposed a pre-training approach based on Mix-Training (MT) to tackle the mixed keyword detection problem and demonstrated its efficiency. However, this approach is fully supervised, unable to utilize vast unlabeled data. To this end, we propose Mix-Training HuBERT (MT-HuBERT), a self-supervised learning (SSL) pre-training framework that implements the MT criterion during pre-training. MT-HuBERT predicts, in a self-supervised manner, the clean acoustic units of each constituent signal from contextual cues, in contrast to predicting compositional patterns of mixed speech. Experiments conducted on the Google Speech Commands (GSC v2) corpus demonstrate that our proposed MT-HuBERT consistently outperforms several state-of-the-art baselines in few-shot KWS tasks under both mixed and clean conditions.
少量样本关键词识别的目标是在非常有限的标注样本下检测以前未出现的关键词。此任务通常采用预训练和适应模式。尽管在干净的环境下效果很好,但大多数现有方法都面临混合关键词识别问题——在同一句话内识别多个重叠关键词,这对于实际应用至关重要。我们之前提出了基于混合训练(MT)的预训练方法来解决混合关键词检测问题,并证明了其效率。然而,这种方法是全监督的,无法利用大量的无标签数据。为此,我们提出了混合训练HuBERT(MT-HuBERT),这是一种自监督学习(SSL)预训练框架,在预训练过程中实现了MT标准。MT-HuBERT以自监督的方式预测每个组成信号的干净声学单元,而不是预测混合语音的组合模式。在Google语音命令(GSC v2)语料库上进行的实验表明,我们提出的MT-HuBERT在混合和干净条件下都在少量的关键词识别任务中持续优于几个先进的基线模型。
论文及项目相关链接
Summary
基于Mix-Training(MT)的预训练方法和混合关键词检测在少量标签样本下的关键字词识别中的新框架MT-HuBERT被提出。该方法采用自监督学习(SSL)进行预训练,预测每个构成信号的清洁声学单元,而非混合语音的组合模式。在Google Speech Commands(GSC v2)语料库上的实验表明,MT-HuBERT在混合和清洁条件下,都显著优于几种最先进的小样本关键词识别技术的基线方法。
Key Takeaways
- 混合关键词识别能力对真实应用至关重要。现有的方法在处理混合关键词识别时面临挑战。
- 基于Mix-Training的预训练策略已被用于解决混合关键词检测问题,但现有方法需要完整的监督数据。引入大量未标记数据的需求促使研究者进行改进。
- 首次提出了结合自监督学习(SSL)的Mix-Training HuBERT框架(MT-HuBERT)。此框架可以在预训练阶段实现Mix-Training标准。
- MT-HuBERT通过上下文线索预测每个构成信号的清洁声学单元,而不是预测混合语音的组合模式。这一策略使其在复杂环境中更具适应性。
点此查看论文截图
Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory
Authors:Yuxuan Lin, Hanjing Yan, Xuan Tong, Yang Chang, Huanzhen Wang, Ziheng Zhou, Shuyong Gao, Yan Wang, Wenqiang Zhang
Few-shot multimodal industrial anomaly detection is a critical yet underexplored task, offering the ability to quickly adapt to complex industrial scenarios. In few-shot settings, insufficient training samples often fail to cover the diverse patterns present in test samples. This challenge can be mitigated by extracting structural commonality from a small number of training samples. In this paper, we propose a novel few-shot unsupervised multimodal industrial anomaly detection method based on structural commonality, CIF (Commonality In Few). To extract intra-class structural information, we employ hypergraphs, which are capable of modeling higher-order correlations, to capture the structural commonality within training samples, and use a memory bank to store this intra-class structural prior. Firstly, we design a semantic-aware hypergraph construction module tailored for single-semantic industrial images, from which we extract common structures to guide the construction of the memory bank. Secondly, we use a training-free hypergraph message passing module to update the visual features of test samples, reducing the distribution gap between test features and features in the memory bank. We further propose a hyperedge-guided memory search module, which utilizes structural information to assist the memory search process and reduce the false positive rate. Experimental results on the MVTec 3D-AD dataset and the Eyecandies dataset show that our method outperforms the state-of-the-art (SOTA) methods in few-shot settings. Code is available at https://github.com/Sunny5250/CIF.
少样本多模态工业异常检测是一项至关重要但尚未得到充分探索的任务,它具备快速适应复杂工业场景的能力。在少样本设置下,训练样本的不足往往无法覆盖测试样本中存在的各种模式。通过从少量训练样本中提取结构共性,可以缓解这一挑战。在本文中,我们提出了一种基于结构共性、名为CIF(少样本中的共性)的全新的少样本无监督多模态工业异常检测方法。为了提取类内结构信息,我们采用超图能够建模高阶关联,来捕捉训练样本中的结构共性,并使用记忆库来存储这类类内结构先验。首先,我们为单语义工业图像设计了语义感知超图构建模块,从中提取出通用结构来引导记忆库的构建。其次,我们使用无需训练的超图消息传递模块来更新测试样本的视觉特征,缩小测试特征与记忆库中特征之间的分布差距。我们进一步提出了超边引导的记忆搜索模块,利用结构信息来辅助记忆搜索过程,降低误报率。在MVTec 3D-AD数据集和Eyecandies数据集上的实验结果表明,我们的方法在少样本设置下优于最新技术方法。代码可在https://github.com/Sunny5250/CIF找到。
论文及项目相关链接
PDF Accepted by AAAI 2026
Summary
本文提出一种基于结构共性(CIF)的少数样本无监督多模态工业异常检测方法。通过构建超图模型捕捉训练样本中的结构共性,并存储于内存银行中。利用超图消息传递模块更新测试样本的视觉特征,缩小测试特征与内存银行特征间的分布差距。同时,通过超边引导的内存搜索模块降低误报率。在MVTec 3D-AD和Eyecandies数据集上的实验结果表明,该方法在少数样本条件下优于现有技术。
Key Takeaways
- 提出基于结构共性的少数样本无监督多模态工业异常检测方法。
- 使用超图模型捕捉训练样本中的结构共性,并存储于内存银行中。
- 采用无训练超图消息传递模块更新测试样本的视觉特征。
- 利用超边引导的内存搜索模块降低误报率。
- 在MVTec 3D-AD和Eyecandies数据集上的实验结果表明该方法的优越性。
- 提出的CIF方法可以有效适应复杂工业场景中的少数样本挑战。
点此查看论文截图
TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Authors:Pasan Dissanayake, Sanghamitra Dutta
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
基于Transformer的模型在有限的训练数据场景下,与传统的神经网络和梯度提升决策树(GBDTs)相比,在表格数据上表现出了有前景的性能。它们利用预训练的知识来适应新领域,仅凭少量的训练样本即可实现令人称赞的性能,这也被称为少样本体制。然而,少样本体制中的性能提升是以显著增加的复杂性和参数数量为代价的。为了规避这种权衡,我们引入了TabDistill策略,这是一种将基于复杂Transformer的模型中的预训练知识蒸馏到更简单的神经网络中的新方法,以有效地对表格数据进行分类。我们的框架实现了两者的最佳结合:参数效率高,同时在有限的训练数据上表现良好。蒸馏后的神经网络在同等训练数据下超越了经典基线,如常规神经网络、XGBoost和逻辑回归,在某些情况下,甚至超越了它们所蒸馏的原始基于Transformer的模型。
论文及项目相关链接
Summary
基于Transformer的模型在处理表格数据时展现出对有限训练数据的优异性能,相较于传统的神经网络和梯度提升决策树(GBDTs)具有显著优势。它们利用预训练知识适应新领域,在少量训练样本(即小样本)的情况下也能取得良好性能。然而,这种小样本情况下的性能提升是以显著增加的复杂性和参数数量为代价的。为此,我们提出TabDistill策略,旨在将基于Transformer的复杂模型中的预训练知识蒸馏到更简单的神经网络中,以实现表格数据的有效分类。我们的框架实现了两全其美:既具有参数效率,又能在有限训练数据下表现良好。蒸馏后的神经网络在同等训练数据下超越了传统基线模型,如常规神经网络、XGBoost和逻辑回归,在某些情况下甚至超越了原始的基于Transformer的模型。
Key Takeaways
- 基于Transformer的模型在处理表格数据时,相较于传统模型在有限训练数据下表现出更好的性能。
- Transformer模型利用预训练知识来适应新领域,在小样本情况下也能取得良好性能。
- Transformer模型的高性能是以较高的复杂性和参数数量增长为代价的。
- TabDistill策略旨在将基于Transformer的复杂模型中的知识蒸馏到更简单的神经网络中。
- TabDistill策略使得模型在参数效率和在小样本情况下的性能上达到平衡。
- 蒸馏后的神经网络在某些情况下超越了原始的基于Transformer的模型以及传统基线模型。
点此查看论文截图
In-Context Adaptation of VLMs for Few-Shot Cell Detection in Optical Microscopy
Authors:Shreyan Ganguly, Angona Biswas, Jaydeep Rade, Md Hasibul Hasan Hasib, Nabila Masud, Nitish Singla, Abhipsa Dash, Ushashi Bhattacharjee, Aditya Balu, Anwesha Sarkar, Adarsh Krishnamurthy, Soumik Sarkar
Foundation vision-language models (VLMs) excel on natural images, but their utility for biomedical microscopy remains underexplored. In this paper, we investigate how in-context learning enables state-of-the-art VLMs to perform few-shot object detection when large annotated datasets are unavailable, as is often the case with microscopic images. We introduce the Micro-OD benchmark, a curated collection of 252 images specifically curated for in-context learning, with bounding-box annotations spanning 11 cell types across four sources, including two in-lab expert-annotated sets. We systematically evaluate eight VLMs under few-shot conditions and compare variants with and without implicit test-time reasoning tokens. We further implement a hybrid Few-Shot Object Detection (FSOD) pipeline that combines a detection head with a VLM-based few-shot classifier, which enhances the few-shot performance of recent VLMs on our benchmark. Across datasets, we observe that zero-shot performance is weak due to the domain gap; however, few-shot support consistently improves detection, with marginal gains achieved after six shots. We observe that models with reasoning tokens are more effective for end-to-end localization, whereas simpler variants are more suitable for classifying pre-localized crops. Our results highlight in-context adaptation as a practical path for microscopy, and our benchmark provides a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.
视觉语言模型(VLMs)在自然图像上表现卓越,但在生物医学显微镜中的应用价值尚未得到充分探索。本文研究了上下文学习如何使最先进的VLMs在无法使用大型标注数据集的情况下进行小样本目标检测成为可能,这在显微镜图像的情况下通常是常态。我们介绍了Micro-OD基准测试,这是一个专为上下文学习准备的图像集,包含来自四个源的跨边界框标注的特定目标样本集(细胞类型)的特定子集,共计包含特定精选的252张图像。我们系统地评估了小样本条件下的八个VLMs,并比较了带有和不带隐式测试时间推理符号的变体。我们还实现了一种混合的小样本目标检测(FSOD)管道,该管道将检测头与基于VLM的小样本分类器相结合,提高了我们在基准测试上的最新VLMs的小样本性能。跨数据集观察发现,由于领域差距,零样本性能较弱;然而,小样本支持始终能提高检测能力,并在六次拍摄后实现微小收益。我们发现带有推理符号的模型更适合端到端的定位,而更简单的变体更适合预定位作物的分类。我们的结果强调了上下文适应在显微镜领域的实用性,我们的基准测试提供了一个可重复的测试平台,以推动生物医学成像中的开放词汇检测的发展。
论文及项目相关链接
Summary
本文探讨了如何在缺乏大量标注数据集的情况下,利用上下文学习使先进的视觉语言模型(VLMs)在生物医学显微镜图像上进行少样本目标检测。研究引入了Micro-OD基准测试集,该数据集包含252张专为上下文学习策划的图像,跨越四种来源的11种细胞类型带有边界框标注。通过系统评估八种VLMs在少样本条件下的性能,并比较了带有和不带隐式测试时间推理标记的变体。此外,实施了一种混合的少样本目标检测(FSOD)管道,将检测头与基于VLM的少样本分类器相结合,提高了最近VLMs在基准测试集上的少样本性能。研究发现,由于领域差距,零样本性能较弱,但少样本支持能持续提高检测能力,六次拍摄后获得边际收益。具有推理标记的模型在端到端定位方面更有效,而更简单的变体更适合预定位裁剪的分类。总的来说,该研究强调了上下文适应在显微镜技术中的实用性,而基准测试集为推进生物医学成像中的开放词汇检测提供了可重复的实验平台。
Key Takeaways
- 视觉语言模型(VLMs)在自然图像上表现出色,但在生物医学显微镜图像上的应用尚未得到充分探索。
- 通过上下文学习,VLMs能够在缺乏大量标注数据集的情况下进行少样本目标检测。
- 引入了Micro-OD基准测试集,为生物医学显微镜图像的目标检测提供了实验平台。
- 对八种VLMs进行了系统评估,发现带有推理标记的模型在端到端定位方面更有效。
- 实施了混合的少样本目标检测(FSOD)管道,提高了VLMs的少样本性能。
- 少样本支持能持续提高检测能力,六次拍摄后获得边际收益。