⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-17 更新
Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback
Authors:Mohammadsina Almasi, Hadis Anahideh
Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.
在诸如教育、就业和医疗保健等高风险领域中公平分配有限资源,需要在权衡短期效用与长期影响的同时,考虑到结果延迟、隐性异质性和道德约束。然而,大多数基于学习的分配框架要么假设即时反馈,要么忽视个体特征与干预动态之间的复杂相互作用。我们针对延迟反馈环境下的个性化资源分配问题,提出了一个新型的两级上下文强盗框架,旨在在具有动态人群、容量约束和时间敏感影响的现实环境中运行。在元级别上,该模型优化群体层面的预算分配以满足公平性和操作约束。在基础层面上,它利用基于观测数据训练的神经网络识别每组中最有反应的个人,同时尊重冷却窗口和通过特定资源延迟内核建模的延迟治疗效果。通过显式建模时间动态和反馈延迟,该算法随着新数据的到来不断完善其政策,使决策更具响应性和适应性。我们通过教育和工作力发展领域的两个真实数据集验证了我们的方法,结果表明它实现了更高的累积成果,更好地适应了延迟结构,并确保了在各群体之间的公平分配。我们的结果突出了延迟感知、数据驱动的决策系统的潜力,可以改进机构政策并提高社会福利。
论文及项目相关链接
PDF Accepted at AAAI-26 (AISI Track). Final version to appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-26), 2026
Summary
本文提出一种两级上下文强盗框架,用于延迟反馈下的个性化资源分配。该框架旨在满足公平性和操作约束条件的同时,解决现实世界环境中的动态人群和资源分配问题。它通过神经网络训练观察数据来识别最具响应性的个体,同时建立延迟处理效果模型。这种算法可随着新数据的到来不断优化其策略,实现更快速、灵活的决策制定。通过对教育和劳动力发展两个真实数据集进行验证,证明该框架可实现更高的累积成果、更好的延迟适应性以及跨子群的公平分配。这凸显了延迟感知数据驱动决策系统的潜力,有助于改善机构政策和社会福利。
Key Takeaways
- 本文强调资源分配的复杂性,涉及短期效用与长期影响的平衡,需要考虑延迟结果、隐性异质性及伦理限制等因素。
- 提出新型两级上下文强盗框架处理个性化资源分配问题,适应现实世界的动态环境及资源限制。
- 通过神经网络识别对干预措施反应最强烈的个体,同时考虑资源分配的公平性和操作约束。
- 模型建立延迟处理效果的机制,以反映真实世界的反馈延迟和临时动态变化。
- 通过教育及劳动力发展领域的真实数据集验证,证明该框架在累积成果、延迟适应性及跨子群公平分配方面的优势。
- 延迟感知数据驱动决策系统可改善机构政策和社会福利。
点此查看论文截图
Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks
Authors:Yunzhe Xu, Zhuosheng Zhang, Zhe Liu
While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models’ capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO’s superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.
虽然提示优化已作为增强语言模型性能的关键技术出现,但现有方法主要集中在基于激发的策略上,寻找最优提示来激活模型的能力。这些方法在应对知识密集型任务时存在根本性局限,因为它们仅在固定的参数边界内操作,而不是提供特定领域所需的事实知识、术语精度和推理模式。为了解决这些局限,我们提出了基于知识提供的提示优化(KPPO)框架,将提示优化重新定义为系统知识集成,而非潜在激发。KPPO引入了三个关键创新点:1) 知识缺口填充机制,用于识别和目标化知识缺口;2) 批量候选评估方法,同时考虑性能提升和分布稳定性;3) 自适应知识修剪策略,平衡性能和令牌效率,减少最多达29%的令牌使用。在15个来自不同领域的知识密集型基准测试上的广泛评估表明,KPPO优于基于激发的方法,在最强基准测试上平均性能提高约6%,同时实现相当或更低的令牌消耗。代码地址:https://github.com/xyz9911/KPPO。
论文及项目相关链接
PDF 16 pages, 19 figures
Summary
本文提出一种基于知识供给的提示优化(KPPO)框架,以解决现有语言模型在知识密集型任务中的局限性。该框架通过系统整合知识,而非仅依赖潜在提示,引入三项关键技术创新:填补知识差距的机制、批量候选评估方法和自适应知识修剪策略。在多个知识密集型基准测试上的广泛评估表明,KPPO显著优于基于提示的方法,平均性能提高约6%,同时实现较低或相当的令牌消耗。
Key Takeaways
- KPPO框架解决了现有语言模型在处理知识密集型任务时的局限性。
- KPPO通过系统整合知识而非仅依赖潜在提示进行优化。
- KPPO引入三项关键技术创新:知识差距填补机制、批量候选评估方法和自适应知识修剪策略。
- 知识差距填补机制可识别并针对性解决知识差距。
- 批量候选评估方法同时考虑性能提升和分布稳定性。
- 自适应知识修剪策略平衡了性能和令牌效率,最多可减少29%的令牌使用。
点此查看论文截图
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Authors:Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Shaharukh Khan, Souvik Rana, Manya Sah, Chandra Khatri, Shubham Agarwal
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
在大型语言模型(LLM)的预训练背景下,合成数据作为生成大规模高质量预训练数据的一种替代方法应运而生。这在资源贫乏的语言环境中特别有益,因为在这些环境中,最近的大型语言模型所带来的好处在不同的语言中分布不均。在这项工作中,我们对印度语言合成多语言预训练数据的生成和评估进行了系统研究。我们构建了大规模合成数据集BhashaKritika,该数据集由使用五种不同技术生成的涵盖十种语言的54亿个令牌组成。我们探讨了以文档、人物和主题为根据生成的影响。我们分析了提示指令和文档依据中的语言选择是如何影响数据质量的,并比较了英语内容的翻译和在印度语中的本地生成。为了支持可扩展性和语言敏感性的评估,我们引入了一个模块化质量评估管道,该管道集成了脚本和语言检测、元数据一致性检查、n-gram重复分析以及使用KenLM模型的基于困惑度的过滤。我们的框架能够在各种脚本和语言环境中实现稳健的质量控制。通过模型运行得到的实证结果揭示了生成策略的关键权衡,并强调了构建有效多语言语料库的最佳实践。
论文及项目相关链接
Summary
该文介绍了合成数据在预训练大型语言模型(LLM)中的应用,特别是在低资源语言环境下的优势。研究团队针对印度语言生成并评估了大规模合成预训练数据集BhashaKritika,包含540B标记,使用5种技术为10种语言生成。该研究探讨了基于文档、人物和主题的生成影响,分析了指令语言和文档基础的语言选择对数据质量的影响,并比较了英语内容的翻译与印度语言的本地生成。为支持可扩展和语言敏感的评价,研究团队引入模块化质量评估管道,整合脚本和语言检测、元数据一致性检查、n-gram重复分析和基于KenLM模型的困惑度过滤。此框架能够在不同脚本和语言环境下实现稳健的质量控制。实证结果揭示了生成策略的关键权衡,并强调了构建有效多语言语料库的最佳实践。
Key Takeaways
- 合成数据作为大规模预训练语言模型的替代方案,特别是在低资源语言环境中具有优势。
- 研究团队创建了针对印度语言的合成多语言预训练数据集BhashaKritika,包含540B标记,使用5种技术为10种语言生成。
- 探讨了基于文档、人物和主题的生成方法的影响,并分析了语言选择对数据质量的影响。
- 研究比较了英语内容的翻译与印度语言的本地生成。
- 引入模块化质量评估管道,包括脚本和语言检测、元数据一致性检查等,实现稳健的质量控制。
- 实证结果揭示了生成策略的关键权衡。
点此查看论文截图
Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Authors:Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
最近大型语言模型(LLM)的进展使得理解和处理语音和非语音音频成为可能,但同时也暴露出当前保障措施未能充分应对复杂音频输入所带来的新安全威胁。我们推出SACRED-Bench(用于红队对抗的语音音频组合),以评估复杂音频攻击下LLM的稳健性。与现有的基于扰动的方法不同,这些方法依赖于噪声优化或白盒访问,SACRED-Bench则利用语音音频组合机制。SACRED-Bench采用三种机制:(a)语音重叠和多人对话,在良性语音之下或旁边嵌入有害提示;(b)语音音频混合,通过非语音音频暗示不安全意图,伴随良性语音或音频;(c)多样的口头指令格式(开放式问答、是非题)规避仅文本过滤器。实验表明,即使在最新先进的专有LLM——Gemini 2.5 Pro中,在SACRED-Bench测试集中攻击成功率仍高达66%,暴露出跨模态和语音音频组合攻击下的漏洞。为了弥补这一差距,我们提出SALMONN-Guard,一个安全保护的LLM,能同时检查语音、音频和文本以做出安全判断,将攻击成功率降低到仅20%。我们的结果强调了为多媒体LLM的安全构建音频感知防御的必要性。该基准测试和SALMONN-Guard检查点可以在https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench找到。警告:本论文包含可能具有冒犯性或有害的示例。
论文及项目相关链接
Summary
本文主要介绍了大型语言模型(LLMs)在理解语音和非语音音频方面的最新进展所引发的新安全风险,并引入了SACRED-Bench基准测试平台来评估LLMs在复杂音频攻击下的稳健性。文章还提出了SALMONN-Guard安全保护方案,该方案联合检查语音、音频和文本以进行安全判断。
Key Takeaways
- 大型语言模型(LLMs)能够理解语音和非语音音频,但存在新的安全风险。
- SACRED-Bench用于评估LLMs在复杂音频攻击下的稳健性,采用三种机制:语音重叠和多说话者对话、语音音频混合以及多样的口语指令格式。
- 实验显示,即使是最先进的LLM(如Gemini 2.5 Pro)在SACRED-Bench测试集上的攻击成功率也达到66%,暴露出跨模态和语音音频组合攻击下的漏洞。
- SALMONN-Guard是一种安全保护方案,能够联合检查语音、音频和文本以进行安全判断,降低了攻击成功率至20%。
- 文章强调了对于多模式LLMs的音频感知防御的必要性。
- SACRED-Bench和SALMONN-Guard的检查点可以在Hugging Face上找到。
点此查看论文截图
GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt
Authors:Zhenhe Li, Can Lin, Ling Zheng, Wen-Da Wei, Junli Liang, Qi Song
Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
多轮指令跟随对于构建能够始终在对话回合中遵循指令的智能对话系统至关重要。然而,现有的增强多轮指令跟随的方法主要依赖于收集或生成大规模的多轮对话数据集来微调大型语言模型(LLMs),这些方法将每个响应生成视为一项独立任务,并未显式地将多轮指令跟随纳入优化目标中。因此,指令调整过的LLMs通常面临复杂的长期约束挑战。在多轮对话中,跨回合的关系约束可以自然地建模为有标签的有向边,使图结构特别适合建模多轮指令跟随。尽管存在这种潜力,但利用图结构增强LLMs的多轮指令跟随能力仍尚未被探索。为了弥补这一空白,我们提出了GraphIF,这是一个即插即用的框架,它将多轮对话建模为有向关系图,并利用图提示增强LLMs的指令跟随能力。GraphIF包含三个关键组件:(1)基于代理的关系提取模块,通过动作触发机制捕获回合间的语义关系,以构建结构化的图;(2)关系图提示生成模块,将结构化图信息转换为自然语言提示;(3)响应重写模块,使用生成的图提示对初始LLM输出进行精炼。在两个长的多轮对话数据集上的大量实验表明,GraphIF可以无缝地集成到指令调整过的LLMs中,并在所有四个多轮指令跟随评估指标上实现了显著的改进。
论文及项目相关链接
Summary
本文强调在构建智能对话系统时,多轮指令跟随的重要性。现有方法主要通过收集或生成大规模多轮对话数据集来微调大型语言模型(LLMs),但它们未能显式地将多轮指令跟随纳入优化目标。因此,提出GraphIF框架,将多轮对话建模为有向关系图,并利用图提示增强LLMs的指令跟随能力。该框架包括三个关键组件,并通过实验证明其可无缝集成到指令调整过的LLMs中,并在所有四项多轮指令跟踪评估指标上实现显著改进。
Key Takeaways
- 多轮指令跟随对于构建能够始终遵循指令的对话系统至关重要。
- 现有方法主要依赖大规模数据集对语言模型进行微调,但未充分考虑多轮指令跟随。
- GraphIF框架将多轮对话建模为关系图,以更好地处理跨轮的关系约束。
- GraphIF包含三个关键组件:关系提取模块、关系图提示生成模块和响应重写模块。
- 通过在两个长多轮对话数据集上的实验,证明了GraphIF可以无缝集成到指令调整过的LLMs中。
- GraphIF在四项多轮指令跟踪评估指标上实现显著改进。
点此查看论文截图
MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples
Authors:Xurui Li, Feng Xue, Yu Zhou
Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.
零样本异常分类(AC)和分割(AS)方法旨在不使用任何标记样本来识别和描述缺陷。本文揭示了现有方法忽略的关键属性:工业产品中的正常图像斑块通常可以找到许多其他相似的斑块,不仅在二维外观上,而且在三维形状上,而异常值仍然多样且孤立。为了明确利用这种判别属性,我们提出了用于零样本AC/AS的MuSc-V2框架,该框架灵活支持单模态的二维/三维或多模态。具体来说,我们的方法首先通过迭代点分组(IPG)改进三维表示,从而减少来自不连续表面的误报。然后,我们使用具有多度的相似性邻域聚合(SNAMD)来融合二维/三维邻域线索,以生成更具判别力的多尺度斑块特征来进行相互评分。核心部分包括相互评分机制(MSM),它允许每个模态内的样本相互评分,以及跨模态异常增强(CAE),它融合了二维和三维分数来恢复特定模态缺失的异常值。最后,通过约束邻域进行再次评分(RsCon)抑制了基于与更具代表性样本相似性的误分类。我们的框架在完整数据集和小子集上都能灵活工作,具有持续稳定的性能,确保在不同产品线之间无缝适应。借助新型框架MuSc-V2实现了显著的性能提升:在MVTec 3D-AD数据集上提高了+23.7%的AP值,在Eyecandies数据集上提高了+19.3%的性能,超越了之前的零样本基准测试,甚至超越了大多数小样方法。代码将在https://github.com/HUST-SLOW/MuSc-V2上提供。
论文及项目相关链接
Summary
本文提出了一种基于Mutual Scoring框架(MuSc-V2)的零样本异常分类与分割方法。该方法充分利用正常图像块在工业品中的相似性和连续性,通过迭代点分组和改进的多尺度相似性邻域聚合,提高了异常检测和分割的准确性。此外,该方法还具有跨模态适应性,可在不同模态的数据集上实现良好的性能。实验结果表明,MuSc-V2在MVTec 3D-AD和Eyecandies数据集上的性能显著提升,超过了零样本基准测试,甚至优于大多数少样本方法。
Key Takeaways
- 提出了一种基于Mutual Scoring框架(MuSc-V2)的零样本异常分类与分割方法。
- 发现了正常图像块在工业品中的相似性和连续性被忽略的问题,并利用这一点提高了异常检测的准确性。
- 通过迭代点分组和改进的多尺度相似性邻域聚合,提高了模型的性能。
- 框架支持单模态(2D/3D)和多模态数据,具有广泛的应用潜力。
- MuSc-V2框架在MVTec 3D-AD和Eyecandies数据集上实现了显著的性能提升。
- MuSc-V2的性能超过了零样本基准测试,甚至优于大多数少样本方法。
点此查看论文截图
Multi-agent In-context Coordination via Decentralized Memory Retrieval
Authors:Tao Jiang, Zichuan Lin, Lihe Li, Yi-Chen Li, Cong Guan, Lei Yuan, Zongzhang Zhang, Yang Yu, Deheng Ye
Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents’ current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at https://github.com/LAMDA-RL/MAICC.
大型转换器模型在多种数据集上进行训练后,在未见过的新任务上表现出了令人印象深刻的少量样本性能,无需进行参数更新。这种能力在强化学习(RL)中也得到了探索,其中智能体与环境进行交互以获取上下文并最大化累积奖励,展示了在复杂环境中的强大适应性。然而,在合作型多智能体强化学习(MARL)中,智能体必须朝着共同的目标进行协调,去中心化策略部署可能导致任务对齐和奖励分配不匹配的问题,限制了策略适应的效率。为了解决这一挑战,我们引入了基于去中心化记忆检索的多智能体上下文协调(MAICC)这一新方法,旨在通过快速适应增强协调。我们的方法包括训练一个集中式嵌入模型来捕捉精细的轨迹表示,随后使用去中心化模型来近似集中式模型以获得团队级别的任务信息。基于学习的嵌入,我们检索相关的轨迹作为上下文,与智能体的当前子轨迹相结合,为决策提供依据。在去中心化执行过程中,我们引入了一种有效的记忆机制,该机制可以平衡测试时的在线数据和离线记忆。基于构建的记忆,我们提出了一个混合效用评分,该评分结合了个人和团队层面的回报,确保智能体之间的信用分配。在合作型MARL基准测试的大量实验,包括基于级别的觅食(LBF)和SMAC(v1/v2),表明MAICC与现有方法相比,能够更快地适应未见过的任务。代码可在https://github.com/LAMDA-RL/MAICC找到。
论文及项目相关链接
Summary
本文介绍了在多样化数据集上训练的大型转换器模型在未见任务上表现出令人印象深刻的少样本性能。文章还探讨了强化学习中的多智能体协同问题,提出了一种新的方法MAICC(Multi-agent In-context Coordination via Decentralized Memory Retrieval)来增强协调并加速适应。该方法通过训练中央嵌入模型来捕捉精细轨迹表示,然后通过分散模型近似中央模型以获取团队级任务信息。基于学习的嵌入,检索相关轨迹作为上下文,结合智能体的当前子轨迹来做出决策。在执行过程中,引入了一种新的记忆机制,有效平衡在线数据和离线记忆。实验表明,MAICC在合作MARL基准测试中,与现有方法相比,能够适应更快的新任务。
Key Takeaways
- 大型转换器模型在未见任务上表现出强大的少样本性能。
- 强化学习中的多智能体协同面临挑战,需要解决任务对齐和奖励分配的问题。
- MAICC方法通过训练中央嵌入模型来捕捉精细轨迹表示,提高协同能力。
- MAICC通过分散模型近似中央模型获取团队级任务信息。
- 相关轨迹被检索作为上下文,结合智能体的当前子轨迹进行决策。
- MAICC引入新的记忆机制,有效平衡在线数据和离线记忆。
点此查看论文截图
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Authors:Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao
Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
实现物理环境中人类与智能体之间的有效协作,不仅要求理解如何行动,还需要理解可操作元素的位置以及如何与之互动。现有方法通常仅在对象级别进行操作,或者断续地进行精细的可用性推理,缺乏连贯的、指令驱动的接地和推理。在此工作中,我们引入了一项新任务:精细3D嵌入式推理,要求智能体根据任务指令,预测3D场景中每个参考可用性元素的结构化三元组,包括其空间位置、运动类型和运动轴。为了解决这个问题,我们提出了AffordBot,这是一个将多模态大型语言模型(MLLMs)与量身定制的链式思维(CoT)推理范式相结合的新框架。为了弥合3D输入与2D兼容的MLLMs之间的差距,我们对场景进行环绕视图渲染,并将3D元素候选者投影到这些视图中,形成与场景几何结构对齐的丰富视觉表示。我们的CoT管道始于主动感知阶段,提示MLLM根据指令选择最具信息量的观点,然后进行逐步推理以定位可用性元素并推断可能的交互动作。在SceneFun3D数据集上评估AffordBot时,它实现了最先进的性能,仅在输入3D点云和MLLMs的情况下,就展现出强大的通用性和物理接地推理能力。
论文及项目相关链接
PDF NeurIPS 2025
Summary
本文介绍了Fine-grained 3D Embodied Reasoning任务,要求智能体在3D场景中,基于任务指令预测每个参考的可操作元素的结构化三元组,包括其空间位置、运动类型和运动轴。为解决此任务,提出了AffordBot框架,结合多模态大型语言模型(MLLMs)和定制化的思维链(CoT)推理范式。通过渲染场景周边视图并将3D元素候选者投影到这些视图中,形成与场景几何结构对齐的丰富视觉表征,以缩小3D输入和2D兼容的MLLMs之间的差距。AffordBot在SceneFun3D数据集上取得了最佳性能,展示了强大的泛化和物理推理能力,仅使用3D点云输入和MLLMs。
Key Takeaways
- Fine-grained 3D Embodied Reasoning任务要求智能体预测3D场景中每个参考的可操作元素的结构化信息。
- AffordBot框架结合多模态大型语言模型和思维链推理范式。
- 为了适应2D兼容的MLLMs,通过渲染场景周边视图并投影3D元素。
- AffordBot具备主动感知阶段,根据指令选择最具有信息量的视角。
- AffordBot通过逐步推理来定位可操作元素并推断可能的交互动作。
- 在SceneFun3D数据集上,AffordBot实现了最佳性能。
点此查看论文截图
Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment
Authors:Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu, Lingyu Zhu, Yuncheng Jiang, Baoliang Chen
Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as “a good photo” or “a bad photo.” However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.
最近的努力尝试将对比语言图像预训练(CLIP)模型重新用于无参考图像质量评估(NR-IQA),通过测量图像嵌入和文本提示(如“好照片”或“坏照片”)之间的余弦相似度来实现。然而,这种语义相似性忽略了至关重要却被忽略的线索:CLIP图像特征的大小,我们实证发现其与感知质量存在强烈相关性。在这项工作中,我们引入了一种新型自适应融合框架,该框架用感知质量线索补充余弦相似性。具体来说,我们首先提取CLIP图像特征的绝对值,并应用Box-Cox转换对特征分布进行统计归一化,减轻语义敏感性。得到的标量摘要作为语义归一化的辅助线索,补充基于余弦的提示匹配。为了有效地整合这两种线索,我们进一步设计了一种基于信心的融合方案,该方案根据相对强度自适应地权衡每项。在多个基准IQA数据集上的广泛实验表明,我们的方法始终优于基于标准CLIP的IQA和最新基线技术,且无需进行任何特定任务的训练。
论文及项目相关链接
Summary
该文本描述了一种利用CLIP模型进行无参考图像质量评估(NR-IQA)的新方法。它引入了图像特征的幅度作为一个重要的质量线索,与基于余弦相似性的语义相似性相结合。通过应用Box-Cox变换来标准化特征分布,并设计了一种基于置信度的融合方案,将两种线索有效地结合在一起。实验表明,该方法在多个IQA数据集上表现优异,无需特定任务训练即可超越标准CLIP-based IQA和现有基线。
Key Takeaways
- 利用CLIP模型进行NR-IQA的新方法结合了图像特征的幅度作为质量线索。
- 通过Box-Cox变换标准化CLIP图像特征分布,减少语义敏感性。
- 设计了一种基于置信度的融合方案,自适应地结合余弦相似性和幅度线索。
- 新方法在不同IQA数据集上的表现均优于标准CLIP-based IQA和现有基线。
- 所提出的方法无需特定任务训练。
- 此方法结合了语义相似性和幅度线索,更全面地评估图像质量。
点此查看论文截图
Leveraging Large Language Models for Identifying Knowledge Components
Authors:Canwen Wang, Jionghao Lin, Kenneth R. Koedinger
Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a “simulated textbook” LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model’s performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.
知识组件(KCs)是自适应学习系统的核心基础,但其通过领域专家进行的手动识别是一个重大瓶颈。虽然大型语言模型(LLM)为自动化这一过程提供了有前景的途径,但先前的研究仅限于小型数据集,并显示出会产生多余、冗余的KC标签。本研究通过以下方式解决了这些限制:首先,将一种模拟教科书的LLM提示策略(使用GPT-4o-mini)扩大到包含646道选择题的更大数据集。我们发现,这种初步的自动化方法表现得明显不如专家设计的KC模型(RMSE 0.4285与RMSE 0.4206),并产生了过多的KC数量(569个与101个)。为了解决冗余问题,我们提出了一种基于余弦相似度合并语义相似KC标签的新方法并对其进行了评估。这种合并策略显著提高了模型的性能;使用余弦相似度阈值为0.8的模型取得了最佳结果,将KC数量减少到428个,并将RMSE提高到0.4259。这表明单独扩大LLM生成是不够的,但将其与语义合并技术相结合为自动化和细化KC识别提供了可行的途径。
论文及项目相关链接
PDF Accepted as an extended abstract in The International Conference on Learning Analytics & Knowledge (LAK’25) Workshop: LLMs for Qualitative Analysis in Education
Summary
本文探讨了知识组件(KCs)在自适应学习系统中的重要地位,以及通过大型语言模型(LLMs)自动化识别KCs的潜力。研究通过GPT-4o-mini模拟教科书提示策略,在646道选择题的大规模数据集上进行尝试,发现初始自动化方法表现不如专家设计的KC模型,并产生过多的KC。为解决冗余问题,研究提出了基于余弦相似性的合并语义相似KC标签的新方法。结合语义合并技术,展现了自动化和细化KC识别的可行路径。
Key Takeaways
- 知识组件(KCs)是自适应学习系统的基石,但其由领域专家手动识别存在瓶颈。
- 大型语言模型(LLMs)为自动化识别KCs提供希望,但之前的研究受限于小数据集,且产生的KC标签多余且冗余。
- 初步尝试使用GPT-4o-mini模拟教科书提示策略在大规模数据集上表现不佳,与专家设计的KC模型相比存在差距。
- 初始自动化方法产生过多的KC标签。
- 为解决冗余问题,提出了基于余弦相似性的合并语义相似KC标签的方法。
- 结合语义合并技术显著提高模型性能,使用余弦相似性阈值为0.8的模型表现最佳。
点此查看论文截图
Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification
Authors:Yuhang Zhou, Yanxiang Zhao, Zhongyun Hua, Zhipu Liu, Zhaoquan Gu, Qing Liao, Leo Yu Zhang
Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.
行人再识别(ReID)是许多现实世界应用(如行人轨迹跟踪)中的基本任务。然而,先进的基于深度学习的ReID模型很容易受到对抗性攻击的影响,其中行人图像的微小扰动可能导致完全错误的预测,构成重大安全威胁。虽然针对分类任务已经提出了许多对抗性防御策略,但它们扩展到度量学习任务(如行人ReID)仍然相对未被探索。而且,现有的针对行人ReID的几种防御方法未能解决对抗鲁棒ReID的固有独特挑战。在本文中,我们将对抗性防御在行人ReID中的挑战系统地归为两个关键问题:模型偏见和复合泛化要求。为了解决这些问题,我们提出了一个去偏双不变防御框架,包含两个阶段。在数据平衡阶段,我们使用基于扩散模型的数据重采样策略来缓解模型偏见,促进训练数据的公平性和多样性。在双向对抗性自我元防御阶段,我们引入了一种新的度量对抗训练方法来克服由于缺乏分类器而导致的鲁棒性下降,该方法结合了最远的负扩展软化策略。此外,我们还引入了对抗增强的自我元机制,以实现未见身份和未见攻击类型的双重泛化。实验表明,我们的方法显著优于现有的最先进的防御方法。
论文及项目相关链接
PDF Accepted by AAAI 2026
Summary
这篇论文针对人物再识别(ReID)任务中模型易受对抗攻击的问题,提出了一个去偏双不变防御框架。该框架包括数据平衡和双向对抗自元防御两个阶段,旨在解决模型偏差和复合泛化要求等挑战。通过扩散模型基础上的数据重采样策略,减轻模型偏差,促进训练数据的公平性和多样性。同时,引入度量对抗训练方法和对抗增强自元机制,实现对未见身份和未见攻击类型的双重泛化。实验表明,该方法显著优于现有最先进的防御手段。
Key Takeaways
- 人物再识别(ReID)在现实世界应用中非常重要,但深度模型易受对抗攻击。
- 对抗防御在ReID中的挑战包括模型偏差和复合泛化要求。
- 提出的去偏双不变防御框架包括数据平衡和双向对抗自元防御两个阶段。
- 数据平衡阶段通过扩散模型基础上的数据重采样策略,减轻模型偏差。
- 双向对抗自元防御阶段引入度量对抗训练方法和对抗增强自元机制。
- 该方法实现了对未见身份和未见攻击类型的双重泛化。
点此查看论文截图
Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search
Authors:Yaodong Yang, Yang Wang, Jinpeng Li, Pei Guo, Da Han, Guangyong Chen, Pheng-Ann Heng
Protein evolution through amino acid sequence mutations is a cornerstone of life sciences. While current in-silicon directed evolution algorithms focus on designing search strategies, they overlook how to utilize the transformative protein language models, which encode rich evolutionary patterns, to guide search. To bridge this gap, we propose AlphaDE, a novel framework to evolve protein sequences by harnessing the innovative paradigms of large language models. First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility for the interested protein class. Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine-tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. An interesting case study further shows that AlphaDE supports condensing the protein sequence space through computational evolution.
蛋白质通过氨基酸序列突变进行进化是生命科学的核心内容。虽然当前的硅基定向进化算法侧重于设计搜索策略,但它们忽视了如何利用蕴含丰富进化模式的蛋白质语言模型来指导搜索。为了弥补这一空白,我们提出了AlphaDE,这是一个利用大型语言模型创新范式进化蛋白质序列的新型框架。首先,AlphaDE通过使用同源蛋白质序列的掩码语言建模对预训练的蛋白质语言模型进行微调,以激活所关注蛋白质类别的进化可能性。其次,AlphaDE引入基于蒙特卡罗树搜索的测试时推理,有效地利用微调后的蛋白质语言模型提供的进化指导来进化蛋白质。大量基准实验表明,即使在少量微调的情况下,AlphaDE也显著优于以前的最先进方法。一个有趣的案例研究进一步表明,AlphaDE支持通过计算进化来凝聚蛋白质序列空间。
论文及项目相关链接
PDF working in progress, 23 pages, 6 figures, 15 tables
Summary
本文介绍了一种新的蛋白质进化框架AlphaDE,该框架利用蛋白质语言模型的创新范式,通过微调预训练的蛋白质语言模型并利用蒙特卡罗树搜索进行推理,实现蛋白质序列的进化。AlphaDE显著提高了现有方法的性能,并支持通过计算进化来凝聚蛋白质序列空间。
Key Takeaways
- AlphaDE框架利用蛋白质语言模型来指导蛋白质序列进化的搜索策略。
- AlphaDE通过在同源蛋白质序列上应用掩码语言建模来微调预训练的蛋白质语言模型,从而激活对目标蛋白质类的进化可能性。
- AlphaDE引入基于蒙特卡罗树搜索的测试时推理,实现蛋白质的有效进化。
- AlphaDE在少数情况下的微调也能显著优于现有的最先进的方法。
- AlphaDE支持通过计算进化来凝聚蛋白质序列空间,展示了一个有趣的案例研究。
- 该框架对于蛋白质进化研究具有重要的推动作用。
点此查看论文截图
How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Authors:Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa
Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
大型语言模型(LLMs)擅长评估机器翻译(MT)的效果,但其规模和成本阻碍了其在边缘设备和隐私敏感工作流程中的部署。我们的问题是:在仍能检测意义改变的翻译错误的情况下,你能做到多小?我们专注于英语到德语的临界错误检测(CED),对sub-2B模型(LFM2-350M、Qwen-3-0.6B/1.7B、Llama-3.2-1B-Instruct、Gemma-3-1B)在WMT21、WMT22和SynCED-EnDe-2025上进行基准测试。我们的框架标准化了提示,应用了轻量级的logit-bias校准和多数投票,并报告了语义质量(MCC、F1-ERR/F1-NOT)和计算指标(VRAM、延迟、吞吐量)。结果揭示了一个明确的甜蜜点,大约十亿参数:Gemma-3-1B在合并权重微调后,在SynCED-EnDe-2025上达到MCC=0.77,F1-ERR=0.98,提供了最佳的质量效率权衡,同时在MacBook Pro M4 Pro(24GB)上保持400毫秒的单样本延迟。在更大规模上,Qwen-3-1.7B达到了最高的绝对MCC值(比Gemma高0.11),但计算成本更高。相比之下,超小型模型(0.6B)在少量校准下仍然可用,但在检测实体和数字错误方面仍存在不足。总体而言,紧凑、指令调整的LLMs辅以轻量级校准和小样本监督,可以实现在线设备上的MT CED可靠检测,从而能够在现实世界的翻译管道中实现私人化、低成本错误筛查。所有数据集、提示和脚本都可在我们的GitHub仓库中找到。
论文及项目相关链接
PDF Accepted in IEEE BigData 2025
Summary
该文本探讨了小型语言模型在机器翻译错误检测方面的能力。通过对比多个子2B模型在多种数据集上的表现,发现围绕十亿参数规模的模型具有最佳的质量效率平衡点。特定模型在合成数据集上达到了较高的语义质量指标,同时在MacBook Pro上的延迟较低。随着模型规模的扩大,虽然性能有所提高,但计算成本也相应增加。而超小型模型虽然可以通过少量样本进行校准,但在检测实体和数字错误方面仍有不足。总体而言,通过轻量级校准和小样本监督增强的紧凑指令调优的小型语言模型可为机器翻译提供可靠的在设备错误检测,适用于现实世界的翻译管道中的私人、低成本错误筛选。
Key Takeaways
- 小型语言模型在机器翻译错误检测中展现潜力。
- 在特定数据集上,围绕十亿参数规模的模型表现出最佳的质量效率平衡点。
- 通过校准和多数投票等轻量级技术,可提高模型的性能。
- 高性能模型虽能提高语义质量指标,但计算成本也随之增加。
- 超小型模型在检测实体和数字错误方面存在不足。
- 紧凑、指令调优的小型语言模型结合轻量级校准和小样本监督,可实现可靠的机器翻译在设备错误检测。
点此查看论文截图
SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation
Authors:Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Gao Huang
Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations. Project Page: https://shihao1895.github.io/SpatialActor
机器人操作需要对现实世界中的物体进行精确的空间理解。基于点的方法受到稀疏采样的困扰,导致精细语义的丢失。基于图像的方法通常将RGB和深度信息输入到预训练于3D辅助任务的2D骨干网络上,但它们的语义和几何纠缠对现实世界中的固有深度噪声敏感,这破坏了语义理解。此外,这些方法关注高级几何,却忽视了对精确交互至关重要的低级空间线索。我们提出了SpatialActor,这是一个用于稳健机器人操作的解耦框架,它明确地分离了语义和几何。语义引导几何模块自适应地融合来自嘈杂深度和语义引导专家先验的两种互补几何。此外,空间转换器利用低级空间线索进行准确的2D-3D映射,并实现空间特征之间的交互。我们在50多个任务的多个仿真和真实场景中对SpatialActor进行了评估。它在RLBench上达到了最先进的性能,达到87.4%,并在不同的噪声条件下提高了13.9%至19.4%,显示出强大的稳健性。此外,它大大增强了对新任务的少量样本泛化能力,并在各种空间扰动下保持稳健性。项目页面:https://shihao1895.github.io/SpatialActor
论文及项目相关链接
PDF AAAI 2026 Oral | Project Page: https://shihao1895.github.io/SpatialActor
Summary:
本文提出了一种名为SpatialActor的框架,用于解决机器人在真实世界中进行操作时的语义和几何问题。该框架旨在解决现有方法在处理语义和几何信息时的不足,包括稀疏采样导致的语义丢失、真实世界深度噪声引起的语义和几何纠缠问题,以及忽视低级空间线索的问题。SpatialActor通过自适应融合噪声深度和语义引导先验信息,实现了语义指导的几何模块。此外,它利用低级空间线索进行精确的二维至三维映射,并通过空间变换实现空间特征间的交互。评估结果显示,SpatialActor在模拟和真实世界的多个场景中实现了卓越性能,特别是在不同噪声条件下的任务中表现出强大的鲁棒性。此外,它还能显著提高对新任务的快速适应能力和在各种空间扰动下的稳健性。
Key Takeaways:
- SpatialActor解决了机器人操作中存在的语义和几何问题,提高了操作精度。
- 通过自适应融合噪声深度和语义引导先验信息,实现了语义指导的几何模块。
- 利用低级空间线索进行精确的二维至三维映射,提高空间特征的交互能力。
- 在模拟和真实世界的多个场景中实现了卓越性能。
- 在不同噪声条件下的任务中表现出强大的鲁棒性。
- 能显著提高对新任务的快速适应能力。
点此查看论文截图
vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs
Authors:Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
最近,以大型语言模型(LLM)提炼的医疗语义先验知识为指导的上下文优化(CoOp)的进步,为基于CLIP的生物医学视觉语言模型(VLMs)的适应提供了手动提示工程和完全微调的可扩展替代方案。然而,在这种情况下,提示学习面临来自LLM和CLIP变体之间语义不对齐的挑战,这是由于不同的训练语料库和模型架构导致的;它缺乏在不断发展的基础模型家族中的可扩展性。更关键的是,通过传统的欧几里得空间优化进行的成对多模式对齐缺乏建模统一表示或应用局部几何约束的能力,这往往会放大复杂生物医学成像中的模式差距,并破坏少量数据的适应性。在这项工作中,我们提出了vMFCoOp框架,它通过逆向估计共享超球流形上的von Mises-Fisher(vMF)分布,通过统一语义锚对齐任意的LLM和CLIP骨干网,实现稳健的生物医学提示和出色的少量数据分类。基于三个互补的约束,vMFCoOp在14个医疗数据集、12种医疗成像方式和13个解剖区域上显示出持续的一致性改进,在准确性、通用性和临床适用性方面超越了最先进的方法。本工作的目标是不断扩大,涵盖更多的下游应用,相关资源将通过https://github.com/VinyehShaw/UniEqui进行共享。
论文及项目相关链接
PDF Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)
Summary
基于大型语言模型(LLM)提炼医学语义先验的上下文优化(CoOp)进展为解决生物医学CLIP基视觉语言模型(VLM)的适应问题提供了可伸缩的替代方案,如手动提示工程和全精细调整。然而,该背景下的提示学习面临LLMs和CLIP变体间语义不一致的挑战,因两者有不同的训练语料库和模型架构,且缺乏跨不断发展的基础模型的扩展性。更重要的是,通过传统的欧几里得空间优化进行一对一的多模式对齐无法建模统一表示或应用局部几何约束,这在复杂的生物医学成像中会放大模式差距并破坏小样本适应的稳定性。针对这些问题,本研究提出了vMFCoOp框架,它通过共享超球面流形上的逆向估计von Mises-Fisher(vMF)分布,并利用统一语义锚对齐任意LLMs和CLIP骨干网之间的语义偏差,实现了稳健的生物医学提示和优越的小样本分类。vMFCoOp在三个互补约束的支持下,在14个医疗数据集、12种医疗成像方式和13个解剖区域上实现了改进,在准确性、通用性和临床适用性方面均优于现有方法。此工作旨在不断扩展以涵盖更多的下游应用,相关资源将通过https://github.com/VinyehShaw/UniEqui共享。
Key Takeaways
- LLM-distilled medical semantic priors guide recent advances in context optimization (CoOp) for biomedical CLIP-based vision-language models (VLMs).
- Prompt learning in this context faces challenges due to semantic misalignment between LLMs and CLIP variants.
- 传统的欧氏空间优化在复杂生物医学成像中存在模态差距和稳定性问题。
- vMFCoOp框架通过共享超球面流形上的逆向估计von Mises-Fisher(vMF)分布来对齐语义偏差,实现了稳健的生物医学提示和优越的小样本分类。
- vMFCoOp在多个医疗数据集、成像方式和解剖区域上表现优越,优于现有方法。
- 此工作旨在通过不断扩展以涵盖更多下游应用。
点此查看论文截图
A cross-modal pre-training framework with video data for improving performance and generalization of distributed acoustic sensing
Authors:Junyi Duan, Jiageng Chen, Zuyuan He
Fiber-optic distributed acoustic sensing (DAS) has emerged as a critical Internet-of-Things (IoT) sensing technology with broad industrial applications. However, the two-dimensional spatial-temporal morphology of DAS signals presents analytical challenges where conventional methods prove suboptimal, while being well-suited for deep learning approaches. Although our previous work, DAS Masked Autoencoder (DAS-MAE), established state-of-the-art performance and generalization without labels, it is not satisfactory in frequency analysis in temporal-dominated DAS data. Moreover, the limitation of effective training data fails to address the substantial data requirements inherent to Transformer architectures in DAS-MAE. To overcome these limitations, we present an enhanced framework incorporating short-time Fourier transform (STFT) for explicit temporal-frequency feature extraction and pioneering video-to-DAS cross-modal pre-training to mitigate data constraints. This approach learns high-level representations (e.g., event classification) through label-free reconstruction tasks. Experimental results demonstrate transformative improvements: 0.1% error rate in few-shot classification (90.9% relative improvement over DAS-MAE) and 4.7% recognition error in external damage prevention applications (75.4% improvement over from-scratch training). As the first work to pioneer video-to-DAS cross-modal pre-training, available training resources are expanded by bridging computer vision and distributed sensing areas. The enhanced performance and generalization facilitate DAS deployment across diverse industrial scenarios while advancing cross-modal representation learning for industrial IoT sensing.
光纤分布式声学传感(DAS)已经作为一种关键的物联网(IoT)传感技术崭露头角,在工业领域有着广泛的应用。然而,DAS信号的二维时空形态给分析带来了挑战,传统的方法证明效果并不理想,而深度学习的方法则更加适用。尽管我们之前的工作DAS掩码自编码器(DAS-MAE)在无标签情况下建立了最先进的性能和泛化能力,但在以时间为主的DAS数据的频率分析中并不令人满意。此外,有效训练数据的限制无法解决DAS-MAE中Transformer架构固有的大量数据要求。为了克服这些局限性,我们提出了一种增强框架,结合了短时傅里叶变换(STFT)进行明确的时空频率特征提取,以及开创性的视频到DAS跨模态预训练来缓解数据约束。这种方法通过无标签重建任务学习高级表示(例如,事件分类)。实验结果表明实现了突破性改进:在少镜头分类中的错误率为0.1%(相对于DAS-MAE有90.9%的相对改进),在外部损伤预防应用中的识别错误率为4.7%(相对于从头开始训练有75.4%的改进)。作为开创视频到DAS跨模态预训练的第一项工作,通过连接计算机视觉和分布式传感领域,扩大了可用的训练资源。增强性能和泛化能力有助于在多种工业场景中部署DAS,同时推动工业物联网感知的跨模态表示学习。
论文及项目相关链接
摘要
光纤分布式声学感知(DAS)是物联网(IoT)的关键传感技术,广泛应用于工业领域。然而,DAS信号的二维时空形态分析具有挑战性,传统方法效果有限,而深度学习则展现出优势。虽然先前工作DAS Masked Autoencoder(DAS-MAE)在无标签情况下实现了卓越的性能和泛化能力,但在以时间为主的DAS数据的频率分析上仍有不足。此外,受限于有效的训练数据无法满足Transformer架构的数据需求。为克服这些局限,我们提出一个结合短时傅里叶变换(STFT)进行显式时空频率特征提取的增强框架,并开创视频到DAS的跨模态预训练以缓解数据约束。该方法通过无标签重建任务学习高级表示(如事件分类)。实验结果显示出突破性改进:在少样本分类中的错误率为0.1%(相对于DAS-MAE有90.9%的相对改进),在外部损伤预防应用中的识别错误率为4.7%(相对于从头开始训练有75.4%的改进)。作为首次尝试视频到DAS跨模态预训练的工作,我们扩大了可用的训练资源,通过连接计算机视觉和分布式感知领域。增强性能和泛化能力有助于在不同工业场景中部署DAS,同时推动工业物联网感知的跨模态表示学习。
关键见解
- 光纤分布式声学感知(DAS)是IoT的重要传感技术,具有广泛的工业应用。
- DAS信号的二维时空形态分析具有挑战性,传统方法表现不佳,深度学习方法更优。
- 先前的DAS-MAE方法在频率分析上仍有不足,特别是在以时间为主的DAS数据中。
- 提出结合短时傅里叶变换(STFT)的增强框架,进行显式的时空频率特征提取。
- 开创视频到DAS的跨模态预训练,以缓解数据约束并扩大训练资源。
- 方法通过无标签重建任务学习高级表示,如事件分类。
点此查看论文截图
Zero-Order Sharpness-Aware Minimization
Authors:Yao Fu, Yihang Jin, Chunxia Zhang, Junmin Liu, Haishan Ye
Prompt learning has become a key method for adapting large language models to specific tasks with limited data. However, traditional gradient-based optimization methods for tuning prompts are computationally intensive, posing challenges for efficiency. We introduce ZOSA (Zero-Order Sharpness-Aware Minimization), a novel optimization framework that integrates zero-order optimization with sharpness-aware minimization to enhance prompt tuning. ZOSA employs Rademacher perturbation vectors to estimate gradients without requiring backpropagation. By incorporating sharpness-aware principles, it targets flat minima in the loss landscape, improving generalization. An adaptive learning rate, guided by loss variability, further ensures stable convergence. Experiments on few-shot learning tasks, such as text classification and natural language inference, show that ZOSA significantly outperforms existing methods. With its theoretical foundation and computational efficiency, ZOSA offers a practical solution for prompt-based learning in resource-limited settings.
提示学习已成为适应大型语言模型进行特定任务的关键方法,尤其当数据有限时。然而,传统的基于梯度的优化方法在调整提示时计算量大,对效率构成挑战。我们引入了ZOSA(零阶尖锐度感知最小化),这是一种新型优化框架,它将零阶优化与尖锐度感知最小化相结合,以提高提示调整能力。ZOSA使用Rademacher扰动向量来估计梯度,无需反向传播。通过结合尖锐度感知原理,它在损失景观中寻找平坦的最小值,从而提高泛化能力。由损失变化引导的自适应学习率进一步确保稳定的收敛。在文本分类和自然语言推理等小样本学习任务上的实验表明,ZOSA显著优于现有方法。凭借其理论基础和计算效率,ZOSA为资源受限环境下的基于提示的学习提供了实用解决方案。
论文及项目相关链接
Summary:
提出一种名为ZOSA的新型优化框架,结合零阶优化和尖锐度感知最小化技术,用于提高提示调整的效率。ZOSA采用Rademacher扰动向量进行梯度估计,不需要反向传播。它通过在损失景观中寻找平坦的最小值来提高泛化能力。在文本分类和自然语言推理等小样本学习任务上的实验表明,ZOSA显著优于现有方法,为资源有限环境下的提示学习提供了实用解决方案。
Key Takeaways:
- ZOSA结合零阶优化和尖锐度感知最小化,旨在提高提示调整的效率。
- ZOSA采用Rademacher扰动向量估计梯度,无需反向传播。
- 通过寻找损失景观中的平坦最小值,ZOSA提高了模型的泛化能力。
- ZOSA具有自适应学习率,根据损失变化进行动态调整,确保稳定收敛。
- 实验表明,ZOSA在文本分类和自然语言推理等小样本学习任务上表现优异。
- ZOSA具有理论支撑和计算效率,适用于资源有限的环境下的提示学习。
- ZOSA为改进大型语言模型的提示调参提供了新的思路和方法。
点此查看论文截图
VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering
Authors:Hai-Dang Nguyen, Minh-Anh Dang, Minh-Tan Le, Minh-Tuan Le
Contemporary Visual Question Answering (VQA) systems remain constrained when confronted with culturally specific content, largely because cultural knowledge is under-represented in training corpora and the reasoning process is not rendered interpretable to end users. This paper introduces VietMEAgent, a multimodal explainable framework engineered for Vietnamese cultural understanding. The method integrates a cultural object detection backbone with a structured program generation layer, yielding a pipeline in which answer prediction and explanation are tightly coupled. A curated knowledge base of Vietnamese cultural entities serves as an explicit source of background information, while a dual-modality explanation module combines attention-based visual evidence with structured, human-readable textual rationales. We further construct a Vietnamese Cultural VQA dataset sourced from public repositories and use it to demonstrate the practicality of programming-based methodologies for cultural AI. The resulting system provides transparent explanations that disclose both the computational rationale and the underlying cultural context, supporting education and cultural preservation with an emphasis on interpretability and cultural sensitivity.
现代视觉问答(VQA)系统在面对文化特定内容时仍受到诸多限制,这主要是因为训练语料库中的文化知识储备不足,且推理过程无法向最终用户解释。本文介绍了VietMEAgent,这是一种针对越南文化理解的多模式可解释框架。该方法将文化对象检测骨干与结构化程序生成层相结合,形成了一个紧密耦合的答案预测和解释管道。越南文化实体的精选知识库作为背景信息的明确来源,而双模式解释模块则结合了基于注意力的视觉证据与结构化、可读的文本理由。我们还从公共存储库中构建了越南文化VQA数据集,并用它来证明基于编程的文化人工智能方法的实用性。该系统提供透明的解释,既揭示计算推理又揭示潜在的文化背景,强调可解释性和文化敏感性,支持教育和文化保护。
论文及项目相关链接
PDF 7 pages, 3 figures, 3 tables, FAIR 2025 conference
Summary
本文介绍了VietMEAgent,这是一个为越南文化理解设计的多模态可解释框架。它通过整合文化对象检测主干与结构化程序生成层,建立一个紧密耦合的答案预测和解释管道。使用越南文化实体的精选知识库作为背景信息的明确来源,同时结合基于注意力的视觉证据和结构化、可读的文本解释,构成双模态解释模块。此外,从公共仓库构建越南文化VQA数据集,演示基于编程的文化AI的实用性。该系统提供透明的解释,既揭示计算原理又展现文化背景,支持教育与文化保育,并强调可解释性和文化敏感性。
Key Takeaways
1.VietMEAgent框架旨在提高VQA系统对越南文化内容的理解。
2.该框架结合文化对象检测与结构化程序生成,实现答案预测与解释的紧密结合。
3.使用越南文化实体的知识库作为背景信息的来源。
4.双模态解释模块结合视觉证据和文本解释,提高系统的可解释性。
5.研究构建了越南文化VQA数据集,展示编程方法在文化AI中的实用性。
6.该系统提供的解释既涉及计算原理又包含文化背景,强调教育及文化保育功能。
点此查看论文截图
Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection
Authors:Anushka Sanjay Shelke, Aditya Sneh, Arya Adyasha, Haroon R. Lone
Fairness in AI-driven stress detection is critical for equitable mental healthcare, yet existing models frequently exhibit gender bias, particularly in data-scarce scenarios. To address this, we propose FairM2S, a fairness-aware meta-learning framework for stress detection leveraging audio-visual data. FairM2S integrates Equalized Odds constraints during both meta-training and adaptation phases, employing adversarial gradient masking and fairness-constrained meta-updates to effectively mitigate bias. Evaluated against five state-of-the-art baselines, FairM2S achieves 78.1% accuracy while reducing the Equal Opportunity to 0.06, demonstrating substantial fairness gains. We also release SAVSD, a smartphone-captured dataset with gender annotations, designed to support fairness research in low-resource, real-world contexts. Together, these contributions position FairM2S as a state-of-the-art approach for equitable and scalable few-shot stress detection in mental health AI. We release our dataset and FairM2S publicly with this paper.
人工智能驱动的应激检测中的公平性对于公平的心理健康护理至关重要,然而现有的模型经常表现出性别偏见,特别是在数据稀缺的情况下。为了解决这个问题,我们提出了FairM2S,这是一个利用视听数据的应激检测公平性感知元学习框架。FairM2S在元训练和适应阶段都整合了均衡机会约束,采用对抗性梯度掩蔽和公平性约束元更新来有效减轻偏见。与五种最先进的基线相比,FairM2S达到了78.1%的准确率,同时将平等机会降低到0.06,显示出实质性的公平收益。我们还发布了SAVSD,这是一个带有性别注释的智能手机捕捉数据集,旨在支持低资源、现实环境中的公平性研究。这些贡献共同将FairM2S定位为心理健康人工智能中公平且可扩展的少量应激检测的最先进方法。我们随这篇论文公开发布我们的数据集和FairM2S。
论文及项目相关链接
Summary
本文关注人工智能驱动的压力检测中的公平性问题,特别是在数据稀缺情境下的性别偏见问题。为此,提出了FairM2S,一个利用视听数据的公平意识元学习框架。FairM2S在元训练和适应阶段都整合了平等机会约束,采用对抗性梯度掩蔽和公平性约束元更新来有效减轻偏见。与五种最先进基线相比,FairM2S达到了78.1%的准确率,同时将平等机会降至0.06,证明了其公平性显著提高。此外,还发布了SAVSD数据集,包含性别注释,用于支持低资源、现实环境下的公平性研究。总的来说,FairM2S成为了一种先进的方法,可实现心理健康AI中的公平和可扩展的少量压力检测。
Key Takeaways
- 人工智能在压力检测中面临公平性问题,特别是在数据稀缺情境下的性别偏见问题。
- FairM2S是一个利用视听数据的公平意识元学习框架,旨在解决这一问题。
- FairM2S在元训练和适应阶段都整合了平等机会约束,以减轻偏见。
- FairM2S达到了78.1%的准确率,同时将平等机会降至0.06,显示出其公平性显著提高。
- 发布了SAVSD数据集,包含性别注释,用于支持低资源、现实环境下的公平性研究和数据训练。
- FairM2S对于实现心理健康AI中的公平和可扩展的少量压力检测是一种先进的方法。
点此查看论文截图
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
Authors:Jingtian Ma, Jingyuan Wang, Wayne Xin Zhao, Guoping Liu, Xiang Wen
Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning method to incorporate spatiotemporal information into TSU. The prompt learning method consists of two components: A dynamic spatio-temporal context representation module that extracts representation vectors of spatio-temporal data for each traffic scene image, and a bi-level ST-aware multi-aspect prompt learning module that integrates the ST-context representation vectors into word embeddings of prompts for the CLIP model. The second module also extracts low-level visual features and image-wise high-level semantic features to exploit interactive relations among different aspects of traffic scenes. To the best of our knowledge, this is the first attempt to integrate spatio-temporal information into visionlanguage models to facilitate TSU task. Experiments on two realworld datasets demonstrate superior performance in the complex scene understanding scenarios with a few-shot learning strategy.
如今,导航和拼车应用程序已经收集了大量带有时空数据的图像。分析这些与时空信息相关的图像的核心技术是交通场景理解(TSU),旨在提供对交通场景的综合描述。与传统的时空数据分析任务不同,TSU任务既依赖于时空数据,又依赖于视觉文本数据,这带来了独特的挑战。然而,最近的研究往往将TSU视为常见的图像理解任务,忽略了时空信息,并忽视了交通场景不同方面的相互关系。为了解决这些问题,我们提出了一种基于CILP的新型时空增强模型(ST-CLIP)用于TSU。我们的模型以经典的视觉语言模型CLIP作为骨干网,并设计了一种时空上下文感知多方面提示(SCAMP)学习方法,将时空信息融入TSU中。提示学习方法由两个组件组成:一个动态时空上下文表示模块,用于提取每个交通场景图像的时空数据表示向量;一个两级ST感知多方面提示学习模块,将ST上下文表示向量集成到CLIP模型的提示词嵌入中。第二个模块还提取低级视觉特征和图像级高级语义特征,以利用交通场景不同方面的交互关系。据我们所知,这是首次尝试将时空信息整合到视觉语言模型中,以促进TSU任务。在两个真实世界数据集上的实验表明,在复杂场景理解场景中采用小样本学习策略时,其性能优越。
论文及项目相关链接
Summary
基于时空增强模型的交通场景理解研究旨在通过结合时空信息和视觉语言模型(CLIP)来理解交通场景。该研究通过设计时空上下文感知多方面提示学习方法(SCAMP),将时空数据融入交通场景理解(TSU)。实验证明,在复杂场景理解场景中,采用少量学习策略的该模型表现出卓越性能。
Key Takeaways
- 交通场景理解(TSU)是分析带有时空信息的图像的核心技术,旨在全面描述交通场景。
- 现有研究常将TSU视为普通图像理解任务,忽略了时空信息以及交通场景中不同方面的相互关系。
- 提出了一种基于CLIP的时空增强模型(ST-CLIP)来解决这一问题。
- ST-CLIP模型使用经典的语言模型CLIP作为骨干,并设计了SCAMP学习方法来融入时空信息。
- SCAMP学习方法包括两个模块:动态时空上下文表示模块和双级ST感知多方面提示学习模块。
- 该方法首次尝试将时空信息融入视觉语言模型以促进TSU任务。
- 在两个真实世界数据集上的实验表明,在少量学习策略的复杂场景理解中,该方法具有卓越性能。