⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-20 更新
Mitigating Label Length Bias in Large Language Models
Authors:Mario Sanz-Guerrero, Katharina von der Wense
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
大型语言模型(LLM)是强大的零样本和少样本学习者。然而,在预测一组候选选项时,LLM会受到标签偏见的影响,而现有的校准方法忽视了由多令牌类标签产生的偏见。我们解决了一个我们称为标签长度偏见的问题,即不同长度的标签在标准长度归一化后仍然处理不一致。为了缓解这一问题,我们提出了归一化上下文校准(NCC),这是一种有效的方法,可以在全标签级别对预测进行归一化和校准。NCC在多个数据集和模型上实现了对先前方法的统计学上的显著改进,F1得分提高了高达10%。此外,NCC将偏见缓解扩展到更广泛的任务,如多项选择题回答。我们的分析表明,结合上下文学习,NCC对少样本示例选择不太敏感,需要更少的示例即可实现竞争性能,并产生更可靠的信心估计。这些发现强调了减轻全标签偏见的重要性,以提高基于LLM的方法的性能和稳健性,特别是在类标签自然包含多个令牌的实际应用中。
论文及项目相关链接
PDF Accepted to AACL 2025 (Main)
Summary
大型语言模型(LLM)在零样本和少样本学习中表现出强大的能力,但在预测候选选项集时,易受标签偏差影响。现有校准方法忽略了由多令牌类别标签产生的偏差。针对标签长度偏差问题,即使在进行标准长度归一化后,不同长度的标签处理仍不一致。为解决此问题,提出了归一化上下文校准(NCC)方法,该方法在全长标签级别对预测进行归一化和校准。NCC在多个数据集和模型上实现了对先前方法的统计显著改进,F1得分提高高达10%。此外,NCC将偏差缓解扩展到更广泛的任务,如多项选择题回答。分析表明,结合上下文学习,NCC对少样本示例选择不那么敏感,竞争性能所需的示例更少,并产生更可靠的信心估计。这些发现强调减轻全长标签偏差的重要性,有助于提高LLM方法性能和稳健性,特别是在类别标签自然包含多个令牌的实际应用中。
Key Takeaways
- 大型语言模型(LLMs)在预测候选选项时易受标签偏差影响。
- 现有校准方法忽略由多令牌类别标签产生的偏差。
- 提出了归一化上下文校准(NCC)方法以解决标签长度偏差问题。
- NCC在多个数据集和模型上实现对先前方法的改进,F1得分提高高达10%。
- NCC将偏差缓解扩展到多项选择题回答等更广泛的任务。
- 结合上下文学习,NCC对少样本示例选择更加稳健,并产生更可靠信心估计。
点此查看论文截图
Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning
Authors:Tong Zhang, Yifan Zhao, Liangyu Wang, Jia Li
Cross-Domain Few-Shot Learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, i.e., semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct Intermediate Domain Proxies (IDP) with source feature embeddings as the codebook and reconstruct the target domain feature with this learned codebook. We then conduct an empirical study to explore the intrinsic attributes from perspectives of visual styles and semantic contents in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks.
跨域小样本学习(CDFSL)致力于仅使用少量训练数据,从源域转移通用知识到目标域,与此同时,它面临着三个学习挑战,即语义不相关、域差异大和数据稀缺。与主流的CDFSL工作集中在通用表示上不同,我们新颖地尝试构建以源特征嵌入作为代码本的中介域代理(IDP),并用这个学习到的代码本重建目标域特征。然后,我们从中介域代理的视觉风格和语义内容等角度进行实证研究,探索其内在属性。受益于中介域的这些属性,我们开发了一种快速域对齐方法,利用这些代理作为目标域特征转换的学习指南。通过中介域重建和目标特征转换的协同学习,我们提出的模型能够在8个跨域小样本学习基准测试上超越最新模型。
论文及项目相关链接
PDF Accepted to IJCV 2025
Summary
跨域小样本学习(CDFSL)旨在利用少量训练数据,从源域转移通用知识到目标域。面对语义分离、领域差异大和缺少数据等学习挑战,我们创新地构建了中间域代理(IDP),使用源特征嵌入作为代码本,重建目标域特征。通过实证研究,我们探索了中间域代理的内在属性,如视觉风格和语义内容。利用这些中间域的属性优势,我们开发了一种快速域对齐方法,将这些代理作为目标域特征转换的学习指南。通过中间域重建和目标特征转换的协同学习,我们提出的模型在8个跨域小样本学习基准测试中超过了最先进的模型。
Key Takeaways
- 跨域小样本学习(CDFSL)专注于从源域转移知识到目标域,使用极少的训练数据。
- 面临语义分离、领域差异大和缺少数据的挑战。
- 创新构建中间域代理(IDP),利用源特征嵌入作为代码本。
- 通过实证研究,探索了中间域代理的内在属性,如视觉风格和语义内容。
- 利用中间域的属性优势,开发了一种快速域对齐方法。
- 将中间域代理作为目标域特征转换的学习指南。
点此查看论文截图
Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
Authors:Weimin Bai, Yubo Li, Weijian Luo, Zeqiang Lai, Yequan Wang, Wenzheng Chen, He Sun
Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM’s Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM’s rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
文本到3D生成的转换技术已经迅速发展,然而,包括基于优化和前馈架构的最先进模型仍然面临两个基本局限。首先,它们在进行粗略语义对齐时遇到困难,往往无法捕捉精细的提示细节。其次,它们缺乏强大的3D空间理解能力,导致几何不一致以及在部件组装和空间关系方面的灾难性失败。为了解决这些挑战,我们提出了VLM3D,这是一个通用框架,它重新利用大型视觉语言模型(VLM)作为强大且可区分的语义和空间批判者。我们的核心贡献是源于VLM的“是”或“否”对数几率的双查询批判信号,它评估语义保真度和几何一致性。我们证明了这种指导信号在两种不同范式下的普遍性:(1)作为基于优化的管道奖励目标时,VLM3D在标准基准测试上的表现显著优于现有方法;(2)作为前馈管道的测试时间指导模块时,它主动引导最新本地3D模型的迭代采样过程以纠正严重的空间错误。VLM3D建立了一条有原则且可推广的途径,将VLM对语义和空间的丰富、基于语言的理解注入到多样化的3D生成管道中。
论文及项目相关链接
Summary
文本提到当前文本到三维生成技术的两大局限性:语义对齐粗糙和空间理解不足。为此,提出了VLM3D框架,利用大型视觉语言模型作为强大的可区分语义和空间评论家。其核心贡献是通过VLM的Yes或No log-odds派生出双查询评论家信号,评估语义保真度和几何一致性。该指导信号在优化管道和基于前馈管道的两个不同范式中均表现出一般性。在优化管道中作为奖励目标,VLM3D在标准基准测试中显著优于现有方法。在基于前馈管道中作为测试时间指导模块,它主动引导当前三维模型的迭代采样过程,纠正严重的空间错误。VLM3D为将丰富的语言基础语义和空间理解注入多种三维生成管道提供了原则性和通用性的途径。
Key Takeaways
- 当前文本到三维生成技术面临两大挑战:语义对齐粗糙和空间理解不足。
- VLM3D框架利用大型视觉语言模型作为强大的可区分语义和空间评论家来解决这些挑战。
- VLM3D通过VLM的Yes或No log-odds派生出双查询评论家信号,评估语义和几何一致性。
- VLM3D在优化管道和基于前馈管道的两个不同范式中都表现出一般性。
- 在优化管道中,VLM3D作为奖励目标显著优于现有方法。
- 在基于前馈管道中,VLM3D能主动引导当前三维模型的迭代采样过程,纠正空间错误。
点此查看论文截图
Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning
Authors:Rui Liu, Yuan Zhao, Zhenqi Jia
The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker’s timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor’s final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.
自动电影配音模型能够根据给定的脚本生成生动逼真的语音,从简短的音色提示中复制演讲者的音色,同时确保与无声视频的唇同步。现有方法模拟了一个简化的工作流程,即演员直接进行配音而无需准备,从而忽视了关键的导演与演员之间的互动。相比之下,真实的工作流程涉及动态的协作:导演积极与演员合作,指导他们理解上下文线索,特别是在表演前的情绪。为了解决这一问题,我们提出了一种新的基于检索的导演-演员交互学习方案,以实现真实的电影配音,称为“真实配音器”,其中包含三种新颖机制:(1)我们构建了一个多模态参考素材库,以模拟导演提供的学习素材。值得注意的是,我们整合了大型语言模型(LLM),以实现跨多模态信号的深层情感表示理解。(2)为了模拟演员在配音过程中如何有效且全面地内化导演提供的素材,我们提出了一种基于情感相似性的检索增强策略。该策略检索与目标无声视频最相关的多模态信息。(3)我们开发了一种基于渐进图谱的语音生成方法,该方法会逐步融入检索到的多模态情感知识,从而模拟演员的最终配音过程。上述机制使真实配音器能够忠实地复制真实的配音工作流程,在情感表达方面取得了全面的改进。在V2C Animation基准数据集上的主观和客观评估都验证了其有效性。代码和演示可在https://github.com/AI-S2-Lab/Authentic-Dubber上找到。
论文及项目相关链接
PDF Accepted by AAAI 2026
Summary
本文介绍了一种名为Authentic-Dubber的自动电影配音模型,该模型通过构建多模态参考素材库、采用基于情感相似度的检索增强策略,以及渐进图基语音生成方法,模拟真实电影配音工作流程,提高情感表达的真实性。
Key Takeaways
- Authentic-Dubber模型生成生动语音,从给定剧本复制演讲者音色,并确保与无声视频同步。
- 现有方法忽略了导演与演员之间的交互作用,而真实的配音工作流程是动态的,导演与演员之间有深入的协作。
- Authentic-Dubber提出了一个Retrieve-Augmented Director-Actor Interaction Learning方案来模拟真实的电影配音。
- 该模型包含三个新机制:构建多模态参考素材库、采用基于情感相似度的检索增强策略,以及渐进图基语音生成方法。
- 多模态参考素材库通过整合大型语言模型实现跨模态情感表示的深入理解。
- 基于情感相似度的检索增强策略可有效地模仿演员如何全面内化导演提供的素材进行配音。
点此查看论文截图
EBind: a practical approach to space binding
Authors:Jim Broadbent, Felix Cohen, Frederik Hvilshøj, Eric Landau, Eren Sasoglu
We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and PCs. In contrast to related work, we will open-source our code, model weights, and datasets.
我们通过对两个核心组件的专注来简化空间绑定:每个模态的单编码器和高质量数据。这使得在单个GPU上训练最先进的模型的时间从多天缩短到几个小时。我们推出了EBind,这是一种简单、以数据为中心、参数效率高的方法,用于绑定多个对比模型的嵌入空间。我们证明了一个简单的1.8B参数的图像-文本-视频-音频-3D模型可以超越规模大4到17倍的模型。实现这一点的关键是精心编制的三类互补数据源:一、使用最先进检索模型的自动搜索产生的约达近七百六千万全自动化多媒体信息图像(图片信息多为三元组形式的单一情景状态涵盖各个通道信息等), 二百万丰富样本从不同数据源收集并标注为负面、部分匹配或正面匹配的三元组,三、近三百四十万已有的带字幕的数据条目。我们使用多种不同的评估方法来说明每个数据源的价值。由于现有基准测试存在局限性,我们进一步推出首个高质量的共识标注的零射击分类基准测试来比对音频与电脑交互的过程以及即时输入回答相关文本结果的精度情况作为终端自动化终端用户界面审核参照比对以进行测试的结果提供对比分析报告而非简而操作通用的用户体验的评价方式的范围环境适用于搭建家庭服务和安防监听之中向普及的需求的产物等非暂时跨境特定的广告宣传工具的范本设计的相关操作行为领域工作内容的自动化服务处理数据以分析消费者行为和态度研究等的通用需求;与之相关的研究论文对比其他同类相关研究不同之处是未来将开放源代码程序(以简化和可接入化的程度越完善也等同于涵盖的内容和途径的广泛性更高为优选),模型权重以及数据集开源公开给公众共同研究进步。
论文及项目相关链接
Summary
本文介绍了EBind方法,这是一种简单、以数据为中心和参数高效的绑定多个对比模型嵌入空间的方法。通过专注于两个核心组件:每模态的单编码器和高质量数据,使得在单个GPU上训练最先进的模型的时间从数天缩短到数小时。实验表明,一个简单的1.8B参数的图像-文本-视频-音频-3D模型可以优于4到17倍的模型。关键的成功因素在于使用了三种互补数据源:全自动的多模态五元组、人类标注的负面、部分或正面匹配的三元组和预存的带说明的数据项。每个数据源的价值都通过多个评估来证明。同时,本文还推出了首个高质量、共识标注的零样本分类基准测试,用于音频和PC之间的对比。本文公开了代码、模型权重和数据集。
Key Takeaways
- EBind方法聚焦于每模态的单编码器和高质量数据,简化了多模态空间绑定的复杂性。
- 通过EBind,训练先进的模型可以在数小时内完成,而不是数天,大大提高了效率。
- 一个相对较小的模型(1.8B参数)能够超越规模更大的模型(4至17倍)。
- 成功的关键在于使用三种互补数据源,包括全自动的多模态数据、人类标注的三元组和预存的带说明的数据项。
- 多种评估方法证明了每个数据源的价值。
- 推出了首个高质量、共识标注的零样本分类基准测试,用于音频和个人电脑(PC)之间的对比。
点此查看论文截图
KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation
Authors:Anji Li, Mingwei Liu, Zhenxi Chen, Zheng Pei, Zike Li, Dekun Dai, Yanlin Wang, Zibin Zheng
Automated unit test generation using large language models (LLMs) holds great promise but often struggles with generating tests that are both correct and maintainable in real-world projects. This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge to enhance LLM-based test generation. Our approach first extracts project structure and usage knowledge through static analysis, which provides rich context for the model. It then employs a testing-domain-knowledge-guided separation of test case design and test method generation, combined with a multi-perspective prompting strategy that guides the LLM to consider diverse testing heuristics. The generated tests follow structured templates, improving clarity and maintainability. We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines using automatic correctness and coverage metrics, as well as a human study assessing readability and maintainability. Results demonstrate that KTester significantly outperforms existing methods across six key metrics, improving execution pass rate by 5.69% and line coverage by 8.83% over the strongest baseline, while requiring less time and generating fewer test cases. Human evaluators also rate the tests produced by KTester significantly higher in terms of correctness, readability, and maintainability, confirming the practical advantages of our knowledge-driven framework.
使用大型语言模型(LLM)进行自动化单元测试生成具有巨大的潜力,但在生成既正确又可在实际项目中维护的测试时常常遇到困难。本文提出了KTester,一个结合特定项目知识和测试领域知识的新型框架,以增强基于LLM的测试生成。我们的方法首先通过静态分析提取项目结构和使用知识,为模型提供丰富的上下文。然后,它采用测试领域知识指导的测试案例设计和测试方法生成的分离,结合多角度提示策略,引导LLM考虑各种测试启发式方法。生成的测试遵循结构化模板,提高了清晰度和可维护性。我们在多个开源项目上评估了KTester,与最新的LLM基线使用自动正确性和覆盖率指标进行比较,以及一项评估可读性和可维护性的人类研究。结果表明,KTester在六个关键指标上显著优于现有方法,在最强基线上提高了执行通过率5.6 百分比和代码覆盖率提高了 8.8 百分比 ,同时需要更少的时间和生成的测试用例更少。人类评估者也对KTester产生的测试在正确性、可读性和可维护性方面给予了更高的评价,这证实了我们的知识驱动框架的实际优势。
论文及项目相关链接
PDF 13 pages, 11 figures
Summary
基于大型语言模型(LLM)的自动化单元测试生成具有巨大的潜力,但在现实世界项目中生成既正确又易于维护的测试方面存在挑战。本文提出KTester框架,该框架结合项目特定知识和测试领域知识以增强LLM的测试生成能力。KTester首先通过静态分析提取项目结构和使用知识,为模型提供丰富的上下文。然后采用测试领域知识指导的测试案例设计和测试方法生成分离,结合多角度提示策略,引导LLM考虑多种测试启发式方法。生成的测试遵循结构化模板,提高清晰度和可维护性。在多个开源项目上的评估结果表明,KTester在六个关键指标上显著优于现有方法,执行通过率提高5.69%,代码覆盖率提高8.83%,同时所需时间和生成的测试用例更少。人类评估者也对KTester生成的测试在正确性、可读性和可维护性方面给予了更高的评价,证实了我们的知识驱动框架的实际优势。
Key Takeaways
- KTester是一个结合项目特定知识和测试领域知识的框架,用于增强LLM在测试生成方面的能力。
- KTester通过静态分析提取项目结构和知识,为LLM提供丰富的上下文。
- 该框架采用测试领域知识指导的测试案例设计和测试方法生成分离。
- 多角度提示策略引导LLM考虑多种测试启发式方法。
- 生成的测试遵循结构化模板,提高清晰度和可维护性。
- 在多个开源项目上的评估表明,KTester在多个关键指标上优于现有方法。
点此查看论文截图
LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation
Authors:Hao Jiang, Guoquan Wang, Donglin Zhou, Sheng Yu, Yang Zeng, Wencong Zeng, Kun Gai, Guorui Zhou
Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM’s geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.
近期大型语言模型(LLM)的进步通过为传统基于ID的方法赋予语义泛化能力,从而增强了基于文本的推荐。基于文本的方法通常通过提示设计对物品文本信息进行编码,并通过物品标记化生成离散语义ID。然而,在特定领域任务(如本地生活服务)中,仅通过提示注入位置信息无法捕获精细的空间特征和物品之间的现实世界距离感知。为解决此问题,我们提出了LGSID,这是一个针对本地生活推荐的LLM对齐地理项目标记化框架。该框架包含两个关键组件:(1)基于强化学习的地理LLM对齐,(2)分层地理项目标记化。在基于强化学习的对齐模块中,我们首先训练一个列表级奖励模型来捕获物品之间现实世界的空间关系。然后,我们引入了一种新型的G-DPO算法,该算法使用预训练的奖励模型将泛化的空间知识和协同信号注入LLM中,同时保持其语义理解。此外,我们提出了一种分层地理项目标记化策略,其中主要标记来自离散的空间和内容属性,剩余标记则使用对齐的LLM地理表示向量进行细化。在快手行业真实数据集上的大量实验表明,LGSID始终优于最新的判别和生成推荐模型。消融研究、可视化和案例研究进一步验证了其有效性。
论文及项目相关链接
Summary
大型语言模型(LLM)的最新进展通过语义泛化能力丰富了传统的基于ID的方法,从而提高了基于文本的推荐。针对领域特定任务如本地生活服务,提出LGSID框架,包括RL基于地理的LLM对齐和分层地理项目令牌化。实验表明,LGSID在快手行业数据集上始终优于最新的判别和生成推荐模型。
Key Takeaways
- LLM的进展通过语义泛化能力增强了文本推荐。
- 文本推荐方法通过编码项目文本信息和生成离散语义ID进行。
- 在特定领域任务中,仅注入位置信息到提示中无法捕捉精细的空间特性和现实世界中的距离感知。
- LGSID框架包括两个关键组件:RL基于地理的LLM对齐和分层地理项目令牌化。
- RL-based对齐模块通过训练列表奖励模型来捕获项目之间的现实空间关系,并引入G-DPO算法将空间知识和协作信号注入LLM中。
- 分层地理项目令牌化策略包括主要令牌和剩余令牌,前者来源于离散的空间和内容属性,后者通过LLM地理表示向量进行精炼。
点此查看论文截图
Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation
Authors:Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong
Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.
精确事件定位(PES)旨在精确时刻识别精细粒度的事件,已成为体育分析的关键组成部分。由于事件快速连续发生、运动模糊和细微的视觉差异,此任务颇具挑战性。因此,大多数现有方法都依赖于特定领域的端到端训练以及大量标记数据集,并且由于它们仅依赖于像素或基于姿势的输入,因此在小样本条件下往往表现挣扎。然而,获取大量标记数据集在实际中是很困难的。我们提出了一种用于小样本PES的统一多实体图网络(UMEG-Net)。UMEG-Net将人体骨骼和运动特定对象关键点集成到一个统一图中,并基于高级GCN和多尺度时间移位功能,具有高效的空间时间提取模块。为了进一步提高性能,我们采用多模态蒸馏技术,将基于关键点的图形的知识转移到视觉表示中。我们的方法在有限标记数据上表现稳健,在小样本设置中显著优于基准模型,为小样本科提供了可扩展和有效的解决方案。代码公开在:https://github.com/LZYAndy/UMEG-Net。
论文及项目相关链接
PDF The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Summary
本文提出了一种用于精细动作事件检测的少样本学习的方法,即统一多实体图网络(UMEG-Net)。该方法结合了人体骨架和运动特定对象关键点,采用基于图卷积网络(GCN)和多尺度时间移位的高效时空提取模块。此外,还采用多模态蒸馏技术从关键点图向视觉表示转移知识。该方法在有限标记数据下表现出稳健性能,并在少样本设置下显著优于基线模型。
Key Takeaways
- PES(精确事件识别)是体育分析中的关键组成部分,旨在识别精确时刻的精细事件。
- 由于快速连续动作、运动模糊和细微的视觉差异,此任务具有挑战性。
- 当前方法大多依赖于特定领域的端到端训练和大量标记数据集,但在少样本条件下表现不佳。
- UMEG-Net结合了人体骨架和运动特定对象关键点,采用统一图表示。
- UMEG-Net使用基于GCN和多尺度时间移位的时空提取模块。
- 多模态蒸馏技术用于从关键点图向视觉表示转移知识。
- 方法在有限标记数据下表现稳健,少样本设置下显著优于基线模型。
点此查看论文截图
SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
Authors:An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang
Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61% increase in R1@0.5 and 2.59% gain in R1@0.7 on Charades-STA.
视频时刻检索是视频理解任务中的一种,旨在根据自然语言查询在未修剪的视频中定位特定的时间片段。尽管最近在传统技术和多模态大型语言模型(MLLM)方面的视频时刻检索取得了进展,但大多数现有方法仍然依赖于粗略的时间理解和单一视觉模式,这在处理复杂视频时限制了性能。为了解决这一问题,我们引入了SMART(基于镜头感知的多模态音频增强时间片段检索),这是一个基于MLLM的框架,它结合了音频线索并利用了镜头级的时间结构。SMART通过结合音频和视觉特征来丰富多模态表示,同时应用“镜头感知令牌压缩”,有选择地保留每个镜头中的高信息令牌,以减少冗余并保留精细的时间细节。我们还对提示设计进行了改进,以更好地利用视听线索。在Charades-STA和QVHighlights上的评估表明,SMART相较于最先进的方法实现了显著改进,包括在Charades-STA上R1@0.5提高了1.61%,R1@0.7提高了2.59%。
论文及项目相关链接
Summary
视频时刻检索是视频理解领域中的一项任务,旨在根据自然语言查询在未剪辑的视频中定位特定的时间片段。尽管近期采用传统技术和多模态大型语言模型(MLLM)的方法有所进展,但大多数现有方法仍依赖于粗略的时间理解和单一视觉模式,这在复杂视频上的性能受到限制。为此,我们引入SMART框架,它结合了音频线索,利用镜头级别的临时结构,并基于MLLM。SMART通过结合音频和视觉特征来丰富多模式表示,同时应用镜头感知标记压缩,选择性保留每个镜头中的高信息标记,以减少冗余并保留精细的时间细节。在Charades-STA和QVHighlights上的评估表明,SMART相较于最新技术实现了显著改进,其中在Charades-STA上R1@0.5提高了1.61%,R1@0.7提高了2.59%。
Key Takeaways
- 视频时刻检索旨在定位未剪辑视频中的特定时间片段,基于自然语言查询。
- 现有方法主要依赖粗略的时间理解和单一视觉模式,性能受限。
- SMART框架结合音频线索和镜头级别的临时结构。
- SMART利用多模态表示,结合音频和视觉特征。
- 镜头感知标记压缩技术被用于SMART中,以保留高信息标记并减少冗余。
- SMART在Charades-STA和QVHighlights上的评估结果优于最新技术。
点此查看论文截图
FxSearcher: gradient-free text-driven audio transformation
Authors:Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim
Achieving diverse and high-quality audio transformations from text prompts remains challenging, as existing methods are fundamentally constrained by their reliance on a limited set of differentiable audio effects. This paper proposes \textbf{FxSearcher}, a novel gradient-free framework that discovers the optimal configuration of audio effects (FX) to transform a source signal according to a text prompt. Our method employs Bayesian Optimization and CLAP-based score function to perform this search efficiently. Furthermore, a guiding prompt is introduced to prevent undesirable artifacts and enhance human preference. To objectively evaluate our method, we propose an AI-based evaluation framework. The results demonstrate that the highest scores achieved by our method on these metrics align closely with human preferences. Demos are available at https://hojoonki.github.io/FxSearcher/
实现多样化和高质量基于文本提示的音频转换仍然具有挑战性,因为现有方法从根本上受到其依赖于有限的可微音频效果的限制。本文提出了一个新型无梯度框架——FxSearcher,它可以根据文本提示发现音频效果的最佳配置,以转换源信号。我们的方法采用贝叶斯优化和基于CLAP的评分函数,以高效执行搜索操作。此外,引入指导提示,以避免不良产物并提高人类偏好度。为了客观地评估我们的方法,我们提出了一个基于AI的评估框架。结果表明,我们的方法在这些指标上获得的最高分数与人类偏好紧密对齐。演示内容可通过https://hojoonki.github.io/FxSearcher/访问。
论文及项目相关链接
Summary
文本提出了一个名为FxSearcher的创新框架,旨在通过无梯度依赖的方式,实现基于文本提示的高质量和多样化的音频转换。通过引入贝叶斯优化和基于CLAP的评分函数,该框架能够高效搜索最佳的音频效果配置来转换源信号。此外,引入了指导提示来防止不希望的伪迹并增强人类偏好。采用基于AI的评估框架进行客观评估,结果证明了该方法与人类偏好高度一致。
Key Takeaways
- FxSearcher是一个无梯度依赖的创新框架,用于音频转换任务。它通过优化音频效果配置来转换源信号以匹配文本提示。
- FxSearcher通过贝叶斯优化和基于CLAP的评分函数实现高效搜索音频效果配置。这种方法超越了传统方法的限制,为音频转换提供了更广泛的可能性。
- 引入了指导提示,有效防止转换过程中出现不希望的伪迹,同时增强了人类偏好。这提高了音频转换的质量和用户体验。
点此查看论文截图
Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation
Authors:Yu Zhong, Zihao Zhang, Rui Zhang, Lingdong Huang, Haihan Gao, Shuo Wang, Da Li, Ruijian Han, Jiaming Guo, Shaohui Peng, Di Huang, Yunji Chen
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely. Additionally, introducing LLMs is accompanied with substantial computational cost and inference latency. To address these issues, we propose a novel dual-process thinking framework dubbed R3, integrating LLMs’ generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a powerful multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R3 significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark. This pronounced enhancement highlights the effectiveness of our method in handling challenging VLN tasks.
视觉与语言导航(VLN)要求智能体根据人类指令动态探索复杂的3D环境。最近的研究强调了在视觉与语言导航中利用大型语言模型(LLM)的潜力,因为它们具备常识知识和通用推理能力。尽管LLM具有优势,但基于LLM的方法和领域专家之间在任务完成性能上仍存在较大差距,因为LLM在本质上难以精确地理解现实世界的空间关系。此外,引入LLM还伴随着巨大的计算成本和推理延迟。为了解决这些问题,我们提出了一种名为R3的新型双过程思维框架,以零样本的方式将LLM的通用能力与VLN特定专业知识相结合。该框架包括三个核心模块:Runner、Ruminator和Regulator。Runner是一个基于轻量级变压器的专家模型,可确保在常规情况下实现高效且准确的导航。Ruminator则采用强大的多模态LLM作为骨干,并采用思维链提示来激发结构化推理。Regulator监控导航进度,并根据三个标准控制适当的思维模式,使Runner和Ruminator能够和谐集成。实验结果表显示,R3在REVERIE基准测试上的SPL和RGSPL分别超过了其他先进方法,达到了3.28%和3.30%。这一显著的提升突显了我们的方法在应对具有挑战性的VLN任务时的有效性。
论文及项目相关链接
Summary:
基于视觉和语言导航(VLN)的任务需求,本文提出一个整合大型语言模型(LLM)与VLN特定专长的零样本思考框架R3。该框架包含三个核心模块:Runner、Ruminator和Regulator。Runner确保高效准确导航,Ruminator利用强大的多模态LLM进行结构化推理,Regulator根据三个标准监控导航进度并控制适当的思考模式。在REVERIE基准测试中,R3显著优于其他先进方法,提高了3.28%和3.30%的SPL和RGSPL。
Key Takeaways:
- VLN任务需要动态探索复杂3D环境并遵循人类指令。
- 大型语言模型(LLM)具有常识知识和通用推理能力,在VLN中有潜力。
- LLM在理解现实世界空间关联方面存在精确性不足的问题。
- 提出了一种名为R3的新型双思维过程框架,结合了LLM的通用能力与VLN特定专长。
- R3框架包含三个核心模块:Runner、Ruminator和Regulator,各有其功能和作用。
- 在REVERIE基准测试中,R3显著提高了VLN任务完成性能。
点此查看论文截图
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
Authors:Chun Chet Ng, Jia Yu Lim, Wei Zeng Low
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. The FinAgentBench dataset formalizes this problem through two tasks: document ranking and chunk ranking. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system. Each component is examined extensively to reveal their synergies: prompt engineering provides precise task instructions, ICL supplies semantically relevant few-shot examples, and the multi-agent system models coordinated scoring behaviour. Our best configuration achieves an NDCG@5 of 0.71818 on the restricted validation split. We further demonstrate that PRISM is feasible and robust for production-scale financial retrieval. Its modular, inference-only design makes it practical for real-world use cases. The source code is released at https://bit.ly/prism-ailens.
随着大型语言模型(LLM)的快速发展,金融信息检索已成为一项重要的工业应用。从冗长的财务文件中提取与任务相关的信息是操作和决策分析的关键。FinAgentBench数据集通过两个任务:文档排名和块排名,正式解决了这个问题。我们提出了PRISM,这是一个无需训练的框架,它集成了精细的系统提示、上下文学习(ICL)和轻量级的多智能体系统。我们对每个组件进行了广泛的研究,以揭示它们的协同作用:提示工程提供精确的任务指令,上下文学习提供语义上相关的少量样本,多智能体系统对协调评分行为进行建模。我们的最佳配置在受限的验证集上实现了NDCG@5为0.71818的成绩。我们进一步证明PRISM对于生产规模的金融检索是可行和稳健的。其模块化、仅推理的设计使其成为真实世界用例的实际选择。源代码已发布在https://bit.ly/prism-ailens。
论文及项目相关链接
PDF 3rd-place solution for the ACM ICAIF 2025 Agentic Retrieval Grand Challenge
Summary
金融信息检索已成为关键工业应用,随着大型语言模型(LLM)的快速发展。PRISM是一个无需训练、整合精细化系统提示、上下文学习和轻量级多智能体系统的框架,用于解决金融文件检索中的文档排序和片段排序任务。该框架实现良好,并适合实际应用场景。
Key Takeaways
- 金融信息检索已成为重要工业应用,得益于大型语言模型的快速发展。
- PRISM框架用于解决金融文件检索中的文档排序和片段排序任务。
- PRISM集成了精细化系统提示、上下文学习和轻量级多智能体系统。
- 精细化系统提示提供精确任务指令,上下文学习提供语义上相关的少量样本。
- 多智能体系统模拟协同评分行为。
- PRISM在限定验证集上取得NDCG@5为0.71818的成绩。
点此查看论文截图
APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design
Authors:Xinpeng Chen, Xiaofeng Han, Kaihao Zhang, Guochao Ren, Yujie Wang, Wenhao Cao, Yang Zhou, Jianfeng Lu, Zhenbo Song
Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time-intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD-agents, a large language model (LLM) driven multi-agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user’s description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users’ design task. To be specific, the SemanticParserAgent is responsible for converting users’ descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse-grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few-shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine-grained sub-elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems. Experimental results on the RICO dataset show that our APD-agents achieve state-of-the-art performance.
移动应用页面设计布局是开发过程中的关键步骤。然而,对于设计师来说,设计令人满意的设计需要花费大量时间,因为他们需要考虑在页面上呈现哪些控件和内容,并反复调整其大小、位置和样式以达到更好的外观和结构。尽管现在很多设计软件都能帮助完成这些重复的任务,但要想有效地使用它们仍然需要进行大量的培训。此外,跨应用页面的协作设计需要额外的时间来对齐标准和确保样式的一致性。在本研究中,我们提出了APD-agents,这是一个基于大型语言模型(LLM)的用于移动应用自动化页面设计的多智能体框架。我们的框架包含OrchestratorAgent、SemanticParserAgent、PrimaryLayoutAgent、TemplateRetrievalAgent和RecursiveComponentAgent。在接收到用户对页面的描述后,OrchestratorAgent可以动态地指导其他智能体完成用户的设计任务。具体来说,SemanticParserAgent负责将用户对页面内容的描述转换为结构化数据。PrimaryLayoutAgent可以生成该页面的初始粗略布局。TemplateRetrievalAgent可以检索语义上相关的少数示例并提升布局生成的质量。此外,RecursiveComponentAgent可以用来决定如何递归生成布局中每个元素所包含的精细粒度子元素。我们的工作充分利用了大型模型驱动的多智能体系统的自动协作能力。在RICO数据集上的实验结果表明,我们的APD-agents达到了最先进的性能。
论文及项目相关链接
Summary
本文介绍了一种基于大型语言模型的多代理框架APD-agents,用于移动应用程序的自动化页面设计。该框架包括多个代理,能够协作完成页面设计任务,如内容结构化、初始布局生成、模板检索和递归组件生成等。实验结果表明,该框架在RICO数据集上取得了最先进的性能。
Key Takeaways
- APD-agents是一个多代理框架,用于移动应用程序的自动化页面设计。
- 该框架包括多个代理,如OrchestratorAgent、SemanticParserAgent、PrimaryLayoutAgent、TemplateRetrievalAgent和RecursiveComponentAgent。
- APD-agents能够自动完成页面设计任务,如内容结构化、初始布局生成等。
- 框架利用大型语言模型的自动协作能力,提高了设计效率。
- APD-agents通过少量示例进行语义相关的模板检索,提高了布局生成的质量。
- 实验结果表明,APD-agents在RICO数据集上的性能达到最新水平。
点此查看论文截图
Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification
Authors:Yao Qin, Yangyang Yan, YuanChao Yang, Jinhua Pang, Huanyong Bi, Yuan Liu, HaiHua Wang
Deep learning models have achieved remarkable success in medical image analysis but are fundamentally constrained by the requirement for large-scale, meticulously annotated datasets. This dependency on “big data” is a critical bottleneck in the medical domain, where patient data is inherently difficult to acquire and expert annotation is expensive, particularly for rare diseases where samples are scarce by definition. To overcome this fundamental challenge, we propose a novel paradigm: Zero-Training Task-Specific Model Synthesis (ZS-TMS). Instead of adapting a pre-existing model or training a new one, our approach leverages a large-scale, pre-trained generative engine to directly synthesize the entire set of parameters for a task-specific classifier. Our framework, the Semantic-Guided Parameter Synthesizer (SGPS), takes as input minimal, multi-modal task information as little as a single example image (1-shot) and a corresponding clinical text description to directly synthesize the entire set of parameters for a task-specific classifier. The generative engine interprets these inputs to generate the weights for a lightweight, efficient classifier (e.g., an EfficientNet-V2), which can be deployed for inference immediately without any task-specific training or fine-tuning. We conduct extensive evaluations on challenging few-shot classification benchmarks derived from the ISIC 2018 skin lesion dataset and a custom rare disease dataset. Our results demonstrate that SGPS establishes a new state-of-the-art, significantly outperforming advanced few-shot and zero-shot learning methods, especially in the ultra-low data regimes of 1-shot and 5-shot classification. This work paves the way for the rapid development and deployment of AI-powered diagnostic tools, particularly for the long tail of rare diseases where data is critically limited.
深度学习模型在医学图像分析方面取得了显著的成功,但从根本上受到大规模精细标注数据集需求的制约。在医学领域,对“大数据”的依赖是一个关键的瓶颈,因为患者数据本质上难以获取,专家标注成本高昂,尤其是对于定义上样本稀缺的罕见疾病。为了克服这一基本挑战,我们提出了一种新型范式:零训练任务特定模型合成(ZS-TMS)。我们的方法不是适应现有模型或训练一个新模型,而是利用大规模预训练生成引擎直接合成针对特定任务的分类器的整个参数集。我们的框架,语义引导参数合成器(SGPS),以最少的多模式任务信息作为输入,只需要一个示例图像(一次拍摄)和相应的临床文本描述,就可以直接合成针对特定任务的分类器的整个参数集。生成引擎会解释这些输入以生成轻量级高效分类器的权重(例如EfficientNet-V2),该分类器可以立即部署进行推理,无需任何特定任务的训练或微调。我们在ISIC 2018皮肤病变数据集和自定义罕见疾病数据集上进行了具有挑战性的少量样本分类基准测试。结果表明,SGPS建立了新的最先进的性能,显著优于先进的少量样本和零样本学习方法,尤其在1个和5个样本分类的超低数据状态下。这项工作为AI驱动的诊断工具的快速开发和部署铺平了道路,特别是在数据极为有限的罕见疾病长尾领域。
论文及项目相关链接
Summary
深度学习方法在医学图像分析上取得显著成效,但对大规模精细标注数据集的依赖成为关键瓶颈。针对医学领域患者数据获取困难、专家标注成本高昂的问题,特别是罕见疾病样本稀缺的问题,提出一种新型解决方案:零训练任务特定模型合成(ZS-TMS)。该方法利用大规模预训练生成引擎,直接合成针对特定任务的参数集,只需少量模态任务信息和示例图像,即可快速部署进行推理。在ISIC 2018皮肤病变数据集和自定义罕见疾病数据集上进行评估,结果显示该方法创建新标准,尤其在超低数据环境下的1-shot和5-shot分类中表现优异。这为AI诊断工具尤其是针对数据严重受限的罕见疾病的快速开发和部署铺平了道路。
Key Takeaways
- 深度学习方法在医学图像分析中的应用及其对数据集的依赖。
- 医学领域患者数据获取和专家标注的困难,特别是罕见疾病样本的稀缺性。
- 提出零训练任务特定模型合成(ZS-TMS)方法以解决数据瓶颈问题。
- 利用预训练生成引擎直接合成任务特定分类器的参数集。
- 语义引导参数合成器(SGPS)只需少量模态任务信息和示例图像即可工作。
- 方法在ISIC 2018皮肤病变数据集和自定义罕见疾病数据集上表现优异。
点此查看论文截图
Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding
Authors:Qingyang Yan, Guangyao Chen, Yixiong Zou
Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +12.52 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual descriptions.The code is released on https://github.com/qyoung-yan/CuRPO.
“链式思维(Chain-of-Thought,简称CoT)提示在多种NLP和计算机视觉任务中展现出巨大的潜力,通过明确生成中间推理步骤。然而,我们发现基于强化学习(RL)的微调CoT推理在视觉定位任务中会出现性能悖论性下降的情况,特别是在CoT输出变得冗长或复杂时。此外,我们的分析表明,增加数据集的大小并不总是能提高性能,因为数据复杂性各不相同。基于这些发现,我们提出了基于课程的相对策略优化(CuRPO)这一新颖的训练策略,它利用CoT长度和广义交并比(gIoU)奖励作为复杂性指标,以从简单到复杂示例的方式逐步构建训练数据。在RefCOCO、RefCOCO+、RefCOCOg和LISA数据集上的大量实验证明了我们的方法的有效性。CuRPO持续超越现有方法,包括Visual-RFT,在RefCOCO上的改进达到了+12.52 mAP。此外,CuRPO表现出惊人的效率和稳健性,即使在少样本学习场景中也能提供出色的定位性能,尤其有利于涉及模糊和复杂文本描述的任务。代码已发布在https://github.com/qyoung-yan/CuRPO。“
论文及项目相关链接
PDF AAAI 2026 (Oral)
Summary
链式思维(Chain-of-Thought,简称CoT)提示法在各NLP和计算机视觉任务中展现出巨大潜力,通过明确生成中间推理步骤,能有效促进推理过程。然而,我们发现基于强化学习(Reinforcement Learning,简称RL)进行fine-tuning的CoT推理在视觉定位任务中会表现出性能降级,尤其在CoT输出复杂且详细时尤为明显。鉴于此,我们提出了基于课程表的相对策略优化(Curriculum-based Relative Policy Optimization,简称CuRPO)方法,利用CoT长度和广义交集联合法(Generalized Intersection over Union,简称gIoU)奖励作为复杂性指标,来结构化从简单到复杂的渐进式训练数据。在RefCOCO、RefCOCO+、RefCOCOg和LISA数据集上的实验证明了我们的方法的有效性。CuRPO方法相较于现有方法如Visual-RFT有明显优势,在RefCOCO数据集上提升最高可达+12.52 mAP。此外,CuRPO展现出卓越的效率与稳健性,即使在少样本学习场景下也能实现出色的定位性能,尤其适用于文本描述模糊且复杂的任务。
Key Takeaways
- 链式思维(CoT)提示法对于NLP和计算机视觉任务展现出显著潜力。
- 强化学习(RL)fine-tuning的CoT推理在视觉定位任务中可能导致性能下降。
- CuRPO是一种新的训练策略,通过利用CoT长度和gIoU奖励作为复杂性指标来渐进式结构化训练数据。
- CuRPO在多个数据集上的实验表现优于现有方法。
- CuRPO在RefCOCO数据集上的改进最高可达+12.52 mAP。
- CuRPO在少样本学习场景中表现出卓越的效率与稳健性。
点此查看论文截图
SQL-to-Text Generation with Weighted-AST Few-Shot Prompting
Authors:Sriom Chakrabarti, Chuangtao Ma, Arijit Khan, Sebastian Link
SQL-to-Text generation aims at translating structured SQL queries into natural language descriptions, thereby facilitating comprehension of complex database operations for non-technical users. Although large language models (LLMs) have recently demonstrated promising results, current methods often fail to maintain the exact semantics of SQL queries, particularly when there are multiple possible correct phrasings. To address this problem, our work proposes Weighted-AST retrieval with prompting, an architecture that integrates structural query representations and LLM prompting. This method retrieves semantically relevant examples as few-shot prompts using a similarity metric based on an Abstract Syntax Tree (AST) with learned feature weights. Our structure-aware prompting technique ensures that generated descriptions are both fluent and faithful to the original query logic. Numerous experiments on three benchmark datasets - Spider, SParC, and CoSQL show that our method outperforms the current baselines by up to +17.24% in execution Accuracy (EX), performs superior in Exact Match (EM) and provides more consistent semantic fidelity when evaluated by humans, all while preserving competitive runtime performance. These results demonstrate that Weighted-AST prompting is a scalable and effective method for deriving natural language explanations from structured database queries.
SQL-to-Text生成旨在将结构化SQL查询翻译为自然语言描述,从而帮助非技术用户理解复杂的数据库操作。尽管大型语言模型(LLM)最近表现出了令人鼓舞的结果,但当前的方法往往无法保持SQL查询的确切语义,特别是在有多种可能的正确表述时。为了解决这个问题,我们的工作提出了加权抽象语法树(AST)检索与提示方法,这是一种结合了结构化查询表示和LLM提示的架构。该方法使用基于抽象语法树(AST)的相似度度量以及学习到的特征权重,检索少量语义相关的示例作为提示。我们的结构感知提示技术确保生成的描述既流畅又忠于原始查询逻辑。在Spider、SParc和CoSQL三个基准数据集上的大量实验表明,我们的方法在执行准确性(EX)方面比当前基线高出+17.24%,在精确匹配(EM)方面表现更优,同时在人类评估时提供更一致的语义保真度,同时保持竞争力水平的运行性能。这些结果表明,加权AST提示是一种可从结构化数据库查询中导出自然语言解释的可扩展且有效的方法。
论文及项目相关链接
Summary
SQL-to-Text生成旨在将结构化SQL查询翻译为自然语言描述,帮助非技术用户理解复杂的数据库操作。尽管大型语言模型(LLM)已有较好表现,但当前方法往往无法保持SQL查询的确切语义,特别是在有多种可能正确表述时。为解决此问题,本研究提出Weighted-AST检索与提示架构,结合结构化查询表示与LLM提示。该方法使用基于抽象语法树(AST)的相似度度量和学习的特征权重,检索少量相关示例作为提示。结构感知提示技术确保生成的描述既流畅又忠实于原始查询逻辑。在Spider、SParC和CoSQL三个基准数据集上的实验表明,该方法在执行准确度(EX)上较当前基线高出+17.24%,在精确匹配(EM)上表现更优,人类评估的语义保真度也更高,同时保持竞争力运行时间性能。结果表明,Weighted-AST提示是一种从结构化数据库查询中派生出自然语言解释的有效且可扩展方法。
Key Takeaways
- SQL-to-Text生成旨在将SQL查询转化为自然语言描述,便于非技术用户理解。
- 当前方法难以保持SQL查询的确切语义,特别是在有多种表达方式时。
- Weighted-AST检索与提示架构结合了结构化查询表示和LLM提示技术。
- 该方法使用基于AST的相似度度量及学习的特征权重进行语义相关示例检索。
- 结构感知提示技术确保描述既流畅又忠实于原始查询逻辑。
- 在多个基准数据集上的实验显示,该方法在执行准确度、精确匹配和语义保真度方面表现优异。
点此查看论文截图
Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection
Authors:Yogesh Kumar, Anand Mishra
Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: https://github.com/yogesh-iitj/fs-video-vit
少样本视频目标检测(FSVOD)解决了在有限标记样本中检测新视频目标的挑战,克服了传统检测方法需要大量训练数据的限制。此任务面临关键挑战,包括维持受遮挡和外观变化影响的帧之间的时间一致性,以及在不需要复杂区域提案的情况下实现新对象泛化。复杂的区域提案通常计算量大,需要特定任务训练。我们的新型目标感知时间建模方法通过引入过滤机制来应对这些挑战,该机制可选择性地在帧之间传播高置信度的目标特征。这实现了有效的特征进展,减少了噪声累积,并在少量样本设置中提高了检测准确性。通过利用少量样本训练的检测和分类头以及有针对性的特征传播,我们在不依赖显式对象管提案的情况下实现了稳健的时间一致性。我们的方法在FSVOD-500、FSYTV-40、VidOR和VidVRD的5次射击设置上分别提高了AP值3.7%、5.3%、4.3%和4.5%。进一步的结果展示了在单次射击、三次射击和十次射击配置中的改进。我们在以下网址公开了代码:https://github.com/yogesh-iitj/fs-video-vit。
论文及项目相关链接
PDF Accepted at AAAI 2026 Main Track
Summary
本文介绍了Few-shot Video Object Detection(FSVOD)技术,该技术解决了在视频中对新型对象进行有限标注样本检测的挑战。通过采用对象感知的时间建模方法,结合过滤机制选择性传播高置信度的对象特征,实现了高效特征进展、减少噪声累积,并在小样本设置下提高了检测准确性。通过利用小样本训练的检测和分类头以及有针对性的特征传播,实现了稳健的时间一致性,无需依赖显式对象管道提案。该方法在多个数据集上的性能有所提升。
Key Takeaways
- Few-shot Video Object Detection (FSVOD) 技术解决了在有限标注样本下检测视频中的新型对象的挑战。
- FSVOD面临的关键挑战包括:维持跨帧的暂时一致性、处理遮挡和外观变化以及实现新型对象的泛化。
- 提出了一种对象感知的时间建模方法,通过选择性传播高置信度的对象特征来解决这些挑战。
- 高效特征进展、减少噪声累积在小样本设置下提高了检测准确性。
- 利用小样本训练的检测和分类头以及有针对性的特征传播,实现了稳健的时间一致性,无需依赖复杂的区域提案。
- 在多个数据集上的实验结果表明,该方法在5-shot设置及其他设置中均取得了性能提升。
点此查看论文截图
PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos
Authors:Dianbing Xi, Guoyuan An, Jingsen Zhu, Zhijian Liu, Yuan Liu, Ruiyuan Zhang, Jiayuan Lu, Yuchi Huo, Rui Wang
We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from Outfit of the Day(OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48x speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.
我们提出了PFAvatar(姿态融合化身),这是一种新的方法,可以从日常穿搭(OOTD)照片重建高质量的三维化身。这些照片展示了多种姿态、遮挡和复杂背景。我们的方法分为两个阶段:(1)通过少量OOTD示例对姿态感知扩散模型进行微调;(2)通过神经辐射场(NeRF)表示3D化身并进行蒸馏。在第一阶段,不同于以往将图像分割成资产(如服装、配饰)进行3D组装的方法,这种方法容易导致不一致性。我们避免分解,直接对全身外观进行建模。通过集成预训练的ControlNet进行姿态估计和新颖的条件先验保留损失(CPPL),我们的方法能够在端到端学习中精细细节,同时减轻少量训练中的语言漂移。我们的方法在仅5分钟内完成个性化设置,与以前的方法相比,速度提高了48倍。在第二阶段,我们引入了一种基于NeRF的化身表示,通过标准SMPL-X空间采样和多分辨率3D-SDS进行优化。与基于网格的表示相比,后者受到分辨率依赖的离散化和遮挡几何错误的影响,我们的连续辐射场能够保留高频纹理(例如头发)并通过透光率正确处理遮挡。实验表明,PFAvatar在重建保真度、细节保留和抗遮挡/截断方面优于最先进的方法,推动了从现实世界OOTD相册中生成实用三维化身的发展。此外,重建的3D化身支持下游应用,如虚拟试穿、动画和人类视频复现,进一步证明了我们方法的通用性和实用价值。
论文及项目相关链接
PDF Accepted by AAAI 2026
Summary
提出一种名为PFAvatar的新方法,从日常穿搭照片重建高质量3D头像。方法分为两个阶段:第一阶段通过微调姿态感知扩散模型进行少数日常穿搭示例学习;第二阶段通过神经辐射场表示3D头像。该方法避免了资产分解的不一致性问题,直接建模全身外观,实现个性化仅需5分钟,速度提升48倍。第二阶段采用基于NeRF的头像表示法,通过规范SMPL-X空间采样和多分辨率3D-SDS进行优化,能保留高频纹理并正确处理遮挡。PFAvatar方法提高了重建的保真度、细节保留以及对遮挡的鲁棒性,推动了实用3D头像从现实世界的日常穿搭相册生成的技术进步。
Key Takeaways
- PFAvatar是一种从日常穿搭照片重建3D头像的新方法。
- 方法分为两个阶段:微调姿态感知扩散模型和采用神经辐射场表示3D头像。
- 避免资产分解的不一致性问题,直接建模全身外观。
- 个性化完成仅需5分钟,速度提升显著。
- 采用NeRF-based头像表示法,能保留高频纹理并正确处理遮挡。
- PFAvatar在重建的保真度、细节保留以及对遮挡的鲁棒性方面表现优越。
点此查看论文截图
VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization
Authors:Youpeng Li, Fuxun Yu, Xinda Wang
The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.
对开源软件的广泛依赖显著增加了漏洞利用的风险,这突显了有效且可扩展的漏洞检测(VD)的必要性。现有的VD技术,无论是基于传统机器学习的方法还是基于大型语言模型(LLM)的方法,如提示工程、监督微调或离线偏好优化,它们在执行上下文感知分析方面的能力从根本上受到限制:它们依赖于固定输入或静态偏好数据集,无法自适应地探索仓库级别的依赖关系,并受到功能级别基准测试的约束,这些基准测试忽略了关键的漏洞上下文。
论文及项目相关链接
Summary
开源软件的广泛应用增加了漏洞利用的风险,凸显了有效且可扩展的漏洞检测(VD)的重要性。现有VD技术存在局限性,无法执行上下文感知分析。本文提出一种基于策略的LLM强化学习框架VULPO,用于上下文感知的VD。通过构建ContextVul数据集和支持多维奖励结构设计,VULPO实现了预测正确性、漏洞定位准确性和语义相关性的综合考量,并设计了难度自适应的奖励缩放机制。实验表明,VULPO框架在上下文感知VD方面的表现卓越。
Key Takeaways
- 开源软件的广泛应用增加了漏洞利用的风险,凸显了改进漏洞检测(VD)技术的必要性。
- 现有VD技术存在局限性,无法执行上下文感知分析,需要新的方法来解决这个问题。
- VULPO是一种基于策略的LLM强化学习框架,用于上下文感知的VD。
- ContextVul数据集的构建支持VULPO的训练和评估。
- VULPO通过多维奖励结构设计,实现了预测正确性、漏洞定位准确性和语义相关性的综合考量。
- VULPO采用了难度自适应的奖励缩放机制,以处理不同的漏洞案例的不对称难度并防止奖励作弊。
- 实验表明,VULPO框架在上下文感知VD方面的表现卓越,显著优于现有方法。
点此查看论文截图
Efficient Reinforcement Learning for Zero-Shot Coordination in Evolving Games
Authors:Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang
Zero-shot coordination(ZSC), a key challenge in multi-agent game theory, has become a hot topic in reinforcement learning (RL) research recently, especially in complex evolving games. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators from a diverse, potentially evolving, pool of partners that are not seen before without any fine-tuning. Population-based training, which approximates such an evolving partner pool, has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient RL training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi cooperative game and confirms its superiority.
零镜头协调(ZSC)是多智能体博弈理论中的一项关键挑战,最近已成为强化学习(RL)研究中的热点话题,特别是在复杂演化游戏中。它关注智能体的泛化能力,要求它们与来自多样化的、可能不断演化的伙伴池中的合作者进行良好协调,而无需进行任何微调即可适应之前未见过的伙伴。基于种群训练的方法近似于这种不断演变的伙伴池,已被证明可以提供良好的零镜头协调性能;然而,现有方法受限于计算资源,主要集中在优化小种群中的多样性,而忽略了扩大种群规模可能带来的潜在性能提升。针对这一问题,本文提出了可扩展种群训练(ScaPT),这是一种高效的RL训练框架,包含两个关键组成部分:一种元智能体,它通过选择性共享参数有效地实现种群,以及一个保证种群多样性的互信息调节器。为了实证验证ScaPT的有效性,本文在Hanabi合作游戏中对其进行了评估,并与其他代表性框架进行了比较,证实了其优越性。
论文及项目相关链接
Summary
在强化学习研究中,零样本协调(ZSC)是一个关键挑战,特别是在复杂多变的博弈环境中。本文主要介绍了可扩展人口训练(ScaPT)框架,包括元代理和相互信息调节器两个关键组件,旨在解决现有方法计算资源有限的问题,通过选择性共享参数和保证人口多样性来提高零样本协调性能。通过汉币合作游戏中的实验验证,证明了ScaPT的优越性。
Key Takeaways
- 零样本协调(ZSC)是强化学习中的关键挑战,尤其在复杂多变的博弈环境中。
- 人口训练可模拟零样本协调的伙伴池,现有的方法主要关注小种群多样性的优化。
- 本文提出可扩展人口训练(ScaPT)框架以解决计算资源问题。包括选择性共享参数的元代理和保证种群多样性的相互信息调节器。
- 元代理能高效实现种群构建。通过参数共享降低计算复杂性。
- 相互信息调节器确保种群多样性,有助于提高零样本协调性能。
- 在汉币合作游戏中验证了ScaPT框架的优越性。实验结果表明其在提高零样本协调性能方面具有潜力。