⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-26 更新
LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context
Authors:Jingzhi Bao, Hongze Chen, Lingting Zhu, Chenyu Liu, Runze Zhang, Keyang Luo, Zeyu Hu, Weikai Chen, Yingda Yin, Xin Wang, Zehong Lin, Jun Zhang, Xiaoguang Han
Physically-based rendering (PBR) provides a principled standard for realistic material-lighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallic-roughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods.
基于物理的渲染(PBR)为计算机图形中的真实材质与灯光交互提供了一个原则性的标准。尽管在生成PBR纹理方面取得了最新进展,但现有方法仍然无法解决两个基本挑战:1)在有限的照明提示下从图像提示中进行材料分解;2)无缝且视角一致的纹理补全。为此,我们提出了LumiTex,这是一个包含三个关键组件的端到端框架:(1)一种多分支生成方案,通过共享照明先验来分解亮度和金属粗糙度,以实现稳健的材料理解;(2)一种照明感知材料注意力机制,将照明上下文注入解码过程中,以生成基于物理的亮度、金属和粗糙度图;(3)基于大型视图合成模型的几何引导填充模块,丰富纹理覆盖并确保无缝、视角一致的UV补全。大量实验表明,LumiTex在纹理质量方面达到了最新技术水平,超越了现有的开源和商业方法。
论文及项目相关链接
PDF Project page: https://lumitex.vercel.app
Summary
物理基础渲染(PBR)为计算机图形学中现实材质与光照交互提供了一种原则性标准。尽管在生成PBR纹理方面近期有所进展,但现有方法仍面临两大挑战:一是在有限光照提示下从图像提示进行材质分解,二是无缝、视角一致纹理完成的实现。为此,我们提出LumiTex端到端框架,包括三大关键组件:多分支生成方案、照明感知材质注意力机制和几何引导补全模块。该框架在解码过程中注入光照上下文,以生成物理基础的材质地图,并通过大型视角合成模型实现丰富的纹理覆盖和无缝、视角一致的UV补全。实验证明,LumiTex在纹理质量方面达到最佳性能,超越了现有开源和商业方法。
Key Takeaways
- 物理基础渲染(PBR)为计算机图形学中现实材质与光照交互提供了原则性标准。
- 现有PBR纹理生成方法面临两大挑战:在有限光照下的材质分解和无缝、视角一致的纹理完成。
- LumiTex框架包括三大关键组件:多分支生成方案、照明感知材质注意力机制和几何引导补全模块。
- LumiTex通过注入光照上下文到解码过程中,以生成物理基础的材质地图。
- LumiTex利用大型视角合成模型实现丰富的纹理覆盖。
- LumiTex能实现无缝、视角一致的UV补全。
点此查看论文截图
VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
Authors:Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, Yihong Gong
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
我们推出了VDC-Agent,这是一款用于视频详细字幕的自我进化框架,无需人工标注和大型教师模型。该代理形成了字幕生成、原则指导评分(评分和文本建议)和提示精修的闭环。当字幕质量下降时,自我反思路径会利用之前的思考链条进行修正更新。在无人标注的视频上运行此过程会产生(字幕,评分)对轨迹。我们将轨迹转换为偏好元组,并过滤掉带有JSON解析错误的样本,从而得到VDC-Agent-19K,其中包含18886个自动构建的对。然后,我们在该数据集上微调基础MLLM,采用从易到难的课程直接偏好优化。基于Qwen2.5-VL-7B-Instruct构建,我们的VDC-Agent-7B在VDC基准测试上达到了最先进的性能,平均准确率为49.08%,得分为2.50,超越了专用视频字幕,并在类似推理成本的情况下,相比基础模型提高了+5.13%的准确率和+0.27的得分。
论文及项目相关链接
Summary
基于VDC-Agent的视频详细标题方法,无需人工标注和大型教师模型,通过闭环生成标题、原则指导评分和提示优化,实现自我进化的视频详细标题框架。利用自动构建的数据集VDC-Agent-19K和简单的从易到难的课程偏好优化策略,对基础模型进行微调,达到VDC基准测试的一流性能。
Key Takeaways
- VDC-Agent是一种无需人工标注和大型教师模型自我进化的视频详细标题框架。
- 该框架形成了一个闭环系统,包括标题生成、原则指导评分和提示优化。
- 通过利用先前的思考链,自我反思路径能够在标题质量下降时进行修正。
- 在无标签视频上运行此过程会产生(标题,评分)对轨迹。
- 将轨迹转换为偏好元组并过滤掉JSON解析错误样本,得到VDC-Agent-19K数据集。
- 使用简单的从易到难的课程偏好优化策略,对基础模型进行微调,得到VDC-Agent-7B模型。
- VDC-Agent-7B模型在VDC基准测试上达到一流性能,平均准确度为49.08%,得分为2.50。相较于专业视频标题生成器和基础模型,其准确性和得分都有所提高。
点此查看论文截图
In-Video Instructions: Visual Signals as Generative Control
Authors:Gongfan Fang, Xinyin Ma, Xinchao Wang
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
大规模视频生成模型最近表现出了强大的视觉能力,能够预测未来帧,这些帧遵循当前观察中的逻辑和物理线索。在这项工作中,我们研究是否可以利用这些能力,通过解释帧内嵌入的视觉信号作为指令,来实现可控的图像到视频的生成,我们将这种模式称为“视频内指令”。与基于提示的控制相比,提示控制提供的是内在全局且粗略的文本描述,而视频内指令则通过叠加文本、箭头或轨迹等元素,直接将用户指导编码到视觉领域。这使得能够通过为不同对象分配不同的指令,在视觉主题和它们的预期行动之间建立明确、空间感知和无歧义的联系。在包括Veo 3.1、Kling 2.5和Wan 2.2三种最新生成器上的广泛实验表明,视频模型可以可靠地解释并执行这种视觉嵌入指令,特别是在复杂的多对象场景中。
论文及项目相关链接
Summary
大型视频生成模型展现出强大的视觉能力,能通过当前观察中的逻辑和物理线索预测未来帧。本研究探索是否可利用这些能力进行可控的图像到视频生成,通过解读帧内嵌入的视觉信号作为指令来实现,我们称之为“视频内指令”。与基于提示的控制不同,提示提供的文本描述是全局和粗略的,而视频内指令直接将用户指导编码到视觉领域,通过叠加文本、箭头或轨迹等元素。这为视觉主体及其意图行为之间建立了明确、空间感知和无歧义的联系,为不同对象分配不同的指令。在多个先进生成器上的广泛实验表明,视频模型可以可靠地解释和执行这些视觉嵌入指令,特别是在复杂的多对象场景中。
Key Takeaways
- 大型视频生成模型具备预测未来帧的能力,基于当前观察中的逻辑和物理线索。
- 本研究提出了“视频内指令”的概念,即利用帧内嵌入的视觉信号作为指令来控制图像到视频生成。
- 视频内指令方法能够将用户指导直接编码到视觉领域。
- 视频内指令通过叠加文本、箭头或轨迹等元素来实现,建立视觉主体和意图行为之间的明确联系。
- 视频模型能够可靠地解释并执行视觉嵌入指令,特别是在复杂的多对象场景中。
- 与基于提示的控制相比,视频内指令更具空间感知和无歧义性。
点此查看论文截图
Rethinking Intermediate Representation for VLM-based Robot Manipulation
Authors:Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yunhui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.
视觉语言模型(VLM)是实现稳健机器人操作的重要组成部分。然而,将其用于将人类指令翻译成为可执行的中间表示形式时,常常需要在VLM的可理解性和泛化性之间进行权衡。受无上下文文法的启发,我们设计了一种名为SEAM的语义装配表示方法,通过将中间表示形式分解为词汇和语法。这样做使我们得到语义丰富的操作简洁词汇,以及处理各种未见任务的VLM友好语法。此外,我们设计了一种新的开放词汇分割范式,采用增强检索的小样本学习策略,以定位用于操作的精细目标部位,在所有这些最先进的并行工作中实现了最短的推理时间。我们还制定了新的动作泛化性和VLM可理解性的指标,显示了SEAM在主流表现上的吸引力。大量现实世界实验进一步证明了其在不同设置和任务下的性能达到最先进的水平。
论文及项目相关链接
Summary
基于上下文无关语法启发,设计出名为SEAM的语义装配表示方法,通过将中间表示分解为词汇和语法,实现视觉语言模型(VLM)的简洁且语义丰富的操作词汇和友好的语法结构,处理多样化的未见任务。提出新的开放词汇分割范式和增强检索的少量学习策略,用于局部精细对象部件的操控,实现最短推理时间。同时,制定了新的动作泛化性和VLM可理解性的评价指标,并在不同设置和任务下展现出卓越性能。
Key Takeaways
- 设计出基于上下文无关语法的语义装配表示方法SEAM。
- 通过将中间表示分解为词汇和语法,实现了对视觉语言模型(VLM)的简洁且语义丰富的操作。
- 提出新的开放词汇分割范式和增强检索的少量学习策略,有效提高机器人操控的局部精细对象部件能力。
- 制定了新的动作泛化性和VLM可理解性的评价指标。
- SEAM在不同设置和任务下展现出卓越性能,尤其在真实环境下的实验验证其表现远超现有技术。
- SEAM设计旨在实现VLM可理解性和泛化性之间的平衡。
点此查看论文截图
BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
Authors:Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang
Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.
条件图像生成通过结构、空间或风格先验增强文本到图像合成,但当前方法在处理不同来源之间的冲突时面临挑战。这些冲突包括:1)输入层面的冲突,即条件图像与文本提示相矛盾;以及2)模型偏见冲突,即使条件与文本匹配,生成偏见也会破坏对齐。解决这些冲突需要微妙的解决方案,而标准的监督微调很难提供这样的解决方案。基于偏好的优化技术(如直接偏好优化(DPO))显示出潜力,但受限于文本和条件信号之间的梯度纠缠,以及缺乏多约束任务的解耦训练数据。为了克服这一点,我们提出了双向解耦的DPO框架(BideDPO)。我们的方法为条件和文本创建两个解耦的偏好对,以减少梯度纠缠。使用自适应损失平衡策略管理对的影响,以实现平衡优化。我们引入了一个自动化数据管道,以抽样模型输出并生成意识到冲突的数据。这个过程嵌入在迭代优化策略中,可以完善模型和数据的改进。我们构建了一个DualAlign基准测试来评估文本和条件之间的冲突解决情况。实验表明,BideDPO能显著提高文本成功率(例如,+35%),并增强对条件的遵守。我们还使用COCO数据集验证了我们的方法。项目页面:https://limuloo.github.io/BideDPO/。
论文及项目相关链接
PDF 29 pages
Summary
本摘要针对条件图像生成中遇到的输入级别和模型偏见冲突问题,提出了一种双向解耦的偏好优化框架(BideDPO)。该框架创建两个解纠缠的偏好对,分别用于条件和文本,以减少梯度纠缠。通过自适应损失平衡策略管理偏好对的影响,实现平衡优化。此外,引入自动化数据管道以采样模型输出并生成冲突感知数据,嵌入迭代优化策略中,以改进模型和数据的精细度。实验表明,BideDPO在解决文本和条件之间的冲突方面显著提高文本成功率(例如,+35%),并增强了对条件的遵循性。同时,在COCO数据集上验证了该方法的有效性。
Key Takeaways
- 条件图像生成面临输入级别和模型偏见冲突的挑战。
- 现有方法如直接偏好优化(DPO)在处理多约束任务时存在梯度纠缠问题。
- 提出的BideDPO框架通过创建两个解纠缠的偏好对(一个用于条件,一个用于文本)来减少梯度纠缠。
- 自适应损失平衡策略用于管理偏好对的影响,实现平衡优化。
- 引入自动化数据管道以生成冲突感知数据,提高模型和数据的精细度。
- BideDPO显著提高了文本成功率,并增强了对条件的遵循性。
点此查看论文截图
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
Authors:Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images–we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models–local edits using eight SOTA diffusion models; 3) Multi-turn editing–each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios–a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k
基于扩散的编辑技术能够实现局部图像区域的现实修改,使得AI生成的内容更难被检测。现有的AIGC检测基准测试主要关注对整个图像的分类,忽视了基于扩散的编辑的定位。我们推出了DiffSeg30k,这是一个包含30k个经过扩散编辑的图像的公开数据集,带有像素级注释,旨在支持精细检测。DiffSeg30k的特点包括:1)自然图像——我们从COCO收集图像或图像提示,以反映现实世界的内容多样性;2)多种扩散模型——使用八种先进的扩散模型进行局部编辑;3)多轮编辑——每个图像经历最多三轮连续编辑,以模拟现实世界的连续编辑;4)现实编辑场景——基于视觉语言模型(VLM)的管道会自动识别有意义的区域,并生成涵盖添加、删除和属性更改的上下文感知提示。DiffSeg30k将AIGC检测从二分类转变为语义分割,能够实现编辑的同步定位和编辑模型的识别。我们基准测试了三种分割方法,揭示了语义分割任务中的重大挑战,特别是关于对图像失真的稳健性。实验还表明,尽管经过像素级定位的训练,分割模型却表现出作为整体图像分类器的可靠性,在扩散编辑方面超越了既定的伪造分类器,并在跨生成器泛化方面显示出巨大潜力。我们相信DiffSeg30k将通过展示分割方法的前景和局限性,推动AI生成内容的精细定位研究的发展。DiffSeg30k已在https://huggingface.co/datasets/Chaos2629/Diffseg30k发布。
论文及项目相关链接
PDF 16 pages, 10 figures
Summary
本文介绍了Diffusion-based编辑技术可以实现对图像局部区域的真实修改,使得AI生成的内容更难被检测。为解决现有AIGC检测基准测试对扩散编辑定位忽视的问题,文章引入了DiffSeg30k数据集。该数据集包含3万张扩散编辑图像,具备像素级注释,旨在支持精细检测。数据集特点包括:采用COCO中的图像或图像提示以反映现实世界内容的多样性、使用多种扩散模型进行局部编辑、模拟现实世界的连续编辑操作以及基于视觉语言模型的管道自动识别有意义区域并生成上下文感知提示。DiffSeg30k将AIGC检测从二进制分类转变为语义分割,能够同时定位编辑并识别编辑模型。文章对三种基准分割方法进行了评估,揭示了语义分割任务面临的挑战,特别是图像失真的稳健性问题。实验表明,尽管训练用于像素级定位的分割模型,但在扩散编辑的像素级定位和鉴别方面表现出色,超越了现有的伪造分类器,并显示出巨大的跨生成器泛化潜力。DiffSeg30k数据集将促进AI生成内容精细定位研究的发展。
Key Takeaways
- Diffusion-based编辑技术能真实修改图像局部区域,使AI生成内容检测难度增加。
- 现有AIGC检测主要集中在整个图像的分类,忽视了扩散编辑的定位。
- 引入DiffSeg30k数据集,包含3万张扩散编辑图像,具有像素级注释,用于支持精细检测。
- DiffSeg30k数据集特点包括使用COCO中的图像、多种扩散模型编辑、模拟连续编辑操作以及基于视觉语言模型的自动区域识别。
- DiffSeg30k将AIGC检测从二进制分类转变为语义分割,实现编辑定位和识别。
- 评估三种基准分割方法,揭示语义分割的挑战和图像失真的稳健性问题。
点此查看论文截图
A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis
Authors:Wenxuan Mu, Jinzhong Ning, Di Zhao, Yijia Zhang
In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM’s insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.
基于大型语言模型的上下文学习(ICL)已成为低资源场景中命名实体识别(NER)的有前途的范式。然而,现有的基于ICL的NER方法存在三个主要局限性:(1)依赖于动态检索注释示例,当注释数据稀缺时,这成为问题;(2)由于语言模型内部领域知识不足,难以推广到未见过的领域;(3)无法融入外部知识或解决实体歧义。为了应对这些挑战,我们提出了KDR-Agent,这是一个用于多领域低资源上下文NER的新型多代理框架,它集成了知识检索、消歧和反思分析。KDR-Agent利用自然语言的类型定义和实体层面上的静态对比演示,减少对大量注释语料库的依赖。中央规划器协调专业代理,以(i)从Wikipedia检索特定领域的提及的事实知识,(ii)通过上下文推理解决模糊的实体,(iii)通过结构化自我评估反思和纠正模型预测。在五个领域的十个数据集上的实验表明,KDR-Agent在多个大型语言模型骨架上显著优于现有的零样本和少样本ICL基线。代码和数据可以在https://github.com/MWXGOD/KDR-Agent找到。
论文及项目相关链接
PDF This paper has been accepted by AAAI 2026 (Main Technical Track)
Summary
ICL结合大型语言模型在资源稀缺的场景下进行命名实体识别展现出巨大潜力,但现有方法存在依赖动态检索标注样本、跨域泛化能力不足及无法解决实体歧义等问题。为此,提出KDR-Agent多智能体框架,集成知识检索、消歧和反思分析,减少依赖大型标注语料库。实验证明,KDR-Agent在多个领域数据集上显著优于现有零样本和少样本ICL基线方法。
Key Takeaways
- ICL与LLM结合在资源有限的NER中展现潜力。
- 当前ICL-based NER方法存在依赖动态检索标注样本、跨域泛化不足及无法解决实体歧义等三大局限。
- KDR-Agent是一个多智能体框架,用于解决多领域低资源上下文NER问题。
- KDR-Agent集成了知识检索、消歧和反思分析。
- KDR-Agent利用自然语言类型定义和静态实体级别对比演示,减少大型标注语料库的依赖。
- KDR-Agent通过中央规划器协调专业智能体进行知识检索、消歧和反思分析。
- 实验证明KDR-Agent在多个数据集上显著优于现有方法。
点此查看论文截图
DEAP-3DSAM: Decoder Enhanced and Auto Prompt SAM for 3D Medical Image Segmentation
Authors:Fangda Chen, Jintao Tang, Pancheng Wang, Ting Wang, Shasha Li, Ting Deng
The Segment Anything Model (SAM) has recently demonstrated significant potential in medical image segmentation. Although SAM is primarily trained on 2D images, attempts have been made to apply it to 3D medical image segmentation. However, the pseudo 3D processing used to adapt SAM results in spatial feature loss, limiting its performance. Additionally, most SAM-based methods still rely on manual prompts, which are challenging to implement in real-world scenarios and require extensive external expert knowledge. To address these limitations, we introduce the Decoder Enhanced and Auto Prompt SAM (DEAP-3DSAM) to tackle these limitations. Specifically, we propose a Feature Enhanced Decoder that fuses the original image features with rich and detailed spatial information to enhance spatial features. We also design a Dual Attention Prompter to automatically obtain prompt information through Spatial Attention and Channel Attention. We conduct comprehensive experiments on four public abdominal tumor segmentation datasets. The results indicate that our DEAP-3DSAM achieves state-of-the-art performance in 3D image segmentation, outperforming or matching existing manual prompt methods. Furthermore, both quantitative and qualitative ablation studies confirm the effectiveness of our proposed modules.
Segment Anything Model(SAM)在医学图像分割方面最近表现出了巨大的潜力。尽管SAM主要接受2D图像的训练,但人们已经尝试将其应用于3D医学图像分割。然而,为了适配SAM而采用的伪3D处理会导致空间特征损失,从而限制其性能。此外,大多数基于SAM的方法仍然依赖于手动提示,这在现实场景中的实施具有挑战性,并需要大量外部专业知识。为了解决这些局限性,我们引入了Decoder Enhanced和Auto Prompt SAM(DEAP-3DSAM)来解决这些问题。具体来说,我们提出了一种特征增强解码器,该解码器融合了原始图像特征与丰富详细的空间信息以增强空间特征。我们还设计了一个双注意力提示器,通过空间注意力和通道注意力自动获取提示信息。我们在四个公共腹部肿瘤分割数据集上进行了全面的实验。结果表明,我们的DEAP-3DSAM在3D图像分割方面达到了最新性能,优于或与其他现有手动提示方法相匹配。此外,定量和定性的消融研究都证实了我们所提出模块的有效性。
论文及项目相关链接
PDF Accepted by BIBM 2024
Summary
SAM模型在医学图像分割中展现出巨大潜力,但其在3D图像处理上存在空间特征损失的问题,且多数方法仍依赖手动提示,难以实现实际应用。为解决这些问题,提出DEAP-3DSAM模型,通过特征增强解码器融合原始图像特征与丰富的空间信息,以及通过双注意力提示器自动获取提示信息。在四个公共腹部肿瘤分割数据集上的实验表明,DEAP-3DSAM在3D图像分割上达到最新性能水平。
Key Takeaways
- SAM模型在医学图像分割中具有显著潜力,但面临从2D到3D转换时空间特征损失的问题。
- 现有SAM模型的应用仍大量依赖手动提示,难以实现自动化和实际应用。
- DEAP-3DSAM模型被引入以解决SAM在3D图像分割中的局限性。
- 特征增强解码器融合原始图像特征与丰富的空间信息,以提升模型性能。
- 双注意力提示器能自动获取提示信息,通过空间注意力和通道注意力机制。
- 在四个公共腹部肿瘤分割数据集上的实验显示,DEAP-3DSAM达到或超越现有手动提示方法的性能。
点此查看论文截图
A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
Authors:Wentao Qu, Guofeng Mei, Yang Wu, Yongshun Gong, Xiaoshui Huang, Liang Xiao
Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
文本到激光雷达生成可以针对下游任务定制具有丰富结构和多样场景的三维数据。然而,文本-激光雷达配对数据的稀缺往往导致训练先验不足,生成过于平滑的三维场景。此外,低质量的文本描述可能会降低生成质量和可控性。在本文中,我们提出了一种名为T2LDM的文本到激光雷达扩散模型,用于场景生成,并配备了自条件表示引导(SCRG)。具体来说,SCRG通过与现实表示对齐,为训练中的去噪网络(DN)提供重建细节的软监督,同时在推理时保持解耦。通过这种方式,T2LDM可以从数据分布中感知丰富的几何结构,生成场景中详细的物体。同时,我们构建了一个内容可组合的文本-激光雷达基准数据集T2nuScenes,以及一个可控性度量指标。基于此,我们分析了不同文本提示对激光雷达生成质量和可控性的影响,提供了实用的提示范式和见解。此外,还设计了一个方向位置先验来缓解街道失真,进一步提高场景的真实性。另外,通过学习通过冻结DN的条件编码器,T2LDM可以支持多种条件任务,包括稀疏到密集、密集到稀疏和语义到激光雷达生成。无条件生成和条件生成的广泛实验表明,T2LDM优于现有方法,实现了最先进的场景生成。
论文及项目相关链接
Summary
本文提出了一种名为T2LDM的文本到激光雷达扩散模型,用于场景生成。该模型通过自我条件表示指导(SCRG)技术实现丰富的几何结构感知和场景内详细物体的生成。研究构建了文本-激光雷达基准数据集T2nuScenes及其可控性指标,同时设计了一种定向位置先验来缓解街道失真问题。通过冷冻去噪网络学习条件编码器,支持多种条件任务,包括稀疏到密集、密集到稀疏和语义到激光雷达生成。实验表明,T2LDM在场景生成方面表现卓越,达到最新技术水平。
Key Takeaways
- T2LDM模型利用自我条件表示指导(SCRG)技术实现丰富的几何结构感知和场景内详细物体的生成。
- 构建了一个文本-激光雷达基准数据集T2nuScenes及其可控性指标,用于评估不同文本提示对激光雷达生成质量和可控性的影响。
- 提出了一种定向位置先验设计,以缓解街道失真问题,提高场景的真实性。
- T2LDM支持多种条件任务,包括稀疏到密集、密集到稀疏和语义到激光雷达生成。
- 实验结果表明,T2LDM在场景生成方面表现出卓越性能,优于现有方法,达到最新技术水平。
- SCRG技术在训练过程中提供了重建细节的软监督,同时可以在推理过程中实现解耦。
点此查看论文截图
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Authors:Ryan Wong, Hosea David Yu Fei Ng, Dhananjai Sharma, Glenn Jun Jie Ng, Kavishvaran Srinivasan
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project
大型语言模型(LLM)仍然容易受到绕过安全过滤器并引发有害或不道德行为的越狱攻击。本文首先系统概述了跨提示级别、模型级别和培训时间干预的现有越狱防御方法,然后提出了三种防御策略。首先,提出一种提示级防御框架,通过清理、同义替换和自适应系统保护来检测和中和对抗性输入。其次,基于逻辑转向防御通过在安全敏感层进行推理时间向量转向来加强拒绝行为。第三,领域特定代理防御采用MetaGPT框架,实施结构化、基于角色的协作和领域遵循。在基准数据集上的实验表明,攻击成功率大幅降低,在基于代理的防御下实现了全面缓解。总体而言,该研究强调了越狱对LLM构成的安全威胁,并确定了关键干预点进行预防,同时指出防御策略通常涉及安全、性能和可扩展性之间的权衡。代码可在:https://github.com/Kuro0911/CS5446-Project上找到。
论文及项目相关链接
PDF 20 pages including appendix; technical report; NeurIPS 2024 style
Summary
大型语言模型(LLMs)易受绕过安全过滤器并引发有害或不道德行为的越狱攻击威胁。本文系统梳理了现有的越狱防御策略,包括提示级别、模型级别和训练时间干预的防御策略,并提出了三种新的防御策略。通过净化、同义替换和自适应系统防护来检测并中和敌对输入;通过推理时间向量转向在关键安全层强化拒绝行为;以及利用MetaGPT框架进行结构化、基于角色的协作和领域适应性来实施领域特定代理防御。实验表明,这些防御策略大大降低了攻击成功率,特别是在基于代理的防御下实现了全面缓解。然而,也指出了防御策略在安全性、性能和可扩展性之间的权衡问题。相关代码可访问链接 https://github.com/Kuro0911/CS5446-Project。
Key Takeaways
- 大型语言模型(LLMs)存在越狱攻击风险,可能引发有害或不道德行为。
- 文章提出了分类的越狱防御策略,涵盖提示级别、模型级别和训练时间干预的防御。
- 三种新的防御策略包括:通过净化等方法检测并中和敌对输入;在关键安全层强化拒绝行为;实施领域特定代理防御以增强结构化协作和领域适应性。
- 实验显示防御策略能显著降低攻击成功率,某些情况下实现全面缓解。
- 防御策略需要在安全性、性能和可扩展性之间进行权衡。
点此查看论文截图
Reproducibility Study of Large Language Model Bayesian Optimization
Authors:Adam Rychert, Gasper Spagnolo, Evgenii Posashkov
In this reproducibility study, we revisit the LLAMBO framework of Daxberger et al. (2024), a prompting-based Bayesian optimization (BO) method that uses large language models as discriminative surrogates and acquisition optimizers via text-only interactions. We replicate the core Bayesmark and HPOBench experiments under the original evaluation protocol, but replace GPT-3.5 with the open-weight Llama 3.1 70B model used for all text encoding components. Our results broadly confirm the main claims of LLAMBO. Contextual warm starting via textual problem and hyperparameter descriptions substantially improves early regret behaviour and reduces variance across runs. LLAMBO’s discriminative surrogate is weaker than GP or SMAC as a pure single task regressor, yet benefits from cross task semantic priors induced by the language model. Ablations that remove textual context markedly degrade predictive accuracy and calibration, while the LLAMBO candidate sampler consistently generates higher quality and more diverse proposals than TPE or random sampling. Experiments with smaller backbones (Gemma 27B, Llama 3.1 8B) yield unstable or invalid predictions, suggesting insufficient capacity for reliable surrogate behaviour. Overall, our study shows that the LLAMBO architecture is robust to changing the language model backbone and remains effective when instantiated with Llama 3.1 70B.
在这项可重复性研究中,我们重新审视了Daxberger等人的LLAMBO框架(2024)。这是一种基于提示的贝叶斯优化(BO)方法,它使用大型语言模型作为判别替代物和纯文本交互的采集优化器。我们按照原始评估协议复制了核心Bayesmark和HPOBench实验,但用用于所有文本编码组件的开源Llama 3.1 70B模型替换了GPT-3.5。我们的结果大致证实了LLAMBO的主要观点。通过文本问题和超参数描述进行上下文预热,显著改善了早期遗憾行为,并降低了运行间的方差。LLAMBO的判别替代物作为纯粹的单一任务回归器,弱于GP或SMAC,但从语言模型诱导的跨任务语义先验中受益。去除文本上下文的消融会显著降低预测准确性和校准度,而LLAMBO候选采样器始终生成比TPE或随机采样更高质量和更多样化的提案。使用较小骨干网(Gemma 27B,Llama 3.1 8B)的实验产生不稳定或无效的预测,表明其可靠替代行为的容量不足。总体而言,我们的研究表明,LLAMBO架构对于更改语言模型主干网是稳健的,当与Llama 3.1 70B实例化时,仍然有效。
论文及项目相关链接
PDF 7 pages, 8 figures. Reproducibility study of the LLAMBO framework (ICLR 2024). Code: https://github.com/spagnoloG/llambo-reproducibility
Summary
LLAMBO框架在可重复性研究中得到验证,使用大型语言模型作为判别代理和文本交互的优化器,具有稳健性。通过核心实验发现,文本上下文暖启动可显著改善早期后悔行为并减少运行方差;LLAMBO的判别代理在纯单任务回归方面表现较弱,但从语言模型中获得了跨任务的语义先验;取消文本上下文显著降低了预测准确性和校准性;LLAMBO候选采样器生成的提案质量更高、更多样化。
Key Takeaways
- LLAMBO框架使用大型语言模型作为判别代理和文本交互的优化器。
- 文本上下文暖启动可改善早期后悔行为和减少运行方差。
- LLAMBO的判别代理在纯单任务回归方面表现相对较弱。
- 语言模型为LLAMBO提供了跨任务的语义先验。
- 取消文本上下文会影响预测准确性和校准性。
- LLAMBO候选采样器生成的提案质量更高、更具多样性。
点此查看论文截图
Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation
Authors:Yuyang Wanyan, Xiaoshan Yang, Weiming Dong, Changsheng Xu
In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. To address these challenges, we introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. To ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and subrouters, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.
本文研究了小样例视频域自适应(FSVDA)这一具有挑战性的任务。视频的多模态特性带来了独特的挑战,需要在小样例情况下同时考虑域对齐和模态协作,这在以前的文献中是被忽视的。我们观察到,在域迁移的影响下,每个单独模态在目标域上的泛化性能,以及融合的多模态特征的泛化性能受到限制。因为每个模态都由具有不同域迁移的多个组件的耦合特征组成。这种变化增加了域适应的复杂性,从而降低了多模态特征融合的有效性。为了解决这些挑战,我们引入了一种新颖的模态协作低秩分解器(MC-LRD)框架,从每个模态中分解出具有不同域迁移级别的模态特有和模态共享特征,这些特征更利于域对齐。MC-LRD包含针对每个模态的多个分解器以及多模态分解路由器(MDR)。每个分解器在不同模态之间共享参数。利用MDR选择性激活分解器以产生模态特有和模态共享的特征。为了保证有效的分解,我们对分解器和子路由器分别应用了正交去相关性约束,以增强其多样性。此外,我们提出了一种跨域激活一致性损失,以保证同一类别的目标样本和源样本在分解器上具有一致的激活偏好,从而促进域对齐。在三个公共基准测试上的广泛实验结果表明,我们的模型在现有方法的基础上取得了显著的改进。
论文及项目相关链接
Summary
该论文研究少数视频域适应问题(FSVDA)。由于视频的多模态性质引入了独特的挑战,需要在少数场景下同时考虑域对齐和模态协作。在域转移的影响下,目标域中每个独立模态以及融合的多模态特征的泛化性能受到限制。为解决此问题,引入了一个新颖的框架Modality-Collaborative LowRank Decomposers(MC-LRD),通过不同的域转移级别分解模态特有的和模态共享的特征,更加适应域对齐。此外,论文提出了跨域激活一致性损失以确保同类目标样本与源样本具有一致的激活偏好,从而促进域对齐。实验结果表明,该模型在三个公共基准测试中取得了显著的改进。
Key Takeaways
- 该论文解决了少数视频域适应任务中的挑战,考虑到视频的多模态特性同时实现域对齐和模态协作的重要性。
- 由于不同模态的特征在不同程度上受到域转移的影响,每个模态特征集成对于实现视频域的少数适应性非常重要。文章通过新颖的方法,考虑了这个问题的复杂性,并在多个公共基准测试中取得了显著改进。
- 提出了一种新颖的框架Modality-Collaborative LowRank Decomposers(MC-LRD),用于分解不同模态特有的和共享的特征,以适应域对齐的需求。该框架包含多个针对每个模态的分解器以及多模态分解路由器(MDR)。
- 分解器可以在不同的模态之间共享参数。路由器则被用于有选择地激活分解器来产生模态特有的和模态共享的特征。为了确保有效的分解,对分解器和子路由器应用了正交去相关约束以增强其多样性。
- 通过提出跨域激活一致性损失确保源和目标样本中的同一类别的激活偏好保持一致,从而促进域对齐。这一策略是本文模型的关键组成部分之一。
- 实验结果表明,该模型在三个公共基准测试中实现了显著的改进,这验证了其在少数视频域适应任务中的有效性。这些测试结果反映了该模型的优越性。
点此查看论文截图
A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification
Authors:Yunpeng Gong, Yongjie Hou, Jiangming Shi, Kim Long Diep, Min Jiang
Sketch based person re-identification aims to match hand-drawn sketches with RGB surveillance images, but remains challenging due to significant modality gaps and limited annotated data. To address this, we introduce KTCAA, a theoretically grounded framework for few-shot cross-modal generalization. Motivated by generalization theory, we identify two key factors influencing target domain risk: (1) domain discrepancy, which quantifies the alignment difficulty between source and target distributions; and (2) perturbation invariance, which evaluates the model’s robustness to modality shifts. Based on these insights, we propose two components: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target distributions and facilitate progressive alignment; and (2) Knowledge Transfer Catalyst (KTC), which enhances invariance by introducing worst-case perturbations and enforcing consistency. These modules are jointly optimized under a meta-learning paradigm that transfers alignment knowledge from data-rich RGB domains to sketch-based scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly in data-scarce conditions.
基于草图的人体再识别旨在将手绘草图与RGB监控图像进行匹配,但由于存在明显的模态差异和标注数据有限,仍然具有挑战性。为了解决这一问题,我们引入了KTCAA,这是一个用于小样本跨模态泛化的理论框架。受泛化理论的启发,我们确定了影响目标域风险的两个关键因素:(1) 域差异,它量化了源分布和目标分布之间的对齐难度;(2) 扰动不变性,它评估了模型对模态变化的鲁棒性。基于这些见解,我们提出了两个组件:(1) 对齐增强(AA),它应用局部草图风格转换来模拟目标分布,促进渐进对齐;(2) 知识转移催化剂(KTC),它通过引入最坏情况扰动和强制执行一致性来提高不变性。这些模块在元学习范式下联合优化,将来自数据丰富的RGB域的对齐知识转移到基于草图的场景。在多个基准测试上的实验表明,KTCAA达到了最先进的性能,特别是在数据稀缺的条件下。
论文及项目相关链接
PDF Accepted by AAAI2026
Summary
本文介绍了针对基于草图的人员重新识别问题的KTCAA框架,该框架旨在匹配手绘草图与RGB监控图像。为解决模态差距大、标注数据有限等挑战,提出两个关键因素:域差异和扰动不变性,并基于此设计了对齐增强和知识转移催化剂两个组件。该框架在多个数据集上的实验表明,特别是在数据稀缺条件下,KTCAA达到了最先进的性能。
Key Takeaways
- KTCAA框架旨在解决基于草图的人员重新识别问题,匹配手绘草图与RGB监控图像。
- 面临的主要挑战包括显著的模态差距和有限标注数据。
- 提出两个影响目标域风险的关键因素:域差异和扰动不变性。
- 设计了对齐增强(AA)组件,通过应用局部草图样式转换来模拟目标分布,促进渐进对齐。
- 知识转移催化剂(KTC)组件通过引入最坏情况扰动和强制执行一致性来提高模型的鲁棒性。
- KTCAA框架采用元学习范式进行优化,实现从丰富的RGB域到草图场景的对接知识转移。
点此查看论文截图
Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion
Authors:Haidong Kang, Ketong Qian, Yi Lu
Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.
克服Few-Shot类增量学习(FSCIL)中的灾难性遗忘的努力主要侧重于开发更有效的基于梯度的优化策略。然而,随着新增类别的数量增加不可避免地带来的训练成本爆炸问题并未受到足够重视,即使在极端数据稀缺的情况下也依赖于梯度学习导致这一后果。更为关键的是,由于FSCIL通常只为每个新类别提供几个样本,基于梯度的更新不仅会导致基础类别的灾难性遗忘,还阻碍了对新类别的适应。本文旨在打破这一长期存在的限制,提出问题:我们可以设计一种无需训练的FSCIL范式,完全消除梯度优化吗?我们通过揭示基于梯度的优化和条件扩散过程之间的有趣联系来给出肯定的回答。基于这一观察,我们提出了条件扩散驱动的FSCIL(CD-FSCIL)框架,用基于扩散的生成转换代替传统的梯度更新过程,实现无需训练的增量适应,同时有效地减轻遗忘。此外,为了增强小样本约束下的表示能力,我们引入了一种多模态学习策略,该策略自动将视觉特征与大语言模型(LLM)生成的自然语言描述相结合。这种协同作用极大地缓解了样本稀缺问题,并提高了跨新类别的泛化能力。在主流FSCIL基准测试上的实验表明,我们的方法不仅达到了最先进的性能,而且大大降低了计算和内存开销,标志着向无需训练持续适应的转变。
论文及项目相关链接
Summary
本文研究了Few-Shot类增量学习(FSCIL)中的灾难性遗忘问题,并提出了一种基于条件扩散的FSCIL框架(CD-FSCIL)。该框架使用扩散过程替代传统的梯度更新过程,实现了无训练增量适应,有效缓解了灾难性遗忘问题。同时,引入多模态学习策略,结合视觉特征和大型语言模型自动生成的自然语言描述,提高了少样本约束下的表示能力,改善了新类的泛化性能。
Key Takeaways
- CD-FSCIL框架使用扩散过程替代梯度更新,实现无训练增量适应。
- 梯度优化与条件扩散过程之间存在联系。
- 多模态学习策略结合了视觉特征和自然语言描述,提高了表示能力。
- Large Language Models(LLMs)自动生成的描述有助于改善样本稀缺问题。
- CD-FSCIL框架在主流FSCIL基准测试上实现最佳性能。
- 该方法显著降低了计算和内存开销。
点此查看论文截图
DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition
Authors:Raja Kumar, Arka Sadhu, Ram Nevatia
Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model’s own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model’s top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.
大型视觉语言模型(LVLMs)虽然拥有丰富的文本知识,但在进行精细图像识别时却难以利用这些知识,经常无法区分视觉相似的类别。现有的使用强化学习(RL)的微调方法,通过精确匹配的奖励信号往往显得脆弱,鼓励对训练类别的记忆,并且无法激发对于未见类别的泛化所需的区别推理。为解决这一问题,我们提出$\textbf{DiVE-k}$方法,即利用模型自身的前k个预测作为训练信号的差异化视觉推理框架。对于每个训练图像,DiVE-k根据模型的前k个输出创建一个选择题,并使用强化学习训练模型选择正确答案。这种方法要求模型在合理的选项之间进行精细的区别推理,并提供一个简单且可验证的奖励信号,从而减轻记忆并改善泛化能力。在五个标准的精细数据集上的实验表明,我们的方法显著优于现有方法。在标准基础到新颖的泛化设置中,DiVE-k在调和均值度量指标上超越QWEN2.5-VL-7B和ViRFT分别为10.04%和6.16%。进一步的实验显示,在混合领域和少场景情况下也有类似的收益。
论文及项目相关链接
Summary
大型视觉语言模型(LVLMs)在细粒度图像识别方面存在困难,难以区分视觉相似的类别。现有使用强化学习(RL)的微调方法常常过于脆弱,鼓励对训练类别的记忆,而无法激发所需的区别推理来实现对未见类别的泛化。为此,我们提出了DiVE-k框架,利用模型的前k个预测作为训练信号,进行差异化视觉推理。DiVE-k为每张训练图像创建基于模型前k个输出的多项选择题,并使用RL训练模型选择正确答案。这种方法要求模型在合理选项之间进行精细的差异化推理,并提供可验证的奖励信号,减轻记忆负担,提高泛化能力。在五个标准细粒度数据集上的实验表明,我们的方法显著优于现有方法。
Key Takeaways
- LVLMs在细粒度图像识别方面存在挑战,难以区分视觉相似类别。
- 现有使用RL的微调方法存在缺陷,如过于脆弱、鼓励记忆而非区别推理。
- DiVE-k框架利用模型的前k个预测作为训练信号,进行差异化视觉推理。
- DiVE-k通过创建基于模型预测的多选题,使用RL训练模型进行选择。
- 该方法要求模型进行精细的差异化推理,并提供可验证的奖励信号。
- 在标准细粒度数据集上的实验表明,DiVE-k显著优于现有方法。
点此查看论文截图
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
Authors:Yara Bahram, Melodie Desbos, Mohammadhadi Shateri, Eric Granger
Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher’s domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG) and subject-driven personalization (SDP). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and outperforms two-stage training pipelines in both quality and diversity.
扩散模型(DMs)能够生成高质量图像,但将其适应新领域时的采样仍然成本高昂。蒸馏后的DMs速度更快,但通常仍局限于其教师模型的领域内。因此,针对新领域的快速和高质生成依赖于两阶段训练管道:先适应再蒸馏或先蒸馏再适应。然而,两者都增加了设计复杂性,并可能出现质量下降或多样性不足的问题。我们引入了Uni-DAD,这是一个统一扩散模型的蒸馏和适应的单阶段训练管道。它在训练过程中结合了两种信号:(i)一种双域分布匹配蒸馏目标,引导学生模型接近源教师模型和目标教师模型的分布;(ii)一个多头生成对抗网络(GAN)损失,鼓励多个特征尺度上的目标真实性。源域蒸馏保留了多样化的源知识,而多头GAN稳定了训练过程并减少了过拟合现象,特别是在小样本情况下。包含目标教师模型有助于适应结构更远的领域。我们在多种数据集上进行了少量图像生成(FSIG)和主体驱动个性化(SDP)的评估。Uni-DAD即使在不到4个采样步骤的情况下,也能提供比最新适应方法更高的质量,并且在质量和多样性方面都优于两阶段训练管道。
论文及项目相关链接
PDF Under review paper at CVPR 2026
Summary
扩散模型(DMs)在生成高质量图像方面表现出色,但在适应新领域时采样成本较高。蒸馏DMs速度更快,但通常局限于教师模型的领域。因此,针对新领域的快速高质量生成依赖于两阶段训练管道,如先适应再蒸馏或先蒸馏再适应。然而,这两种方法都增加了设计复杂性,并可能出现质量下降或多样性不足的问题。本文介绍了Uni-DAD,这是一个统一蒸馏和DMs适应性的单阶段管道。它在训练过程中结合了两种信号:一是双域分布匹配蒸馏目标,引导学生模型向源教师模型和目标教师模型的分布靠近;二是多头生成对抗网络(GAN)损失,鼓励目标真实性跨多个特征尺度。源域蒸馏保留了多样的源知识,而多头GAN稳定了训练,减少了过拟合,尤其在少量样本的情况下。包含目标教师有助于适应结构差异更大的领域。我们在多种数据集上进行了一系列少样本图像生成(FSIG)和主体驱动个性化(SDP)的评估。即使少于4个采样步骤,Uni-DAD的质量也高于最新适应方法,并且在质量和多样性方面都优于两阶段训练管道。
Key Takeaways
- 扩散模型(DMs)在生成高质量图像时面临适应新领域的采样成本问题。
- 目前的两阶段训练管道存在设计复杂性和质量多样性挑战。
- Uni-DAD提出一个单阶段管道,统一了DMs的蒸馏和适应性。
- Uni-DAD通过双域分布匹配蒸馏目标引导学生模型,并结合多头GAN损失来提高目标真实性。
- 源域蒸馏保留多样源知识,而多头GAN稳定训练并减少过拟合。
- 包含目标教师模型有助于适应结构差异更大的领域。
点此查看论文截图
Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design
Authors:Pasquale De Marinis, Uzay Kaymak, Rogier Brussee, Gennaro Vessio, Giovanna Castellano
Few-Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision-making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data-scarce scenarios. This paper introduces the first dedicated method for interpreting matching-based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few-shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few-shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.
少量射击语义分割(FSS)模型在利用极少量标记样本对新型类别进行分割时表现出强大的性能,但其决策过程在很大程度上仍然不透明。尽管在标准的计算机视觉任务中,可解释性AI已经取得了重大进展,但在FSS领域,尽管它对理解模型行为和在数据稀缺场景中指导支持集选择至关重要,可解释性仍然几乎未被探索。本文介绍了一种基于匹配的首个专用方法来解释FSS模型,该方法利用模型固有的结构特性。我们的亲和力解释器方法提取属性图,突出显示支持图像中的哪些像素对查询分割预测贡献最大,使用支持特征和查询特征在多个特征级别之间的匹配分数进行计算。我们将标准可解释性评价指标扩展到FSS领域,并提出了额外的指标,以更好地捕捉在少量场景中对解释的实际效用。在FSS基准数据集上对不同模型的全面实验表明,我们的亲和力解释器明显优于改编的标准归属方法。定性分析表明,我们的解释提供了结构化、连贯的注意力模式,与模型架构相符,并能实现有效的模型诊断。这项工作为可解释的FSS研究奠定了基础,为实现更可靠、可诊断的少量分割系统提供了更好的模型理解途径。源代码公开在https://github.com/pasqualedem/AffinityExplainer。
论文及项目相关链接
Summary
本文介绍了首款针对基于匹配的少镜头语义分割模型的解释方法,该方法通过利用模型固有的结构属性提取属性图,高亮支持图像中哪些像素点对查询分割预测的贡献最大,采用支持图像和查询特征在多个特征层面间的匹配分数来计算。此外,本文拓展了标准的解释性评估指标到FSS领域,并提出了额外的指标来更好地捕捉少镜头场景中解释的实际效用。实验证明,本文提出的Affinity Explainer显著优于适应的标准归因方法,提供的解释结构一致、符合模型架构,能有效进行模型诊断。这为少镜头语义分割模型的解释性研究奠定了基础,有助于更好地理解和诊断模型,为更可靠的少镜头分割系统提供支持。
Key Takeaways
- FSS模型在少量标注样本下实现了强大的性能,但其决策过程缺乏透明度。
- 该论文引入了首款针对基于匹配的FSS模型的解释方法,利用模型的结构属性进行解释。
- Affinity Explainer通过提取属性图来高亮对查询分割预测贡献最大的像素点。
- 该论文扩展了标准解释性评估指标到FSS领域,并引入了新的评估指标。
- Affinity Explainer在FSS基准数据集上的实验表现优于其他适应的标准归因方法。
- 提供的解释是结构化的,与模型架构一致,并能有效进行模型诊断。
点此查看论文截图
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models
Authors:Dachuan Zhao, Weiyue Li, Zhenda Shen, Yushu Qiu, Bowen Xu, Haoyu Chen, Yongchao Chen
Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.
视觉语言模型(VLMs)对于多模态推理已变得不可或缺,然而它们的表示通常编码并放大人口统计偏见,导致下游任务中的偏见关联和失准预测。这种行为破坏了公平性,并扭曲了视觉和语言之间的预期对齐。最近的后处理方法试图通过用中性值替换最具属性相关性的嵌入坐标来减轻偏见。然而,我们的系统分析揭示了这种坐标方法存在三个关键缺陷:特征纠缠、较差的跨数据集泛化能力和不完全的偏见移除。我们发现偏见并不局限于少数坐标,而是分布在少数线性子空间中。为了解决这些局限性,我们提出了子空间投影去偏见(SPD)方法,这是一个基于几何原理的框架,能够识别和去除整个线性可解码偏见子空间,同时重新插入中性均值组件以保留语义保真度。在零样本分类、文本到图像检索和图像生成等实验中对SPD进行了验证,我们的方法在四个公平指标上平均提高了18.5%,同时与最佳去偏见基线相比,任务性能损失最小。
论文及项目相关链接
Summary
视觉语言模型(VLMs)在多模态推理中不可或缺,但其表示常常编码和放大人口偏见,导致下游任务中的偏见关联和预测失误。新出现的后处理方法试图通过替换最具属性相关的嵌入坐标来减少偏见。然而,系统分析发现这种坐标级方法存在三个关键缺陷:特征纠缠、跨数据集泛化性能差和偏见移除不完全。为解决这些问题,提出了子空间投影去偏见(SPD)框架,能够识别并移除线性可解码偏见整个子空间,同时重新插入中性均值成分以保持语义保真性。实验证明SPD方法有效且稳健,在零样本分类、文本到图像检索和图像生成任务中平均提高了四个公平性指标下的表现18.5%,同时与最佳去偏见基线相比任务性能损失最小。
Key Takeaways
- VLMs在多模态推理中扮演重要角色,但存在编码和放大人口偏见的问题。
- 后处理方法试图通过替换最具属性相关的嵌入坐标来减少偏见,但存在特征纠缠、跨数据集泛化差和偏见移除不完全的问题。
- 偏见不仅存在于某些坐标中,而是分布在一些线性子空间中。
- 提出了子空间投影去偏见(SPD)框架,可识别和移除线性可解码偏见的整个子空间。
- SPD方法在零样本分类、文本到图像检索和图像生成等任务中表现优异,实现了有效的去偏见效果。
- SPD方法相较于最佳去偏见基线,任务性能损失较小。
点此查看论文截图
Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion
Authors:Yan Xu, Yixing Wang, Stella X. Yu
Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That’s the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.
仅通过几个场景片段,你能想象出摄像机镜头移动时所呈现的电影画面吗?这就是我们研究稀疏输入新颖视图合成的视角,不仅填补广泛间隔视图之间的空间空白,而且作为在空间中展开的自然视频的补全。我们把这项任务重新定位为测试时的自然视频补全,利用预训练视频扩散模型的强大先验来虚构合理的中间视图。我们的零样本、生成导向框架在新型相机姿态下产生伪视图,通过不确定性感知机制进行空间连贯性调制。这些合成的帧密集监督用于场景重建的3D高斯拼贴(3D-GS),特别是在未观察到的区域。一个迭代反馈循环使3D几何和2D视图合成相互提供信息,从而提高场景重建和生成的视图的质量。结果是从稀疏输入进行连贯、高保真渲染,无需任何特定场景的培训和微调。在LLFF、DTU、DL3DV和MipNeRF-360上,我们的方法在极端稀疏情况下显著优于强大的3D-GS基准测试。
论文及项目相关链接
PDF Accepted to NeurIPS 2025
Summary
基于少许场景的片段,使用预训练的扩散模型进行虚构视角的合成。任务重塑为测试时的自然视频补全,通过不确定感知机制生成模拟视角以补充空间连贯性。这些合成帧密集地用于场景重建的监管工作,实现了高保真度的渲染。不需要特定的训练或微调。这种方法在各种数据集上的表现显著优于现有的3D重建方法。它通过不断迭代更新场景重建和生成的视角信息,以提高整体效果。总体而言,该方法通过利用零次和生成导向框架的能力,在稀疏输入下实现了高质量的渲染结果。
Key Takeaways
- 该方法关注稀疏输入下的场景合成,任务重塑为测试时的自然视频补全。
- 使用预训练的扩散模型进行虚构视角的合成,通过不确定感知机制增强空间连贯性。
- 合成帧用于密集监管场景重建过程,特别是未被充分观测的区域。
- 通过迭代反馈循环实现场景重建与视角生成的相互提升。
- 方法无需特定场景的培训和微调,可广泛应用于多种数据集。
点此查看论文截图
Frequency-Adaptive Sharpness Regularization for Improving 3D Gaussian Splatting Generalization
Authors:Youngsik Yun, Dongjun Gu, Youngjung Uh
Despite 3D Gaussian Splatting (3DGS) excelling in most configurations, it lacks generalization across novel viewpoints in a few-shot scenario because it overfits to the sparse observations. We revisit 3DGS optimization from a machine learning perspective, framing novel view synthesis as a generalization problem to unseen viewpoints-an underexplored direction. We propose Frequency-Adaptive Sharpness Regularization (FASR), which reformulates the 3DGS training objective, thereby guiding 3DGS to converge toward a better generalization solution. Although Sharpness-Aware Minimization (SAM) similarly reduces the sharpness of the loss landscape to improve generalization of classification models, directly employing it to 3DGS is suboptimal due to the discrepancy between the tasks. Specifically, it hinders reconstructing high-frequency details due to excessive regularization, while reducing its strength leads to under-penalizing sharpness. To address this, we reflect the local frequency of images to set the regularization weight and the neighborhood radius when estimating the local sharpness. It prevents floater artifacts in novel viewpoints and reconstructs fine details that SAM tends to oversmooth. Across datasets with various configurations, our method consistently improves a wide range of baselines. Code will be available at https://bbangsik13.github.io/FASR.
尽管3D高斯融合(3DGS)在大多数配置中都表现出色,但在小样本场景中,它在新型视角上的泛化能力较差,因为它过度拟合稀疏观测数据。我们从机器学习的角度重新审视了3DGS的优化问题,将新型视角合成框架化为一个未见过视角的泛化问题——这是一个尚未被充分探索的方向。我们提出了频率自适应锐度正则化(FASR),它重新定义了3DGS的训练目标,从而引导3DGS收敛到一个更好的泛化解决方案。尽管锐度感知最小化(SAM)同样通过减少损失景观的锐度来改善分类模型的泛化能力,但直接将其应用于3DGS并不理想,因为任务之间存在差异。具体来说,由于过度正则化,它阻碍了高频率细节的重建,而降低其强度则会导致对锐度的惩罚不足。为了解决这个问题,我们反映图像的局部频率来设置正则化权重和估计局部锐度时的邻域半径。这可以防止新型视角中的浮动伪影,并重建SAM倾向于过度平滑的精细细节。我们的方法在各种配置的数据集上始终改进了一系列基线方法。代码将在https://bbangsik13.github.io/FASR上提供。
论文及项目相关链接
PDF Project page: https://bbangsik13.github.io/FASR
Summary
本文介绍了三维高斯贴图(3DGS)在大多数配置中表现优秀,但在小样本情况下缺乏对新视角的泛化能力,因为它对稀疏观测过度拟合的问题。为此,本文重新考虑了从机器学习角度优化3DGS的方法,将新视角合成作为一个泛化问题来解决。本文提出了频率自适应锐度正则化(FASR)方法,重新定义了3DGS的训练目标,引导其向更好的泛化解决方案收敛。通过实验验证,该方法在多个数据集和配置下均能有效提高基线性能。
Key Takeaways
- 3DGS在大多数配置中表现良好,但在小样本情况下缺乏对新视角的泛化能力。
- 本文从机器学习角度重新考虑了优化3DGS的方法,将新视角合成问题视为一个泛化问题来解决。
- 提出了频率自适应锐度正则化(FASR)方法,以改进和优化训练目标函数来提高泛化能力。
- FASR通过考虑图像的局部频率来调整正则化权重和邻域半径来估计局部锐度。
- FASR能有效防止浮动伪影的产生,并在重建细节方面表现出卓越性能。
- 在不同数据集和配置下,FASR方法能有效提高基线性能。