嘘~ 正在从服务器偷取页面 . . .

Few-Shot


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-19 更新

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Authors:Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo

We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

我们介绍了Part-X-MLLM,这是一种原生3D多模态大型语言模型,它通过制定结构化、可执行的语法将各种3D任务统一起来。给定RGB点云和自然语言提示,我们的模型可以自回归地生成一个单一、连贯的令牌序列,该序列编码部分级边界框、语义描述和编辑命令。这种结构化输出作为通用接口,为基于零件生成和编辑的下游几何感知模块提供动力。通过将符号规划与几何综合相分离,我们的方法允许任何兼容的几何引擎通过单一、本地化的前端进行控制。我们预训练了一个双编码器架构来分离结构与语义,并在大规模、以部分为中心的数据集上调整模型指令。实验表明,我们的模型在生成高质量、结构化计划方面表现出色,通过统一接口实现了领先的性能,包括基于地面的问答、组合生成和局部编辑。项目页面:https://chunshi.wang/Part-X-MLLM/。

论文及项目相关链接

PDF

Summary

Part-X-MLLM是一款结合3D多任务的大型语言模型,可将各种3D任务转化为结构化、可执行的语法程序。该模型根据RGB点云和自然语言提示,自动生成编码部分级别边界框、语义描述和编辑命令的单一、连贯的令牌序列。这一结构化的输出为基于零件的生成和编辑提供了下游几何感知模块的通用接口。通过符号规划与几何综合的解耦,我们的方法允许任何兼容的几何引擎通过单一、语言原生前端进行控制。

Key Takeaways

  1. Part-X-MLLM是一个3D多任务大型语言模型,能统一各种3D任务。
  2. 该模型将3D任务转化为结构化、可执行的语法程序。
  3. 模型根据RGB点云和自然语言提示自动生成编码部分级别边界框、语义描述和编辑命令的连贯序列。
  4. 结构化的输出为基于零件的生成和编辑提供了通用接口。
  5. 模型通过符号规划与几何综合的解耦,允许使用任何兼容的几何引擎。
  6. 模型通过预训练的双编码器架构来分离结构和语义。

Cool Papers

点此查看论文截图

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Authors:Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

人类定义的创造力具有高度抽象性,对于多模态大型语言模型(MLLMs)来说,理解和评估与人类判断相符的创造力是一个挑战。现有基准测试的缺失进一步加剧了这一困境。为此,我们提出了CreBench,它包含两个关键组成部分:1)一个评估基准,涵盖从创意想法到过程再到产品的多个维度;2)CreMIT(创造力多模态指令调整数据集),这是一个多模态创造力评估数据集,包含2.2K来自不同来源的多模态数据、79.2K的人类反馈和470万条多类型指令。具体来说,为了确保MLLMs能够处理各种与创造力相关的查询,我们提示GPT对这些人类反馈进行改进,以激活更强的创造力评估能力。CreBench作为构建理解人类创造力MLLMs的基础。基于CreBench,我们对开源的通用MLLMs进行了微调,从而得到了CreExpert,一个多模态创造力评估专家模型。大量实验表明,与最新的MLLMs相比,包括最先进的GPT-4V和Gemini-Pro-Vision,所提出的CreExpert模型在与人类创造力评价的契合度上取得了显著的优势。

论文及项目相关链接

PDF 13 pages, 3 figures,The 40th Annual AAAI Conference on Artificial Intelligence(AAAI 2026),Paper has been accepted for a poster presentation

Summary

本文介绍了针对人类定义下的创造力的评估难题,提出了一种名为CreBench的基准测试,包括评价创意想法、过程和产品的多个维度。同时,为了训练能够理解人类创造力的多模态大语言模型(MLLMs),提出了CreMIT数据集。通过对GPT进行微调,确保了MLLMs能够处理各种创造力相关的查询,并开发出CreExpert模型,该模型在与人类创造力评价对齐方面表现优异。

Key Takeaways

  1. 人类定义的创造力具有高度抽象性,对于多模态大语言模型(MLLMs)理解和评估创造力带来了挑战。
  2. 缺乏现有的基准测试加剧了这一难题。
  3. 提出CreBench基准测试,包括从创意想法、过程到产品的多个维度评价。
  4. 引入CreMIT数据集,用于多模态创造力评价,包含2.2K多源多模态数据、79.2K人类反馈和470万条多类型指令。
  5. 通过GPT微调,确保MLLMs能处理各种创造力相关查询。
  6. 开发出基于CreBench的CreExpert模型,该模型在与人类创造力评价对齐方面表现优异。

Cool Papers

点此查看论文截图

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Authors:Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang, Kaishuo Wei, Yuqi Yu, Weiping Wen, Xiaojie Wu, Junhua Liu

The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.

大型语言模型(LLM)的快速采纳带来了变革性的应用和新安全风险,包括绕过对齐保障来触发有害输出的越狱攻击。现有的自动化越狱生成方法(例如AutoDAN)存在变异多样性有限、适应度评估肤浅以及基于关键词的检测脆弱等问题。为了解决这些局限性,我们提出了ForgeDAN,这是一种针对对齐LLM生成语义连贯且高度有效的对抗性提示的新型进化框架。首先,ForgeDAN引入了跨字符、单词和句子级别的多策略文本扰动操作,以增强攻击多样性;然后,我们采用基于文本相似性模型的可解释语义适应度评估来指导进化过程,以产生语义相关且有害的输出;最后,ForgeDAN整合了双重维度的越狱判断,利用基于LLM的分类器联合评估模型合规性和输出危害性,从而降低误报并提高检测效率。我们的评估表明,ForgeDAN在保持自然性和隐蔽性的同时,实现了高越狱成功率,优于现有的最先进解决方案。

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型(LLMs)的迅速采用带来的变革性应用和新安全风险的挑战,包括绕过对齐保障以产生有害输出的越狱攻击。针对现有自动化越狱生成方法(如AutoDAN)在突变多样性、适应度评估和关键词检测方面的局限性,提出了ForgeDAN这一新型进化框架。ForgeDAN引入跨字符、单词和句子级别的多策略文本扰动,以增强攻击多样性;同时采用基于文本相似性模型的解释性语义适应度评估来引导进化过程生成语义相关且有害的输出;最后,它集成了基于LLM的分类器来进行双重维度越狱判断,联合评估模型合规性和输出危害性,以降低误报并增强检测效果。评估表明,ForgeDAN在高越狱成功率的同时保持了自然性和隐蔽性,优于现有最佳解决方案。

Key Takeaways

  1. 大型语言模型(LLMs)的普及带来了安全挑战,特别是越狱攻击问题。
  2. 现有解决方案如AutoDAN存在突变多样性不足、适应度评估不全面和基于关键词的检测方式脆弱等问题。
  3. ForgeDAN通过引入多策略文本扰动增强攻击多样性。
  4. ForgeDAN采用基于文本相似性模型的解释性语义适应度评估,使进化过程能生成语义相关且有害的输出。
  5. ForgeDAN集成了基于LLM的分类器进行双重维度越狱判断,提高检测效率和准确性。
  6. ForgeDAN在保持自然性和隐蔽性的同时,实现了高越狱成功率。

Cool Papers

点此查看论文截图

Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

Authors:Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, André Pomp, Tobias Meisen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision-relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi-agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM-as-a-judge agent, and an optional human-in-the-loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder-oriented reports. The system is validated through a real-world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state-of-the-art LLMs and three prompting paradigms identify GPT-4.1 mini with Chain-of-Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi-agent orchestration significantly enhances factual precision, coherence, and scalability in LLM-based reporting. The proposed framework establishes a replicable and domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics.

提高公共交通的燃油效率需要整合复杂的多模式数据,以形成可解释的、与决策相关的见解。然而,传统的分析方法和可视化方法通常会产生碎片化的输出,需要大量的人工解读,从而限制了可扩展性和一致性。本研究提出了一个多智能体框架,该框架利用多模式大型语言模型(LLM)自动进行数据叙事和能源洞察生成。该框架协调了三个专业智能体,包括数据叙事智能体、LLM判断智能体和可选的人机循环评估智能体,以将分析产物迭代地转化为连贯的、以利益相关者为导向的报告。该系统通过丹麦北部尤特兰地区公共巴士运输的实际情况研究得到了验证,其中4006次出行的燃油效率数据使用高斯混合模型聚类进行分析。跨越五种最新LLM和三种提示范式的比较实验确定了GPT-4.1 mini配合Chain-of-Thought提示为最佳配置,实现了97.3%的叙述准确性,同时平衡了可解释性和计算成本。研究结果表明,多智能体协同显著提高了基于LLM的报告的事实精确性、连贯性和可扩展性。所提出的框架为能源信息学中的AI驱动叙事生成和决策支持建立了可复制和适应领域的的方法。

论文及项目相关链接

PDF

Summary

该研究提出一个多智能体框架,该框架运用多模态大型语言模型自动化进行数据的叙事和能源洞察的生成。此框架包括数据叙事智能体、大型语言模型作为法官的智能体以及可选的人机交互评价者,将分析产物转化为连贯的、面向利益相关者的报告。在丹麦北部公交运输的实际案例研究中验证了系统的有效性,通过高斯混合模型聚类分析来自4006次的行程的燃料效率数据。对比实验表明GPT-4.1 mini配合Chain-of-Thought提示法是最优配置,能在保持解释性和计算成本平衡的同时,达到97.3%的叙事准确性。此研究为人工智能驱动的叙事生成和能源信息学决策支持提供了可复制和适应领域的方法论。

Key Takeaways

  1. 该研究提出了一个包含数据叙事智能体、大型语言模型作为法官的智能体以及人机互动评价者的多智能体框架,用于自动化处理多模态数据并生成能源洞察。
  2. 通过实际案例研究验证了该框架的有效性,在公交运输领域分析燃料效率数据。
  3. 对比实验表明GPT-4.1 mini配合Chain-of-Thought提示法是最优配置,具有较高的叙事准确性。
  4. 该框架提高了事实准确性、连贯性和可扩展性,在能源信息学领域具有广泛的应用前景。
  5. 此研究为AI驱动的叙事生成和决策支持提供了可复制的方法论,适用于不同领域的信息处理和分析。
  6. 该框架强调了在复杂数据处理和分析过程中人机协作的重要性。

Cool Papers

点此查看论文截图

Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

Authors:Yu Wen, Shuyong Gao, Shuping Zhang, Miao Huang, Lili Tao, Han Yang, Haozhe Xing, Lihe Zhang, Boxue Hou

Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.

指代伪装目标检测(Ref-COD)旨在通过融入图像和文本描述等参考信息来识别隐藏的目标。以往的研究已将具有显著目标的参考图像转换为一维提示,并取得了显著的结果。我们探索了通过融合丰富显著的图像特征和伪装目标特征的多上下文来增强性能的方法。因此,我们提出了RFMNet,它利用参考显著图像多个编码阶段的特征,并在相应的编码阶段与伪装特征进行交互融合。鉴于显著目标图像中的特征包含大量的目标相关详细信息,在局部区域进行特征融合更有利于检测伪装目标。因此,我们提出了一种重叠窗口交叉注意力机制,使模型能够基于参考特征更加关注局部信息匹配。此外,我们提出了指代特征聚合(RFA)模块,以逐步解码和分割伪装目标。在Ref-COD基准测试上的大量实验表明,我们的方法达到了最先进的性能。

论文及项目相关链接

PDF 12 pages, 7figures, This work is supported by National Nature Science Foundation of China (Grant No. 62203291)

Summary

本文介绍了如何结合参考信息(如图像和文本描述)来识别隐藏物体,提出了一种新的方法——RFMNet。该方法利用参考显著图像的多阶段编码特征,与隐蔽物体特征进行交互式融合。通过局部区域特征融合和重叠窗口交叉注意力机制,提高隐蔽物体检测性能。实验结果在Ref-COD基准测试上表现卓越。

Key Takeaways

  1. RFMNet利用参考显著图像的多阶段编码特征,与隐蔽物体特征进行交互式融合,以提高识别隐藏物体的性能。
  2. 通过局部区域特征融合,可以更好地利用显著图像中的丰富对象相关详细信息,有助于提高隐蔽物体检测的准确性。
  3. 提出了一种新的重叠窗口交叉注意力机制,使模型能够基于参考特征更加关注局部信息匹配。
  4. RFMNet中的参照特征聚合(RFA)模块可以逐步解码和分割隐蔽物体。
  5. 实验结果表明,RFMNet在Ref-COD基准测试上实现了最先进的性能。
  6. RFMNet通过融入多上下文信息提升了模型的泛化能力,使其在复杂的真实场景中具有更强的鲁棒性。

Cool Papers

点此查看论文截图

Video Spatial Reasoning with Object-Centric 3D Rollout

Authors:Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

近期多模态大型语言模型(MLLMs)的进步在视觉语言理解方面展示了显著的能力。然而,实现稳健的视频空间推理——在动态的3D场景中理解对象的位置、方向和对象间关系的能力,仍然是一个尚未解决的关键挑战。现有方法主要依赖于空间接地监督微调或强化学习,但我们发现这些模型往往表现出查询锁定推理,只关注提示中明确提到的对象,而忽视关键的上下文线索。为了解决这个问题,我们提出了Object-Centric 3D Rollout(OCR)这一新策略,在训练过程中对选定对象的3D几何结构进行结构化扰动。通过降低特定对象的视觉线索并将改变的几何结构投影到二维空间,OCR迫使模型对整个场景进行整体推理。我们还设计了一种基于rollout的训练管道,联合使用普通和区域噪声视频来优化空间推理轨迹。实验表明,我们的模型性能处于领先水平:我们的规模为3B参数的模型在VSI-Bench上达到了47.5%的准确率,优于多个规模为7B的基线模型。消融实验证实了OCR相较于先前的rollout策略(如T-GRPO、NoisyRollout)的优势。

论文及项目相关链接

PDF

Summary

本文介绍了多模态大型语言模型在视觉语言理解方面的最新进展,并指出视频空间推理是一项尚未解决的关键挑战。现有方法主要依赖于空间定位的监督微调或强化学习,但存在查询锁定推理的问题,忽略关键上下文线索。为此,本文提出了一种新的策略——对象中心三维滚动(OCR),在训练过程中向选定对象的3D几何结构引入结构化扰动。通过降低对象特定的视觉线索并将更改后的几何结构投影到二维空间,OCR迫使模型在整个场景中进行全面推理。同时,设计了一种基于滚动条的培训管道,该管道结合了普通和区域噪声视频,以优化空间推理轨迹。实验结果表明,该模型在VSI-Bench上的准确率达到47.5%,优于多个基线模型。

Key Takeaways

  1. 多模态大型语言模型在视觉语言理解方面取得显著进展,但视频空间推理仍是未解决的挑战。
  2. 现有方法主要依赖监督微调或强化学习进行空间定位,但存在查询锁定推理问题。
  3. 提出了对象中心三维滚动(OCR)策略,通过结构化扰动训练时对象的3D几何结构来改进模型的空间推理能力。
  4. OCR策略迫使模型在整个场景中全面推理,通过降低对象特定的视觉线索并将更改的几何结构投影到二维空间。
  5. 设计了一种基于滚动条的培训管道,结合了普通和区域噪声视频,优化空间推理轨迹。
  6. 实验表明,该模型在VSI-Bench数据集上的准确率达到47.5%,优于多个基线模型。

Cool Papers

点此查看论文截图

Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Authors:Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar

Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.

自动语音识别(ASR)转录,特别是在孟加拉语等低资源语言中,存在关键的不确定性:单词重复可能是重复失误(无意的ASR错误/犹豫)或形态复现(故意的语法结构)。标准的失态修正会错误地删除有效的语言信息。为解决这一问题,我们推出了首个公开可用的、手动注释的孟加拉语语料库,该语料库包含2万行数据,旨在明确区分这两种现象在嘈杂的ASR转录中的区别。我们使用两种范式来评估这一新资源:最先进的多语言大型语言模型(LLM)和针对特定任务的编码器模型微调。LLM通过少量提示实现具有竞争力的性能(最高达82.68%的准确率)。然而,微调证明更为优越,特定的孟加拉语BERT模型达到最高准确率84.78%,F1分数为0.677。这为发展孟加拉语的复杂、语义保留文本规范化系统建立了坚实的语言基础,并为开发此类系统提供了必要的数据。

论文及项目相关链接

PDF

Summary

这篇文章的摘要为:针对孟加拉语等低资源语言自动语音识别(ASR)转录文本中的重复现象,文章提出了一种区分重复现象与重复失语的新方法。通过构建首个公开可用的孟加拉语语料库,对ASR转录中的重复现象进行手动标注,并分别采用当前最先进的多语言大型语言模型和特定任务微调编码器模型进行基准测试。结果表明,多语言大型语言模型通过少量样本提示就实现了竞争性表现;而通过微调语言特定的孟加拉语BERT模型获得了最佳性能和F1分数。这为发展复杂的语义保留文本规范化系统提供了重要数据。

Key Takeaways

以下是七个关键要点:

  1. 低资源语言的ASR转录文本中存在一个关键问题,即需要区分重复的意图是否为失语或有意为之的语法构造。
  2. 为解决这一问题,引入了首个公开可用的、含有二十万个手动标注数据的孟加拉语语料库。
  3. 文章利用两种不同的方法来测试这一新资源:当前最先进的多语言大型语言模型和特定任务微调编码器模型。
  4. 多语言大型语言模型表现出了竞争力,准确性达到82.68%。
  5. 任务特定微调编码器模型表现更佳,其中孟加拉语BERT模型表现最佳,准确率和F1分数分别为84.78%和0.677。
  6. 此研究提供了用于孟加拉语的更为复杂的语义保留文本规范化系统的重要数据支持。

Cool Papers

点此查看论文截图

Soft Conflict-Resolution Decision Transformer for Offline Multi-Task Reinforcement Learning

Authors:Shudong Wang, Xinfei Wang, Chenhao Zhang, Shanchen Pang, Haiyuan Gui, Wenhao Ji, Xiaojian Liao

Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model’s generalization and learning efficiency. To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6% on MT50 and by 10.5% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.

多任务强化学习(MTRL)旨在学习多种任务的统一策略,但通常面临跨任务的梯度冲突问题。现有的基于掩码的方法试图通过分配任务特定的参数掩码来缓解这种冲突。然而,我们的经验研究表明,粗粒度的二进制掩码会过度抑制关键冲突参数,阻碍任务之间的知识共享。此外,不同任务表现出不同级别的冲突,而现有方法采用一刀切式的固定稀疏度策略来维持训练的稳定性和性能,这证明是不足的。这些限制阻碍了模型的推广和学习效率。

论文及项目相关链接

PDF

Summary

本文探讨了多任务强化学习(MTRL)中梯度冲突的问题。现有基于掩码的方法试图通过分配任务特定参数掩码来缓解这种冲突,但存在粗粒度的二进制掩码会过度抑制关键冲突参数的问题,阻碍任务间的知识共享。针对这些问题,本文提出了基于参数重要性的软冲突解决方法的SoCo-DT方案,利用Fisher信息动态调整掩码值以保留重要参数并抑制冲突参数。同时引入基于IQR的动态稀疏调整策略,并通过不对称余弦退火时间表实现自适应稀疏演变。实验结果表明,SoCo-DT在Meta-World基准测试上的表现优于现有最佳方法,显示出其在缓解梯度冲突和提高整体多任务性能方面的有效性。

Key Takeaways

  1. 多任务强化学习(MTRL)面临梯度冲突的问题。
  2. 现有基于掩码的方法试图缓解梯度冲突,但存在粗粒度的二进制掩码会过度抑制关键参数的问题。
  3. SoCo-DT方案通过参数重要性的软冲突解决方法,利用Fisher信息动态调整掩码值。
  4. 引入基于IQR的动态稀疏调整策略,根据任务的冲突和和谐分数分布构建任务特定阈值方案。
  5. 采用不对称余弦退火时间表,使稀疏阈值能够自适应演变。
  6. 在Meta-World基准测试上,SoCo-DT表现优于现有最佳方法,显示出其缓解梯度冲突和提高多任务性能的有效性。

Cool Papers

点此查看论文截图

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Authors:Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen, Yasuhiko Nakashima

Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

终身学习在全幻灯片图像(WSIs)上的应用旨在在一系列癌症相关任务上顺序地训练或微调统一模型,从而减少数据迁移和处理的资源消耗和努力,尤其是在考虑到WSI的千兆字节规模大小的情况下。在本文中,我们介绍了MergeSlide,这是一个简单有效的框架,它将终身学习视为模型合并问题,并利用视觉语言病理学基础模型来解决。当有新的任务到来时,它是:1)通过类感知提示进行定义,2)使用无多层感知机(MLP)的骨干进行几个周期的微调,3)使用正交持续合并策略将其合并到统一模型中,这保留了性能并减轻了灾难性遗忘。在类别增量学习(CLASS-IL)设置下进行推理时,任务身份是未知的,因此我们引入了任务到类别提示对齐(TCP)推理。具体来说,TCP首先使用任务级提示识别最相关的任务,然后应用相应的类感知提示来生成预测。为了评估MergeSlide,我们在六个TCGA数据集上进行了实验。结果表明,MergeSlide优于基于复演的持续学习和视觉语言零样本基线。代码和数据可在https://github.com/caodoanh2001/MergeSlide上找到。

论文及项目相关链接

PDF WACV2026 Accepted

Summary

基于全切片图像(WSIs)的终身学习旨在通过一系列癌症相关任务来训练或微调统一模型,从而减少对数据传输和处理的资源需求。本文介绍了一个简单有效的框架——MergeSlide,它将终身学习视为模型合并问题,并利用视觉语言病理学基础模型来解决。当有新的任务时,它通过类感知提示进行定义,使用无多层感知机(MLP)的骨架进行微调,并使用正交持续合并策略将其合并到统一模型中,从而保留性能并避免灾难性遗忘。在任务身份未知的类增量学习(CLASS-IL)设置下,引入任务到类提示对齐(TCP)推理。实验结果表明,MergeSlide在六个TCGA数据集上的表现优于基于复习的连续学习和视觉语言零样本基线。

Key Takeaways

  1. MergeSlide利用视觉语言病理学基础模型,将终身学习视为模型合并问题来解决。
  2. 新任务通过类感知提示进行定义,并使用无多层感知机(MLP)的骨架进行微调。
  3. 使用正交持续合并策略,将新任务合并到统一模型中,保留性能并避免灾难性遗忘。
  4. 引入任务到类提示对齐(TCP)推理,用于类增量学习设置下的推断。
  5. TCP首先通过任务级别提示识别最相关的任务,然后应用相应的类感知提示来生成预测。
  6. MergeSlide在六个TCGA数据集上的表现优于复习式连续学习和视觉语言零样本基线方法。

Cool Papers

点此查看论文截图

GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs

Authors:Yiyang Zhao, Huiyu Bai, Xuejiao Zhao

Alignment of large language models (LLMs) with human preferences typically relies on supervised reward models or external judges that demand abundant annotations. However, in fields that rely on professional knowledge, such as medicine and law, such large-scale preference labels are often unachievable. In this paper, we propose a generative entropy-guided preference modeling approach named GEM for LLMs aligment at low-resource and domain-specific scenarios. Instead of training a discriminative reward model on preference data, we directly train the LLM to internalize a closed-loop optimization architecture that can extract and exploit the multi-dimensional, fine-grained cognitive signals implicit in human preferences. Specifically, our Cognitive Filtering module, based on entropy theory in decision making, first leverages Chain-of-Thought (CoT) prompting to generate diverse candidate reasoning chains (CoTs) from preference data. Subsequently, it introduces a token scoring mechanism to rank and weight the sampled CoTs, boosting the importance of high-confidence answers and strategically high-entropy tokens. Building on these filtered preferences, we fine-tune the LLM using a novel self-evaluated group advantage algorithm, SEGA, which effectively aggregates group-level cognitive signals and transforms the entropy-based scores into implicit rewards for policy optimization. In these ways, GEM empowers the LLM to rely on its own judgments and establishes an entropy-guided closed-loop cognitive optimization framework, enabling highly efficient few-shot alignment of LLMs. Experiments on general benchmarks and domain-specific tasks (such as mathematical reasoning and medical dialogues) demonstrate that our GEM achieves significant improvements with few-shot preference data.

大型语言模型(LLM)与人类偏好对齐通常依赖于监督奖励模型或外部评判者,这需要大量标注。然而,在依赖专业知识的领域,如医学和法律,这种大规模偏好标签往往难以实现。在本文中,我们提出了一种用于LLM在资源稀缺和特定领域场景中对齐的生成熵引导偏好建模方法,名为GEM。我们不在偏好数据上训练判别奖励模型,而是直接训练LLM内化闭环优化架构,能够提取和利用人类偏好中隐含的多维度、精细的认知信号。具体来说,我们的认知过滤模块基于决策中的熵理论,首先利用思维链(CoT)提示生成来自偏好数据的多样化候选推理链(CoTs)。随后,它引入了一种令牌评分机制来对采样到的CoTs进行排名和加权,提高高置信度答案和战略高熵令牌的重要性。基于这些过滤后的偏好,我们使用新型自我评价群体优势算法SEGA对LLM进行微调,该算法可有效聚合群体层面的认知信号,将熵得分转化为策略优化的隐性奖励。通过这种方式,GEM使LLM能够依赖其自己的判断并建立熵引导的闭环认知优化框架,实现LLM的高效少样本对齐。在通用基准测试和特定领域任务(如数学推理和医疗对话)上的实验表明,我们的GEM在少量偏好数据的情况下取得了显著改进。

论文及项目相关链接

PDF This paper has been accepted by AAAI 2026-AIA and designated as an oral presentation paper

Summary

本文提出了一种基于生成熵引导偏好建模的方法(GEM),用于在低资源和特定领域的场景下实现大型语言模型(LLM)的对齐。该方法无需依赖大量的偏好标签数据,而是通过优化内部化闭环架构,提取和利用人类偏好中的多维精细认知信号。实验表明,在通用基准测试和特定领域任务上,GEM方法利用少量的偏好数据即可实现显著改进。

Key Takeaways

  1. 本文提出了一种新的方法GEM,用于大型语言模型(LLM)的对齐,特别是在低资源和特定领域场景。
  2. GEM方法不依赖大量的偏好标签数据,而是直接训练LLM内部化闭环优化架构。
  3. GEM通过熵理论决策中的认知过滤模块,利用链式思维(CoT)提示生成多样的候选推理链,并引入标记机制对采样结果进行排名和加权。
  4. SEGA算法用于基于过滤后的偏好微调LLM,有效聚合群体认知信号,并将熵得分转化为隐式奖励,用于策略优化。
  5. GEM方法使LLM能够依靠自身判断,建立了一个基于熵引导的闭环认知优化框架。
  6. 实验结果表明,在通用基准测试和特定领域任务上,GEM方法利用少量的偏好数据即可实现显著改进,特别是在数学推理和医疗对话等任务上。
  7. GEM方法为解决大型语言模型与人类偏好对齐的问题提供了一种新的可能途径,特别是在资源有限和领域特定的场景中。

Cool Papers

点此查看论文截图

SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

Authors:Wenqian Ye, Di Wang, Guangtao Zheng, Bohan Liu, Aidong Zhang

Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object’s core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

大型视觉语言模型,如CLIP,通过在共享嵌入空间中对齐图像和文本,展现出了强大的零样本分类性能。然而,CLIP模型经常会产生多模态的虚假偏见,即不理想的依赖虚假特征的倾向。例如,CLIP可能会根据经常共同出现的背景而不是对象的核心特征来推断图像中的对象类型。这种偏见会显著损害预训练CLIP模型在非分布数据上的稳健性,在这种情况下,跨模态关联不再有效。现有的减轻多模态虚假偏见的方法通常需要在下游数据上进行微调或预先了解偏见,这破坏了CLIP的开箱即用性。在本文中,我们首先从理论上分析了多模态虚假偏见对零样本分类的影响。基于这一洞察,我们提出了Spuriousness-Aware Guided Exploration(SAGE),这是一种简单有效的方法,它通过引导提示选择来减轻虚假偏见。SAGE无需训练、微调或外部注释。它探索提示模板的空间,选择那些在类之间产生最大语义分离的提示,从而提高最差组的稳健性。在四个真实世界基准数据集和五个流行骨干模型上的大量实验表明,SAGE持续提高了零样本性能和泛化能力,超越了以前不需要外部知识或模型更新的零样本方法。

论文及项目相关链接

PDF Accepted at AAAI 2026

Summary:基于CLIP的大型视觉语言模型在零样本分类中表现出强大的性能,但存在多模态偶然偏见问题。本文提出一种名为SAGE的新方法,通过引导提示选择来减轻虚假偏见。SAGE无需训练、微调或外部注释,通过探索提示模板并选择能够最大程度地在类之间产生语义分隔的提示,从而提高最差组的稳健性。实验证明,SAGE在零样本分类和泛化方面表现优异,超越了之前的方法。

Key Takeaways

  1. CLIP等大型视觉语言模型在零样本分类中表现出强大的性能,但存在多模态偶然偏见问题。
  2. 现有方法通常需要下游数据的微调或预先了解偏见,这影响了CLIP的即插即用性。
  3. 本文提出了Spuriousness-Aware Guided Exploration(SAGE)方法来解决这一问题。
  4. SAGE通过探索提示模板并选择能够最大程度地在类之间产生语义分隔的提示,以减轻偏见问题。
  5. SAGE不需要训练、微调或外部注释,具有简单有效的特点。
  6. 实验证明,SAGE在四个真实世界基准数据集和五个流行骨干模型上的表现均优于先前的零样本方法。

Cool Papers

点此查看论文截图

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Authors:Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, Sunghoon Im

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

我们提出了一个无需训练的文本到图像(T2I)生成框架,名为Infinite-Story,它适用于多提示叙事场景。基于规模化的自回归模型,我们的方法解决了文本到图像生成中的两个关键挑战:身份不一致和风格不一致。为了克服这些问题,我们引入了三种互补技术:身份提示替换,它减轻了文本编码器中上下文偏见的影响,使不同提示中的身份属性保持一致;以及包含自适应风格注入和同步指导适应的统一注意力引导机制,它们共同确保了全局风格和身份外观的一致性,同时保持了提示保真度。不同于基于扩散的方法需要微调或在推理过程中较慢的问题,Infinite-Story完全在测试阶段运行,确保在不同提示下的身份和风格一致性非常高。大量实验表明,我们的方法达到了最先进的生成性能,并且相比现有最快的文本一致T2I模型推理速度提高了6倍以上(每张图像只需1.72秒),这凸显了其在现实世界的叙事故事中的有效性和实用性。

论文及项目相关链接

PDF 18pages, 13 figures, AAAI 2026 Oral

Summary

无限故事:无需训练即可实现文本到图像生成的方法。解决了跨提示身份不一致的问题,引入了身份提示替换等技术。具有快速推理和实用性强的特点。采用自适应风格注入等技术实现全局风格和身份的一致性,可用于现实世界中的视觉故事创作。其在多种场景下具有优异性能。无限故事面向多种场合应用设计的特点解决了一致性T2I生成的难题。在保证绘图速度和画质的情况下仍然显著超越当前快速模型水平的技术非常先进且实用。全球一流的创新技术和科学解决方案显著推动了数字叙事产业的持续发展与创新方向的实现方向的可实践性和发展潜能广泛适用于多领域应用场景,并可实现更广泛的应用场景创新与发展方向的可扩展性显著提高了视觉叙事的质量。随着技术的不断进步,未来数字叙事领域的发展前景将更加广阔。

Key Takeaways

  1. Infinite-Story是一个无需训练的文本到图像生成框架,专门用于多提示故事叙述场景。
  2. 解决了身份不一致和风格不一致的关键挑战。
  3. 引入身份提示替换技术来减轻文本编码器的上下文偏见,以实现跨提示的身份属性对齐。
  4. 通过自适应风格注入和同步指导适应的统一注意力指导机制来保持全局风格和身份的一致性。此方法能够保留提示保真度并维护一致的风格与身份外观。相较于依赖精细调整或推理缓慢的扩散模型,Infinite-Story能够在测试阶段独立运行。无限故事的特点包括其创新性和技术优越性能够大大提高工作效率和质量对叙事的理解和发展能力使其在实践中具有很高的应用价值和灵活性是研究人员针对实际问题需求而进行创新性研究和设计的代表解决方案使其与其他工作进行了明显区分能够实现较快的速度加快和优化模型的训练并加强灵活性并能够扩大在多模态以及非特定对象结构等场景的应用场景和创新方向实现更快、更准确的推理速度并具有高度的可推广性和适用性可以适应不同领域的复杂场景和问题并且具备良好的开发环境和可持续性极大地拓展了传统的工作边界扩大了研究和创新的潜力大大增强了框架的稳定性并使关键技术应用成为可能并且可以克服高难度的计算难题来扩展实际场景的通用性超越了以前的技术水平为行业带来了前所未有的变革和创新发展提高了实际应用场景中的灵活性和效率性同时极大地提高了用户体验并促进了行业的技术进步和创新发展进一步推动了行业的技术迭代和行业需求的可持续性发展的行业使用和推广具有积极的价值和广阔的应用前景充分展现了人工智能的强大魅力和实用价值并进一步提高了人类的生活和工作质量加快技术进步以适应数字化时代的发展需求和实际工作的现实要求和创新发展的潜在趋势有助于进一步推动数字化时代的发展推动技术创新不断满足行业的多元化需求从而加快智能化社会的发展进程以及整个社会的科技进步水平并为未来的发展带来更大的影响力和贡献力推动了科技的进步和创新为行业的未来发展提供了强大的技术支持并带来了重要的价值和应用前景和广阔的未来发展趋势并展示了无限故事框架在数字化时代的重要性和影响力以及其独特的优势和应用价值在科技领域的广泛应用前景和未来的发展趋势以及其对于数字化时代的重要贡献和影响力和未来的广阔发展前景。

Cool Papers

点此查看论文截图

Medal S: Spatio-Textual Prompt Model for Medical Segmentation

Authors:Pengcheng Shi, Jiawei Chen, Jiaqi Liu, Xinglin Zhang, Tao Chen, Lei Li

We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at https://github.com/yinghemedical/Medal-S.

我们介绍了Medal S,这是一种医学分割基础模型,它在一个端到端可训练框架内支持原生分辨率的空间和文本提示。与缺乏空间意识的纯文本方法不同,Medal S实现了体积提示和文本嵌入之间的通道对齐,缓解了分辨率不匹配带来的不准确问题。通过保留完整的3D上下文,它能够并行处理多个原生分辨率的掩膜,提高多类分割性能。一个轻量级的3D卷积模块能够实现由两种提示类型引导的精确体素空间细化,支持BiomedSegFM数据集中的CT、MRI、PET、超声和显微镜模式的最多243类。Medal S提供两种提示模式:纯文本模式,其中模型预测作为空间提示进行自我细化而无需人工输入;以及混合模式,结合手动注释以增强灵活性。对于24类分割,并行空间提示使推理时间比顺序提示减少了90%以上。为了解决目标补丁比例失衡的问题,我们提出了动态重采样,扩展了SAT和nnU-Net用于数据增强。此外,我们还开发了优化的文本预处理、两阶段推理策略和后处理技巧,以提高内存效率、精度和推理速度。在验证集上的五种模式平均表现中,Medal S在DSC、NSD、F1和DSC TP等各项指标上的表现均优于SAT。具体而言,DSC为75.44(对比SAT为69.83),NSD为77.34(对比SAT为71.06),F1为38.24(对比SAT为24.88),DSC TP为65.46(对比SAT为46.97)。Medal S通过空间精度与语义文本指导的和谐融合,实现了在多类医学分割任务中相较于基于顺序提示的方法的优越效率和准确性。Medal S将在https://github.com/yinghemedical/Medal-S公开可用。

论文及项目相关链接

PDF Accepted by CVPR 2025 Workshop MedSegFM

Summary
医疗分割模型Medal S支持端到端的训练框架内的原生分辨率空间文本提示。与其他仅使用文本的方法不同,Medal S实现了体积提示与文本嵌入之间的通道对齐,减少了分辨率不匹配导致的误差。它通过保留完整的3D上下文,可以并行处理多个原生分辨率掩码,提高多类别分割性能。此外,Medal S提供两种提示模式,包括无需人工输入的文本模式以及结合手动注释的混合模式。在五个模态的平均验证集上,Medal S优于SAT的分割结果。奖牌S公开网址:链接地址。简而言之,Medal S结合空间精确性和语义文本指导实现医疗图像分割的高效率和准确性。

Key Takeaways

  1. Medal S是一个支持原生分辨率空间文本提示的医疗分割基础模型。
  2. 通过实现体积提示与文本嵌入之间的通道对齐,提高了多类别分割性能。
  3. 该模型能够并行处理多个原生分辨率掩码,保留完整的3D上下文信息。
  4. Medal S提供两种提示模式,包括文本模式和混合模式,以适应不同需求。
  5. 在五个模态的平均验证集上,Medal S在分割性能上优于SAT。
  6. 动态重采样技术解决了目标斑块比例失衡的问题,提高了数据增强的效果。

Cool Papers

点此查看论文截图

ArtiWorld: LLM-Driven Articulation of 3D Objects in Scenes

Authors:Yixuan Yang, Luyang Xie, Zhen Luo, Zixiang Zhao, Mingqi Gao, Feng Zheng

Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.

构建交互式模拟器和可扩展的机器人学习环境需要大量的关节式资产。然而,模拟中现有的大多数3D资产都是刚性的,手动将它们转换为关节式物体非常耗费劳动力和成本。这引发了一个自然的问题:我们能否自动识别场景中的可动关节物体并将其直接转换为关节式资产?在本文中,我们介绍了ArtiWorld,这是一个场景感知的管道,可以从文本场景描述中定位候选的可动关节物体,并重建可执行的URDF模型,同时保留原始几何形状。该管道的核心是Arti4URDF,它利用3D点云、大型语言模型(LLM)的先验知识,以及面向URDF的提示设计,可以快速将刚体转换为基于URDF的交互式可动关节物体,同时保持其3D形状。我们在三个层次上评估ArtiWorld:3D模拟物体、完整的3D模拟场景和真实世界扫描场景。在所有这三个设置中,我们的方法始终优于现有方法,并达到了最先进的性能,同时保留了物体的几何形状并正确捕捉了物体的交互性,从而产生可用的基于URDF的可动关节模型。这为直接从现有3D资产构建交互式、适用于机器人的模拟环境提供了切实可行的途径。代码和数据将公开发布。

论文及项目相关链接

PDF

Summary

本文提出了一种名为ArtiWorld的场景感知管道,可从文本场景描述中定位可动对象并重建可执行URDF模型,实现交互式模拟器的建设。核心组件Arti4URDF可结合3D点云、大型语言模型的先验知识和URDF导向提示设计,将刚体快速转换为交互式URDF基关节对象,同时保持其三维形状。评估结果显示,该方法在模拟对象和场景以及真实世界扫描场景中均表现优异,超越了现有方法并达到了领先水平。这为直接从现有三维资产构建交互式机器人模拟环境提供了实用途径。

Key Takeaways

  1. ArtiWorld能从文本场景描述中自动识别和转换可动对象,构建交互式模拟环境。
  2. Arti4URDF是该管道的核心组件,能结合多种技术将刚体转换为交互式URDF基关节对象。
  3. 该方法能够在保持对象几何形状的同时,正确捕捉对象交互性,生成可用的URDF基关节模型。
  4. ArtiWorld在模拟对象和场景以及真实世界扫描场景中均表现优异。
  5. 该方法显著优于现有方法,达到了领先水平。
  6. 该研究为机器人模拟环境的构建提供了实用途径,特别是从现有的三维资产出发。

Cool Papers

点此查看论文截图

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Authors:Dianbing Xi, Guoyuan An, Jingsen Zhu, Zhijian Liu, Yuan Liu, Ruiyuan Zhang, Jiayuan Lu, Rui Wang, Yuchi Huo

We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from ``Outfit of the Day’’ (OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48$\times$ speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

我们提出了PFAvatar(姿态融合化身),这是一种新的方法,可以从“日常穿搭”(OOTD)照片重建高质量3D化身。这些照片展现了多种姿态、遮挡和复杂背景。我们的方法分为两个阶段:(1)用少量OOTD示例对姿态感知扩散模型进行微调;(2)通过神经辐射场(NeRF)表示3D化身并进行提炼。在第一阶段,与以往将图像分割成资产(如服装、配饰)进行3D组装的方法不同,这种方法容易导致不一致性。我们避免分解,直接对全身外观进行建模。通过集成预先训练的ControlNet进行姿态估计和新颖的条件先验保留损失(CPPL),我们的方法能够在端到端学习中精细细节,同时减轻少量训练中的语言漂移。我们的方法在仅5分钟内完成个性化设置,与以前的方法相比实现了48倍的速度提升。在第二阶段,我们引入了一种基于NeRF的化身表示,通过规范的SMPL-X空间采样和多分辨率3D-SDS进行优化。与基于网格的表示方法相比,我们的连续辐射场可以保留高频纹理(例如头发),并通过透射率正确地处理遮挡。实验表明,PFAvatar在重建保真度、细节保留以及遮挡/截断鲁棒性方面超越了现有先进技术,推动了从现实世界OOTD相册生成实用3D化身的发展。此外,重建的3D化身支持下游应用,如虚拟试穿、动画和人类视频重新演绎,进一步证明了我们方法的通用性和实用价值。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary
本文提出一种名为PFAvatar的新方法,能够从多样化的姿势、遮挡和复杂背景的“每日穿搭”(OOTD)照片中重建高质量的三维虚拟角色。该方法分为两个阶段:一是微调基于少量OOTD例子的姿态感知扩散模型,二是蒸馏以神经辐射场(NeRF)表示的三维虚拟角色。新方法避免了以往方法的分解缺点,直接对全身外观进行建模,能够端到端地学习精细细节并缓解少量训练时的语言漂移问题。同时引入NeRF基于的虚拟角色表示法,通过标准SMPL-X空间采样和多分辨率3D-SDS进行优化,能够保留高频纹理并正确处理遮挡。实验表明,PFAvatar在重建保真度、细节保留以及遮挡/截断稳健性方面优于现有技术,推动了实际应用的OOTD专辑的三维虚拟角色生成。

Key Takeaways

  1. PFAvatar能够从OOTD照片中重建高质量的三维虚拟角色,处理多样化姿势、遮挡和复杂背景。
  2. 方法分为两个阶段:微调姿态感知扩散模型和蒸馏以NeRF表示的三维虚拟角色。
  3. 避免分解建模,直接对全身外观进行建模,能够端到端地学习精细细节。
  4. 引入Condition Prior Preservation Loss(CPPL)缓解语言漂移问题。
  5. 使用NeRF基虚拟角色表示法,能够保留高频纹理并正确处理遮挡。
  6. PFAvatar在重建保真度、细节保留和遮挡稳健性方面优于现有技术。

Cool Papers

点此查看论文截图

Prompt-Driven Domain Adaptation for End-to-End Autonomous Driving via In-Context RL

Authors:Aleesha Khurram, Amir Moeini, Shangtong Zhang, Rohan Chandra

Despite significant progress and advances in autonomous driving, many end-to-end systems still struggle with domain adaptation (DA), such as transferring a policy trained under clear weather to adverse weather conditions. Typical DA strategies in the literature include collecting additional data in the target domain or re-training the model, or both. Both these strategies quickly become impractical as we increase scale and complexity of driving. These limitations have encouraged investigation into few-shot and zero-shot prompt-driven DA at inference time involving LLMs and VLMs. These methods work by adding a few state-action trajectories during inference to the prompt (similar to in-context learning). However, there are two limitations of such an approach: $(i)$ prompt-driven DA methods are currently restricted to perception tasks such as detection and segmentation and $(ii)$ they require expert few-shot data. In this work, we present a new approach to inference-time few-shot prompt-driven DA for closed-loop autonomous driving in adverse weather condition using in-context reinforcement learning (ICRL). Similar to other prompt-driven DA methods, our approach does not require any updates to the model parameters nor does it require additional data collection in adversarial weather regime. Furthermore, our approach advances the state-of-the-art in prompt-driven DA by extending to closed driving using general trajectories observed during inference. Our experiments using the CARLA simulator show that ICRL results in safer, more efficient, and more comfortable driving policies in the target domain compared to state-of-the-art prompt-driven DA baselines.

尽管自动驾驶领域取得了显著进展和突破,但许多端到端系统仍然面临域适应(DA)的挑战,例如将清晰天气下训练的策略转移到恶劣天气条件。文献中的典型DA策略包括在目标域中收集额外数据、重新训练模型或两者都进行。随着驾驶规模和复杂性的增加,这些策略很快就会变得不切实际。这些局限性促使人们开始在推理时间探索少样本和零样本提示驱动DA的研究,涉及LLMs和VLMs。这些方法通过推理过程中添加少数状态动作轨迹到提示中(类似于上下文学习)来工作。然而,这种方法有两个局限性:一是提示驱动DA方法目前仅限于感知任务,如检测和分割;二是它们需要专家少样本数据。在这项工作中,我们提出了一种新的方法,使用上下文强化学习(ICRL)在恶劣天气条件下进行闭环自动驾驶的推理时间少样本提示驱动DA。与其他提示驱动DA方法类似,我们的方法不需要更新模型参数,也不需要在对抗天气情况下收集额外数据。此外,我们的方法通过扩展到使用推理过程中观察到的通用轨迹来驾驶,从而提高了提示驱动DA的最新水平。在CARLA模拟器上进行的实验表明,与最新的提示驱动DA基线相比,ICRL在目标领域导致了更安全、更高效、更舒适的驾驶策略。

论文及项目相关链接

PDF

Summary

该文本介绍了在自动驾驶领域,尽管有显著的进步和突破,但端到端系统在域适应(DA)方面仍面临挑战。文章提出了一种新的基于强化学习和轨迹数据的方法来解决这个问题,即推理时间少提示驱动的域适应方法(ICRL)。这种方法既不需要更新模型参数,也不需要收集额外的数据来应对恶劣天气条件,使驾驶更安全、高效和舒适。相关实验也证实了这一方法的优势。

Key Takeaways

以下是该文本的关键要点:

  • 自动驾驶在域适应(DA)方面仍面临挑战,特别是在恶劣天气条件下的驾驶策略转移。
  • 当前典型的域适应策略包括在目标域收集额外数据或重新训练模型等,但随着驾驶规模和复杂性的增加,这些方法变得不切实际。
  • 基于少提示驱动的域适应方法(ICRL)在推理时间具有潜力,通过添加少量状态动作轨迹到提示中进行自适应调整。然而,这种方法仅限于感知任务,并需要专家数据。

Cool Papers

点此查看论文截图

Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

Authors:Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma, Yanshu Li, Smita Krishnaswamy, Xiao Wang, Tianyang Wang

The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main

自动化路面缺陷检测系统的部署经常受到跨域泛化能力不足的阻碍。监督检测器在特定领域内部署时能取得准确的检测效果,但在新的环境下部署需要昂贵的重新标注过程。标准的自我监督方法虽然能捕捉到通用特征,但仍容易受到领域变化的影响。我们提出了一个自我监督框架——ours,它无需标签就能对目标领域进行视觉探测。ours引入了一个自我监督提示增强模块(SPEM),该模块从目标数据的无标签数据中衍生出缺陷感知提示,以指导冻结的ViT主干网络,以及一个领域感知提示对齐(DAPA)目标,该目标将源数据和目标数据的提示条件表示对齐。在四个具有挑战性的基准测试上的实验表明,ours持续优于强大的监督、自我监督和适应基准测试,实现了稳健的零样本迁移、对领域变化的增强抵抗力和高效的数据使用效率。这些结果凸显了自我监督提示作为一个实用方向,可用于构建可扩展和适应性强的视觉检测系统。源代码公开可用:https://github.com/xixiaouab/PROBE/tree/main

论文及项目相关链接

PDF Accepted by WACV 2026

Summary

本文提出一种自监督的框架,通过视觉探测目标域,无需标签即可实现路面缺陷检测的自动化部署。该框架引入自监督提示增强模块(SPEM),从无标签的目标数据中推导出缺陷感知提示,引导冻结的ViT主干网络;同时采用领域感知提示对齐(DAPA)目标,对齐提示条件下的源和目标表示。实验结果表明,该框架在四个具有挑战性的基准测试上表现优异,具有稳健的零样本迁移、良好的领域变化适应性和高数据效率的少样本适应性。这证明了自监督提示方向对于构建可扩展和适应性强的视觉检测系统具有实用价值。

Key Takeaways

  1. 提出一种自监督框架用于路面缺陷检测,该框架可在无需标签的情况下探测目标域。
  2. 引入自监督提示增强模块(SPEM),从无标签的目标数据中推导出缺陷感知提示。
  3. 采用领域感知提示对齐(DAPA)目标,提高源和目标域之间的数据对齐程度。
  4. 在四个基准测试上的表现优于其他方法,包括强监督、自监督和自适应方法。
  5. 实现稳健的零样本迁移、良好的领域变化适应性和高数据效率的少样本适应性。
  6. 证明了自监督提示方向对于构建可扩展和适应性强的视觉检测系统具有实用价值。

Cool Papers

点此查看论文截图

Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation

Authors:Sujun Sun, Haowen Gu, Cheng Xie, Yanxu Ren, Mingwu Ren, Haofeng Zhang

Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model’s ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.

跨域小样本分割(CD-FSS)的目标是对目标域中的新类别进行分割,这些类别在训练时没有涉及,并且与源域的数据分布有很大的不同,仅使用少量的标注样本。近年来,此任务已经取得了显著的进展。然而,现有的CD-FSS方法主要关注源域和目标域之间的风格差异,而忽略了分割粒度差异,导致目标域中新类别的语义辨别能力不足。因此,我们提出了分层语义学习(HSL)框架来解决这个问题。具体来说,我们引入了双风格随机化(DSR)模块和分层语义挖掘(HSM)模块来学习分层语义特征,从而提高模型在不同粒度上识别语义的能力。DSR通过前景和全局风格随机化,分别模拟具有不同前景背景风格差异和整体风格变化的目标域数据。HSM则利用多尺度超像素来指导模型在不同粒度上挖掘类内一致性和类间区别。此外,我们还提出了原型置信度调制阈值(PCMT)模块,以缓解前景和背景过于相似时的分割模糊问题。在四个流行的目标域数据集上进行了大量实验,结果表明我们的方法达到了最先进的性能。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

本文介绍了跨域小样分割(CD-FSS)任务的目标和挑战,即在新类别目标域中对仅有少量标注样本进行分割,且目标域与源域的数据分布显著不同。针对现有方法主要关注风格差距而忽视分割粒度差距的问题,提出了层次语义学习(HSL)框架,包括双风格随机化(DSR)模块、层次语义挖掘(HSM)模块和原型置信调制阈值化(PCMT)模块来解决此问题。该框架可提高模型识别不同粒度层次语义的能力,通过模拟目标域数据的多样风格差异和前景背景差异,挖掘类内一致性和类间区别。在四个流行的目标域数据集上的实验表明,该方法达到了最先进的性能。

Key Takeaways

  1. CD-FSS任务旨在分割新类别,这些类别在目标域中未参与训练,且与源域的数据分布显著不同。
  2. 现有方法主要关注风格差距,忽略了分割粒度差距,导致对目标域中新类别的语义判别不足。
  3. 提出的HSL框架包括DSR、HSM和PCMT三个模块来解决这个问题。
  4. DSR模块模拟目标域数据的多样风格差异和前景背景差异。
  5. HSM模块利用多尺度超像素来挖掘不同粒度层次的类内一致性和类间区别。
  6. PCMT模块用于缓解前景和背景过于相似时的分割模糊问题。

Cool Papers

点此查看论文截图

OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description

Authors:Quanxing Xu, Ling Zhou, Feifei Zhang, Jinyu Tian, Rubing Huang

Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.

大型语言模型(LLM)已成为视觉问答(VQA)中的关键工具,在处理少样本或零样本场景中的知识密集型问题时尤为如此。然而,它们对大量训练数据集的依赖往往导致在获取知识时继承语言偏见。这一局限性对现有方法提出了两个关键约束:(1)由于偏见利用,LLM的预测变得不那么可靠;(2)尽管具有强大的知识推理能力,LLM在处理超出分布范围(OOD)的泛化问题时仍然感到困难重重。为了解决这些问题,我们提出了Object Attribute Description Promoter(OAD-Promoter),这是一种通过缓解语言偏见并增强跨领域稳健性来增强基于LLM的VQA性能的新方法。OAD-Promoter包含三个组件:Object-concentrated Example Generation(OEG)模块、Memory Knowledge Assistance(MKA)模块和OAD提示。OEG模块生成全局标题和对象集中样本,共同增强对LLM的视觉信息输入,并通过互补的全局和区域性视觉线索缓解偏见。MKA模块通过从存储的示例中检索与未见领域的问题相关的知识,协助LLM处理OOD样本。最后,OAD提示整合了前面模块的输出以优化LLM推理。实验表明,OAD-Promoter在基于LLM的VQA方法中显著提高了少样本或零样本设置中的性能,并实现了最新的最佳结果。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

大型语言模型(LLM)在视觉问答(VQA)中对于处理知识密集型问题十分重要,但在少样本或零样本场景下,其依赖大量训练数据集导致的语言偏见问题限制了其可靠性及泛化能力。为此,我们提出Object Attribute Description Promoter(OAD-Promoter)新方法,通过减轻语言偏见和提高领域偏移稳健性,增强LLM在VQA中的性能。OAD-Promoter包含三个模块:Object-concentrated Example Generation(OEG)、Memory Knowledge Assistance(MKA)和OAD Prompt。它们分别通过生成全局字幕和对象集中样本、从存储的示例中检索相关知识点以支持未见领域的问题处理、以及优化LLM推理,共同提升VQA的效果。

Key Takeaways

  1. LLM在VQA中处理知识密集型问题时,依赖大量训练数据集可能导致语言偏见。
  2. 语言偏见会影响LLM预测的可靠性并限制其泛化能力。
  3. OAD-Promoter是一种增强LLM在VQA中的性能的新方法,包括OEG、MKA和OAD Prompt三个模块。
  4. OEG模块通过生成全局和对象集中的样本,提高视觉信息输入,并减轻语言偏见。
  5. MKA模块通过从存储的示例中检索相关知识点,帮助LLM处理来自未见领域的问题。
  6. OAD Prompt模块优化了LLM的推理过程。

Cool Papers

点此查看论文截图

Towards Mitigating Systematics in Large-Scale Surveys via Few-Shot Optimal Transport-Based Feature Alignment

Authors:Sultan Hassan, Sambatra Andrianomena, Benjamin D. Wandelt

Systematics contaminate observables, leading to distribution shifts relative to theoretically simulated signals-posing a major challenge for using pre-trained models to label such observables. Since systematics are often poorly understood and difficult to model, removing them directly and entirely may not be feasible. To address this challenge, we propose a novel method that aligns learned features between in-distribution (ID) and out-of-distribution (OOD) samples by optimizing a feature-alignment loss on the representations extracted from a pre-trained ID model. We first experimentally validate the method on the MNIST dataset using possible alignment losses, including mean squared error and optimal transport, and subsequently apply it to large-scale maps of neutral hydrogen. Our results show that optimal transport is particularly effective at aligning OOD features when parity between ID and OOD samples is unknown, even with limited data-mimicking real-world conditions in extracting information from large-scale surveys. Our code is available at https://github.com/sultan-hassan/feature-alignment-for-OOD-generalization.

系统误差污染了观测结果,导致相对于理论模拟信号的分布偏移,为使用预训练模型对这类观测结果进行标注带来了巨大挑战。由于系统误差通常难以理解和建模,直接且完全消除它们可能不可行。为了应对这一挑战,我们提出了一种新方法,通过优化预训练模型从训练内样本(In-Distribution,ID)中提取特征的对齐损失来对齐训练内和训练外样本(Out-of-Distribution,OOD)的已学习特征。我们首先使用可能的对齐损失在MNIST数据集上进行了实验验证,包括均方误差和最优传输,然后将其应用于大规模中性氢地图。结果表明,即使在不知道ID和OOD样本之间平衡的情况下,利用有限数据模仿现实世界的提取信息条件进行大规模调查时,最优传输在对齐OOD特征方面尤其有效。我们的代码位于:https://github.com/sultan-hassan/feature-alignment-for-OOD-generalization

论文及项目相关链接

PDF 5 pages, 3 figures, accepted to NeurIPS Workshop on Unifying Representations in Neural Models (UniReps 2025)

摘要
针对系统误差对观测量的影响,提出了一种新的特征对齐方法。该方法通过优化预训练模型提取的特征表示上的特征对齐损失,实现对内分布(ID)和异常分布(OOD)样本之间的特征对齐。实验验证表明,该方法在MNIST数据集上具有良好的性能,适用于中性氢的大尺度地图。结果表明,当ID和OOD样本之间的平衡未知时,最优传输在特征对齐方面尤为有效,即使在模仿真实世界条件下从大规模调查中提取信息时也是如此。代码已公开分享。

关键见解

  • 系统误差会影响观测量的分布,导致理论模拟信号与实际分布之间的差异。这构成了一个重大挑战,特别是对于使用预训练模型来标注此类观测量。
  • 直接去除系统误差可能不可行,因为对它们的理解通常有限且难以建模。因此,提出了一种特征对齐的新方法来解决这一挑战。
  • 特征对齐方法通过优化预训练模型提取的特征表示上的特征对齐损失来实现。此方法不仅适用于ID样本,也适用于OOD样本。
  • 实验验证显示,该方法在MNIST数据集上具有良好的性能,特别是使用最优传输进行特征对齐时效果更佳。即使在没有足够数据的情况下,该方法也能很好地模仿真实世界的条件。这对于从大规模调查中提取信息非常有用。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
I2I Translation I2I Translation
I2I Translation 方向最新论文已更新,请持续关注 Update in 2025-11-19 Free-Form Scene Editor Enabling Multi-Round Object Manipulation like in a 3D Engine
下一篇 
MMT MMT
MMT 方向最新论文已更新,请持续关注 Update in 2025-11-19 VIR-Bench Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
2025-11-19
  目录