⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-27 更新
Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Authors:Panayiotis Danassis, Naman Goel
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs’ ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
大型语言模型(LLM)的迅速增殖已经彻底改变了AI辅助的代码生成。然而,LLM的快速发展已经超出了我们对其进行适当基准测试的能力。现有的基准测试强调单元测试通过率和语法正确性。这样的指标忽视了现实世界问题的难度,这些问题需要规划、优化和策略性互动。我们引入了一个基于现实世界物流优化问题的多智能体推理驱动基准测试(拍卖、提货和交付问题),该问题将竞争性拍卖与容量受限的路由相结合。基准测试要求构建的智能体能够在不确定的情况下进行战略投标,并优化计划,在完成任务的同时实现利润最大化。我们评估了40个由最新一代LLM编码的智能体(在多种提示方法下,包括震动编码),以及17个在人类编码的智能体在LLM出现之前开发的。我们的结果来自12场全员参与的锦标赛和大约4万场比赛的结果显示:(i)人类(研究生)编码的智能体的明显优势:前五名始终由人类编码的智能体赢得;(ii)大多数LLM编码的智能体(33个中的40个)被非常简单的基线击败;(iii)即使给最好的人类解决方案作为输入并提示其进行改进,表现最佳的LLM往往会使得解决方案变得更糟,而不是改进它。我们的结果突显了LLM生成能在现实世界中竞争的代码的能力上的差距,并强调了需要在现实世界的场景中强调推理驱动的代码合成的新评估方法。
论文及项目相关链接
Summary
大型语言模型(LLM)在AI辅助代码生成方面的快速发展,要求我们对其进行更准确的评估。现有评估标准忽视了现实世界问题的复杂性,我们引入了一个基于真实世界物流优化问题的多智能体推理驱动评估标准。评估结果显示,人类编码的智能体表现优于大多数LLM编码的智能体,并且在给定最佳人类解决方案的情况下,最好的LLM未能改善反而使解决方案变得更糟。这表明LLM在现实世界竞争环境中生成有效代码的能力存在差距。
Key Takeaways
- 大型语言模型(LLM)在AI辅助代码生成领域有快速发展,但现有评估标准不足以全面衡量其性能。
- 现有评估标准主要关注单元测试通过率和语法正确性,忽视了现实世界问题的复杂性。
- 引入了一个基于真实世界物流优化问题的多智能体推理驱动评估标准,以更好地模拟现实情况。
- 评估结果显示人类编码的智能体表现优于大多数LLM编码的智能体。
- 最好的LLM在给定最佳人类解决方案的情况下,未能改善反而使解决方案变得更糟。
- LLM在现实世界竞争环境中生成有效代码的能力存在差距。
点此查看论文截图
Universe of Thoughts: Enabling Creative Reasoning with Large Language Models
Authors:Yuto Suzuki, Farnoush Banaei-Kashani
Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-Thought (CoT) prompting technique, numerous reasoning methods have emerged that decompose problems into smaller, sequential steps (or thoughts). However, existing reasoning models focus on conventional problem-solving and do not necessarily generate creative solutions by ``creative reasoning’’. In domains where the solution space is expansive and conventional solutions are suboptimal, such as drug discovery or business strategization, creative reasoning to discover innovative solutions is crucial. To address this gap, first we introduce a computational framework for creative reasoning inspired by established cognitive science principles. With this framework, we propose three core creative reasoning paradigms, namely, \textit{combinational}, \textit{exploratory}, and \textit{transformative} reasoning, where each offers specific directions for systematic exploration of the universe of thoughts to generate creative solutions. Next, to materialize this framework using LLMs, we introduce the \textit{Universe of Thoughts} (or \textit{UoT}, for short), a novel set of methods to implement the aforementioned three creative processes. Finally, we introduce three novel tasks that necessitate creative problem-solving, along with an evaluation benchmark to assess creativity from three orthogonal perspectives: feasibility as constraint, and utility and novelty as metrics. With a comparative analysis against the state-of-the-art (SOTA) reasoning techniques as well as representative commercial models with reasoning capability, we show that UoT demonstrates superior performance in creative reasoning.
基于大型语言模型(LLM)的推理已经引起了越来越多的关注,因为这些模型在数学和复杂逻辑任务中表现出了出色的性能。从思维链(CoT)提示技术开始,出现了许多推理方法,将问题分解成更小、连续的步骤(或思维)。然而,现有的推理模型主要关注传统的问题解决,并不一定能通过“创造性推理”产生创造性解决方案。在解决方案空间广阔且传统解决方案不够理想的环境中,如药物发现或业务战略规划,通过创造性推理来发现创新解决方案至关重要。
为了弥补这一空白,我们首先借鉴认知科学的原则,引入了一个创造性推理的计算框架。在这个框架下,我们提出了三种核心的创造性推理范式,即组合推理、探索性推理和变革性推理。每一种范式都提供了特定的方向,以系统地探索思维宇宙,以产生创造性的解决方案。
接下来,为了使用LLM实现这个框架,我们引入了“思维宇宙”(或简称UoT)这一新方法集,以实现上述三种创造性过程。最后,我们引入了三种需要创造性解决问题的能力测试任务,并制定了评估创造力的标准,从可行性约束、实用性和新颖性三个角度进行评估。通过与最新的推理技术以及具有推理能力的代表性商业模型的比较分析,我们证明UoT在创造性推理方面表现出卓越的性能。
论文及项目相关链接
Summary
大规模语言模型(LLM)启发下的推理已引起广泛关注,其在数学和复杂逻辑任务中的出色表现令人瞩目。当前流行的推理方法如链式思维(CoT)等技术将问题分解为一系列小步骤。然而,这些方法主要关注传统问题解决,并不一定能通过“创造性推理”产生创新解决方案。在解决方案空间广阔且传统方案不足够优秀的领域(如药物发现或业务策略制定),创造性推理至关重要。本研究借鉴认知科学原则,构建了一个计算框架,并提出了三种核心创造性推理范式:组合式、探索式和转化式推理。为了将这一框架应用于LLM,我们引入了“思维宇宙”(UoT)这一新方法集。评估显示,相较于最新推理技术和具有推理能力的代表性商业模型,UoT在创造性推理方面表现出卓越性能。
Key Takeaways
- 大规模语言模型(LLM)在数学和复杂逻辑任务中表现优异,引发广泛关注。
- 当前流行的推理方法如链式思维(CoT)主要关注传统问题解决。
- 创造性推理在解决方案空间广阔的领域如药物发现或业务策略制定中至关重要。
- 研究借鉴认知科学原则,构建了一个计算框架用于创造性推理。
- 提出了三种核心创造性推理范式:组合式、探索式和转化式推理。
- 引入了“思维宇宙”(UoT)这一新方法集,将计算框架应用于LLM。
点此查看论文截图
A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines
Authors:Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal
Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).
文本规范化是许多自然语言处理(NLP)任务中不可或缺的前处理步骤,而词干提取就是其中一种将单词简化为基本或词根形式的规范化技术。然而,评估词干提取方法是一项挑战,因为现有的评估方法存在局限性,并不能捕捉到过度词干提取可能造成的潜在危害;因此,开发新的方法来评估词干提取方法至关重要。为了解决这一问题,本研究提出了一种面向任务的评价词干提取方法的新方法,该方法考虑三个方面:(1)使用词干提取效率得分(SES)评估词干提取的效用,(2)利用模型性能差异(MPD)评估词干提取对下游任务的影响,以及(3)使用平均归一化莱文斯坦距离(ANLD)评估词干与原始单词之间的语义相似性,从而提供一个全面的评估框架。我们应用我们的评估框架比较了孟加拉语(BNLTK)和英语(Snowball)的两种词干提取器,结果揭示了一个重要问题,促使我们对它们的性能进行详细分析。虽然孟加拉语词干提取器由于有效的单词缩减(CR = 1.90)而获得了最高的SES(1.67),但仅使用SES是不够的,因为我们的提议的安全措施ANLD显示,这种高SES是由于有害的过度词干提取(ANLD = 0.26)造成的,这与观察到的下游性能下降有关。相比之下,英语词干提取器实现了适度的SES(1.31),并保持了安全的语义距离(ANLD = 0.14),允许其单词缩减对下游性能产生积极影响,因此是一个更可靠的词干提取器。本研究提供了一个有价值的工具,可以区分潜在效率提升(高SES)和语义保留(低ANLD)。
论文及项目相关链接
摘要
本文提出了一种面向任务的评价方法来评估词干提取方法,考虑了词干提取的有效性、对下游任务的影响以及词干与原词的语义相似性。通过对两种词干提取方法(针对孟加拉语和英语)的比较,发现仅依赖词干提取的有效性评分是不充分的,因为高评分可能源于过度词干提取导致的语义损失。因此,一个可靠的词干提取方法需要在有效性和语义保留之间取得平衡。
关键见解
- 提出了一种新的面向任务的评价方法来全面评估词干提取方法。
- 该方法考虑了词干提取的有效性、对下游任务的影响以及词干与原始词的语义相似性。
- 仅依赖词干提取的有效性评分是不够的,因为高评分可能源于过度词干提取导致的语义损失。
- 孟加拉语词干提取方法虽然有效,但存在过度词干提取问题,影响语义保留。
- 英语词干提取方法在有效性和语义保留之间取得了平衡,表现更可靠。
- 本研究提供的工具可以区分潜在效率提升(高有效性评分)和语义保留(低语义损失)。
- 研究结果对开发更完善的词干提取方法具有重要指导意义。
点此查看论文截图
IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection
Authors:Xuelin Qian, Jiaming Lu, Zixuan Wang, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Junwei Han
Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.
红外小目标检测(IRSTD)面临着低信噪比、复杂背景和缺乏可辨识目标特征等重大挑战。虽然基于深度学习的编码器-解码器框架已经推动了该领域的发展,但它们在跨不同场景(例如昼夜变化、天空/海事/地面领域)的模式漂移导致静态模式学习存在缺陷,从而限制了稳健性。为了解决这一问题,我们提出了IrisNet,这是一种新型元学习框架,能够动态适应输入红外图像状态的检测策略。我们的方法通过图像到解码器的变压器在红外图像特征和整个解码器参数之间建立动态映射。更具体地说,我们将参数化解码器表示为保留分层层间关系的结构化二维张量,并允许变压器通过自注意力建模层间依赖关系,同时通过交叉注意力生成自适应解码模式。为了进一步增强对红外图像的感知能力,我们集成了高频成分来补充目标位置和场景边缘信息。在NUDT-SIRST、NUAA-SIRST和IRSTD-1K数据集上的实验表明,我们的IrisNet具有卓越性能,达到了最新技术水平。
论文及项目相关链接
PDF 10pages,5figures
Summary:
红外小目标检测(IRSTD)面临低信噪比、复杂背景和缺乏可辨识目标特征等挑战。虽然基于深度学习的编码器-解码器框架已在此领域取得进展,但其在不同场景下的静态模式学习存在模式漂移问题(如昼夜变化、天空/海洋/地面领域),限制了其稳健性。为解决此问题,我们提出了IrisNet,一种新型元学习框架,可动态适应输入红外图像的状态来调整检测策略。通过图像到解码器的转换器建立红外图像特征与整个解码器参数之间的动态映射。实验表明,IrisNet在NUDT-SIRST、NUAA-SIRST和IRSTD-1K数据集上表现卓越,达到最佳性能。
Key Takeaways:
- 红外小目标检测面临低信噪比、复杂背景和缺乏目标特征等挑战。
- 现有深度学习编码器-解码器框架存在模式漂移问题,限制了稳健性。
- IrisNet是一种新型元学习框架,可动态适应输入红外图像状态来调整检测策略。
- IrisNet通过建立图像特征和解码器参数之间的动态映射来提高检测性能。
- 参数化解码器被表示为一个结构化的二维张量,保持层次层关联。
- 转换器通过自注意力建模层间依赖关系,并通过交叉注意力生成自适应解码模式。
点此查看论文截图
Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction
Authors:Baoshun Shi, Ke Jiang, Qiusheng Lian, Xinran Yu, Huazhu Fu
Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.
尽管基于深度学习的稀疏视图计算机断层扫描(SVCT)重建算法取得了显著进展,但这些方法仍面临两个主要局限:(i)由于深度展开算法的经验设计性质,很难明确证明其先验网络满足Lipschitz约束。(ii)在多视图情况下,为每种设置训练一个单独模型的巨大存储成本阻碍了其在实际临床中的应用。为了解决这些问题,我们精心设计了一个明确可证明的Lipschitz约束网络,名为LipNet,并集成了一个明确的提示模块,以提供不同稀疏采样设置的区别知识,从而能够在单个模型中处理多种稀疏视图配置。此外,我们为多重一体SVCT重建开发了一个节省存储的深层展开框架,称为PromptCT,它将LipNet嵌入为其先验网络,以确保其相应迭代算法的收敛性。在模拟和真实数据实验中,PromptCT在多重一体SVCT重建方面优于基准重建算法,以更低的存储成本实现了更高质量的重建。在理论方面,我们明确证明了LipNet满足边界属性,进一步证明了其Lipschitz连续性,并随后分析了所提出迭代算法的收敛性。数据和代码可在https://github.com/shibaoshun/PromptCT公开获取。
论文及项目相关链接
Summary
本文提出一种针对稀疏视图计算层析成像(SVCT)重建算法的挑战的解决方案。为解决现有深度学习算法的两大局限性,即难以证明先验网络满足Lipschitz约束以及处理多种视图的高存储成本问题,提出了一种名为LipNet的显式可证明的Lipschitz约束网络。通过集成明确的提示模块,用于提供不同稀疏采样设置的区别知识,实现单个模型处理多种稀疏视图配置。同时,开发了一种名为PromptCT的存储节省型深层展开框架,用于多重一体SVCT重建,将LipNet嵌入作为其先验网络,确保相应迭代算法的收敛性。在模拟和实际数据实验中,PromptCT优于基准重建算法,实现了高质量重建并降低了存储成本。
Key Takeaways
- 本文指出深度学习在稀疏视图计算层析成像(SVCT)重建算法中的两个主要挑战:难以证明先验网络满足Lipschitz约束和处理多种视图的高存储成本。
- 提出了一种名为LipNet的显式可证明的Lipschitz约束网络,解决了这两个挑战。
- 通过集成明确的提示模块,使模型能够处理多种稀疏视图配置,提高了模型的实用性。
- 开发了一种名为PromptCT的存储节省型深层展开框架,用于多重一体SVCT重建,确保迭代算法的收敛性。
- 在模拟和实际数据实验中,PromptCT实现了高质量重建并降低了存储成本。
- 数据和代码已公开在GitHub上供公众使用。
点此查看论文截图
Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement
Authors:Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang
Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.
近期视频生成技术的进展带来了令人印象深刻的视觉质量,然而,当前模型仍然难以产生符合现实物理规律的结果。为此,我们提出了一种利用大型语言模型和视觉语言模型的迭代自精化框架,为视频生成提供物理感知指导。具体来说,我们引入了一种基于物理不一致反馈的多模态思维链(MM-CoT)过程,逐步优化提示信息,从而提高生成视频的质量。该方法无需训练且易于集成到各种视频生成模型中。在PhyIQ基准测试集上的实验显示,我们的方法将物理智商得分从56.31提高到了62.38。我们希望这项工作能为物理一致性视频生成提供初步的探索,并为未来的研究提供启示。
论文及项目相关链接
PDF ICCV 2025 Physics-IQ Challenge Third Place Solution
Summary
这项研究提出了一种利用大型语言模型和视觉语言模型的迭代自我完善框架,为视频生成提供符合物理学的指导。通过引入多模态思维链(MM-CoT)过程,根据物理不一致性的反馈来优化提示,逐步提高生成质量。实验表明,该方法在PhyIQ基准测试上的物理智商得分从56.31提高到了62.38。
Key Takeaways
- 研究提出了一种迭代自我完善框架,结合了大型语言模型和视觉语言模型,用于视频生成。
- 引入了多模态思维链(MM-CoT)过程,根据物理不一致性的反馈优化提示。
- 方法无需额外训练,易于应用于各种视频生成模型。
- 在PhyIQ基准测试上,物理智商得分显著提升。
- 该研究旨在实现更真实、符合物理规律的视频生成。
- 方法的应用将有助于提高视频生成的视觉质量和物理一致性。
点此查看论文截图
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
Authors:Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das
Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP
直到最近,CLIP类型基础模型的通用语料库大多都集中在探索短描述的检索或场景中的对象分类作为单一对象图像分类任务。对于给定文本提示的图像嵌入检索(图像检索任务)也是如此。然而,现实世界的场景图像具有丰富的组合结构,涉及多个对象和动作。CLIP基础文献中的最新方法通过挖掘更困难的负图像文本对并改进永久文本提示(通常使用大型语言模型)来提高类别级别的辨别能力。然而,这些改进仍然局限于预定义的类别列表,并没有显式地建模关系或组合结构。PyramidCLIP部分地通过对齐全局和局部视觉特征来解决这一差距,但它仍然缺少对对象间关系的显式建模。因此,为了进一步利用这一方面的场景分析优势,所提出的ScenarioCLIP模型接受输入文本、接地关系以及输入图像,同时突出显示关系的重点区域。该模型在精选的场景数据上进行预训练,并针对特定的下游任务(如跨模态检索和精细粒度的视觉理解任务)进行微调。为了解决缺乏特定领域的数据集问题,我们通过扩展现有公开可用的室内和室外场景数据集中的图像文本对来生成一个新型数据集,并使用现有的语言模型管道来接地动作、对象和关系,通过手动和自动整理填充。我们为多个基于场景的任务建立了综合基准测试,并与许多基准方法进行了比较。ScenarioCLIP在各种特定领域的任务上表现出了稳健的零样本和微调性能。我们的代码和数据集可在https://github.com/scenario-clip/ScenarioCLIP上获得。
论文及项目相关链接
Summary
本文介绍了ScenarioCLIP模型的研究进展。该模型针对真实世界场景图像的多目标、动作和关系进行分析。它通过接受输入文本、关系以及图像,并结合关注区域的关系,实现对场景图像中对象间关系的建模。模型通过预训练场景数据,并微调特定下游任务,如跨模态检索和精细视觉理解任务。为应对缺乏领域特定数据集的问题,研究团队通过扩展现有室内和室外场景数据集生成了新的数据集。同时,他们建立了一个全面的基准测试,与多种基线方法进行比较,ScenarioCLIP在多种特定任务上表现出强大的零样本和微调性能。
Key Takeaways
- ScenarioCLIP模型旨在处理真实世界场景图像中的多目标、动作和关系分析。
- 该模型接受输入文本、关系及图像,并结合关注区域的关系建模。
- 模型通过预训练场景数据并微调特定下游任务,如跨模态检索和精细视觉理解任务。
- 为应对缺乏特定领域数据集的问题,研究团队生成了新的数据集,通过扩展现有室内和室外场景数据集。
- 研究团队使用现有语言模型对动作、对象和关系进行定位,并结合手动和自动筛选。
- ScenarioCLIP模型在各种特定任务上表现出强大的零样本和微调性能。
点此查看论文截图
FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation
Authors:Yadong Liu, Shangfei Wang
Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information-based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effectiveness and robustness of the proposed framework.
多模态情感分析由于各模态之间的内在异质性仍然是一项具有挑战性的任务。这种异质性通常表现为信号不同步、模态间信息不平衡以及来自任务无关噪声的干扰,阻碍了稳健和准确情感表示的学习。为了解决这些问题,我们提出了一种分解型多模态融合框架,该框架首先会将每种模态分解成共享和独特的表示形式,然后抑制两者中的任务无关噪声,仅保留情感关键表示。这种精细的分解通过减少冗余信息、促进跨模态互补性、隔离任务相关的情感线索,从而提高了表示质量。我们并没有直接操作特征空间,而是采用基于互信息的优化策略来以更稳定、更有原则的方式引导分解过程。为了进一步支持特征提取和长期时间建模,我们引入了两个辅助模块:一个Q-Formers混合物模块,它位于分解之前,使用可学习的查询从多个模态中提取精细的情感特征;一个动态对比队列模块,它位于分解之后,存储最新的高级表示形式进行对比学习,使模型能够捕捉长期鉴别模式并提高类别级别的可分性。在多个公共数据集上的广泛实验表明,我们的方法始终优于现有方法,验证了所提出框架的有效性和稳健性。
论文及项目相关链接
PDF 15 pages, 9 figures, conference
Summary
该文本提出了一种解决多模态情感分析中的挑战的方法。针对模态间的固有异质性,通过分解每个模态为共享和独特表示,然后抑制任务不相关噪声,保留情感关键表示,从而改善表示质量。采用基于互信息的优化策略引导分解过程,同时引入两个辅助模块提取特征并进行长期时序建模。
Key Takeaways
- 多模态情感分析面临挑战,主要源于各模态间的固有异质性。
- 提出的框架通过分解每个模态为共享和独特表示,以处理模态间的异质性。
- 框架能抑制任务不相关噪声,保留情感关键表示,提高表示质量。
- 采用基于互信息的优化策略,使分解过程更稳定、有原则。
- 引入两个辅助模块:Mixture of Q-Formers用于特征提取,Dynamic Contrastive Queue用于长期时序建模和对比学习。
- 框架在多个公共数据集上的表现优于现有方法,验证了其有效性和稳健性。
点此查看论文截图
“When Data is Scarce, Prompt Smarter”… Approaches to Grammatical Error Correction in Low-Resource Settings
Authors:Somsubhra De, Harsh Kumar, Arun Prakash A
Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task–ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.
语法错误修正(GEC)是自然语言处理中的一个重要任务,旨在自动检测和修正文本中的语法错误。虽然基于变压器的模型和大型注释数据集方面的最新进展极大地提高了英语等高资源语言的GEC性能,但进展并不均衡。对于大多数印度语言来说,由于资源有限、语言多样性和复杂的形态结构,GEC仍然是一项具有挑战性的任务。在这项工作中,我们探索了使用最新的大型语言模型(LLM)的基于提示的方法,如GPT-4.1、Gemini-2.5和LLaMA-4,并结合少量样例的策略来适应资源稀缺的环境。我们发现,即使是基本的提示策略,如零样本和少量样例的方法,也能使这些LLM大幅超越微调过的印度语模型(如Sarvam-22B),从而展示了当代LLM在GEC方面的出色多语言泛化能力。我们的实验表明,精心设计的提示和轻量级适配显著提高了多个印度语的修正质量。我们在共享任务中取得了领先的结果——在泰米尔语(GLEU:91.57)和印地语(GLEU:85.69)中排名第一,在泰卢固语(GLEU:85.22)中排名第二,在孟加拉语(GLEU:92.86)和马拉雅拉姆语(GLEU:92.97)中分别排名第四和第五。这些发现突显了提示驱动的自然语言处理技术的有效性,并强调了大规模LLM在跨语言GEC中弥资源差距的潜力。
论文及项目相关链接
PDF 10 pages, 5 figures, 5 tables; Accept-demonstration at BHASHA Workshop, IJCNLP-AACL 2025
摘要
本文探讨了自然语言处理中的语法错误校正(GEC)任务。针对英语等资源丰富型语言,基于转换器的模型和大规模标注数据集的发展已经大大提高了GEC的性能。然而,对于大多数印度语言,由于资源有限、语言多样性和复杂形态,GEC仍然是一项具有挑战性的任务。本研究探索了基于最新大型语言模型(LLMs)的提示方法,如GPT-4.1、Gemini-2.5和LLaMA-4,结合小样本策略来适应资源匮乏的环境。观察发现,即使是基本的提示策略,如零样本和小样本方法,也能使这些LLMs显著优于经过微调的语言模型,如Sarvam-22B。这证明了当代LLMs在GEC方面的出色多语言泛化能力。实验表明,精心设计的提示和轻量级适应可显著提高跨多个印度语言的校正质量。在共享任务中取得了领先的结果——在泰米尔语(GLEU:91.57)和印地语(GLEU:85.69)中排名第一,在泰卢固语(GLEU:85.22)中排名第二,在孟加拉语(GLEU:92.86)和马拉亚拉姆语(GLEU:92.97)中排名第四。这些发现突显了提示驱动的自然语言处理技术的有效性,并强调了大规模LLMs在跨语言GEC中的潜力。
关键见解
- 语法错误校正(GEC)是自然语言处理的一个重要任务,特别是在资源丰富的语言如英语方面取得了显著进展。
- 对于大多数印度语言,由于资源限制、语言多样性和复杂的形态,GEC仍然具有挑战性。
- 基于最新大型语言模型(LLMs)的提示方法,在适应资源有限的环境中表现出强大的性能。
- 基本的提示策略如零样本和小样本方法使LLMs在GEC任务中显著优于经过微调的语言模型。
- 精心设计的提示和轻量级适应可以显著提高跨多个印度语言的校正质量。
- 在多个印度语言的GEC共享任务中,使用LLMs的方法取得了领先的结果。
点此查看论文截图
MTA: A Merge-then-Adapt Framework for Personalized Large Language Model
Authors:Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao
Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
个性化大型语言模型(PLLMs)旨在使模型输出与个别用户偏好对齐,这对于以用户为中心的应用至关重要。然而,当前普遍采用的方法为每个用户微调一个单独模块,面临两大局限:(1)存储成本随用户数量线性增长,使得该方法无法规模化;(2)从头开始微调静态模型对于数据稀疏的用户通常产生次优性能。为了解决这些挑战,我们提出了MTA,这是一个用于PLLMs的Merge-then-Adapt框架。MTA包含三个阶段。首先,我们通过选择锚用户和在Meta-LoRA模块中预训练元个性化特征来构建共享Meta-LoRA Bank。其次,为了确保可扩展性并能够实现超越静态模型的动态个性化组合,我们引入了自适应LoRA融合阶段。此阶段检索并动态合并最相关的锚元LoRAs来合成特定用户的LoRA,从而消除了对用户特定存储的需求并支持更灵活的个人化设置。最后,我们提出了用于少量个性化数据的LoRA堆叠阶段,它在合并的LoRA上应用一个额外的超低阶、轻量级的LoRA模块。微调此模块可实现少量数据下的有效个性化设置。在LaMP基准测试上的广泛实验表明,我们的方法在多任务上均优于现有最先进的方法。
论文及项目相关链接
Summary
基于个性化大型语言模型(PLLMs)的需求,当前主流方法存在用户数量增长带来的存储成本增加和稀疏数据下性能不佳的问题。为解决这些问题,本文提出一种名为MTA的合并后自适应框架,通过构建共享元LoRA库、引入自适应LoRA融合和LoRA堆叠少样本个性化三个阶段,实现了动态的个人化组合和更灵活的个人化,同时减少了存储成本,并在多个任务上优于现有顶尖方法。
Key Takeaways
- PLLMs强调个性化输出与用户偏好对齐,对于用户导向应用至关重要。
- 现有方法的两大局限:随着用户数量增加的存储成本线性增长和对稀疏数据用户性能不佳。
- MTA框架包含三个阶段:构建共享元LoRA库、自适应LoRA融合和LoRA堆叠少样本个性化。
- 通过构建元-LoRA模块预训练个人化特征形成共享元LoRA库。
- 自适应LoRA融合实现动态合并最相关锚元LoRAs,生成特定用户个性化,无需用户特定存储。
- LoRA堆叠少样本个性化阶段通过在合并的LoRA上添加超低秩轻量级LoRA模块进行微调实现有效个性化。
点此查看论文截图
iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization
Authors:Xiucheng Wang, Tingwei Yuan, Yang Cao, Nan Cheng, Ruijin Sun, Weihua Zhuang
Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabling localization without costly in-situ measurements. However, constructing high-fidelity indoor RMs remains challenging due to the prohibitive latency of EM solvers and the limitations of learning-based methods, which often rely on sparse measurements or assumptions of homogeneous material, which are misaligned with the heterogeneous and multipath-rich nature of indoor environments. To overcome these challenges, we propose iRadioDiff, a sampling-free diffusion-based framework for indoor RM construction. iRadioDiff is conditioned on access point (AP) positions, and physics-informed prompt encoded by material reflection and transmission coefficients. It further incorporates multipath-critical priors, including diffraction points, strong transmission boundaries, and line-of-sight (LoS) contours, to guide the generative process via conditional channels and boundary-weighted objectives. This design enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent RMs. Experiments demonstrate that iRadioDiff achieves state-of-the-art performance in indoor RM construction and received signal strength based indoor localization, which offers effective generalization across layouts and material configurations. Code is available at https://github.com/UNIC-Lab/iRadioDiff.
无线电地图(RMs)作为环境感知电磁(EM)表示,将场景几何形状和材料属性与信号强度的空间分布连接起来,从而实现无需昂贵现场测量的定位。然而,由于电磁求解器的禁止性延迟和学习型方法的局限性,构建高保真室内RM仍然是一个挑战。学习型方法常常依赖于稀疏测量或同质材料的假设,这与室内环境的异质性和多路径丰富的特性不相符。为了克服这些挑战,我们提出了iRadioDiff,一种用于室内RM构建的基于扩散的无采样框架。iRadioDiff以接入点(AP)位置为条件,以材料反射和传输系数编码的物理信息为引导。它进一步结合了多路径关键先验,包括衍射点、强传输边界和视线(LoS)轮廓,以通过条件通道和边界加权目标引导生成过程。这种设计能够准确地模拟非稳态场的不连续性,并有效地构建物理一致的RMs。实验表明,iRadioDiff在室内RM构建和基于接收信号强度的室内定位方面达到了最先进的性能,为不同布局和材料配置提供了有效的通用性。代码可在 https://github.com/UNIC-Lab/iRadioDiff 获得。
论文及项目相关链接
Summary
本文介绍了Radio maps(RMs)在室内定位中的应用,提出了一种无采样扩散框架iRadioDiff,用于构建室内RMs。iRadioDiff基于接入点位置,结合材料反射和传输系数的物理信息编码,融入多路径关键先验信息,如衍射点、强传输边界和视线轮廓,以指导生成过程。该方法能准确模拟非稳态场不连续性,高效构建物理一致性RMs,并在室内RM构建和基于接收信号强度的室内定位方面取得最新技术表现。
Key Takeaways
- Radio maps (RMs) 能够帮助实现环境感知的电磁表示,将场景几何与材料属性连接到信号强度的空间分布,实现无需昂贵现场测量的定位。
- 室内RMs构建面临挑战,主要由于电磁求解器的延迟和基于学习的方法的局限性。
- iRadioDiff是一种无采样扩散框架,用于解决室内RM构建的挑战。
- iRadioDiff基于接入点位置和材料反射/传输系数的物理信息编码。
- iRadioDiff融入多路径关键先验信息(如衍射点、强传输边界和视线轮廓)以优化生成过程。
- iRadioDiff能准确模拟非稳态场不连续性,高效构建物理一致性RMs。
点此查看论文截图
HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning
Authors:Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen
Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
最新的扩散模型进展在简单提示下生成高质量图像的能力令人印象深刻。然而,当面对涉及多个对象和层次结构的复杂提示时,现有模型在准确遵循指令方面遇到困难,导致概念遗漏、混淆和组合性差等问题。为了解决这些局限性,我们提出了基于新型合成链(CoS)范式的分层组合生成框架(HiCoGen)。HiCoGen不同于单一生成方式,它首先利用大型语言模型(LLM)将复杂提示分解为最小的语义单元。然后,它迭代地合成这些单元,每一步生成的图像为下一步提供关键的视觉上下文,确保所有文本概念都忠实地构建成最终的场景。为了进一步优化这个过程,我们引入了一种强化学习(RL)框架。关键的是,我们发现标准扩散采样器的有限探索阻碍了有效的RL。我们从理论上证明,通过在早期生成阶段集中随机性可以最大化样本多样性,并基于这一见解,提出了新型衰减随机性时间表来增强探索。我们的RL算法由分层奖励机制引导,该机制联合评估图像的全局、主题和关系级别。我们还构建了HiCoPrompt,这是一个新的文本到图像分层提示的基准测试,用于严格评估。实验表明,我们的方法在概念覆盖和组合准确性方面显著优于现有方法。
论文及项目相关链接
PDF 9 pages
Summary
本文介绍了针对扩散模型在处理复杂提示时的局限性,提出了一种基于合成链(CoS)范式的分层组合生成框架(HiCoGen)。该框架利用大型语言模型(LLM)将复杂提示分解为最小语义单元,并通过迭代合成这些单元来生成图像。为提高生成过程的优化,引入了强化学习(RL)框架,并提出衰减随机性调度来增强探索。实验表明,该方法在概念覆盖和组合准确性方面显著优于现有方法。
Key Takeaways
- 扩散模型在处理包含多个对象和层次结构的复杂提示时存在概念遗漏、混淆和差劲的组合问题。
- 提出的分层组合生成框架(HiCoGen)利用大型语言模型(LLM)将复杂提示分解为最小语义单元。
- HiCoGen通过迭代合成这些单元来生成图像,确保所有文本概念都忠实构建为最终场景的一部分。
- 引入强化学习(RL)框架来进一步优化生成过程。
- 衰减随机性调度的提出,旨在最大化样本多样性,并增强探索能力。
- 新型RL算法受到分层奖励机制的引导,对图像进行全局、主体和关系级别的联合评估。
点此查看论文截图
AppSelectBench: Application-Level Tool Selection Benchmark
Authors:Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida
Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.
计算机使用代理(CUAs)越来越多地配备了外部工具,使它们能够执行复杂而真实的任务。对于CUAs进行有效操作,应用选择是一项基本能力,它指的是在调用精细工具(如API)之前,要决定使用哪个应用程序。这决定了代理是否初始化正确的环境,避免编排混乱,并有效地专注于相关上下文。然而,现有的基准测试主要评估精细的API选择,对于模型是否能够在不同应用程序之间进行推理和选择,提供的洞察力有限。为了填补这一空白,我们引入了AppSelectBench,这是一个用于评估CUAs中应用选择的综合基准测试。AppSelectBench包含一个新的用户任务生成管道,能够大规模产生真实、多样且语义丰富的用户意图,以及统一的评估协议,涵盖随机、启发式、零样本、少样本和检索增强设置。AppSelectBench涵盖了一百个常用的桌面应用程序,包括超过十万个真实、多样且语义丰富的用户任务。在封闭源代码和开放源代码的大型语言模型上的广泛实验揭示了跨应用程序推理的系统的优势和劣势,表明即使是最先进的模型在做出一致的应用程序选择时仍然面临困难。总的来说,这些结果将AppSelectBench确立为研究和发展应用级推理的基础,这是智能CUAs的一项基本但尚未被充分探索的能力。源代码可在https://github.com/microsoft/appselectbench找到。
论文及项目相关链接
Summary
计算机使用代理(CUAs)越来越多地配备了外部工具,能够执行复杂而现实的任务。应用选择是CUAs进行有效操作的基本能力之一,指在选择应用程序后使用精细工具如API之前做出决策。现有基准测试主要集中在精细API的选择上,对模型能否在不同应用程序间进行推理选择了解有限。为此,我们推出了AppSelectBench,这是一个用于评估计算机使用代理中应用选择的全面基准测试。AppSelectBench包含新型用户任务生成管道,能够大规模产生真实、多样且语义丰富的用户意图,以及涵盖随机、启发式、零样本、小样例和检索增强设置的统一评估协议。该基准测试涵盖了一百种常用的桌面应用程序和超过数十万的真实、多样且语义丰富的用户任务。在封闭源代码和开放源代码的大型语言模型上的广泛实验揭示了应用程序间推理的系统性优势和劣势,表明即使是最先进的模型仍难以做出一致的应用程序选择。这些结果共同奠定了AppSelectBench作为研究和推进应用程序级别推理的基础,这是智能计算机使用代理的关键但尚未被充分探索的能力。
Key Takeaways
- 计算机使用代理(CUAs)越来越多配备外部工具,执行复杂任务。
- 应用选择是CUAs进行有效操作的基本能力之一。
- 现有基准测试主要集中在精细API的选择上,缺乏对模型在不同应用程序间推理能力的评估。
- AppSelectBench是一个全面评估计算机使用代理应用选择的基准测试。
- AppSelectBench包含新型用户任务生成管道和统一评估协议。
- AppSelectBench涵盖多种环境和超过数十万的真实、多样且语义丰富的用户任务。
点此查看论文截图
DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models
Authors:Jun Jia, Hongyi Miao, Yingjie Zhou, Linhan Cao, Yanwei Jiang, Wangqiu Zhou, Dandan Zhu, Hua Yang, Wei Sun, Xiongkuo Min, Guangtao Zhai
With the rapid advancement of diffusion models, a variety of fine-tuning methods have been developed, enabling high-fidelity image generation with high similarity to the target content using only 3 to 5 training images. More recently, zero-shot generation methods have emerged, capable of producing highly realistic outputs from a single reference image without altering model weights. However, technological advancements have also introduced significant risks to facial privacy. Malicious actors can exploit diffusion model customization with just a few or even one image of a person to create synthetic identities nearly identical to the original identity. Although research has begun to focus on defending against diffusion model customization, most existing defense methods target fine-tuning approaches and neglect zero-shot generation defenses. To address this issue, this paper proposes Dual-Layer Anti-Diffusion (DLADiff) to defense both fine-tuning methods and zero-shot methods. DLADiff contains a dual-layer protective mechanism. The first layer provides effective protection against unauthorized fine-tuning by leveraging the proposed Dual-Surrogate Models (DSUR) mechanism and Alternating Dynamic Fine-Tuning (ADFT), which integrates adversarial training with the prior knowledge derived from pre-fine-tuned models. The second layer, though simple in design, demonstrates strong effectiveness in preventing image generation through zero-shot methods. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in defending against fine-tuning of diffusion models and achieves unprecedented performance in protecting against zero-shot generation.
随着扩散模型的快速发展,已经开发出了各种微调方法,仅使用3到5张训练图像就能生成高保真度的图像,且与目标内容高度相似。最近,还出现了零样本生成方法,能够仅通过一张参考图像生成高度逼真的输出,而无需修改模型权重。然而,技术进步也对面部隐私带来了重大风险。恶意行为者只需几张甚至一张个人照片,就能利用扩散模型的定制化创建与原始身份几乎相同的合成身份。虽然研究已经开始关注防范扩散模型定制化的攻击,但大多数现有的防御方法主要针对微调方法,而忽视了对零样本生成的防御。为了解决这一问题,本文提出了双层抗扩散(DLADiff)策略,该策略能有效防御微调方法和零样本方法。DLADiff包含双层保护机制。第一层通过利用提出的双替身模型(DSUR)机制和交替动态微调(ADFT)技术,有效防止未经授权的微调。ADFT集成了对抗性训练,并融合了预微调模型的先验知识。第二层虽然设计简洁,但在防止通过零样本方法进行图像生成方面表现出强大的有效性。大量的实验结果表明,我们的方法在防御扩散模型的微调方面显著优于现有方法,并在保护零样本生成方面取得了前所未有的性能。
论文及项目相关链接
Summary
随着扩散模型的快速发展,出现了多种微调方法,仅使用3到5张训练图像就能生成高保真、高目标相似度的图像。最近,零样本生成方法也开始出现,仅使用一张参考图像就能产生高度逼真的输出。然而,这也带来了面部隐私的重大风险。恶意人士只需几张甚至一张个人照片,就能利用扩散模型的定制功能创建几乎与原始身份相同的合成身份。尽管已有研究开始关注防止扩散模型定制化的防御方法,但大多数现有防御方法主要针对微调方法,而忽视了对零样本生成的防御。为解决此问题,本文提出双层抗扩散(DLADiff)方法,同时防御微调方法和零样本方法。DLADiff包含双层保护机制,第一层通过利用提出的双替身模型(DSUR)机制和交替动态微调(ADFT)对抗未经授权的微调;第二层设计简洁但效果显著,可以防止通过零样本方法进行图像生成。实验结果表明,我们的方法在防御扩散模型的微调方面显著优于现有方法,并在防止零样本生成方面取得了前所未有的性能。
Key Takeaways
- 扩散模型的快速发展促进了高保真图像生成技术的发展,仅需少量训练图像即可生成高度逼真的图像。
- 出现了利用扩散模型定制化的风险,恶意人士可以通过仅使用少量个人照片创建高度逼真的合成身份。
- 现有研究已开始关注防止扩散模型定制的防御方法,但多数方法主要关注微调方法,忽视了对零样本生成的防御。
- 本文提出的双层抗扩散(DLADiff)方法旨在同时防御微调方法和零样本方法。
- DLADiff包含双层保护机制,第一层通过DSUR和ADFT机制对抗未经授权的微调。
- 第二层保护机制设计简洁但能有效防止通过零样本方法进行图像生成。
点此查看论文截图
CodeFuse-CommitEval: Towards Benchmarking LLM’s Power on Commit Message and Code Change Inconsistency Detection
Authors:Qingyu Zhang, Puzhuo Liu, Peng Di, Chenxiong Qian
Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level “purpose” inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.
版本控制依赖于提交消息来传达代码更改的理由,但这些消息往往质量不高,更重要的是与它们的差异不一致,这被称为消息代码不一致(MCI)。MCIs误导了审查者,阻碍了维护,污染了研究数据集,并可能掩盖了安全补丁。然而,没有专门的基准测试来评估MCI检测模型。我们介绍了CODEFUSE-COMMITEVAL,这是使用大型语言模型(LLM)进行MCI检测的第一个基准测试。它建立在ApacheCM数据集的基础上,具有多样性和高质量的特点。我们通过规则指导的突变生成七种不一致的消息类型,这些突变原本是一致的提交,并使用双重验证来验证正负面样本。使用这个标签化的消息差异对数据集,我们在一个标准的设置下评估了六个最先进的开源LLM模型,以及三种增强策略:少样本提示、思维链和扩展上下文。结果表明,模型在检测不一致的提交时比检测一致的提交更加可靠(平均召回率为85.95%,精确度为80.28%,特异性为63.8%);gpt-oss-20B总体表现最佳,但使用的令牌数量是其他人的两倍以上。增强效果各不相同:相邻上下文有助于大型模型,但给小型模型增加了噪声;少样本可以提高准确性和减少令牌使用,但会增加普遍的错误预测;思维链提高了精确度和特异性,但牺牲了召回率和更高的令牌消耗。按类型分析表明,组件、文件路径和操作的不一致性更容易被检测,而意图级别的“目的”不一致性则准确性较低且令牌成本较高。CODEFUSE-COMMITEVAL为测量、比较和改进MCI检测提供了严格的基础,并强调了为了捕捉高级语义差距,需要更丰富的上下文和平衡的数据。
论文及项目相关链接
摘要
本文介绍了CODEFUSE-COMMITEVAL这一针对消息与代码不一致性(MCI)检测的基准测试集。该基准测试集用于评估大型语言模型(LLMs)在MCI检测方面的性能。文章指出当前版本控制中的提交消息质量低下,且与代码差异存在不一致性,导致审查困难、维护受阻、研究数据集污染以及安全补丁被掩盖等问题。为解决这一问题,文章构建了CODEFUSE-COMMITEVAL基准测试集,通过规则引导突变生成七种不一致消息类型,并采用双重验证确保正负样本的准确性。在此基础上,文章评估了六种最新开源LLMs的性能,并探讨了少样本提示、思维链和扩展上下文三种增强策略的效果。结果显示,模型对不一致提交的检测比一致提交更可靠,但GPT-oss-20B表现最佳,且使用的令牌数是其他模型的两倍。增强策略效果各异,相邻上下文有助于大型模型但为小型模型增加噪声,少样本提示提高准确性并减少令牌使用但增加普遍错误预测,思维链在精确度和特异性方面有所提高但降低了召回率和增加了令牌消耗。文章还进行了类型分析,发现组件、文件路径和操作不一致性更容易检测,而意图层面的“目的”不一致性则较难且成本高。总的来说,CODEFUSE-COMMITEVAL为MCI检测提供了严格的测量和比较基础,强调了更丰富上下文和平衡数据在捕捉高级语义差距中的必要性。
关键见解
- 介绍了CODEFUSE-COMMITEVAL作为首个针对消息与代码不一致性(MCI)检测的基准测试集。
- 指出版本控制中的提交消息质量低下以及与代码差异的不一致性所带来的问题。
- 通过规则引导突变生成七种不一致消息类型,并采用双重验证确保样本准确性。
- 评估了六种最新开源LLMs的性能,并探讨了增强策略的效果。
- 发现模型对不一致提交的检测更可靠,但存在性能差异和增强策略效果的复杂性。
- 强调了更丰富上下文和平衡数据在MCI检测中的重要性。
点此查看论文截图
Vision–Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation
Authors:Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos
Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.
半监督学习(SSL)已经成为医学图像分割的有效范式,减少了对于大量专家标注的依赖。同时,视觉语言模型(VLM)在多种视觉领域表现出了强大的泛化和小样本能力。在这项工作中,我们将基于VLM的分割整合到半监督医学图像分割中,通过引入一种融合基础级视觉语义理解的半监督分割助理“VESSA”(Vision-Language Enhanced Semi-supervised Segmentation Assistant),将视觉语言增强的分割辅助技术(VESSA)整合到SSL框架中。我们的方法分为两个阶段。在第一阶段,VLM增强的分割基础模型VESSA被训练为参考引导分割助理,利用包含金标准范例的模板库来模拟从有限标记数据中学习。对于输入模板对,VESSA执行视觉特征匹配,从示例分割中提取代表性的语义和空间线索,为SAM2启发式的掩膜解码器生成结构化提示以产生分割掩膜。在第二阶段,VESSA被整合到最先进的SSL框架中,以实现与学生模型的动态交互:随着学生模型的预测变得更加精细,它们被反馈到VESSA作为提示,使其能够生成更高质量的伪标签和更强的指导。在多个分割数据集和领域进行的广泛实验表明,使用VESSA增强的SSL显著提高了分割精度,在极度有限的标注条件下超越了最先进的基准模型。
论文及项目相关链接
Summary
SSL框架在医学图像分割领域的应用通过引入半监督学习方法取得了显著成果。本研究结合视觉语言模型(VLMs)的强大泛化能力和少样本学习能力,提出了一种基于VLM的半监督医学图像分割方法。通过引入名为VESSA的模型,将基础级别的视觉语义理解融入SSL框架中,实现更准确的医学图像分割。该模型分为两个阶段,第一阶段训练VESSA模型作为参考引导分割助手,第二阶段将其集成到先进的SSL框架中,实现与学生模型的动态交互。实验证明,在有限的标注条件下,VESSA增强的SSL框架能显著提高分割精度,优于现有基线方法。
Key Takeaways
- VESSA模型结合了视觉语言模型(VLMs)和半监督学习(SSL)方法,用于医学图像分割。
- VESSA模型分为两个阶段:第一阶段训练模型作为参考引导分割助手;第二阶段集成到先进的SSL框架中。
- VESSA能够提取示例分割中的语义和空间线索,为SAM2启发式的掩膜解码器生成结构化提示。
- 学生模型的预测结果可以反馈给VESSA,生成更高质量的伪标签和更强的指导。
- 在多个分割数据集和领域进行的实验表明,VESSA增强的SSL方法能在有限的标注条件下显著提高分割精度。
点此查看论文截图
Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV
Authors:Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama
Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.
在少数动作识别(FSAR)中,广角视频有效地表达了特定场景中的动作。然而,由于缺乏全局理解和背景和主体的认知,识别此类样本中的动作仍然具有挑战性,因为背景干扰会造成影响。接纳加权键值(RWKV)学习各维度之间的交互,在全局建模方面显示出潜力。然而,直接将RWKV应用于广角FSAR可能会因过多的背景信息而无法突出主体。此外,由于具有相似背景的帧导致的时序关系退化难以重建,进一步影响了性能。因此,我们设计了化合物分段和时序重建RWKV(Otter)。具体来说,化合物分段模块(CSM)旨在分割并强调每一帧中的关键补丁,有效地将主体突出显示与背景信息相对比。时序重建模块(TRM)被纳入时序增强原型构建中,以实现双向扫描,从而更好地重建时序关系。此外,常规原型与时序增强原型相结合,可以同时增强主体突出和时序建模,提高广角FSAR的性能。在SSv2、Kinetics、UCF101和HMDB51等基准测试上的大量实验表明,Otter达到了最先进的性能。在VideoBadminton数据集上的额外评估进一步验证了Otter在广角FSAR中的优越性。
论文及项目相关链接
PDF Accepted by AAAI 2026 Oral
Summary:在少数镜头动作识别(FSAR)中,宽角度视频能有效表达特定场景下的动作。然而,由于缺乏全局的主客背景理解,识别此类样本中的动作仍具挑战性。RWKV在全局建模方面显示出潜力,但直接应用于宽角度FSAR可能会因背景信息过多而难以突出主体。为此,我们设计了Otter模型,包括化合物分段模块(CSM)和时间重建模块(TRM)。CSM用于分割并强调关键帧中的补丁,有效突出主体信息;TRM则用于重建时间关系。通过基准测试集的实验证明,Otter在宽角度FSAR领域取得了最新技术成果。在VideoBadminton数据集上的额外评估进一步验证了其在宽角度FSAR中的优越性。
Key Takeaways:
- 宽角度视频在少数镜头动作识别(FSAR)中能有效表达特定场景下的动作。
- RWKV在全局建模方面显示出潜力,但直接应用于宽角度FSAR存在挑战。
- Otter模型包括化合物分段模块(CSM)和时间重建模块(TRM),旨在解决这些挑战。
- CSM用于分割并强调关键帧中的补丁,有效突出主体信息。
- TRM用于重建时间关系,以改善性能。
点此查看论文截图
CORE – A Cell-Level Coarse-to-Fine Image Registration Engine for Multi-stain Image Alignment
Authors:Esha Sadia Nasir, Behnaz Elhaminia, Mark Eastwood, Catherine King, Owen Cain, Lorraine Harper, Paul Moss, Dimitrios Chanouzas, David Snead, Nasir Rajpoot, Adam Shephard, Shan E Ahmed Raza
Accurate and efficient registration of whole slide images (WSIs) is essential for high-resolution, nuclei-level analysis in multi-stained tissue slides. We propose a novel coarse-to-fine framework CORE for accurate nuclei-level registration across diverse multimodal whole-slide image (WSI) datasets. The coarse registration stage leverages prompt-based tissue mask extraction to effectively filter out artefacts and non-tissue regions, followed by global alignment using tissue morphology and ac- celerated dense feature matching with a pre-trained feature extractor. From the coarsely aligned slides, nuclei centroids are detected and subjected to fine-grained rigid registration using a custom, shape-aware point-set registration model. Finally, non-rigid alignment at the cellular level is achieved by estimating a non-linear dis- placement field using Coherent Point Drift (CPD). Our approach benefits from automatically generated nuclei that enhance the accuracy of deformable registra- tion and ensure precise nuclei-level correspondence across modalities. The pro- posed model is evaluated on three publicly available WSI registration datasets, and two private datasets. We show that CORE outperforms current state-of-the-art methods in terms of generalisability, precision, and robustness in bright-field and immunofluorescence microscopy WSIs
对于多染色组织切片的高分辨率、核级分析,对全切片图像(WSIs)进行准确高效的注册至关重要。我们提出了一种新颖的由粗到细的框架CORE,用于在不同模式的全切片图像(WSI)数据集中实现准确的核级注册。粗注册阶段利用基于提示的组织掩膜提取技术,有效地过滤掉伪影和非组织区域,随后利用组织形态和预训练的特征提取器进行加速密集特征匹配,实现全局对齐。从粗略对齐的切片中检测细胞核中心,并使用自定义的形状感知点集注册模型进行精细粒度的刚性注册。最后,通过估计非线性位移场,利用协同点漂移(CPD)实现细胞层面的非刚性对齐。我们的方法受益于自动生成的细胞核,提高了可变注册的准确性,并确保跨模态的精确核级对应关系。该模型在三个公开可用的WSI注册数据集和两个私有数据集上进行了评估。我们表明,CORE在明场和免疫荧光显微镜WSI的通用性、精度和稳健性方面优于当前最先进的方法。
论文及项目相关链接
Summary
该文本介绍了一种用于多模态全切片图像(WSI)的精准细胞核级别注册的新方法。该方法采用从粗到细的框架,通过基于提示的组织掩膜提取进行粗注册,利用组织形态和预训练的特征提取器进行全局对齐和加速密集特征匹配。该方法进一步检测细胞核质心并进行精细刚性注册,并利用相干点漂移(CPD)实现细胞级别的非刚性对齐。该方法在公开和私有数据集上的评估结果表明,该方法在通用性、精确性和稳健性方面优于当前的主流方法。
Key Takeaways
- 提出了一个新型的粗到细框架(CORE)用于全切片图像(WSI)的细胞核级别注册。
- 利用基于提示的组织掩膜提取进行粗注册,有效过滤掉伪影和非组织区域。
- 通过全局对齐和加速密集特征匹配进行预训练的特征提取。
- 通过检测细胞核质心进行精细刚性注册。
- 利用相干点漂移(CPD)实现细胞级别的非刚性对齐。
- 该方法在公开和私有数据集上的表现优于当前的主流方法。
点此查看论文截图
Target-aware Image Editing via Cycle-consistent Constraints
Authors:Yanghao Wang, Zhen Wang, Long Chen
Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state’’ and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
最近,预训练文本到图像流模型的进展推动了基于文本的图像编辑的显著进步。主流方法通常采用“破坏-然后恢复”的模式,首先将源图像腐蚀为“中间状态”,然后在提示的指导下恢复为目标图像。然而,当前的方法以目标无关的方式构建这个中间状态,即他们主要关注源图像的重建,而忽略了特定编辑目标的语义差距。这种设计在所需修改与源内容有很大偏差时,本质上会导致编辑能力有限或结果不一致。在本文中,我们认为中间状态应该是目标感知的,即选择性地破坏与编辑相关的内容,同时保留与编辑无关的内容。为此,我们提出了FlowCycle,这是一个新型的无反演和基于流的编辑框架,它通过可学习的噪声对破坏进行参数化,并通过循环一致的过程对其进行优化。通过迭代地从源编辑到目标并恢复到源,同时施加双重一致性约束,FlowCycle学习产生目标感知的中间状态,能够在保持源一致性的同时实现忠实的修改。大量的消融实验表明,FlowCycle在编辑质量和一致性方面优于现有方法。
论文及项目相关链接
Summary
本文提出一种新型文本驱动图像编辑框架FlowCycle,旨在解决现有方法在面对与源图像有较大偏差的目标编辑时出现的局限性。该框架采用目标感知中间状态策略,通过迭代编辑源图像至目标图像并恢复回源图像,利用双重一致性约束优化可学习噪声参数化腐败过程,实现忠实修改并保留源一致性。
Key Takeaways
- 文本到图像的预训练模型在文本驱动的图像编辑中取得显著进展。
- 当前主流方法采用“破坏并恢复”模式,但设计中间状态时忽略特定编辑目标,导致有限的可编辑性和不一致性。
- 提出目标感知中间状态策略,即选择性破坏与编辑相关的内容,同时保留与编辑无关的内容。
- 引入FlowCycle框架,基于流编辑,无需反转过程。
- 通过参数化腐败过程和双重一致性约束优化可学习噪声。
- FlowCycle实现了高质量和一致性的图像编辑,优于现有方法。
点此查看论文截图
WeatherDiffusion: Controllable Weather Editing in Intrinsic Space
Authors:Yixin Zhu, Zuoliang Zhu, Jian Yang, Miloš Hašan, Jin Xie, Beibei Wang
We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches.We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.
我们提出了WeatherDiffusion,这是一个基于扩散的内在空间可控天气编辑框架。我们的框架包括两个基于扩散优先级的组件:反向渲染器,它从输入图像中估计材质属性、场景几何和照明作为内在地图;正向渲染器则利用这些几何和材质地图以及描述特定天气条件的文本提示来生成最终图像。与传统的像素空间编辑方法相比,内在地图增强了可控性。我们提出了一种感知内在地图的注意力机制,这有助于提高大型户外场景的空间对应和分解质量。对于正向渲染,我们利用CLIP空间的天气提示插值来实现精细的天气控制。我们还引入了合成数据集和真实世界数据集,包含各种天气条件下的3.8万和1.8万张图像,每张图像都有内在地图注释。WeatherDiffusion超越了最先进的像素空间编辑方法、天气恢复方法和基于渲染的方法,显示出在下游任务(如自动驾驶)中的潜力,提高了在具有挑战性的天气场景中检测和分割的稳健性。
论文及项目相关链接
Summary
基于扩散原理的WeatherDiffusion框架用于可控天气编辑。它包含两个组件:基于扩散先验的逆向渲染器和正向渲染器。逆向渲染器从输入图像估计材质属性、场景几何和照明,生成内在地图;正向渲染器则利用这些几何和材质图以及描述特定天气条件的文本提示来生成最终图像。内在地图提高了与传统像素空间编辑方法相比的可控性。提出一种内在地图感知的注意力机制,提高大型户外场景的空间对应和分解质量。利用CLIP空间的天气提示插值进行正向渲染,实现精细的天气控制。WeatherDiffusion表现优于最先进的像素空间编辑方法、天气恢复方法和基于渲染的方法,并在自动驾驶等下游任务中展现出对恶劣天气场景的鲁棒性检测和分割能力。
Key Takeaways
- WeatherDiffusion是一个基于扩散原理的框架,用于可控天气编辑。
- 包含逆向渲染器和正向渲染器两个组件,逆向渲染器生成内在地图,正向渲染器利用这些地图和文本提示生成图像。
- 内在地图提高了天气编辑的可控性,相比传统像素空间编辑方法更具优势。
- 提出一种内在地图感知的注意力机制,提高场景的空间对应和分解质量。
- 利用CLIP空间的天气提示插值进行精细的天气控制。
- WeatherDiffusion在多项实验中表现优异,优于其他像素空间编辑和天气恢复方法。