发布日期: 2025-10-19

更新日期: 2025-11-27

文章字数: 5k

阅读时长: 20 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-19 更新

Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Authors:Maria-Teresa De Rosa Palmini, Eva Cetinic

As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.

随着文本到图像（TTI）扩散模型在内容创建中的影响越来越大，越来越多的注意力被指向其对社会和文化的影响。虽然之前的研究主要关注了人口统计和文化偏见，但这些模型准确代表历史背景的能力却鲜有研究。为了弥补这一空白，我们引入了一个评估TTI模型如何描绘历史背景的基准测试。该基准测试结合了HistVis，这是一个由三个最先进的扩散模型根据精心设计提示生成包含多个历史时期普遍人类活动的合成图像数据集，并附带可复制的评价协议。我们从三个方面对生成的图像进行评估：（1）隐含的风格关联：检查与特定时代相关的默认视觉风格；（2）历史一致性：识别如现代物品出现在前现代语境中的不合时代的东西；（3）人口统计代表性：将生成的种族和性别分布与历史上可能存在的基线进行比较。我们的研究发现，在历史主题生成的图像中存在系统性不准确之处，因为TTI模型经常会通过融入未说明的风格线索、引入不合时代的东西以及未能反映可能的人口模式来刻板地描绘过去时代。通过为生成图像中的历史表现提供一个可复制的基准测试，这项工作朝着构建更历史准确的TTI模型迈出了初步的一步。

论文及项目相关链接

PDF

Summary

本文关注文本到图像扩散模型对社会和文化的影响，特别是在历史背景呈现方面的准确性。文章提出了一个评估模型表现的标准，包括默认视觉风格的时代关联、历史一致性和人口代表性三个方面。研究结果表明，这些模型在历史主题生成图像方面存在系统性不准确，通过引入历史背景的可重复性基准，为建立更准确的文本到图像扩散模型提供了初步方向。

Key Takeaways

文本到图像（TTI）扩散模型日益受到关注，其社会和文化影响日益显著。
文章强调历史背景呈现方面的评估标准尚待探索。
提出的评估基准包括默认视觉风格的时代关联、历史一致性和人口代表性三个方面。
TTI模型在历史主题生成图像方面存在系统性不准确。
模型会融入未说明的样式暗示、出现时代错误以及未能反映合理的人口模式。
文章提供了一个可重复的历史表现基准，为建立更准确的TTI模型提供了方向。

Cool Papers

点此查看论文截图

OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models

Authors:Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari

In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object’s effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

在Omnimatte中，我们的目标是将给定的视频分解成语义上有意义的层次，包括背景和各种独立物体及其相关的效果，如阴影和反射。现有的方法通常需要大量的训练或昂贵的自监督优化。在本文中，我们提出了OmnimatteZero，这是一种无需训练的方法，它利用现成的预训练视频扩散模型进行omnimatte操作。OmnimatteZero可以从视频中移除物体，提取独立物体层次及其效果，并将这些物体合成到新的视频中。这是通过适应零样本图像补全技术来完成视频物体移除的任务，而这是一个他们无法有效处理的开箱即用任务。为了克服这一点，我们引入了时间和空间的注意力指导模块，以引导扩散过程，实现准确的物体移除和背景重建的时间一致性。我们进一步显示自我关注图捕捉物体的信息及其足迹，并使用它们来补全物体的效果，留下干净的背景。此外，通过简单的潜在算术运算，可以无缝地隔离物体层次并与新的视频层次重新组合，以产生新的视频。评估结果表明，OmnimatteZero不仅在背景重建方面表现出卓越的性能，而且为Omnimatte方法创造了新的记录，实现了实时性能，帧运行时间最短。

论文及项目相关链接

PDF Accepted to SIGGRAPH ASIA 2025. Project Page: https://dvirsamuel.github.io/omnimattezero.github.io/

Summary

本文介绍了OmnimatteZero，这是一种无需训练的视频分解方法，利用预训练的视频扩散模型实现视频语义层分解。它能够移除视频中的物体、提取物体层及其效果，并将这些物体合成到新视频中。通过引入时间域和空间域注意力引导模块，克服零样本图像修复技术在视频对象移除方面的不足，实现准确的对象移除和背景重建。此外，利用自注意力图捕捉对象信息及其足迹，用于修复对象效果，留下干净背景。该方法还能通过简单的潜在算术操作，无缝地隔离和重组对象层，生成新视频。评价显示，OmnimatteZero不仅在背景重建方面表现出卓越性能，还创下了Omnimatte方法的新纪录，实现了实时性能，帧运行时间最短。

Key Takeaways

OmnimatteZero是一种无需训练的视频分解方法。
利用预训练的视频扩散模型实现视频语义层分解。
可以移除视频中的物体并提取物体层及其效果。
引入时间域和空间域注意力引导模块，提高视频对象移除的准确性。
利用自注意力图修复对象效果，留下干净背景。
通过简单的潜在算术操作，能够无缝地隔离和重组对象层，生成新视频。

Cool Papers

点此查看论文截图

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Authors:Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, Kyungsu Kim

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Code is available at https://github.com/shjo-april/DiffEGG.

高质量实例和全景分割传统上依赖于密集的实例级注释，如掩膜、边界框或点，这些注释成本高昂、不一致且难以扩展。无监督和弱监督的方法减轻了这一负担，但仍受到语义主干约束和人类偏见的影响，通常会产生合并或碎片化的输出。我们提出了TRACE（将扩散线索转换为实例边缘，TRAnsforming diffusion Cues to instance Edges），展示了文本到图像扩散模型秘密地作为实例边缘注释器工作。TRACE识别实例出现点（IEP），即自注意力图中对象边界首次出现的位置，通过注意力边界发散（ABDiv）提取边界，并将其蒸馏到轻量级的单步边缘解码器中。这种设计消除了对每张图像进行扩散反演的需要，实现了81倍的更快推理速度，同时产生更清晰、更连贯的边界。在COCO基准测试中，TRACE将无监督实例分割提高了5.1 AP，在标签监督全景分割中，它比点监督基线高出1.7 PQ，而无需使用任何实例级标签。这些结果表明，扩散模型编码了隐藏的实例边界先验信息，解码这些信号提供了一种实用且可扩展的替代昂贵的手动注释方法。代码可在https://github.com/shjo-april/DiffEGG中找到。

论文及项目相关链接

PDF

Summary
扩散模型可用于文本转图像的场景，本文提出TRACE方法，利用扩散模型的特性进行实例边缘标注，实现了无监督实例分割和全景分割。该方法识别了实例出现点（IEP），通过注意力边界发散（ABDiv）提取边界，简化了流程并提高了效率。在COCO基准测试中，TRACE提高了无监督实例分割的准确度，同时在标签监督全景分割中超越了点监督基线，证明了扩散模型编码了隐藏的实例边界先验信息。

Key Takeaways

扩散模型在文本转图像领域有潜力进行实例边缘标注。
TRACE方法识别了实例出现点（IEP），简化了实例分割的流程。
通过注意力边界发散（ABDiv），TRACE能够提取物体边界。
TRACE实现了快速推理，相比传统方法提高了效率。
在COCO基准测试中，TRACE提高了无监督实例分割的准确度。
在标签监督全景分割中，TRACE超越了点监督基线，显示出优越性。

Cool Papers

点此查看论文截图

SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Authors:Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, Sandeep Chinchali

In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as “Clear and Day” weather, leading to decreased performance in under-represented conditions like “Rainy and Night”. To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model. We release our code and pipeline at https://github.com/UTAustin-SwarmLab/SynDiff-AD.

近年来，在收集大规模数据集以改进分割和自动驾驶模型方面取得了重大进展。这些大规模数据集通常以“晴朗和白天”等常见环境条件为主，导致在代表性不足的条件（如“雨天夜间”）下性能下降。为了解决这一问题，我们引入了SynDiff-AD，这是一种利用扩散模型（DMs）生成此类子群体的真实图像的新数据增强流程。SynDiff-AD使用ControlNet——一种基于语义地图引导数据生成的DM，以及一种新型提示方案，该方案可生成针对子组的语义密集提示。通过SynDiff-AD增强数据集，我们改进了Mask2Former和SegFormer等分割模型在Waymo数据集上的性能，分别提高了1.2%和2.3%；在DeepDrive数据集上分别提高了1.4%和0.7%。此外，我们还证明我们的SynDiff-AD流程提高了端到端自动驾驶模型（如AIM-2D和AIM-BEV）在CARLA自动驾驶模拟器中不同环境条件下的驾驶性能，最高提高了20%，为更稳健的模型提供了支持。我们在https://github.com/UTAustin-SwarmLab/SynDiff-AD发布了我们的代码和流程。

论文及项目相关链接

PDF 15 pages, 10 figures

Summary

本文介绍了SynDiff-AD这一新型数据增强管道，其利用扩散模型（DMs）生成针对特定子组的真实图像，以提高分割模型和自动驾驶模型在多种环境下的性能。该管道通过语义地图引导数据生成，并采用一种新颖的提示方案来生成子组特定的语义密集提示。实验结果表明，使用SynDiff-AD进行数据增强可以显著提高分割模型的性能，并在CARLA自动驾驶模拟器中提高端到端自动驾驶模型的驾驶性能。

Key Takeaways

SynDiff-AD是一种利用扩散模型（DMs）的新型数据增强管道，专为解决大型数据集中环境条件下的性能下降问题而设计。
该方法通过语义地图引导数据生成，并使用ControlNet DM模型。
SynDiff-AD引入了一种新的提示方案，该方案可以生成针对特定子组的语义密集提示。
使用SynDiff-AD进行数据增强可以提高分割模型的性能，如Mask2Former和SegFormer。
在Waymo和DeepDrive数据集上，使用SynDiff-AD的分割模型性能分别提高了最多达1.2%和2.3%。
在CARLA自动驾驶模拟器中，使用SynDiff-AD的端到端自动驾驶模型的驾驶性能提高了最多达20%。

Cool Papers

点此查看论文截图

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Authors:Junjie Shentu, Matthew Watson, Noura Al Moubayed

Text-to-image (T2I) customization empowers users to adapt the T2I diffusion model to new concepts absent in the pre-training dataset. On this basis, capturing multiple new concepts from a single image has emerged as a new task, allowing the model to learn multiple concepts simultaneously or discard unwanted concepts. However, multiple-concept disentanglement remains a key challenge. Existing disentanglement models often exhibit two main issues: feature fusion and asynchronous learning across different concepts. To address these issues, we propose AttenCraft, an attention-based method for multiple-concept disentanglement. Our method uses attention maps to generate accurate masks for each concept in a single initialization step, aiding in concept disentanglement without requiring mask preparation from humans or specialized models. Moreover, we introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts, promoting balanced feature acquisition and synchronized learning. AttenCraft also introduces a feature-retaining training framework that employs various loss functions to enhance feature recognition and prevent fusion. Extensive experiments show that our model effectively mitigates these two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.

文本到图像（T2I）的定制化使用户能够将T2I扩散模型适应于预训练数据集中不存在的新概念。在此基础上，从单张图像中捕捉多个新概念已经成为了一项新任务，这使得模型能够同时学习多个概念或丢弃不需要的概念。然而，多概念分解仍然是一个关键挑战。现有的分解模型通常存在两个主要问题：特征融合和不同概念之间的异步学习。为了解决这些问题，我们提出了AttenCraft，这是一种基于注意力的多概念分解方法。我们的方法使用注意力图在一次初始化步骤中为每个概念生成准确的蒙版，有助于概念分解，而无需人工或专业模型进行蒙版准备。此外，我们引入了一种基于注意力分数的自适应算法，以估计不同概念的比例，促进平衡特征获取和同步学习。AttenCraft还引入了一个保留特征的培训框架，该框架采用各种损失函数来提高特征识别并防止融合。大量实验表明，我们的模型有效地缓解了这两个问题，实现了最先进的图像保真度和与基线模型相当的提示保真度。

论文及项目相关链接

PDF

Summary
文本到图像（T2I）定制允许用户适应T2I扩散模型以处理未在预训练数据集中出现的新概念。基于此，从单一图像中捕捉多个新概念已成为新任务，使模型能够同时学习多个概念或丢弃不需要的概念。然而，概念分解仍存在关键挑战。现有分解模型主要面临特征融合和不同概念之间的异步学习两个问题。为解决这些问题，我们提出AttenCraft，一种基于注意力的多重概念分解方法。该方法使用注意力图在一次初始化步骤中为每个概念生成精确掩码，有助于概念分解，无需人工或特殊模型制作掩码。此外，我们基于注意力分数引入自适应算法，以估算不同概念采样比率，促进平衡特征获取和同步学习。AttenCraft还引入了一个保留特征的培训框架，采用各种损失函数提高特征识别并防止融合。大量实验表明，我们的模型有效解决了这两个问题，实现了图像保真度的领先水平，并实现了与基线模型相当的提示保真度。

Key Takeaways