⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-19 更新
Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine
Authors:Xincheng Shuai, Zhenyuan Qin, Henghui Ding, Dacheng Tao
Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
近期文本到图像(T2I)扩散模型的进展极大地推动了语义图像编辑的进步,但大多数方法在3D感知对象操作方面还存在不足。在这项工作中,我们提出了FFSE,这是一个3D感知的自回归框架,旨在实现在真实世界图像上直接进行直观、物理一致的物体编辑。不同于那些在图像空间操作或需要缓慢且易出错的3D重建的先前方法,FFSE将编辑建模为一系列学习的3D转换,允许用户执行任意操作,如平移、缩放和旋转,同时保留逼真的背景效果(例如阴影、反射),并在多轮编辑中保持全局场景一致性。为了支持多轮3D感知对象操作的学习,我们引入了3DObjectEditor,这是一个混合数据集,由各种对象和场景模拟编辑序列构建而成,能够在多轮和动态条件下进行有效训练。大量实验表明,所提出的FFSE在单轮和多轮3D感知编辑场景中均显著优于现有方法。
论文及项目相关链接
PDF AAAI 2026, Project Page: https://henghuiding.com/FFSE/
Summary:
近期文本到图像(T2I)扩散模型的进步极大地提升了语义图像编辑的性能,但大多数方法在3D感知物体操作方面仍有不足。本研究提出FFSE,一个3D感知的自动生成框架,旨在实现直观、物理一致的物体编辑,可直接应用于真实世界图像。不同于其他方法,FFSE在图像空间进行操作或依赖耗时且易出错的3D重建。它将编辑视为一系列学习的3D转换,允许用户进行任意操作,如平移、缩放和旋转等,同时保持背景效果(如阴影、反射)的真实性和全局场景的一致性。为支持多轮次的3D感知物体操作学习,我们引入了3DObjectEditor混合数据集,通过模拟编辑序列涵盖多样的物体和场景,以在多轮和动态条件下实现有效训练。实验证明,FFSE在单轮和多轮次的3D感知编辑场景中均显著优于现有方法。
Key Takeaways:
- T2I扩散模型的最新进展已显著提升语义图像编辑能力。
- 当前大部分方法难以实现3D感知物体操作。
- FFSE是一个新的3D感知自动生成框架,用于真实世界图像的直观物理编辑。
- FFSE将编辑操作建模为学习的3D转换序列。
- 用户可以对物体进行任意操作,同时保持背景效果的真实性。
- 引入的3DObjectEditor混合数据集支持多轮次的3D感知物体操作学习。
点此查看论文截图
Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation
Authors:Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.
印度诗歌以其语言上的复杂性和深厚的文化共鸣而著称,拥有跨越数千年的丰富而多样的传统。然而,其多层次的含义、文化隐喻和复杂的语法结构常常构成理解上的挑战,特别是对于非母语者或对其语境和语言不熟悉的读者。尽管其在文化上具有重要性,但现有的诗歌作品大多忽略了印度语诗歌。在本文中,我们提出了翻译与图像生成(TAI)框架,它借助适当提示调整的大型语言模型(LLM)和潜在扩散模型。我们的框架支持联合国可持续发展目标中的优质教育(SDG 4)和减少不平等(SDG 10),通过增强全球观众对文化丰富的印度语诗歌的访问能力。它包括(1)一个翻译模块,该模块使用赔率比率偏好对齐算法,将形态丰富的诗歌准确地翻译成英语;(2)一个图像生成模块,该模块采用语义图来捕获标记、依赖关系和隐喻及其意义之间的语义关系,以创建印度诗歌的视觉有意义表示。我们的综合实验评估,包括人类和定量评估,证明了TAI扩散在诗歌图像生成任务中的优越性,超过了强大的基线。为了进一步解决印度语诗歌资源匮乏的问题,我们推出了形态丰富印度语诗歌MorphoVerse数据集,包含1570首跨越21种低资源印度语言的诗歌。通过解决诗歌翻译和视觉理解上的差距,这项工作旨在提高诗歌的普及性并丰富读者的体验。
论文及项目相关链接
Summary
本文提出翻译与图像生成(TAI)框架,结合大型语言模型(LLMs)和潜在扩散模型,通过适当的提示调整,支持联合国可持续发展目标中的优质教育和减少不平等目标。框架包括翻译模块和图像生成模块,前者使用Odds Ratio Preference Alignment Algorithm准确翻译形态丰富的诗歌,后者利用语义图捕捉诗歌中的标记、依赖关系和隐喻之间的语义关系,创建印度诗歌的视觉表示。实验评估表明,该框架在诗歌图像生成任务上优于基线。此外,为解决印度语言诗歌资源稀缺的问题,引入了形态丰富印度语言诗歌MorphoVerse数据集。
Key Takeaways
- 印度诗歌具有丰富多样的遗产,但其复杂的语言和文化背景为非母语读者或不了解其文化和语言背景的读者带来了理解上的挑战。
- 现有研究对印度语言诗歌的关注度不够。
- 提出的Translation and Image Generation(TAI)框架旨在增强印度语言诗歌的全球可及性。
- TAI框架包括翻译模块和图像生成模块,前者利用Odds Ratio Preference Alignment Algorithm进行准确翻译,后者通过语义图捕捉诗歌的视觉元素。
- 实验评估证明了TAI框架在诗歌图像生成任务上的优越性。
- 引入MorphoVerse数据集,包含多种印度语言的诗歌,以解决资源稀缺问题。
点此查看论文截图
Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
Authors:Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.
从单幅图像中进行多人人体网格恢复是一项具有挑战性的任务,受到野外训练数据稀缺的阻碍。目前流行的野外人体网格伪真实值(pGT)生成管道是以单人为中心的,其中每个人体是单独处理的,没有进行联合优化。这种疏忽导致了场景级别的连贯性缺失,使得同一图像中的个体深度和尺度存在冲突。为了解决这一问题,我们引入了深度条件翻译优化(DTO)这种基于优化的新方法,可以联合优化人群中所有个体的相机空间平移。通过利用人体高度和深度线索的人体测量先验知识以及单目深度估计器,DTO解决了一个场景内所有主体的连贯定位问题,这是在有最大后验概率(MAP)框架的指导下完成的。将DTO应用于4D-Humans数据集,我们构建了DTO-Humans,这是一个新的大规模pGT数据集,包含0.56M张高质量、场景连贯的多人图像,特征人群密集,平均每张图像有4.8人。此外,我们提出了Metric-Aware HMR,这是一种端到端的网络,直接估计人体网格和相机参数在度量尺度上。这是通过相机分支和一种新的相对度量损失来实现的,后者强制实施合理的相对尺度。大量实验表明,我们的方法在相对深度推理和人体网格恢复方面达到了最新性能。代码和数据将公开发布。
论文及项目相关链接
Summary
本文解决多人人体网格从单张图像中恢复的问题,提出一种基于优化方法Depth-conditioned Translation Optimization(DTO)。该方法能够联合优化场景中所有个体的相机空间平移,利用人体高度和深度线索来解决场景一致性的人体网格恢复问题。此外,还构建了新的大规模伪真实数据集DTO-Humans,并提出Metric-Aware HMR网络模型来估计人体网格和相机参数。该文章的方法在相对深度推理和人体网格恢复上达到最优性能。
Key Takeaways
- 多人人体网格从单张图像恢复是挑战性问题,缺乏野外训练数据。
- 现有伪真实标注生成管道是单人的,缺乏场景一致性。
- 引入Depth-conditioned Translation Optimization(DTO)方法,联合优化场景中所有个体的相机空间平移。
- 利用人体高度和深度线索解决场景一致性的人体网格恢复问题。
- 构建新的大规模伪真实数据集DTO-Humans,包含高质量的多人图像。
- 提出Metric-Aware HMR网络模型,直接估计人体网格和相机参数。
点此查看论文截图
MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI
Authors:Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi, Hemmen Sabir, Shadi Albarqouni
Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.
便携式超低场磁共振成像(uLF-MRI,0.064T)为新生儿护理提供了可访问的神经成像,但与高场(HF)MRI相比,其信噪比低,诊断质量较差。我们提出了MRIQT,这是一个用于从uLF到HF MRI的图像质量转移(IQT)的3D条件扩散框架。MRIQT结合了逼真的K空间退化进行物理一致的uLF模拟、v预测与无分类器指导的稳定图像到图像生成,以及信噪比加权的3D感知损失以实现解剖保真度。该模型从受噪声影响的uLF输入中去除噪声,该输入以相同的扫描为条件,利用体积注意力U-Net架构进行结构保留翻译。MRIQT在新生儿队列(具有多种病理)上进行了训练,在PSNR上超越了最近的GAN和CNN基准测试,提高了15.3%,并且较目前最佳水平提高了1.78%,而医生认为其输出的85%具有良好的质量,病理清晰可见。MRIQT实现了基于扩散的高保真增强便携式超低场(uLF)MRI,可用于可靠的新生儿脑评估。
论文及项目相关链接
PDF 5 pages, 4 figures
Summary
便携式超低场磁共振成像(uLF-MRI,0.064T)在新生儿护理中提供可访问的神经成像,但与高场(HF)MRI相比,其信噪比低,诊断质量较差。本文提出MRIQT,一种从uLF到HF MRI的图像质量转移(IQT)的3D条件扩散框架。MRIQT结合真实的K空间退化进行物理一致的uLF模拟、v预测与无分类器引导的稳定图像到图像生成,以及信噪比加权的3D感知损失,以保证解剖真实性。该模型从带噪声的uLF输入中去除噪声,并以相同的扫描为条件,利用体积注意力U-Net架构进行结构保留翻译。在具有多种病理的新生儿队列上进行训练,MRIQT在峰值信噪比(PSNR)上超越了最近的生成对抗网络(GAN)和卷积神经网络(CNN)基准测试,达到了15.3%的提升,并较目前最佳水平高出1.78%。医师评价其输出的图像中,85%具有良好的质量和清晰的病理表现。MRIQT实现了基于扩散的便携式超低场(uLF)MRI高保真增强,可用于可靠的新生儿脑评估。
Key Takeaways
- 便携式超低场MRI(uLF-MRI)在新生儿护理中提供可访问的神经成像,但诊断质量较差。
- 提出MRIQT框架,实现从uLF到HF MRI的图像质量转移。
- MRIQT结合物理一致的uLF模拟、稳定的图像生成和解剖真实性的感知损失。
- 利用体积注意力U-Net架构进行结构保留翻译,从带噪声的uLF输入中去除噪声。
- 在多种病理的新生儿队列上训练,MRIQT在PSNR上表现优异,超越现有技术。
- 医师评价MRIQT输出的图像质量良好,85%具有清晰的病理表现。
点此查看论文截图
Towards Rotation-only Imaging Geometry: Rotation Estimation
Authors:Xinrui Li, Qi Cai, Yuanxin Wu
Structure from Motion (SfM) is a critical task in computer vision, aiming to recover the 3D scene structure and camera motion from a sequence of 2D images. The recent pose-only imaging geometry decouples 3D coordinates from camera poses and demonstrates significantly better SfM performance through pose adjustment. Continuing the pose-only perspective, this paper explores the critical relationship between the scene structures, rotation and translation. Notably, the translation can be expressed in terms of rotation, allowing us to condense the imaging geometry representation onto the rotation manifold. A rotation-only optimization framework based on reprojection error is proposed for both two-view and multi-view scenarios. The experiment results demonstrate superior accuracy and robustness performance over the current state-of-the-art rotation estimation methods, even comparable to multiple bundle adjustment iteration results. Hopefully, this work contributes to even more accurate, efficient and reliable 3D visual computing.
结构从运动(SfM)是计算机视觉中的一项关键任务,旨在从一系列二维图像中恢复三维场景结构和相机运动。最近的仅姿态成像几何技术将三维坐标与相机姿态分离,并通过姿态调整显示出显著更好的SfM性能。继续从仅姿态的角度,本文探讨了场景结构、旋转和平移之间的关键关系。值得注意的是,平移可以用旋转来表示,这允许我们将成像几何表示浓缩在旋转流形上。基于重投影误差,本文提出了仅适用于旋转优化的框架,用于双目和多目场景。实验结果证明了其在当前最先进的旋转估计方法上的更高准确性和稳健性,甚至可与多次捆绑调整迭代结果相媲美。希望这项工作能为更精确、高效和可靠的3D视觉计算做出贡献。
论文及项目相关链接
Summary
SfM任务中,本文探索了场景结构、旋转和平移之间的关键关系,提出基于姿态调整的旋转优化框架,在重投影误差的基础上处理二维图像序列以恢复三维场景结构。相较于现有先进技术,本方法在旋转估计方面具有更高准确性和稳健性,且能凝聚成像几何表现于旋转流形上。该研究对三维视觉计算精确性、效率和可靠性的提升有所贡献。
Key Takeaways
- 本文关注SfM任务中场景结构、旋转与平移之间的关系。
- 提出基于姿态调整的旋转优化框架,以处理二维图像序列并恢复三维场景结构。
- 通过重投影误差处理图像数据。
- 该方法相较于现有先进技术展现出更高的旋转估计准确性及稳健性。
- 研究能够凝聚成像几何于旋转流形上。
- 研究成果有助于提升三维视觉计算的精确性、效率和可靠性。
点此查看论文截图
DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT
Authors:Xianhao Zhou, Jianghao Wu, Ku Zhao, Jinlong He, Huangxuan Zhao, Lei Chen, Shaoting Zhang, Guotai Wang
Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3’s feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$\rightarrow$CT and CBCT$\rightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.
从CBCT或MRI生成合成CT图像具有高效的辐射剂量计划和自适应放疗的潜力。然而,现有的基于CNN的模型缺乏全局语义理解,而Transformer往往由于模型容量高和归纳偏置弱,导致对小型医疗数据集过度拟合。为了解决这些局限性,我们提出了DINOv3引导交叉融合(DGCF)框架,该框架结合了预训练的DINOv3 Transformer和一个可训练的CNN编码器-解码器。它通过可学习的交叉融合模块分层融合Transformer的全局表示和CNN的局部特征,实现局部外观和上下文表示的平衡。此外,我们引入了多级DINOv3感知(MLDP)损失,鼓励合成CT与DINOv3特征空间中的真实值之间的语义相似性。在SynthRAD2023骨盆数据集上的实验表明,DGCF在MRI→CT和CBCT→CT翻译任务上达到了MS-SSIM、PSNR和基于分割的指标方面的最佳性能。据我们所知,这是第一项利用DINOv3表示进行医学图像翻译的工作,突出了自监督Transformer指导在语义感知CT合成中的潜力。代码可在https://github.com/HiLab-git/DGCF处获取。
论文及项目相关链接
Summary
本文提出了一种名为DINOv3引导的交叉融合(DGCF)框架,用于生成合成CT图像。该框架结合了自监督的DINOv3 Transformer和可训练的CNN编码器-解码器,通过分层融合Transformer的全局表示和CNN的局部特征,实现了平衡的局部外观和上下文表示。实验表明,DGCF在SynthRAD2023盆腔数据集上的MRI→CT和CBCT→CT转换任务上取得了最新性能。这是首次将DINOv3表示用于医学图像翻译的工作,突出了自监督Transformer指导语义感知CT合成的潜力。
Key Takeaways
1.DGCF框架结合了自监督的DINOv3 Transformer和CNN编码器-解码器,为解决生成合成CT图像的问题提供了新的思路。
2.DGCF通过分层融合Transformer的全局表示和CNN的局部特征,实现了对图像局部外观和上下文的平衡表示。
3.实验证明,DGCF在MRI→CT和CBCT→CT转换任务上取得了最新性能,显示出其在医学图像翻译领域的有效性。
4.DINOv3在医学图像翻译中的使用是首次尝试,突出了其在语义感知CT合成中的潜力。
5.提出的MLDP损失函数鼓励合成CT与真实CT之间的语义相似性,有助于提升图像合成的质量。
6.DGCF框架的代码已公开可用,为进一步研究和应用提供了便利。
点此查看论文截图
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation
Authors:Xiang Gao, Jiaying Liu
Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. The code is publicly available at: https://github.com/XiangGao1102/FBSDiff.
大规模文本到图像的扩散模型在生成式人工智能和多模态技术的演变中,已成为革命性的里程碑。它可以通过自然语言文本提示来生成精彩的图像。然而,这类模型缺乏可控性的问题限制了它们在现实生活内容创建中的实际应用。因此,人们开始关注利用参考图像来控制文本到图像的合成,这也被视为根据文本提示操作(或编辑)参考图像,即文本驱动的图像到图像翻译。本文贡献了一种新颖、简洁、高效的方法,它以即插即用方式适应预训练的大规模文本到图像(T2I)扩散模型到图像到图像(I2I)范式,实现高质量和多功能文本驱动的I2I翻译,而无需任何模型训练、模型微调或在线优化过程。为了用参考图像引导T2I生成,我们提出在DCT谱空间中分解具有不同频率带的扩散特征的各种指导因素,并相应地设计了一种新型频率带替换层,以即插即用方式实现对参考图像对T2I生成结果的动态控制。我们证明,通过简单地调整替代的频率带的类型和带宽,我们的方法能够灵活地控制参考图像的指导因素和指导强度。大量的定性和定量实验验证了我们方法在图像到图像翻译的视觉效果、多功能性和可控性方面的优越性。代码公开在:https://github.com/XiangGao1102/FBSDiff。
论文及项目相关链接
PDF Accepted conference paper of ACM MM 2024
Summary
本文介绍了一种新颖、简洁且高效的方法,该方法将预训练的文本到图像(T2I)扩散模型适应于图像到图像(I2I)范式,以插件方式实现高质量和多功能文本驱动的I2I翻译,无需进行模型训练、微调或在线优化过程。通过分解不同频率带的扩散特征,提出一种新型频率带替代层,实现对参考图像的动态控制。通过调整替代频率带的类型和带宽,可以灵活控制指导因素和参考图像的指导强度。实验证明,该方法在I2I翻译的视觉质量、通用性和可控性方面优于相关方法。
Key Takeaways
- 大型文本到图像扩散模型是生成式AI和多模态技术的重要里程碑。
- 缺乏可控性是这些模型实际应用中的限制。
- 利用参考图像来控制文本到图像的合成已成为研究焦点。
- 本文提出了一种新颖方法,将预训练的文本到图像扩散模型适应于图像到图像范式。
- 该方法通过分解不同频率带的扩散特征,提出频率带替代层,实现对参考图像的动态控制。
- 方法允许通过调整替代频率带的类型和带宽,灵活控制指导因素和参考图像的指导强度。
- 实验证明该方法在I2I翻译的视觉质量、通用性和可控性方面优于其他方法。