发布日期: 2025-09-12

更新日期: 2025-10-07

文章字数: 12.3k

阅读时长: 49 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-12 更新

ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

Authors:Ao Li, Jinpeng Liu, Yixuan Zhu, Yansong Tang

Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI’s superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.

人类与物体的交互重建是理解人类与其周围环境之间复杂关系的重要里程碑。然而，由于缺乏关于人类-物体交互的先验知识，以前的优化方法往往难以获得物理上可行的重建结果。在本文中，我们介绍了ScoreHOI，这是一种基于扩散的有效优化器，它引入了扩散先验来精确恢复人类-物体交互。通过利用分数引导采样中的可控性，扩散模型可以根据图像观察和物体特征重建人类和物体的条件分布。在推理过程中，ScoreHOI通过特定的物理约束引导去噪过程，有效提高重建结果。此外，我们提出了一种接触驱动迭代细化方法，以提高接触的可行性和提高重建精度。在标准基准测试上的广泛评估表明，ScoreHOI在最新技术方法之上表现出卓越的性能，突显了其在联合人类-物体交互重建中实现精确和稳健改进的能力。

论文及项目相关链接

PDF Accepted by ICCV 2025

Summary

本文提出了ScoreHOI，一种基于扩散优化的方法，通过引入扩散先验知识，实现对人类-物体交互的精确重建。该方法利用评分引导采样内的可控性，根据图像观察和物体特征重建人类和物体的姿态条件分布。ScoreHOI在推理过程中有效地提高了重建结果，通过特定的物理约束引导去噪过程。此外，还提出了一种接触驱动迭代优化方法，提高了接触的合理性和重建的准确性。在标准基准测试上的广泛评估表明，ScoreHOI在联合人类-物体交互重建方面优于最新技术，实现了精确且稳健的提升。

Key Takeaways

ScoreHOI是一种基于扩散优化的方法，用于重建人类-物体交互。
引入扩散先验知识，以改善对人类-物体交互的重建结果。
利用评分引导采样的可控性，根据图像观察和物体特征重建姿态条件分布。
ScoreHOI通过特定的物理约束在推理过程中提高了重建结果。
提出了一种接触驱动迭代优化方法，提高接触合理性和重建准确性。
在标准基准测试上，ScoreHOI表现出卓越性能，优于当前最新技术。

Cool Papers

点此查看论文截图

Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

Authors:Sung Ju Lee, Nam Ik Cho

Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at https://github.com/thomas11809/SFWMark

针对潜在扩散模型（LDMs）的语义水印技术对抗再生攻击具有鲁棒性，但由于频率完整性的丧失，通常面临检测性能下降的问题。为了解决这一问题，我们提出了一种名为Hermitian对称傅里叶水印（SFW）的新型嵌入方法，它通过强制Hermitian对称来保持频率完整性。此外，我们还引入了一种中心感知嵌入策略，通过确保稳健的信息保留，减少语义水印因裁剪攻击而面临的风险。为了验证我们的方法，我们将这些技术应用于现有的语义水印方案，增强其在频域的结构，以实现更好的鲁棒性和检索准确性。大量实验表明，我们的方法在验证和识别性能上达到了最新水平，在各种攻击场景中超越了以前的方法。消融研究证实了SFW对检测能力的影响，中心感知嵌入对抗裁剪的有效性，以及消息容量对识别精度的影响。值得注意的是，我们的方法在保持图像高度保真度的同时实现了最高的检测精度，这得到了FID和CLIP分数的证明。总之，我们提出的SFW被证明是一个有效的框架，能够在鲁棒性和图像保真度之间取得平衡，解决了语义水印中的固有权衡问题。代码可在[https://github.com/thomas11809/SFWMark找到。]

论文及项目相关链接

PDF Accepted to the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Project page: https://thomas11809.github.io/SFWMark/ Code: https://github.com/thomas11809/SFWMark

Summary

本文提出一种针对潜在扩散模型（LDMs）的语义水印技术，旨在解决再生攻击和频率完整性丧失导致的问题。新方法包括Hermitian对称傅里叶水印（SFW），它利用Hermitian对称性保持频率完整性。同时引入中心感知嵌入策略，通过确保稳健的信息保留来减少语义水印对裁剪攻击的脆弱性。实验证明，该方法在多种攻击场景下实现最先进的验证和识别性能，且在检测能力和图像保真度上表现优异。

Key Takeaways

语义水印技术对于对抗潜在扩散模型的再生攻击具有鲁棒性，但面临检测性能下降的问题。
提出了一种新的嵌入方法——Hermitian对称傅里叶水印（SFW），通过强制Hermitian对称性来保持频率完整性。
引入中心感知嵌入策略，减少语义水印对裁剪攻击的脆弱性，确保稳健的信息保留。
方法在多种攻击场景下实现先进的验证和识别性能，且检测能力和图像保真度表现优异。
通过广泛实验验证了该方法的有效性，包括与其他方法的比较。
代码公开可用，方便进一步研究和应用。

Cool Papers

点此查看论文截图

Universal Few-Shot Spatial Control for Diffusion Models

Authors:Kiet T. Nguyen, Chanhuyk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong

Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.

空间调节在预训练的文本到图像扩散模型中已经显著提高了生成图像结构的精细控制。然而，现有的控制适配器表现出有限的适应性和高培训成本，当遇到与训练任务大不相同的新型空间控制条件时更是如此。为了解决这个问题，我们提出了通用小样控制（Universal Few-Shot Control，简称UFC），这是一种通用的小样控制适配器，能够推广到新型空间条件。对于未见任务的一些图像条件对和查询条件，UFC利用查询和支持条件之间的类比来构建任务特定的控制特征，通过匹配机制和一小部分任务特定参数的更新来实现。在六个新型空间控制任务上的实验表明，UFC只需用未见任务的30个注释示例进行微调，即可实现与空间条件一致的精细控制。值得注意的是，使用全训练数据的0.1%进行微调时，UFC在各种控制任务中的性能与完全监督的基线相当。我们还证明了UFC可独立于各种扩散主干应用，并在UNet和DiT架构上都证明了其有效性。代码可在https://github.com/kietngt00/UFC找到。

论文及项目相关链接

PDF

Summary

空间条件在预训练的文本到图像扩散模型中已用于精细控制生成图像的结构。然而，现有控制适配器在面对与训练任务大不相同的新型空间控制条件时，表现出有限的适应性和较高的训练成本。为解决此问题，我们提出了通用少样本控制（UFC），一种能够推广到新空间条件的通用少样本控制适配器。对于未见任务的新型空间控制条件，UFC仅使用少量图像条件对和查询条件，通过类比查询和支持条件来构建任务特定控制特征，通过匹配机制和更新少量任务特定参数实现实例化。在六个新型空间控制任务上的实验表明，UFC仅通过微调30个未见任务的标注样本即可实现与空间条件一致的精细控制。值得注意的是，当使用全训练数据的0.1%进行微调时，UFC在各种控制任务中的性能与完全监督的基线相当。UFC还可广泛应用于各种扩散模型主干，并在UNet和DiT架构上均表现出良好的效果。相关代码可通过链接访问：https://github.com/kietngt00/UFC。

Key Takeaways

文本-图像扩散模型中的空间条件对生成图像的结构具有精细控制作用。
现有控制适配器在面对新型空间控制条件时存在局限性。
通用少样本控制（UFC）适配器被提出，以克服这一局限性并适应新空间条件。
UFC利用少量样本实现对新任务的精细控制，并表现出良好的泛化能力。
当使用少量训练数据进行微调时，UFC性能与完全监督方法相当。
UFC适用于多种扩散模型架构，并在UNet和DiT上验证其有效性。

Cool Papers

点此查看论文截图

LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors

Authors:Wenshuo Gao, Xicheng Lan, Luyao Zhang, Shuai Yang

Vector graphics, known for their scalability and user-friendliness, provide a unique approach to visual content compared to traditional pixel-based images. Animation of these graphics, driven by the motion of their elements, offers enhanced comprehensibility and controllability but often requires substantial manual effort. To automate this process, we propose a novel method that integrates implicit neural representations with text-to-video diffusion models for vector graphic animation. Our approach employs layered implicit neural representations to reconstruct vector graphics, preserving their inherent properties such as infinite resolution and precise color and shape constraints, which effectively bridges the large domain gap between vector graphics and diffusion models. The neural representations are then optimized using video score distillation sampling, which leverages motion priors from pretrained text-to-video diffusion models. Finally, the vector graphics are warped to match the representations resulting in smooth animation. Experimental results validate the effectiveness of our method in generating vivid and natural vector graphic animations, demonstrating significant improvement over existing techniques that suffer from limitations in flexibility and animation quality.

矢量图形以其可伸缩性和用户友好性著称，与传统基于像素的图像相比，为视觉内容提供了一种独特的方法。这些图形的动画，通过其元素的运动驱动，提高了可理解性和可控性，但通常需要大量的手动操作。为了自动化这一过程，我们提出了一种将隐式神经表示与文本到视频的扩散模型相结合的新方法，用于矢量图形动画。我们的方法采用分层隐式神经表示来重建矢量图形，保持其固有的无限分辨率和精确的颜色和形状约束等特性，有效地弥补了矢量图形和扩散模型之间的巨大领域差距。然后，使用视频分数蒸馏采样优化神经表示，利用预训练的文本到视频扩散模型的运动先验。最后，将矢量图形进行变形以匹配表示结果，从而实现平滑的动画。实验结果验证了我们的方法在生成生动自然的矢量图形动画方面的有效性，显著优于现有技术，后者在灵活性和动画质量方面存在局限性。

论文及项目相关链接

PDF 5 pages, ICIPW 2025, Website: https://gaowenshuo.github.io/LINR-bridge/

摘要

本文提出一种结合隐式神经网络表示与文本到视频扩散模型的方法，用于矢量图形动画的自动生成。该方法采用分层隐式神经网络表示重建矢量图形，保留其无限分辨率和精确的颜色与形状约束等固有属性，有效弥合了矢量图形和扩散模型之间的领域差距。通过利用预训练的文本到视频扩散模型中的运动先验知识，采用视频分数蒸馏采样对神经网络表示进行优化。最终，将矢量图形进行变形以匹配表示结果，从而实现流畅的动画效果。实验结果表明，该方法能生成生动自然的矢量图形动画，在灵活性和动画质量方面显著优于现有技术。

关键见解

矢量图形具有可伸缩性和用户友好性，为视觉内容提供了与传统像素图像不同的独特方法。
矢量图形动画增强了可理解性和可控性，但通常需要大量手动工作。
提出了一种结合隐式神经网络表示与文本到视频扩散模型的新方法，实现矢量图形动画的自动化。
采用分层隐式神经网络表示重建矢量图形，保留其固有属性，如无限分辨率和精确的颜色与形状约束。
利用预训练的文本到视频扩散模型中的运动先验知识，通过视频分数蒸馏采样优化神经网络表示。
矢量图形变形以匹配表示结果，实现流畅动画效果。

Cool Papers

点此查看论文截图

ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Authors:Wenshuo Gao, Xicheng Lan, Shuai Yang

Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

尽管视频生成技术发展迅速，但创建高质量、精确符合用户意图的视频仍然是一项重大挑战。现有方法往往无法实现对视频细节的精细控制，从而限制了它们的实际应用。我们引入了ANYPORTAL，这是一种用于视频背景替换的新型零样本框架，它利用预训练的扩散模型。我们的框架在零样本设置下，协同整合视频扩散模型的时序先验知识与图像扩散模型的重新照明能力。为了解决前景一致性的关键挑战，我们提出了精细化投影算法，该算法能够实现像素级别的细节操作，确保精确保留前景。ANYPORTAL无需训练，克服了实现前景一致性和时间连贯重新照明等挑战。实验结果表明，ANYPORTAL在消费级GPU上取得了高质量的结果，为视频内容创建和编辑提供了实用且高效的解决方案。

论文及项目相关链接

PDF 8 pages, ICCV 2025, Website: https://gaowenshuo.github.io/AnyPortal/

Summary

基于预训练的扩散模型，我们推出了ANYPORTAL，一种用于视频背景替换的新型零样本框架。该框架结合了视频扩散模型的时序先验知识与图像扩散模型的重新照明能力。通过精细化投影算法确保前景精确保留，解决了前景一致性的关键挑战。ANYPORTAL无需训练，克服了在实现前景一致性和时间连贯性重新照明方面的挑战。实验结果表明，ANYPORTAL在消费者级GPU上取得了高质量的结果，为视频内容创建和编辑提供了实用且高效的解决方案。

Key Takeaways

ANYPORTAL是一个用于视频背景替换的新型零样本框架。
它结合了视频扩散模型的时序先验与图像扩散模型的重新照明能力。
解决了前景一致性的关键挑战，通过精细化投影算法确保前景精确保留。
ANYPORTAL无需训练，能够克服实现前景一致性和时间连贯性重新照明的挑战。
该框架能在消费者级GPU上运行，具备实用性和高效性。
ANYPORTAL适用于视频内容创建和编辑。

Cool Papers

点此查看论文截图

DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

Authors:Ze-Xin Yin, Jiaxiong Qiu, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie

The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: https://zx-yin.github.io/dreamlifting/.

以物理为基础渲染（PBR）材料的3D资产创建需要耗费劳动力和经验，这需要一个自主的3D资产创建流程。然而，现有的大多数3D生成方法主要集中在几何建模上，要么将纹理烘焙成简单的顶点颜色，要么将纹理合成留给使用图像扩散模型的后期处理。为了实现端到端的PBR准备3D资产生成，我们提出了轻量化高斯资产适配器（LGAA），这是一个新的框架，它通过利用多视图（MV）扩散先验因素从全新角度对几何和PBR材料进行建模。LGAA具有模块化设计，包含三个组件。具体来说，LGAA包装器重用并适应MV扩散模型的网络层，这些网络层封装了从数十亿张图像中获得的知识，以数据高效的方式实现了更好的收敛。为了将多个扩散先验因素融入几何和PBR合成中，LGAA切换器对齐多个封装不同知识的LGAA包装器层。然后，设计了一个驯化的变分自编码器（VAE），称为LGAA解码器，用于预测带有PBR通道的2D高斯喷绘（2DGS）。最后，我们引入了一个专门的后期处理程序，从生成的2DGS中提取高质量、可重新照明的网格资产。大量的定量和定性实验表明，LGAA在文本和图像条件下的MV扩散模型中的卓越性能。此外，模块化设计使得可以灵活地融入多个扩散先验因素，知识保留方案在仅使用69k多视图实例进行训练时实现了有效的收敛。我们的代码、预训练权重和使用的数据集将通过我们的项目页面公开可用：https://zx-yin.github.io/dreamlifting/。

论文及项目相关链接

PDF 14 pages, 7 figures, project page: https://zx-yin.github.io/dreamlifting/

Summary
本文提出一种名为LGAA（轻量化高斯资产适配器）的新框架，实现了基于物理渲染（PBR）的端到端的三维资产生成。该框架通过利用多视角（MV）扩散先验信息，统一了几何建模和PBR材质建模。通过复用和调整MV扩散模型的网络层，LGAA能够在数据高效的情况下实现更好的收敛效果。该框架包含LGAA包装器、LGAA交换机和LGAA解码器三个组件，预测具有PBR通道的二维高斯喷涂结果。此外，还有专门的后期处理流程，可以从结果中提取高质量的可重新照明网格资产。该框架在多视角扩散模型上表现出卓越性能，并且其模块化设计使得可以灵活地引入多种扩散先验信息。

Key Takeaways

提出了一种名为LGAA的新框架，实现了端到端的三维资产生成，支持基于物理渲染（PBR）材质。
利用多视角（MV）扩散先验信息，统一了几何建模和PBR材质建模。
通过复用和调整MV扩散模型的网络层，实现了数据高效下的良好收敛效果。
LGAA框架包含三个组件：LGAA包装器、LGAA交换机和LGAA解码器，预测具有PBR通道的二维高斯喷涂结果。
引入了专门的后期处理流程，可以从结果中提取高质量的可重新照明网格资产。
该框架在多视角扩散模型上表现出卓越性能。

Cool Papers

点此查看论文截图

Knowledge Distillation Driven Semantic NOMA for Image Transmission with Diffusion Model

Authors:Qifei Wang, Zhen Gao, Zhijin Qin, Xiaodong Xu, Meixia Tao

As a promising 6G enabler beyond conventional bit-level transmission, semantic communication can considerably reduce required bandwidth resources, while its combination with multiple access requires further exploration. This paper proposes a knowledge distillation-driven and diffusion-enhanced (KDD) semantic non-orthogonal multiple access (NOMA), named KDD-SemNOMA, for multi-user uplink wireless image transmission. Specifically, to ensure robust feature transmission across diverse transmission conditions, we firstly develop a ConvNeXt-based deep joint source and channel coding architecture with enhanced adaptive feature module. This module incorporates signal-to-noise ratio and channel state information to dynamically adapt to additive white Gaussian noise and Rayleigh fading channels. Furthermore, to improve image restoration quality without inference overhead, we introduce a two-stage knowledge distillation strategy, i.e., a teacher model, trained on interference-free orthogonal transmission, guides a student model via feature affinity distillation and cross-head prediction distillation. Moreover, a diffusion model-based refinement stage leverages generative priors to transform initial SemNOMA outputs into high-fidelity images with enhanced perceptual quality. Extensive experiments on CIFAR-10 and FFHQ-256 datasets demonstrate superior performance over state-of-the-art methods, delivering satisfactory reconstruction performance even at extremely poor channel conditions. These results highlight the advantages in both pixel-level accuracy and perceptual metrics, effectively mitigating interference and enabling high-quality image recovery.

作为一种有望超越传统位级传输的6G技术，语义通信可以大大减少所需的带宽资源，而将其与多种接入技术相结合需要进一步探索。本文针对多用户上行无线图像传输，提出了一种基于知识蒸馏和扩散增强（KDD）的语义非正交多址接入（NOMA），名为KDD-SemNOMA。具体来说，为了确保在各种传输条件下特征传输的稳健性，我们首先开发了一种基于ConvNeXt的深度联合源信道编码架构，该架构具有增强的自适应特征模块。该模块结合了信噪比和信道状态信息，以动态适应加性白高斯噪声和瑞利衰落信道。此外，为了提高图像恢复质量而无需推理开销，我们引入了一种两阶段知识蒸馏策略，即一个教师模型在干扰自由的正交传输上进行训练，通过特征亲和力蒸馏和跨头预测蒸馏来指导学生模型。而且，基于扩散模型的细化阶段利用生成先验将初始的SemNOMA输出转换为具有高保真度和增强感知质量的图像。在CIFAR-10和FFHQ-256数据集上的大量实验表明，该方法优于最先进的方法，即使在极差的信道条件下也能实现令人满意的重建性能。这些结果突出了像素级精度和感知指标方面的优势，有效地减轻了干扰，并实现了高质量图像的恢复。

论文及项目相关链接

PDF 13 pages, submitted to IEEE for possible publication

Summary
语义通信作为有望的6G技术，能有效减少带宽需求。本文提出一种基于知识蒸馏和扩散增强（KDD）的语义非正交多址接入（NOMA），名为KDD-SemNOMA，用于多用户上行无线图像传输。研究中设计了联合源信道编码架构，引入两阶段知识蒸馏策略提升图像复原质量，并采用扩散模型进行精细优化。实验结果展示其在复杂信道环境下的优异性能。

Key Takeaways

语义通信作为6G潜力技术，能显著降低带宽需求。
提出KDD-SemNOMA方案，结合知识蒸馏和扩散模型，用于多用户上行无线图像传输。
设计深度联合源信道编码架构，融入信号噪声比和信道状态信息以适应不同传输环境。
引入两阶段知识蒸馏策略，提升图像复原质量且不存在推理开销。
扩散模型用于图像精细优化，生成高保真图像。
实验结果展示在复杂信道环境下优越性能，包括像素级精度和感知指标。

Cool Papers

点此查看论文截图

Missing Fine Details in Images: Last Seen in High Frequencies

Authors:Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, generated images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information during optimization, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Moreover, we integrate our frequency-preserving latent embeddings into a SOTA latent diffusion model, resulting in sharper and more realistic image generation. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image synthesis, with broader implications for applications in content creation, neural rendering, and medical imaging.

潜在生成模型在高保真图像合成方面取得了显著的进展，通常采用涉及两阶段训练过程的模型。第一阶段是通过学习到的分词器将图像压缩成潜在嵌入。生成图像的质量强烈依赖于这些潜在嵌入的表达能力和优化程度。虽然已提出了各种方法来学习有效的潜在表示，但由于高频损失导致的细节丢失，生成的图像通常缺乏真实感，特别是在纹理区域有锐利过渡的区域。我们对现有的最先进的潜在分词器进行了详细的频率分解，并表明传统目标本质上优先进行低频重建，往往以高频保真度为代价。我们的分析表明，这些潜在分词器在优化过程中存在偏向于低频信息的倾向，导致输出过于平滑和视觉失真，降低了感知质量。为了解决这一问题，我们提出了一种基于小波、具有频率感知能力的变分自编码器（FA-VAE）框架，该框架显式地解耦了低频和高频组件的优化。这种解耦能够在保留全局结构的同时，改善精细纹理的重建。此外，我们将频率保留的潜在嵌入集成到最先进的潜在扩散模型中，从而生成更清晰、更逼真的图像。我们的方法弥补了当前潜在分词器中的保真度差距，并强调了频率感知优化在真实图像合成中的重要性，为内容创建、神经渲染和医学成像等领域带来更广泛的应用影响。

论文及项目相关链接

PDF

Summary

本文探讨了潜在生成模型在高保真图像合成中的最新进展。文章指出，现有方法主要依赖两阶段训练过程，第一阶段通过学到的标记器将图像压缩为潜在嵌入。生成图像的质量强烈依赖于这些潜在嵌入的表达性和优化程度。然而，现有方法常常忽视高频信息的损失，导致纹理区域过渡尖锐的部分缺乏真实感。为此，作者进行了一项针对当前最先进的潜在标记器的频率分解研究，发现传统目标在优化时倾向于低频重建，牺牲了高频保真度。针对这一问题，作者提出了基于小波的频率感知变分自编码器（FA-VAE）框架，该框架明确解耦了低频和高频组件的优化。通过将这种频率感知的潜在嵌入整合到最先进的潜在扩散模型中，作者实现了更清晰、更真实的图像生成。该研究为当前潜在标记器中的保真度差距搭起了桥梁，并强调了频率感知优化在真实图像合成中的重要性，为内容创建、神经渲染和医学成像等应用带来了更广泛的启示。

Key Takeaways

潜在生成模型用于高保真图像合成已显示出显著进展，但生成图像在纹理区域缺乏真实感。
现有最先进的潜在标记器在优化时存在偏向，更注重低频信息的重建，牺牲了高频保真度。
常规目标在图像生成过程中的频率偏重导致输出图像过于平滑，出现视觉失真。
提出了基于小波的频率感知变分自编码器（FA-VAE）框架，能够解耦低频和高频组件的优化，改善纹理的重建并保留全局结构。
将频率感知的潜在嵌入整合到潜在扩散模型中，实现了更清晰、更真实的图像生成。
研究强调了频率感知优化在真实图像合成中的重要性。

Cool Papers

点此查看论文截图

SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Authors:Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot

Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

红外与可见光图像融合（IVIF）旨在将红外图像中的热辐射信息与可见光图像中的丰富纹理细节相结合，以提高下游视觉任务的感知能力。然而，现有方法往往由于缺乏场景的深度语义理解而无法保留关键目标，同时融合过程本身也可能引入伪影和细节损失，严重损害图像质量和任务性能。为了解决这些问题，本文提出了SGDFuse，一个由Segment Anything Model（SAM）引导的条件扩散模型，实现高保真和语义感知的图像融合。我们的方法的核心是利用SAM生成的高质量语义掩膜作为明确先验，通过条件扩散模型指导融合过程的优化。具体而言，该框架分为两个阶段：首先进行多模态特征的初步融合，然后利用SAM的语义掩膜与初步融合图像作为条件，驱动扩散模型的从粗到细的降噪生成。这确保融合过程不仅具有明确的语义方向性，而且还保证了最终结果的高保真度。大量实验表明，SGDFuse在主观和客观评估以及适应下游任务方面均达到最佳性能，为解决图像融合的核心挑战提供了强大的解决方案。SGDFuse的代码可在https://github.com/boshizhang123/SGDFuse上找到。

论文及项目相关链接

PDF Submitted to Information Fusion

Summary

红外与可见图像融合（IVIF）旨在结合红外图像中的热辐射信息与可见图像中的丰富纹理细节，以提高下游视觉任务的感知能力。然而，现有方法由于缺乏深度场景语义理解，常常无法保留关键目标，同时融合过程本身也可能引入伪影和细节损失，严重影响图像质量和任务性能。为解决这些问题，本文提出SGDFuse方法，采用由Segment Anything Model（SAM）引导的条件扩散模型，实现高保真和语义感知的图像融合。SGDFuse利用SAM生成的高质量语义掩膜作为明确先验，指导融合过程的优化。通过初步融合多模态特征并利用语义掩膜与初步融合图像作为条件驱动扩散模型的从粗到细的降噪生成，确保融合过程不仅具有明确的语义方向性，还保证了最终结果的高保真度。实验表明，SGDFuse在主观和客观评估中均达到最佳性能，并具有良好的下游任务适应性。
SGDFuse的代码已公开于：https://github.com/boshizhang123/SGDFuse。

Key Takeaways

红外与可见图像融合旨在结合两种图像的优势，提高视觉任务的感知能力。
现有方法因缺乏深度语义理解而存在问题，如无法保留关键目标和引入伪影。
SGDFuse采用条件扩散模型，结合SAM生成的语义掩膜，实现高保真和语义感知的图像融合。
利用SAM的语义掩膜指导融合过程的优化，确保明确的语义方向和结果的高保真度。
SGDFuse通过两阶段过程实现融合：初步特征融合和基于语义掩膜的条件扩散模型生成。
实验表明SGDFuse在主观和客观评估中表现优异，并具有良好的下游任务适应性。

Cool Papers

点此查看论文截图

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Authors:Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP

我们介绍了DIP，这是一种新型的无监督后训练法，旨在增强大规模预训练视觉编码器中的密集图像表示，以实现上下文场景理解。不同于依赖复杂自我蒸馏架构的先前方法，我们的方法使用伪任务训练视觉编码器，这些伪任务明确模拟下游上下文场景，并受元学习原理的启发。为了在未标记数据上进行后训练，我们提出了一种自动生成上下文任务的机制，该机制结合了预训练的扩散模型和视觉编码器本身。DIP简单、无监督且计算高效，在单个A100 GPU上运行时间不到9小时。它通过伪上下文任务学习密集表示，在多种下游现实世界上下文场景理解任务中表现出强劲性能。它优于初始视觉编码器和先前方法，为解决密集表示问题提供了实用有效的解决方案。代码链接：https://github.com/sirkosophia/DIP

论文及项目相关链接

PDF Accepted to ICCV 2025

Summary
本文介绍了DIP，一种新型无监督的预训练后图像处理方法，旨在增强大规模预训练视觉编码器的密集图像表示能力，以进行上下文场景理解。该方法通过模拟下游上下文场景生成伪任务来训练视觉编码器，受元学习原理启发。通过结合预训练的扩散模型和视觉编码器本身，实现自动在上下文任务生成机制上的无监督训练。DIP简单高效，在单个A100 GPU上不到9小时内完成计算。通过伪上下文任务学习密集表示，在多种下游现实场景理解任务中表现出色，优于初始视觉编码器和先前方法。

Key Takeaways

DIP是一种新型的预训练后图像处理方法，旨在增强大规模预训练视觉编码器的上下文场景理解能力。
该方法通过模拟下游上下文场景生成伪任务进行训练，不同于依赖复杂自蒸馏架构的先前方法。
DIP利用元学习原理，结合预训练的扩散模型和视觉编码器本身生成上下文任务。
DIP具有简单、无监督和计算效率高的特点，能在单个A100 GPU上短时间内完成计算。
通过学习伪上下文任务的密集表示，DIP在多种下游现实场景理解任务中表现出卓越性能。
DIP优于初始视觉编码器和先前方法。

Cool Papers

点此查看论文截图

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Authors:Hyeonho Jeong, Suhyeon Lee, Jong Chul Ye

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

我们介绍了Reangle-A-Video，这是一个从单个输入视频生成同步多视角视频的统一框架。不同于主流方法在大型4D数据集上训练多视角视频扩散模型，我们的方法将多视角视频生成任务重新定义为视频到视频的翻译，并利用可公开获取的图像和视频扩散先验。本质上，Reangle-A-Video分为两个阶段。（1）多视角运动学习：以自监督的方式同步微调图像到视频扩散转换器，从一组变形视频中提炼出视角不变的运动。（2）多视角一致图像到图像的翻译：输入视频的第一帧在推理时间跨视角一致性指导下变形并填充，生成各种相机视角下的多视角一致初始图像。在静态视角传输和动态摄像机控制方面的广泛实验表明，Reangle-A-Video超越了现有方法，为多视角视频生成建立了新的解决方案。我们将公开发布我们的代码和数据。项目页面：https://hyeonho99.github.io/reangle-a-video/

论文及项目相关链接

PDF ICCV 2025, Project page: https://hyeonho99.github.io/reangle-a-video/

Summary

本文介绍了Reangle-A-Video框架，该框架可从单输入视频生成同步多视角视频。不同于主流方法在大型4D数据集上训练多视角视频扩散模型，Reangle-A-Video将多视角视频生成任务转化为视频到视频的翻译，并利用可公开获取的图像和视频扩散先验。其实质上分为两个阶段：1）多视角运动学习：以自监督方式同步微调图像到视频扩散转换器，从一组变形视频中提取视角不变运动；2）多视角一致图像到图像翻译：在推理时间跨视角一致性指导下，将输入视频的第一帧变形并填充到各种相机视角，生成多视角一致起始图像。实验表明，Reangle-A-Video超越了现有方法，为多视角视频生成提供了新的解决方案。

Key Takeaways

Reangle-A-Video是一个生成同步多视角视频的统一框架，可从单一输入视频生成。
与主流方法不同，它并不在大型4D数据集上训练多视角视频扩散模型。
该方法将多视角视频生成转化为视频到视频的翻译任务。
利用图像和视频扩散先验。
Reangle-A-Video分为两个阶段：多视角运动学习和多视角一致图像到图像翻译。
通过自监督方式同步微调图像到视频扩散转换器以提取视角不变运动。

Cool Papers

点此查看论文截图

Closed-Loop Unsupervised Representation Disentanglement with $β$-VAE Distillation and Diffusion Probabilistic Feedback

Authors:Xin Jin, Bohan Li, BAAO Xie, Wenyao Zhang, Jinming Liu, Ziqiang Li, Tao Yang, Wenjun Zeng

Representation disentanglement may help AI fundamentally understand the real world and thus benefit both discrimination and generation tasks. It currently has at least three unresolved core issues: (i) heavy reliance on label annotation and synthetic data – causing poor generalization on natural scenarios; (ii) heuristic/hand-craft disentangling constraints make it hard to adaptively achieve an optimal training trade-off; (iii) lacking reasonable evaluation metric, especially for the real label-free data. To address these challenges, we propose a \textbf{C}losed-\textbf{L}oop unsupervised representation \textbf{Dis}entanglement approach dubbed \textbf{CL-Dis}. Specifically, we use diffusion-based autoencoder (Diff-AE) as a backbone while resorting to $\beta$-VAE as a co-pilot to extract semantically disentangled representations. The strong generation ability of diffusion model and the good disentanglement ability of VAE model are complementary. To strengthen disentangling, VAE-latent distillation and diffusion-wise feedback are interconnected in a closed-loop system for a further mutual promotion. Then, a self-supervised \textbf{Navigation} strategy is introduced to identify interpretable semantic directions in the disentangled latent space. Finally, a new metric based on content tracking is designed to evaluate the disentanglement effect. Experiments demonstrate the superiority of CL-Dis on applications like real image manipulation and visual analysis.

表示分离可能有助于人工智能从根本上理解现实世界，从而有益于判别和生成任务。目前它至少存在三个未解决的核心问题： (i) 严重依赖标签注释和合成数据——导致在自然场景中的泛化能力较差； (ii) 启发式/手工分离的约束使难以自适应地实现最佳训练平衡； (iii) 缺乏合理的评估指标，特别是对于真实的无标签数据。为了解决这些挑战，我们提出了一种名为CL-Dis的闭环无监督表示分离方法。具体来说，我们使用基于扩散的自编码器（Diff-AE）作为主干，并求助于β-VAE作为协同飞行员来提取语义上分离的表示。扩散模型的强大生成能力和VAE模型的良好分离能力是互补的。为了加强分离，VAE潜在蒸馏和扩散反馈在闭环系统中相互连接，以实现进一步的相互促进。然后，引入了一种自监督的导航策略，以在分离的潜在空间中识别可解释的语义方向。最后，设计了一个基于内容跟踪的新指标来评估分离效果。实验证明，CL-Dis在实时图像操作和视觉分析应用中的优越性。

论文及项目相关链接

PDF ECCV 2024

Summary

本文探讨了表示分离对人工智能理解现实世界的重要性，并指出了当前存在的三个核心问题。为解决这些问题，提出了一种名为CL-Dis的闭环无监督表示分离方法，该方法以扩散自编码器（Diff-AE）为骨干，借助β-VAE提取语义分离表示。通过结合扩散模型的强大生成能力和VAE模型良好的分离能力，强化了表示分离。此外，通过VAE潜伏蒸馏和扩散反馈的闭环系统，实现了进一步的相互促进。最后，引入了一种基于内容跟踪的新评估指标，并在实际应用中验证了CL-Dis的优势，如真实图像操作和视觉分析等。

Key Takeaways