发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 18.3k

阅读时长: 74 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

NERD: Network-Regularized Diffusion Sampling For 3D Computed Tomography

Authors:Shijun Liang, Ismail Alkhouri, Qing Qu, Rongrong Wang, Saiprasad Ravishankar

Numerous diffusion model (DM)-based methods have been proposed for solving inverse imaging problems. Among these, a recent line of work has demonstrated strong performance by formulating sampling as an optimization procedure that enforces measurement consistency, forward diffusion consistency, and both step-wise and backward diffusion consistency. However, these methods have only considered 2D reconstruction tasks and do not directly extend to 3D image reconstruction problems, such as in Computed Tomography (CT). To bridge this gap, we propose NEtwork-Regularized diffusion sampling for 3D CT (NERD) by incorporating an L1 regularization into the optimization objective. This regularizer encourages spatial continuity across adjacent slices, reducing inter-slice artifacts and promoting coherent volumetric reconstructions. Additionally, we introduce two efficient optimization strategies to solve the resulting objective: one based on the Alternating Direction Method of Multipliers (ADMM) and another based on the Primal-Dual Hybrid Gradient (PDHG) method. Experiments on medical 3D CT data demonstrate that our approach achieves either state-of-the-art or highly competitive results.

针对逆向成像问题，已经提出了许多基于扩散模型（DM）的方法。其中，最近的一项工作通过制定采样作为优化过程，强制执行测量一致性、正向扩散一致性以及逐步和反向扩散一致性，展现了强大的性能。然而，这些方法仅考虑了2D重建任务，并不能直接扩展到3D图像重建问题，例如在计算机断层扫描（CT）中。为了弥补这一差距，我们通过将L1正则化融入优化目标，提出了用于3D CT的网络正则化扩散采样（NERD）。这种正则化鼓励相邻切片之间的空间连续性，减少了切片间的伪影，促进了连贯的体积重建。此外，我们还引入了两种有效的优化策略来解决由此产生的目标：一种基于交替方向法乘数（ADMM），另一种基于原始-对偶混合梯度（PDHG）方法。在医疗3D CT数据上的实验表明，我们的方法达到了最前沿或高度竞争的结果。

论文及项目相关链接

PDF

Summary：针对三维计算机断层扫描（CT）图像重建问题，提出一种基于网络正则化的扩散采样方法（NERD）。该方法通过引入L1正则化项，优化目标函数，增强空间连续性，减少切片间伪影，实现更连贯的三维重建。同时，采用交替方向乘子法（ADMM）和原始-对偶混合梯度法（PDHG）两种优化策略求解目标函数。实验证明，该方法在医学三维CT数据上取得了先进或高度竞争的结果。

Key Takeaways：

扩散模型（DM）在解决逆向成像问题上有多种方法，其中一些通过制定采样作为优化程序来强制执行测量一致性、前向扩散一致性和步进与反向扩散一致性，表现出良好性能。
现有方法主要集中在二维重建任务上，无法直接应用于三维图像重建问题，如计算机断层扫描（CT）。
提出一种名为NERD的新方法，通过引入L1正则化项来弥补这一差距，优化目标函数以鼓励空间连续性，减少切片间伪影。
NERD采用两种优化策略：基于交替方向乘子法（ADMM）和原始-对偶混合梯度法（PDHG）。
NERD方法在医学三维CT数据上的实验结果表现出先进或高度竞争的性能。
L1正则化项有助于实现更连贯的三维重建，增强图像的空间连续性。

Cool Papers

点此查看论文截图

Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation

Authors:Aditi Agarwal, Anjali Jain, Nikita Saxena, Ishan Deshpande, Michal Kazmierski, Abigail Annkah, Nadav Sherman, Karthikeyan Shanmugam, Alok Talekar, Vaibhav Rajan

Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub-) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images – having higher revisit frequency (e.g., weekly) – using reference-based super-resolution (Ref-SR) methods. However, current Ref-SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale-factor requirements for this task. Further, previous two-step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, $\textbf{SEED-SR}$, which uses a combination of conditional latent diffusion models and large-scale multi-spectral, multi-source geo-spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation-aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20$\times$ scale factor, and rigorous experiments on two large, real datasets demonstrate up to $\textbf{25.5}$ and $\textbf{12.9}$ relative improvement in instance and semantic segmentation metrics respectively over approaches based on state-of-the-art Ref-SR methods.

通过卫星图像的分割来划定农田边界是许多农业应用中的基本步骤。对于小规模农场，这项任务尤其具有挑战性，因为准确划定边界需要使用高分辨率（HR）图像，而这些图像只能在较低的回访频率（例如，每年）上获得。为了支持更频繁（亚）季节性的监测，可以将高分辨率图像与具有更高回访频率（例如，每周）的低分辨率（LR）图像相结合，使用基于参考的超分辨率（Ref-SR）方法作为参考。然而，当前的Ref-SR方法优化了感知质量，平滑了下游任务所需的关键特征，并且无法满足此项任务的大规模因子要求。此外，先前先执行超分辨率再分割的两步方法未能有效地利用不同的卫星数据源作为输入。我们通过一种新方法SEED-SR来解决这些问题，该方法结合了条件潜在扩散模型以及大规模的多光谱、多源地理空间基础模型。我们的关键创新之处在于绕过像素空间中的显式SR任务，而是在感知分割的潜在空间中执行SR。这种独特的方法使我们能够以前所未有的20倍规模因子生成分割地图，并且在两个大型真实数据集上进行严格实验，相对于基于最新Ref-SR方法的方案，在实例和语义分割指标上分别实现了高达25.5%和12.9%的相对改进。

论文及项目相关链接

PDF

摘要
基于卫星图像的农场边界分割在许多农业应用中是基础步骤，对小型农场尤其具有挑战性。为了支持更频繁的（亚）季节性监测，可以使用基于参考的超分辨率方法（Ref-SR），结合低分辨率图像的高重访频率与可利用的高分辨率图像。然而，现有的Ref-SR方法优化感知质量并忽略了下游任务所需的关键特征，无法满足大规模因子要求。我们提出了一种新方法SEED-SR，结合条件潜在扩散模型和多光谱、多源地理空间基础模型，通过绕过显式超分辨率任务直接在分割感知潜在空间进行超分辨率处理来解决这些问题。该方法能在前所未有的20倍尺度因子下生成分割图，并在两个大型真实数据集上的实验表明，相较于基于最新Ref-SR的方法，其在实例和语义分割指标上分别有高达25.5和12.9的相对改进。

关键见解

卫星图像中的农场边界分割在农业应用中至关重要，特别是在小型农场中。
结合高分辨率和低分辨率图像以支持更频繁的监测是一个有效的策略。
当前Ref-SR方法在优化感知质量时忽视了下游任务所需的关键特征，难以满足大规模应用的需求。
SEED-SR方法通过条件潜在扩散模型和多源地理空间模型结合，在分割感知潜在空间进行超分辨率处理，是一种创新解决方案。
SEED-SR方法能够在高尺度因子（20倍）下生成分割图，为这一领域带来了前所未有的进步。
在大型真实数据集上的实验表明，SEED-SR方法在实例和语义分割指标上相对于现有方法有显著改进。

Cool Papers

点此查看论文截图

Iterative Diffusion-Refined Neural Attenuation Fields for Multi-Source Stationary CT Reconstruction: NAF Meets Diffusion Model

Authors:Jiancheng Fang, Shaoyu Wang, Junlin Wang, Weiwen Wu, Yikun Zhang, Qiegen Liu

Multi-source stationary computed tomography (CT) has recently attracted attention for its ability to achieve rapid image reconstruction, making it suitable for time-sensitive clinical and industrial applications. However, practical systems are often constrained by ultra-sparse-view sampling, which significantly degrades reconstruction quality. Traditional methods struggle under ultra-sparse-view settings, where interpolation becomes inaccurate and the resulting reconstructions are unsatisfactory. To address this challenge, this study proposes Diffusion-Refined Neural Attenuation Fields (Diff-NAF), an iterative framework tailored for multi-source stationary CT under ultra-sparse-view conditions. Diff-NAF combines a Neural Attenuation Field representation with a dual-branch conditional diffusion model. The process begins by training an initial NAF using ultra-sparse-view projections. New projections are then generated through an Angle-Prior Guided Projection Synthesis strategy that exploits inter view priors, and are subsequently refined by a Diffusion-driven Reuse Projection Refinement Module. The refined projections are incorporated as pseudo-labels into the training set for the next iteration. Through iterative refinement, Diff-NAF progressively enhances projection completeness and reconstruction fidelity under ultra-sparse-view conditions, ultimately yielding high-quality CT reconstructions. Experimental results on multiple simulated 3D CT volumes and real projection data demonstrate that Diff-NAF achieves the best performance under ultra-sparse-view conditions.

多源静态计算层析成像（CT）因其快速图像重建能力而近期备受关注，使其成为适用于时间敏感性临床和工业应用的理想选择。然而，实际系统通常受到超稀疏视图采样的限制，这会显著降级重建质量。传统方法在超稀疏视图设置下表现挣扎，插值变得不准确，导致重建结果不尽人意。为了应对这一挑战，本研究提出了Diffusion-Refined Neural Attenuation Fields（Diff-NAF），这是一个针对超稀疏视图条件下的多源静态CT量身定制的迭代框架。Diff-NAF结合了神经衰减场表示法和双分支条件扩散模型。流程始于使用超稀疏视图投影训练初始NAF。然后通过角度优先引导投影合成策略生成新的投影，该策略利用视图间先验，随后通过扩散驱动的重用投影细化模块进行细化。细化后的投影被作为伪标签纳入训练集，用于下一次迭代。通过迭代细化，Diff-NAF在超稀疏视图条件下逐步提高了投影的完整性和重建的保真度，最终产生高质量的CT重建结果。在多个模拟的3D CT体积和真实投影数据上的实验结果表明，Diff-NAF在超稀疏视图条件下实现了最佳性能。

论文及项目相关链接

PDF

Summary

多源静态计算机断层扫描（CT）因能快速实现图像重建，在时间和敏感的临床和工业应用中受到关注。然而，实际系统中常受到超稀疏视图采样的限制，导致重建质量下降。为解决此挑战，本研究提出Diffusion-Refined Neural Attenuation Fields（Diff-NAF）方法，适用于超稀疏视图条件下的多源静态CT。Diff-NAF结合神经衰减场表示和双分支条件扩散模型。通过迭代优化，逐步提高投影的完整性和重建的保真度，实现高质量的CT重建。

Key Takeaways

多源静态CT具有快速图像重建能力，适用于时间敏感的临床和工业应用。
实际系统中常面临超稀疏视图采样问题，导致重建质量下降。
Diff-NAF方法结合神经衰减场表示和双分支条件扩散模型，适用于超稀疏视图条件下的多源静态CT。
Diff-NAF通过迭代优化，逐步提高投影的完整性和重建的保真度。
Angle-Prior Guided Projection Synthesis策略用于生成新投影，利用不同视图间的先验信息。
Diffusion-driven Reuse Projection Refinement Module用于细化投影，并作为伪标签融入训练集进行下一次迭代。

Cool Papers

点此查看论文截图

InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

Authors:Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun, Wenzheng Chen, He Sun

Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher’s strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.

视频逆问题在流媒体、远程存在、增强现实/虚拟现实等领域中至关重要，这些领域要求高感知质量与严格的延迟限制共存。基于扩散的先验目前提供了最先进的重建技术，但现有方法要么采用临时调节器适应图像扩散模型，导致出现时间上的伪影，要么依赖于原生视频扩散模型，其迭代后验采样对于实时使用来说太慢。我们推出了InstantViR，这是一个由预训练视频扩散先验驱动的超高速视频重建摊销推理框架。我们将强大的双向视频扩散模型（教师模型）提炼为因果自回归学生模型，该模型能够在单次前向传递中将退化视频直接映射到其恢复版本，继承了教师的强大时间建模能力，同时完全消除了迭代测试时间优化。提炼是优先驱动的：它只需要教师扩散模型和已知的退化算子，并不依赖于外部配对的干净/嘈杂的视频数据。为了进一步提升吞吐量，我们用一个高效的LeanVAE替代了视频扩散主干VAE，通过创新的教师空间正则化提炼方案，实现在潜在空间的低延迟处理。在随机填充、高斯去模糊和超分辨率增强方面，InstantViR匹配或超越了基于扩散的基线重建质量，同时在NVIDIA A100 GPU上以超过35帧的速度运行，相对于迭代视频扩散求解器实现了高达100倍的速度提升。这些结果表明，基于扩散的视频重建与实时、交互、可编辑的流媒体场景兼容，将高质量视频恢复转变为现代视觉系统的实用组件。

论文及项目相关链接

PDF

摘要

视频逆问题在流媒体、远程出席以及AR/VR等领域中至关重要，其中高感知质量必须与严格的延迟限制共存。虽然扩散先验目前提供了最先进的重建效果，但现有方法要么采用临时调节器适应图像扩散模型导致出现时间上的瑕疵，要么依赖于原始视频扩散模型，其迭代后验采样对于实时使用来说过于缓慢。我们引入了InstantViR，这是一个由预训练视频扩散先验驱动的超快速视频重建摊销推理框架。我们将强大的双向视频扩散模型（教师）提炼为因果自回归学生，单次前向传递即可将退化视频直接映射到其恢复版本，继承了教师的强大时间建模能力，同时完全消除了迭代测试时间优化。提炼是由先验驱动的：它只需要教师扩散模型和已知的退化运算符，并不依赖于外部清洁/噪声视频的配对数据。为了进一步提高吞吐量，我们通过创新的教师空间正则化提炼方案，用高效率的LeanVAE替换视频扩散主干VAE，实现低延迟的潜在空间处理。在随机填充、高斯去模糊和超分辨率等应用中，InstantViR匹配或超越了扩散基准的重建质量，同时在NVIDIA A100 GPU上以超过35 FPS运行，实现了高达100倍的迭代视频扩散求解器加速。这些结果表明，基于扩散的视频重建与实时、交互、可编辑的流媒体场景兼容，将高质量视频恢复转变为现代视觉系统的实用组件。

关键见解

视频逆问题在多个领域的重要性及其面临的挑战。
现有视频扩散模型存在的问题，如时间瑕疵和较低的实时性能。
InstantViR框架的引入，结合预训练的视频扩散先验进行超快速视频重建。
通过蒸馏技术将强大的双向视频扩散模型（教师）转化为因果自回归模型（学生），实现单次前向传递恢复视频。
框架能够继承教师的优秀时间建模能力，同时消除迭代测试时间优化。
框架的先验驱动蒸馏过程仅需要教师扩散模型和已知的退化运算符，无需配对清洁/噪声视频数据。
通过高效LeanVAE的使用和创新教师空间正则化蒸馏方案，实现低延迟的潜在空间处理和高性能的视频重建。

Cool Papers

点此查看论文截图

Coffee: Controllable Diffusion Fine-tuning

Authors:Ziyao Zeng, Jingcheng Ni, Ruyi Liu, Alex Wong

Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.

文本到图像的扩散模型能够生成具有灵活提示的多样化内容，这使得它们非常适合通过微调少量用户提供的数据来进行个性化定制。然而，可控的微调以防止模型学习微调数据中的不良概念，并防止这些概念与用户提示相互纠缠，仍然是一个开放性的挑战。这对于下游任务至关重要，如偏差缓解、防止恶意适应、属性解耦和扩散政策的可泛化微调。我们提出了Coffee，它允许使用语言来指定不良概念以规范适应过程。我们的方法的关键在于防止用户提示的嵌入与不良概念对齐。关键的是，Coffee不需要额外的训练，并且可以通过修改文本描述来灵活地修改不良概念。我们通过微调与用户提示和不良概念相关的图像来评估Coffee。实验结果表明，Coffee可以防止文本到图像模型在微调过程中学习指定的不良概念，并优于现有方法。论文被接受后将发布代码。

论文及项目相关链接

PDF

Summary

文本到图像扩散模型通过灵活的提示生成多样化内容，适合通过少量用户提供的数据进行微调以实现定制。然而，如何可控地微调模型以避免学习微调数据中的不良概念，以及防止这些概念与用户提示纠缠在一起，仍然是一个挑战。这对于下游任务至关重要，如偏差缓解、防止恶意适应、属性解耦和扩散政策的通用微调。本文提出一种名为Coffee的方法，允许使用语言来指定不良概念以规范适应过程。Coffee的核心在于防止用户提示的嵌入与不良概念对齐。关键的是，Coffee无需额外的训练，通过修改文本描述即可灵活修改不良概念。通过实验评估，Coffee在图像与用户提示和不良概念配对上进行微调，结果表明它能防止文本到图像模型在微调过程中学习指定的不良概念，并优于现有方法。

Key Takeaways

文本到图像扩散模型具备生成多样化内容的能力，且能通过微调适应个性化需求。
可控的微调是一个关键挑战，需要防止模型学习不良概念和与用户提示混淆。
下游任务如偏差缓解、防止恶意适应等对于模型性能至关重要。
提出了名为Coffee的方法，允许使用语言来指定不良概念，从而规范模型的适应过程。
Coffee通过防止用户提示与不良概念的嵌入对齐来实现其核心功能。
Coffee具有灵活性，可以通过修改文本描述来轻松修改不良概念，无需额外训练。
实验结果表明，Coffee能有效防止文本到图像模型在微调过程中学习指定的不良概念，且其性能优于现有方法。

Cool Papers

点此查看论文截图

FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Authors:Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen, Zhong Ji

All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.

全景图像恢复（AIO-IR）的目标是开发一个统一模型，该模型能够在复杂条件下处理多种退化。然而，现有方法通常依赖于特定任务的设计或潜在路由策略，这使得它们难以适应具有各种退化的现实世界场景。我们提出了FAPE-IR，这是一个用于图像恢复的频率感知规划和执行框架。它使用一个冻结的多模式大型语言模型（MLLM）作为规划器，分析退化图像并生成简洁的频率感知恢复计划。这些计划指导基于LoRA的混合专家（LoRA-MoE）模块在扩散执行器中工作，该模块根据输入图像频率特征动态选择高频或低频专家进行补充。为了进一步提高恢复质量和减少伪影，我们引入了对抗性训练和频率正则化损失。通过将语义规划与基于频率的恢复相结合，FAPE-IR为全景图像恢复提供了统一且可解释的解决方案。大量实验表明，FAPE-IR在七个恢复任务上达到了最先进的性能，并在混合退化情况下表现出了强大的零样本泛化能力。

论文及项目相关链接

PDF

Summary
FAPE-IR提出一个频率感知的规划执行框架，用于图像修复。它结合冻结的多模式大型语言模型（MLLM）作为规划器，生成简洁的频率感知修复计划，并引导基于LoRA的专家混合模块（LoRA-MoE）在扩散执行器中动态选择高频或低频专家，辅以输入图像的频率特征。通过语义规划与频率修复的结合，FAPE-IR为全方位图像修复提供了统一且可解释的解决方案。

Key Takeaways

AIO-IR目标开发统一模型，处理复杂条件下的多种退化。
现有方法常依赖特定任务设计或潜在路由策略，难以适应多种退化场景。
FAPE-IR采用频率感知的规划执行框架进行图像修复。
利用冻结的MLLM作为规划器，生成简洁的频率感知修复计划。
LoRA-MoE模块在扩散执行器中动态选择高频或低频专家。
对抗训练和频率正则化损失提高了修复质量和减少了伪影。

Cool Papers

点此查看论文截图

FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

Authors:Rong Zhang, Jinxiao Li, Jingnan Wang, Zhiwen Zuo, Jianfeng Dong, Wei Li, Chi Wang, Weiwei Xu, Xun Wang

Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model’s appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

服装导向的时尚图像生成旨在合成穿着给定服装的现实且可控的人体模型，这因其电子商务等实际应用而日益受到关注。该任务的关键挑战在于两个方面：（1）忠实保留服装细节；（2）实现对模型外观的精细控制。现有方法通常需要在生成过程中进行服装变形，这往往会导致服装纹理失真。此外，由于缺乏专门设计的机制，它们无法控制生成模型的精细属性。为了解决这些问题，我们提出了FashionMAC，这是一种基于扩散的无变形框架，可实现高质量且可控的时尚展示图像生成。我们的框架的核心思想是不需要执行服装变形，而是直接外推从穿着者身上分割出来的服装，这能够忠实保留复杂的服装细节。此外，我们提出了一种新型的区域自适应解耦注意力（RADA）机制，并结合链式掩膜注入策略，以实现合成人体模型的精细外观控制。具体来说，RADA自适应地预测每个精细文本属性的生成区域，并通过链式掩膜注入策略强制文本属性集中在预测区域，这显著提高了视觉逼真度和可控性。大量实验验证了我们的框架在性能上优于现有最先进的方法。

论文及项目相关链接

PDF

Summary
本文提出一种基于扩散模型的无需变形的时装展示图像生成框架FashionMAC，可高质量且可控地生成时装展示图像。其核心思想是不需要执行衣物变形，而是直接外推从穿着者身上分割出来的衣物，实现衣物细节的忠实保留。同时，结合区域自适应解耦注意力（RADA）机制和链式掩膜注入策略，实现对合成人物模型的细粒度外观控制。

Key Takeaways

服装细节的真实保留：通过消除衣物变形的需求，直接外推分割的衣物部分，忠实保留衣物的细节。
高质量且可控的时装展示图像生成：提出的FashionMAC框架能够实现高质量的时装展示图像生成，并且具有可控性。
引入区域自适应解耦注意力（RADA）机制：RADA机制能够自适应预测每个细粒度文本属性的生成区域。
链式掩膜注入策略：通过该策略，文本属性能够专注于预测的生成区域，提高了视觉保真度和可控性。
框架超越了现有先进技术：经过广泛实验验证，FashionMAC框架在性能上超越了现有的先进技术。
核心挑战包括细节保留和模型外观的精细控制：时装图像生成的关键挑战在于如何真实地保留服装细节，以及实现对模型外观的精细控制。

Cool Papers

点此查看论文截图

Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios

Authors:Sanjay Acharjee, Abir Khan Ratul, Diego Patino, Md Nazmus Sakib

Training vision models to detect workplace hazards accurately requires realistic images of unsafe conditions that could lead to accidents. However, acquiring such datasets is difficult because capturing accident-triggering scenarios as they occur is nearly impossible. To overcome this limitation, this study presents a novel scene graph-guided generative AI framework that synthesizes photorealistic images of hazardous scenarios grounded in historical Occupational Safety and Health Administration (OSHA) accident reports. OSHA narratives are analyzed using GPT-4o to extract structured hazard reasoning, which is converted into object-level scene graphs capturing spatial and contextual relationships essential for understanding risk. These graphs guide a text-to-image diffusion model to generate compositionally accurate hazard scenes. To evaluate the realism and semantic fidelity of the generated data, a visual question answering (VQA) framework is introduced. Across four state-of-the-art generative models, the proposed VQA Graph Score outperforms CLIP and BLIP metrics based on entropy-based validation, confirming its higher discriminative sensitivity.

训练视觉模型以准确检测工作场所危险需要现实的不安全状况图像，这些图像可能导致事故。然而，获取此类数据集很困难，因为捕捉发生的导致事故的场景几乎是不可能的。为了克服这一局限性，本研究提出了一种新的场景图引导生成式人工智能框架，该框架结合了职业安全与健康管理局（OSHA）的历史事故报告，合成逼真的危险场景图像。使用GPT-4o分析OSHA的叙述，提取结构化危险推理，将其转换为捕获空间和上下文关系理解风险的对象级场景图。这些图形引导文本到图像扩散模型以生成结构上准确的危险场景。为了评估生成数据的真实性和语义保真度，引入了一个视觉问答（VQA）框架。在四种最先进的生成模型中，所提出的基于VQA图的评分方法优于基于熵验证的CLIP和BLIP指标，证明了其更高的鉴别敏感性。

论文及项目相关链接

PDF

Summary

本研究提出了一种基于场景图引导的新型生成式AI框架，用于合成基于历史职业安全与卫生管理局（OSHA）事故报告的逼真危险场景图像。该研究使用GPT-4o分析OSHA报告，提取结构化危险推理，并将其转化为捕捉空间和上下文关系的对象级场景图，以理解风险。这些场景图指导文本到图像的扩散模型生成组合准确的危险场景。评估生成数据真实性和语义保真度时，引入了一种视觉问答（VQA）框架。相较于CLIP和BLIP指标，该研究的VQA Graph Score评价指标表现出更高的鉴别敏感性。

Key Takeaways

训练视觉模型检测工作场所危险需要真实的危险图像数据集，但获取这些数据集十分困难。
提出了一种基于场景图的生成式AI框架，合成逼真的危险场景图像。
使用GPT-4o从OSHA事故报告中提取结构化危险推理。
将危险推理转化为对象级场景图，捕捉空间和上下文关系以理解风险。
场景图指导文本到图像的扩散模型生成危险场景。
引入VQA框架评估生成图像的真实性和语义保真度。

Cool Papers

点此查看论文截图

Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Authors:Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J

Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.

印度诗歌以其语言复杂性和深厚的文化共鸣而闻名，拥有跨越数千年的丰富而多样的传统。然而，其层次化的意义、文化典故和复杂的语法结构往往构成理解上的挑战，特别是对于非母语者或对其语境和语言不熟悉的读者。尽管其在文化上具有重要意义，但现有关于诗歌的作品大多忽略了印度语言诗歌。在本文中，我们提出了翻译与图像生成（TAI）框架，该框架通过适当的提示调整，利用大型语言模型（LLM）和潜在扩散模型。我们的框架支持联合国可持续发展目标中的优质教育（SDG 4）和减少不平等（SDG 10），通过提高富含文化的印度语言诗歌的可达性，为全球受众服务。它包括（1）一个使用几率比率偏好对齐算法的翻译模块，该算法可将形态丰富的诗歌准确地翻译成英语；（2）一个图像生成模块，该模块采用语义图来捕获标记、依赖关系和隐喻及其意义之间的语义关系，以创建印度诗歌的视觉有意义表示。我们的全面实验评估，包括人类和定量评估，证明了TAI扩散在诗歌图像生成任务中的优越性，超越了强大的基线。为了进一步解决印度语言诗歌资源匮乏的问题，我们介绍了形态丰富印度语言诗歌MorphoVerse数据集，该数据集包含1570首跨越21种低资源印度语言的诗歌。通过解决诗歌翻译和视觉理解之间的差距，这项工作旨在提高可及性并丰富读者的体验。

论文及项目相关链接

PDF

Summary

本文提出了一个结合翻译与图像生成的框架（TAI），利用大型语言模型（LLM）和潜在扩散模型，通过适当的提示调整，支持印度语言诗歌的全球化。该框架包括翻译模块和图像生成模块，前者使用Odds Ratio Preference Alignment Algorithm将形态丰富的诗歌翻译成英语，后者利用语义图捕捉诗歌中的符号、依赖关系和隐喻之间的语义关系，为印度诗歌创建视觉上有意义的表示。实验评估表明，TAI在诗歌图像生成任务中表现出卓越性能。此外，为解决印度语言诗歌资源匮乏的问题，还推出了MorphoVerse数据集，包含1570首跨越21种低资源印度语言的诗歌。

Key Takeaways

印度诗歌具有悠久历史和丰富多样的遗产，但其复杂的语言和文化背景对非母语者或不了解其文化和语言背景的读者来说，理解起来具有挑战性。
现有关于诗歌的研究大多忽略了印度语言诗歌。
提出的Translation and Image Generation (TAI)框架，借助大型语言模型和潜在扩散模型，旨在增强印度语言诗歌的全球性。
TAI框架包括翻译模块和图像生成模块，前者采用Odds Ratio Preference Alignment Algorithm进行翻译，后者利用语义图捕捉诗歌的语义关系，生成有意义的图像表示。
综合实验评估表明，TAI在诗歌图像生成任务中具有卓越性能。
针对印度语言诗歌资源的匮乏，推出了Morphologically Rich Indian Language Poems MorphoVerse数据集。

Cool Papers

点此查看论文截图

Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model

Authors:Fei Kong

Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.

最近，提出了去噪扩散编码本模型（DDCM）。DDCM利用去噪扩散概率模型（DDPM），根据预设规则，将反向过程中的随机噪声替换为从特定集合中采样的噪声，从而实现图像压缩。然而，DDCM无法应用于除DDPM之外的其他方法。在本文中，我们提出了广义去噪扩散压缩模型（gDDCM），它将DDCM扩展到主流扩散模型及其变体，包括DDPM、基于分数的模型、一致性模型和修正流。我们在CIFAR-10和LSUN卧室数据集上评估了我们的方法。实验结果表明，我们的方法成功地将DDCM推广到了上述模型，并实现了性能提升。

论文及项目相关链接

PDF in Chinese language

Summary
近期提出了去噪扩散编码本模型（DDCM），它利用去噪扩散概率模型（DDPM）并将反向过程中的随机噪声替换为特定集合中的噪声样本，从而实现图像压缩。然而，DDCM仅适用于DDPM方法。本文提出了广义去噪扩散压缩模型（gDDCM），将DDCM扩展到主流扩散模型及其变体，包括DDPM、基于分数的模型、一致性模型和校正流。我们在CIFAR-10和LSUN卧室数据集上评估了该方法，实验结果表明，该方法成功地将DDCM推广到了上述模型，并实现了性能的提升。

Key Takeaways

DDCM利用DDPM并将反向过程中的随机噪声替换为特定集合采样的噪声，实现图像压缩。
DDCM仅适用于DDPM，缺乏通用性。
本文提出了广义去噪扩散压缩模型（gDDCM），以扩展DDCM的应用范围。
gDDCM可以应用于主流扩散模型及其变体，包括DDPM、基于分数的模型、一致性模型和校正流。
gDDCM在CIFAR-10和LSUN卧室数据集上进行了评估。
实验结果表明，gDDCM成功推广了DDCM的应用，并提高了性能。

Cool Papers

点此查看论文截图

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Authors:Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang

Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

基于视觉的3D语义场景补全（SSC）因其在自动驾驶领域的应用潜力而受到越来越多的关注。虽然大多数现有方法采用以自我为中心（ego-centric）的方法，通过在整个场景上聚合和扩散特征，但它们常常忽略精细的物体级细节，导致语义和几何上的模糊性，特别是在复杂环境中。为了解决这一局限性，我们提出了Ocean，一个以物体为中心的预测框架，它将场景分解为单独的物体实例，以实现更精确的语义占用预测。具体来说，我们首先采用轻量级分割模型MobileSAM从输入图像中提取实例掩膜。然后，我们引入了一个3D语义组注意力模块，它利用线性注意力在3D空间中聚合物体中心特征。为了处理分割错误和缺失的实例，我们进一步设计了一个全局相似性引导注意力模块，它利用分割特征进行全局交互。最后，我们提出了一个实例感知局部扩散模块，它通过生成过程改进实例特征，随后在BEV空间细化场景表示。在SemanticKITTI和SSCBench-KITTI360基准测试上的广泛实验表明，Ocean达到了最先进的性能，mIoU得分分别为17.40和20.28。

论文及项目相关链接

PDF Accepted to AAAI-2026

Summary

本文关注基于视觉的3D语义场景补全（SSC）在自动驾驶领域的应用。针对现有方法忽略对象级别的细节导致的语义和几何模糊问题，提出了Ocean框架。它采用对象中心预测，将场景分解为独立对象实例，实现更准确的语义占用预测。通过模块设计，包括MobileSAM分割模型、3D语义组注意力模块、全局相似性引导注意力和实例感知局部扩散模块，以提高性能和准确性。在SemanticKITTI和SSCBench-KITTI360基准测试上，Ocean实现了最佳性能，mIoU得分分别为17.40和20.28。

Key Takeaways

自动驾驶中，基于视觉的3D语义场景补全（SSC）受到关注。
现有方法多采用以自我为中心的预测范式，导致语义和几何模糊问题。
Ocean框架采用对象中心预测，分解场景为独立对象实例，提高语义占用预测准确性。
Ocean包括MobileSAM分割模型、3D语义组注意力模块等关键模块。
Ocean通过全局相似性引导注意力处理分割错误和缺失实例。
Ocean通过实例感知局部扩散模块提高实例特征并优化场景表示。

Cool Papers

点此查看论文截图

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Authors:Dianbing Xi, Guoyuan An, Jingsen Zhu, Zhijian Liu, Yuan Liu, Ruiyuan Zhang, Jiayuan Lu, Yuchi Huo, Rui Wang

We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from Outfit of the Day(OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48x speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

我们提出了PFAvatar（姿态融合化身），这是一种新的方法，可以从日常穿搭（OOTD）照片重建高质量的三维化身。OOTD照片展示了各种姿势、遮挡和复杂的背景。我们的方法分为两个阶段：（1）利用少量OOTD示例对姿态感知扩散模型进行微调；（2）通过神经辐射场（NeRF）表示三维化身并进行提炼。在第一阶段，与以往将图像分割为资产（如服装、配饰）进行三维组装的方法不同，这种方法容易引发不一致的问题。我们避免分解，直接对全身外观进行建模。通过集成预训练的ControlNet进行姿态估计和新颖的条件先验保留损失（CPPL），我们的方法能够在端到端的学习中捕捉精细细节，同时减轻少量训练中的语言漂移。我们的方法在仅仅5分钟内完成个性化设置，与以前的方法相比，速度提高了48倍。在第二阶段，我们引入了一种基于NeRF的化身表示，通过标准SMPL-X空间采样和多分辨率3D-SDS进行优化。与基于网格的表示方法相比，我们的连续辐射场可以保留高频纹理（例如头发），并通过透射率正确地处理遮挡。实验表明，PFAvatar在重建保真度、细节保留以及遮挡/截断鲁棒性方面优于现有技术，推动了从现实世界OOTD相册生成实用三维化身的发展。此外，重建的三维化身支持下游应用，如虚拟试穿、动画和人类视频重演，进一步证明了我们方法的通用性和实用价值。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary
我们提出了一种名为PFAvatar（姿态融合化身）的新方法，能够从日常穿搭（OOTD）照片中重建高质量的三维化身。该方法分为两个阶段：一是用少量OOTD示例微调姿态感知扩散模型，二是通过神经辐射场（NeRF）表示化身并进行优化。此方法避免了分解图像资产（如服装、配饰）进行三维装配的不一致性，能够端到端学习精细细节并减轻少量训练中的语言漂移。第二阶段的NeRF化身表示法优化，通过标准SMPL-X空间采样和多分辨率3D-SDS技术，能保存高频纹理，正确处理遮挡问题。实验表明，PFAvatar在重建保真度、细节保存以及对遮挡和截断情况的稳健性方面超越了现有方法，推动了实际应用的OOTD专辑的三维化身生成。重建的化身还支持下游应用，如虚拟试穿、动画和人体视频重新演绎。

Key Takeaways

PFAvatar能够从OOTD照片中重建高质量的三维化身。
方法分为两个阶段：微调姿态感知扩散模型和通过NeRF表示优化化身。
避免分解图像资产进行三维装配的不一致性，能够端到端学习精细细节。
通过CPPL和ControlNet辅助姿态估计，减少语言漂移。
NeRF化身表示法能够保存高频纹理，正确处理遮挡问题。
PFAvatar在重建保真度、细节保存和对遮挡/截断情况的稳健性方面超越现有方法。

Cool Papers

点此查看论文截图

cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold

Authors:Zain Shabeeb, Daniel Saeedi, Darin Tsui, Vida Jamali, Amirali Aghazadeh

Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5$\times$ while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling. Project website: https://cryosense.github.io.

冷冻电子显微镜（cryo-EM）能够实现生物分子的原子分辨率可视化；然而，现代直接探测器生成的数据量远远超过了可用的存储和传输带宽，从而限制了实际吞吐量。我们引入了cryoSENSE，这是硬件和软件协同设计框架的计算实现，用于压缩冷冻电子显微镜（cryo-EM）的感知和采集。我们证明，蛋白质冷冻电子显微镜图像位于低维流形上，可以使用预定义基元中的稀疏先验和由降噪扩散模型捕获的生成先验进行独立表示。cryoSENSE利用这些低维流形，能够从空间和傅立叶域欠采样测量中进行忠实的图像重建，同时保留下游结构分辨率。在实验上，cryoSENSE在提高采集速度的同时保持原始的三维分辨率，可控制测量掩码的数量和欠采样的程度之间的权衡。稀疏先验有利于从傅立叶域测量和适度压缩中重建忠实图像，而生成扩散先验则可实现从像素域测量和更严重的欠采样中准确恢复。项目网站：https://cryosense.github.io。

论文及项目相关链接

PDF

Summary

本文介绍了cryoSENSE的计算实现，这是一种硬件和软件协同设计的框架，用于压缩冷冻电子显微镜（cryo-EM）感知和采集。研究表明，蛋白质冷冻电子显微镜图像位于低维流形上，可以使用预定义的稀疏先验和由去噪扩散模型捕获的生成先验进行独立表示。cryoSENSE利用这些低维流形实现了从空间和傅立叶域欠采样测量值的忠实图像重建，同时保持下游结构分辨率。实验表明，cryoSENSE在提高采集速度的同时保留了原始3D分辨率，可以在掩膜测量数量和欠采样程度之间实现可控的权衡。

Key Takeaways

cryo-EM允许原子分辨率的蛋白质可视化，但现代直接检测器生成的数据量超过了可用的存储和传输带宽，限制了实际吞吐量。
cryoSENSE是硬件和软件协同设计的框架的计算实现，用于压缩cryo-EM感知和采集。
蛋白质冷冻电子显微镜图像位于低维流形上，可独立使用稀疏先验和生成先验表示。
cryoSENSE利用这些低维流形实现从空间和傅立叶域欠采样测量的忠实图像重建。
在实验中，cryoSENSE通过提高采集速度同时保留原始3D分辨率，在掩膜测量数量和欠采样程度之间提供了可控的权衡。
稀疏先验有利于从傅立叶域测量值和适度压缩中实现忠实重建。
生成扩散先验在像素域测量和更严重的欠采样中实现准确恢复。

Cool Papers

点此查看论文截图

GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

Authors:Jiaqi Wu, Yaosen Chen, Shuyuan Zhu

Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://sobeymil.github.io/GeoMVD.com.

多视角图像生成在计算机视觉领域具有重要的应用价值，特别是在3D重建、虚拟现实和增强现实等领域。大多数现有方法依赖于单张图像的扩展，但在保持跨视图一致性和生成高分辨率输出方面面临重大的计算挑战。为了解决这些问题，我们提出了几何引导的多视图扩散模型，该模型结合了提取多视图几何信息和调整几何特征强度的机制，以生成既跨视图一致又细节丰富的图像。具体来说，我们设计了一个多视图几何信息提取模块，该模块利用深度图、法线图和前景分割掩膜来构建共享几何结构，确保不同视图之间的形状和结构一致性。为了增强生成过程中的一致性和细节恢复，我们开发了一种解耦的几何增强注意力机制，该机制加强了关键几何细节的特征关注，从而提高了整体图像质量和细节保留。此外，我们还采用了一种自适应学习策略，对模型进行微调，以更好地捕捉生成视图之间的空间关系和视觉连贯性，确保结果的真实性。我们的模型还结合了一种迭代细化过程，通过多个阶段的图像生成逐步改进输出质量。最后，提出了一种动态几何信息强度调整机制，自适应地调节几何数据的影响，优化整体质量，同时确保生成图像的自然性。更多细节请参见项目页面：https://sobeymil.github.io/GeoMVD.com。

论文及项目相关链接

PDF

Summary

本文提出一种基于几何引导的多视角扩散模型（Geometry-guided Multi-View Diffusion Model），用于生成多视角图像。该模型通过提取和利用多视角几何信息，解决了现有方法在计算量大、视角一致性保持差、高分辨率输出生成困难等问题。通过深度图、法线图、前景分割掩膜等构建共享几何结构，确保不同视角的形状和结构一致性。采用去耦的几何增强注意力机制提高图像质量及细节保留。此外，模型还具备自适应学习能力，能够捕捉空间关系和视觉连贯性，生成更真实的图像。通过迭代优化过程，逐步改善输出质量。模型还提出了动态调整几何信息强度机制，以优化整体质量和保证图像的自然性。

Key Takeaways

提出基于几何引导的多视角扩散模型（Geometry-guided Multi-View Diffusion Model），解决多视角图像生成中的计算量大、视角一致性差等问题。
通过深度图、法线图和前景分割掩膜构建共享几何结构，确保不同视角的形状和结构一致性。
采用去耦的几何增强注意力机制，提高图像质量和细节保留。
模型具备自适应学习能力，能够捕捉空间关系和视觉连贯性，生成更真实的图像。
通过迭代优化过程逐步改善输出质量。
模型具备动态调整几何信息强度机制，以优化整体图像质量并保障自然性。

Cool Papers

点此查看论文截图

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Authors:Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

在考虑感知生成的改进是为了提高复杂任务的性能时，我们发现现有的一些序列式、自回归的方法会因错误传播而意外地降低性能，这是一个关键的失败模式。为了系统地分析这一问题，我们提出了ParaBench，这是一个新的基准测试，旨在评估文本和图像输出模式。我们使用ParaBench的分析表明，这种性能下降与生成推理和最终图像之间的对齐不佳有强相关性。

摘要

该文指出了现有序列自回归方法在处理复杂任务时存在的一个关键失败模式，即误差传播导致的性能下降。为解决这一问题，作者提出了ParaBench这一新基准测试平台，该平台既可以评估文本输出也可以评估图像输出。分析显示性能下降与生成推理和最终图像之间的对齐不佳密切相关。为解决此问题，作者提出了MMaDA-Parallel这一并行多模态扩散框架，它能在整个去噪轨迹中实现文本和图像之间的持续双向交互。此外，结合监督微调训练，采用平行强化学习（ParaRL）策略进一步调优，通过在轨迹上应用语义奖励来加强跨模态一致性。实验证明，该模型显著提高了跨模态对齐和语义一致性，在ParaBench上的输出对齐度相比目前最先进的Bagel模型提高了6.9%，为思考感知图像合成建立了更稳健的范式。

关键见解

现有自回归方法在处理复杂任务时存在误差传播导致的性能下降问题。
ParaBench基准测试平台被设计用来评估文本和图像输出，揭示生成推理与最终图像之间的对齐问题。
MMaDA-Parallel框架实现了文本和图像之间的持续双向交互，在整个去噪轨迹中优化跨模态对齐。
MMaDA-Parallel结合监督微调训练并采用平行强化学习（ParaRL）策略进一步调优模型。
ParaRL通过轨迹上的语义奖励来加强跨模态一致性。
实验结果表明，MMaDA-Parallel在ParaBench上的输出对齐度相比目前最先进的模型有显著提高。
开源代码为社区提供了一个更稳健的思考感知图像合成范式。

Cool Papers

点此查看论文截图

Learning few-step posterior samplers by unfolding and distillation of diffusion models

Authors:Charlesquin Kemajou Mbakam, Jonathan Spence, Marcelo Pereyra

Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.

扩散模型（DMs）在贝叶斯计算成像中已作为强大的图像先验涌现出来。在此上下文中，已经提出了两种主要策略来利用DMs：即插即用（Plug-and-Play）方法，它是零射击（zero-shot）和高度的灵活性，但依赖于近似值；以及针对特定任务的专用条件扩散模型，通过监督训练实现更高的精度和更快的推理速度。在这项工作中，我们引入了一个新型框架，该框架集成了深度展开和模型蒸馏，将DM图像先验转换为一个用于后采样（posterior sampling）的几步条件模型。我们的方法的一个核心创新之处在于展开马尔可夫链蒙特卡罗（MCMC）算法——具体来说，是最近提出的LATINO Langevin采样器（Spagnoletti等人，2025）——这代表了首次将深度展开应用于蒙特卡罗采样方案的实例。我们通过广泛的实验和与最新技术的比较来验证我们提出的展开和蒸馏采样器，它们在保持推理时正向模型变化灵活性的同时，实现了出色的准确性和计算效率。

论文及项目相关链接

PDF 34 pages, 18 figures, 11 tables

摘要
扩散模型（DMs）已成为贝叶斯计算成像中的强大图像先验。本文介绍了一种新型框架，它融合了深度展开和模型蒸馏技术，将扩散模型图像先验转化为用于后验采样的几步条件模型。该框架的创新之处在于展开马尔可夫链蒙特卡洛算法（MCMC），特别是最近提出的LATINO Langevin采样器，代表了深度展开首次应用于蒙特卡洛采样方案。实验证明，所提出的展开和蒸馏采样器在保持对前向模型推理时变异的适应性的同时，达到了很高的准确性和计算效率，并与现有技术相比具有优势。

要点

扩散模型（DMs）已成为贝叶斯计算成像中的强大图像先验。
提出了结合深度展开和模型蒸馏的新型框架，将DM图像先验转化为条件模型。
创新地展开马尔可夫链蒙特卡洛（MCMC）算法，特别是LATINO Langevin采样器的应用。
框架首次将深度展开应用于蒙特卡洛采样方案。
通过广泛实验和与最新技术的比较，展示所提出的采样器的高准确性和计算效率。
提出的采样器在保持推理时适应前向模型变异的能力。
与现有技术相比，所提出的框架具有优势。

Cool Papers

点此查看论文截图

OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

Authors:Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis

We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/

我们介绍了OG-VLA，这是一种新型架构和学习框架，结合了Vision Language Action模型（VLA）的推广优势和3D感知策略的稳健性。我们解决了将自然语言指令和一个或多个RGBD观察结果映射到静态机器人动作的挑战。3D感知机器人策略在精确的机器人操作任务上实现了最新性能，但在推广到未见过的指令、场景和对象时面临困难。另一方面，VLA在指令和场景方面的推广能力很强，但可能对相机和机器人姿态变化敏感。我们利用嵌入在语言与视觉基础模型中的先验知识，提高3D感知关键帧策略的泛化能力。OG-VLA将从不同视角输入的观测值投影到点云中，然后从规范的正交视角进行渲染，确保输入视角的不变性和输入与输出空间之间的一致性。这些规范视角通过视觉主干、大型语言模型（LLM）和图像扩散模型进行处理，生成编码输入场景上末端执行器下一位置和方向的图像。在Arnold和Colosseum基准测试上的评估证明，在未见过的环境上推广具有最新状态，相对改进超过40%，同时在已知设置中也保持稳健性能。我们还展示了在3到5次演示中的现实适应性和强大的泛化能力。相关视频和资源请访问：[https://og-vla.github.io/]

论文及项目相关链接

PDF 13 pages

Summary：

我们提出了一种新的架构和学习框架OG-VLA，它将Vision Language Action模型（VLA）的泛化能力与3D感知策略的稳健性相结合。OG-VLA解决了将自然语言指令和一个或多个RGBD观测结果映射到静态机器人动作的挑战。我们在Arnold和Colosseum基准测试上的评估显示，OG-VLA在未见过的环境中有最先进的泛化能力，同时在已知场景中保持稳健性能。此外，我们还展示了在少量真实世界示范中的强大泛化能力。

Key Takeaways：

OG-VLA是一个结合了Vision Language Action模型（VLA）泛化能力与3D感知策略稳健性的新型架构和学习框架。
OG-VLA解决了将自然语言指令和RGBD观测结果映射到静态机器人动作的挑战。
OG-VLA通过利用语言和视觉基础模型中的先验知识，提高了3D感知关键帧策略的泛化能力。
OG-VLA通过将输入观察从不同视角投影到点云，然后从其规范的正交视角进行渲染，确保了输入视角的不变性和输入输出空间的一致性。
OG-VLA使用视觉主干、大型语言模型（LLM）和图像扩散模型来处理这些规范视图，生成编码输入场景上最终执行器下一步位置和方向的图像。
在Arnold和Colosseum基准测试上的评估显示，OG-VLA在未见过的环境中有最先进的泛化能力，相对改进超过40%。

Cool Papers

点此查看论文截图

Is Noise Conditioning Necessary for Denoising Generative Models?

Authors:Qiao Sun, Zhicheng Jiang, Hanhong Zhao, Kaiming He

It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a theoretical analysis of the error caused by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.

普遍认为噪声调节对于降噪扩散模型的成功运行是必不可少的。本研究对这一观点提出了质疑。受盲图像去噪研究的启发，我们研究了在没有噪声调节的情况下，基于去噪的多种生成模型。令人惊讶的是，大多数模型表现出优雅的退化性能，在某些情况下，甚至在无需噪声调节的情况下表现更佳。我们对因去除噪声调节而产生的误差进行了理论分析，并证明我们的分析与实际观察相吻合。我们进一步引入了一种无需噪声调节的模型，在CIFAR-10上实现了具有竞争力的FID分数2.23，显著缩小了与领先的噪声条件模型的差距。我们希望我们的研究能激励社区重新思考去噪生成模型的基础和公式。

论文及项目相关链接

PDF Update ImageNet experiments (SiT with CFG). Update Appendix

Summary

本文挑战了广泛认为的噪声调节对降噪扩散模型成功工作的不可或缺性。受盲图像去噪研究启发，本文研究了在无噪声调节的情况下各种基于去噪的生成模型。结果表明，大多数模型在没有噪声调节的情况下性能仍然稳健，甚至在某些情况下表现更佳。本文提供了去除噪声调节所带来的误差的理论分析，并展示了该分析与实证观察的一致性。此外，本文引入了一种无需噪声调节的模型，在CIFAR-10上实现了竞争性的FID分数，显著缩小了与领先的噪声条件模型的差距。本文的发现有望激励社区重新思考降噪生成模型的基础和公式。

Key Takeaways