发布日期: 2025-11-22

更新日期: 2025-11-27

文章字数: 11.6k

阅读时长: 47 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-22 更新

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Authors:Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

近期在视觉语言动作（VLA）模型方面的进展表明，视觉信号可以有效地补充稀疏动作监督。然而，让VLA直接预测高维视觉状态会分散模型容量并产生高昂的训练成本，而将视觉状态压缩成更紧凑的监督信号又不可避免地会产生信息瓶颈。此外，由于忽视了语言监督，现有方法往往存在理解和推理能力不足的缺陷。针对这些问题，本文提出了一个名为Mantis的新型框架，它配备了分离式视觉预测（DVF）。具体来说，Mantis通过结合元查询和扩散Transformer（DiT）头，将视觉预测与主干分离。通过将当前视觉状态通过残差连接提供给DiT，简单的下一个状态预测目标使元查询能够自动捕捉描述视觉轨迹的潜在动作，从而增强显式动作的学习。这种分离式的设计减轻了VLA主干的负担，使其能够通过语言监督保持理解和推理能力。经验上，Mantis在人类操作视频、机器人演示和图像文本对上进行了预训练，在LIBERO基准测试上进行微调后，达到了96.7%的成功率，超越了强大的基线，并表现出了高速的收敛性。现实世界评估表明，Mantis在遵循指令能力、对未见指令的推广能力和推理能力方面超越了领先的开源VLA模型π_{0.5}。代码和权重已发布，以支持开源社区。

论文及项目相关链接

PDF

摘要

本文介绍了Mantis，一种针对视觉语言动作（VLA）模型的新型框架，通过解耦视觉预测与主干结合的方法来解决现有问题。Mantis利用元查询和扩散Transformer（DiT）头相结合，通过残连接将当前视觉状态提供给DiT，使得简单的下一个状态预测目标能够使元查询自动捕获区分视觉轨迹的潜在动作，从而提高显式动作的学习效率。这种解耦方法减轻了VLA主干的负担，使其能够通过语言监督保持理解和推理能力。在人类操作视频、机器人演示和图像文本对上预训练后，Mantis在LIBERO基准测试上取得了96.7%的成功率，超越了强大的基线，并表现出高速收敛性。在现实世界评估中，特别是在遵循指令、泛化未见指令和推理能力方面，Mantis表现优于领先的开源VLA模型π_{0.5}。

关键见解

Mantis框架解决了视觉语言动作（VLA）模型在预测高维视觉状态时的分布模型容量和昂贵的训练成本问题。
通过结合元查询和扩散Transformer（DiT）头，实现了视觉预测与主干的解耦。
利用当前视觉状态作为残连信息，提高了显式动作的学习效率。
解耦方法允许VLA模型通过语言监督保持理解和推理能力。
Mantis在LIBERO基准测试上取得了高成功率，并表现出快速收敛。
在现实世界评估中，Mantis优于领先的开源VLA模型π_{0.5}，特别是在指令遵循、未见指令泛化和推理能力方面。
代码和权重已发布，以支持开源社区。

Cool Papers

点此查看论文截图

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Authors:Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu

Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.

扩散转换器（DiTs）在图像生成方面表现出卓越的性能，然而其大量的参数导致了较高的计算成本，阻碍了其在资源受限环境中的部署。为了解决这一问题，我们提出了可插拔的连续层蒸馏结构化剪枝（PPCL），这是一个专为DiT架构设计的灵活结构化剪枝框架。首先，我们通过线性探测机制结合相似性指标的一阶差分趋势分析来识别冗余的层间隔。随后，我们提出了一种即插即用的教师-学生交替蒸馏方案，该方案量身定制，可在单个训练阶段内集成深度剪枝和宽度剪枝。这种蒸馏框架实现了跨不同剪枝比例的灵活知识转移，无需针对每个配置进行重新训练。在多个多模式扩散转换器架构模型上的广泛实验表明，与全模型相比，PPCL实现了参数数量的50%减少，关键目标指标的降级少于3%。值得注意的是，我们的方法在保持高质量图像生成能力的同时，实现了更高的压缩率，非常适合资源受限的环境。PPCL的开源代码和检查点可以在以下链接中找到：https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning。

Summary

本文提出一种针对Diffusion Transformers（DiTs）的灵活结构化剪枝框架——Pluggable Pruning with Contiguous Layer Distillation（PPCL），用于解决图像生成中的高计算成本问题。通过线性探测机制和相似性度量的一阶差分趋势分析，识别冗余层区间；采用即插即用的教师-学生交替蒸馏方案，实现在单一训练阶段内深度与宽度的灵活剪枝。PPCL在多个多模态扩散转换器架构模型上的实验表明，与全模型相比，参数数量减少了50%，关键目标指标下降不到3%，同时保持高质量的图像生成能力，并在高压缩比下适用于资源受限环境。

Key Takeaways

Diffusion Transformers (DiTs)面临高计算成本问题，尤其在资源受限环境中。
提出Pluggable Pruning with Contiguous Layer Distillation (PPCL)框架，专为DiT架构设计的灵活结构化剪枝方法。
通过线性探测机制和相似性度量的差分趋势分析识别冗余层区间。
采用教师-学生交替蒸馏方案，实现深度与宽度的灵活剪枝，在同一训练阶段集成。
PPCL实现知识转移灵活，无需为每个配置重新训练。
实验显示，与全模型相比，PPCL参数减少50%，关键目标指标下降不到3%，维持高质量图像生成能力。

Cool Papers

点此查看论文截图

Decoupling Complexity from Scale in Latent Diffusion Model

Authors:Tianxiong Zhong, Xingye Tian, Xuebo Wang, Boyuan Jiang, Xin Tao, Pengfei Wan

Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.

现有的潜在扩散模型通常将规模和内容复杂度相结合，使用更多的潜在令牌来表示高分辨率图像或高帧率视频。然而，表示视觉数据所需的潜在容量主要取决于内容复杂度，规模只作为上限。基于此观察，我们提出了DCS-LDM，这是一种视觉生成的新型范式，它将信息复杂度与规模分离。DCS-LDM构建了一个分层、与规模无关的潜在空间，通过多级令牌对样本复杂度进行建模，并支持在固定的潜在表示内进行任意分辨率和帧率的解码。这个潜在空间使DCS-LDM能够实现灵活的计算质量权衡。此外，通过分解各级的结构信息和详细信息，DCS-LDM支持从粗到细的渐进生成范式。实验结果表明，DCS-LDM的性能与最先进的方法相当，同时在不同规模和视觉质量上提供灵活的生成。

论文及项目相关链接

PDF 15 pages, 16 figures

Summary

文本提出了一个名为DCS-LDM的新型视觉生成模型，该模型从内容和复杂度出发构建了一个层次化的、尺度独立的潜在空间。通过多层次的令牌建模样本复杂度，支持在固定的潜在表示内解码到任意分辨率和帧率，实现了灵活的计算质量权衡。该模型具有渐进的从粗糙到精细的生成范式，能够支持多种规模和视觉质量的灵活生成。这种新型的视觉生成模型可提高效率和生成质量。

Key Takeaways

DCS-LDM是一个新型的视觉生成模型，旨在解决现有潜扩散模型在内容复杂性和规模上的耦合问题。
该模型构建了一个层次化、尺度独立的潜在空间，以建模样本复杂度并支持任意分辨率和帧率的解码。
DCS-LDM实现了灵活的计算质量权衡，能够在固定的潜在表示内平衡计算资源和生成质量。
通过分解结构化和详细信息，DCS-LDM支持渐进的从粗糙到精细的生成范式。
该模型在性能上达到了当前先进方法的水准，同时在多种规模和视觉质量上提供了灵活的生成能力。
DCS-LDM的潜在空间能够有效地处理复杂视觉数据和不同分辨率的视频数据。

Cool Papers

点此查看论文截图

Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

Authors:Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu, Nannan Wang, Xinbo Gao

The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework’s effectiveness and state-of-the-art performance.

以DeepSeek和Grok等模型为例，稀疏门控的专家混合（MoE）架构所展示的成促使研究者探究其在不同领域的应用适应性。在真实世界图像超分辨率（Real-ISR）中，现有方法主要依赖通过低秩适应（LoRA）模块对预训练的扩散模型进行微调，以重建高分辨率（HR）图像。然而，这些密集型的Real-ISR模型在适应性地捕捉复杂真实世界退化样本的异质性特征方面存在局限，或在有限的计算预算下无法实现输入知识共享。为解决这一问题，我们探究了稀疏MoE在Real-ISR中的集成，并提出了一种用于单步图像超分辨率的混合秩（MoR）架构。我们引入了一种精细的专家分区策略，将LoRA中的每个秩视为独立的专家。这种设计实现了灵活的知识重组，同时将固定位置的秩作为共享专家，以保留常识特征并尽量减少路由冗余。此外，我们开发了一个退化估计模块，利用CLIP嵌入和预定义的正面负面文本对来计算相对退化分数，动态引导专家激活。为了更好地适应不同样本的复杂性，我们引入了无专家插槽，并提出了一种降解感知的负载均衡损失，该损失根据退化严重程度动态调整活跃专家的数量，确保最优的计算资源分配。全面的实验验证了我们的框架的有效性和最先进的性能。

论文及项目相关链接

PDF 16 pages, Accepted by AAAI 2026

Summary

本研究探讨了将稀疏Mixture-of-Experts（MoE）架构应用于真实世界图像超分辨率（Real-ISR）的问题。针对现有方法主要依赖低秩适配（LoRA）模块进行微调以重建高分辨率（HR）图像的限制，研究团队提出了Mixture-of-Ranks（MoR）架构进行单步图像超分辨率处理。通过引入精细的专家分区策略，将LoRA中的每个等级视为独立专家，实现灵活的知识重组，同时隔离固定位置的等级作为共享专家，以保留通用特征并减少路由冗余。此外，该研究还开发了利用CLIP嵌入和预定义的正负文本对来计算相对退化分数的退化估计模块，动态引导专家激活。为了应对不同样本复杂性的挑战，研究团队引入了零专家槽位并提出了退化感知负载均衡损失，根据退化严重程度动态调整活跃专家数量，确保优化计算资源分配。实验验证了该框架的有效性和业界领先水平。

Key Takeaways

稀疏Mixture-of-Experts架构在图像超分辨率中的应用受到关注。
现有图像超分辨率方法主要依赖低秩适配模块进行微调。
Mixture-of-Ranks架构通过精细的专家分区策略实现知识重组和共享。
利用CLIP嵌入和预定义文本对计算相对退化分数，动态引导专家激活。
引入零专家槽位以应对不同样本复杂性。
提出了退化感知负载均衡损失，确保优化计算资源分配。
实验验证了该框架的有效性和业界领先水平。

Cool Papers

点此查看论文截图

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Authors:Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu

Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

视频扩散生成面临着关键的采样效率瓶颈，特别是在大规模模型和长上下文环境中。现有的视频加速方法，改编自基于图像的技术，缺乏针对大规模视频模型的单步蒸馏能力和针对条件下游任务的通用性。为了弥补这一差距，我们提出了视频相位对抗平衡（V-PAE），这是一种蒸馏框架，能够在大规模视频模型中进行高质量的单步视频生成。我们的方法采用两个阶段的过程。(i)稳定性引导是一个预热过程，用于对齐真实和生成视频的分布。它提高了后续过程中单步对抗蒸馏的稳定性。(ii)统一对抗平衡是一个灵活的自我对抗过程，重用生成器的参数作为判别器的骨干。它在高斯噪声空间中实现了协同进化的对抗平衡。对于条件任务，我们主要保持视频-图像主题的连续性，这在蒸馏训练中因语义退化和条件帧崩溃而导致的问题在图像到视频（I2V）生成中尤为明显。在VBench-I2V上的综合实验表明，V-PAE在总体质量评分上平均优于现有加速方法5.8%，包括语义对齐、时间连贯性和帧质量。此外，我们的方法降低了大规模视频模型的扩散延迟（例如Wan2.1-I2V-14B），同时保持了竞争力。

论文及项目相关链接

PDF Accepted in AAAI 2026. Renamed from POSE to V-PAE to avoid ambiguity. Project Page: https://v-pae.github.io/

Summary

本文提出了视频分阶段对抗均衡（V-PAE）框架，解决了视频扩散生成中关键采样效率瓶颈的问题。该框架适用于大规模视频模型，采用两阶段过程：稳定性引导和统一对抗均衡。在条件下游任务中，主要保持视频图像主题一致性。实验证明，V-PAE在整体质量评分上平均优于现有加速方法5.8%，并大幅减少大规模视频模型的扩散延迟。

Key Takeaways

视频扩散生成面临关键采样效率瓶颈，特别是在大规模模型和长上下文情况下。
现有视频加速方法缺乏针对大规模视频模型的单一步骤蒸馏能力，且缺乏条件下游任务的通用性。
提出的V-PAE框架是一种蒸馏框架，能够在大规模视频模型上进行高质量的单步视频生成。
V-PAE采用两阶段过程：稳定性引导和统一对抗均衡，以提高生成视频的质量和效率。
在条件任务中，保持视频图像主题一致性是V-PAE的主要关注点。
实验证明，V-PAE在整体质量评分上优于现有加速方法。

Cool Papers

点此查看论文截图

One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Authors:Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin

Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios - for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by language modeling, where generation is guided by conditioning prompts. However, our framework differs fundamentally from LLMs in two key aspects. First, it employs a bidirectional modeling paradigm that symmetrically allows prompting either from the garment to generate try-on results or from the dressed person to recover the try-off garment. Second, it strictly adheres to Tweedie’s formula, enabling faithful estimation of the underlying data distribution during the denoising process. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as input, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical solution for virtual garment synthesis.

最近基于扩散的方法在基于图像的虚拟试穿技术上取得了重大进展，使得更真实和端到端的服装合成成为可能。然而，大多数现有方法仍然受到展示服装、分割掩膜等的限制，以及它们对灵活姿势变化的有限处理能力。这些局限性降低了它们在现实场景中的实用性——例如，用户无法轻松地将一个人的服装转移到另一个人身上，生成的试穿结果通常仅限于参考图像的同一姿势。在本文中，我们介绍了OMFA（统一模型，适用于所有场景），这是一个既可用于虚拟试穿也可用于虚拟脱衣的统一扩散框架，它无需展示服装即可运行，并支持任意姿势。OMFA受到语言模型的启发，其中生成是通过条件提示来引导的。然而，我们的框架与大型语言模型在两个方面存在根本差异。首先，它采用双向建模范式，对称地允许从服装生成试穿结果或从穿着者恢复试脱的服装进行提示。其次，它严格遵循Tweedie公式，在降噪过程中能够忠实估计潜在数据分布。OMFA没有采用下半身约束，它是一个完全无掩膜框架，只需单幅肖像和目标服装作为输入，被设计成支持灵活的服装组合和跨人服装转移，更好地符合实际使用场景。此外，通过利用基于SMPL-X的姿势条件，OMFA仅从一张图像即可支持多视角和任意姿势的试穿。大量实验表明，OMFA在试穿和试脱任务上都达到了最新水平的结果，为虚拟服装合成提供了实用的解决方案。

论文及项目相关链接

PDF

摘要

最新基于扩散的方法在图像虚拟试穿领域取得了显著进展，实现了更真实、端到端的服装合成。然而，大多数现有方法仍然受限于展示服装和分割掩膜的使用，以及处理灵活姿态变化的能力有限。这些局限降低了它们在现实场景中的实用性，例如，用户无法轻松地将一个人的服装转移到另一个人身上，生成的试穿结果通常仅限于参考图像的姿势。本文介绍了OMFA（统一扩散框架，适用于虚拟试穿和脱衣），无需展示服装即可运行，并支持任意姿势。OMFA受到语言模型的启发，其中生成由条件提示引导。然而，我们的框架与大型语言模型在两个方面存在根本差异。首先，它采用双向建模范式，可以对称地从服装生成试穿结果或从穿衣的人恢复脱衣服装的提示。其次，它严格遵循Tweedie公式，在去噪过程中实现忠实估计潜在数据分布。OMFA是一个完全无掩膜的框架，只需单幅肖像和目标服装作为输入，旨在支持灵活的服装组合和跨人服装转移，更符合实际使用场景。此外，通过利用SMPL-X基于姿态的条件，OMFA仅从一张图像就支持多视角和任意姿态的试穿。大量实验表明，OMFA在试穿和脱衣任务上均达到最新水平，为虚拟服装合成提供了实用解决方案。

关键见解

现有图像虚拟试穿方法受限于展示服装、分割掩膜的使用，以及处理灵活姿态变化的能力。
OMFA是一个统一扩散框架，适用于虚拟试穿和脱衣，无需展示服装，支持任意姿势。
OMFA采用双向建模范式，可以从服装或穿衣的人生成提示。
OMFA严格遵循Tweedie公式，实现忠实估计潜在数据分布。
OMFA是一个完全无掩膜的框架，只需单幅肖像和目标服装作为输入。
OMFA支持灵活的服装组合和跨人服装转移。

Cool Papers

点此查看论文截图

FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Authors:Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Dong Nie, Lei Sun, Xiangxiang Chu

Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.

场景文本编辑旨在在图像上修改或添加文本，同时确保文本忠诚度和与背景一致的整体视觉质量。最近的方法主要基于UNet的扩散模型，已经改进了场景文本编辑结果，但在处理复杂的字形结构时仍然面临挑战，尤其是非拉丁字母（例如中文、韩语、日语）。为了解决这些问题，我们提出了FLUX-Text，这是一个简单先进的跨语言场景文本编辑DiT方法。具体来说，我们的FLUX-Text通过轻量级的视觉和文本嵌入模块，增强了字形理解和生成能力，同时保留了FLUX的原始生成能力。我们进一步提出了一种针对文本区域的区域性文本感知损失，以及一个配套的两阶段训练策略，以更好地平衡文本编辑和整体图像质量。受益于DiT的架构和轻量级特征注入模块，FLUX-Text仅使用0.1M训练样本进行训练，与流行方法所需的2.9M相比，减少了**97%**。在多个公共数据集上的广泛实验，包括英语和中文基准测试，证明我们的方法在视觉质量和文本忠诚度方面超越了其他方法。所有代码可在https://github.com/AMAP-ML/FluxText获取。

论文及项目相关链接

PDF 10 pages, 5 figures

Summary

基于UNet的扩散模型在场景文本编辑中取得显著成果，但仍面临复杂字形结构，特别是非拉丁字母（如中文、韩文、日文）的挑战。为解决这些问题，提出FLUX-Text，一个先进的多语言场景文本编辑方法。它通过轻量级视觉和文本嵌入模块增强字形理解和生成，同时保持FLUX的原始生成能力。还提出针对文本区域的区域文本感知损失和相应的两阶段训练策略，以更好地平衡文本编辑和整体图像质量。FLUX-Text可以在仅0.1M训练样本上进行训练，相较于流行方法所需的2.9M样本，减少了97%。在多个公共数据集上的实验表明，该方法在视觉质量和文本保真度上超越其他方法。

Key Takeaways

FLUX-Text是一个针对多语言场景文本编辑的方法，旨在解决复杂字形结构（特别是非拉丁字母）的问题。
通过轻量级视觉和文本嵌入模块增强字形理解和生成。
提出区域文本感知损失，针对文本区域进行优化。
采用两阶段训练策略，以平衡文本编辑和整体图像质量。
FLUX-Text可以在极少的训练样本（仅0.1M）上进行训练，相较于其他方法大大减少了训练成本。
在多个公共数据集上的实验证明，FLUX-Text在视觉质量和文本保真度上超越现有方法。
所有代码已公开，方便研究和应用。

Cool Papers

点此查看论文截图

VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

Authors:Mohammadreza Teymoorianfard, Siddarth Sitaraman, Shiqing Ma, Amir Houmansadr

Video diffusion models can generate realistic and temporally consistent videos. This raises concerns about provenance, ownership, and integrity. Watermarking can help address these issues by embedding metadata directly into the content. To work well, a watermark needs enough capacity for meaningful metadata. It must also stay imperceptible and remain robust to common video manipulations. Existing methods struggle with limited capacity, extra inference cost, or reduced visual quality. We introduce VidStamp, a watermarking framework that embeds frame-level messages through the decoder of a latent video diffusion model. The decoder is fine-tuned in two stages. The first stage uses static image datasets to encourage spatial message separation. The second stage uses synthesized video sequences to restore temporal consistency. This approach enables high-capacity watermarks with minimal perceptual impact. VidStamp also supports dynamic watermarking through a control signal that selects message templates during inference. This adds flexibility and creates a second channel for communication. We evaluate VidStamp on Stable Video Diffusion (I2V), OpenSora, and Wan (T2V). The system embeds 48 bits per frame while preserving visual quality and staying robust to common distortions. Compared with VideoSeal, VideoShield, and RivaGAN, it achieves lower log P-values and stronger detectability. Its frame-wise watermarking design also enables precise temporal tamper localization, with an accuracy of 0.96, which exceeds the VideoShield baseline. Code: https://github.com/SPIN-UMass/VidStamp

视频扩散模型可以生成真实且时间连贯的视频。这引发了关于出处、所有权和完整性的担忧。水印可以通过将元数据直接嵌入内容中来帮助解决这些问题。为了更好地工作，水印需要有足够的容量来存储有意义的元数据。它还必须保持不可察觉，并对常见的视频操作保持稳健。现有方法面临着容量有限、额外推理成本或视觉质量降低的问题。我们介绍了VidStamp，这是一个水印框架，通过潜在视频扩散模型的解码器嵌入帧级消息。解码器分两个阶段进行微调。第一阶段使用静态图像数据集来鼓励空间消息分离。第二阶段使用合成视频序列来恢复时间连贯性。这种方法可实现大容量水印，同时最小化感知影响。VidStamp还支持通过控制信号在推理过程中选择消息模板的动态水印。这增加了灵活性，并创建了第二个通信通道。我们在Stable Video Diffusion（I2V）、OpenSora和Wan（T2V）上评估了VidStamp。该系统每帧嵌入48位，同时保持视觉质量，并对常见的失真保持稳健。与VideoSeal、VideoShield和RivaGAN相比，它实现了更低的log P值和更强的可检测性。其帧级水印设计还实现了精确的时间篡改定位，准确率为0.96，超过了VideoShield基线。代码：https://github.com/SPIN-UMass/VidStamp

论文及项目相关链接

PDF

Summary

视频扩散模型能够生成真实且时间连贯的视频，这引发了关于出处、所有权和完整性的问题。水印技术能够通过在内容中嵌入元数据来解决这些问题。一个有效的水印需要有足够的信息容量，同时需要难以察觉并且能够抵抗常见的视频操作。VidStamp是一个水印框架，它通过视频扩散模型的解码器嵌入帧级信息。该解码器分为两个阶段进行微调，第一阶段使用静态图像数据集鼓励空间信息分离，第二阶段使用合成视频序列恢复时间连贯性。这种方法能够在保持视觉质量的同时嵌入大容量水印。此外，VidStamp还支持通过控制信号在推理过程中选择信息模板，增加了灵活性并创建第二个通信渠道。该系统在多种视频扩散模型上的表现优秀，嵌入帧的信息量大，对常见扭曲的抵抗力强，并且定位准确。

Key Takeaways

视频扩散模型生成的现实性和时间连贯性引发关于出处、所有权和完整性的关注。
水印是解决这些问题的有效手段之一，通过将元数据嵌入视频内容中。
VidStamp是一个新的水印框架，通过视频扩散模型的解码器嵌入帧级信息，保持视觉质量的同时嵌入大容量水印。
VidStamp的解码器经过两个阶段微调：第一个阶段使用静态图像数据集来分离空间信息，第二个阶段则侧重于恢复时间连贯性。这一策略实现了出色的性能。
VidStamp支持动态水印功能，通过控制信号在推理过程中选择信息模板，增加了灵活性并提供了额外的通信渠道。
VidStamp在各种视频扩散模型上的表现优异，包括Stable Video Diffusion、OpenSora和Wan等模型。其水印嵌入帧的信息量大且对常见扭曲具有强大的抵抗力。

Cool Papers

点此查看论文截图

Human Motion Unlearning

Authors:Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso

We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., “kicking” is “loading and swinging a leg”). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: https://www.pinlab.org/hmu.

我们引入了人类运动撤销任务，以防止有毒动画的合成，同时保留一般的文本到运动的生成性能。撤销有毒运动具有挑战性，因为这些运动可能来自明确的文本提示和安全的运动组合（例如，“踢”是“抬起并摆动腿”）的隐含毒性组合。我们通过从HumanML3D和Motion-X的最近大规模文本到运动数据库中过滤有毒动作，建立了第一个运动撤销基准测试。我们提出了适应最先进图像撤销技术的基线，以处理时空信号。最后，我们提出了一种基于潜在代码替换的新型运动撤销模型，我们称之为LCR。LCR无需训练，适用于最先进文本到运动扩散模型的离散潜在空间。LCR简单且无论在质上还是在量上都始终优于基线。项目页面：https://www.pinlab.org/hmu。

论文及项目相关链接

PDF

Summary

本文介绍了人类运动去学习的任务，旨在防止合成有毒动画的同时保持文本到运动的生成性能。文章指出，去除非法运动是一项挑战，因为这些运动可能来自明确的文本提示和安全的组合动作隐性的有毒组合。为此，本文提出基于Motion-X和HumanML3D等大型文本到运动数据集的运动去学习基准测试。同时，通过适应最新的图像去学习方法来处理时空信号，提出了基线方法。最后，提出了一种基于潜在代码替换的运动去学习方法LCR（Latent Code Replacement），该方法无需训练，适用于最新的文本到运动扩散模型的离散潜在空间。LCR简单且一致地在质量和数量上超越基线方法。

Key Takeaways

引入人类运动去学习的任务，旨在避免合成有毒动画。
去除非法运动是一项挑战，因为这些可能来自文本提示和动作组合的隐性毒性。
建立了基于大型文本到运动数据集的运动去学习的基准测试。
适应了处理时空信号的图像去学习方法为基线方法。
提出了一种基于潜在代码替换的去学习方法LCR，适用于离散潜在空间的文本到运动扩散模型。
LCR方法无需训练，并且在质量和数量上超越基线方法。

Cool Papers

点此查看论文截图

Zero-Shot Video Translation via Token Warping

Authors:Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame’s query, key, and value patches, aligning them with the current frame’s patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations are available in supplementary materials.

随着生成式人工智能的革新，视频相关任务已经得到了广泛的研究。然而，当前最先进的视频模型在视觉质量和用户控制生成内容方面仍然落后于图像模型。在本文中，我们介绍了TokenWarping，这是一种用于时间连贯视频翻译的新型框架。现有的基于扩散的视频编辑方法仅依赖于自注意力中的关键和值补丁来确保时间一致性，通常牺牲了局部和结构性区域的保留。这些方法忽略了查询补丁在实现精确特征聚合和时间连贯性方面的作用。相比之下，TokenWarping通过构建不同帧之间的时间相关性，利用互补令牌先验。我们的方法首先从源视频中提取光流。在扩散模型的去噪过程中，这些光流被用来扭曲上一帧的查询、关键和值补丁，将它们与当前帧的补丁对齐。通过直接扭曲查询补丁，我们增强了自注意力中的特征聚合，同时扭曲关键和值补丁确保了跨帧的时间一致性。这种令牌扭曲对自注意力层的输出施加了明确的约束，有效地确保了时间连贯的翻译。我们的框架不需要任何额外的训练或微调，并且可以无缝地集成到现有的文本到图像编辑方法中。我们在各种视频翻译任务上进行了广泛的实验，证明TokenWarping在定性和定量上均超越了最先进的方法。视频演示见补充材料。

论文及项目相关链接

PDF

Summary

本文介绍了一种名为TokenWarping的新型框架，用于实现时间连贯的视频翻译。该框架通过利用扩散模型中的令牌先验信息，构建不同帧之间的时间相关性，以提高视频编辑的准确性和连贯性。通过提取源视频的光流信息，在扩散模型的去噪过程中，将前一帧的查询、键和值补丁进行扭曲，与当前帧的补丁对齐。TokenWarping可增强自我注意力中的特征聚合，同时确保跨帧的时间连贯性。该框架无需任何额外的训练或微调，可轻松集成到现有的文本到图像编辑方法中，并在各种视频翻译任务上表现出超越现有方法的性能。

Key Takeaways

革命性的生成式AI使得视频相关任务得到了广泛研究。
当前先进的视频模型在视觉质量和用户控制生成内容上仍落后于图像模型。
现有基于扩散的视频编辑方法主要依赖关键值和补丁自注意力来保证时间连贯性，但牺牲了局部和结构区域的保留。
TokenWarping利用令牌先验信息构建时间相关性框架，实现更准确和连贯的视频编辑。
TokenWarping通过提取源视频的光流信息，在自我注意力中增强特征聚合，确保跨帧的时间连贯性。
该框架可轻松集成到现有的文本到图像编辑方法中，无需额外训练或微调。

Cool Papers

点此查看论文截图

Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models

Authors:Kyungsung Lee, Donggyu Lee, Myungjoo Kang

Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.

扩散模型因其能生成高质量重建图像以及与现有方法的兼容性，最近被公认为图像修复（IR）领域的一种有前途的框架。现有的解决IR中噪声反问题的图像修复方法，考虑了像素级别的数据保真度。本文提出了一个用于高斯噪声的图像修复扩散模型——空间与频率感知图像修复（SaFaRI）。我们的模型鼓励图像在空间和频率域保留数据保真度，从而提高重建质量。我们全面评估了模型对各种噪声反问题的性能，包括补全、去噪和超分辨率等。我们的全面评估表明，无论是在ImageNet数据集还是FFHQ数据集上，SaFaRI在零样本图像修复方法中都取得了最先进的性能表现，并在LPIPS和FID指标上表现出优势。

论文及项目相关链接

PDF

Summary

扩散模型在图像修复（IR）领域展现出巨大潜力，能够生成高质量重建图像，并与现有方法兼容。针对带噪声的逆问题，本文提出一种面向空间与频率感知的扩散模型SaFaRI，能够在高斯噪声下实现图像修复。模型兼顾空间与频率域的数据保真，提高了重建质量。经过在多种带噪声逆问题上的全面评估，包括补全、去噪和超分辨率等任务，SaFaRI在ImageNet和FFHQ数据集上实现了卓越性能，超越了现有零样本IR方法在LPIPS和FID指标上的表现。

Key Takeaways

扩散模型在图像修复领域具有潜力，生成高质量重建图像。
现有方法主要关注像素级数据保真。
本文提出SaFaRI模型，兼顾空间与频率域的数据保真。
SaFaRI模型针对带噪声的逆问题，包括补全、去噪和超分辨率等任务。
SaFaRI在ImageNet和FFHQ数据集上实现卓越性能。
SaFaRI在LPIPS和FID指标上超越现有零样本IR方法。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-22/Diffusion%20Models/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Diffusion Models

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-11-22 NoPo-Avatar Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

2025-11-22 医学图像

医学图像

NeRF

NeRF 方向最新论文已更新，请持续关注 Update in 2025-11-22 EOGS++ Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering

2025-11-22 NeRF

NeRF