嘘~ 正在从服务器偷取页面 . . .

Diffusion Models


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-27 更新

PixelDiT: Pixel Diffusion Transformers for Image Generation

Authors:Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

对于扩散Transformer(DiT)而言,潜在空间建模一直是其标准方法。然而,它依赖于一个两阶段管道,其中预训练的自动编码器会导致重建损失,从而在积累误差的同时阻碍联合优化。为了解决这些问题,我们提出了PixelDiT,这是一个单阶段端到端的模型,它不需要自动编码器,直接在像素空间学习扩散过程。PixelDiT采用基于完全变压器的架构,由两级设计构成:一个用于捕获全局语义的补丁级DiT和一个用于细化纹理细节的像素级DiT,这能够在保持细节的同时有效地训练像素空间扩散模型。我们的分析表明,有效的像素级令牌建模是像素扩散成功的关键。PixelDiT在ImageNet 256x256上实现了1.61的FID,大大超越了现有的像素生成模型。我们进一步将PixelDiT扩展到文本到图像生成,并在像素空间以1024x1024的分辨率进行预训练。它在GenEval上达到0.74,在DPG-bench上达到83.5,接近最佳的潜在扩散模型。

论文及项目相关链接

PDF

Summary

本文介绍了针对Diffusion Transformers(DiTs)的潜在空间建模问题,提出了一种名为PixelDiT的单阶段端到端模型。该模型直接在像素空间学习扩散过程,无需使用自编码器,解决了传统两阶段流程中的有损重建和误差累积问题。PixelDiT采用双重级别的全变压器架构设计,包括用于捕捉全局语义的补丁级别DiT和用于细化纹理细节的像素级别DiT。实验表明,有效的像素级令牌建模是像素扩散成功的关键。PixelDiT在ImageNet 256x256上实现了1.61的FID,大幅超越了现有像素生成模型。此外,PixelDiT还扩展到了文本到图像生成,并在像素空间以1024x1024分辨率进行预训练,在GenEval上达到了0.74,在DPG-bench上达到了83.5,接近最佳的潜在扩散模型。

Key Takeaways

  1. PixelDiT解决了传统Diffusion Transformers(DiTs)在潜在空间建模中的自编码器带来的问题,包括有损重建、误差累积以及联合优化阻碍。
  2. PixelDiT是一个单阶段的端到端模型,直接在像素空间学习扩散过程。
  3. PixelDiT采用双重级别的全变压器架构设计,包括补丁级别和像素级别的DiT,分别用于捕捉全局语义和细化纹理细节。
  4. 有效的像素级令牌建模是PixelDiT成功的关键。
  5. PixelDiT在ImageNet 256x256上实现了1.61的FID,显著优于其他像素生成模型。
  6. PixelDiT可扩展至文本到图像生成,并在像素空间以高分辨率进行预训练。

Cool Papers

点此查看论文截图

MotionV2V: Editing Motion in a Video

Authors:Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz

While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a “motion edit” and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating “motion counterfactuals”, video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

虽然生成式视频模型已经实现了显著的保真度和一致性,但将这些能力应用于视频编辑仍然是一个复杂的挑战。最近的研究探索了运动可控性作为增强文本到视频生成或图像动画的手段,然而,我们认为精确的运动控制是编辑现有视频的一种有前途但尚未充分探索的模式。在这项工作中,我们提出了通过直接编辑从输入中提取的稀疏轨迹来修改视频运动。我们将输入和输出轨迹之间的偏差称为“运动编辑”,并证明这种表示形式结合生成主干,可实现强大的视频编辑功能。为了实现这一点,我们引入了生成“运动反事实”的管道,即视频对具有相同内容但不同运动,我们在该数据集上微调了基于运动的视频扩散架构。我们的方法允许在任何时间戳开始进行编辑并自然传播。在四组用户研究中,我们的模型超过先前的模型获得了超过65%的偏好。请访问我们的项目页面:https://ryanndagreat.github.io/MotionV2V

论文及项目相关链接

PDF

Summary

本文探讨了将生成式视频模型应用于视频编辑的挑战,并提出通过直接编辑从输入中提取的稀疏轨迹来实现视频运动的修改。引入“运动编辑”概念,即输入与输出轨迹之间的偏差,结合生成式背景,实现强大的视频编辑功能。提出生成“运动反事实”数据集,并微调运动条件视频扩散架构,使编辑可在任何时间戳开始并自然传播。

Key Takeaways

  1. 生成式视频模型在视频编辑方面面临挑战,但运动控制是一种有效的解决方案。
  2. 提出通过直接编辑输入中的稀疏轨迹来修改视频运动。
  3. 引入“运动编辑”概念,即输入与输出轨迹的偏差。
  4. 结合生成式背景技术,实现强大的视频编辑功能。
  5. 创建“运动反事实”数据集,用于训练视频扩散模型。
  6. 通过微调运动条件视频扩散架构,使模型能够处理各种视频编辑任务。

Cool Papers

点此查看论文截图

MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Authors:Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi

Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

利用奖励模型强化学习人类反馈(RLHF)使得生成模型与人类审美和感知偏好更加对齐。然而,联合优化多个奖励经常会产生对齐税,即在一个维度上改进的同时降低其他维度。为了解决这个问题,我们引入了两个互补的方法:MapReduce LoRA和奖励感知令牌嵌入(RaTE)。MapReduce LoRA并行训练偏好特定的LoRA专家,并迭代地合并它们以精炼共享基础模型;RaTE学习奖励特定的令牌嵌入,这些嵌入在推理时进行组合以实现灵活的偏好控制。在文本到图像生成(Stable Diffusion 3.5 Medium和FLUX.1-dev)方面的实验显示,在GenEval、PickScore和OCR上分别提高了36.1%、4.6%和55.7%,以及提高了32.7%、4.3%和67.1%。在文本到视频生成(HunyuanVideo)方面,视觉和动作质量分别提高了48.1%和90.0%。在语言任务Helpful Assistant中,与Llama-2 7B相结合,有用性和无害性分别提高了43.4%和136.7%。我们的框架为跨模态的多偏好对齐设定了新的最先进的方案。

论文及项目相关链接

PDF

Summary
强化学习结合奖励模型可改进生成模型对人类审美和感知偏好的适应性。但多奖励的优化常常带来对齐损耗。为解决这个问题,本文提出了两个互补的方法:MapReduce LoRA 和 Reward-aware Token Embedding(RaTE)。实验显示在各种任务中这两种方法都有显著提升效果。该框架成为了跨模态的多偏好对齐的最新前沿。

Key Takeaways

  1. 强化学习结合奖励模型提高了生成模型与人类审美和感知偏好的对齐程度。
  2. 多奖励优化会导致对齐损耗问题,需要进行改进。
  3. MapReduce LoRA 通过并行训练偏好特定的 LoRA 专家,迭代合并以优化共享基础模型来解决这个问题。
  4. RaTE 方法学习奖励特定的令牌嵌入,以在推理时进行灵活的偏好控制。
  5. 在文本到图像生成实验中,两种方法都显著提高了评估指标如 GenEval、PickScore 和 OCR。
  6. 在文本到视频生成任务中,视觉和动作质量也有显著提高。

Cool Papers

点此查看论文截图

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Authors:Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun

Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

扩散模型(Diffusion Models)已经成为一类领先的生成模型,然而其迭代采样过程计算成本仍然很高。时间步长蒸馏(Timestep distillation)是一种加速生成的很有前景的技术,但它通常需要大量的训练,并且会导致图像质量下降。此外,使用强化学习(RL)针对特定目标(如美学吸引力或用户偏好)微调这些蒸馏模型是众所周知的,不稳定且容易陷入奖励破解(reward hacking)。在这项工作中,我们介绍了Flash-DMD这一新颖框架,它能实现快速收敛,同时具备蒸馏和基于RL的联合精细化。具体来说,我们首先提出了一种高效的时间步知蒸馏策略,该策略以更低的训练成本实现了增强的现实性,仅花费DMD2的2.1%训练成本便超越了其性能。其次,我们引入了一种联合训练方案,在该方案中,模型使用RL目标进行微调,同时继续进行时间步长蒸馏训练。我们证明,持续进行的蒸馏所产生的稳定、明确的损失充当了强大的正则化器,有效地稳定了RL训练过程并防止了策略崩溃。在基于分数的模型和流匹配模型上的大量实验表明,我们提出的Flash-DMD不仅收敛速度更快,而且在少步采样体制下达到了最先进的生成质量,在视觉质量、人类偏好和文本图像对齐指标上均优于现有方法。我们的工作为训练高效、高保真和稳定的生成模型提供了一种有效的范式。代码即将发布。

论文及项目相关链接

PDF

摘要

扩散模型作为一类领先的生成模型已崭露头角,但其迭代采样过程计算成本较高。尽管时序蒸馏技术可以加速生成过程,但它通常需要大量的训练,并可能导致图像质量下降。此外,使用强化学习(RL)对蒸馏模型进行特定目标微调,如审美或用户偏好,存在不稳定性和容易陷入奖励陷阱的问题。本研究介绍了Flash-DMD框架,该框架通过蒸馏和联合RL细化实现快速收敛。首先,我们提出了一种高效的时序感知蒸馏策略,以更低的训练成本实现了更高的真实感,其性能超过了DMD2,训练成本仅为其2.1%。其次,我们引入了一种联合训练方案,在该方案中,模型使用RL目标进行微调,同时继续进行时序蒸馏训练。我们证明,持续进行的蒸馏所带来的稳定、明确的损失可以作为强大的正则化剂,有效地稳定了RL训练过程,防止了策略崩溃。在基于分数的模型和流匹配模型上的大量实验表明,我们提出的Flash-DMD不仅收敛速度更快,而且在少步采样体制下实现了最先进的生成质量,在视觉质量、人类偏好和文本图像对齐指标上均优于现有方法。我们的工作为训练高效、高保真、稳定的生成模型提供了一种有效的范式。代码即将发布。

关键见解

  1. 扩散模型作为领先的生成模型,但其迭代采样过程计算成本较高。
  2. 时序蒸馏技术可以加速生成过程,但需大量训练且可能导致图像质量下降。
  3. 使用强化学习对蒸馏模型进行微调存在不稳定性和奖励陷阱问题。
  4. 提出了Flash-DMD框架,结合高效时序感知蒸馏策略和联合RL训练方案,实现快速收敛和高质量生成。
  5. 稳定的蒸馏损失作为正则化剂,有效防止RL训练的不稳定性和策略崩溃。
  6. 在多种模型上的实验表明Flash-DMD在少步采样体制下具有先进生成质量,优于现有方法。

Cool Papers

点此查看论文截图

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Authors:Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

流正则化(NFs)是基于连续数据的端到端可能性生成模型,最近在图像生成方面取得了令人鼓舞的进展。然而,在视频生成领域,时空复杂性和计算成本都更高,最先进的系统几乎完全依赖于基于扩散的模型。在这项工作中,我们重新访问这个设计空间,提出STARFlow-V,一个基于流正则化的视频生成器,具有诸多优势,如端到端学习、稳健的因果预测和原生可能性估计。基于最近提出的STARFlow,STARFlow-V在时空潜在空间中进行操作,采用全局-局部架构,将因果依赖关系限制在全局潜在空间内,同时保留帧内丰富的局部交互。这减轻了随时间积累错误的问题,这是标准自回归扩散模型生成的一个常见陷阱。此外,我们提出了流分数匹配,为模型配备了一个轻量级的因果去噪器,以改善视频生成的自回归一致性。为了提高采样效率,STARFlow-V采用视频感知Jacobi迭代方案,将内部更新重新定义为可并行化的迭代,而不破坏因果性。由于可逆结构,同一模型可以天然支持文本到视频、图像到视频以及视频到视频生成任务。据我们所知,STARFlow-V在视觉保真度和时间一致性方面表现出色,相对于基于扩散的基线具有实际的采样吞吐量。这些结果首次证明流正则化能够进行高质量的自回归视频生成,为构建世界模型提供了一个有前途的研究方向。代码和生成的样本可在https://github.com/apple/ml-starflow中找到。

论文及项目相关链接

PDF 21 pages

摘要

基于流的正则化(NFs)是面向连续数据的端到端基于可能性的生成模型,最近在图像生成方面取得了令人鼓舞的进展。然而,在视频生成领域,时空复杂性和计算成本更高,最先进的系统几乎完全依赖于基于扩散的模型。在这项工作中,我们重新审视这一设计空间,并推出STARFlow-V,这是一款基于流正则化的视频生成器,具有诸多优势,如端到端学习、稳健的因果预测和内在的可能性估计。建立在最近提出的STARFlow之上,STARFlow-V在时空潜在空间中进行操作,采用全局-局部架构,将因果依赖关系限制在全局潜在空间内,同时保留丰富的局部帧内交互。这减轻了随时间积累错误的标准自回归扩散模型生成中的常见缺陷。此外,我们提出了流分数匹配,这项新技术为模型配备了一个轻量级的因果去噪器,以改善视频生成的自回归一致性。为了提高采样效率,STARFlow-V采用了一种视频感知的雅可比迭代方案,将内部更新重塑为可并行化的迭代,同时不破坏因果性。凭借可逆结构,同一模型天然支持文本到视频、图像到视频以及视频到视频的生成任务。经验表明,STARFlow-V相对于基于扩散的基线,在视觉保真度和时间一致性方面取得了强有力的成果,采样效率也较高。这些结果据我们所知是首次表明流正则化能够进行高质量的自回归视频生成,为其作为构建世界模型的有前途的研究方向奠定了基础。代码和生成的样本可在https://github.com/apple/ml-starflow访问。

要点

  1. 基于流的正则化(NFs)是面向连续数据的生成模型。
  2. 在视频生成领域,STARFlow-V模型展现出优势,包括端到端学习、因果预测和可能性估计。
  3. STARFlow-V在时空潜在空间操作,采用全局-局部架构以处理时空复杂性。
  4. 流分数匹配技术的提出,提高了视频生成的一致性。
  5. 采用视频感知的雅可比迭代方案,提高了采样效率,同时保持因果性。
  6. STARFlow-V支持多种生成任务,包括文本到视频、图像到视频以及视频到视频的生成。
  7. 实证研究结果显示,STARFlow-V在视觉保真度、时间一致性和采样效率方面表现出色,为流正则化在视频生成领域的研究奠定了基础。

Cool Papers

点此查看论文截图

TReFT: Taming Rectified Flow Models For One-Step Image Translation

Authors:Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang, Feng Dai

Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

Rectified Flow(RF)模型通过最优传输理论实现了高质量图像和视频合成。然而,当应用于图像到图像的翻译时,它们仍然依赖于成本高昂的多步去噪处理,阻碍了实时应用的发展。虽然最近的对抗训练范式CycleGAN-Turbo在一阶扩散模型中对预训练模型进行了图像翻译应用,但我们发现直接将其应用于RF模型会导致严重的收敛问题。在本文中,我们分析了这些挑战,并提出了TReFT(一种驯服Rectified Flow模型进行一步图像翻译的新方法)。与以前的工作不同,TReFT直接使用由预训练的DiT或UNet预测的速度作为输出——这是一种简单而有效的设计,可以解决在一步推理下的对抗训练中的收敛问题。这种设计的主要灵感来自于一个新的观察结果,即在去噪过程结束时,由预训练的RF模型预测的速度会收敛到从原始点到最终清晰图像的方向,我们通过理论分析进一步证明了这一属性。当将TReFT应用于大型预训练的RF模型(如SD3.5和FLUX)时,我们在训练过程中引入了高效的内存潜在循环一致性及身份损失,同时简化了轻量级架构以实现更快的推理速度。使用TReFT进行微调的预训练RF模型在多个图像翻译数据集上的性能与最新方法相当,同时实现了实时推理。

论文及项目相关链接

PDF

Summary
本文介绍了针对Rectified Flow模型在图像到图像翻译中面临的挑战,提出了一种名为TReFT的新方法。TReFT利用预训练模型的预测速度来解决对抗训练中的收敛问题,实现了单步推理。该方法提高了实时性能,并与其他图像翻译数据集上的最先进方法相比具有竞争力。

Key Takeaways

  1. Rectified Flow (RF)模型已通过最优传输理论实现高质量图像和视频合成。但在图像到图像翻译应用中,仍面临多步去噪成本高昂的问题,阻碍了其实时应用。
  2. CycleGAN-Turbo对抗训练模式在预训练的扩散模型中进行一步图像翻译,但直接应用于RF模型会导致严重的收敛问题。
  3. TReFT方法旨在解决RF模型在图像翻译中的收敛问题,它通过利用预训练模型的预测速度来实现单步推理。
  4. TReFT的设计主要基于一个观察结果:在去噪过程结束时,预训练的RF模型预测的向量收敛于从原始图像到清洁图像的向量。
  5. 在大型预训练RF模型(如SD3.5和FLUX)中应用TReFT时,引入了内存高效的潜在周期一致性和身份损失以及轻量级架构简化,以实现更快的推理速度。
  6. 使用TReFT微调后的预训练RF模型在多个图像翻译数据集上的性能与其他最先进的方法相当,同时实现了实时推理。

Cool Papers

点此查看论文截图

HVAdam: A Full-Dimension Adaptive Optimizer

Authors:Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu, Steve Drew, Xiaoguang Niu

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer’s ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad’s hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

自适应优化器(如Adam)在训练大规模模型(如大型语言模型和扩散模型)方面取得了巨大成功。然而,它们在经典架构(如CNN)上的表现通常不如非自适应方法(如SGD)。我们确定了这一性能差距的关键原因:预处理器中的自适应性限制了优化器适应各种优化景观的能力。为了解决这一问题,我们提出了Anon(具有新型收敛技术的适应性非限制优化器),这是一种具有可连续调整的自适应性的新型优化器,可以实现在SGD和Adam行为之间的插值,甚至可以超出这两者之外。为了确保在整个自适应范围内的收敛性,我们引入了增量延迟更新(IDU)这一新型机制,它比AMSGrad的硬最大跟踪策略更灵活,并增强了梯度噪声的鲁棒性。我们在理论和实证上分别在凸和非凸设置下建立了收敛性保证。经验上,Anon在代表性的图像分类、扩散和语言建模任务上始终优于最新优化器。这些结果表明,适应性可以作为一种有价值的可调设计原则,而Anon提供了第一个统一可靠的框架,能够弥合经典和现代优化器之间的差距并超越它们的优势属性。

论文及项目相关链接

PDF

Summary:自适应优化器如Adam在训练大规模模型如扩散模型和大语言模型方面取得了巨大成功,但在经典架构如CNN上的泛化性能较差。本文识别出这一性能差距的关键原因是预处理器中的适应性限制。为此,本文提出了一种新型优化器Anon,具有可连续调整的适应性,可在SGD和Adam之间插值,甚至超出两者之外。通过引入增量延迟更新(IDU)机制,确保在整个适应性范围内的收敛性。理论分析和实证研究均表明,Anon在图像分类、扩散和语言建模任务上均优于现有优化器。这表明适应性可以作为一种有价值的可调设计原则,而Anon提供了首个统一可靠的框架,能够弥合经典和现代优化器之间的差距并超越其优势属性。

Key Takeaways

  1. 自适应优化器如Adam在训练大规模模型上表现优秀,但在某些任务上泛化性能较差。
  2. 性能差距的关键原因是预处理器中的适应性限制。
  3. 提出了一种新型优化器Anon,具有连续可调适应性,可在SGD和Adam之间插值。
  4. Anon通过引入增量延迟更新(IDU)机制确保收敛性。
  5. 理论分析和实证研究均表明Anon在多种任务上表现优异。
  6. 适应性可作为有价值的可调设计原则。

Cool Papers

点此查看论文截图

Advancing Image Classification with Discrete Diffusion Classification Modeling

Authors:Omer Belhasin, Shelly Golan, Ran El-Yaniv, Michael Elad

Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .

图像分类是计算机视觉中一个研究深入的任务,但在高不确定条件下,如输入图像损坏或训练数据有限时,它仍然具有挑战性。传统的分类方法通常训练模型直接预测输入图像的分类标签,但在这种情况下可能会导致性能不佳。为了解决这个问题,我们提出了离散扩散分类建模(DiDiCM)这一新框架,它利用基于扩散的程序对给定输入图像的分类标签后验分布进行建模。DiDiCM支持基于扩散的预测,既可以在类概率上进行,也可以在离散类标签上进行,在计算和内存方面提供了灵活的权衡。我们进行了全面的实证研究,证明了DiDiCM在标准分类器上的优越性,表明在ImageNet数据集上,只需几次扩散迭代就能达到较高的分类精度,随着任务变得更加具有挑战性,精度收益会不断增加。我们的代码已发布在https://github.com/omerb01/didicm上。

论文及项目相关链接

PDF

Summary
离散扩散分类建模(DiDiCM)是一个解决高不确定性条件下图像分类问题的新框架。它通过基于扩散的程序建模类别标签的后验分布,能在有限的训练数据和输入图像失真等情况下实现优越性能。DiDiCM支持在类别概率或离散类别标签上进行扩散预测,具有计算和内存的灵活性。在ImageNet数据集上的实验表明,与基准分类器相比,DiDiCM仅需几次扩散迭代就能提高分类精度,随着任务难度增加,精度提升更加明显。

Key Takeaways

  1. DiDiCM是一个针对计算机视觉中高不确定性条件下图像分类问题的新框架。
  2. 传统分类方法直接预测类别标签,但在高不确定性条件下可能性能不佳。
  3. DiDiCM利用扩散过程建模类别标签的后验分布,以改善性能。
  4. DiDiCM支持在类别概率或离散类别标签上进行扩散预测,提供计算和内存的灵活性。
  5. 在ImageNet数据集上的实验显示,DiDiCM优于标准分类器,并能在任务变得更具有挑战性时进一步提高精度。
  6. DiDiCM的代码已公开发布在指定GitHub仓库。

Cool Papers

点此查看论文截图

Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Authors:Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin

Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG’s ability to produce photorealistic camouflage images.

伪装图像生成(CIG)是一个新兴研究领域,专注于合成图像,其中对象和谐地混合在一起,并与周围环境保持高度视觉一致性。现有方法通过融合对象到特定背景或通过前景对象引导的扩散来扩展周围环境来实现CIG。然而,他们通常无法获得自然的结果,因为他们忽略了伪装对象和背景环境之间的逻辑关系。为了解决这一问题,我们提出了可控文本引导的伪装图像生成方法CT-CIG,该方法能够生成真实且逻辑合理的伪装图像。我们利用大型视觉语言模型(VLM),设计了一种伪装揭示对话机制(CRDM),为现有的伪装数据集标注高质量文本提示。随后,使用构建好的图像提示对来微调Stable Diffusion,并引入一个轻量级控制器来引导伪装对象的位置和形状,以提高伪装场景的适应性。此外,我们还设计了一个频率交互细化模块(FIRM),以捕捉高频纹理特征,促进复杂伪装模式的学习。大量实验包括CLIPScore评估和伪装效果评估,证明了我们的生成文本提示与CT-CIG之间的语义对齐能力,以及生成逼真伪装图像的能力。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary
高拟真迷彩图像生成(CT-CIG)是一项新兴技术,旨在利用大型视觉语言模型(VLM)生成具有逻辑性和真实感的迷彩图像。该技术通过Camouflage-Revealing Dialogue Mechanism(CRDM)机制标注现有迷彩数据集,并结合Stable Diffusion进行微调,以实现可控文本引导的迷彩图像生成。同时,新增的Frequency Interaction Refinement Module(FIRM)能捕捉高频率纹理特征,提高复杂迷彩图案的学习效果。

Key Takeaways

  1. CT-CIG技术旨在生成具有逻辑性和真实感的迷彩图像。
  2. 利用大型视觉语言模型(VLM)进行图像生成。
  3. 通过Camouflage-Revealing Dialogue Mechanism(CRDM)机制标注现有迷彩数据集。
  4. 利用Stable Diffusion进行微调,实现可控文本引导的迷彩图像生成。
  5. 新增的Frequency Interaction Refinement Module(FIRM)能捕捉高频率纹理特征。
  6. 技术通过广泛的实验验证,包括CLIPScore评价和迷彩效果评估。
  7. 技术能够生成语义对齐的文本提示和高度逼真的迷彩图像。

Cool Papers

点此查看论文截图

OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Authors:Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang, Hongyu Li, Xinrui Chen, Yongxian Wei, Chun Yuan

Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

生成模型在RGB合成方面表现出色,但实际应用需要处理RGBA操作。这导致了一个分裂的局面:专门的单任务模型处理alpha但缺乏通用性,而统一的多任务框架仅限于RGB领域。为了弥补这一关键差距,我们提出了OmniAlpha,这是第一个统一的、多任务生成框架,用于序列到序列的RGBA图像生成和编辑。其架构特点为MSRoPE-BiL,这是一种新型的RoPE方法,具有双向可扩展层轴的Diffusion Transformer(DiT)主干,能够同时处理多个输入和目标RGBA层。为了支持这个框架,我们引入了AlphaLayers,这是一个由新型自动化合成和过滤管道构建的高质量、多层三元组数据集,包含1000个样本。OmniAlpha在这个数据集上进行联合训练,涵盖了21个不同的任务,大量实验表明,我们的统一方法始终优于强大的专项基准测试。最值得注意的是,OmniAlpha在AIM-500上的无遮罩镶边实现了84.8%的相对SAD减少,并在层条件完成中赢得了超过90%的人类偏好。我们的工作证明,统一的多任务模型可以学习出色的RGBA共享表示,为更强大、图层感知的生成系统铺平了道路。

论文及项目相关链接

PDF

Summary

该文提出OmniAlpha,一个统一的多任务生成框架,用于序列到序列的RGBA图像生成和编辑。为了解决当前RGB合成和alpha处理的碎片化问题,OmniAlpha引入新的数据集AlphaLayers并采用了创新的扩散变换器架构,以支持多层次的RGBA处理。通过实验验证,OmniAlpha在多任务上的表现超过了专业化的基线模型,尤其在无遮挡抠图和分层补全方面效果显著。这为更强大的分层感知生成系统铺平了道路。

Key Takeaways

  1. OmniAlpha是首个统一的多任务生成框架,支持序列到序列的RGBA图像生成和编辑。
  2. 引入新的数据集AlphaLayers,包含高质量的多层次图像数据。
  3. 采用创新的扩散变换器架构,支持多层次的RGBA处理。
  4. OmniAlpha在多任务上的表现超过专业化的基线模型。
  5. OmniAlpha在无遮挡抠图和分层补全方面的效果显著。
  6. OmniAlpha的工作证明统一的多任务模型可以学习更优越的RGBA共享表示。

Cool Papers

点此查看论文截图

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Authors:Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler

Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

流动匹配已经成为一种有前景的生成方法,它解决了最先进扩散模型的长时间采样问题,并实现了更灵活的轨迹设计,同时保持了高质量图像生成。这种能力使其成为图像恢复任务的理想生成先验。尽管当前利用流动模型的方法在恢复方面取得了有前景的结果,但某些方法仍存在处理时间长或结果过于平滑的问题。为了解决这些挑战,我们引入了Restora-Flow,这是一种无需训练的方法,它通过退化掩膜引导流动匹配采样,并融入轨迹校正机制来确保与退化输入的的一致性。我们在自然和医疗数据集上评估了我们的方法,涵盖了涉及基于掩膜的退化的多个图像恢复任务,包括填充、超分辨率和去噪。与扩散和流动匹配参考方法相比,我们的方法在感知质量和处理时间上表现出优势。

论文及项目相关链接

PDF Accepted for WACV 2026

Summary

扩散模型中的流匹配方法展现出巨大的潜力,解决了当前先进扩散模型的漫长采样时间问题,同时保持了高质量的图像生成能力,适用于图像修复任务。为解决现有流模型方法面临的挑战,我们提出Restora-Flow,一种无需训练的方法,通过退化掩膜引导流匹配采样,并加入轨迹校正机制,确保与退化输入的的一致性。我们在自然和医疗数据集上评估了我们的方法,涉及基于掩膜的退化任务,如补全、超分辨率和去噪。相较于扩散和流匹配基准方法,我们的方法在感知质量和处理时间上表现更优。

Key Takeaways

  1. 流匹配作为一种生成性方法,解决了先进扩散模型的漫长采样时间问题,并保持了高质量的图像生成。
  2. 流匹配方法具有灵活性,适用于多种图像修复任务。
  3. Restora-Flow是一种无需训练的流匹配方法,通过退化掩膜引导采样过程。
  4. Restora-Flow加入了轨迹校正机制,确保与退化输入的一致性。
  5. 在自然和医疗数据集上进行的实验表明,Restora-Flow在多个图像修复任务中表现优越。
  6. 与其他扩散和流匹配方法相比,Restora-Flow在感知质量和处理时间上有所提升。

Cool Papers

点此查看论文截图

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Authors:Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu

Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

尽管有所进展,视频扩散变压器(Video Diffusion Transformers)仍然难以推广到其训练长度之外,我们称之为视频长度外推(Video Length Extrapolation)的挑战。我们确定了两种失败的模式:模型特定的周期性内容重复和普遍的质量下降。先前的工作试图通过位置编码来解决重复问题,忽视了质量下降问题,并且只实现了有限的外推。在本文中,我们从更基础的观点重新审视这一挑战:注意力图(Attention Maps),它直接控制上下文如何影响输出。我们发现,这两种失败的模式都源于一个统一的原因:注意力分散(Attention Dispersion),即超出训练窗口的标记会稀释已学习的注意力模式。这导致质量下降,当这种分散结构化形成周期性注意力模式时,就会出现重复现象,这是由位置编码的和谐特性所引发的。基于这一见解,我们提出了UltraViCo,这是一种无需训练、即插即用的方法,通过恒定衰减因子来抑制超出训练窗口的标记的注意力。通过同时解决这两种失败的模式,我们在模型和外推比率方面大大超越了广泛的基线,将外推限制从2倍提高到4倍。值得注意的是,在4倍外推的情况下,它相较于之前最佳方法改善了动态度和图像质量达233%和40.5%。此外,我们的方法无缝地推广到了可控视频合成和编辑等下游任务。

论文及项目相关链接

PDF Project page: https://thu-ml.github.io/UltraViCo.github.io/

Summary

本文探讨了视频扩散变压器(Video Diffusion Transformers)在视频长度外推(Video Length Extrapolation)方面的挑战,即模型在面对超出训练长度的视频时表现不佳的问题。文章指出两种失败模式:模型特定的周期性内容重复和普遍的质量下降。文章从注意力图(Attention Maps)的角度重新审视这一问题,发现两种失败模式源于同一原因:注意力分散(Attention Dispersion)。基于此,文章提出了UltraViCo方法,通过抑制超出训练窗口的标记的注意力,有效地解决了这一问题,提高了模型的性能。

Key Takeaways

  1. 视频扩散变压器在视频长度外推方面存在挑战,即模型在生成超过训练长度的视频时表现不佳。
  2. 现有方法主要关注周期性内容重复问题,但忽视了质量下降问题。
  3. 文章从注意力图的角度分析了问题的根源,发现两种失败模式均源于注意力分散。
  4. 提出了UltraViCo方法,通过抑制超出训练窗口的标记的注意力,解决注意力分散问题。
  5. UltraViCo方法可以同时解决两种失败模式,提高模型的性能,并扩大了外推极限。
  6. UltraViCo方法在多种模型和不同外推比例下的性能优于现有基线方法。

Cool Papers

点此查看论文截图

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Authors:Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

最新的扩散模型进展在针对简单提示生成高质量图像方面表现出了令人印象深刻的能力。然而,当面对涉及多个对象和层次结构的复杂提示时,现有模型在准确遵循指令方面遇到了困难,从而导致了概念遗漏、混淆和组合性差等问题。为了解决这些局限性,我们提出了一个基于新型合成链(CoS)范式的分层组合生成框架(HiCoGen)。与传统的整体生成方式不同,HiCoGen首先利用大型语言模型(LLM)将复杂提示分解为最小的语义单元。然后,它迭代地合成这些单元,每一步生成的图像都为下一步提供了关键的视觉上下文,确保所有文本概念都忠实地构建在最终场景中。为了进一步优化这个过程,我们引入了一种强化学习(RL)框架。关键的是,我们发现标准扩散采样器的有限探索阻碍了有效的RL。我们从理论上证明,通过将随机性集中在早期生成阶段可以最大化样本多样性,并基于这一见解,提出了新型衰减随机性时间表来增强探索。我们的RL算法由分层奖励机制引导,该机制从全局、主题和关系层面共同评估图像。我们还构建了用于严格评估的HiCoPrompt分层提示文本到图像基准测试。实验表明,我们的方法在概念覆盖和组合准确性方面显著优于现有方法。

论文及项目相关链接

PDF 9 pages

Summary
本文提出了基于合成链(CoS)的分层组合生成框架(HiCoGen),以解决扩散模型在处理复杂提示时的问题。HiCoGen利用大型语言模型(LLM)将复杂提示分解为最小语义单元,并通过迭代合成这些单元来生成图像。为提高生成过程的优化,引入了强化学习(RL)框架,并提出衰减随机性调度策略以增强探索。实验表明,该方法在概念覆盖和组合准确性方面显著优于现有方法。

Key Takeaways

  1. 扩散模型在生成复杂图像时面临挑战,如概念遗漏、混淆和缺乏组合性。
  2. 提出的分层组合生成框架(HiCoGen)通过合成链(CoS)解决这些问题。
  3. HiCoGen利用大型语言模型(LLM)分解复杂提示为最小语义单元。
  4. 引入强化学习(RL)框架优化生成过程。
  5. 衰减随机性调度策略最大化样本多样性并增强探索。
  6. 强化学习算法通过分级奖励机制对图像进行全局、主题和关系级别的评估。
  7. 实验显示HiCoGen在概念覆盖和组合准确性方面优于现有方法。

Cool Papers

点此查看论文截图

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Authors:Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo

Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

图像扩散模型最初是为图像生成而开发的,但它们在背后捕捉了丰富的语义结构,这使得它们能够完成合成之外的多种识别和定位任务。在这项工作中,我们研究发现其自注意力图可以被重新解释为语义标签传播核,为相关图像区域之间提供稳健的像素级对应关系。将此机制扩展到帧之间会产生一个临时传播核,通过视频中的分割实现零样本目标跟踪。我们还进一步展示了测试时优化策略的有效性,包括DDIM反转、文本反转和自适应头权重,这些策略在适应扩散特征以实现稳健和一致性的标签传播方面表现出良好效果。基于这些发现,我们引入了DRIFT框架,该框架利用预训练图像扩散模型和SAM引导遮罩细化,在标准视频目标分割基准测试中实现了最先进的零样本性能,用于视频中的目标跟踪。

论文及项目相关链接

PDF

Summary:图像扩散模型不仅可用于图像生成,还隐含地捕获了丰富的语义结构,使模型能够完成合成以外的各种识别和定位任务。本研究中,我们发现其自注意力图可解释为语义标签传播核,为相关图像区域提供稳健的像素级对应关系。将此机制扩展到视频帧之间,可实现零射击中目标跟踪。我们还展示了测试时优化策略的有效性,包括DDIM反转、文本反转和自适应头权重,以适应扩散特征,实现稳健和一致的标签传播。在此基础上,我们引入了DRIFT框架,利用预训练的图像扩散模型和SAM引导的遮罩细化,在视频物体追踪方面实现了零击中的最佳性能。

Key Takeaways

  1. 图像扩散模型具备完成图像生成外的其他任务潜力,如识别和定位任务。
  2. 自注意力图可解释为语义标签传播核,提供像素级的图像区域对应关系。
  3. 通过将机制扩展到视频帧之间,实现了零射击中目标跟踪。
  4. 测试时的优化策略(DDIM反转、文本反转和自适应头权重)有助于适应扩散特征并实现稳健的标签传播。
  5. DRIFT框架利用预训练的图像扩散模型和SAM引导的遮罩细化技术,在视频物体追踪方面表现卓越。
  6. 图像扩散模型具备处理复杂视频数据的潜力。

Cool Papers

点此查看论文截图

Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Authors:Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You

Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

扩散模型已经成为文本到图像生成的主导范式,而测试时缩放(TTS)通过推理过程中分配更多的计算来进一步提高图像质量。然而,现有的TTS方法在整张图像层面上操作,忽略了图像质量通常具有空间异质性的事实。这导致了在已经满意的区域上进行不必要的计算,并且对于局部缺陷的修正不足。在本文中,我们探索了一个新的方向——局部TTS,它自适应地重新采样缺陷区域,同时保留高质量区域,从而大大减少搜索空间。这种范式提出了两个核心挑战:准确定位缺陷并保持全局一致性。我们提出了LoTTS,这是第一个完全无需训练的局部TTS框架。对于缺陷定位,LoTTS通过质量感知提示(例如高质量与低质量)对比交叉和自注意力信号来识别缺陷区域,然后将它们细化为连贯的掩码。为了保持一致性,LoTTS只对缺陷区域进行扰动并对其进行局部去噪,确保修正仅限于缺陷区域,而其余图像保持不变。在SD2.1、SDXL和FLUX上的大量实验表明,LoTTS达到了最先进的性能:它一致地提高了局部质量和全局保真度,同时与Best-of-N采样相比,将GPU成本降低了2-4倍。这些发现确立了局部TTS作为推理时间扩散模型缩放的有前途的新方向。

论文及项目相关链接

PDF

摘要
扩散模型已成为文本到图像生成的主导范式,测试时缩放(TTS)通过推理期间分配更多计算来提高图像质量。然而,现有TTS方法在整图层面操作,忽视了图像质量的空间异质性,导致对已满意区域的无效计算和对局部缺陷的修正不足。本文探索了一个新的方向——局部TTS,它自适应地对缺陷区域进行重采样,同时保持高质量区域,从而大大减少搜索空间。该范式面临两个核心挑战:准确定位缺陷并保持全局一致性。我们提出了LoTTS,这是第一个完全无需训练的局部TTS框架。对于缺陷定位,LoTTS在质量感知提示下对比交叉和自我注意信号(例如高质量与低质量),以识别缺陷区域,然后将其细化为连贯的掩膜。为确保一致性,LoTTS仅扰动缺陷区域并对其进行局部去噪,确保修正仅限于缺陷区域,其余部分不受干扰。在SD2.1、SDXL和FLUX上的广泛实验表明,LoTTS达到了最新性能水平:它既能提高局部质量又能保持全局保真度,同时将GPU成本降低了传统的基于文本的对话机器人的回复速度通常较慢的原因是因为它需要进行复杂的自然语言处理和分析。为了提高响应速度并满足用户的需求,研究者们一直在努力改进对话机器人的技术。其中一项重要的改进是使用更高效的对话机器人技术来提升性能和响应速度。关键见解

  • 扩散模型在文本到图像生成中占据主导地位,测试时缩放(TTS)能提高图像质量。
  • 现有TTS方法忽视了图像质量的空间异质性,导致计算效率低下。
  • LoTTS框架是第一个完全无需训练的局部TTS方法,能够自适应地识别并修正图像中的缺陷区域。
  • LoTTS通过对比交叉和自我注意信号进行缺陷定位,并细化缺陷区域为连贯的掩膜。
  • LoTTS仅对缺陷区域进行扰动和局部去噪,保持全局一致性。
  • LoTTS在多个数据集上实现了最新性能水平,提高了局部质量和全局保真度,同时降低了GPU成本。
  • 这些发现表明局部TTS作为在推理时间扩展扩散模型的有前途的新方向。

Cool Papers

点此查看论文截图

FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Authors:Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen, Zhong Ji

All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.

全景图像恢复(AIO-IR)的目标是开发一个统一模型,该模型能够在复杂条件下处理多种退化。然而,现有方法通常依赖于特定任务的设计或潜在路由策略,这使得它们难以适应具有各种退化的现实世界场景。我们提出了FAPE-IR,这是一个用于图像恢复的频率感知规划与执行框架。它使用一个冻结的多模态大型语言模型(MLLM)作为规划器,分析退化图像并生成简洁的频率感知恢复计划。这些计划指导基于LoRA的混合专家(LoRA-MoE)模块在扩散执行器中工作,该模块能够动态选择高频或低频专家,并与输入图像的频率特征相结合。为了进一步提高恢复质量和减少伪影,我们引入了对抗性训练和频率正则化损失。通过将语义规划与基于频率的恢复相结合,FAPE-IR为全景图像恢复提供了统一和可解释的解决方案。大量实验表明,FAPE-IR在七个恢复任务上达到了最先进的性能,并在混合退化情况下表现出了强大的零样本泛化能力。

论文及项目相关链接

PDF

Summary
AIO-IR致力于开发一种能够在复杂条件下处理多种退化的统一模型。现有方法通常依赖于特定任务的设计或潜在路由策略,难以适应具有多种退化的真实场景。为此,我们提出了FAPE-IR,这是一种频率感知规划和执行框架,用于图像恢复。它利用冻结的多模态大型语言模型作为规划器,分析退化图像并生成简洁的频率感知恢复计划。这些计划指导基于LoRA的混合专家模块在扩散执行器中动态选择高频或低频专家,并与输入图像的频率特征相结合。通过引入对抗性训练和频率正则化损失,进一步提高恢复质量并减少伪影。通过语义规划与频率基础恢复的结合,FAPE-IR为全合一图像恢复提供了统一和可解释的解决方案。

Key Takeaways

  1. AIO-IR目标:开发一种能够适应多种退化并处理复杂条件的统一图像恢复模型。
  2. 现有方法的问题:依赖于特定任务的设计或潜在路由策略,难以适应具有多种退化的真实场景。
  3. FAPE-IR的提出:结合频率感知规划和执行框架,用于图像恢复。
  4. 利用冻结的多模态大型语言模型作为规划器,生成频率感知的恢复计划。
  5. 动态选择高频或低频专家,结合输入图像的频率特征进行恢复。
  6. 通过引入对抗性训练和频率正则化损失来提高恢复质量和减少伪影。

Cool Papers

点此查看论文截图

Orientation Matters: Making 3D Generative Models Orientation-Aligned

Authors:Yichong Lu, Yuzhuo Tian, Zijin Jiang, Yikun Zhao, Yuanbo Yang, Hao Ouyang, Haoji Hu, Huimin Yu, Yujun Shen, Yiyi Liao

Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.

人类凭直觉从单一图像感知物体的形状和方位,这是受到标准姿势的强烈先验的引导。然而,现有的3D生成模型由于训练数据的不一致性,往往会产生错位的结果,这限制了它们在下游任务中的可用性。为了弥补这一差距,我们引入了方位对齐的3D对象生成任务:从单个图像生成3D对象,并在各类别之间保持一致的方位。为了促进这一目标的实现,我们构建了Objaverse-OA数据集,包含14832个方位对齐的3D模型,跨越1008个类别。借助Objaverse-OA数据集,我们微调了两个具有代表性的基于多视图扩散和3D变分自编码器框架的3D生成模型,以生成对齐物体,这些物体能够很好地泛化到各种类别的未见物体。实验结果表明,我们的方法优于事后对齐方法。此外,我们还展示了由我们的对齐对象生成功能支持的下游应用,包括基于合成分析的无射击对象方位估计和高效的箭头式对象旋转操作。

论文及项目相关链接

PDF Accepted by NeurIPS 2025. Project Page: https://xdimlab.github.io/Orientation_Matters

Summary

本文介绍了面向方向对齐的3D对象生成任务,通过构建包含大量方向对齐的3D模型的Objaverse-OA数据集,对基于多视角扩散和三维变分自编码器的两类代表性3D生成模型进行微调,以生成具有良好泛化能力的对齐对象。实验结果表明,该方法优于事后对齐方法,并展示了下游应用如零样本物体方向估计和基于分析的合成以及高效的箭头式物体旋转操控。

Key Takeaways

  1. 人类能凭直觉从单一图像感知物体形状和方向,得益于对规范姿势的强大先验知识。
  2. 现有的三维生成模型由于训练数据不一致,会产生错位结果,限制了其在下游任务中的应用。
  3. 提出面向方向对齐的3D对象生成任务,旨在解决不一致的三维模型问题。
  4. 构建Objaverse-OA数据集,包含方向对齐的14,832个三维模型,涵盖超过1,008个类别。
  5. 利用Objaverse-OA数据集微调两种代表性三维生成模型,实现良好泛化能力的对齐对象生成。
  6. 实验证明该方法优于事后对齐方法。

Cool Papers

点此查看论文截图

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Authors:Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user study, while being 6x faster.

我们解决了单眼到立体视频的转换问题,并提出了一种新颖的架构,用于对通过输入左视图基于深度重新投影得到的变形右视图进行填充和细化。我们扩展了稳定视频扩散(SVD)模型,以利用输入左视频、变形右视频和遮挡掩膜作为条件输入来生成高质量右摄像头视图。为了有效地利用相邻帧的信息进行填充,我们修改了SVD中的注意力层,为遮挡的像素计算全注意力。我们的模型通过最小化图像空间损失进行端到端的右视图视频生成,无需迭代扩散步骤,以确保高质量生成。我们的方法表现优于之前的最先进方法,在用户研究中排名第一的次数是第二名方法的2.6倍,同时速度是第二名的6倍。

论文及项目相关链接

PDF To be published at 3DV 2026, project webpage https://m2svid.github.io/

Summary
本文解决了单眼视频到立体视频的转换问题,并提出了一种新型架构,用于填充和细化通过深度基于投影的输入左眼视频获得的变形右眼视频。文章扩展了稳定视频扩散(SVD)模型,以利用输入左眼视频、变形的右眼视频和遮挡掩膜作为条件输入来生成高质量右摄像头视图。为了有效地利用邻近帧的信息进行填充,文章修改了SVD中的注意力层,为被遮挡的像素计算完整的注意力。该模型通过最小化图像空间损失以端到端的方式训练生成右视图视频,确保高质量生成,且无需迭代扩散步骤。该方法在用户研究中表现优于先前最先进的方法,被评为第一名的频率是第二名方法的2.6倍,同时处理速度更快,为第二名的6倍。

Key Takeaways

  1. 解决了单眼视频到立体视频的转换问题。
  2. 提出了一种新型架构用于填充和细化变形右眼视频。
  3. 扩展了Stable Video Diffusion(SVD)模型以生成高质量右摄像头视图。
  4. 利用输入左眼视频、变形右眼视频和遮挡掩膜作为条件输入。
  5. 修改了SVD中的注意力层以计算被遮挡像素的完整注意力。
  6. 模型通过最小化图像空间损失进行端到端训练。

Cool Papers

点此查看论文截图

Personalized Generative Low-light Image Denoising and Enhancement

Authors:Xijun Wang, Prateek Chennuri, Dilshan Godaliyadda, Yu Yuan, Bole Ma, Xingguang Zhang, Hamid R. Sheikh, Stanley Chan

Modern cameras’ performance in low-light conditions remains suboptimal due to fundamental limitations in photon shot noise and sensor read noise. Generative image restoration methods have shown promising results compared to traditional approaches, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Leveraging the availability of personalized photo galleries of the users, we introduce Diffusion-based Personalized Generative Denoising (DiffPGD), a new approach that builds a customized diffusion model for individual users. Our key innovation lies in the development of an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer serves as a robust prior that can be seamlessly integrated into the diffusion model to restore degraded images without the need for fine-tuning. Over a wide range of low-light testing scenarios, we show that DiffPGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches. Our project page can be found at \href{https://genai-restore.github.io/DiffPGD/}{\textcolor{purple}{\textbf{https://genai-restore.github.io/DiffPGD/}}}.

现代相机在低光条件下的性能仍然不够理想,这主要归因于光子噪声和传感器读出噪声的根本限制。与传统方法相比,生成式图像恢复方法已经显示出有希望的结果,但在信噪比(SNR)较低时,它们会出现幻觉内容生成的问题。我们利用用户个性化照片库的可用性,引入了基于扩散的个性化生成去噪(DiffPGD)这一新方法,它为个人用户构建了一个定制的扩散模型。我们的关键创新在于开发了一个身份一致的物理缓冲区,它能从库中提取人的物理属性。这个身份一致的物理缓冲区作为一个稳健的先验,可以无缝地集成到扩散模型中,用于恢复退化的图像,而无需微调。在多种低光测试场景中,我们证明DiffPGD与现有的基于扩散的去噪方法相比,实现了更出色的图像去噪和增强性能。有关我们的项目的更多信息,请访问:[https://genai-restore.github.io/DiffPGD/]​​。

论文及项目相关链接

PDF

Summary

本文介绍了针对现代相机在低光条件下性能不佳的问题,提出了一种基于扩散模型的个性化生成去噪方法(DiffPGD)。该方法通过利用用户个性化照片库,建立针对个人的扩散模型,并在其中引入身份一致的物理缓冲区,以提取个人的物理特征作为稳健先验,从而恢复退化图像,无需微调。实验表明,DiffPGD在多种低光测试场景下相比现有扩散去噪方法具有更优的图像去噪和增强性能。

Key Takeaways

  1. 现代相机在低光条件下存在性能问题,主要由于光子射击噪声和传感器读取噪声的基本限制。
  2. 生成式图像恢复方法相比传统方法更具潜力,但在低信噪比条件下会出现内容幻觉生成问题。
  3. DiffPGD方法利用用户个性化照片库建立针对个人的扩散模型。
  4. 身份一致的物理缓冲区是DiffPGD的关键创新,可从照片库中提取个人物理特征。
  5. 该物理缓冲区作为稳健先验,可无缝集成到扩散模型中,恢复退化图像,无需微调。
  6. DiffPGD在多种低光测试场景下实现了优越的图像去噪和增强性能。

Cool Papers

点此查看论文截图

Categorical Flow Matching on Statistical Manifolds

Authors:Chaoran Cheng, Jiahan Li, Jian Peng, Ge Liu

We introduce Statistical Flow Matching (SFM), a novel and mathematically rigorous flow-matching framework on the manifold of parameterized probability measures inspired by the results from information geometry. We demonstrate the effectiveness of our method on the discrete generation problem by instantiating SFM on the manifold of categorical distributions whose geometric properties remain unexplored in previous discrete generative models. Utilizing the Fisher information metric, we equip the manifold with a Riemannian structure whose intrinsic geometries are effectively leveraged by following the shortest paths of geodesics. We develop an efficient training and sampling algorithm that overcomes numerical stability issues with a diffeomorphism between manifolds. Our distinctive geometric perspective of statistical manifolds allows us to apply optimal transport during training and interpret SFM as following the steepest direction of the natural gradient. Unlike previous models that rely on variational bounds for likelihood estimation, SFM enjoys the exact likelihood calculation for arbitrary probability measures. We manifest that SFM can learn more complex patterns on the statistical manifold where existing models often fail due to strong prior assumptions. Comprehensive experiments on real-world generative tasks ranging from image, text to biological domains further demonstrate that SFM achieves higher sampling quality and likelihood than other discrete diffusion or flow-based models.

我们引入统计流匹配(SFM),这是一个新型的、在数学上严格的流匹配框架,它基于信息几何的结果,在参数化概率流形上构建。我们通过将SFM实例化在类别分布的流形上,展示了该方法在离散生成问题上的有效性,之前离散的生成模型很少探索此类分布的几何特性。我们使用Fisher信息度量来装备这个流形一个黎曼结构,通过沿着测地线的最短路径有效地利用其固有的几何结构。我们开发了一种高效的训练和采样算法,通过流形之间的微分同胚克服了数值稳定性问题。我们对统计流形的独特几何视角使我们能够在训练过程中应用最优传输,并将SFM解释为遵循自然梯度的最陡峭方向。不同于之前依赖于变分边界进行似然估计的模型,SFM可以对任意概率度量进行精确的似然计算。我们证明SFM能够在统计流形上学习更复杂的模式,而现有模型往往由于强烈的先验假设而失败。在图像、文本和生物等领域的真实世界生成任务的全面实验进一步表明,SFM的采样质量和似然性高于其他离散扩散或基于流的模型。

论文及项目相关链接

PDF Accepted to NeurIPS 2024 as a conference paper

Summary

统计流匹配(SFM)是一种新型的数学严谨流匹配框架,受信息几何结果的启发,在参数化概率流形上实现。利用Fisher信息度量在分类分布流形上配置黎曼结构,并沿着最短的测地线路径进行训练采样。该框架解决了数值稳定性问题,允许在训练过程中应用最优传输,并解释为遵循自然梯度的最陡峭方向。不同于依赖变分边界进行似然估计的先前模型,SFM能对任意概率进行精确似然计算。实验表明,SFM能在统计流形上学习更复杂的模式,且在图像、文本和生物等领域的真实世界生成任务中,采样质量和似真度高于其他离散扩散或基于流的模型。

Key Takeaways

  1. 统计流匹配(SFM)是一个基于信息几何结果的流匹配框架。
  2. SFM在参数化概率流形上实现,利用Fisher信息度量配置黎曼结构。
  3. 沿着最短的测地线路径进行训练采样以提高效率并解决数值稳定性问题。
  4. SFM利用最优传输并遵循自然梯度的最陡峭方向。
  5. 与依赖变分边界的模型不同,SFM可以精确计算任意概率的似然值。
  6. SFM能学习更复杂的统计流形模式,超越现有模型的局限性。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
医学图像 医学图像
医学图像 方向最新论文已更新,请持续关注 Update in 2025-11-27 MedROV Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
2025-11-27
下一篇 
NeRF NeRF
NeRF 方向最新论文已更新,请持续关注 Update in 2025-11-27 Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation
2025-11-27
  目录