嘘~ 正在从服务器偷取页面 . . .

I2I Translation


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-27 更新

TReFT: Taming Rectified Flow Models For One-Step Image Translation

Authors:Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang, Feng Dai

Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

纠正流(RF)模型通过最优传输理论实现了高质量图像和视频合成。然而,当应用于图像到图像的翻译时,它们仍然依赖于成本高昂的多步去噪,阻碍了实时应用。尽管最近的对抗训练范式CycleGAN-Turbo在一步图像翻译的预训练扩散模型中有效,但我们发现直接将其应用于RF模型会导致严重的收敛问题。在本文中,我们分析了这些挑战,并提出了TReFT,这是一种用于一步图像翻译的新方法,旨在驯服纠正流模型。与以前的工作不同,TReFT直接使用由预训练的DiT或UNet预测的流速作为输出。这是一个简单而有效的设计,解决了在一步推理下的对抗训练中的收敛问题。这种设计的主要动机是新的观察结果,即在去噪过程接近尾声时,预训练的RF模型预测的流速收敛到从原始点到最终清晰图像的向量。我们通过理论分析进一步证明了这一属性。当将TReFT应用于大型预训练RF模型(如SD3.5和FLUX)时,我们在训练过程中引入了高效的内存循环一致性损失和身份损失以及轻量级架构简化以加快推理速度。使用TReFT微调的预训练RF模型在多个图像翻译数据集上的性能与最先进的方法相当,同时可实现实时推理。

论文及项目相关链接

PDF

Summary

本文介绍了Rectified Flow(RF)模型在图像和视频合成中的优势,但在图像到图像翻译中的应用仍存在挑战。为提高效率,研究团队提出了TReFT方法,该方法可直接使用预训练DiT或UNet预测的速率,有效解决对抗训练下的收敛问题,实现一步推断。团队还对大型预训练RF模型如SD3.5和FLUX进行了优化,引入内存高效的潜在循环一致性和身份损失,以及轻量级架构简化,以加快推断速度。经TReFT微调的预训练RF模型在多个图像翻译数据集上的性能与最先进的方法相当,同时实现了实时推断。

Key Takeaways

  1. Rectified Flow (RF) 模型通过最优传输理论实现了高质量图像和视频合成。
  2. 在图像到图像翻译应用中,RF模型仍依赖繁琐的多步去噪,影响实时应用。
  3. CycleGAN-Turbo对抗训练范式虽适用于预训练扩散模型的一步图像翻译,但直接应用于RF模型会引发严重的收敛问题。
  4. TReFT方法解决RF模型在对抗训练下的收敛问题,利用预训练DiT或UNet预测的速率作为输出。
  5. TReFT方法观察到预训练RF模型的速率预测在去噪过程结束时收敛于原点至最终清晰图像的方向,这一性质得到理论验证。
  6. 将TReFT应用于大型预训练RF模型,如SD3.5和FLUX,引入内存高效的潜在循环一致性和身份损失,以及简化架构以加快推断。

Cool Papers

点此查看论文截图

Blind Adaptive Local Denoising for CEST Imaging

Authors:Chu Chen, Aitor Artola, Yang Liu, Se Weon Park, Raymond H. Chan, Jean-Michel Morel, Kannie W. Y. Chan

Chemical Exchange Saturation Transfer (CEST) MRI enables molecular-level visualization of low-concentration metabolites by leveraging proton exchange dynamics. However, its clinical translation is hindered by inherent challenges: spatially varying noise arising from hardware limitations, and complex imaging protocols introduce heteroscedasticity in CEST data, perturbing the accuracy of quantitative contrast mapping such as amide proton transfer (APT) imaging. Traditional denoising methods are not designed for this complex noise and often alter the underlying information that is critical for biomedical analysis. To overcome these limitations, we propose a new Blind Adaptive Local Denoising (BALD) method. BALD exploits the self-similar nature of CEST data to derive an adaptive variance-stabilizing transform that equalizes the noise distributions across CEST pixels without prior knowledge of noise characteristics. Then, BALD performs two-stage denoising on a linear transformation of data to disentangle molecular signals from noise. A local SVD decomposition is used as a linear transform to prevent spatial and spectral denoising artifacts. We conducted extensive validation experiments on multiple phantoms and \textit{in vivo} CEST scans. In these experiments, BALD consistently outperformed state-of-the-art CEST denoisers in both denoising metrics and downstream tasks such as molecular concentration maps estimation and cancer detection.

化学交换饱和转移(CEST)MRI通过利用质子交换动力学实现了对低浓度代谢物的分子水平可视化。然而,其临床翻译受到了一些固有挑战的阻碍:硬件限制导致的空间变化噪声,以及复杂的成像协议在CEST数据中引入异方差性,干扰了酰胺质子转移(APT)成像等定量对比映射的准确性。传统降噪方法并不适用于这种复杂的噪声,并且经常改变对生物医学分析至关重要的底层信息。为了克服这些局限性,我们提出了一种新的盲自适应局部降噪(BALD)方法。BALD利用CEST数据的自相似性质,推导出一种自适应方差稳定变换,该变换可在不了解噪声特性的情况下,使CEST像素之间的噪声分布均等。然后,BALD对数据的线性变换进行两阶段降噪,以将分子信号与噪声分开。使用局部奇异值分解(SVD)分解作为线性变换,以防止空间和光谱降噪伪影。我们在多个幻影和体内CEST扫描上进行了广泛的验证实验。在这些实验中,BALD在降噪指标和下游任务(如分子浓度映射估计和癌症检测)上均表现优于最新CEST降噪器。

论文及项目相关链接

PDF

Summary
化学交换饱和转移(CEST)MRI利用质子交换动力学实现低浓度代谢物的分子水平可视化。但其临床应用受到硬件限制导致的空间变化噪声和复杂成像协议引入的异质性噪声的影响。为克服这些限制,我们提出了一种新的盲自适应局部去噪(BALD)方法。BALD利用CEST数据的自相似性,通过自适应方差稳定变换,在无需了解噪声特性的情况下,均衡噪声分布。然后,BALD对数据的线性变换进行两阶段去噪,以解开分子信号和噪声。局部奇异值分解(SVD)分解作为线性变换,防止空间和光谱去噪伪影。BALD在多个幻影和体内CEST扫描的广泛验证实验中表现优异。

Key Takeaways

  1. CEST MRI实现了低浓度代谢物的分子水平可视化。
  2. 空间变化噪声和异质性噪声阻碍了CEST MRI的临床应用。
  3. 提出了一种新的盲自适应局部去噪(BALD)方法以克服这些限制。
  4. BALD利用CEST数据的自相似性进行自适应方差稳定变换。
  5. BALD采用两阶段去噪来区分分子信号和噪声。
  6. 局部奇异值分解(SVD)用于线性变换,减少去噪伪影。

Cool Papers

点此查看论文截图

HunyuanOCR Technical Report

Authors: Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow “OCR expert models” and inefficient “General VLMs”. 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

本文介绍了HunyuanOCR,这是一个面向OCR任务的商业级、开源、轻量级(1B参数)的视觉语言模型(VLM)。该架构由通过MLP适配器连接的本地图像转换器(ViT)和轻量级LLM组成。HunyuanOCR在性能上表现出卓越的效果,超过了商业API、传统管道和大型模型(例如Qwen3-VL-4B)。具体来说,它在感知任务(文本定位、解析)上超越了当前的公开解决方案,并在语义任务(信息提取、文本图像翻译)方面表现出色,在ICDAR 2025 DIMT挑战赛中荣获第一名(小型模型赛道)。此外,在OCRBench上,它在参数少于3B的VLM中实现了业界最先进的结果。 HunyuanOCR在三个方面取得了突破:1)统一通用性和效率:我们在轻量级框架内实现了对核心功能(包括定位、解析、信息提取、视觉问答和翻译)的全面支持。这解决了“OCR专家模型”的局限性以及“通用VLM”的效率低下问题。2)端到端的简洁架构:采用纯粹的端到端范式消除了对预处理模块(例如布局分析)的依赖。这从根本上解决了传统管道中常见的错误传播问题,并简化了系统部署。3)数据驱动和强化学习策略:我们证实了高质量数据的关键作用,并在业内首次证明强化学习(RL)策略在OCR任务中产生了显著的性能提升。 HunyuanOCR已在HuggingFace上正式开源。我们还基于vLLM提供了高性能部署解决方案,使其生产效率处于顶级水平。我们希望该模型能推动前沿研究并为工业应用提供坚实的基础。

论文及项目相关链接

PDF

Summary

该论文介绍了HunyuanOCR,这是一款针对OCR任务的商业级开源轻量级(1B参数)视觉语言模型。它采用Native Vision Transformer和轻量级LLM结合MLP适配器的架构,表现出卓越的性能,超越了商业API、传统管道和更大规模的模型。HunyuanOCR在感知任务、语义任务上表现出色,并在ICDAR 2025 DIMT挑战赛中荣获第一名(小型模型赛道)。此外,它在OCRBench上的结果达到了少于3B参数的VLMs的先进水平。

Key Takeaways

  1. HunyuanOCR是一个商业级、开源、轻量级的视觉语言模型,专门用于OCR任务。
  2. 该模型由Native Vision Transformer和轻量级LLM组成,通过MLP适配器连接。
  3. HunyuanOCR在多项OCR任务中表现出卓越性能,包括文本识别、解析、信息提取、视觉问答和文本图像翻译等。
  4. 它在ICDAR 2025 DIMT挑战赛中荣获小型模型赛道第一名。
  5. 在OCRBench上,HunyuanOCR的结果达到了业界领先水平。
  6. 该模型实现了端到端的流程,简化了系统部署,并解决了传统管道中的错误传播问题。

Cool Papers

点此查看论文截图

Multi-Modal Data Exploration via Language Agents

Authors:Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger

International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose M$^2$EX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M$^2$EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.

国际企业、组织和医院收集大量存储在数据库、文本文件、图像和视频中的多模式数据。尽管最近在多模式数据探索领域以及能够自动将自然语言问题翻译成数据库查询语言的数据库系统方面取得了进展,但使用自然语言同时查询结构化数据库和非结构化模式(如文本、图像)的研究挑战仍然很大程度上未被探索。在本文中,我们提出了M$^2$EX系统,该系统通过语言代理实现多模式数据探索。我们的方法基于以下研究贡献:(1)我们的系统受到现实使用案例的启发,使用户能够探索多模式信息系统。(2)M$^2$EX利用基于大型语言模型(LLM)的代理人工智能框架,将自然语言问题分解为文本到SQL生成和图像分析等子任务,并协调模态特定的专家以制定高效的查询计划。(3)在包含关系数据、文本和图像的多模式数据集上的实验结果表明,我们的系统在准确性和各种性能指标上超越了最先进的多模式探索系统,包括查询延迟、API成本和规划效率。这得益于LLM推理能力的更有效利用。

论文及项目相关链接

PDF Accepted to the IJCNLP AACL 2025 Findings

Summary

多模态数据探索是近年来的研究热点。本论文提出一种多模态数据探索系统M$^2$EX,支持对数据库中的结构化数据以及文本等非结构化数据进行自然语言查询。该系统采用基于大型语言模型的智能代理框架,可将自然语言问题分解为多个子任务,如文本到SQL的转换和图像分析等,并协调各种模态专家以制定高效的查询计划。实验结果表明,该系统在性能上优于其他多模态探索系统,在准确性、查询延迟、API成本和计划效率等方面均表现出卓越表现。

Key Takeaways

  1. 国际企业和组织收集大量多模态数据存储在数据库、文本文档、图像和视频中。
  2. 多模态数据探索是近年来的研究热点,但查询结构化数据库和非结构化数据(如文本和图像)的自然语言查询仍然是一个挑战。
  3. 本论文提出了一种多模态数据探索系统M$^2$EX,支持自然语言查询多模态数据。
  4. M$^2$EX采用基于大型语言模型的智能代理框架,将自然语言问题分解为多个子任务。
  5. M$^2$EX系统可以协调各种模态专家制定高效的查询计划。
  6. 实验结果表明,M$^2$EX系统在性能上优于其他多模态探索系统。

Cool Papers

点此查看论文截图

Zero-Shot Video Translation via Token Warping

Authors:Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame’s query, key, and value patches, aligning them with the current frame’s patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations can be found on our project webpage: https://alex-zhu1.github.io/TokenWarping/. Code is available at: https://github.com/Alex-Zhu1/TokenWarping.

随着生成式人工智能的革新,视频相关任务已得到广泛研究。然而,当前最先进的视频模型在视觉质量和用户对生成内容的控制方面仍然落后于图像模型。在本文中,我们介绍了TokenWarping,一种用于实现时间上连贯的视频翻译的新型框架。现有的基于扩散的视频编辑方法仅依赖于自注意力机制中的关键值和值补丁来确保时间连贯性,通常牺牲了局部和结构性区域的保留。这些方法忽略了查询补丁在实现精确特征聚合和时间连贯性方面的作用。相比之下,TokenWarping通过构建不同帧之间的时间相关性,利用互补的令牌先验。我们的方法首先从源视频中提取光流。在扩散模型的去噪过程中,这些光流被用来扭曲上一帧的查询、关键和值补丁,将它们与当前帧的补丁对齐。通过直接扭曲查询补丁,我们增强了自注意力机制中的特征聚合,同时扭曲关键和值补丁确保了跨帧的时间连贯性。这种令牌扭曲对自注意力层的输出施加了明确的约束,有效地确保了时间上的连贯翻译。我们的框架不需要任何额外的训练或微调,并且可以无缝地集成到现有的文本到图像编辑方法中。我们在各种视频翻译任务上进行了大量实验,证明TokenWarping在定性和定量上均超越了最先进的方法。视频演示可在我们的项目网页上找到:项目链接。代码可在:代码链接

论文及项目相关链接

PDF Code is available at: https://github.com/Alex-Zhu1/TokenWarping

Summary

该文提出了一种名为TokenWarping的新型框架,用于实现时间连贯的视频翻译。该框架通过利用令牌先验信息构建不同帧之间的时间相关性,改进了现有的基于扩散的视频编辑方法。通过直接对查询令牌进行变形,增强了自我注意机制中的特征聚合能力;同时对键和值令牌的变形确保了帧之间的时间连贯性。该框架无需任何额外的训练或微调,并能无缝地集成到现有的文本到图像编辑方法中。实验证明,TokenWarping在视频翻译任务上超越了现有方法。

Key Takeaways

  1. TokenWarping是一种用于视频翻译的新型框架,基于令牌先验信息进行时间连贯的视频编辑。
  2. 该框架通过对查询令牌进行变形来增强特征聚合能力,同时确保键和值令牌的时间连贯性。
  3. TokenWarping利用光学流从源视频中提取信息,用于在扩散模型的去噪过程中对令牌进行变形。
  4. 该方法提高了视频翻译的质量和用户对生成内容的控制力,超越了现有方法。
  5. TokenWarping可无缝集成到现有的文本到图像编辑方法中,无需额外的训练或微调。
  6. 广泛实验证明,TokenWarping在各种视频翻译任务上表现优异。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
视频理解 视频理解
视频理解 方向最新论文已更新,请持续关注 Update in 2025-11-27 Mistake Attribution Fine-Grained Mistake Understanding in Egocentric Videos
2025-11-27
下一篇 
Few-Shot Few-Shot
Few-Shot 方向最新论文已更新,请持续关注 Update in 2025-11-27 Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
2025-11-27
  目录