发布日期: 2025-09-24

更新日期: 2025-11-27

文章字数: 6.2k

阅读时长: 24 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-24 更新

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Authors:Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.

文本到图像的扩散模型通过其跨模态注意力机制隐式地建立文本概念，从而擅长将语言提示转化为逼真的图像。最近的多模态扩散变压器通过引入对连接的图像和文本符号的联合自注意力，进一步扩展了这一功能，实现了更丰富、更可扩展的跨模态对齐。然而，对于这些注意力图如何在图像生成中发挥作用以及发挥作用的具体位置，我们的了解仍然有限。在本文中，我们介绍了Seg4Diff（用于扩散的分割），这是一个分析MM-DiT注意力结构的系统框架，重点关注特定层如何将语义信息从文本传播到图像。通过综合分析，我们确定了语义定位专家层，这是一个特定的MM-DiT块，能够持续地将文本符号与空间连贯的图像区域对齐，自然产生高质量语义分割掩码。我们进一步证明，使用带掩码注释的图像数据进行轻量级微调方案可以增强这些层的语义分组能力，从而提高分割性能和生成的图像保真度。我们的研究结果表明，语义分组是扩散变压器的一种新兴属性，可以对其进行选择性放大以提高分割和生成性能，为统一模型铺平道路，该模型可桥接视觉感知和生成。

论文及项目相关链接

PDF NeurIPS 2025. Project page: https://cvlab-kaist.github.io/Seg4Diff/

Summary

文本到图像扩散模型通过跨模态注意力机制隐式地将文本概念与图像对齐，擅长将语言提示转化为逼真的图像。最近的多模态扩散变压器通过引入联合自注意力机制，实现对图像和文本标记的联合处理，实现了更丰富、更可扩展的跨模态对齐。然而，对于注意力地图如何以及在何处助力图像生成的理解仍然有限。本文提出Seg4Diff（用于扩散的分割）框架，用于分析多模态扩散变压器的注意力结构，重点关注特定层如何将语义信息从文本传播到图像。综合分析后，我们识别出一个语义定位专家层，这是一个特定的多模态扩散变压器模块，能够持续地将文本标记与空间连贯的图像区域对齐，自然生成高质量语义分割掩码。我们还证明，通过应用轻量级微调方案与带掩码注释的图像数据相结合，可以提升这些层次的语义分组能力，从而提高分割性能和图像生成保真度。研究结果表明，语义分组是多模态扩散变压器的一种新兴特性，可以针对性地增强以改进分割和生成性能，为统一视觉感知和生成的模型铺平道路。

Key Takeaways

文本到图像扩散模型通过将文本概念与图像对齐，实现了语言提示到图像的转化。
多模态扩散变压器引入联合自注意力机制，使跨模态对齐更丰富且可扩展。
Seg4Diff框架用于分析多模态扩散模型的注意力结构，识别出语义定位专家层。
语义定位专家层能将文本标记与空间连贯的图像区域对齐，生成高质量语义分割掩码。
结合轻量级微调方案和带掩码注释的图像数据，可以提升模型的语义分组能力。
语义分组是多模态扩散变压器的一种新兴特性，可以提高分割性能和图像生成质量。

Cool Papers

点此查看论文截图

PhysHDR: When Lighting Meets Materials and Scene Geometry in HDR Reconstruction

Authors:Hrishav Bakul Barua, Kalin Stefanov, Ganesh Krishnasamy, KokSheik Wong, Abhinav Dhall

Low Dynamic Range (LDR) to High Dynamic Range (HDR) image translation is a fundamental task in many computational vision problems. Numerous data-driven methods have been proposed to address this problem; however, they lack explicit modeling of illumination, lighting, and scene geometry in images. This limits the quality of the reconstructed HDR images. Since lighting and shadows interact differently with different materials, (e.g., specular surfaces such as glass and metal, and lambertian or diffuse surfaces such as wood and stone), modeling material-specific properties (e.g., specular and diffuse reflectance) has the potential to improve the quality of HDR image reconstruction. This paper presents PhysHDR, a simple yet powerful latent diffusion-based generative model for HDR image reconstruction. The denoising process is conditioned on lighting and depth information and guided by a novel loss to incorporate material properties of surfaces in the scene. The experimental results establish the efficacy of PhysHDR in comparison to a number of recent state-of-the-art methods.

从低动态范围（LDR）到高动态范围（HDR）的图像转换是许多计算机视觉问题的基本任务。针对这个问题已经提出了许多数据驱动的方法，然而，它们在图像中的照明、光照和场景几何的显式建模方面存在不足。这限制了重建的HDR图像的质量。由于光照和阴影与不同的材质有不同的交互方式（例如，如玻璃和金属之类的光泽表面，以及如木头和石头之类的朗伯或漫反射表面），对特定材料的属性（如镜面反射和漫反射）进行建模有望提高HDR图像重建的质量。本文介绍了PhysHDR，这是一种简单而强大的基于潜在扩散的HDR图像重建生成模型。去噪过程以光照和深度信息为条件，并由一种新型损失来指导，以结合场景中表面的材质属性。实验结果证明了PhysHDR与最近的一些先进方法相比的有效性。

论文及项目相关链接

PDF Submitted to IEEE

Summary：
本文介绍了PhysHDR模型，这是一种基于潜在扩散生成模型的HDR图像重建方法。该模型利用照明和深度信息作为去噪过程的条件，并通过新型损失函数引导，以融入场景中表面的材料属性。实验结果证明，与一些最新的先进方法相比，PhysHDR在HDR图像重建中非常有效。

Key Takeaways：

低动态范围（LDR）到高动态范围（HDR）图像翻译是计算视觉领域中的一项重要任务。
目前的数据驱动方法缺乏明确的建模光照、照明和场景几何，限制了重建的HDR图像的质量。
材质特性（如镜面反射和漫反射）的建模有助于提高HDR图像重建的质量。
PhysHDR是一种基于潜在扩散生成模型的简单而强大的HDR图像重建方法。
该模型在去噪过程中结合了照明和深度信息。
通过新型损失函数引导，PhysHDR能够融入场景中表面的材料属性。

Cool Papers

点此查看论文截图

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Authors:Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

具有理解和生成视觉内容能力的统一多模态大型语言模型（LLM）具有巨大的潜力。然而，现有的开源模型通常在这两种能力之间面临性能权衡的问题。我们推出了Manzano，这是一个简单且可扩展的统一框架，通过混合图像标记器与精心策划的训练方案相结合，大幅减少了这种紧张关系。单个共享视觉编码器为两个轻量级适配器提供信息，这两个适配器分别产生用于图像到文本理解的连续嵌入和用于文本到图像生成的离散令牌，它们位于同一语义空间中。统一的自回归LLM预测文本和图像令牌的高级语义，随后辅助扩散解码器将图像令牌转换为像素。该架构与理解和生成数据上的统一训练方案相结合，实现了两种能力的可扩展联合学习。Manzano在统一模型中实现了最先进的成果，并且在专业模型中具有竞争力，特别是在文本丰富的评估中。我们的研究表明任务冲突最小，并且从扩大模型规模中获得了持续的收益，这验证了我们的混合标记器设计选择。

论文及项目相关链接

PDF

Summary

统一的多模态大型语言模型（LLM）既能够理解又能生成视觉内容，具有巨大的潜力。然而，现有的开源模型常常在理解和生成能力之间面临性能权衡问题。我们提出了Manzano，这是一个简单且可扩展的统一框架，通过混合图像令牌化与精心策划的训练方案相结合，大幅减少了这种权衡。一个共享的视觉编码器为两个轻量级适配器提供信息输入，这两个适配器分别产生用于图像到文本理解和文本到图像生成连续嵌入和离散令牌。统一的自回归LLM预测文本和图像令牌的高级语义，辅助扩散解码器随后将图像令牌转换为像素。该架构与在理解和生成数据上的统一训练方案相结合，实现了两种能力的可扩展联合学习。Manzano在统一模型中实现了最先进的成果，在文本丰富的评估中与专业模型竞争。我们的研究表明任务冲突最小，并且从扩大模型规模中获得了持续的收益，验证了混合令牌化的设计选择。

Key Takeaways

统一多模态大型语言模型具备理解和生成视觉内容的能力，具有巨大潜力。
现有模型在理解和生成能力之间常存在性能权衡问题。
Manzano框架通过混合图像令牌化和训练方案来减少这种权衡。
Manzano使用一个共享视觉编码器为两个适配器提供信息，分别用于图像到文本和文本到图像的转换。
统一的自回归LLM预测文本和图像的高级语义。
Manzano在统一模型中实现了最先进的成果，并在某些评估中与专业模型竞争。

Cool Papers

点此查看论文截图

DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

Authors:Meng Yang, Fan Fan, Zizhuo Li, Songchu Deng, Yong Ma, Jiayi Ma

Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality’s features, which enhances the model’s understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model’s generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.

多模态图像匹配旨在寻找不同模态图像之间的像素级对应关系，对于跨模态感知、融合和分析至关重要。然而，不同模态之间显著的外观差异使这项任务具有挑战性。由于高质量标注数据集的稀缺，现有的深度学习匹配方法提取模态通用特征的表现不佳，并且缺乏对多种场景的适应性。经过大规模数据训练的视觉基础模型（VFM）可产生可推广和适应各种模态数据和任务的特征表示，包括多模态匹配。因此，我们提出了使用来自VFM的知识蒸馏的多模态图像匹配方法DistillMatch。DistillMatch采用知识蒸馏技术构建轻量级的学生模型，从VFM（包括DINOv2和DINOv3）中提取高级语义特征，以协助跨模态匹配。为了保留模态特定信息，它提取并注入模态类别信息到另一模态的特征中，这增强了模型对跨模态关联的理解。此外，我们设计了V2I-GAN通过将可见光图像翻译为伪红外图像来增强模型的泛化能力，以实现数据增强。实验表明，DistillMatch在公共数据集上的表现优于现有算法。

论文及项目相关链接

PDF 10 pages, 4 figures, 3 tables

Summary

本文介绍了多模态图像匹配技术面临的挑战以及现有的深度学习方法的不足。为了解决这个问题，提出了一种基于知识蒸馏的多模态图像匹配方法DistillMatch，利用大规模数据训练的Vision Foundation Model（VFM）进行知识蒸馏，构建轻量级的学生模型来提取高层次的语义特征以实现跨模态匹配。同时，该方法还结合了模态类别信息，增强了模型对跨模态关联的理解。此外，通过设计V2I-GAN模型提升模型的泛化能力，通过可见光到伪红外图像的翻译进行数据增强。实验表明，DistillMatch在公共数据集上的表现优于现有算法。

Key Takeaways

多模态图像匹配面临像素级别的跨模态对应挑战。
现有深度学习方法在提取模态共同特征时表现不佳，缺乏适应多种场景的能力。
Vision Foundation Model（VFM）能够提供适应多种模态的数据和任务的一般化和鲁棒特征表示。
DistillMatch方法利用知识蒸馏技术从VFM构建轻量级学生模型进行跨模态匹配。
DistillMatch结合了模态类别信息，增强了模型对跨模态关联的理解。
V2I-GAN模型用于提升模型的泛化能力，通过可见光到伪红外图像的翻译进行数据增强。
实验表明DistillMatch在公共数据集上的表现优于现有算法。

Cool Papers

点此查看论文截图

Deep Feedback Models

Authors:David Calhas, Arlindo L. Oliveira

Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models.

深度反馈模型（DFMs）是一类新的有状态神经网络，结合了自下而上的输入和随时间变化的高级表示。这种反馈机制为原本静态的架构引入了动态性，使DFMs能够迭代地优化其内部状态并模拟生物决策制定的某些方面。我们将此过程建模为通过循环神经网络解决的微分方程，通过指数衰减进行稳定以确保收敛。为了评估其有效性，我们在两种关键条件下对DFMs进行了测量：对噪声的鲁棒性和有限数据的泛化能力。在物体识别和分割任务中，DFMs始终优于其前馈对应物，特别是在数据较少或噪声较高的环境中。此外，DFMs在医疗成像环境中表现良好，对各种类型的噪声腐蚀具有鲁棒性。这些发现强调了反馈在实现稳定、健壮和可推广的学习中的重要性。代码可在 https://github.com/DCalhas/deep_feedback_models 找到。

论文及项目相关链接

PDF

Summary
深度学习反馈模型（DFMs）是一类新型的有状态神经网络，通过结合自下而上的输入和高级表示的时间序列，引入了动态机制。这种反馈机制使DFMs能够迭代地优化其内部状态并模拟生物决策制定的某些方面。建模为微分方程，通过递归神经网络解决，并通过指数衰减实现稳定以确保收敛。在对象识别和分割任务中，特别是在低数据或高噪声条件下，DFMs的表现优于前馈模型。此外，DFMs在医学成像环境中具有广泛的应用前景，对各种噪声干扰具有鲁棒性。这些发现凸显了反馈在实现稳定、健壮和可泛化的学习中的重要性。

Key Takeaways

DFMs是有状态神经网络的一种新型分类，结合了自下而上的输入和高级表示的时间序列。
反馈机制使DFMs能够迭代地优化其内部状态并模拟生物决策过程。
通过微分方程建模，使用递归神经网络解决并经由指数衰减实现稳定收敛。
在对象识别和分割任务中，特别是在资源受限的环境下，DFMs表现出卓越性能。
DFMs可应用于医学成像环境，显示出对各种噪声干扰的鲁棒性。
DFMs的重要性在于实现稳定、健壮和可泛化的学习。

Cool Papers

点此查看论文截图

RaceGAN: A Framework for Preserving Individuality while Converting Racial Information for Image-to-Image Translation

Authors:Mst Tasnim Pervin, George Bebis, Fang Jiang, Alireza Tavakkoli

Generative adversarial networks (GANs) have demonstrated significant progress in unpaired image-to-image translation in recent years for several applications. CycleGAN was the first to lead the way, although it was restricted to a pair of domains. StarGAN overcame this constraint by tackling image-to-image translation across various domains, although it was not able to map in-depth low-level style changes for these domains. Style mapping via reference-guided image synthesis has been made possible by the innovations of StarGANv2 and StyleGAN. However, these models do not maintain individuality and need an extra reference image in addition to the input. Our study aims to translate racial traits by means of multi-domain image-to-image translation. We present RaceGAN, a novel framework capable of mapping style codes over several domains during racial attribute translation while maintaining individuality and high level semantics without relying on a reference image. RaceGAN outperforms other models in translating racial features (i.e., Asian, White, and Black) when tested on Chicago Face Dataset. We also give quantitative findings utilizing InceptionReNetv2-based classification to demonstrate the effectiveness of our racial translation. Moreover, we investigate how well the model partitions the latent space into distinct clusters of faces for each ethnic group.

近年来，生成对抗网络（GANs）在无配对图像到图像翻译方面为多个应用取得了显著进展。虽然CycleGAN率先引领了这一趋势，但它仅限于两个领域的翻译。StarGAN通过解决跨多个领域的图像到图像翻译克服了这一限制，但它无法解决这些领域的深度低级别风格变化映射。StyleGAN和StarGANv2的创新为实现通过参考引导图像合成进行风格映射提供了可能，但这些模型并不保持个性并且除了输入图像外还需要额外的参考图像。我们的研究旨在通过多领域图像到图像翻译进行种族特征翻译。我们提出了一种新型框架RaceGAN，它能够在种族属性翻译过程中映射多个领域的风格代码，同时保持个性化和高级语义，无需依赖参考图像。在芝加哥面部数据集上测试时，RaceGAN在翻译种族特征（如亚洲人、白人和黑人）方面的性能超越了其他模型。我们还使用基于InceptionReNetv2的分类方法进行定量研究，以证明我们的种族翻译的有效性。此外，我们还调查了模型如何将潜在空间划分成每个种族的脸部特征为特色的不同集群。

论文及项目相关链接

PDF

Summary

生成对抗网络（GANs）在近年来在无配对图像到图像转换中取得了显著进展，并应用于多个领域。虽然CycleGAN率先实现了图像转换，但它仅限于两个领域之间的转换。StarGAN突破了这一限制，实现了跨多个领域的图像到图像转换，但在深度低层次的风格变化映射上仍有不足。StyleGAN和StarGANv2实现了参考引导图像合成的风格映射，但这些模型未能保持个性且需要额外的参考图像。本研究旨在通过多领域图像到图像转换进行种族特征转换。我们提出了RaceGAN，这是一种新型框架，能够在种族属性转换过程中映射多个领域的风格代码，同时保持个性且无需依赖参考图像。在芝加哥面部数据集上进行的测试表明，RaceGAN在翻译种族特征（如亚洲人、白人和黑人）方面优于其他模型。我们还使用基于InceptionReNetv2的分类方法给出了定量研究结果来证明我们的种族翻译的有效性。此外，我们还调查了模型如何将面部潜在空间划分为不同族群的不同集群的程度。总的来说，本文展示了RaceGAN在种族特征转换方面的优异性能和创新性技术贡献。

Key Takeaways