⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models
Authors:Joan Font-Quer Roset, Devina Mohan, Anna Scaife
In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.
在这项工作中,我们使用基于得分的扩散模型估计了Radio Galaxy Zoo(RGZ)数据集的内蕴维度(iD)。我们研究了iD估计值如何随着贝叶斯神经网络(BNN)能量得分的变化而变化,该能量得分衡量的是射电源与RGZ数据集的MiraBest子集之间的相似性。我们发现,非分布源表现出更高的iD值,且RGZ的整体iD超过了通常报道的自然图像数据集的iD。此外,我们还分析了iD在Fanaroff-Riley(FR)形态类之间以及随着信噪比(SNR)的变化情况。虽然FR I和FR II类之间没有发现关系,但存在一种微弱的趋势,即SNR较高时iD较低。未来使用RGZ数据集的工作可以利用iD和能量得分之间的关系,对各种自监督学习算法学到的表示进行定量研究并改进。
论文及项目相关链接
PDF 9 pages, 5 figures, 2 tables, submitted to NeurIPS 2025 ML4PS Workshop
Summary
本文使用基于分数的扩散模型评估了Radio Galaxy Zoo(RGZ)数据集的内蕴维度(iD)。研究发现,通过基于贝叶斯神经网络(BNN)的能量分数,可以衡量无线电资源与RGZ数据集的MiraBest子集之间的相似性。研究表明,异常值来源具有更高的iD值,且RGZ的总体iD值高于通常报道的自然图像数据集的值。此外,文章还分析了iD如何随Fanaroff-Riley(FR)形态分类和信噪比(SNR)而变化。尽管FR I和FR II类之间没有发现关系,但存在微弱的趋势,即SNR较高时iD较低。未来可以利用iD和能量分数之间的关系来定量研究和改进各种自监督学习算法的表示学习。
Key Takeaways
- 使用扩散模型估计了Radio Galaxy Zoo(RGZ)数据集的内蕴维度(iD)。
- 通过贝叶斯神经网络(BNN)的能量分数来衡量无线电资源与数据集的相似性。
- 发现异常值来源具有更高的iD值。
- RGZ数据集的总体iD值高于自然图像数据集。
- iD随Fanaroff-Riley(FR)形态分类变化的分析未发现明确关系。
- 存在信噪比(SNR)与iD之间的微弱趋势,即SNR较高时iD较低。
点此查看论文截图
Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping
Authors:Guowei Zhang, Yun Zhao, Moein Khajehnejad, Adeel Razi, Levin Kuhlmann
Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain’s hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.
将人类大脑活动与自然图像进行映射为理解和视觉认知打开了一个新窗口。然而,当前的基于扩散的解码器面临一个核心难题:大多数解码器直接对功能磁共振成像(fMRI)特征进行条件处理,而没有分析视觉信息在大脑皮层上的组织方式。这忽略了大脑的分层处理过程,并模糊了早期、中期和晚期视觉区域的角色。我们提出了Hi-DREAM,这是一个受大脑启发的条件扩散框架,它明确了皮层的组织。感兴趣区域(ROI)适配器将fMRI数据分为早期/中期/晚期流,并将其转换为与U-Net深度相匹配的多尺度皮层金字塔(浅层保留布局和边缘;深层则强调对象和语义)。一个轻便的、深度匹配的ControlNet在降噪过程中注入了这些尺度特定的提示。结果是一个高效且可解释的解码器,其中每个信号都扮演着类似大脑的角色,使模型不仅能够重建图像,还能够阐明不同视觉区域的职能贡献。在自然场景数据集(NSD)上的实验表明,Hi-DREAM在高层次语义指标上达到了最新技术水平,同时在低层次保真度上保持竞争力。这些发现表明,通过皮层结构进行条件结构化是一种强大的替代方法,相对于纯粹的数据驱动嵌入,它提供了一个研究视觉皮层的有用视角。
论文及项目相关链接
Summary
大脑活动与自然图像映射为研究视觉与认知提供了新的视角,但现有基于扩散的解码器面临核心难题:它们大多直接根据fMRI特征进行解码,忽略了视觉信息在大脑皮层的组织方式。研究团队提出了Hi-DREAM模型,这是一个受大脑启发的条件扩散框架,明确了皮层的组织。该模型通过感兴趣区域适配器将fMRI分为早期、中期和晚期信息流,并转化为与U-Net深度相匹配的多尺度皮层金字塔。实验结果显示,Hi-DREAM在高层次语义指标上达到最佳性能,同时在低层次保真度上保持竞争力。这显示将条件结构化按皮层等级排列是强大的替代方法,有助于研究视觉皮层。
Key Takeaways
- 映射大脑活动与自然图像为理解视觉和认知提供了新的视角。
- 当前扩散解码器大多直接基于fMRI特征进行解码,忽略了大脑皮层的视觉信息组织方式。
- Hi-DREAM模型是一个受大脑启发的条件扩散框架,旨在明确皮层的组织。
- Hi-DREAM模型通过ROI适配器将fMRI分为早期、中期和晚期信息流。
- Hi-DREAM模型将fMRI信息转化为多尺度皮层金字塔,与U-Net深度相匹配。
- Hi-DREAM模型实验结果达到最佳性能标准,尤其是在高层次语义指标上表现突出。
点此查看论文截图
CountSteer: Steering Attention for Object Counting in Diffusion Models
Authors:Hyemin Boo, Hyoryung Kim, Myungjin Lee, Seunghyeon Lee, Jiyoung Lee, Jang-Hwan Choi, Hyunsoo Cho
Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model’s cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.
文本到图像的扩散模型能够生成逼真且连贯的图像,但它们往往无法遵循文本中的数字指令,这揭示了语言与视觉表示之间的鸿沟。有趣的是,我们发现这些模型并非完全忽视数字——它们对自身的计数准确性有隐含意识,因为它们的内部信号会根据输出是否满足指定计数而呈现出一致的变化。这一观察表明,模型已经编码了潜在的数值正确性概念,可以加以利用以更精确地引导生成。基于这一直觉,我们引入了CountSteer,这是一种无需训练的方法,通过引导模型在推理过程中的交叉注意隐藏状态,提高指定对象计数的生成。在我们的实验中,CountSteer在提高对象计数准确性的同时,没有损害视觉质量,这证明了是一个简单而有效的步骤,朝着更可控和语义上更可靠的文本到图像生成方向发展。
论文及项目相关链接
PDF Accepted to AAAI 2026 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD)
Summary
文本到图像扩散模型可以生成逼真且连贯的图像,但在遵循文本中的数字指令方面常常出现问题,这显示出语言与视觉表示之间存在差距。研究发现,这些模型并非完全忽视数字,它们对计数准确性有隐性认知。模型输出符合计数时,其内部信号会有规律地变化。基于此观察,引入了一种无需训练的方法CountSteer,通过引导模型在推理阶段的交叉注意力隐藏状态,提高特定对象计数的生成准确性。实验表明,CountSteer在不损害视觉质量的情况下提高了约4%的对象计数准确性,是朝着更可控和语义更可靠的文本到图像生成方向迈出的简单而有效的一步。
Key Takeaways
- 文本到图像扩散模型在生成图像时能够遵循文本描述,但处理包含数字的指令时存在挑战。
- 模型并非完全忽视数字,而是对计数准确性有隐性认知。
- 模型内部信号会根据输出是否符合计数要求而发生变化。
- CountSteer方法通过引导模型的交叉注意力隐藏状态,在不损害视觉质量的前提下提高了对象计数的准确性。
- CountSteer是一种无需额外训练的有效方法。
- CountSteer方法提高了文本到图像生成的可控性和语义可靠性。
点此查看论文截图
Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing
Authors:Cong Cao, Yujie Xu, Xiaodong Xu
In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.
近年来,图像编辑领域引起了越来越多的关注。然而,当面对新的风格时,通用的图像编辑模型往往无法产生令人满意的结果。挑战在于如何利用有限的配对数据有效地微调通用图像编辑模型以适应新的风格。为了解决这一问题,本文提出了一种新型的少样本风格编辑框架。为了完成此任务,我们构建了一个包含五种不同风格的基准数据集。相应地,我们提出了一种参数高效的多风格混合专家低秩适应(MoE LoRA)方法,其中包含风格特定和风格共享路由机制,以联合微调多种风格。风格特定路由确保不同的风格不会相互干扰,而风格共享路由自适应地分配共享MoE LoRAs来学习通用模式。我们的MoE LoRA可以通过一种新的度量指导方法自动确定每层的最佳秩,该方法估计每个单秩组件的重要性得分。此外,我们探索了在扩散变压器(DiT)模型中插入LoRA的最佳位置,并集成了对抗学习和流匹配来引导扩散训练过程。实验结果表明,我们提出的方法在较少的LoRA参数下超越了现有的最先进的方法。
论文及项目相关链接
Summary
本文提出了一种针对图像编辑的新框架,旨在解决通用图像编辑模型在面对新风格时表现不佳的问题。文章构建了包含五种不同风格的基准数据集,并提出了一种参数高效的混合专家低秩适应(MoE LoRA)方法,通过风格特定和风格共享的路由机制联合微调多种风格。MoE LoRA可自动确定每层的最佳秩,通过一种新的度量引导方法估计每个单秩组件的重要性得分。此外,文章还探讨了将LoRA插入扩散模型的最优位置,并集成了对抗性学习和流匹配来指导扩散训练过程。实验结果表明,该方法优于现有最先进的图像编辑技术,且LoRA参数更少。
Key Takeaways
- 提出了一种新的图像编辑框架以解决通用模型面对新风格时的挑战。
- 构建了一个包含五种不同风格的基准数据集,用于评估和测试新框架的性能。
- 提出了一种参数高效的混合专家低秩适应(MoE LoRA)方法,用于微调多种风格。
- MoE LoRA具有风格特定和风格共享的路由机制,确保不同风格不会相互干扰。
- MoE LoRA可自动确定每层的最佳秩,通过度量引导方法估计每个单秩组件的重要性。
- 探讨了将LoRA插入扩散模型的最优位置,以提高模型的性能。
点此查看论文截图
3D Gaussian and Diffusion-Based Gaze Redirection
Authors:Abiram Panchalingam, Indu Bodala, Stuart Middleton
High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.
高保真度的视线重定向对于生成增强数据以改善视线估计器的泛化能力至关重要。虽然像GazeGaussian这样的3D高斯拼接(3DGS)模型代表了当前技术的前沿,但在呈现微妙、连续的视线转移时可能会遇到困难。在本文中,我们提出了DiT-Gaze框架,它通过结合扩散变压器(DiT)、各角度的弱监督以及正交约束损失,增强了3D视线重定向模型的性能。DiT允许更高保真度的图像合成,而我们使用合成中间视线角度的弱监督策略,在训练过程中提供了平滑的视线方向流形。正交约束损失从数学上强制实现视线、头部姿势和表情内部表示的去耦合。综合实验表明,DiT-Gaze在感知质量和重定向准确性方面均达到了最新技术水平,将最先进的视线误差降低了4.1%,至6.353度,为创建合成训练数据提供了一种优越的方法。我们的代码和模型将向研究社区开放,以供基准测试。
论文及项目相关链接
Summary
新一代的眼神捕捉技术DiT-Gaze通过结合Diffusion Transformer(DiT)、弱监督策略以及正交约束损失,提高了高保真眼神重定向的能力,从而生成更逼真的合成数据用于训练眼神捕捉模型。该技术不仅提高了感知质量和重定向精度,还减少了眼神捕捉误差。该技术的代码和模型将为研究社区公开提供。
Key Takeaways
- DiT-Gaze技术结合了Diffusion Transformer(DiT)以增强3D眼神重定向模型的性能。
- 弱监督策略用于合成中间眼神角度的数据,使训练过程中的眼神方向更加平滑。
- 正交约束损失被用来数学上强制分离眼神、头部姿势和表情的内部表示。
- DiT-Gaze在感知质量和重定向精度上达到了新的业界水准。
- DiT-Gaze减少了眼神捕捉误差,提升了至少4.1%,达到6.353度。
- 该技术将为研究社区公开可用的代码和模型,便于其他研究者进行基准测试。
点此查看论文截图
RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting
Authors:Ruocheng Wu, Haolan He, Yufei Wang, Zhihao Li, Bihan Wen
3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
3D高斯描画(3DGS)因其高质量实时渲染能力而在3D场景表示中受到广泛关注。然而,当输入包含稀疏训练视图时,3DGS容易过拟合,这主要是由于缺少中间视图监督。受视频扩散模型(VDM)近期成功的启发,我们提出了一种名为指导评分蒸馏(GSD)的框架,用于从预训练的VDMs中提取丰富的多视图一致性先验。基于评分蒸馏采样(SDS)的见解,GSD监督来自多个相邻视图的渲染图像,将高斯描画表示引向VDM的生成方向。然而,生成方向通常涉及对象运动和随机相机轨迹,使得在优化过程中直接监督具有挑战性。为解决此问题,我们引入了一种统一的指导形式来校正VDM的噪声预测结果。具体来说,我们结合基于真实深度图的深度warp指导以及基于语义图像特征的指导,确保VDM的评分更新方向与正确的相机姿态和准确的几何对齐。实验结果表明,我们的方法在多个数据集上的表现优于现有方法。
论文及项目相关链接
Summary
3D高斯喷绘(3DGS)在三维场景表示中因其高质量实时渲染能力而受到广泛关注。然而,当输入包含稀疏训练视图时,3DGS易出现过拟合,主要是由于缺乏中间视图监督。受视频扩散模型(VDM)成功的启发,我们提出了一种名为Guidance Score Distillation(GSD)的框架,从预训练的VDM中提取丰富的多视图一致性先验。GSD建立在Score Distillation Sampling(SDS)的基础上,对来自多个相邻视图的渲染图像进行监督,引导高斯喷绘表示朝向VDM的生成方向。但是,生成方向通常涉及物体运动和随机相机轨迹,使得在优化过程中直接监督具有挑战性。为解决此问题,我们引入了一种统一的指导形式来校正VDM的噪声预测结果。具体来说,我们结合基于真实深度图的深度warp指导和基于语义图像特征的指导,确保从VDM得到的分数更新方向与正确的相机姿态和精确几何对齐。实验结果显显示,我们的方法在多个数据集上的表现均超越现有方法。
Key Takeaways
- 3DGS在三维场景表示中具有高质量的实时渲染能力。
- 在输入包含稀疏训练视图的情况下,3DGS易出现过拟合问题。
- VDM的成功启发了一种新的框架GSD,用于从预训练的VDM中提取多视图一致性先验。
- GSD利用SDS的方法,通过监督多个相邻视图的渲染图像来引导高斯喷绘表示朝向VDM的生成方向。
- 生成方向涉及物体运动和随机相机轨迹,使得优化过程中的直接监督具有挑战性。
- 引入了一种统一的指导形式来校正VDM的噪声预测结果,确保分数更新方向与正确的相机姿态和几何对齐。
- 实验结果显示,GSD方法在多个数据集上的性能优于现有方法。
点此查看论文截图
Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image
Authors:Matthias Humt, Ulrich Hillenbrand, Rudolph Triebel
While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models–Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers–which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.
尽管生成模型已被广泛应用于包括3D数据在内的多种数据模态,但对于哪种模型最适合哪种任务尚未达成共识。此外,经常使用文本和图像等条件信息来引导生成过程,而其他信息,如部分3D数据,尚未得到充分评估。在这项工作中,我们比较了两种最有前途的生成模型——去噪扩散概率模型和自回归因果转换器——我们将其适应于生成形状建模和完成的任务。我们对这两个任务进行了全面的定量评估和比较,包括基线判别模型和广泛的消融研究。我们的结果表明:(1)具有连续潜在空间的扩散模型在单噪声深度图像的多模态形状完成任务上表现最佳;(2)在相同的离散潜在空间上进行比较时,自回归模型在这些任务上的性能可与扩散模型相匹配或超过。
论文及项目相关链接
PDF 17 pages, 4 figures, 19 tables
Summary
本文比较了两种最具前景的生成模型——去噪扩散概率模型和自回归因果转换器,用于生成形状建模和补全任务。研究结果显示,扩散模型在连续潜在空间中表现最佳,自回归模型在离散潜在空间中可匹配或超越扩散性能。
Key Takeaways
- 生成模型广泛应用于多种数据模态,但针对何种任务选择何种模型尚未达成共识。
- 扩散模型和自回归模型是两种最具前景的生成模型。
- 扩散模型在连续潜在空间中的性能超过判别模型和自回归方法,尤其在单噪声深度图像的多模式形状补全任务上表现卓越。
- 在相同的离散潜在空间比较,自回归模型的性能可与扩散模型相当或更好。
- 文中还提到了条件信息如文本和图像对生成过程的指导作用,而部分3D数据尚未得到充分评估。
- 文章通过定量评估和比较,为生成形状建模和补全任务提供了深入见解。
点此查看论文截图
Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types
Authors:Chi-Yu Chen, Rawan Abulibdeh, Arash Asgari, Leo Anthony Celi, Deirdre Goode, Hassan Hamidi, Laleh Seyyed-Kalantari, Po-Chih Kuo, Ned McCague, Thomas Sounack
Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient’s health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.
人工智能正在揭示医学从未打算编码的信息。基于胸部X光片的深度视觉模型现在不仅可以检测疾病,还可以检测社会不平等的隐形痕迹。在这项研究中,我们展示了最先进的架构(DenseNet121、SwinV2-B、MedMamba)可以从正常的胸部X光片中预测患者的健康保险类型(作为社会经济地位的有力代理),预测的准确性相当高(在MIMIC-CXR-JPG上为AUC约0.67,在CheXpert上为0.68)。即使在控制年龄、种族和性别后,这一信号仍然存在,并且在模型仅针对单一种族群体进行训练时仍然可检测。基于补丁的遮挡表明,信号是弥漫的而不是局部的,嵌入在胸部上区和中部区域。这表明深度网络可能正在内化临床环境、设备差异或护理路径的微妙痕迹;学习社会经济隔离本身。这些发现挑战了医学图像是中立生物数据的假设。通过揭示模型如何感知和利用这些隐藏的社会签名,这项工作重新定义了医疗人工智能的公平性:目标不再仅仅是平衡数据集或调整阈值,而是质疑和解除嵌入在临床数据中的社会指纹。
论文及项目相关链接
PDF Submitting to MIDL 2026
Summary
人工智能不仅在疾病检测上有所突破,还能从普通的胸部X光片中识别出社会不平等的痕迹。最新研究结果显示,顶尖架构的模型能准确预测病人的健康保险类型——一个社会经济地位的有力指标,准确率达到了AUC 0.67以上。即使控制了年龄、种族和性别因素,这一信号依然存在,甚至在只针对单一种族群体进行训练时仍然可检测。这表明深度网络可能正在内化临床环境、设备差异或护理路径的细微痕迹,甚至是学习社会经济的分割。这一发现挑战了医学图像是中性生物数据的假设,重新定义了医疗人工智能的公平性。
Key Takeaways
- 人工智能可以从胸部X光片中识别出社会不平等的痕迹。
- 顶尖模型能准确预测病人的健康保险类型,反映社会经济状态。
- 模型预测的准确性在控制年龄、种族和性别因素后依然显著。
- 信号是普遍的,不仅存在于多种数据集上,而且存在于单一种族群体的数据中。
- 通过局部遮挡发现,这些社会信号是分散的,嵌入在胸部上半部分和中部区域。
- 研究挑战了医学图像作为中性数据的假设,揭示了模型如何感知和利用隐藏的社会签名。
点此查看论文截图
SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation
Authors:Sumin Yu, Taesup Moon
While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity–adjusting guidance strength based on the prompt–and selectivity–targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.
基于扩散的文本到图像(T2I)模型虽然已经达到了显著的图像生成质量,但它们也易于创建有害内容,引发了社会关注,并强调了更安全生成的需求。现有的推理时间指导方法既缺乏适应性——根据提示调整指导力度,也缺乏选择性——只针对图像的不安全区域进行定位。我们的方法SP-Guard通过估计提示有害性并应用选择性指导掩膜来指导仅不安全的区域,解决了这些限制。实验表明,SP-Guard在最小化意外内容更改的同时,生成了比现有方法更安全的图像。除了提高安全性之外,我们的发现还突出了图像生成中透明度和可控性的重要性。
论文及项目相关链接
PDF Accepted for presentation at TRUST-AI Workshop, ECAI 2025. Proceedings to appear in CEUR-WS
Summary
本文介绍了基于扩散的T2I模型在图像生成方面的突出表现,但也指出了其易于生成有害内容的问题,引发了社会关注并强调了更安全生成的需求。现有推理时间指导方法缺乏自适应性和选择性,无法根据提示调整指导力度,也无法针对图像的不安全区域进行定位。本文提出的SP-Guard方法通过估算提示危害性并应用选择性指导掩膜,只在不安全区域进行指导来解决这些问题。实验表明,SP-Guard能够生成比其他方法更安全的图像,同时最小化意外内容更改。此外,本文还发现透明度和可控性是图像生成中的重要因素。
Key Takeaways
- 扩散模型在图像生成方面表现出卓越的性能,但也存在生成有害内容的风险。
- 现有推理时间指导方法缺乏自适应性和选择性,无法满足安全生成的需求。
- SP-Guard方法通过估算提示危害性并应用选择性指导掩膜解决现有方法的局限性。
- SP-Guard能在生成更安全图像的同时最小化意外内容更改。
- 除了提高安全性外,透明度和可控性在图像生成中也很重要。
- SP-Guard的实验结果证明了其有效性和优越性。
点此查看论文截图
Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification
Authors:Junjie Zhang, Feng Zhao, Hanqiang Liu, Jun Yu
The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.
遥感技术的蓬勃发展引发了一项新型的多模态泛化任务,要求模型在拥有强大的跨场景泛化能力的同时克服数据异质性。此外,大多数视觉语言模型(VLM)通常使用通用文本描述遥感图像的表面材料,缺乏针对不同遥感视觉模态的专有语言先验知识。在此工作中,我们将遥感多模态泛化(RSMG)形式化为一种学习范式,并提出一种频率感知视觉语言多模态泛化网络(FVMGN)用于遥感图像分类。具体来说,设计了一种基于扩散的训练-测试时间增强(DTAug)策略,以重建多模态土地覆盖分布,丰富FVMGN的输入信息。之后,为了克服多模态异质性,开发了一种多模态小波解纠缠(MWDis)模块,通过重新采样频率域中的低频和高频成分来学习跨域不变特征。考虑到遥感视觉模态的特性,将共享和专有类别文本设计为语言输入,用于基于变压器的文本编码器以提取多样的文本特征。对于多模态视觉输入,构建了一个空间频率感知图像编码器(SFIE)以实现局部-全局特征重建和表示。最后,建议一个多尺度空间频率特征对齐(MSFFA)模块以构建统一语义空间,确保在空间域和频率域中实现不同文本和视觉特征的多尺度精细对齐。大量实验表明,与最新方法相比,FVMGN具有出色的多模态泛化能力。
论文及项目相关链接
摘要
遥感技术与新型多模态泛化任务的结合,要求模型克服数据异质性,具备强大的跨场景泛化能力。针对此,本文提出一种频率感知的视语言多模态泛化网络(FVMGN),用于遥感图像分类。设计扩散训练测试时间增强策略,重建多模态土地覆盖分布,丰富FVMGN的输入信息。为克服多模态异质性,开发多模态小波分离模块,通过频率域重采样高低频成分学习跨域不变特征。针对遥感视觉模态特性,设计共享和专有类别文本作为语言输入,为基于变压器的文本编码器提取多种文本特征。对于多模态视觉输入,构建空间频率感知图像编码器实现局部全局特征重建和表示。最后,提出多尺度空间频率特征对齐模块,构建统一语义空间,确保不同文本和视觉特征在空间域和频率域的多尺度精细对齐。实验表明,FVMGN相较于最新方法具有出色的多模态泛化能力。
关键见解
- 遥感技术的新多模态泛化任务要求模型克服数据异质性,具备强大的跨场景泛化能力。
- 提出频率感知的视语言多模态泛化网络(FVMGN)用于遥感图像分类。
- 扩散训练测试时间增强策略用于重建多模态土地覆盖分布。
- 多模态小波分离模块克服多模态异质性,学习跨域不变特征。
- 根据遥感视觉模态特性,设计共享和专有类别文本作为语言输入。
- 空间频率感知图像编码器实现局部和全局特征重建和表示。
点此查看论文截图
Fast Data Attribution for Text-to-Image Models
Authors:Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Richard Zhang, Jun-Yan Zhu
Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.
文本对图像模型的归因旨在识别对生成输出影响最大的训练图像。现有的归因方法每次查询都需要大量的计算资源,使得它们在现实世界的实际应用中不切实际。我们提出了一种用于可扩展和高效数据归因的新型方法。我们的主要思想是将缓慢的无学习归因方法提炼到特征嵌入空间,以便高效地检索高度有影响力的训练图像。在部署过程中,结合高效的索引和搜索方法,我们的方法能够在不运行昂贵的归因算法的情况下成功找到具有高度影响力的图像。我们在中等规模的MSCOCO训练模型和大规模LAION训练的Stable Diffusion模型上都展示了广泛的结果,证明我们的方法可以在几秒钟内实现更好的性能或具有竞争力的表现,比现有方法快2500倍至40万倍。我们的工作代表了数据归因方法在现实世界模型(如Stable Diffusion)的大规模应用方面迈出的有意义的一步。
论文及项目相关链接
PDF NeurIPS 2025 camera ready. Project page: https://peterwang512.github.io/FastGDA
Summary
本文提出了一种高效且可扩展的数据归因方法,用于文本生成图像模型中识别对生成输出影响最大的训练图像。该方法通过将慢的无学习归因方法提炼到特征嵌入空间,实现了高效检索关键训练图像的能力。在部署过程中,结合高效索引和搜索方法,能在不运行昂贵归因算法的情况下成功找到高度影响结果的图像。此方法在中型MSCOCO模型和大型LAION训练稳定扩散模型上展现出更好的或相当的性能,且在几秒内完成,远超现有方法的速度,最高达两倍到数十万倍的速度提升。这项研究是实现数据归因方法在真实世界模型如稳定扩散的大规模应用的重要一步。
Key Takeaways
- 提出了一种新的数据归因方法,用于文本生成图像模型。
- 方法实现了在特征嵌入空间的高效检索,有效识别对生成输出影响最大的训练图像。
- 结合高效索引和搜索方法,无需运行昂贵的归因算法即可找到关键图像。
- 在中型和大型模型上展现出良好性能,显著提高了现有方法的效率。
- 方法的性能提升达到了现有方法的数千倍至数十万倍的速度提升。
- 研究为实现数据归因方法在真实世界模型的大规模应用奠定了基础。
点此查看论文截图
STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data
Authors:Yongdeuk Seo, Hyun-seok Min, Sungchul Choi
Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.
场景文本编辑(STE)是在保留图像视觉风格(如字体、颜色和背景)的同时,修改图像中的文本内容。虽然最近的扩散方法在提高视觉质量方面取得了进展,但仍存在关键限制:不支持低资源语言、合成数据和真实数据之间的领域差距,以及缺乏评估文本风格保留的合适指标。为了解决这些挑战,我们提出了STELLAR(场景文本编辑器,适用于低资源语言和真实世界数据)。STELLAR通过语言自适应的符号编码器以及分阶段训练策略,实现了可靠的多语言编辑。该策略首先进行合成数据预训练,然后在真实图像上进行微调。我们还构建了用于训练和评估的新数据集STIPLAR(场景文本图像对,适用于低资源语言和真实世界数据)。此外,我们提出了文本外观相似性(TAS),这是一种新的指标,通过独立测量字体、颜色和背景相似性来评估风格保留情况,即使在没有真实值的情况下也能实现稳健评估。实验结果表明,STELLAR在视觉一致性和识别准确性方面优于最新模型,与基线相比,跨语言的平均TAS改进了2.2%。
论文及项目相关链接
PDF Accepted to AAAI 2026 Workshop (Artificial Intelligence with Biased or Scarce Data)
Summary
基于场景文本编辑(STE)的任务是修改图像中的文本内容,同时保留其视觉风格,如字体、颜色和背景。针对低资源语言、合成与现实数据之间的域差距以及缺乏适当的文本风格保留评估指标等挑战,提出了STELLAR(场景文本编辑器,用于低资源语言和现实数据)。STELLAR通过语言自适应字形编码器和多阶段训练策略实现可靠的多语言编辑,先合成数据进行预训练,再在真实图像上进行微调。此外,构建了STIPLAR(场景文本图像对低资源语言和现实数据)新数据集用于训练和评估。还提出了Text Appearance Similarity(TAS)这一新指标,通过独立测量字体、颜色和背景相似性来评估风格保留情况,即使在无真实数据的情况下也能进行稳健评估。实验结果表明,STELLAR在视觉一致性和识别准确率方面优于最新模型,平均跨语言 TAS 较基线提高了 2.2%。
Key Takeaways
- STELLAR旨在解决场景文本编辑中的挑战,支持低资源语言并适用于现实数据。
- 通过语言自适应字形编码器实现可靠的多语言编辑。
- 采用多阶段训练策略,先在合成数据上预训练,再在真实图像上微调。
- 构建了STIPLAR数据集用于训练和评估场景文本编辑模型。
- 提出了Text Appearance Similarity (TAS)这一新指标,用于评估文本风格保留情况。
- 实验显示,STELLAR在视觉一致性和识别准确率上较现有模型有显著提升。
点此查看论文截图
Self-Diffusion Driven Blind Imaging
Authors:Yanlong Yang, Guanxiong Luo
Optical imaging systems are inherently imperfect due to diffraction limits, lens manufacturing tolerances, assembly misalignment, and other physical constraints. In addition, unavoidable camera shake and object motion further introduce non-ideal degradations during acquisition. These aberrations and motion-induced variations are typically unknown, difficult to measure, and costly to model or calibrate in practice. Blind inverse problems offer a promising direction by jointly estimating both the latent image and the unknown degradation kernel. However, existing approaches often suffer from convergence instability, limited prior expressiveness, and sensitivity to hyperparameters. Inspired by recent advances in self-diffusion, we propose DeblurSDI, a zero-shot, self-supervised blind imaging framework that requires no pre-training. DeblurSDI formulates blind image recovery as an iterative reverse self-diffusion process that begins from pure noise and progressively refines both the sharp image and the blur kernel. Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.
光学成像系统由于衍射极限、镜头制造公差、装配失调等物理约束而天生不完美。此外,不可避免的相机抖动和物体运动会在采集过程中进一步引入非理想退化。这些畸变和运动引起的变化通常是未知的,难以测量,并且在实践中对建模或校准的成本很高。盲反问题通过联合估计潜在图像和未知的退化核提供了一个有前途的方向。然而,现有方法常常存在收敛不稳定、先验表达有限以及对超参数敏感等问题。受到自扩散最新进展的启发,我们提出了DeblurSDI,这是一个零样本、自监督的盲成像框架,无需预先训练。DeblurSDI将盲图像恢复制定为一个从纯噪声开始的迭代反向自扩散过程,该过程逐步细化锐利的图像和模糊核。在光学畸变和运动模糊相结合的大量实验表明,DeblurSDI在各方面均大幅超越其他方法。
论文及项目相关链接
Summary
光学成像系统由于衍射极限、镜头制造公差、装配失准以及其他物理约束而天生存在缺陷。此外,不可避免的相机抖动和物体运动会在采集过程中进一步引入非理想退化。这些像差和运动引起的变化通常是未知的,难以测量,并且在实践中对建模或校准的成本很高。盲反问题提供了一种有前景的方向,可以联合估计潜在图像和未知的退化内核。然而,现有方法常常存在收敛不稳定、先验表现力有限以及对超参数敏感的问题。受自扩散最新进展的启发,我们提出了DeblurSDI,这是一种无需预训练的零样本、自监督盲成像框架。DeblurSDI将盲图像恢复制定为一个从纯噪声开始的迭代反向自扩散过程,逐步精细化清晰图像和模糊内核。在光学像差和运动模糊的混合实验上,DeblurSDI始终大幅优于其他方法。
Key Takeaways
- 光学成像系统存在多种固有缺陷,包括物理约束和像差。
- 盲反问题在估计潜在图像和未知退化内核方面具有潜力。
- 现有方法在处理盲反问题时存在收敛不稳定等缺陷。
- DeblurSDI是一个自监督的盲成像框架,通过迭代反向自扩散过程恢复图像。
- DeblurSDI不需要预训练,且能有效处理光学像差和运动模糊。
- 实验表明,DeblurSDI在性能上大幅超越了其他方法。
- DeblurSDI的方法是基于自扩散的最新进展,通过逐步精细化清晰图像和模糊内核来恢复图像。
点此查看论文截图
Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers
Authors:Claudio Affolter, Sidi Wu, Yizi Chen, Lorenz Hurni
Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making. The code and pre-trained models are available at https://github.com/claudaff/generative-ai-mapmaking/.
传统的地图制作严重依赖于地理信息系统(GIS),需要领域专业知识,并且耗时,尤其是对于重复任务。最近生成式人工智能(GenAI)的进展,尤其是图像扩散模型,为自动化和民主化地图制作流程提供了新的机会。然而,这些模型在创建准确地图方面存在困难,因为对空间组成和语义布局的控制有限。为了解决这个问题,我们将矢量数据集成到地图生成过程中,通过文本提示引导不同风格的地图生成。我们的模型是首个能够生成控制风格准确地图的模型,我们将它集成到一个web应用程序中,以提高其易用性和可访问性。我们与专业制图师进行了一项用户研究,以评估生成地图的保真度、web应用程序的易用性以及不断涌现的GenAI在地图制作中的意义。研究结果表明我们开发的应用程序具有潜力,更普遍地说,GenAI模型有助于非专业用户和专家更高效地创建地图。我们还概述了进一步的技术改进和强调了制图师在推进AI辅助地图制作范式中的新作用。代码和预先训练的模型可在https://github.com/claudaff/generative-ai-mapmaking/获得。
论文及项目相关链接
Summary
本文介绍了一种利用图像扩散模型与矢量数据结合的方法,通过文本提示来引导地图的生成。此方法可在不同风格下生成准确的地图,并集成到web应用程序中以提高可用性和可访问性。用户研究结果表明,该应用程序具有潜力,新兴的人工智能模型有助于非专业人士和专业人士更有效地创建地图。
Key Takeaways
- 传统的地图制作依赖于地理信息系统(GIS),需要大量专业知识且耗时耗力。
- 生成式人工智能(GenAI)在自动化和民主化地图制作过程中提供了新的机会。
- 图像扩散模型在准确创建地图方面存在局限性,缺乏空间构图和语义布局的控制。
- 通过结合矢量数据和引导地图生成的文本提示,可以生成准确且风格可控的地图。
- 集成模型到web应用程序提高了其可用性和可访问性。
- 用户研究评估了生成地图的保真度、web应用程序的可用性以及新兴GenAI在地图制作中的影响,显示出该应用程序的潜力。
点此查看论文截图
MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing
Authors:Jinghan Yu, Junhao Xiao, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang, Daizong Liu, Xianghao Meng, Jianjun Li
Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.
近年来,扩散模型在图像定制任务中取得了成功。然而,现有的基于掩膜的擦除方法在复杂场景(如人与人遮挡、人与物体纠缠以及人与背景干扰)中仍面临挑战,这主要是因为缺乏大规模多实例数据集以及有效的空间解耦技术来分离前景和背景。为了填补这些空白,我们策划了MILD数据集,捕捉多样化的姿势、遮挡以及复杂的多实例交互。随后,我们定义了跨域注意力差距(CAG),这是一种衡量语义泄漏的注意力差距指标。在此基础上,我们提出了多层扩散(MILD)模型,将生成过程分解为独立的去噪路径,从而实现每个前景实例和背景的单独重建。为了增强以人为中心的理解,我们引入了人类形态学指导(Human Morphology Guidance),这是一个即插即用模块,将姿势、解析和空间关系融入扩散过程,以提高结构意识和恢复质量。此外,我们还推出了空间调制注意力(Spatially-Modulated Attention),这是一种自适应机制,利用空间掩膜先验来调制语义区域的注意力,进一步拓宽CAG以有效减少边界伪影并缓解语义泄漏。实验表明,MILD显著优于现有方法。数据集和代码可在以下网址公开获取:https://mild-multi-layer-diffusion.github.io/。
论文及项目相关链接
Summary
扩散模型在图像定制任务中取得了成功,但仍面临复杂场景下的挑战,如人与人遮挡、人与物体纠缠以及人与背景干扰。为了解决这些问题,我们创建了MILD数据集并定义了跨域注意力间隙(CAG)指标。此外,我们提出了多层扩散模型(MILD),将其分解为独立的去噪路径,能够分别重建每个前景实例和背景。我们还引入了人类形态指导模块和空间调制注意力机制,以提高结构意识和恢复质量,并有效减少边界伪影和语义泄漏。
Key Takeaways
- 扩散模型在图像定制任务中取得显著成功。
- 现有的人擦除方法在复杂场景中仍有挑战,如人与人遮挡等问题。
- 为了解决这些问题,创建了MILD数据集,用于捕捉多样的姿势、遮挡和复杂的多实例交互。
- 定义了跨域注意力间隙(CAG)指标,用于量化语义泄漏。
- 提出了多层扩散模型(MILD),能分别重建前景实例和背景。
- 引入了人类形态指导模块,结合姿势、解析和空间关系,提高结构意识和恢复质量。
点此查看论文截图
Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach
Authors:Hangyu Liu, Bo Peng, Pengxiang Ding, Donglin Wang
Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. However, existing generative approaches for multi-target attacks primarily encode target labels into one-dimensional tensors, leading to a loss of fine-grained visual information and overfitting to model-specific features during noise generation. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model’s attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments demonstrate that TGAF consistently surpasses state-of-the-art methods across various settings.
相较于单目标对抗攻击,多目标攻击因其能够同时为多个目标类别生成对抗样本的能力而备受关注。然而,现有的多目标攻击的生成方法主要是将目标标签编码为一维张量,这导致在生成噪声时丢失了精细的视觉信息,并对特定模型的特性产生过度拟合。为了弥补这一空白,我们首先确定并验证了语义特征的质量和数量是影响目标攻击转移性的关键因素:1)特征质量指的是植入目标特征的结构和细节完整性,因为缺陷可能导致关键鉴别信息的丢失;2)特征数量指的是植入目标特征的空间充足性,因为不足会限制受害者模型对此特征的关注。基于这些发现,我们提出了二维张量引导对抗融合(TGAF)框架,该框架利用扩散模型的强大生成能力,将目标标签编码为二维语义张量,以指导对抗噪声生成。此外,我们还设计了一种针对训练过程的定制掩码策略,确保生成的噪声部分保留有关目标类别的完整语义信息。大量实验表明,TGAF在各种设置下均超越现有先进方法。
论文及项目相关链接
PDF AAAI-26 (Oral)
Summary
该文本介绍了多目标攻击在生成对抗性图像方面的优势,并指出了现有生成方法存在的问题。为解决这些问题,提出了基于二维张量引导的对抗性融合(TGAF)框架,利用扩散模型的生成能力,将目标标签编码为二维语义张量,以指导对抗性噪声生成。同时,设计了一种针对训练过程的定制掩码策略,确保生成的噪声部分保留目标类的完整语义信息。实验表明,TGAF在不同设置下始终超越现有方法。
Key Takeaways
- 多目标攻击能同时生成针对多个目标类的对抗性图像,受到广泛关注。
- 现有生成方法将目标标签编码为一维张量,导致精细视觉信息丢失和模型特定特征的过拟合。
- 语义特征的质量和数量是影响目标攻击可转移性的关键因素。
- 特征质量指植入目标特征的结构和细节完整性,不足会导致关键鉴别信息丢失。
- 特征数量指植入目标特征的空间充足性,不足会限制受害者模型的关注。
- 提出的TGAF框架利用扩散模型的生成能力,将目标标签编码为二维语义张量,指导对抗性噪声生成。
点此查看论文截图
Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction
Authors:Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang
Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models have been released at https://github.com/IRMVLab/MMTwin.
预测手部动作对于理解人类意图以及建立人类动作与机器人操作之间的动作空间至关重要。现有的手部轨迹预测(HTP)方法基于过去的以自我为中心的观察来预测未来手部在三维空间中的路径点。然而,这些模型仅设计用于处理二维以自我为中心的视频输入。它们无法意识到来自二维和三维观察的多元环境信息,阻碍了三维HTP性能的进一步提高。此外,这些模型忽视了手部动作与头戴式相机自主运动之间的协同作用,要么单独预测手部轨迹,要么仅从过去的帧中编码自主运动。为了解决这些局限性,我们提出了用于多模态三维手部轨迹预测的新型扩散模型(MMTwin)。MMTwin旨在将多模态信息作为输入,包括二维RGB图像、三维点云、过去的手部路径点和文本提示。此外,还将两个潜在扩散模型——自主运动扩散和HTP扩散相结合,构成双胞胎模型,以同时预测相机自主运动和未来手部轨迹。我们提出了一种新型混合的Mamba-Transformer模块,作为HTP扩散的去噪模型,以更好地融合多模态特征。在三个公开数据集和我们自行录制的数据上的实验结果表明,与我们提出的MMTwin相比,最先进的基线可以预测合理的未来三维手部轨迹,并在未见过的环境中有良好的泛化能力。代码和预训练模型已发布在https://github.com/IRMVLab/MMTwin。
论文及项目相关链接
PDF Accepted to IROS 2025
Summary
基于扩散模型的多模态三维手势轨迹预测。针对现有手势轨迹预测模型仅支持二维视频输入的问题,提出了全新的扩散模型MMTwin。它能够吸收包括二维RGB图像、三维点云、过去的手势路径点和文本提示在内的多模态信息。该模型通过整合两个潜扩散模型,即动作扩散和手势轨迹预测扩散,同时预测相机动作和未来的手势轨迹。使用混合Mamba-Transformer模块作为手势轨迹预测扩散的去噪模型,以更好地融合多模态特征。实验结果表明,与现有最佳基线相比,MMTwin能够预测更可靠的三维手势轨迹,并在未见过的环境中具有良好的泛化能力。
Key Takeaways
- 手势轨迹预测在理解人类意图和连接人类动作与机器人操作之间起着关键作用。
- 现有手势轨迹预测方法仅基于过去二维观察数据预测未来手势轨迹,缺乏多模态环境信息的利用。
- 提出新型扩散模型MMTwin,融合多种输入信息如二维RGB图像、三维点云等。
- MMTwin通过整合动作扩散和手势轨迹预测扩散,同时处理相机动作和手势轨迹预测。
- 采用混合Mamba-Transformer模块作为去噪模型,优化多模态特征的融合。
- 实验证明MMTwin相较于现有方法能更准确地预测三维手势轨迹,且具有良好的泛化能力。
点此查看论文截图
Efficient Image Restoration via Latent Consistency Flow Matching
Authors:Elad Cohen, Idan Achituve, Idit Diamant, Arnon Netzer, Hai Victor Habi
Recent advances in generative image restoration (IR) have demonstrated impressive results. However, these methods are hindered by their substantial size and computational demands, rendering them unsuitable for deployment on edge devices. This work introduces ELIR, an Efficient Latent Image Restoration method. ELIR addresses the distortion-perception trade-off within the latent space and produces high-quality images using a latent consistency flow-based model. In addition, ELIR introduces an efficient and lightweight architecture. Consequently, ELIR is 4$\times$ smaller and faster than state-of-the-art diffusion and flow-based approaches for blind face restoration, enabling a deployment on resource-constrained devices. Comprehensive evaluations of various image restoration tasks and datasets show that ELIR achieves competitive performance compared to state-of-the-art methods, effectively balancing distortion and perceptual quality metrics while significantly reducing model size and computational cost. The code is available at: https://github.com/eladc-git/ELIR
最新的生成式图像修复(IR)技术已经取得了令人瞩目的成果。然而,这些方法受到其庞大体积和计算需求的限制,不适合在边缘设备上部署。本研究引入了ELIR,一种高效的潜在图像修复方法。ELIR解决了潜在空间内的失真与感知之间的权衡问题,并使用基于潜在一致性流的模型生成高质量图像。此外,ELIR还引入了高效且轻量级的架构。因此,与传统的扩散和基于流的盲脸修复方法相比,ELIR体积更小、速度更快,可达4倍,能够在资源受限的设备上进行部署。对不同的图像修复任务和数据集的综合评估表明,ELIR与最先进的方法相比具有竞争力,在失真和感知质量指标之间实现了有效平衡,同时显著减少了模型大小和计算成本。代码可从以下网址获取:https://github.com/eladc-git/ELIR
论文及项目相关链接
PDF 21 pages, 11 figures
Summary
本文介绍了一种高效的潜在图像恢复方法——ELIR。该方法解决了潜在空间内的失真与感知之间的权衡问题,并利用基于流的模型产生高质量图像。与传统的图像恢复方法相比,ELIR拥有更小、更快速的架构,可以在资源受限的设备上进行部署。它能在平衡失真和感知质量指标的同时,实现与最新技术相当的性能。
Key Takeaways
- ELIR方法解决了生成式图像恢复中的失真与感知之间的权衡问题。
- ELIR利用基于流的模型产生高质量图像。
- ELIR具有高效且轻量级的架构,使得其在资源受限的设备上能够进行部署。
- ELIR在图像恢复任务上的性能与最新技术相当。
- ELIR相比当前最先进的方法,模型更小、速度更快。
- 代码已公开,可访问于 https://github.com/eladc-git/ELIR。
点此查看论文截图
Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos
Authors:Junyi Ma, Jingyi Xu, Xieyuanli Chen, Hesheng Wang
Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer’s dynamics for more accurate interaction prediction. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D is released as open source at https://github.com/IRMVLab/Diff-IP2D.
理解人类在手与物体交互过程中的行为对于服务机器人操作和扩展现实应用至关重要。为了实现这一点,近期已经提出了一些方法,能够同时预测手部轨迹和物体功能在人体自我中心视频中的表现。这种联合预测为未来手部与物体在二维空间中的交互提供了全面的表征,展示了潜在的人类运动和动机。然而,现有的方法大多采用自回归模式进行单向预测,这缺乏对整体未来序列的内部相互约束,并且沿着时间轴积累误差。同时,这些研究基本上忽视了相机自我运动对第一人称视角预测的影响。为了解决这些局限性,我们提出了一种基于扩散的交互预测新方法,即Diff-IP2D,以迭代非自回归的方式同时预测未来的手部轨迹和物体功能。我们将二维图像序列变换为潜在特征空间,并设计了一种去噪扩散模型,根据过去的特征预测未来的潜在交互特征。运动特征进一步集成到条件去噪过程中,使Diff-IP2D能够感知到相机佩戴者的动态变化,从而实现更精确交互预测。大量实验表明,我们的方法在现成的指标和我们新提出的评估协议上都显著优于最新的基线方法。这凸显了利用生成模式进行二维手部物体交互预测的有效性。Diff-IP2D的代码已作为开源发布在https://github.com/IRMVLab/Diff-IP2D。
论文及项目相关链接
PDF Accepted to IROS 2025
Summary
为应对服务机器人操作和扩展现实应用中的手与物体交互预测挑战,提出一种基于扩散模型的方法Diff-IP2D,以迭代非自回归的方式同时预测未来手轨迹和物体功能。通过将连续二维图像转换为潜在特征空间,设计去噪扩散模型,根据过去特征预测未来交互特征。整合运动特征到条件去噪过程中,提高相机佩戴者动态感知的准确度。实验证明,该方法在现有指标和新提出的评估协议上均显著优于最先进基线。突显了生成式范式在手-物体二维交互预测中的重要性。
Key Takeaways
- 理解手与物体的交互预测对于服务机器人操作和扩展现实应用至关重要。
- 现有方法大多采用自回归模式进行单向预测,存在缺乏整体未来序列的相互约束和沿时间轴的误差累积问题。
- 提出的Diff-IP2D方法以迭代非自回归的方式同时预测未来手轨迹和物体功能。
- Diff-IP2D通过将连续二维图像转换为潜在特征空间,并利用去噪扩散模型进行预测。
- 集成运动特征到条件去噪过程中,以提高相机佩戴者动态感知的准确度。
- 实验证明Diff-IP2D在多个评估指标上显著优于现有方法。
- 释放开源代码,网址为[链接地址]。