发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 9.5k

阅读时长: 38 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Authors:Jialong Sun, Hongguang Zhu, Weizhe Liu, Yunda Sun, Renshuai Tao, Yunchao Wei

Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.

训练禁止物品检测模型需要大量的X光安检图像，但收集和标注这些图像既耗时又费力。为了解决数据不足的问题，X光安检图像合成方法通过合成图像来扩大数据集。然而，之前的方法主要采用两阶段管道，在第一阶段进行劳动密集型的前景提取，然后在第二阶段进行图像合成。这种管道带来了不可避免的额外劳动力成本，并且效率不高。在本文中，我们提出了一种基于文本到图像生成的一阶段X光安检图像合成管道（Xsyn），该管道采用两种有效策略来提高合成图像的使用性。Cross-Attention Refinement（CAR）策略利用扩散模型的交叉注意力图来优化边界框标注。Background Occlusion Modeling（BOM）策略在潜在空间中显式建模背景遮挡，以提高成像复杂性。据我们所知，与以前的方法相比，Xsyn首次实现了无需额外劳动力成本的高质量X光安检图像合成。实验表明，我们的方法在所有之前的方法中表现最佳，mAP提高了1.2%，并且我们方法生成的合成图像对提高各种X光安检数据集和检测器的禁止物品检测性能有益。代码可通过https://github.com/pILLOW-1/Xsyn/访问。

论文及项目相关链接

PDF

Summary

本文提出了一种基于文本到图像生成的一站式X光安检图像合成方法（Xsyn）。该方法采用两种策略提高合成图像的使用性，包括利用扩散模型的跨注意力映射细化边界框标注的背景遮挡建模策略（BOM）。相较于传统方法，Xsyn无需额外人工成本即可实现高质量X光安检图像合成，且在多个数据集和检测器上提高了禁运物品的检测性能。

Key Takeaways

训练禁运物品检测模型需要大量X光安检图像，但收集和标注这些图像耗时费力。
X-ray安检图像合成方法通过合成图像来解决数据不足的问题。
现有方法主要遵循两阶段流程，第一阶段进行劳动密集型的前景提取，第二阶段进行图像合成，这种方法成本高且效率低下。
本文提出了一种基于文本到图像生成的一站式X光安检图像合成方法（Xsyn）。
Xsyn采用两种策略提高合成图像质量：跨注意力细化（CAR）和背景遮挡建模（BOM）。
Xsyn无需额外人工成本即可实现高质量X光安检图像合成。

Cool Papers

点此查看论文截图

Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis

Authors:Chengyu Xie, Zhi Gong, Junchi Ren, Linkun Yu, Si Shen, Fei Shen, Xiaoyu Du

Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.

姿态引导的人体图像生成受限于单一参考视角的不完整纹理以及缺乏明确的跨视角交互。我们提出了联合条件扩散模型（JCDM），这是一个利用多视角先验的联合条件扩散框架。外观先验模块（APM）从不完全的参考中推断出整体的身份保留先验，联合条件注入（JCI）机制融合了多视角线索，并将共享条件注入去噪主干中，以在姿态之间对齐身份、颜色和纹理。JCDM支持可变数量的参考视角，并能与标准扩散主干进行集成，只需进行最小化和有针对性的架构修改。实验证明了其卓越的保真度和跨视角一致性。

论文及项目相关链接

PDF

Summary:
多视角先验扩散模型（JCDM）是一种基于多视角先验的联合条件扩散框架，解决了姿态引导的人体图像生成中的纹理不完整和缺乏跨视角交互的问题。该模型包括外观先验模块（APM）和联合条件注入（JCI）机制，能够从前视角的不完全参考中推断整体身份保持先验信息，融合多视角线索并将共享条件注入去噪主干中，以实现跨姿态的身份、颜色和纹理对齐。JCDM支持可变数量的参考视角，并能与标准扩散主干进行集成，只需进行最小的有针对性的架构修改。实验证明其达到了业界领先的保真度和跨视角一致性。

Key Takeaways:

JCDM解决了姿态引导的人体图像生成中的纹理不完整问题。
JCDM利用多视角先验信息来提高图像生成的跨视角一致性。
外观先验模块（APM）可以从不完全的参考中推断出身份保持的先验信息。
联合条件注入（JCI）机制实现了跨姿态的身份、颜色和纹理对齐。
JCDM支持使用可变数量的参考视角。
JCDM可以与现有的扩散模型主干集成，只需进行少量的架构修改。

Cool Papers

点此查看论文截图

GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

Authors:Jiaqi Wu, Yaosen Chen, Shuyuan Zhu

Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://sobeymil.github.io/GeoMVD.com.

多视角图像生成在计算机视觉领域具有重要的应用价值，特别是在3D重建、虚拟现实和增强现实等领域。大多数现有方法依赖于单张图像的扩展，但在保持跨视图一致性和生成高分辨率输出方面面临重大的计算挑战。为了解决这些问题，我们提出了几何引导的多视图扩散模型，该模型结合了提取多视图几何信息和调整几何特征强度的机制，以生成既跨视图一致又细节丰富的图像。具体来说，我们设计了一个多视图几何信息提取模块，该模块利用深度图、法线图和前景分割掩膜来构建共享几何结构，确保不同视图之间的形状和结构一致性。为了增强生成过程中的一致性和细节恢复，我们开发了一种解耦的几何增强注意力机制，该机制加强了对关键几何细节的特征关注，从而提高了整体图像质量和细节保留。此外，我们采用了一种自适应学习策略，对模型进行微调，以更好地捕捉生成视图之间的空间关系和视觉连贯性，确保结果的真实性。我们的模型还结合了一种迭代细化过程，通过多个阶段的图像生成逐步改进输出质量。最后，提出了一种动态几何信息强度调整机制，自适应地调节几何数据的影响，优化整体质量，同时确保生成图像的自然性。更多细节可访问项目页面：https://sobeymil.github.io/GeoMVD.com。

论文及项目相关链接

PDF

Summary

本文提出一种基于几何引导的多视角扩散模型（Geometry-guided Multi-View Diffusion Model），用于多视角图像生成。该模型通过提取和利用多视角几何信息，调整几何特征强度，生成视角间一致且细节丰富的图像。模型包括多视角几何信息提取模块、增强几何特征的注意力机制和自适应学习策略等，可应用于3D重建、虚拟现实和增强现实等领域。详情访问项目页面：https://sobeymil.github.io/GeoMVD.com。

Key Takeaways

提出一种基于几何引导的多视角扩散模型，用于多视角图像生成。
设计多视角几何信息提取模块，利用深度图、法线图、前景分割掩膜构建共享几何结构，确保不同视角的形状和结构一致性。
开发了一种解耦的几何增强注意力机制，提高了对关键几何细节的关注度，改善图像质量和细节保留。
应用自适应学习策略，使模型更好地捕捉空间关系和视觉连贯性，确保生成结果的真实性。
模型包含迭代优化过程，通过多个阶段的图像生成逐步提高输出质量。
提出动态几何信息强度调整机制，自适应调节几何数据的影响，优化整体质量，确保生成图像的自然性。

Cool Papers

点此查看论文截图

Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

Authors:Yuxiao Yang, Xiao-Xiao Long, Zhiyang Dou, Cheng Lin, Yuan Liu, Qingsong Yan, Yuexin Ma, Haoqian Wang, Zhiqiang Wu, Wei Yin

In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

在这项工作中，我们介绍了\textbf{Wonder3D++}，这是一种从单视图图像高效生成高保真纹理网格的新方法。最近基于Score Distillation Sampling（SDS）的方法已显示出从二维扩散先验恢复三维几何的潜力，但它们通常受到耗时的针对形状优化和不一致的几何形状的困扰。相比之下，某些工作通过快速网络推理直接产生三维信息，但其结果往往质量较低，缺乏几何细节。为了全面提高单视图重建任务的质量、一致性和效率，我们提出了一种跨域扩散模型，该模型生成多视图法线贴图和相应的彩色图像。为确保生成的连贯性，我们采用了一种多视图跨域注意力机制，促进了跨视图和跨模态的信息交换。最后，我们引入了一种级联的3D网格提取算法，该算法以从粗到细的方式仅约3分钟从多视图二维表示中驱动高质量表面。我们的广泛评估表明，我们的方法与以前的工作相比，实现了高质量的重建结果、稳健的泛化能力和良好的效率。代码可在https://github.com/xxlong 访问我们的项目主页查看详细信息。

论文及项目相关链接

PDF 21 pages, 19 figures, accepted by TPAMI

Summary
本工作推出Wonder3D++方法，可从单视图图像高效生成高保真纹理网格。该方法结合Score Distillation Sampling（SDS）的优势，通过跨域扩散模型生成多视角法线图及对应彩色图像，并采用多视角跨域注意力机制确保生成的连贯性。最后，采用级联的3D网格提取算法，以粗细结合的方式从多视角2D表示中快速生成高质量表面，实现高效、高质量的重建结果。

Key Takeaways

Wonder3D++是一种从单视图图像高效生成高保真纹理网格的新方法。
该方法结合Score Distillation Sampling（SDS）的优势，提高了3D几何恢复的潜力。
跨域扩散模型用于生成多视角法线图和彩色图像，确保生成的连贯性。
多视角跨域注意力机制的应用促进了不同视角和模态之间的信息交流。
采用级联的3D网格提取算法，以粗细结合的方式快速生成高质量表面。
该方法的重建结果具有高质量和良好的泛化能力。

Cool Papers

点此查看论文截图

Denoising weak lensing mass maps with diffusion model: systematic comparison with generative adversarial network

Authors:Shohei D. Aoyama, Ken Osato, Masato Shirasaki

Removing the shape noise from the observed weak lensing field, i.e., denoising, enhances the potential of WL by accessing information at small scales where the shape noise dominates without denoising. We utilise two machine learning (ML) models for denosing: generative adversarial network (GAN) and diffusion model (DM). We evaluate the performance of denosing with GAN and DM utilising the large suite of mock WL observations, which serve as the training and test data sets. We apply denoising to 1,000 noisy mass maps with GAN and DM models trained with 39,000 mock observations. Both models can fairly well reproduce the true convergence map on large scales. Then, we measure cosmological statistics: power spectrum, bispectrum, one-point probability distribution function, peak and minima counts, and scattering transform coefficients. We find that DM outperforms GAN in almost all considered statistics and recovers the correct statistics down to small scales. For example, the angular power spectrum can be recovered with DM up to multipoles $\ell \lesssim 6000$ while the noise power spectrum dominates from $\ell \simeq 2000$. We also conduct stress tests on the trained model; denoising the maps with different characteristics, e.g., different source redshifts, from the training data. The performance degrades at small scales, but the statistics can still be recovered at large scales. Though the training of DM is more computationally demanding compared with GAN, there are several advantages: numerically stable training, higher performance in the reconstruction of cosmological statistics, and sampling multiple realisations once the model is trained. It has been known that DM can generate higher-quality images in real-world problems than GAN, the superiority has been confirmed as well in the WL denoising problem.

从观测到的弱引力透镜场去除形状噪声，即降噪，增强了弱引力透镜的潜力。通过访问未降噪时形状噪声占主导地位的小尺度信息来实现这一点。我们利用两种机器学习（ML）模型进行降噪：生成对抗网络（GAN）和扩散模型（DM）。我们利用大量的模拟弱引力透镜观测结果来评估使用GAN和DM进行降噪的性能，这些观测结果既作为训练数据也作为测试数据。我们对1000个带有噪声的质量图应用了降噪处理，这些质量图使用GAN和DM模型进行训练，训练数据为39000个模拟观测结果。这两种模型在大尺度上都能较好地再现真实的收敛图。然后，我们测量宇宙学统计量：功率谱、双谱、一点概率分布函数、峰和谷值计数以及散射变换系数。我们发现DM在几乎所有的统计量上都优于GAN，并能恢复到小尺度的正确统计量。例如，使用DM可以恢复到多重角λ≈6000的角功率谱，而噪声功率谱则从λ≈2000开始占主导地位。我们还对训练好的模型进行了压力测试，即对具有不同特性的地图进行降噪，例如与训练数据不同的源红移。虽然在小尺度上的性能有所下降，但在大尺度上仍然可以恢复统计量。尽管与GAN相比，DM的训练计算需求更大，但DM有几个优势：数值稳定的训练、在重建宇宙学统计量方面性能更高、一旦模型训练完成就可以进行多次采样实现。已知DM在现实世界的问题中可以生成比GAN更高质量的图像，其在弱引力透镜降噪问题上的优越性也得到了证实。

论文及项目相关链接

PDF Submitted to PASJ, 18 pages, 19 figures, 5 tables

摘要

利用生成对抗网络（GAN）和扩散模型（DM）进行弱引力透镜观测数据降噪，可以提高WL的潜力。在模拟弱引力透镜观测数据的大规模测试中，扩散模型在几乎所有考虑的统计特征上都优于GAN，并且在较小的尺度上恢复了正确的统计特征。尽管DM训练相对更耗时，但它在数值稳定性方面表现出优势，并展现出更高性能的宇宙学统计重建能力，而且一旦模型训练完成即可采样多个实现。已知DM在现实问题的图像生成中质量高于GAN，其在弱引力透镜消噪问题中的优越性也得到了证实。

关键见解

通过去除弱引力透镜观测数据中的形状噪声（即降噪），可以访问小尺度上的信息，从而提高弱引力透镜的潜力。
使用了生成对抗网络（GAN）和扩散模型（DM）两种机器学习模型进行降噪。
在模拟弱引力透镜观测数据的测试中，评估了GAN和DM的降噪性能。
在处理带有各种特性的地图时（例如不同的源红移），虽然性能有所下降，但DM仍然能够在大尺度上恢复统计数据。这意味着DM在处理不同类型的噪声方面表现出一定的稳健性。然而性能有所退化仍需要在后续研究中进行优化改进。这表明这两种方法都有潜在的改进空间。因此在实际应用中需要根据具体情况选择最适合的方法。在降噪过程中可能需要根据地图的特性进行针对性的调整和优化以提高性能。因此实际应用中需要考虑到这一点并进行相应的优化工作。尽管如此DM在数值稳定性方面表现出优势并且在重建宇宙学统计特征方面表现出更高的性能能够生成更高质量的图像在弱引力透镜消噪问题中显示出优越性也需要在后续研究中持续优化和改进模型的性能以便更好地满足实际需求解决现实问题以实现更准确和可靠的弱引力透镜观测数据分析。总体来说虽然DM相较于GAN有一定的优势但仍需要根据具体需求进行优化和改进以实现最佳的降噪效果和提升模型的泛化能力以便更好地应用于实际问题中。

Cool Papers

点此查看论文截图

Adaptive Diffusion Models for Sparse-View Motion-Corrected Head Cone-beam CT

Authors:Antoine De Paepe, Alexandre Bousse, Clémentine Phung-Ngoc, Youness Mellak, Dimitris Visvikis

Cone-beam computed tomography (CBCT) is an imaging modality widely used in head and neck diagnostics due to its accessibility and lower radiation dose. However, its relatively long acquisition times make it susceptible to patient motion, especially under sparse-view settings used to reduce dose, which can result in severe image artifacts. In this work, we propose a novel framework, joint reconstruction and motion estimation (JRM) with adaptive diffusion model (ADM), that simultaneously addresses motion compensation and sparse-view reconstruction in head CBCT. Leveraging recent advances in diffusion-based generative models, our method integrates a wavelet-domain diffusion prior into an iterative reconstruction pipeline to guide the solution toward anatomically plausible volumes while estimating rigid motion parameters in a blind fashion. We evaluate our method on simulated motion-affected CBCT data derived from real clinical computed tomography (CT) volumes. Experimental results demonstrate that JRM- ADM achieves consistent quantitative improvements over both traditional and learning-based baselines. In highly undersampled cases, JRM-ADM improves peak signal-to-noise ratio (PSNR) by more than 4 dB and structural similarity index measure (SSIM) by 0.10 compared to baseline the motion-corrected (MC) reconstruction method. These results highlight the potential of our approach to enable motion-robust and low-dose CBCT imaging, paving the way for improved clinical viability. The project page is available at https://antoinedepaepe.github.io/jrm-adm-io/.

锥束计算机断层扫描（CBCT）是一种在头部和颈部诊断中广泛使用的成像方式，因其可及性和较低的辐射剂量而受到欢迎。然而，其相对较长的采集时间使其容易受到患者运动的影响，特别是在用于减少剂量的稀疏视图设置下，可能导致严重的图像伪影。在这项工作中，我们提出了一种新型框架，即联合重建和动态估计（JRM）与自适应扩散模型（ADM），该框架可以同时解决头部CBCT中的运动补偿和稀疏视图重建问题。利用基于扩散的生成模型的最新进展，我们的方法将小波域扩散先验知识集成到迭代重建流程中，以向解剖上合理的体积引导解决方案，同时以盲态估计刚体运动参数。我们在从真实的临床计算机断层扫描（CT）体积派生的模拟运动影响CBCT数据上评估了我们的方法。实验结果表明，与基于传统和学习的基线相比，JRM-ADM持续实现了定量改进。在高度欠采样的情况下，与基于运动的校正（MC）重建方法相比，JRM-ADM的峰值信噪比（PSNR）提高了超过4分贝，结构相似性指数度量（SSIM）提高了0.10。这些结果突出了我们的方法在实现稳健运动和低剂量CBCT成像方面的潜力，为改善临床可行性铺平了道路。项目页面可在[https://antoinedepaepe.github.io/jrm-adm-io/]上找到。

论文及项目相关链接

PDF 12 pages, 10 figures, 2 table4

Summary

本文提出了一个联合重建和运动估计（JRM）与自适应扩散模型（ADM）的新框架，用于解决头部锥形束计算机断层扫描（CBCT）中的运动补偿和稀疏视图重建问题。该框架利用基于扩散的生成模型的最新进展，将小波域扩散先验集成到迭代重建管道中，以引导解决方案朝向解剖上合理的体积，同时以盲的方式估计刚体运动参数。实验结果表明，JRM-ADM相较于传统和基于学习的方法在模拟运动影响的CBCT数据上实现了持续的定量改进。在高度欠采样的情况下，与基于运动的校正（MC）重建方法相比，JRM-ADM的峰值信噪比（PSNR）提高了超过4分贝，结构相似性指数度量（SSIM）提高了0.1。这为运动鲁棒和低剂量CBCT成像的实现提供了潜力，为提高临床可行性铺平了道路。

Key Takeaways

CBCT在头颈诊断中广泛应用，但其较长的采集时间易受到患者运动的影响，导致图像出现严重伪影。
提出了一个名为JRM-ADM的新框架，同时解决运动补偿和稀疏视图重建问题。
利用扩散模型生成小波域扩散先验，将其集成到迭代重建管道中。
JRM-ADM方法能引导解决方案向解剖上合理的体积发展，并盲估刚体运动参数。
实验结果表明，JRM-ADM在模拟运动影响的CBCT数据上表现优于传统和基于学习的方法。
在高度欠采样情况下，JRM-ADM的PSNR和SSIM指标相较于运动校正方法有明显提升。

Cool Papers

点此查看论文截图

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Authors:Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Lang Zhang, Fu Liu, Peng Jia, Xianpeng Lang

Advanced end-to-end autonomous driving systems predict other vehicles’ motions and plan ego vehicle’s trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

高级端到端自动驾驶系统能够预测其他车辆的行驶轨迹并规划自身车辆的行驶轨迹。能够预测轨迹结果的全球模型已经被用来评估自动驾驶系统。然而，现有的世界模型主要关注自身车辆的轨迹，而忽略对其他车辆的操控。这一局限阻碍了它们真实模拟自身车辆与驾驶场景之间互动的能力。在本文中，我们提出了一种名为EOT-WM的驾驶世界模型，该模型统一了视频中自身车辆与其他车辆的轨迹，用于驾驶模拟。具体来说，将鸟瞰空间中的多条轨迹与视频中的每辆车相匹配以控制视频生成仍然是一个挑战。我们首先将在鸟瞰空间中的自身车辆和其他车辆轨迹投影到图像坐标中，通过像素位置进行车辆轨迹匹配。然后，使用时空变分自动编码器对轨迹视频进行编码，使其在统一视觉空间中与驾驶视频潜在变量在空间和时间上对齐。进一步设计了一个轨迹注入扩散Transformer，对带有噪声的视频潜在变量进行去噪，以在自身车辆和其他车辆轨迹的指导下生成视频。此外，我们提出了一种基于控制潜在变量相似性的指标来评估轨迹的可控性。在nuScenes数据集上进行了大量实验，所提出模型的FID得分比最新技术高出30%，FVD得分高出55%。该模型还可以预测未见过的驾驶场景并生成自身车辆的轨迹。

论文及项目相关链接

PDF 8 pages, 7 figures

Summary

本文提出一种名为EOT-WM的驾驶世界模型，该模型统一了自我车辆和其他车辆的轨迹，用于驾驶模拟。通过投影车辆在鸟瞰空间中的轨迹到图像坐标进行匹配，并使用时空变分自动编码器对轨迹视频进行编码。设计了一个轨迹注入扩散Transformer，用于去噪视频潜在表示，以生成受自我车辆和其他车辆轨迹指导的视频。同时，提出了基于控制潜在相似性度量的评价指标。在nuScenes数据集上的实验表明，该模型在FID和FVD指标上分别优于现有技术30%和55%。该模型可预测具有自我生成轨迹的未见驾驶场景。

Key Takeaways

EOT-WM模型统一了自我车辆和其他车辆的轨迹，增强了驾驶模拟的真实性。
通过投影车辆在鸟瞰空间中的轨迹到图像坐标进行匹配。
使用时空变分自动编码器对轨迹视频进行编码，与驾驶视频潜在表示进行对齐。
设计了轨迹注入扩散Transformer，用于去噪视频潜在表示，生成受轨迹指导的视频。
提出了基于控制潜在相似性度量的评价指标来评估轨迹的可控性。
在nuScenes数据集上的实验表明，该模型在FID和FVD指标上取得了显著优于现有技术的性能。

Cool Papers

点此查看论文截图

RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations

Authors:Jiaqi Guo, Santiago Lopez-Tapia, Wing Shun Li, Yunnan Wu, Marcelo Carignano, Martin Kröger, Vinayak P. Dravid, Igal Szleifer, Vadim Backman, Aggelos K. Katsaggelos

Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, leading to distorted CT images with artifacts. This results in an ill-posed problem known as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, our experiments are conducted on two different LACT datasets, i.e., ChromSTEM and C4KC-KiTS. Through extensive experiments, we show that by leveraging learned Mean-Reverting SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, RN-SDEs can restore high-quality images from severe degradation and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.

计算机断层扫描是一种广泛应用于成像的技术，其应用范围从医学影像到材料分析都有涉及。然而，一个主要挑战来自于某些角度扫描信息的缺失，导致计算机断层扫描图像失真并出现伪影。这导致了一个被称为有限角度计算机断层扫描（LACT）重建问题的不适定问题。为了解决这个问题，我们提出了残差零空间扩散随机微分方程（RN-SDEs），它是扩散模型的一种变体，用均值回复（MR）随机微分方程来表征扩散过程。为了展示RN-SDEs的通用性，我们在两个不同的LACT数据集（即ChromSTEM和C4KC-KiTS）上进行了实验。通过大量实验，我们证明了利用学习得到的均值回复SDEs作为先验，并通过基于范围零空间分解（RNSD）的修正来强调数据一致性，RN-SDEs可以从严重退化的图像中恢复高质量图像，并在大多数LACT任务中实现最佳性能。此外，我们还对计算复杂度和运行效率进行了定量比较，突出了我们提出方法的卓越有效性。

论文及项目相关链接

PDF

Summary

基于计算层析成像（CT）在医疗成像和材料分析等领域广泛应用的同时，其有限角度计算层析成像（LACT）重建问题中的图像失真和伪影问题日益凸显。为解决这一问题，本文提出了基于残差零空间扩散随机微分方程（RN-SDEs）的解决方案。通过学习和利用均值回归（MR）随机微分方程来描述扩散过程，并结合范围零空间分解（RNSD）修正技术，RN-SDEs可在严重降质的情况下恢复高质量图像，并在大多数LACT任务中实现最佳性能。同时，本文还对比了计算复杂度和运行效率，突显了所提方法的高效性。

Key Takeaways

计算层析成像（CT）存在有限角度问题，导致图像失真和伪影。
提出基于残差零空间扩散随机微分方程（RN-SDEs）的解决方案来解决这一问题。
RN-SDEs利用均值回归（MR）随机微分方程描述扩散过程。
通过范围零空间分解（RNSD）修正技术，RN-SDEs能在严重降质情况下恢复高质量图像。
RN-SDEs在大多数有限角度计算层析成像（LACT）任务中实现最佳性能。
相比其他方法，RN-SDEs具有更高的计算效率和运行性能。

Cool Papers

点此查看论文截图

Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

Authors:Quang-Huy Nguyen, Jin Peng Zhou, Zhenzhen Liu, Khanh-Huyen Bui, Kilian Q. Weinberger, Wei-Lun Chao, Dung D. Le

Recent object detectors have achieved impressive accuracy in identifying objects seen during training. However, real-world deployment often introduces novel and unexpected objects, referred to as out-of-distribution (OOD) objects, posing significant challenges to model trustworthiness. Modern object detectors are typically overconfident, making it unreliable to use their predictions alone for OOD detection. To address this, we propose leveraging an auxiliary model as a complementary solution. Specifically, we utilize an off-the-shelf text-to-image generative model, such as Stable Diffusion, which is trained with objective functions distinct from those of discriminative object detectors. We hypothesize that this fundamental difference enables the detection of OOD objects by measuring inconsistencies between the models. Concretely, for a given detected object bounding box and its predicted in-distribution class label, we perform class-conditioned inpainting on the image with the object removed. If the object is OOD, the inpainted image is likely to deviate significantly from the original, making the reconstruction error a robust indicator of OOD status. Extensive experiments demonstrate that our approach consistently surpasses existing zero-shot and non-zero-shot OOD detection methods, establishing a robust framework for enhancing object detection systems in dynamic environments.

最近的物体检测器在识别训练过程中看到的物体方面取得了令人印象深刻的准确性。然而，在现实世界部署中经常会遇到新的和意料之外的物体，这些被称为超出分布范围的物体（OOD物体），对模型的可信性构成了重大挑战。现代物体检测器通常过于自信，仅使用其预测结果进行OOD检测是不可靠的。为了解决这个问题，我们提出利用辅助模型作为补充解决方案。具体来说，我们利用现成的文本到图像生成模型，如Stable Diffusion等，其采用的客观函数与判别式物体检测器的客观函数有所不同。我们假设这种基本差异能够通过测量模型之间的一致性来检测OOD物体。具体来说，对于给定的检测到的物体边界框及其预测的分布内类别标签，我们对移除了物体的图像进行类别条件填充。如果物体是OOD，填充后的图像很可能与原始图像有很大的偏差，使得重建误差成为OOD状态的稳健指标。大量实验表明，我们的方法始终超越现有的零样本和非零样本OOD检测方法，为动态环境中增强物体检测系统建立了一个稳健的框架。

论文及项目相关链接

PDF Accepted in WACV 2026 (Algorithms track)

Summary

该摘要提出了一种利用辅助模型来增强对象检测系统在动态环境中对新型对象识别的可靠性方法。通过结合一个专门的文本到图像生成模型（如Stable Diffusion），并利用类条件修复技术来检测图像中的对象是否为训练数据分布外的类型，进而提升模型在面临未知对象时的检测能力。该方法显著超越了现有的零样本和非零样本未知对象检测方案。

Key Takeaways