NeRF

发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 10.7k

阅读时长: 43 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

Mip-NeWRF: Enhanced Wireless Radiance Field with Hybrid Encoding for Channel Prediction

Authors:Yulin Fu, Jiancun Fan, Shiyu Zhai, Zhibo Duan, Jie Luo

Recent work on wireless radiance fields represents a promising deep learning approach for channel prediction, however, in complex environments these methods still exhibit limited robustness, slow convergence, and modest accuracy due to insufficiently refined modeling. To address this issue, we propose Mip-NeWRF, a physics-informed neural framework for accurate indoor channel prediction based on sparse channel measurements. The framework operates in a ray-based pipeline with coarse-to-fine importance sampling: frustum samples are encoded, processed by a shared multilayer perceptron (MLP), and the outputs are synthesized into the channel frequency response (CFR). Prior to MLP input, Mip-NeWRF performs conical-frustum sampling and applies a scale-consistent hybrid positional encoding to each frustum. The scale-consistent normalization aligns positional encodings across scene scales, while the hybrid encoding supplies both scale-robust, low-frequency stability to accelerate convergence and fine spatial detail to improve accuracy. During training, a curriculum learning schedule is applied to stabilize and accelerate convergence of the shared MLP. During channel synthesis, the MLP outputs, including predicted virtual transmitter presence probabilities and amplitudes, are combined with modeled pathloss and surface interaction attenuation to enhance physical fidelity and further improve accuracy. Simulation results demonstrate the effectiveness of the proposed approach: in typical scenarios, the normalized mean square error (NMSE) is reduced by 14.3 dB versus state-of-the-art baselines.

近期关于无线辐射场的工作代表了一种用于信道预测的深度学习方法的潜力。然而，在复杂环境中，这些方法仍表现出有限的稳健性、缓慢的收敛速度和适中的准确性，这是由于建模不够精细所导致的。为了解决这一问题，我们提出了Mip-NeWRF，这是一种基于稀疏信道测量的室内信道预测准确性的物理启发神经网络框架。该框架以基于射线的管道运行，采用从粗到细的重要性采样：截断样本被编码，通过共享的多层感知器（MLP）进行处理，输出被合成成信道频率响应（CFR）。在MLP输入之前，Mip-NeWRF执行锥形截断采样并且对每个截断应用尺度一致的混合位置编码。尺度一致的归一化使场景尺度上的位置编码对齐，而混合编码提供了既稳健于尺度又加速收敛的低频稳定性，以及提高准确性的精细空间细节。在训练过程中，应用课程学习时间表以稳定和加速共享MLP的收敛。在信道合成过程中，MLP的输出，包括预测的虚拟发射机存在概率和振幅，与建模的路径损耗和表面交互衰减相结合，以增强物理保真度和进一步提高准确性。仿真结果表明所提出方法的有效性：在典型场景中，与最新技术基准相比，归一化均方误差（NMSE）降低了14.3 dB。

论文及项目相关链接

PDF 13 pages, 12 figures

Summary

本文提出一种基于物理信息的神经网络框架Mip-NeWRF，用于基于稀疏信道测量的室内信道准确预测。通过采用基于射线的处理管道和粗细重要性采样，结合多层感知器（MLP）进行编码处理，并合成信道频率响应（CFR）。框架中的关键步骤包括锥形截体采样、尺度一致混合位置编码、课程学习调度和信道合成，旨在提高复杂环境下的信道预测准确性、收敛速度和鲁棒性。模拟结果表明，该方法在典型场景下的归一化均方误差（NMSE）相较于现有技术降低了14.3 dB。

Key Takeaways

Mip-NeWRF是一个基于物理信息的神经网络框架，用于室内信道预测。
它采用基于射线的处理管道和粗细重要性采样，以提高预测准确性。
Mip-NeWRF结合多层感知器（MLP）进行编码处理，并合成信道频率响应（CFR）。
锥形截体采样和尺度一致混合位置编码的引入，增强了框架的鲁棒性和准确性。
通过应用课程学习调度，稳定并加速了MLP的收敛。
MLP输出与路径损耗和表面交互衰减模型的结合，提高了物理真实性和预测精度。

Cool Papers

点此查看论文截图

WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting

Authors:Kaitao Huang, Yan Yan, Jing-Hao Xue, Hanzi Wang

3D GAN inversion projects a single image into the latent space of a pre-trained 3D GAN to achieve single-shot novel view synthesis, which requires visible regions with high fidelity and occluded regions with realism and multi-view consistency. However, existing methods focus on the reconstruction of visible regions, while the generation of occluded regions relies only on the generative prior of 3D GAN. As a result, the generated occluded regions often exhibit poor quality due to the information loss caused by the low bit-rate latent code. To address this, we introduce the warping-and-inpainting strategy to incorporate image inpainting into 3D GAN inversion and propose a novel 3D GAN inversion method, WarpGAN. Specifically, we first employ a 3D GAN inversion encoder to project the single-view image into a latent code that serves as the input to 3D GAN. Then, we perform warping to a novel view using the depth map generated by 3D GAN. Finally, we develop a novel SVINet, which leverages the symmetry prior and multi-view image correspondence w.r.t. the same latent code to perform inpainting of occluded regions in the warped image. Quantitative and qualitative experiments demonstrate that our method consistently outperforms several state-of-the-art methods.

3D GAN反投影通过将单张图像投影到预训练的3D GAN的潜在空间来实现单镜头新颖视角合成，这需要高保真可见区域、逼真的遮挡区域以及多视角一致性。然而，现有方法主要关注可见区域的重建，而遮挡区域的生成仅依赖于3D GAN的生成先验。因此，由于低比特率潜在代码导致的信息丢失，生成的遮挡区域往往质量较差。为了解决这一问题，我们引入了warping-and-inpainting策略，将图像修复技术融入3D GAN反投影，并提出了一种新型的3D GAN反投影方法——WarpGAN。具体来说，我们首先使用3D GAN反投影编码器将单视图图像投影到潜在代码上，该代码作为输入用于驱动3D GAN。然后，我们使用由3D GAN生成的深度图进行视角变换。最后，我们开发了一种新型的SVINet网络，它利用对称性先验和相同潜在代码的多视角图像对应关系，对变换后的图像的遮挡区域进行填充修复。定量和定性实验均表明，我们的方法始终优于几种最新技术方法。

论文及项目相关链接

PDF

Summary

本项目针对3D GAN反卷积技术在单视角图像投影到潜在空间的问题。现有方法主要关注可见区域的重建，而忽略遮挡区域的生成，导致遮挡区域质量较差。为解决此问题，本文采用图像补全的战争策略并推出新的WarpGAN方法。此方法将图像先进行三维GAN反卷积编码器投影到潜在空间，然后使用深度图进行视角变换，最后利用对称先验和多视角图像对应关系进行遮挡区域的补全。实验证明，该方法优于现有技术。

Key Takeaways

现有3D GAN反卷积技术主要关注可见区域的重建，忽略遮挡区域生成的问题。
遮挡区域生成质量差的原因是信息损失和低比特率的潜在代码。
采用战争策略的图像补全技术被引入3D GAN反卷积中。
提出新的WarpGAN方法，通过三维GAN反卷积编码器将图像投影到潜在空间。
使用深度图进行视角变换。
利用对称先验和多视角图像对应关系进行遮挡区域的补全。

Cool Papers

点此查看论文截图

PrAda-GAN: A Private Adaptive Generative Adversarial Network with Bayes Network Structure

Authors:Ke Jia, Yuheng Ma, Yang Li, Feifei Wang

We revisit the problem of generating synthetic data under differential privacy. To address the core limitations of marginal-based methods, we propose the Private Adaptive Generative Adversarial Network with Bayes Network Structure (PrAda-GAN), which integrates the strengths of both GAN-based and marginal-based approaches. Our method adopts a sequential generator architecture to capture complex dependencies among variables, while adaptively regularizing the learned structure to promote sparsity in the underlying Bayes network. Theoretically, we establish diminishing bounds on the parameter distance, variable selection error, and Wasserstein distance. Our analysis shows that leveraging dependency sparsity leads to significant improvements in convergence rates. Empirically, experiments on both synthetic and real-world datasets demonstrate that PrAda-GAN outperforms existing tabular data synthesis methods in terms of the privacy-utility trade-off.

我们重新研究了在差分隐私下生成合成数据的问题。为了解决基于边缘方法的核心局限性，我们提出了带有贝叶斯网络结构的私有自适应生成对抗网络（PrAda-GAN），它结合了基于GAN的方法和基于边缘方法的优点。我们的方法采用顺序生成器架构来捕获变量之间的复杂依赖性，同时自适应地调整学习到的结构以促进基础贝叶斯网络中的稀疏性。从理论上讲，我们建立了参数距离、变量选择误差和Wasserstein距离上的递减界限。我们的分析表明，利用依赖稀疏性可以提高收敛速度。从经验上看，合成数据和真实世界数据集上的实验表明，在隐私效用权衡方面，PrAda-GAN优于现有的表格数据合成方法。

论文及项目相关链接

PDF

Summary

本文介绍了在差分隐私下生成合成数据的问题，并提出了Private Adaptive Generative Adversarial Network with Bayes Network Structure（PrAda-GAN）方法来解决核心限制。该方法结合了GAN和边际方法的优势，采用顺序生成器架构捕捉变量间的复杂依赖关系，自适应调整学习结构以促进底层贝叶斯网络的稀疏性。理论分析和实验表明，利用依赖稀疏性可以提高收敛速度，并在隐私效用权衡方面优于现有表格数据合成方法。

Key Takeaways

针对差分隐私下的合成数据生成问题，提出了PrAda-GAN方法，结合了GAN和边际方法的优点。
PrAda-GAN采用顺序生成器架构来捕捉变量间的复杂依赖关系。
通过自适应调整学习结构，促进底层贝叶斯网络的稀疏性。
理论分析表明，利用依赖稀疏性有助于提高收敛速度。
实验结果表明，PrAda-GAN在隐私效用权衡方面优于现有方法。
该方法在合成数据和真实世界数据集上均表现出良好的性能。

Cool Papers

点此查看论文截图

Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation?

Authors:Rui-Qing Sun, Ang Li, Zhijing Wu, Tian Lan, Qianyu Lu, Xingshan Yao, Chen Xu, Xian-Ling Mao

Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these methods.However, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.

生成对话面部（TFG）旨在生成真实且动态的对话肖像，广泛应用于数字教育、影视制作、电商直播等领域。目前，基于神经辐射场（NeRF）或三维高斯喷射（3DGS）的TFG方法受到广泛关注。它们从每个目标个体的参考视频中学习和存储个性化特征，以生成逼真的说话视频。为确保模型捕捉足够的三维信息并成功学习唇音映射，先前的研究通常需要精细处理和拟合几分钟的参考视频，这总是需要花费数小时的时间。处理和拟合长参考视频的计算负担严重限制了这些方法的实际应用价值。然而，拟合如此多的参考视频真的有必要吗？我们的探索性案例研究表明，仅使用几秒的参考视频片段就可以实现与完整参考视频相当甚至更好的性能。这表明视频的信息质量远比其长度重要。受此观察启发，我们提出了ISExplore（即信息片段探索）方法，这是一种简单有效的片段选择策略，可自动根据三个关键数据质量维度确定最具代表性的五秒参考视频片段：音频特征多样性、唇部动作幅度和摄像机视角数量。大量实验表明，我们的方法能加速NeRF和3DGS方法的数据处理和训练速度超过五倍，同时保持高保真输出。项目资源可在xx处获取。

论文及项目相关链接

PDF

摘要

说话面孔生成（TFG）旨在生成真实且动态的肖像视频，广泛应用于数字教育、影视制作、电商直播等领域。基于神经辐射场（NeRF）或3D高斯溅射（3DGS）的TFG方法通过学习参考视频中的个性化特征来生成真实说话视频。为确保模型捕捉足够的3D信息并成功学习唇音映射，先前研究通常需要精细处理和拟合数分钟的参考视频，耗时良久。然而，真的需要拟合这么久的参考视频吗？我们的探索性案例研究表明，使用几秒的参考视频片段就能达到甚至超越全视频的性能。这表明视频的信息质量远比其长度重要。受此启发，我们提出了ISExplore（信息片段探索）策略，该策略能自动识别最具信息量的5秒参考视频片段，基于三个关键数据质量维度：音频特征多样性、嘴唇运动幅度和摄像头视角数量。大量实验证明，我们的方法能提升NeRF和3DGS方法的数据处理和训练速度超过5倍，同时保持高保真输出。

关键发现

说话面孔生成（TFG）技术能生成真实且动态的肖像视频，广泛应用于多个领域。
当前TFG方法基于NeRF或3DGS，通过学习和处理参考视频生成说话视频，通常需数分钟参考视频，耗时较长。
相比视频长度，信息质量对TFG性能至关重要。
提出ISExplore策略，自动选择最具信息量的5秒参考视频片段。
ISExplore策略基于音频特征多样性、嘴唇运动幅度和摄像头视角数量三个维度。
使用ISExplore策略能显著提升NeRF和3DGS方法的数据处理和训练速度。

Cool Papers

点此查看论文截图

Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction

Authors:Changyue Shi, Chuxiao Yang, Xinyuan Hu, Minghao Chen, Wenwen Pan, Yan Yang, Jiajun Ding, Zhou Yu, Jun Yu

Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.

动态高斯涂抹技术已在4D场景重建方面取得了显著成效。然而，这些方法依赖于密集帧视频序列来实现逼真的重建。在现实世界的场景中，由于设备限制，有时只能获取稀疏帧。在本文中，我们提出了Sparse4DGS，这是一种稀疏帧动态场景重建的首个方法。我们发现，在稀疏帧设置下，无论是标准空间还是变形空间，动态重建方法都会失效，特别是在纹理丰富的区域。Sparse4DGS通过关注纹理丰富的区域来解决这一挑战。对于变形网络，我们提出了纹理感知变形正则化，引入了一种基于纹理的深度对齐损失来规范高斯变形。对于标准高斯场，我们引入了纹理感知规范优化，将基于纹理的噪声纳入标准高斯场的梯度下降过程。大量实验表明，当以稀疏帧为输入时，我们的方法在NeRF-Synthetic、HyperNeRF、NeRF-DS以及我们的iPhone-4D数据集上的性能优于现有的动态或少数技术。

论文及项目相关链接

PDF AAAI 2026

Summary

本文提出了Sparse4DGS方法，解决了稀疏帧动态场景重建的问题。针对动态重建方法在稀疏帧设置下在规范空间和变形空间中的失败，特别是在纹理丰富区域的问题，Sparse4DGS专注于纹理丰富区域，并提出纹理感知变形正则化和纹理感知规范优化，提高了性能。在NeRF-Synthetic、HyperNeRF、NeRF-DS和iPhone-4D数据集上，Sparse4DGS在稀疏帧输入下优于现有动态或少数技术。

Key Takeaways

Sparse4DGS是首个针对稀疏帧动态场景重建的方法。
动态重建方法在稀疏帧设置下在规范空间和变形空间中表现不佳。
Sparse4DGS专注于处理纹理丰富区域的重建问题。
提出了纹理感知变形正则化，通过引入基于纹理的深度对齐损失来规范高斯变形。
引入了纹理感知规范优化，将纹理基于的噪声纳入规范高斯场的梯度下降过程中。
在多个数据集上，Sparse4DGS在稀疏帧输入下的性能优于现有技术。

Cool Papers

点此查看论文截图

Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes

Authors:Shaoxiang Wang, Shihong Zhang, Christen Millerdurai, Rüdiger Westermann, Didier Stricker, Alain Pagani

Despite recent advances in single-object front-facing inpainting using NeRF and 3D Gaussian Splatting (3DGS), inpainting in complex 360° scenes remains largely underexplored. This is primarily due to three key challenges: (i) identifying target objects in the 3D field of 360° environments, (ii) dealing with severe occlusions in multi-object scenes, which makes it hard to define regions to inpaint, and (iii) maintaining consistent and high-quality appearance across views effectively. To tackle these challenges, we propose Inpaint360GS, a flexible 360° editing framework based on 3DGS that supports multi-object removal and high-fidelity inpainting in 3D space. By distilling 2D segmentation into 3D and leveraging virtual camera views for contextual guidance, our method enables accurate object-level editing and consistent scene completion. We further introduce a new dataset tailored for 360° inpainting, addressing the lack of ground truth object-free scenes. Experiments demonstrate that Inpaint360GS outperforms existing baselines and achieves state-of-the-art performance. Project page: https://dfki-av.github.io/inpaint360gs/

尽管最近使用NeRF和3D高斯贴图（3DGS）的单对象正面补全技术有所进展，但在复杂的360°场景中的补全仍然很少被探索。这主要是由于以下三个主要挑战：（i）在360°环境的3D场景中识别目标对象，（ii）处理多对象场景中的严重遮挡，这使得难以定义要补全的区域，以及（iii）有效地在不同视角上保持一致且高质量的外貌。为了应对这些挑战，我们提出了Inpaint360GS，这是一个基于3DGS的灵活360°编辑框架，支持在三维空间中的多对象移除和高保真补全。通过将二维分割转化为三维并利用虚拟相机视角进行上下文指导，我们的方法能够实现精确的对象级编辑和一致的场景补全。我们还针对360°补全引入了一个新数据集，解决了缺乏无对象场景的真实数据问题。实验表明，Inpaint360GS优于现有基线并达到了最先进的性能。项目页面：https://dfki-av.github.io/inpaint360gs/。

论文及项目相关链接

PDF WACV 2026, project page: https://dfki-av.github.io/inpaint360gs/

Summary

该文本介绍了一种针对复杂360°场景中的对象缺失修复的新方法Inpaint360GS。它基于三维高斯扩展（3DGS），支持多对象删除并在三维空间中实现高保真补全。通过蒸馏二维分割到三维并利用虚拟相机视角进行上下文指导，该方法能够实现准确的对象级编辑和一致的场景补全。此外，还引入了一个针对360°补全的新数据集，解决了缺乏无对象场景真实数据的问题。实验表明，Inpaint360GS优于现有基线并达到了最新性能水平。

Key Takeaways

介绍了针对复杂360°场景中的对象缺失修复的最新方法Inpaint360GS。
基于三维高斯扩展（3DGS）技术，支持多对象删除和三维空间中的高保真补全。
通过将二维分割蒸馏到三维并利用虚拟相机视角进行上下文指导，实现了准确的对象级编辑和一致的场景补全。
提出了一种新的数据集，用于解决针对360°补全的缺乏真实无对象场景的问题。
Inpaint360GS的性能超越了现有基线，并达到了最新的性能水平。
此方法可以应用于虚拟编辑、游戏场景制作、虚拟现实等场景。

Cool Papers

点此查看论文截图

VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes

Authors:Zhengyu Zou, Jingfeng Li, Hao Li, Xiaolei Hou, Jinwen Hu, Jingkun Chen, Lechao Cheng, Dingwen Zhang

Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in handling large-scale dynamic environments. To address these issues, we propose Vision-only Dynamic NeRF (VDNeRF), a method that accurately recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring additional camera pose information or expensive sensor data. VDNeRF employs two separate NeRF models to jointly reconstruct the scene. The static NeRF model optimizes camera poses and static background, while the dynamic NeRF model incorporates the 3D scene flow to ensure accurate and consistent reconstruction of dynamic objects. To address the ambiguity between camera motion and independent object motion, we design an effective and powerful training framework to achieve robust camera pose estimation and self-supervised decomposition of static and dynamic elements in a scene. Extensive evaluations on mainstream urban driving datasets demonstrate that VDNeRF surpasses state-of-the-art NeRF-based pose-free methods in both camera pose estimation and dynamic novel view synthesis.

神经辐射场（NeRF）使用一组已知相机姿态的图像来隐式地模拟连续的三维场景，从而实现逼真的新视角渲染。然而，现有的基于NeRF的方法在自动驾驶和机器人感知等应用中遇到了挑战，主要是由于捕捉准确相机姿态的困难以及处理大规模动态环境的局限性。为了解决这些问题，我们提出了仅视觉动态NeRF（VDNeRF），一种无需额外的相机姿态信息或昂贵的传感器数据，就能准确恢复相机轨迹并为动态城市场景学习时空表征的方法。VDNeRF采用两个单独的NeRF模型联合重建场景。静态NeRF模型优化相机姿态和静态背景，而动态NeRF模型则融入3D场景流，以确保对动态对象的准确且一致重建。为了解决相机运动和独立物体运动之间的歧义问题，我们设计了一个有效且强大的训练框架，以实现稳健的相机姿态估计和场景中静态与动态元素的自监督分解。在主流城市驾驶数据集上的广泛评估表明，VDNeRF在相机姿态估计和动态新视角合成方面超越了最先进的基于NeRF的无姿态方法。

论文及项目相关链接

PDF

Summary

基于神经辐射场（NeRF）技术的视觉仅动态NeRF（VDNeRF）方法，能准确恢复相机轨迹并学习动态城市场景的时空表示，而无需额外的相机姿态信息或昂贵的传感器数据。该方法通过两个独立的NeRF模型联合重建场景，静态NeRF模型优化相机姿态和静态背景，动态NeRF模型结合3D场景流，确保动态物体的准确和一致重建。VDNeRF解决了相机运动和独立物体运动之间的模糊性，设计了一个有效且强大的训练框架，实现稳健的相机姿态估计和场景中静态与动态元素的自监督分解。在主流城市驾驶数据集上的评估表明，VDNeRF在无需姿态的NeRF方法中超越了现有技术，在相机姿态估计和动态新颖视图合成方面表现优异。

Key Takeaways

VDNeRF能够准确恢复相机轨迹并学习动态城市场景的时空表示，无需额外的相机姿态信息或昂贵的传感器数据。
方法采用两个独立的NeRF模型联合重建场景，包括静态NeRF模型和动态NeRF模型。
静态NeRF模型主要负责优化相机姿态和静态背景，而动态NeRF模型则结合3D场景流以确保动态物体的准确重建。
VDNeRF解决了相机运动和独立物体运动之间的模糊性，通过设计有效的训练框架实现稳健的相机姿态估计和静态与动态元素的自监督分解。
该方法在主流城市驾驶数据集上进行了评估，证明了其在相机姿态估计和动态新颖视图合成方面的优越性。
VDNeRF技术对于自主驾驶和机器人感知等应用具有重要的潜在价值。

Cool Papers

点此查看论文截图

The Evolving Nature of Latent Spaces: From GANs to Diffusion

Authors:Ludovica Schaerf

This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi’s account of synthesis as the amalgamation of distributed representations, we propose a distinction between “synthesis in a strict sense”, where a compact latent space wholly determines the generative process, and “synthesis in a broad sense,” which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.

本文探讨了生成视觉模型中内部表示的不断演变性质，重点关注从GAN和VAE到基于扩散的架构的概念和技术转变。本文借鉴了贝阿特丽斯·法齐（Beatrice Fazi）关于合成是分布式表示的融合的观点，我们提出了“严格意义上的合成”和“广义上的合成”之间的区别。其中，“严格意义上的合成”是指一个紧凑的潜在空间完全决定生成过程，而“广义上的合成”则是指表征工作分布在各层的模型。通过对模型架构的仔细阅读以及针对层级表示进行干预的目标实验设置，我们展示了扩散模型如何分散表示的负担，从而挑战了统一内部空间的假设。通过将这些发现置于媒体理论框架内，并与潜在空间和柏拉图表征假设等隐喻进行批判性互动，我们主张重新思考如何理解生成人工智能：不是作为内容的直接合成，而是作为专门过程的突发配置。

论文及项目相关链接

PDF Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI’25)

Summary
本文探讨了生成视觉模型内部表示形式的演变过程，从概念和技术角度分析了从GANs和VAEs到基于扩散的架构的转变。文章借鉴了Beatrice Fazi关于合成是分布式表示融合的观点，提出了“严格意义上的合成”和“广义上的合成”的区分。前者指一个紧凑的潜在空间完全决定生成过程，后者则指模型的工作表示形式分布在各层中。通过细读模型架构和有针对性的实验设置，本文展示了扩散模型如何分担表示的负担，从而挑战了统一内部空间的假设。结合媒体理论框架和关于潜在空间和柏拉图表征假设的隐喻，本文主张重新理解生成人工智能：它不是内容的直接合成，而是专门过程的突发配置。

Key Takeaways

文章分析了生成视觉模型内部表示形式的演变，从GANs和VAEs转向扩散架构。
引入Beatrice Fazi的观点，提出“严格意义上的合成”与“广义上的合成”的区分。
扩散模型将表示的负担分散在各层中，挑战了统一内部空间的假设。
通过实验证明扩散模型在层间表示上的特殊性。
文章结合媒体理论框架，重新思考生成AI的理解方式。
强调生成AI不是内容的直接合成，而是专门过程的突发配置。

Cool Papers

点此查看论文截图

GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Authors:Alexander Valverde, Brian Xu, Yuyin Zhou, Meng Xu, Hongyun Wang

Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

场景重建已成为计算机视觉的核心挑战之一，神经辐射场（NeRF）和高斯贴片（Gaussian Splatting）等方法已取得了显著进展。尽管高斯贴片在大规模数据集上表现出强大的性能，但在捕获精细细节或在稀疏覆盖区域保持真实性方面往往遇到困难，这主要是由于稀疏的3D训练数据固有的局限性导致的。在这项工作中，我们提出了一种名为GauSSmart的混合方法，它有效地桥接了二维基础模型和三维高斯贴片重建。我们的方法结合了成熟的二维计算机视觉技术，包括凸过滤和来自基础模型（如DINO）的语义特征监督，以增强基于高斯的场景重建。通过利用二维分割先验和高维特征嵌入，我们的方法指导高斯贴片的稠密化和细化，改进了欠代表的覆盖区域并保持复杂结构细节。我们在三个数据集上验证了我们的方法，在大多数评估场景中，GauSSmart持续优于现有高斯贴片技术。我们的结果证明了混合二维和三维方法的巨大潜力，突出显示了如何将二维基础模型与三维重建管道相结合，可以克服单独使用任何一种方法的固有局限性。

论文及项目相关链接

PDF

Summary

神经网络辐射场（NeRF）和高斯贴图等方法在计算机视觉的场景重建领域取得了显著的进展。然而，高斯贴图在大规模数据集上表现强劲，但在稀疏覆盖区域捕捉细节和保持真实感方面存在困难。本文提出一种名为GauSSmart的混合方法，该方法有效结合了二维基础模型和三维高斯贴图重建。通过利用凸过滤和来自DINO等基础模型的语义特征监督等二维计算机视觉技术，该方法增强了基于高斯的场景重建。利用二维分割先验和高维特征嵌入，GauSSmart指导高斯贴图的密集化和精细化，提高了代表性不足区域的覆盖率，并保留了精细的结构细节。在三个数据集上的验证表明，GauSSmart在大多数场景中均优于现有的高斯贴图方法。结果证明了混合二维-三维方法的巨大潜力，突显了将二维基础模型与三维重建流程相结合如何克服单一方法的固有局限性。

Key Takeaways

神经网络辐射场（NeRF）和高斯贴图等方法在计算机视觉的场景重建中取得显著进展。
高斯贴图在大型数据集上表现良好，但在稀疏数据区域的细节捕捉和真实感保持方面存在挑战。
提出一种名为GauSSmart的混合方法，结合二维基础模型和三维高斯贴图重建，以改善场景重建效果。
GauSSmart利用二维计算机视觉技术，如凸过滤和语义特征监督，增强基于高斯的方法。
通过利用二维分割先验和高维特征嵌入，GauSSmart提高了稀疏区域的覆盖率并保留了精细结构细节。
在多个数据集上的验证表明，GauSSmart在场景重建方面优于传统的高斯贴图方法。

Cool Papers

点此查看论文截图

SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion

Authors:Zhiwen Yang, Yuxin Peng

Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

基于摄像头的三维语义场景补全（SSC）是自动驾驶系统中的一项关键任务，它评估体素级几何和语义信息以实现整体场景感知。尽管现有的基于体素和基于平面的SSC方法已经取得了很大的进展，但它们很难捕捉到真实的几何细节的物理规律。另一方面，像NeRF和3DGS这样的神经重建方法具有出色的物理感知能力，但在处理大规模、复杂的自动驾驶场景时，它们计算成本较高且收敛缓慢，导致语义准确性较低。为了解决这些问题，我们提出了用于基于摄像头的SSC的语义物理参与表示（SPHERE），它结合了体素和高斯表示，以共同利用语义和物理信息。首先，语义引导的高斯初始化（SGI）模块利用双分支三维场景表示来确定关键体素作为锚点来引导高效的高斯初始化。然后，物理感知的谐波增强（PHE）模块将语义球面谐波纳入其中，以模拟物理感知的上下文细节，并通过焦点分布对齐来促进语义几何一致性，生成具有真实细节SSC结果。在流行的SemanticKITTI和SSCBench-KITTI-360基准测试集上进行的广泛实验和分析验证了SPHERE的有效性。代码可从 https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025 获得。

论文及项目相关链接

PDF 10 pages, 6 figures, accepted by ACM MM 2025

Summary

本文提出了针对基于摄像头的三维语义场景完成（SSC）任务的新方法——语义物理参与表示（SPHERE）。该方法结合了体素和高斯表示，旨在提高自主驾驶系统中对场景感知的几何和语义精度。通过双分支三维场景表示定位关键体素作为锚点，实现高效高斯初始化，并采用物理感知的球形谐波增强模块，以优化语义几何一致性，提升感知结果的逼真度。在SemanticKITTI和SSCBench-KITTI-360等流行基准测试上的实验验证了SPHERE的有效性。

Key Takeaways

相机基于的3D语义场景完成（SSC）在自主驾驶系统中是关键任务，涉及对整个场景进行体素级的几何和语义感知。
现存的基于体素和基于平面的SSC方法虽然有所进展，但在捕捉物理规则和生成真实几何细节方面存在挑战。
神经网络重建方法如NeRF和3DGS具有出色的物理感知能力，但在处理大规模复杂自主驾驶场景时计算成本高、收敛慢，导致语义精度不高。
提出的Semantic-PHysical Engaged REpresentation（SPHERE）方法结合了体素和高斯表示，旨在解决上述问题。
SPHERE通过双分支3D场景表示定位关键体素作为锚点，进行高效的高斯初始化。
物理感知的球形谐波增强模块能够模拟物理规则并优化语义几何一致性，提高感知结果的逼真度。

Cool Papers

点此查看论文截图

Multispectral-NeRF:a multispectral modeling approach based on neural radiance fields

Authors:Hong Zhang, Fei Guo, Zihan Xie, Dizhao Yao

3D reconstruction technology generates three-dimensional representations of real-world objects, scenes, or environments using sensor data such as 2D images, with extensive applications in robotics, autonomous vehicles, and virtual reality systems. Traditional 3D reconstruction techniques based on 2D images typically relies on RGB spectral information. With advances in sensor technology, additional spectral bands beyond RGB have been increasingly incorporated into 3D reconstruction workflows. Existing methods that integrate these expanded spectral data often suffer from expensive scheme prices, low accuracy and poor geometric features. Three - dimensional reconstruction based on NeRF can effectively address the various issues in current multispectral 3D reconstruction methods, producing high - precision and high - quality reconstruction results. However, currently, NeRF and some improved models such as NeRFacto are trained on three - band data and cannot take into account the multi - band information. To address this problem, we propose Multispectral-NeRF, an enhanced neural architecture derived from NeRF that can effectively integrates multispectral information. Our technical contributions comprise threefold modifications: Expanding hidden layer dimensionality to accommodate 6-band spectral inputs; Redesigning residual functions to optimize spectral discrepancy calculations between reconstructed and reference images; Adapting data compression modules to address the increased bit-depth requirements of multispectral imagery. Experimental results confirm that Multispectral-NeRF successfully processes multi-band spectral features while accurately preserving the original scenes’ spectral characteristics.

3D重建技术利用二维图像等传感器数据生成真实世界物体、场景或环境的三维表示，在机器人、自动驾驶汽车和虚拟现实系统等领域有广泛应用。传统的基于二维图像的3D重建技术通常依赖于RGB光谱信息。随着传感器技术的进步，除了RGB之外的其他光谱波段已被越来越多地融入到3D重建工作流程中。现有融合这些扩展光谱数据的方法常常面临成本高昂、精度低、几何特征差等问题。基于NeRF的三维重建可以有效解决当前多光谱3D重建方法的各种问题，产生高精度高质量的重建结果。然而，目前NeRF和一些改进模型如NeRFacto都是在三波段数据上进行训练，无法考虑多波段信息。为了解决这一问题，我们提出了Multispectral-NeRF，这是一种源于NeRF的增强型神经网络架构，可以有效整合多光谱信息。我们的技术贡献包括三方面改进：扩展隐藏层的维度以适应六波段光谱输入；重新设计残差函数，以优化重建图像和参考图像之间的光谱差异计算；适应数据压缩模块，以解决多光谱图像增加的位深度要求。实验结果证实，Multispectral-NeRF成功处理了多波段光谱特征，同时准确保留了原始场景的光谱特性。

论文及项目相关链接

PDF

Summary

基于NeRF的三维重建技术能够有效融合多光谱信息，解决现有多光谱三维重建方法存在的问题，如成本高、精度低、几何特征差等。本文通过扩展隐藏层维度、重新设计残差函数以及适应数据压缩模块等技术手段，提出了一种名为Multispectral-NeRF的增强型神经网络架构。实验结果表明，Multispectral-NeRF能够成功处理多波段光谱特征，同时准确保留原始场景的光谱特性。

Key Takeaways