⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-19 更新
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Authors:Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu
We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
我们引入了GS-Light,这是一个高效、对文本位置有感知的管道,用于对通过高斯拼贴(3DGS)表示的3D场景进行文本引导的重照明。GS-Light实现了一个无需训练的单输入扩散模型的扩展,以处理多视图输入。给定用户提示,可能指定照明方向、颜色、强度或参考对象,我们采用大型视觉语言模型(LVLM)将提示解析为照明优先事项。我们使用现成的几何和语义估计器(深度、表面法线和语义分割)将这些照明优先事项与视图几何约束融合,以计算照明地图并为每个视图生成初始潜在代码。这些精心得出的初始潜码引导扩散模型生成更准确地反映用户期望的照明输出,尤其是在照明方向方面。通过向我们的多视图重照明模型输入多视图渲染图像以及初始潜码,我们生成了高保真、艺术化的重照明图像。最后,我们用重照明的外观对3DGS场景进行微调,以获得完全重照明的3D场景。我们在室内和室外场景上评估GS-Light,将其与最新技术基准线(包括视图重照明、视频重照明和场景编辑方法)进行比较。通过定量指标(多视图一致性、成像质量、美学分数、语义相似性等)和定性评估(用户研究),GS-Light在基准线上表现出持续改进。代码和资产将在发布时提供。
论文及项目相关链接
PDF Submitting for Neurocomputing
Summary
本文介绍了GS-Light,这是一种基于高斯拼贴(3DGS)表示的文本引导式三维场景重照明的高效、文本位置感知管道。GS-Light通过无训练的单输入扩散模型处理多视图输入,实现用户提示解析为照明先验。结合现成的几何和语义估算器(深度、表面法线和语义分割),将这些照明先验与视图几何约束融合以计算照明图并为每个视图生成初始潜在代码。这些精心得出的初始潜在代码指导扩散模型生成更准确地反映用户期望的照明输出,特别是在照明方向方面。通过向多视图重照明模型输入多视图渲染图像以及初始潜在代码,生成高质量、艺术性重照明图像。最后,使用重照明外观微调3DGS场景,获得完全重照明的三维场景。GS-Light在室内外场景上的表现优于现有技术基线,包括视图重照明、视频重照明和场景编辑方法。
Key Takeaways
- GS-Light是一种用于文本引导的三维场景重照明的管道,基于高斯拼贴(3DGS)表示。
- GS-Light通过无训练的单输入扩散模型处理多视图输入。
- 用户提示被解析为照明先验,并结合几何和语义估算器进行计算。
- 初始潜在代码指导扩散模型生成反映用户期望的照明输出,特别是照明方向。
- 多视图渲染图像和初始潜在代码用于生成高质量、艺术性重照明图像。
- 通过微调获得完全重照明的三维场景。
点此查看论文截图
Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation
Authors:Ziyang Huang, Jiagang Chen, Jin Liu, Shunping Ji
3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.
3D高斯混合技术(3DGS)已成为新视角合成的重要框架,但其核心优化挑战尚未得到充分研究。我们发现了3DGS优化中的两个关键问题:陷入次优局部最优和收敛质量不足。为了解决这些问题,我们提出了Opt3DGS,这是一个通过自适应探索和曲率引导利用的两阶段优化过程来增强3DGS的稳健框架。在探索阶段,自适应加权随机梯度朗之万动力学(SGLD)方法增强了全局搜索,以逃避局部最优。在开发阶段,局部拟牛顿方向引导的Adam优化器利用曲率信息实现精确高效的收敛。在多种基准数据集上的广泛实验表明,Opt3DGS通过改进3DGS优化过程(不修改其底层表示)实现了最先进的渲染质量。
论文及项目相关链接
PDF Accepted at AAAI 2026 as a Conference Paper
Summary
本文探讨了基于3D高斯填充(3DGS)进行新颖视图合成时面临的核心优化挑战,并针对这些问题提出了Opt3DGS框架。该框架通过自适应探索和曲率引导利用的两阶段优化过程来增强3DGS。在探索阶段,采用自适应加权随机梯度朗之万动力学(SGLD)方法进行全局搜索,以逃离局部最优解;在利用阶段,利用局部拟牛顿方向引导的Adam优化器进行精确高效的收敛。实验证明,Opt3DGS在不改变其底层表示的情况下,实现了最先进的渲染质量。
Key Takeaways
- 3DGS作为新颖视图合成的领先框架,但其核心优化挑战尚未得到充分探索。
- 提出了Opt3DGS框架来解决这些问题,包括自适应探索与曲率引导利用的两阶段优化过程。
- 探索阶段采用自适应加权随机梯度朗之万动力学(SGLD)进行全局搜索以逃离局部最优解。
- 利用阶段利用局部拟牛顿方向引导的Adam优化器进行精确高效的收敛。
点此查看论文截图
SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting
Authors:Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang, Zongqian Zhan
Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/
轻量级建筑表面模型对于数字城市、导航和快速地理空间分析至关重要。然而,传统的多视角几何流程仍然很繁琐且对质量敏感,因为它们依赖于密集重建、网格生成和随后的简化。本研究提出了SF-Recon方法,该方法直接从多视角图像重建轻量级建筑表面,无需后续网格简化。我们首先训练初始的3D高斯喷溅(3DGS)场以获得一致的视图表示。然后,通过法线梯度引导的高斯优化提取建筑结构,该优化选择与屋顶和墙壁边界对齐的基本形状,随后进行多视角边缘一致性修剪,以提高结构清晰度并抑制非结构伪影,无需外部监督。最后,多视角深度约束的Delaunay三角剖分将结构化高斯场转化为轻量级、结构忠实的建筑网格。基于提出的SF数据集,实验结果表明,我们的SF-Recon可以直接从多视角图像重建轻量级建筑模型,在保持计算效率的同时,实现更少的面和顶点。网站:https://lzh282140127-cell.github.io/SF-Recon-project/
论文及项目相关链接
Summary
该文介绍了一种名为SF-Recon的方法,该方法可从多视角图像直接重建轻量级建筑表面模型,无需后续网格简化。通过初始的3D高斯喷溅(3DGS)场获得视角一致表示,随后通过正常梯度引导的高斯优化和基于多视角边缘一致性的修剪来增强结构锐度和抑制非结构伪影,最后通过多视角深度约束Delaunay三角剖分将结构化高斯场转换为轻量级、结构真实的建筑网格。
Key Takeaways
- SF-Recon方法可直接从多视角图像重建轻量级建筑表面模型,无需后续网格简化。
- 初始阶段通过3D高斯喷溅(3DGS)场获得视角一致表示。
- 通过正常梯度引导的高斯优化,选择与屋顶和墙壁边界对齐的原始元素。
- 多视角边缘一致性修剪增强结构锐度,抑制非结构伪影。
- 提出了一种名为SF的数据集,用于实验评估。
- SF-Recon方法实现了建筑模型的轻量化,减少了面数和顶点数。
点此查看论文截图
SymGS : Leveraging Local Symmetries for 3D Gaussian Splatting Compression
Authors:Keshav Gupta, Akshat Sanghvi, Shreyas Reddy Palley, Astitva Srivastava, Charu Sharma, Avinash Sharma
3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, \textbf{\textit{SymGS}}, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at \textbf{\color{cyan}{symgs.github.io}}
3D高斯延展技术已成为新型视角合成中的变革性技术,这主要是因为其高速渲染和逼真的保真度。然而,其内存占用随着场景的复杂性而迅速扩展,经常达到数GB。现有方法通过引入利用基于原始级别的冗余的压缩策略来解决这个问题,通过相似性检测和量化来实现。我们的目标是通过融入对称感知技术来超越此类方法的压缩极限,特别是针对镜像对称来消除冗余原始数据。我们提出了一种新型压缩框架“SymGS”,将可学习的镜像引入场景,从而消除局部和全局反射冗余以实现压缩。我们的框架可以作为最新压缩方法的即插即用增强功能(例如HAC),以实现进一步的压缩。与HAC相比,我们在基准数据集上实现了1.66倍的压缩(大规模场景上可达3倍)。平均而言,SymGS实现了对3DGS场景的108倍压缩,同时保持了渲染质量。项目页面和补充材料可在symgs.github.io找到。
论文及项目相关链接
PDF Project Page: https://symgs.github.io/
Summary
本文介绍了三维高斯展平技术中的对称感知压缩框架SymGS。它通过引入学习镜像来消除场景中的局部和全局反射冗余,以提高压缩效率并降低内存占用。SymGS可作为一个插件增强现有压缩方法(如HAC),在基准数据集上实现更高的压缩率,平均压缩率达到108倍,同时保持渲染质量。
Key Takeaways
- 3D高斯展平技术在新型视图合成中具有变革性影响,具有高渲染速度和逼真的保真度。
- 现有方法的内存占用随场景复杂度而快速增加,达到数GB。
- 对称感知技术旨在超越现有方法的压缩限制,通过消除冗余基本元素来提高压缩效率。
- SymGS框架引入学习镜像来消除局部和全局反射冗余,实现高效压缩。
- SymGS可作为插件增强现有压缩方法(如HAC),实现更高的压缩率。
- 与HAC相比,SymGS在基准数据集上实现了高达1.66倍的压缩率(在大规模场景上可达3倍)。
点此查看论文截图
Monocular 3D Lane Detection via Structure Uncertainty-Aware Network with Curve-Point Queries
Authors:Ruixin Liu, Zejian Yuan
Monocular 3D lane detection is challenged by aleatoric uncertainty arising from inherent observation noise. Existing methods rely on simplified geometric assumptions, such as independent point predictions or global planar modeling, failing to capture structural variations and aleatoric uncertainty in real-world scenarios. In this paper, we propose MonoUnc, a bird’s-eye view (BEV)-free 3D lane detector that explicitly models aleatoric uncertainty informed by local lane structures. Specifically, 3D lanes are projected onto the front-view (FV) space and approximated by parametric curves. Guided by curve predictions, curve-point query embeddings are dynamically generated for lane point predictions in 3D space. Each segment formed by two adjacent points is modeled as a 3D Gaussian, parameterized by the local structure and uncertainty estimations. Accordingly, a novel 3D Gaussian matching loss is designed to constrain these parameters jointly. Experiments on the ONCE-3DLanes and OpenLane datasets demonstrate that MonoUnc outperforms previous state-of-the-art (SoTA) methods across all benchmarks under stricter evaluation criteria. Additionally, we propose two comprehensive evaluation metrics for ONCE-3DLanes, calculating the average and maximum bidirectional Chamfer distances to quantify global and local errors. Codes are released at https://github.com/lrx02/MonoUnc.
单目3D车道检测面临由固有观测噪声引起的偶然不确定性带来的挑战。现有方法依赖于简化的几何假设,如独立点预测或全局平面建模,无法捕捉真实场景中的结构变化和偶然不确定性。在本文中,我们提出了MonoUnc,这是一种无需鸟瞰图(BEV)的3D车道检测器,能够显式地通过局部车道结构对偶然不确定性进行建模。具体而言,3D车道被投影到前视(FV)空间,并由参数曲线近似表示。在曲线预测的引导下,为3D空间中的车道点预测动态生成曲线点查询嵌入。由两个相邻点形成的每个线段都被建模为3D高斯分布,通过局部结构和不确定性估计进行参数化。因此,设计了一种新型的3D高斯匹配损失,以联合约束这些参数。在ONCE-3DLanes和OpenLane数据集上的实验表明,MonoUnc在更严格的评估标准下,在所有基准测试中均优于先前最先进的(SoTA)方法。此外,我们还为ONCE-3DLanes提出了两项综合评价指标,通过计算平均和最大双向Chamfer距离来量化全局和局部误差。代码已发布在https://github.com/lrx02/MonoUnc。
论文及项目相关链接
Summary
本文提出一种名为MonoUnc的鸟瞰图(BEV)自由3D车道检测器,它显式建模由局部车道结构信息引起的偶然不确定性。通过投影3D车道到前视(FV)空间并使用参数曲线近似表示,动态生成曲线点查询嵌入以进行3D空间中的车道点预测。每个由相邻两点形成的线段被建模为3D高斯分布,由局部结构和不确定性估计参数化。在ONCE-3DLanes和OpenLane数据集上的实验表明,MonoUnc在更严格的评估标准下全面超越现有先进方法。
Key Takeaways
- MonoUnc是一种新型的3D车道检测方法,能够显式建模偶然不确定性。
- 该方法采用前视空间投影和参数曲线近似表示3D车道。
- 通过动态生成曲线点查询嵌入进行3D空间中的车道点预测。
- 每个线段被视为3D高斯分布,考虑了局部结构和不确定性估计。
- 设计了新型的3D高斯匹配损失以联合约束参数。
- 在多个数据集上的实验表明,MonoUnc性能优于现有先进方法。
点此查看论文截图
Beyond Darkness: Thermal-Supervised 3D Gaussian Splatting for Low-Light Novel View Synthesis
Authors:Qingsen Ma, Chen Zou, Dianyun Wang, Jia Wang, Liuyu Xiang, Zhaofeng He
Under extremely low-light conditions, novel view synthesis (NVS) faces severe degradation in terms of geometry, color consistency, and radiometric stability. Standard 3D Gaussian Splatting (3DGS) pipelines fail when applied directly to underexposed inputs, as independent enhancement across views causes illumination inconsistencies and geometric distortion. To address this, we present DTGS, a unified framework that tightly couples Retinex-inspired illumination decomposition with thermal-guided 3D Gaussian Splatting for illumination-invariant reconstruction. Unlike prior approaches that treat enhancement as a pre-processing step, DTGS performs joint optimization across enhancement, geometry, and thermal supervision through a cyclic enhancement-reconstruction mechanism. A thermal supervisory branch stabilizes both color restoration and geometry learning by dynamically balancing enhancement, structural, and thermal losses. Moreover, a Retinex-based decomposition module embedded within the 3DGS loop provides physically interpretable reflectance-illumination separation, ensuring consistent color and texture across viewpoints. To evaluate our method, we construct RGBT-LOW, a new multi-view low-light thermal dataset capturing severe illumination degradation. Extensive experiments show that DTGS significantly outperforms existing low-light enhancement and 3D reconstruction baselines, achieving superior radiometric consistency, geometric fidelity, and color stability under extreme illumination.
在极低光条件下,新型视图合成(NVS)在几何、色彩一致性和辐射稳定性方面面临严重退化。当直接应用于曝光不足的输入时,标准的三维高斯喷绘(3DGS)流水线会失效,因为独立的视图增强会导致照明不一致和几何失真。为了解决这个问题,我们提出了DTGS,这是一个统一框架,它通过Retinex启发式的照明分解与热引导的三维高斯喷绘紧密耦合,实现照明不变重建。不同于将增强作为预处理步骤的先前方法,DTGS通过循环增强-重建机制,在增强、几何和热监督之间进行联合优化。热监督分支通过动态平衡增强、结构和热损失,稳定颜色恢复和几何学习。此外,嵌入在3DGS循环内的基于Retinex的分解模块提供物理可解释的反射-照明分离,确保跨视点的颜色和纹理一致性。为了评估我们的方法,我们构建了RGBT-LOW,这是一个新的多视角低光热数据集,捕捉严重的照明退化。大量实验表明,DTGS在极端照明条件下显著优于现有的低光增强和3D重建基线,实现了卓越的辐射一致性、几何保真度和颜色稳定性。
论文及项目相关链接
Summary
该文本描述了在极低光照条件下,标准三维高斯散斑(3DGS)在处理时面临的挑战。为了解决这个问题,提出了DTGS统一框架,该框架通过Retinex灵感照明分解与热引导三维高斯散斑技术紧密结合,实现了照明不变重建。此外,新构建的RGBT-LOW多视角低光热数据集用于评估新方法在低光照条件下的表现。实验证明,DTGS在辐射度量一致性、几何保真度和颜色稳定性方面显著优于现有方法。
Key Takeaways
- 在极低光照条件下,标准三维高斯散斑技术面临几何、颜色一致性和辐射稳定性方面的严重退化问题。
- DTGS框架解决了这一问题,它通过Retinex灵感照明分解与热引导三维高斯散斑技术实现照明不变重建。
- DTGS采用循环增强重建机制,联合优化增强、几何和热监督。
- 热监督分支通过动态平衡增强、结构和热损失来稳定颜色恢复和几何学习。
- Retinex基于分解模块的嵌入提供了物理可解释的反射照明分离,确保跨视点的颜色和纹理一致性。
- 新构建的RGBT-LOW多视角低光热数据集用于评估方法在低光照条件下的性能。
点此查看论文截图
TR-Gaussians: High-fidelity Real-time Rendering of Planar Transmission and Reflection with 3D Gaussian Splatting
Authors:Yong Liu, Keyang Ye, Tianjia Shao, Kun Zhou
We propose Transmission-Reflection Gaussians (TR-Gaussians), a novel 3D-Gaussian-based representation for high-fidelity rendering of planar transmission and reflection, which are ubiquitous in indoor scenes. Our method combines 3D Gaussians with learnable reflection planes that explicitly model the glass planes with view-dependent reflectance strengths. Real scenes and transmission components are modeled by 3D Gaussians and the reflection components are modeled by the mirrored Gaussians with respect to the reflection plane. The transmission and reflection components are blended according to a Fresnel-based, view-dependent weighting scheme, allowing for faithful synthesis of complex appearance effects under varying viewpoints. To effectively optimize TR-Gaussians, we develop a multi-stage optimization framework incorporating color and geometry constraints and an opacity perturbation mechanism. Experiments on different datasets demonstrate that TR-Gaussians achieve real-time, high-fidelity novel view synthesis in scenes with planar transmission and reflection, and outperform state-of-the-art approaches both quantitatively and qualitatively.
我们提出了传输-反射高斯(TR-Gaussians)这一新颖的三维高斯基表示方法,用于在室内场景中普遍存在的平面传输和反射的高保真渲染。我们的方法结合了三维高斯和可学习的反射平面,后者可显式地模拟玻璃平面的视角相关反射强度。三维高斯模拟真实场景和传输成分,而反射成分则根据反射平面镜像对称的高斯进行模拟。传输和反射成分根据基于Fresnel的视角相关加权方案进行混合,能够在不同视角下合成复杂的效果。为了有效地优化TR-Gaussians,我们开发了一个多阶段优化框架,融合了颜色和几何约束以及一个透明度扰动机制。在不同数据集上的实验表明,TR-Gaussians实现了平面传输和反射场景的高保真实时新视角合成,并在定量和定性方面超越了现有技术的前沿方法。
论文及项目相关链接
PDF 15 pages, 12 figures
Summary
该文本介绍了一种基于3D高斯的新型表示方法——传输反射高斯(TR-Gaussians),用于室内场景中平面传输和反射的高保真渲染。该方法结合了3D高斯和可学习的反射平面,通过视图相关的反射强度明确建模玻璃平面。通过Fresnel-based的视图相关加权方案混合传输和反射成分,实现不同视点下的复杂外观效果的忠实合成。开发的多阶段优化框架纳入颜色和几何约束以及不透明度扰动机制,有效优化TR-Gaussians。实验证明,TR-Gaussians在具有平面传输和反射的场景中实现实时高保真新颖视图合成,并在定量和定性上均优于现有方法。
Key Takeaways
- 提出了一种新的基于3D高斯的方法——传输反射高斯(TR-Gaussians),用于室内场景的平面传输和反射的高保真渲染。
- 结合了3D高斯和可学习的反射平面,明确建模玻璃平面的视图相关反射强度。
- 通过Fresnel-based的视图相关加权方案混合传输和反射成分,实现复杂外观效果的忠实合成。
- 开发了一个多阶段优化框架,纳入颜色和几何约束以及不透明度扰动机制,以优化TR-Gaussians。
- TR-Gaussians能够实现实时高保真新颖视图合成,在具有平面传输和反射的场景中有出色表现。
- TR-Gaussians在定量和定性评估上均优于现有方法。
点此查看论文截图
SplatSearch: Instance Image Goal Navigation for Mobile Robots using 3D Gaussian Splatting and Diffusion Models
Authors:Siddarth Narasimhan, Matthew Lisondra, Haitong Wang, Goldie Nejat
The Instance Image Goal Navigation (IIN) problem requires mobile robots deployed in unknown environments to search for specific objects or people of interest using only a single reference goal image of the target. This problem can be especially challenging when: 1) the reference image is captured from an arbitrary viewpoint, and 2) the robot must operate with sparse-view scene reconstructions. In this paper, we address the IIN problem, by introducing SplatSearch, a novel architecture that leverages sparse-view 3D Gaussian Splatting (3DGS) reconstructions. SplatSearch renders multiple viewpoints around candidate objects using a sparse online 3DGS map, and uses a multi-view diffusion model to complete missing regions of the rendered images, enabling robust feature matching against the goal image. A novel frontier exploration policy is introduced which uses visual context from the synthesized viewpoints with semantic context from the goal image to evaluate frontier locations, allowing the robot to prioritize frontiers that are semantically and visually relevant to the goal image. Extensive experiments in photorealistic home and real-world environments validate the higher performance of SplatSearch against current state-of-the-art methods in terms of Success Rate and Success Path Length. An ablation study confirms the design choices of SplatSearch.
实例图像目标导航(IIN)问题要求部署在未知环境中的移动机器人仅使用单个目标参考图像来搜索特定对象或感兴趣的人。当面临以下情况时,这个问题可能尤其具有挑战性:1)参考图像是从任意视角捕获的;2)机器人必须在稀疏视图场景重建中进行操作。在本文中,我们通过引入SplatSearch来解决IIN问题,这是一种新型架构,它利用稀疏视图三维高斯摊铺重建技术(Sparse-view 3D Gaussian Splatting, 简称3DGS)。SplatSearch使用稀疏在线3DGS地图渲染候选对象周围的多视角图像,并使用多视角扩散模型来补充渲染图像的缺失区域,从而实现与目标图像之间的稳健特征匹配。我们引入了一种新型前沿探索策略,该策略使用合成视角的视觉上下文和目标图像中的语义上下文来评估前沿位置,从而使机器人能够优先探索在语义和视觉上与目标图像相关的前沿。在逼真的家庭和真实世界环境中的大量实验验证了SplatSearch相对于当前最先进的方法具有更高的成功率和成功路径长度方面的性能。消融研究证实了SplatSearch的设计选择。
论文及项目相关链接
PDF Project Page: https://splat-search.github.io/
Summary
本论文针对实例图像导航(IIN)问题提出了一种新的解决方案:SplatSearch。它利用稀疏视角的3D高斯Splatting(3DGS)重建技术,通过渲染目标对象的多视角图像来解决问题。SplatSearch利用多视角扩散模型完成渲染图像的缺失区域,使其与目标图像进行稳健的特征匹配。此外,还引入了一种新的边界探索策略,结合合成视角的视觉上下文和目标图像的语义上下文来评估边界位置,使机器人能够优先探索语义上和视觉上与目标图像相关的边界。在逼真的家庭和真实环境中的实验验证了SplatSearch在成功率和成功路径长度方面的高性能。
Key Takeaways
- 实例图像导航(IIN)问题要求机器人在未知环境中仅使用单个目标参考图像搜索特定对象或人物。
- SplatSearch是一种解决IIN问题的新架构,利用稀疏视角的3D高斯Splatting(3DGS)重建。
- SplatSearch通过渲染目标对象的多视角图像,并利用多视角扩散模型完成图像缺失区域的填充,实现与目标图像的稳定特征匹配。
- 引入了一种新的边界探索策略,结合视觉和语义上下文来评估边界位置,指导机器人的探索方向。
- 在真实和仿真环境下的实验验证了SplatSearch在成功率和路径规划上的优越性。
- Ablation研究确认了SplatSearch的设计选择的有效性。
- SplatSearch能够为移动机器人提供更精确的导航,尤其在环境信息不完全已知的情况下。
点此查看论文截图
Neo: Real-Time On-Device 3D Gaussian Splatting with Reuse-and-Update Sorting Acceleration
Authors:Changhun Oh, Seongryong Oh, Jinwoo Hwang, Yoonsung Kim, Hardik Sharma, Jongse Park
3D Gaussian Splatting (3DGS) rendering in real-time on resource-constrained devices is essential for delivering immersive augmented and virtual reality (AR/VR) experiences. However, existing solutions struggle to achieve high frame rates, especially for high-resolution rendering. Our analysis identifies the sorting stage in the 3DGS rendering pipeline as the major bottleneck due to its high memory bandwidth demand. This paper presents Neo, which introduces a reuse-and-update sorting algorithm that exploits temporal redundancy in Gaussian ordering across consecutive frames, and devises a hardware accelerator optimized for this algorithm. By efficiently tracking and updating Gaussian depth ordering instead of re-sorting from scratch, Neo significantly reduces redundant computations and memory bandwidth pressure. Experimental results show that Neo achieves up to 10.0x and 5.6x higher throughput than state-of-the-art edge GPU and ASIC solution, respectively, while reducing DRAM traffic by 94.5% and 81.3%. These improvements make high-quality and low-latency on-device 3D rendering more practical.
实时在资源受限设备上进行三维高斯展开(3DGS)渲染对于提供沉浸式增强和虚拟现实(AR/VR)体验至关重要。然而,现有解决方案难以实现高帧率,尤其是在高分辨率渲染方面。我们的分析将3DGS渲染流水线中的排序阶段识别为主要瓶颈,原因在于其对高内存带宽的需求。本文介绍了Neo,它引入了一种重用以更新排序算法,该算法利用连续帧中高斯排序的时间冗余性,并为此算法设计了一个硬件加速器。通过有效地跟踪和更新高斯深度排序,而不是从零开始重新排序,Neo显著减少了冗余计算和内存在带宽方面的压力。实验结果表明,Neo的吞吐量比最新的边缘GPU和ASIC解决方案分别高出高达10.0倍和5.6倍,同时将DRAM流量减少了94.5%和81.3%。这些改进使得高质量和低延迟的本地设备三维渲染更加实用。
论文及项目相关链接
Summary
这篇论文介绍了名为Neo的实时三维高斯渲染技术,该技术针对现有解决方案在高分辨率渲染时帧率低的问题进行了优化。主要改进在于识别并解决了渲染管线中的排序阶段瓶颈,通过利用高斯排序的时间冗余信息,提出了一种重用和更新排序算法,并为此算法设计了一种硬件加速器。通过有效追踪和更新高斯深度排序,而不是从头开始重新排序,Neo显著减少了冗余计算和内存带宽压力。实验结果表明,Neo较最新的边缘GPU和ASIC解决方案分别提高了高达10倍和5.6倍的吞吐量,同时DRAM流量减少了94.5%和81.3%。这些改进使得高质量、低延迟的本地设备三维渲染更加实用。
Key Takeaways
- 实时三维高斯渲染技术对于增强现实和虚拟现实体验至关重要。
- 现有解决方案在高分辨率渲染时面临帧率问题。
- 排序阶段是三维高斯渲染的主要瓶颈,其高内存带宽需求限制了性能。
- Neo技术利用高斯排序的时间冗余信息来优化渲染过程。
- Neo提出了一种重用和更新排序算法,并通过硬件加速器实现了优化。
- Neo通过有效追踪和更新高斯深度排序,显著减少了冗余计算和内存带宽压力。
点此查看论文截图
Reconstructing 3D Scenes in Native High Dynamic Range
Authors:Kaixuan Zhang, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu
High Dynamic Range (HDR) imaging is essential for professional digital media creation, e.g., filmmaking, virtual production, and photorealistic rendering. However, 3D scene reconstruction has primarily focused on Low Dynamic Range (LDR) data, limiting its applicability to professional workflows. Existing approaches that reconstruct HDR scenes from LDR observations rely on multi-exposure fusion or inverse tone-mapping, which increase capture complexity and depend on synthetic supervision. With the recent emergence of cameras that directly capture native HDR data in a single exposure, we present the first method for 3D scene reconstruction that directly models native HDR observations. We propose {\bf Native High dynamic range 3D Gaussian Splatting (NH-3DGS)}, which preserves the full dynamic range throughout the reconstruction pipeline. Our key technical contribution is a novel luminance-chromaticity decomposition of the color representation that enables direct optimization from native HDR camera data. We demonstrate on both synthetic and real multi-view HDR datasets that NH-3DGS significantly outperforms existing methods in reconstruction quality and dynamic range preservation, enabling professional-grade 3D reconstruction directly from native HDR captures. Code and datasets will be made available.
高动态范围(HDR)成像对于专业数字媒体创作,例如电影制作、虚拟制作和真实感渲染,至关重要。然而,目前的3D场景重建主要集中在低动态范围(LDR)数据上,限制了其在专业工作流程中的应用。现有的从LDR观测重建HDR场景的方法依赖于多曝光融合或反向色调映射,这增加了捕获的复杂性并依赖于合成监督。随着能直接捕获单次曝光下的原生HDR数据的相机最近的出现,我们首次提出了直接对原生HDR观测进行建模的3D场景重建方法。我们提出了“Native High dynamic range 3D Gaussian Splatting(NH-3DGS)”,在重建过程中保留了全动态范围。我们的主要技术贡献是对颜色表示进行新型亮度-色度分解,能够从原生HDR相机数据中直接进行优化。我们在合成和真实的多视图HDR数据集上的演示表明,NH-3DGS在重建质量和动态范围保留方面显著优于现有方法,能够实现直接从原生HDR捕获进行专业级别的3D重建。代码和数据集将提供。
论文及项目相关链接
Summary
HDR成像在专业数字媒体创建中至关重要,如电影制作、虚拟制作和逼真渲染等。然而,3D场景重建主要集中在低动态范围(LDR)数据上,限制了其在专业工作流程中的应用。现有从LDR观测重建HDR场景的方法依赖于多曝光融合或逆色调映射,这增加了捕获的复杂性并依赖于合成监督。随着直接捕获原生HDR数据的相机近期出现,我们首次提出了直接建模原生HDR观测的3D场景重建方法。我们提出Native High dynamic range 3D Gaussian Splatting(NH-3DGS),在重建过程中保留了全动态范围。我们的关键技术贡献是颜色表示的新颖亮度-色度分解,能够直接从原生HDR相机数据进行优化。我们在合成和真实的多视角HDR数据集上的演示表明,NH-3DGS在重建质量和动态范围保留方面显著优于现有方法,能够实现直接从原生HDR捕获的专业级3D重建。
Key Takeaways
- HDR成像对专业数字媒体创建至关重要。
- 现有3D场景重建方法主要关注LDR数据,存在局限性。
- 提出Native High dynamic range 3D Gaussian Splatting(NH-3DGS)方法,直接建模原生HDR观测。
- NH-3DGS保留了全动态范围,并在重建过程中实现了质量提升。
- 核心技术贡献在于颜色表示的新颖亮度-色度分解。
- NH-3DGS在合成和真实的多视角HDR数据集上的表现优于现有方法。
点此查看论文截图
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
Authors:Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald Dansereau, Niko Sünderhauf, Dimity Miller
Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.
在线场景变化检测(SCD)是一个极具挑战性的问题,它要求智能体在从非约束视角观察场景时实时检测相关的变化。现有的在线SCD方法的准确性远远低于离线方法。我们提出了第一个在线SCD方法,该方法具有姿态无关性、无需标签,同时保证多视角一致性,在超过每秒10帧的操作下实现了卓越的性能,甚至超越了最佳的离线方法。我们的方法引入了一种新的自监督融合损失,通过多个线索和观察推断场景变化,基于PnP的快速姿态估计参考场景,以及针对3D高斯平铺场景表示的快速变化引导更新策略。在复杂的真实世界数据集上的大量实验表明,我们的方法优于在线和离线的基线方法。
论文及项目相关链接
Summary
在线场景变化检测(SCD)是一个极具挑战性的问题,要求智能体在从不受约束的视角观察场景时检测相关的变化。现有的在线SCD方法在准确性上远远落后于离线方法。本文提出了首个在线SCD方法,该方法具有姿态无关性、无需标签,并确保了多视角一致性,同时以超过10 FPS的速度运行,实现了超越现有离线方法的最新性能。通过引入新的自监督融合损失来从多个线索和观察中推断场景变化,采用基于PnP的快速姿态估计以匹配参考场景,以及针对快速场景变化的指导更新策略进行快速重建场景的更新,方法显示出极大的优势。大量实验证明该方法优于在线和离线基准测试。
Key Takeaways
- 在线场景变化检测(SCD)是一个需要实时检测场景变化的挑战性问题。
- 当前在线SCD方法的准确性低于离线方法。
- 本文首次提出了一种在线SCD方法,该方法具有姿态无关性、无需标签和多视角一致性特点。
- 该方法引入了自监督融合损失来推断场景变化。
- 采用基于PnP的快速姿态估计技术匹配参考场景。
点此查看论文截图
LiDAR-GS++:Improving LiDAR Gaussian Reconstruction via Diffusion Priors
Authors:Qifeng Chen, Jiarun Liu, Rengan Xie, Tao Tang, Sicong Du, Yiru Zhao, Yuchi Huo, Sheng Yang
Recent GS-based rendering has made significant progress for LiDAR, surpassing Neural Radiance Fields (NeRF) in both quality and speed. However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. Specifically, we introduce a controllable LiDAR generation model conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans and employ an effective distillation mechanism for expansive reconstruction. By extending reconstruction to under-fitted regions, our approach ensures global geometric consistency for extrapolative novel views while preserving detailed scene surfaces captured by sensors. Experiments on multiple public datasets demonstrate that LiDAR-GS++ achieves state-of-the-art performance for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.
最近的基于GS的渲染在激光雷达技术上取得了显著进展,在质量和速度方面都超越了神经辐射场(NeRF)。然而,由于单次遍历扫描的重建不完整,这些方法在推断新颖视图合成时会出现伪影。为了解决这一局限性,我们提出了LiDAR-GS++,这是一种由扩散先验增强的激光雷达高斯喷涂重建方法,可用于公共城市道路进行实时高保真重新模拟。具体来说,我们引入了一个可控的激光雷达生成模型,该模型以粗略推断的渲染为条件,生成额外的几何一致性扫描,并采用了有效的蒸馏机制进行扩展重建。通过将重建扩展到未拟合区域,我们的方法确保了推断新颖视图的全局几何一致性,同时保留了传感器捕捉到的详细场景表面。在多个公共数据集上的实验表明,LiDAR-GS++在推断和推断的视点上均达到了最先进的性能,超越了现有的GS和NeRF方法。
论文及项目相关链接
PDF Accepted by AAAI-26
Summary
基于GS的渲染方法在LiDAR领域在质量和速度上都超越了神经辐射场(NeRF)。然而,由于单遍扫描的重建不完整,这些方法在推断新颖视图合成时会出现伪影。为解决此限制,我们推出LiDAR-GS++,这是一种结合扩散先验增强的LiDAR高斯平铺重建方法,适用于公共城市道路的高保真实时模拟。
Key Takeaways
- GS-based rendering for LiDAR has advanced significantly, surpassing NeRF in quality and speed.
- 单遍扫描的重建不完整导致推断新颖视图合成时出现伪影。
- LiDAR-GS++采用可控的LiDAR生成模型,以粗略推断的渲染为条件,产生额外的几何一致扫描。
- LiDAR-GS++采用有效的蒸馏机制进行扩展重建,将重建扩展到欠配区域。
- LiDAR-GS++方法确保推断新颖视图的全局几何一致性,同时保留传感器捕捉到的详细场景表面。
- 在多个公共数据集上的实验表明,LiDAR-GS++在插值和推断观点上达到了最先进的性能。
点此查看论文截图
SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images
Authors:Xinyuan Hu, Changyue Shi, Chuxiao Yang, Minghao Chen, Jiajun Ding, Tao Wei, Chen Wei, Zhou Yu, Min Tan
Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textit{RGFE}. To further refine predicted Gaussian primitives, we introduce \textit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.
从稀疏、低分辨率(LR)图像进行前馈3D重建是自动驾驶和嵌入式AI等现实世界应用中的关键能力。然而,现有方法往往无法恢复精细的纹理细节。这种限制源于LR输入中缺乏高频信息。为了解决这一问题,我们提出了SRSplat,这是一个前馈框架,仅从几个LR视角重建高分辨率的3D场景。我们的主要见解是通过联合利用外部高质量参考图像和内部纹理线索来弥补纹理信息的不足。我们首先针对每个场景,利用多模态大型语言模型(MLLMs)和扩散模型构建场景特定参考画廊。为了整合这些外部信息,我们引入了参考引导特征增强(RGFE)模块,该模块对齐并融合来自LR输入图像和其参考孪生图像的特征。随后,我们训练一个解码器,使用从RGFE获得的多视图融合特征来预测高斯基元。为了进一步细化预测的高斯基元,我们引入了纹理感知密度控制(TADC),它可以根据LR输入的内部纹理丰富程度自适应地调整高斯密度。大量实验表明,我们的SRSplat在包括RealEstate10K、ACID和DTU等各种数据集上的性能优于现有方法,并表现出强大的跨数据集和跨分辨率泛化能力。
论文及项目相关链接
PDF AAAI2026-Oral. Project Page: https://xinyuanhu66.github.io/SRSplat/
Summary
本文提出一种名为SRSplat的反馈前3D重建框架,能够从少量低分辨率图像重建出高分辨率的3D场景。通过结合外部高质量参考图像和内部纹理线索来弥补纹理信息的缺失。创新点包括场景特定参考画廊的构建、参考引导特征增强(RGFE)模块的使用以及纹理感知密度控制(TADC)的引入。该方法在多个数据集上表现优异,具有跨数据集和跨分辨率的泛化能力。
Key Takeaways
- SRSplat是一个能够从少量低分辨率图像重建高分辨率3D场景的反馈前框架。
- 通过结合外部高质量参考图像和内部纹理线索来弥补纹理信息的缺失。
- 构建了场景特定参考画廊,利用多模态大型语言模型和扩散模型为每个场景生成参考图像。
- 引入参考引导特征增强(RGFE)模块,整合LR输入图像和参考图像的特征。
- 使用多视角融合特征来预测高斯基本体,通过纹理感知密度控制(TADC)进一步优化预测结果。
点此查看论文截图
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
Authors:Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu
Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.
在视觉和机器人领域,弥合复杂人类指令和精确三维物体定位之间的鸿沟仍然是一个巨大的挑战。现有的三维分割方法往往难以解释模糊、基于推理的指令,而擅长此类推理的二维视觉语言模型则缺乏内在的的三维空间理解能力。在本文中,我们介绍了REALM,这是一个创新的MLLM-agent框架,它可以进行基于推理的开放式世界分割,而无需进行大量的针对三维的特定后训练。我们直接在三维高斯平铺表示上进行分割,充分利用其呈现高度逼真的新视角的能力,非常适合MLLM理解。由于直接将一个或多个渲染视图直接输入到MLLM中可能导致对视点选择的敏感性,我们提出了一种新的全局到局部空间定位策略。具体来说,首先将多个全局视图并行输入到MLLM代理中,进行粗略定位,并汇总响应以稳健地识别目标对象。然后,合成该对象的一些特写新颖视图以执行精细的局部分割,生成准确且一致的3D蒙版。大量实验表明,REALM在解释LERF、3D-OVS和我们新引入的REALM3D基准测试中的明确和隐含指令方面取得了显著的成绩。此外,我们的代理框架无缝支持一系列三维交互任务,包括对象移除、替换和风格转换,证明了其实用性、通用性和多用途性。项目页面:https://ChangyueShi.github.io/REALM。
论文及项目相关链接
Summary
本文提出了一种名为REALM的创新MLLM代理框架,用于在无需大量针对3D的后期训练情况下,实现基于推理的分割。该框架直接在3D高斯投影表示上进行分割,并利用其渲染逼真新视角的能力,高度适合MLLM理解。为解决直接喂食一个或多个渲染视图可能导致的视角选择敏感性,本文提出了一种全球到局部的空间定位策略。首先,将多个全局视图并行输入到MLLM代理中,进行粗略定位,并通过聚合响应来稳健地识别目标对象。然后,合成目标对象的近距离新视角以进行精细局部分割,生成准确且一致的3D掩膜。REALM在解释明确和隐含指令方面表现出卓越性能,并在LERF、3D-OVS和最新引入的REALM3D基准测试中取得了显著成果。此外,该代理框架无缝支持一系列3D交互任务,包括对象移除、替换和风格转换,展示了其实用性、通用性和灵活性。
Key Takeaways
- REALM是一个MLLM-agent框架,能够在无需大量针对3D的后期训练情况下实现基于推理的分割。
- REALM利用3D高斯投影表示进行分割,并借助其渲染逼真新视角的能力促进MLLM理解。
- 全球到局部的空间定位策略解决了视角选择敏感性问题。
- REALM通过结合全局和局部视图,实现了目标对象的粗略和精细定位。
- REALM在多个基准测试中表现出卓越性能,包括LERF、3D-OVS和最新引入的REALM3D。
- REALM框架支持一系列3D交互任务,包括对象移除、替换和风格转换。
点此查看论文截图
Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars
Authors:Marcel C. Bühler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, Umar Iqbal
We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on the ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.
我们介绍了Dream、Lift、Animate(DLA)这一全新框架,它能从单张图片重建可动画的3D人类角色。这是通过利用多视角生成、3D高斯提升和姿态感知的UV空间高斯映射来实现的。给定一张图片,我们首先使用视频扩散模型梦见可能的多视角,捕捉丰富的几何和外观细节。然后,这些视角被提升为无结构的3D高斯。为了实现动画效果,我们提出了一种基于变压器的编码器,该编码器能够模拟全局空间关系并将这些高斯投影到与参数化身体模型的UV空间对齐的结构化潜在表示中。这个潜在代码被解码为UV空间高斯,可以通过身体驱动的变形进行动画处理,并根据姿态和视点进行渲染。通过将高斯锚定到UV流形,我们的方法确保了动画过程中的一致性,同时保留了精细的视觉细节。DLA无需后期处理即可实现实时渲染和直观编辑。我们的方法在ActorsHQ和4D-Dress数据集上的感知质量和光度准确性方面都优于现有技术。通过将视频扩散模型的生成能力与姿态感知的UV空间高斯映射相结合,DLA缩小了无结构3D表示与高保真、可动画角色之间的差距。
论文及项目相关链接
PDF Accepted to 3DV 2026
Summary
本文介绍了Dream、Lift、Animate(DLA)这一新型框架,该框架能够从单一图像重建可动画的3D人类角色。通过利用多视角生成、3D高斯提升和姿态感知UV空间映射等技术实现。给定图像,首先通过视频扩散模型生成多个视角的图像,再将这些视角提升为无结构的3D高斯数据。为实现动画效果,提出基于转换编码器的模型,该模型能够模拟全局空间关系并将这些高斯数据投影到参数化身体模型的UV空间中。这种潜在代码被解码为UV空间高斯数据,可以通过身体驱动变形进行动画处理,并根据姿态和视角进行渲染。DLA方法确保了动画过程中的一致性,同时保留了精细的视觉细节,实现了实时渲染和直观编辑,无需后期处理。
Key Takeaways
- DLA框架能够从单一图像重建3D人类角色。
- 利用多视角生成、3D高斯提升和姿态感知UV空间映射等技术实现重建。
- 通过视频扩散模型生成多个视角的图像。
- 基于转换编码器的模型实现动画效果,模拟全局空间关系。
- 将高斯数据投影到参数化身体模型的UV空间,实现潜在代码的解码。
- DLA方法支持实时渲染和直观编辑,无需后期处理。
点此查看论文截图
ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS
Authors:Weijie Wang, Donny Y. Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang
Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their models, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.
前馈三维高斯摊铺(3DGS)模型最近作为合成新视角的一种有前途的解决方案而出现,无需对每个场景进行单独的优化就可以进行一次推理推断。然而,模型的有限容量从根本上限制了其可扩展性,随着输入视图数量的增加,可能导致性能下降或内存消耗过大。在这项工作中,我们通过信息瓶颈原理来分析前馈3DGS框架,并引入了ZPressor,这是一个轻量级的架构无关模块,能够将多视图输入有效地压缩成紧凑的潜在状态Z,同时保留关键场景信息并丢弃冗余信息。具体来说,ZPressor通过将视图划分为锚点集和支持集并使用交叉注意力将支持视图的信息压缩到锚视图中来形成压缩的潜在状态Z,从而实现了在容量为80GB的GPU上支持高达一百个输入视图在480P分辨率下的扩展能力。我们展示了将ZPressor集成到几种先进的前馈3DGS模型中,在大型基准测试DL3DV-10K和RealEstate10K上的一致性能改进以及在中等输入视图下的稳健性增强。视频结果、代码和训练模型均可在我们的项目页面查看:[网址链接]。
论文及项目相关链接
PDF NeurIPS 2025, Project Page: https://lhmd.top/zpressor, Code: https://github.com/ziplab/ZPressor
Summary
该文探讨了在视图合成领域中,前馈3D高斯扩展模型在处理大量视图时的局限性问题。为了提升模型处理性能并解决内存消耗问题,该文通过信息瓶颈原理对前馈3DGS框架进行分析,并引入了名为ZPressor的轻量级架构无关模块。该模块可将多视图输入压缩成紧凑的潜在状态Z,保留关键场景信息并去除冗余信息。通过整合ZPressor模块,现有前馈3DGS模型能在有限的GPU内存下处理超过百个视图。在大型数据集上测试显示,该方案性能表现优越且鲁棒性强。详情可见相关项目页面:https://lhmd.top/zpressor。
Key Takeaways
- 前馈3D高斯扩展(3DGS)模型可用于快速视图合成,无需为每个场景进行优化。
- 随着输入视图数量的增加,现有前馈3DGS模型的性能受限,并可能出现内存消耗过大的问题。
- ZPressor模块通过压缩多视图输入到紧凑的潜在状态Z,提高了模型的效率和性能。
- ZPressor模块将视图分为锚点和支持集,并使用交叉注意力机制将支持视图的信息压缩到锚视图中。
- 集成ZPressor模块的前馈3DGS模型能在有限的GPU内存下处理更多视图,并且在大型数据集上的性能表现卓越。
点此查看论文截图
GaussianFocus: Constrained Attention Focus for 3D Gaussian Splatting
Authors:Zexu Huang, Min Xu, Stuart Perry
Recent developments in 3D reconstruction and neural rendering have significantly propelled the capabilities of photo-realistic 3D scene rendering across various academic and industrial fields. The 3D Gaussian Splatting technique, alongside its derivatives, integrates the advantages of primitive-based and volumetric representations to deliver top-tier rendering quality and efficiency. Despite these advancements, the method tends to generate excessive redundant noisy Gaussians overfitted to every training view, which degrades the rendering quality. Additionally, while 3D Gaussian Splatting excels in small-scale and object-centric scenes, its application to larger scenes is hindered by constraints such as limited video memory, excessive optimization duration, and variable appearance across views. To address these challenges, we introduce GaussianFocus, an innovative approach that incorporates a patch attention algorithm to refine rendering quality and implements a Gaussian constraints strategy to minimize redundancy. Moreover, we propose a subdivision reconstruction strategy for large-scale scenes, dividing them into smaller, manageable blocks for individual training. Our results indicate that GaussianFocus significantly reduces unnecessary Gaussians and enhances rendering quality, surpassing existing State-of-The-Art (SoTA) methods. Furthermore, we demonstrate the capability of our approach to effectively manage and render large scenes, such as urban environments, whilst maintaining high fidelity in the visual output.
近期三维重建和神经渲染的发展极大地推动了多尺度全景技术在各学术和产业领域中的逼真三维场景渲染能力。三维高斯展开技术及其衍生技术结合了基于原始数据和体积表示的优势,实现了高质量的渲染效果和效率。然而,该技术往往会产生大量冗余的噪声高斯分布,过度拟合每个训练视图,导致渲染质量下降。此外,虽然三维高斯展开在小规模以对象为中心的场景中表现出色,但其应用于大型场景时受到视频内存限制、优化时间过长以及不同视图之间外观变化等约束的阻碍。为了应对这些挑战,我们引入了GaussianFocus这一创新方法,它结合了补丁注意力算法来优化渲染质量,并实施了高斯约束策略来减少冗余。此外,我们提出了一种用于大规模场景的细分重建策略,将它们划分为较小的、可管理的块进行单独训练。我们的结果表明,GaussianFocus显著减少了不必要的高斯分布,提高了渲染质量,超越了现有的最先进方法。此外,我们展示了我们的方法在管理和渲染大型场景(如城市环境)方面的能力,同时保持视觉输出的高保真度。
论文及项目相关链接
Summary
该文本介绍了近期三维重建和神经渲染技术的发展对光栅化三维场景渲染的影响。特别介绍了3D高斯喷绘技术及其衍生的集成方法的优点,它们融合了基于原始的和体积表示方法的优点,实现了高质量的渲染效果。然而,该技术仍面临一些问题,如过度生成冗余噪声高斯并适应每个训练视图而导致渲染质量下降。为解决这些问题,文章提出了GaussianFocus方法,采用补丁注意力算法改进渲染质量,并实施了高斯约束策略来减少冗余。此外,还提出了一种针对大规模场景的细分重建策略,将场景分为较小的可管理块进行单独训练。总结来说,GaussianFocus提高了渲染质量并降低了不必要的冗余高斯,超越了现有技术,并展示了在大型场景管理渲染方面的能力。
Key Takeaways
- 3D重建和神经渲染技术的最新进展推动了光栅化三维场景渲染的进步。
- 3D高斯喷绘技术融合了基于原始和体积的表示方法的优点。
- 存在生成冗余噪声高斯的问题,影响渲染质量。
- GaussianFocus方法通过补丁注意力算法改进渲染质量,减少冗余。
- GaussianFocus采用高斯约束策略来最小化冗余。
- 针对大规模场景提出了细分重建策略,将场景分为较小的块进行单独训练。
- GaussianFocus提高了渲染质量并超越了现有技术。
点此查看论文截图
DehazeGS: Seeing Through Fog with 3D Gaussian Splatting
Authors:Jinze Yu, Yiqun Wang, Aiheng Jiang, Zhengda Lu, Jianwei Guo, Yong Li, Hongxing Qin, Xiaopeng Zhang
Current novel view synthesis methods are typically designed for high-quality and clean input images. However, in foggy scenes, scattering and attenuation can significantly degrade the quality of rendering. Although NeRF-based dehazing approaches have been developed, their reliance on deep fully connected neural networks and per-ray sampling strategies leads to high computational costs. Furthermore, NeRF’s implicit representation limits its ability to recover fine-grained details from hazy scenes. To overcome these limitations, we propose learning an explicit Gaussian representation to explain the formation mechanism of foggy images through a physically forward rendering process. Our method, DehazeGS, reconstructs and renders fog-free scenes using only multi-view foggy images as input. Specifically, based on the atmospheric scattering model, we simulate the formation of fog by establishing the transmission function directly onto Gaussian primitives via depth-to-transmission mapping. During training, we jointly learn the atmospheric light and scattering coefficients while optimizing the Gaussian representation of foggy scenes. At inference time, we remove the effects of scattering and attenuation in Gaussian distributions and directly render the scene to obtain dehazed views. Experiments on both real-world and synthetic foggy datasets demonstrate that DehazeGS achieves state-of-the-art performance. visualizations are available at https://dehazegs.github.io/
当前的新型视图合成方法通常针对高质量和清晰的输入图像进行设计。然而,在雾天场景中,散射和衰减会显著降低渲染质量。尽管已经开发了基于NeRF的去雾方法,但它们依赖于深度全连接神经网络和按射线采样策略,导致计算成本高昂。此外,NeRF的隐式表示限制了其从雾天场景中恢复细节的能力。为了克服这些局限性,我们提出了学习显式高斯表示,通过物理前向渲染过程来解释雾天图像的形成机制。我们的方法DehazeGS仅使用多视角雾天图像作为输入,重建和渲染无雾场景。具体来说,基于大气散射模型,我们通过深度到传输映射直接在高斯原始数据上建立传输函数来模拟雾的形成。在训练过程中,我们联合学习大气光和散射系数,同时优化雾天场景的高斯表示。在推理阶段,我们消除了高斯分布中的散射和衰减影响,直接渲染场景以获得去雾的视图。在现实世界和合成雾数据集上的实验表明,DehazeGS达到了最新技术水平。可视化请访问:[https://dehazegs.github.io/] 。
论文及项目相关链接
PDF 9 pages,5 figures. Accepted by AAAI2026. visualizations are available at https://dehazegs.github.io/
Summary
本文介绍了针对雾天场景合成视图的方法。传统的NeRF去雾方法计算成本高且难以恢复细节。文章提出使用明确的Gaussian表达学习雾图像的形成机制,基于大气散射模型,直接建立传输函数至Gaussian原始数据上,去雾渲染清晰场景。新方法DehazeGS仅使用多视角雾图像作为输入,训练时联合学习大气光和散射系数,优化Gaussian表达。实验证明其在真实和合成雾天数据集上表现优秀。
Key Takeaways
- 当前视图合成方法主要针对高质量清晰图像,雾天场景中的散射和衰减会降低渲染质量。
- NeRF-based去雾方法计算成本高且难以恢复细节。
- 提出使用明确的Gaussian表达学习雾图像的形成机制。
- 基于大气散射模型,直接建立传输函数至Gaussian原始数据上。
- DehazeGS方法仅使用多视角雾图像作为输入,进行去雾渲染。
- 训练过程中联合学习大气光和散射系数,优化Gaussian表达。
- 通过实验证明DehazeGS在真实和合成雾天数据集上表现优秀。
点此查看论文截图
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
Authors:Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren
We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs a chunk-wise autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we adopt a multi-view video representation, providing rich perspectives to address challenges like motion ambiguity and 3D grounding. Additionally, EnerVerse-D, a data engine pipeline combining generative modeling with 4D Gaussian Splatting, forms a self-reinforcing data loop to reduce the sim-to-real gap. Leveraging these innovations, EnerVerse translates 4D world representations into physical actions via a policy head (EnerVerse-A), achieving state-of-the-art performance in both simulation and real-world tasks. For efficiency, EnerVerse-A reuses features from the first denoising step and predicts action chunks, achieving about 280 ms per 8-step action chunk on a single RTX 4090. Further video demos, dataset samples could be found in our project page.
我们介绍了EnerVerse,这是一个构建和解释实体空间的生成式机器人基础模型。EnerVerse采用分块自回归视频扩散框架,根据指令预测未来的实体空间,并通过稀疏上下文记忆进行长期推理来增强预测。为了构建三维机器人世界模型,我们采用多视角视频表示,提供丰富的视角来解决运动模糊和三维定位等挑战。此外,EnerVerse-D是一个数据引擎管道,将生成建模与4D高斯涂抹相结合,形成一个自我加强的数据循环,以缩小仿真与现实的差距。利用这些创新技术,EnerVerse通过策略头(EnerVerse-A)将4D世界表示转化为物理动作,在模拟和真实世界任务中都实现了最先进的性能。为了提高效率,EnerVerse-A重用了第一次去噪步骤的特征,并预测动作块,在单个RTX 4090上实现每8步动作块约280毫秒的响应时间。更多视频演示、数据集样本可在我们的项目页面找到。
论文及项目相关链接
PDF Accepted by NeurIPS 2025. Website: https://sites.google.com/view/enerverse
Summary
EnerVerse是一个生成式机器人基础模型,能够构建并解读体现空间。它采用分块自回归视频扩散框架,从指令预测未来的体现空间,并通过稀疏上下文记忆进行长期推理。通过采用多视角视频表示来建模3D机器人世界,解决了运动模糊和3D定位等挑战。EnerVerse-D数据引擎管道结合生成建模与4D高斯涂抹技术,形成自我加强的数据循环,缩小模拟与真实之间的差距。EnerVerse将4D世界表示转化为物理动作,通过策略头(EnerVerse-A)实现仿真和真实任务的卓越性能。
Key Takeaways
- EnerVerse是一个生成式机器人基础模型,能构建并解读体现空间。
- 采用分块自回归视频扩散框架进行预测。
- 通过稀疏上下文记忆进行长期推理。
- 采用多视角视频表示建模3D机器人世界,解决运动模糊和3D定位挑战。
- EnerVerse-D数据引擎管道结合生成建模与4D高斯涂抹技术。
- 自我加强的数据循环有助于缩小模拟与真实之间的差距。