⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-23 更新
Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
Authors:Jinfeng Liu, Lingtong Kong, Mi Zhou, Jinwen Chen, Dan Xu
We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.
我们介绍了Mono4DGS-HDR系统,它是第一个能从交替曝光拍摄的无姿态单目低动态范围(LDR)视频中重建可渲染的4D高动态范围(HDR)场景的系统。为了解决这一具有挑战性的问题,我们提出了一个基于高斯糊化技术的两阶段优化方法的统一框架。第一阶段学习在正交相机坐标系中的视频HDR高斯表示,不需要相机姿态,能够实现稳健的初始HDR视频重建。第二阶段将视频高斯变换到世界空间,并与相机姿态一起对世界高斯进行联合优化。此外,我们提出了一种时间亮度正则化策略,以提高HDR外观的时间一致性。由于我们的任务之前没有进行过研究,因此我们使用公开可用的数据集构建了一个新的评估基准来进行HDR视频重建。大量实验表明,Mono4DGS-HDR在渲染质量和速度方面都显著优于从最新技术改编的替代解决方案。
论文及项目相关链接
PDF Project page is available at https://liujf1226.github.io/Mono4DGS-HDR/
Summary
Mono4DGS-HDR系统能从交替曝光的单目低动态范围(LDR)视频中重建可渲染的4D高动态范围(HDR)场景。它采用基于高斯涂斑的统一框架和两阶段优化方法,先学习视频HDR高斯正交相机坐标表示,然后转换视频高斯为世界空间并联合优化世界高斯与相机姿态。此外,还提出一种时间亮度正则化策略,以提高HDR外观的时间一致性。新建立的评估基准在HDR视频重建方面表现显著,远超其他基于最新技术的方法。
Key Takeaways
- Mono4DGS-HDR是首个能从交替曝光单目LDR视频重建渲染的4D HDR场景的系统。
- 采用基于高斯涂斑的统一框架和两阶段优化方法。
- 第一阶段学习视频HDR高斯正交相机坐标表示,无需相机姿态,可实现稳健的初始HDR视频重建。
- 第二阶段将视频高斯转换为世界空间,并与相机姿态联合优化。
- 提出时间亮度正则化策略,提高HDR外观的时间一致性。
- 建立新的评估基准,用于HDR视频重建的性能评估。
点此查看论文截图
OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Authors:Tianyu Huang, Runnan Chen, Dongting Hu, Fengming Huang, Mingming Gong, Tongliang Liu
Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
理解三维场景对于自动驾驶、机器人技术和增强现实至关重要。最近的语义高斯拼贴方法利用大规模二维视觉模型将二维语义特征投射到三维场景中。然而,它们存在两个主要局限性:(1)预处理时个体掩膜缺乏足够的上下文线索;(2)融合这些二维模型的跨视图特征时出现的不一致性和细节缺失。在本文中,我们介绍了具有上下文感知跨视图融合的OpenInsGaussian,这是一个具有开放词汇的实例高斯分割框架。**我们的方法包括两个模块:上下文感知特征提取,该模块通过丰富的语义上下文增强每个掩膜;注意力驱动的特征聚合,该模块选择性地融合跨视图特征,以减轻对齐误差和不完全性。在基准数据集上进行的大量实验表明,OpenInsGaussian在开放词汇三维高斯分割方面达到了最新水平的结果,大大超过了现有基线。这些发现凸显了我们提出方法的稳健性和通用性,标志着在三维场景理解及其在多种现实世界场景的实际部署方面取得了重大进展。
论文及项目相关链接
Summary
本文介绍了OpenInsGaussian,一种具有开放词汇表的高斯分割框架,用于理解三维场景。它通过上下文感知的特征提取和注意力驱动的特征聚合两个模块,解决了现有语义高斯分裂方法在处理上下文线索不足和多视角特征融合不一致的问题。在基准数据集上的实验表明,OpenInsGaussian在开放词汇表的三维高斯分割方面取得了最先进的成果,显著优于现有基线。这标志着在三维场景理解和其在各种现实世界场景的实际部署方面的显著进步。
Key Takeaways
- OpenInsGaussian是一种用于三维场景理解的开放词汇表高斯分割框架。
- 该方法通过上下文感知的特征提取模块增强每个掩膜的语义上下文。
- 注意力驱动的特征聚合模块选择性地融合多视角特征,减少对齐误差和不完全融合的问题。
- OpenInsGaussian解决了现有语义高斯分裂方法在预处理时上下文线索不足和多视角特征融合不一致的问题。
- 在基准数据集上的实验表明,OpenInsGaussian在开放词汇表的三维高斯分割任务中取得了显著成果。
点此查看论文截图
HouseTour: A Virtual Real Estate A(I)gent
Authors:Ata Çelen, Marc Pollefeys, Daniel Barath, Iro Armeni
We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
我们介绍了HouseTour方法,该方法可以从一系列描绘现有三维空间的图像中生成具有空间感知能力的三维相机轨迹和自然语言摘要。不同于现有视觉语言模型(VLMs),我们的方法通过扩散过程生成平滑的视频轨迹,该过程受到已知相机姿态的约束,并将这些信息融入VLM中进行三维定位描述。我们使用三维高斯喷射合成最终视频,沿着轨迹呈现新颖的视角。为了支持此任务,我们推出了HouseTour数据集,其中包括超过1200个带有相机姿态、三维重建和房地产描述的房子旅游视频。实验表明,将三维相机轨迹纳入文本生成过程,相较于独立处理每个任务的方法,能够提高性能。我们既评估了单独性能也评估了端到端性能,并引入了一个新的联合度量指标。我们的工作能够为房地产和旅游应用自动创建专业品质的视频,无需专业知识和技能或特殊设备。
论文及项目相关链接
PDF Published on ICCV 2025
Summary
本文介绍了HouseTour方法,该方法能够从一系列图像中生成空间感知的3D相机轨迹和自然语言摘要。与传统视觉语言模型不同,HouseTour能够通过扩散过程生成平滑的视频轨迹,并集成到已知相机姿态的VLM中进行描述。该方法采用3D高斯绘制技术合成最终视频,以沿轨迹展示不同的视角。同时推出HouseTour数据集,包括超过一千两百多房屋的视频和真实场景描述等信息。实验证明将三维相机轨迹引入文本生成过程可以提升其性能表现。研究工作可以实现无需专业人员和专业设备的自动专业级视频创作。未来有望成为不动产和旅游行业的常用工具。此论文带来的主要成就即为采用了具有创造性的解决方案(HouseTour方法和数据集),解决了现有视觉语言模型在几何推理方面的不足问题。同时,也提高了自动化视频创作的性能和质量。此外,该研究还提出了一种新的联合评价指标,用于评估整个系统的性能表现。这一创新性的研究为未来的自动化视频创作提供了无限的可能性。
Key Takeaways
以下是关于该文本的关键见解:
- HouseTour方法能够从一系列图像生成空间感知的3D相机轨迹和自然语言摘要。
- HouseTour方法通过扩散过程生成平滑的视频轨迹,并集成到VLM中。这有助于处理视觉和语言信息融合问题。与现有的视觉语言模型相比,该方法能够更好地处理几何推理任务。
点此查看论文截图
Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation
Authors:Yi-Ruei Liu, You-Zhe Xie, Yu-Hsiang Hsu, I-Sheng Fang, Yu-Lun Liu, Jun-Cheng Chen
Common computer vision systems typically assume ideal pinhole cameras but fail when facing real-world camera effects such as fisheye distortion and rolling shutter, mainly due to the lack of learning from training data with camera effects. Existing data generation approaches suffer from either high costs, sim-to-real gaps or fail to accurately model camera effects. To address this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage pipeline that combines 4D Gaussian Splatting with physically-based ray tracing for camera effect simulation. Given multi-view videos, 4D-GRT first reconstructs dynamic scenes, then applies ray tracing to generate videos with controllable, physically accurate camera effects. 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. Additionally, we construct eight synthetic dynamic scenes in indoor environments across four camera effects as a benchmark to evaluate generated videos with camera effects.
常规计算机视觉系统通常假定理想情况下的针孔相机,但在面对真实世界相机效果(例如鱼眼失真和滚动快门)时却会失效,这主要是因为缺乏从带有相机效果的训练数据中学习。现有的数据生成方法存在成本高、仿真与真实场景之间的差距,或者无法准确建模相机效果的问题。为了解决这一瓶颈,我们提出了4D高斯光线追踪(4D-GRT),这是一种新型的两阶段管道,结合了4D高斯喷溅和基于物理的光线追踪来进行相机效果模拟。给定多视角视频,4D-GRT首先重建动态场景,然后应用光线追踪生成具有可控且物理准确的相机效果的视频。4D-GRT实现了最快的渲染速度,同时与现有基线相比,其渲染质量表现更好或相当。此外,我们在四种相机效果下构建了八个室内环境的合成动态场景作为基准测试,以评估生成的带有相机效果的视频。
论文及项目相关链接
PDF Paper accepted to NeurIPS 2025 Workshop SpaVLE. Project page: https://shigon255.github.io/4DGRT-project-page/
摘要
本文提出一种名为4D高斯射线追踪(4D-GRT)的新型两阶段管道,结合4D高斯喷射与基于物理的射线追踪进行摄像头效果模拟,以解决现有计算机视觉系统在面对真实世界摄像头效果(如鱼眼失真和滚动快门)时的不足。给定多视角视频,4D-GRT首先重建动态场景,然后应用射线追踪生成具有可控且物理准确的摄像头效果的视频。4D-GRT实现了最快的渲染速度,同时渲染质量优于或相当于现有基线。此外,我们还构建了8个室内环境合成动态场景,涵盖四种摄像头效果,作为评估带有摄像头效果生成视频的基准。
要点
- 计算机视觉系统通常假设理想针孔相机,但在面对真实世界的摄像头效果时会失效。
- 现有数据生成方法存在高成本、模拟到现实的差距或无法准确建模摄像头效果的问题。
- 提出4D高斯射线追踪(4D-GRT)来解决这一问题,这是一种新型的两阶段管道。
- 4D-GRT结合4D高斯喷射与基于物理的射线追踪进行摄像头效果模拟。
- 给定多视角视频,4D-GRT首先重建动态场景。
- 4D-GRT可以生成具有可控、物理准确的摄像头效果的视频。
- 4D-GRT实现了快速的渲染速度,并且渲染质量优越。
点此查看论文截图
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Authors:Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang
Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous Driving. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards to guide the policy in effectively responding to safety-critical events and understanding real-world causal relationships. To better align with human driving behavior, we incorporate IL into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, particularly exhibiting a 3x lower collision rate. Abundant closed-loop results are presented in the supplementary material. Code is available at https://github.com/hustvl/RAD for facilitating future research.
现有的端到端自动驾驶(AD)算法通常遵循模仿学习(IL)范式,这面临着因果混淆和开环间隙等挑战。在这项工作中,我们提出了RAD,这是一个基于3DGS的闭环强化学习(RL)框架,用于端到端的自动驾驶。我们借助3DGS技术,构建了真实物理世界的照片级数字副本,使AD策略能够广泛探索状态空间,并通过大规模试错学习处理超出分布范围的情况。为了提高安全性,我们设计了专门的奖励来指导策略有效应对安全关键事件并理解现实世界中的因果关系。为了更好地与人类驾驶行为相符,我们将IL纳入RL训练中作为正则化项。我们引入了一个闭环评估基准,包含多样化、之前未见的3DGS环境。与基于IL的方法相比,RAD在大多数闭环指标上表现更优,尤其是碰撞率降低了3倍。丰富的闭环结果已呈现在补充材料中。代码可在https://github.com/hustvl/RAD查阅,以推动未来研究。
论文及项目相关链接
PDF Code: https://github.com/hustvl/RAD
Summary
本文提出了一种基于3DGS的闭环强化学习(RL)框架RAD,用于端到端的自动驾驶(AD)。该方法利用3DGS技术构建真实世界的数字复制品,使AD策略能够广泛探索状态空间,并通过大规模试错学习处理异常场景。为提高安全性,设计专项奖励以引导策略有效应对安全关键事件并理解真实世界的因果关系。同时,将模仿学习纳入RL训练,以更好地符合人类驾驶行为。通过闭环评估基准测试,与基于IL的方法相比,RAD在大多数闭环指标上表现更优,尤其是碰撞率降低了三倍。
Key Takeaways
- 本文提出了基于3DGS的闭环强化学习框架RAD,用于端到端自动驾驶。
- 利用3DGS技术构建真实世界的数字复制品,使AD策略能广泛探索状态空间。
- 通过大规模试错学习处理异常场景,提高AD系统的适应性。
- 设计专项奖励以提高安全性,引导策略有效应对安全关键事件。
- 纳入模仿学习以更好地符合人类驾驶行为,提升驾驶的自然性和流畅性。
- 通过闭环评估基准测试,RAD在性能上显著优于基于IL的方法。
点此查看论文截图
H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting
Authors:Bing He, Yunuo Chen, Guo Lu, Qi Wang, Qunshan Gu, Rong Xie, Li Song, Wenjun Zhang
Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity. This approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion. Scene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. However, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion. To preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. This design decouples directly observable motion components from those that are geometrically occluded. Specifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization. Experiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.
动态场景重建在三维视觉领域一直是一个挑战。可变形三维高斯拼贴在处理这一任务时表现出色,能够实现实时渲染和高视觉保真度。该方法将动态场景分解为规范空间中的静态表示和随时间变化的场景运动。场景运动被定义为所有高斯点的集体运动,为了简洁性,现有方法通常采用隐式神经场或稀疏控制点。然而,这些方法主要依赖于基于梯度的优化来获取所有运动信息。由于自由度较高,它们在处理现实世界的数据集时,对于复杂的运动难以收敛。为了保持运动表示的简洁性并解决收敛挑战,本文提出了异质三维控制点,称为\textbf{H3D控制点},其属性是通过结合光流反向投影和基于梯度的方法的混合策略获得的。这种设计将可以直接观察到的运动组件与那些几何上被遮挡的组件解耦。具体来说,投射到图像平面上的三维运动组件通过光流反向投影直接获取,而不可观察的部分则通过基于梯度的优化进行改进。在Neu3DV和CMU-Panoptic数据集上的实验表明,我们的方法优于最新的可变形三维高斯拼贴技术。值得注意的是,我们的方法仅在100次迭代内收敛,并在单个NVIDIA RTX 4070 GPU上实现每秒2秒的帧处理速度。
论文及项目相关链接
Summary
点此查看论文截图