⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
3D Gaussian and Diffusion-Based Gaze Redirection
Authors:Abiram Panchalingam, Indu Bodala, Stuart Middleton
High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.
高保真度的视线重定向对于生成扩充数据以提高视线估计器的泛化能力至关重要。虽然像GazeGaussian这样的3D高斯拼贴(3DGS)模型代表了最先进的技术,但在呈现细微、连续的视线转移时可能会遇到困难。在本文中,我们提出了DiT-Gaze框架,它通过结合扩散转换器(DiT)、视线角度的弱监督以及正交性约束损失,增强了3D视线重定向模型。DiT能够实现更高保真度的图像合成,而我们使用合成中间视线角度的弱监督策略在训练过程中提供了平滑的视线方向流形。正交性约束损失从数学上强制实现视线、头部姿势和表情内部表示的去耦合。综合实验表明,DiT-Gaze在感知质量和重定向准确性方面均达到了最新技术水平,将现有技术的视线误差降低了4.1%,达到6.353度,为创建合成训练数据提供了卓越的方法。我们的代码和模型将提供给研究界作为基准测试。
论文及项目相关链接
Summary
本文提出了一种基于Diffusion Transformer(DiT)的增强型三维视线重定向框架DiT-Gaze。结合弱监督策略和正交约束损失,DiT-Gaze提高了高保真视线重定向的能力,改善了三维高斯映射(3DGS)模型在渲染细微连续视线转移时的不足。实验表明,DiT-Gaze在感知质量和重定向准确性上达到了新的水平,降低了现有技术的视线误差,提供了一种优越的合成训练数据创建方法。
Key Takeaways
- 高保真视线重定向对于生成增强数据以改进视线估计器的泛化能力至关重要。
- 现有的3DGS模型如GazeGaussian虽然先进,但在渲染细微、连续的视线转移时可能遇到困难。
- DiT-Gaze框架结合了Diffusion Transformer(DiT)、不同视线角度的弱监督策略和正交约束损失,增强了3D视线重定向模型。
- DiT允许更高保真度的图像合成。
- 弱监督策略使用合成生成的中间视线角度,为训练过程中视线的平滑流形提供了支持。
- 正交约束损失在数学上强制实现了视线、头部姿势和表情的内部表示的分离。
点此查看论文截图
RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting
Authors:Ruocheng Wu, Haolan He, Yufei Wang, Zhihao Li, Bihan Wen
3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
3D高斯贴图(3DGS)因其高质量实时渲染能力而在3D场景表示中受到广泛关注。然而,当输入包含稀疏训练视图时,3DGS容易过拟合,这主要是由于缺乏中间视图监督。受视频扩散模型(VDM)近期成功的启发,我们提出了一种名为指导评分蒸馏(GSD)的框架,用于从预训练的VDMs中提取丰富的多视图一致性先验。基于评分蒸馏采样(SDS)的见解,GSD监督来自多个相邻视图的渲染图像,引导高斯贴图表示朝向VDM的生成方向。然而,生成方向通常涉及对象运动和随机相机轨迹,使得在优化过程中直接监督具有挑战性。为了解决这一问题,我们引入了一种统一的指导形式来纠正VDM的噪声预测结果。具体来说,我们结合基于真实深度图的深度warp指导以及基于语义图像特征的指导,确保VDM的评分更新方向与正确的相机姿态和准确几何对齐。实验结果表明,我们的方法在多个数据集上的表现优于现有方法。
论文及项目相关链接
Summary
基于三维高斯混合(3DGS)模型在三维场景表示中的实时渲染能力备受关注,但面临稀疏训练视图输入时的过拟合问题。为解决这一问题,本研究受视频扩散模型(VDM)启发,提出了名为引导评分蒸馏(GSD)的框架。GSD能够从预训练的VDM中提取丰富的多视角一致性先验信息,结合评分蒸馏采样(SDS)的理念,监督相邻多个视角的渲染图像,指导高斯混合表示向VDM的生成方向优化。为解决生成方向涉及的对象运动和随机相机轨迹带来的直接监督挑战,本研究引入了一种统一的引导形式,校正VDM的噪声预测结果。通过结合真实深度图和语义图像特征的引导,确保从VDM获取的评分更新方向与正确的相机姿态和准确几何结构对齐。实验结果显示,该方法在多个数据集上的表现优于现有方法。
Key Takeaways
- 3DGS在三维场景表示中具有高质实时渲染能力,但在稀疏训练视图下易过拟合。
- 提出GSD框架,利用VDM的多视角一致性先验信息来解决这一问题。
- GSD结合SDS理念,监督相邻视角的渲染图像,引导高斯混合表示向VDM的生成方向优化。
- 面对生成方向中的对象运动和随机相机轨迹带来的挑战,引入统一引导形式校正VDM的噪声预测。
- 结合真实深度图和语义图像特征的引导,确保评分更新方向与正确相机姿态和几何结构对齐。
- 实验结果显示,该方法在多个数据集上的表现优于现有方法。
点此查看论文截图
Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos
Authors:Zhixin Xu, Hengyu Zhou, Yuan Liu, Wenhan Xue, Hao Pan, Wenping Wang, Bin Wang
Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera’s time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.
多视角视频重建在计算机视觉中扮演着至关重要的角色,广泛应用于电影制作、虚拟现实和运动分析。虽然最近的进展,如四维高斯散斑(4DGS),在动态场景重建中表现出了令人印象深刻的能力,但它们通常依赖于输入视频流在时间上是同步的这一假设。然而,在现实场景中,由于摄像机触发延迟或独立记录设置等因素,这一假设往往不成立,导致跨视图的时序不对齐和重建质量下降。为了解决这一挑战,提出了一种用于从异步多视角视频中实现高质量4DGS重建的新型时间对齐策略。我们的方法采用了从粗到细的对齐模块,估计并补偿了每个摄像机的时间偏移。该方法首先确定一个粗略的帧级偏移,然后对其进行细化,以达到亚帧精度。该策略可以作为一个可轻松集成的模块,集成到现有的4DGS框架中,增强其在处理异步数据时的稳健性。实验表明,我们的方法能够有效地处理时序不对齐的视频,并显著提高基线方法的性能。
论文及项目相关链接
PDF AAAI 2026
摘要
针对多视角视频重建中常见的时间同步问题,提出了一种新颖的时空对齐策略。该方法采用粗到细的对齐模块,估计并补偿每个相机的时间偏移,先确定帧级别的粗略偏移,再精细调整至子帧级别。此方法可集成到现有的4DGS框架中,增强处理异步数据的稳健性,有效处理时间错位视频,并显著提高基线方法的性能。
关键见解
- 多视角视频重建在计算机视觉中起重要作用,应用于电影制作、虚拟现实和运动分析中。
- 现有方法如4D高斯Splatting(4DGS)在动态场景重建中表现出色,但依赖视频流的时间同步假设。
- 在现实世界中,由于相机触发延迟或独立记录设置等因素,此假设经常失败,导致时间错位和重建质量下降。
- 提出了一种新颖的时空对齐策略,以处理未同步的多视角视频的高质量4DGS重建。
- 该策略采用粗到细的对齐模块,估计并补偿每个相机的时间偏移,实现子帧级别的精度。
- 此方法可集成到现有的4DGS框架中,增强其处理异步数据的稳健性。
- 实验表明,该方法能有效处理时间错位视频,并显著提高基线方法的性能。
点此查看论文截图
PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI
Authors:Sun Jo, Seok Young Hong, JinHyun Kim, Seungmin Kang, Ahjin Choi, Don-Gwan An, Simon Song, Je Hyeong Hong
4D flow magnetic resonance imaging (MRI) is a reliable, non-invasive approach for estimating blood flow velocities, vital for cardiovascular diagnostics. Unlike conventional MRI focused on anatomical structures, 4D flow MRI requires high spatiotemporal resolution for early detection of critical conditions such as stenosis or aneurysms. However, achieving such resolution typically results in prolonged scan times, creating a trade-off between acquisition speed and prediction accuracy. Recent studies have leveraged physics-informed neural networks (PINNs) for super-resolution of MRI data, but their practical applicability is limited as the prohibitively slow training process must be performed for each patient. To overcome this limitation, we propose PINGS-X, a novel framework modeling high-resolution flow velocities using axes-aligned spatiotemporal Gaussian representations. Inspired by the effectiveness of 3D Gaussian splatting (3DGS) in novel view synthesis, PINGS-X extends this concept through several non-trivial novel innovations: (i) normalized Gaussian splatting with a formal convergence guarantee, (ii) axes-aligned Gaussians that simplify training for high-dimensional data while preserving accuracy and the convergence guarantee, and (iii) a Gaussian merging procedure to prevent degenerate solutions and boost computational efficiency. Experimental results on computational fluid dynamics (CFD) and real 4D flow MRI datasets demonstrate that PINGS-X substantially reduces training time while achieving superior super-resolution accuracy. Our code and datasets are available at https://github.com/SpatialAILab/PINGS-X.
四维流磁共振成像(MRI)是一种可靠的非侵入性方法,用于估计血流速度,对心血管诊断至关重要。与传统的专注于解剖结构的MRI不同,四维流MRI需要较高的时空分辨率,以早期检测关键条件,如狭窄或动脉瘤。然而,实现这种分辨率通常会导致扫描时间延长,需要在采集速度和预测精度之间进行权衡。最近的研究已经利用物理信息神经网络(PINNs)进行MRI数据的超分辨率处理,但它们的实际应用受到限制,因为每个患者的训练过程都极其缓慢。为了克服这一局限性,我们提出了PINGS-X,这是一个新型框架,利用轴对齐的时空高斯表示对高分辨率流速进行建模。PINGS-X受到三维高斯拼贴(3DGS)在新型视图合成中的有效性的启发,通过几个非平凡的新创新扩展了这一理念:一、归一化的高斯拼贴具有正式的收敛保证;二、轴对齐的高斯简化了高维数据的训练过程,同时保持了准确性和收敛保证;三、高斯合并程序可以防止退化解决方案并提高计算效率。在计算流体动力学(CFD)和真实四维流MRI数据集上的实验结果表明,PINGS-X在大大减少训练时间的同时,实现了更高的超分辨率精度。我们的代码和数据集可在https://github.com/SpatialAILab/PINGS-X获取。
论文及项目相关链接
PDF Accepted at AAAI 2026. Supplementary material included after references. 27 pages, 21 figures, 11 tables
Summary
4D流磁共振成像(MRI)用于估计血流速度,对心血管疾病诊断至关重要。传统MRI关注于解剖结构,而4D流MRI要求高时空分辨率以早期检测狭窄或动脉瘤等关键状况。研究利用物理信息神经网络(PINNs)实现MRI数据的超分辨率,但每个患者的训练过程极其缓慢。为解决这一问题,本文提出PINGS-X框架,采用轴对齐时空高斯表示模拟高分辨率血流速度。受3D高斯喷涂(3DGS)在新型视图合成中的有效性启发,PINGS-X通过几个重大创新扩展了这一理念:正规化高斯喷涂、确保收敛性;轴对齐高斯简化高维数据的训练并保留准确性和收敛保证;以及高斯合并程序防止退化解决方案并提高计算效率。实验结果表明,PINGS-X在减少训练时间的同时实现了出色的超分辨率准确性。
Key Takeaways
- 4D流MRI对于心血管疾病诊断至关重要,因为它可以估计血流速度。
- 传统MRI主要关注解剖结构,而4D流MRI需要高时空分辨率来检测早期疾病迹象。
- 物理信息神经网络(PINNs)已被用于实现MRI数据的超分辨率,但每个患者的训练过程漫长且受限。
- PINGS-X框架采用轴对齐时空高斯表示模拟高分辨率血流速度,基于3D高斯喷涂(3DGS)理念的创新扩展。
- PINGS-X的主要创新包括正规化高斯喷涂、轴对齐高斯以及高斯合并程序,以提高计算效率和准确性。
点此查看论文截图
BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading
Authors:Jonathan Schmidt, Simon Giebenhain, Matthias Niessner
We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.
我们介绍了BecomingLit这一新型方法,用于重建可在交互速率下从新视角呈现的高分辨率可重新照明的头部化身。因此,我们提出了一种新的低成本灯光舞台捕获装置,特别适用于捕捉面部。使用该装置,我们收集了一个新型数据集,其中包含在不同照明条件和面部表情下多个主题的多样化多视角序列。通过利用我们的新数据集,我们引入了一种基于3D高斯原始的新型可重新照明的化身表示法,我们使用参数化头部模型和表情依赖的动态模块对其进行动画处理。我们提出了一种新的混合神经着色方法,将神经漫反射BRDF与解析镜面项相结合。我们的方法从动态光照舞台记录中重建了解耦的材料,并能实现点光源和环境贴图的所有频率重新照明。此外,我们的化身可以轻松地从单目视频进行动画和操控。我们在数据集上进行了广泛的实验验证,在重新照明和重新表演方面始终显著优于现有最先进的算法。
论文及项目相关链接
PDF NeurIPS 2025, Project Page: see https://jonathsch.github.io/becominglit/ , YouTube Video: see https://youtu.be/xPyeIqKdszA
Summary
本文介绍了BecomingLit这一新型方法,用于重建可重新照明的、高分辨率的头部化身,可从新的视角以互动速率进行渲染。文章提出了一种新的低成本灯光舞台捕捉装置,专为捕捉面部设计。使用该装置,收集了在多种光照条件和面部表情下的多样化多角度序列的新数据集。文章引入了一种基于3D高斯原始数据的新式可重新照明化身表示方法,通过参数化头部模型和表情依赖的动态模块进行动画化。提出了一种新的混合神经着色方法,结合了神经漫反射BRDF与分析镜面术语。该方法从动态光照舞台记录中重建了分离的材料,并能使用点光源和环境贴图对所有频率的化身进行重新照明。此外,该化身可轻松从单目视频进行动画和操控。在数据集上进行的大量实验验证了该方法的有效性,其在重新照明和重演方面显著优于现有先进技术。
Key Takeaways
- BecomingLit是一种重建高分辨率头部化身的新方法,能够以互动速率从不同视角进行渲染。
- 提出了一种低成本的灯光舞台捕捉设备,专门用于捕捉面部。
- 收集了一个新型数据集,包含多种光照条件和面部表情下的多角度序列。
- 引入了一种基于3D高斯原始数据的可重新照明化身表示方法。
- 采用了混合神经着色方法,结合了神经漫反射BRDF与分析镜面技术。
- 方法能够从动态光照舞台记录中重建分离的材料,并支持使用点光源和环境贴图进行全频率重新照明。
点此查看论文截图
Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction
Authors:Jiacong Chen, Qingyu Mao, Youneng Bao, Xiandong Meng, Fanyang Meng, Ronggang Wang, Yongsheng Liang
3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed.
3D高斯拼贴(3DGS)已成为一种高保真、高效率的在线自由视点视频(FVV)重建范式,为观众提供快速响应和沉浸式体验。然而,现有的在线方法面临存储要求过高的挑战,这主要是由于点对建模未能利用运动属性所导致的。为了解决这一局限性,我们提出了一种新型的紧凑高斯流(ComGS)框架,该框架利用动态场景中的运动局部性和一致性,通过关键帧驱动的运动表示对对象一致的高斯点运动进行建模。通过仅传输关键点的属性,该框架提供了更高效的存储解决方案。具体来说,我们首先使用视图空间梯度差异策略,在运动区域内识别一组稀疏的运动敏感关键点。借助这些关键点,我们提出了一种自适应的运动驱动机制,预测一个空间影响场,将关键点运动传播到具有相似运动的相邻高斯点。此外,ComGS采用了一种错误感知的修正策略来进行关键帧重建,该策略能够选择性地优化错误区域,并减轻错误积累,同时不会产生不必要的开销。总体而言,与3DGStream相比,ComGS实现了超过159倍的存储缩减,与当前最佳方法QUEEN相比,实现了14倍的存储缩减,同时保持了具有竞争力的视觉保真度和渲染速度。
论文及项目相关链接
PDF Accepted by NeurIPS 2025
Summary
3DGS技术在在线自由视点视频重建领域具有高保真度和高效率的优势。然而,由于点建模方法未充分利用运动属性,现有在线方法存在存储要求过高的挑战。为解决此问题,提出一种新型的Compact Gaussian Streaming(ComGS)框架,利用动态场景中的运动局部性和一致性,通过关键帧驱动的运动表示对物体一致的高斯点运动进行建模。仅传输关键帧属性,为更高效的存储解决方案。ComGS采用误差感知校正策略,选择性优化错误区域,减轻误差累积,同时保持视觉保真度和渲染速度。
Key Takeaways
- 3DGS技术用于在线自由视点视频重建,提供高保真度和效率。
- 现有方法因点建模方法未充分利用运动属性而面临高存储要求的问题。
- ComGS框架提出一种新型的解决方案,利用动态场景中的运动局部性和一致性。
- ComGS通过关键帧驱动的运动表示建模物体一致的高斯点运动。
- 仅传输关键帧属性,提高存储效率。
- ComGS采用误差感知校正策略,优化错误区域,减轻误差累积。
点此查看论文截图
MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction
Authors:Zhiqiang Wei, Lianqing Zheng, Jianan Liu, Tao Huang, Qing-Long Han, Wenwen Zhang, Fengdeng Zhang
Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR’s geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on two large-scale benchmarks demonstrate state-of-the-art performance. On nuScenes-OpenOccupancy, MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Furthermore, on the SemanticKITTI benchmark, our method achieves a new state-of-the-art mIoU of 24.08%, robustly validating its generalization capabilities.Ablation studies further confirm the effectiveness of each individual module, highlighting substantial improvements in the perception of small objects and reinforcing the practical value of MS-Occ for safety-critical autonomous driving scenarios.
精确的3D语义占用感知对于在具有多样性和不规则物体的复杂环境中进行自动驾驶至关重要。以视觉为中心的方法存在几何不准确的缺陷,而基于激光雷达的方法常常缺乏丰富的语义信息。为了解决这些局限性,提出了一种新的多阶段激光雷达-相机融合框架MS-Occ,包括中间阶段融合和后期阶段融合,通过分层跨模态融合,将激光雷达的几何保真度与基于相机的语义丰富性相结合。该框架在两个关键阶段引入了创新:(1)在中间阶段特征融合中,Gaussian-Geo模块利用稀疏激光雷达深度图的Gaussian核渲染来增强2D图像特征中的密集几何先验,Semantic-Aware模块通过可变形交叉注意力丰富激光雷达体素中的语义上下文;(2)在后期阶段体素融合中,Adaptive Fusion(AF)模块动态平衡跨模态的体素特征,而High Classification Confidence Voxel Fusion(HCCVF)模块则使用基于自注意力的细化解决语义不一致问题。在两项大规模基准测试上的实验证明了其卓越性能。在nuScenes-OpenOccupancy上,MS-Occ的交集比(IoU)达到32.1%,平均交集比(mIoU)为25.3%,相较于当前最佳水平提高了+0.7%的IoU和+2.4%的mIoU。此外,在SemanticKITTI基准测试中,我们的方法达到了全新的mIoU最高纪录,达到24.08%,从而稳健地验证了其泛化能力。消融研究进一步证实了各个模块的效用,突出其在感知小物体方面的重大改进,并强化了MS-Occ在现实安全关键的自动驾驶场景中的实用价值。
论文及项目相关链接
PDF 8 pages, 5 figures
Summary
一种名为MS-Occ的新型多阶段激光雷达与相机融合框架,旨在解决自主驾驶在复杂环境中面临的问题。该框架结合了激光雷达的几何精度和相机的语义丰富性,通过分层跨模态融合,提高了对多样和不规则物体的准确三维语义占用感知。其在两个大规模基准测试上的表现达到业界领先水平。
Key Takeaways
- 自主驾驶在复杂环境中需要准确的3D语义占用感知。
- 现有方法如纯视觉方法存在几何不精确问题,而纯激光雷达方法则缺乏丰富的语义信息。
- MS-Occ框架通过多阶段融合(包括中间阶段融合和后期阶段融合)解决了这些问题。
- MS-Occ结合了激光雷达的几何精度和相机的语义丰富性。
- 框架中的关键创新包括中间阶段的特征融合和后期阶段的体素融合。
- 实验结果表明,MS-Occ在nuScenes-OpenOccupancy和SemanticKITTI基准测试上达到了业界领先水平。