⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-28 更新
Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Authors:Zhifei Li, Feng Qiu, Yiran Wang, Yujing Xia, Kui Xiao, Miao Zhang, Yan Zhang
Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.
视觉问答(VQA)提出了一个独特的挑战,要求模型理解和推理视觉内容以准确回答问题。现有的VQA模型经常受到训练数据引入的偏见的影响,导致过分依赖表面模式,对多样的问题和图像泛化能力不足。本文提出了一种新型模型IOG-VQA,它集成了对象交互自注意力机制和基于GAN的去偏置技术,以提高VQA模型的性能。自注意力机制允许我们的模型捕捉图像内对象之间的复杂交互,提供更全面的视觉上下文理解。同时,基于GAN的去偏框架生成无偏的数据分布,帮助模型学习更稳健和可泛化的特征。通过利用这两个组件,IOG-VQA有效地结合了视觉和文本信息,以解决VQA数据集中的固有偏见。在VQA-CP v1和VQA-CP v2数据集上的大量实验表明,与现有方法相比,我们的模型表现出优异的性能,特别是在处理偏见和不平衡的数据分布时,这突显了在推进VQA任务时解决对象交互和数据集偏见的重要性。我们的代码可在https://github.com/HubuKG/IOG-VQA找到。
论文及项目相关链接
PDF 14 pages, 6 figures. ACCEPTED for publication as a REGULAR paper in the IEEE Transactions on Multimedia 2025
Summary
视觉问答(VQA)要求模型理解和推理视觉内容以准确回答问题,这带来了独特的挑战。现有VQA模型常因训练数据引入的偏见而过度依赖表面模式,且对多样的问题和图像泛化能力有限。本文提出一种新型模型IOG-VQA,它集成了对象交互自注意力机制和基于GAN的去偏框架,以提升VQA模型的性能。自注意力机制让模型能捕捉图像内对象间的复杂交互,为视觉上下文提供更全面的理解。基于GAN的去偏框架则生成无偏见的数据分布,帮助模型学习更稳健和可泛化的特征。利用这两个组件,IOG-VQA有效地结合了视觉和文本信息,解决VQA数据集固有的偏见问题。在VQA-CP v1和VQA-CP v2数据集上的实验显示,我们的模型相较于现有方法表现出卓越性能,特别是在处理有偏见和不平衡的数据分布时,强调了在推进VQA任务时同时处理对象交互和数据集偏见的重要性。我们的代码公开在:https://github.com/HubuKG/IOG-VQA。
Key Takeaways
- VQA模型面临由训练数据引入的偏见问题,导致过度依赖表面模式和泛化能力有限。
- IOG-VQA模型通过集成对象交互自注意力机制,提高模型对图像内对象间复杂交互的理解。
- 基于GAN的去偏框架帮助生成无偏见的数据分布,促进模型学习更稳健和可泛化的特征。
- IOG-VQA结合视觉和文本信息,有效解决VQA数据集的固有偏见问题。
- 在VQA-CP v1和VQA-CP v2数据集上的实验显示,IOG-VQA模型在处理有偏见和不平衡数据分布时表现出卓越性能。
- IOG-VQA模型公开可用,为VQA研究提供新的解决方案。
点此查看论文截图




Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps
Authors:Gabriel Maldonado, Narges Rashvand, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi
Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method’s superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion’s complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.
连续人体运动理解是计算机视觉领域的一个核心挑战,因为其高维度和内在冗余性。对于分析复杂的运动动力学,高效压缩和表示至关重要。在这项工作中,我们引入了一个利用密集运动标记的对抗优化VQ-GAN框架,用于压缩时空热图,同时保留人体运动的精细痕迹。我们的方法结合了密集运动标记和对抗优化,消除了在非对抗基线中观察到的重建伪影,如运动模糊和时间错位。我们在CMU全景数据集上的实验证明了我们方法的优越性,相对于dVAE基线提高了9.31%的SSIM,并减少了37.1%的时间不稳定性。此外,我们的密集标记策略实现了对运动复杂性的新型分析,揭示出二维运动可以用紧凑的128个标记词汇表示最优,而三维运动的复杂性则需要更大的1024个标记代码本以实现忠实重建。这些结果证明了在各种运动分析应用中实际部署的可行性。该工作的代码库可在https://github.com/TeCSAR-UNCC/Pose-Quantization获取。
论文及项目相关链接
Summary
本文引入了一种基于对抗训练的VQ-GAN框架,结合密集运动令牌化,用于压缩时空热图,同时保留人类运动的精细痕迹。实验证明,该方法在CMU全景数据集上的表现优于dVAE基线,提高了9.31%的SSIM得分,并降低了37.1%的临时不稳定性。此外,该研究还揭示了不同运动复杂度的最优令牌表示方式,为实际应用提供了可行性依据。
Key Takeaways
- 引入了一种新的VQ-GAN框架,结合了密集运动令牌化和对抗训练。
- 该方法能够在压缩时空热图时保留人类运动的精细痕迹。
- 在CMU全景数据集上的实验表现优于dVAE基线。
- 方法提高了9.31%的SSIM得分,并降低了37.1%的临时不稳定性。
- 密集令牌化策略能够揭示不同运动复杂度的最优表示方式。
- 2D运动可以用紧凑的128令牌词汇表最优地表示,而3D运动的复杂性则需要更大的1024令牌代码本进行忠实重建。
点此查看论文截图





Seeing Through Reflections: Advancing 3D Scene Reconstruction in Mirror-Containing Environments with Gaussian Splatting
Authors:Zijing Guo, Yunyang Zhao, Lin Wang
Mirror-containing environments pose unique challenges for 3D reconstruction and novel view synthesis (NVS), as reflective surfaces introduce view-dependent distortions and inconsistencies. While cutting-edge methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) excel in typical scenes, their performance deteriorates in the presence of mirrors. Existing solutions mainly focus on handling mirror surfaces through symmetry mapping but often overlook the rich information carried by mirror reflections. These reflections offer complementary perspectives that can fill in absent details and significantly enhance reconstruction quality. To advance 3D reconstruction in mirror-rich environments, we present MirrorScene3D, a comprehensive dataset featuring diverse indoor scenes, 1256 high-quality images, and annotated mirror masks, providing a benchmark for evaluating reconstruction methods in reflective settings. Building on this, we propose ReflectiveGS, an extension of 3D Gaussian Splatting that utilizes mirror reflections as complementary viewpoints rather than simple symmetry artifacts, enhancing scene geometry and recovering absent details. Experiments on MirrorScene3D show that ReflectiveGaussian outperforms existing methods in SSIM, PSNR, LPIPS, and training speed, setting a new benchmark for 3D reconstruction in mirror-rich environments.
包含镜子的环境对3D重建和新颖视角合成(NVS)提出了独特的挑战,因为反射表面引入了视角相关的失真和不一致性。虽然最先进的方法,如神经辐射场(NeRF)和三维高斯涂料(3DGS)在典型场景中表现出色,但在镜子存在的情况下,它们的性能会下降。现有解决方案主要通过对称映射处理镜面,但往往忽略了镜子反射所携带的丰富信息。这些反射提供了可以补充的视角,可以填补缺失的细节,并显着提高重建质量。为了推进在镜子丰富的环境中的3D重建,我们推出了MirrorScene3D数据集,它包含多样的室内场景、1256张高质量图像和注释的镜子掩膜,为反射环境中重建方法的评估提供了基准。在此基础上,我们提出了ReflectiveGS,它是基于三维高斯涂料的一种扩展,利用镜子反射作为补充视角,而不是简单的对称伪像,以增强场景几何并恢复缺失的细节。在MirrorScene3D上的实验表明,ReflectiveGaussian在SSIM、PSNR、LPIPS和训练速度上均优于现有方法,为镜子丰富的环境中的3D重建设定了新的基准。
论文及项目相关链接
Summary
本文介绍了在含有镜子的环境中,3D重建和新颖视图合成面临的挑战。现有方法如Neural Radiance Fields和3D Gaussian Splatting在典型场景中表现优异,但在镜子存在时性能下降。文章提出了MirrorScene3D数据集和ReflectiveGS方法,利用镜子反射作为补充视角,提高场景几何结构和恢复缺失细节的能力,为在镜子丰富的环境中进行3D重建设立了新的基准。
Key Takeaways
- 镜子在3D重建和新颖视图合成中带来挑战,因反射表面产生视图依赖的失真和不一致性。
- 现有方法在处理镜子表面时主要通过对称映射,但忽略了镜子反射所携带的丰富信息。
- MirrorScene3D数据集提供室内场景的多样化图像和标注的镜子掩膜,为评估反射环境下的重建方法提供了基准。
- ReflectiveGS方法利用镜子反射作为补充视角,而非简单的对称产物。
- ReflectiveGS提高了场景几何结构和恢复缺失细节的能力。
- 在MirrorScene3D数据集上的实验显示,ReflectiveGaussian在SSIM、PSNR、LPIPS和训练速度上超越了现有方法。
点此查看论文截图





HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis
Authors:Zipeng Wang, Dan Xu
Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful alternative to NeRF-based approaches, enabling real-time, high-quality novel view synthesis through explicit, optimizable 3D Gaussians. However, 3DGS suffers from significant memory overhead due to its reliance on per-Gaussian parameters to model view-dependent effects and anisotropic shapes. While recent works propose compressing 3DGS with neural fields, these methods struggle to capture high-frequency spatial variations in Gaussian properties, leading to degraded reconstruction of fine details. We present Hybrid Radiance Fields (HyRF), a novel scene representation that combines the strengths of explicit Gaussians and neural fields. HyRF decomposes the scene into (1) a compact set of explicit Gaussians storing only critical high-frequency parameters and (2) grid-based neural fields that predict remaining properties. To enhance representational capacity, we introduce a decoupled neural field architecture, separately modeling geometry (scale, opacity, rotation) and view-dependent color. Additionally, we propose a hybrid rendering scheme that composites Gaussian splatting with a neural field-predicted background, addressing limitations in distant scene representation. Experiments demonstrate that HyRF achieves state-of-the-art rendering quality while reducing model size by over 20 times compared to 3DGS and maintaining real-time performance. Our project page is available at https://wzpscott.github.io/hyrf/.
最近,3D高斯贴图(3DGS)作为一种强大的NeRF替代方法崭露头角,它通过明确的、可优化的3D高斯实现实时高质量的新型视图合成。然而,由于3DGS依赖于高斯参数来模拟视图相关效果和各向异性形状,它产生了巨大的内存开销。尽管最近的工作提出用神经网络场来压缩3DGS,但这些方法在捕捉高斯属性的高频空间变化方面存在困难,导致精细细节的重建效果降低。我们提出了混合辐射场(HyRF),这是一种结合显式高斯和神经网络场优点的新型场景表示方法。HyRF将场景分解为(1)一组紧凑的显式高斯,只存储关键的高频参数,(2)基于网格的神经网络场,用于预测其余属性。为了提高表示能力,我们引入了解耦神经网络体系结构,分别模拟几何(尺度、不透明度、旋转)和视图相关的颜色。此外,我们提出了一种混合渲染方案,将高斯贴图与神经网络预测的背景进行合成,解决了远距离场景表示的局限性。实验表明,HyRF达到了最先进的渲染质量,与3DGS相比将模型大小缩小了20倍以上,同时保持了实时性能。我们的项目页面可在[https://wzpscott.github.io/hyrf/]上找到。
论文及项目相关链接
PDF Accepted at NeurIPS 2025
Summary
本文介绍了Hybrid Radiance Fields(HyRF)这一新型场景表示方法,它结合了显式高斯和神经网络字段的优势。HyRF将场景分解为两组:一组是存储关键高频参数的显式高斯;另一组是基于网格的神经场预测剩余属性。此外,还提出了一种混合渲染方案,将高斯溅射与神经网络场预测的背景相结合,解决了远距离场景表示的限制。HyRF实现了高质量的渲染效果,模型大小相较于传统方法减少了超过20倍,同时保持了实时性能。
Key Takeaways
- 3D Gaussian Splatting (3DGS) 是基于NeRF的替代方案,能进行实时高质量的新视角合成,但它存在内存开销大的问题。
- HyRF 结合了显式高斯和神经场的技术优势,解决单一技术方案的不足。它将关键的高频参数存储在显式的Gaussians中,其余的参数通过神经网络场预测。
- 提出了一种新型的神经网络场架构,将几何(尺度、不透明度、旋转)和视角依赖的颜色分开建模,增强了表示能力。
- 采用混合渲染方案,结合了高斯溅射和神经网络场预测的背景,解决了远距离场景表示的问题。
- 实验表明,HyRF实现了高质量的渲染效果,模型大小相较于传统方法减少了超过20倍,并保持了实时性能。
点此查看论文截图




GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation
Authors:Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe
Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.
数据增强策略Mixup在图像分类中很受欢迎,但其朴素的像素级插值往往会产生不真实的图像,可能阻碍学习,特别是在高风险的医疗应用中。我们提出了GeMix,这是一个两阶段的框架,它用基于类别条件GAN的学习感知插值替换启发式混合。首先,我们在目标数据集上训练StyleGAN2-ADA生成器。在数据增强过程中,我们从偏向不同类别的Dirichlet先验中采样两个标签向量,并通过Beta分布系数将它们混合。然后,我们将生成器设置在这个软标签上,合成视觉上连贯的图像,这些图像位于一个连续的类别流形上。我们在大规模COVIDx-CT-3数据集上采用ResNet-50、ResNet-101和EfficientNet-B0三种骨干网对GeMix进行基准测试。当与真实数据结合时,我们的方法在所有骨干网上都提高了宏观F1分数,降低了COVID-19检测的误报率。因此,GeMix可以作为像素空间mixup的替代方案,提供更强的正则化和更高的语义保真度,而不会破坏现有的训练流程。我们公开发布我们的代码在https://github.com/hugocarlesso/GeMix以促进可重复性和进一步研究。
论文及项目相关链接
PDF Accepted at CBMI 2025
Summary
基于StyleGAN2-ADA生成器的GeMix方法,通过采用标签感知插值技术,实现了在图像分类中的混合增强策略。此方法以软标签为条件合成图像,可生成连续的类别流形中的图像,增强了正则化并提高了语义保真度。在COVIDx-CT-3数据集上进行的实验表明,GeMix与传统混合方法相比,可提高所有背骨网络的宏观F1得分,并降低COVID-19检测的误报率。GeMix易于集成现有训练管道中,增强了医学应用的实用性。代码已公开于hugocarlesso/GeMix。
Key Takeaways
- GeMix是一种基于StyleGAN2-ADA生成器的两阶段框架,用于图像分类中的增强策略。
- 它采用标签感知插值技术,通过合成视觉上连贯的图像来增强正则化和语义保真度。这些图像位于连续的类别流形中。
- 在COVIDx-CT-3数据集上的实验表明,GeMix提高了宏观F1得分,降低了误报率。它对COVID-19检测具有实际应用价值。
- GeMix易于集成现有训练管道中,作为像素空间混合的替代品使用。
- GeMix通过合成更真实的图像解决了传统混合策略可能产生的不现实图像问题。
- GeMix公开了源代码,有助于促进可重复性和进一步研究。
点此查看论文截图



