发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 2.1k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

Audio-Driven Universal Gaussian Head Avatars

Authors:Kartik Teotia, Helge Rhodin, Mohit Mendiratta, Hyeongwoo Kim, Marc Habermann, Christian Theobalt

We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject’s global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.

我们介绍了第一种音频驱动通用写实头像合成方法，该方法将人物无关的语音模型与我们的新型通用头部头像先验（UHAP）相结合。UHAP是在跨身份多视角视频上进行训练的。特别是，我们的UHAP受到中性扫描数据的监督，使其能够捕获高保真度的身份特定细节。与此前的方法相比，我们主要将音频特征映射到几何变形上，而忽视音频相关的外观变化，而我们的通用语音模型直接将原始音频输入映射到UHAP潜在表达空间中。该表达空间固有地编码了几何和外观变化。为了有效地对新主题进行个性化设置，我们采用了单目编码器，它能够在视频帧之间实现动态表情变化的轻量级回归。考虑到这些表情相关的变化，它使随后的模型微调阶段能够专注于捕捉主题的全局外观和几何结构。通过UHAP解码这些音频驱动的表达式代码，生成高度逼真的头像，具有精确的唇部同步和微妙的表情细节，如眉毛运动、目光转移以及逼真的嘴巴内部外观和运动。大量评估表明，我们的方法不仅是第一个可以兼顾详细外观建模和渲染的通用音频驱动头像模型，而且在衡量唇部同步准确性、图像质量和感知真实性的指标上，也超越了竞争对手（仅几何）的方法。

论文及项目相关链接

PDF (SIGGRAPH Asia 2025) Project page: https://kartik-teotia.github.io/UniGAHA/

Summary

本文介绍了首个音频驱动通用写实风格的人像合成方法，该方法结合了人物无关的语音模型与新型的通用头部头像先验信息（UHAP）。UHAP通过跨身份的多视角视频进行训练，并通过中性扫描数据进行监督，从而以高保真度捕捉身份特定的细节。与主要将音频特征映射到几何变形而忽视音频相关的外观变化的先前方法不同，我们的通用语音模型直接将原始音频输入映射到UHAP潜在表达式空间。该表达式空间固有地编码了几何和外观变化。为了有效地对新主题进行个性化设置，我们采用了单目编码器，它能够在视频帧之间实现动态表达式变化的轻量级回归。通过考虑这些表达式变化，随后的模型微调阶段可以专注于捕捉主题的全局外观和几何形状。通过UHAP解码这些音频驱动的表达式代码，生成了高度逼真的头像，具有精确的唇部同步和微妙的表情细节，如眉毛运动、目光转移以及逼真的嘴巴内部外观和运动。

Key Takeaways

引入了一种音频驱动的通用写实风格的人像合成方法。
结合了人物无关的语音模型与新型通用头部头像先验信息（UHAP）。
UHAP通过跨身份的多视角视频进行训练，并受到中性扫描数据的监督。
通用语音模型将音频直接映射到UHAP潜在表达式空间，包含几何和外观变化。
使用单目编码器实现动态表情变化的轻量级回归，便于个性化设置。
模型能够生成高度逼真的头像，具有精确的唇部同步和微妙的表情细节。
该方法在唇同步准确性、图像质量和感知现实感等方面优于仅基于几何的方法。

Cool Papers

点此查看论文截图

MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems

Authors:Muhammet Anil Yagiz, Zeynep Sude Cengiz, Polat Goktas

The rapid expansion of immersive Metaverse applications introduces complex challenges at the intersection of performance, privacy, and environmental sustainability. Centralized architectures fall short in addressing these demands, often resulting in elevated energy consumption, latency, and privacy concerns. This paper proposes MetaFed, a decentralized federated learning (FL) framework that enables sustainable and intelligent resource orchestration for Metaverse environments. MetaFed integrates (i) multi-agent reinforcement learning for dynamic client selection, (ii) privacy-preserving FL using homomorphic encryption, and (iii) carbon-aware scheduling aligned with renewable energy availability. Evaluations on MNIST and CIFAR-10 using lightweight ResNet architectures demonstrate that MetaFed achieves up to 25% reduction in carbon emissions compared to conventional approaches, while maintaining high accuracy and minimal communication overhead. These results highlight MetaFed as a scalable solution for building environmentally responsible and privacy-compliant Metaverse infrastructures.

沉浸式元宇宙应用的快速扩张在性能、隐私和环境可持续性交汇处带来了复杂的挑战。集中式架构难以满足这些需求，往往导致能耗增加、延迟和隐私担忧。本文提出了MetaFed，一个去中心化的联邦学习（FL）框架，能够为元宇宙环境实现可持续和智能的资源编排。MetaFed集成了（i）用于动态客户端选择的多智能体强化学习，（ii）使用同态加密的隐私保护联邦学习，以及（iii）与可再生能源可用性对齐的碳感知调度。在MNIST和CIFAR-10上使用轻量级ResNet架构的评估表明，与传统方法相比，MetaFed实现了高达25%的碳排放减少，同时保持了高准确性和最小的通信开销。这些结果突出了MetaFed作为构建环保和符合隐私规定的元宇宙基础设施的可扩展解决方案。

论文及项目相关链接

PDF 2025 IEEE International Symposium on Emerging Metaverse (ISEMV), co-located with the 2025 IEEE/CVF International Conference on Computer Vision (ICCV)

Summary

该论文探讨了元宇宙应用程序在性能、隐私和环境可持续性方面的挑战，并提出了MetaFed框架。MetaFed是一个去中心化的联邦学习框架，通过多智能体强化学习进行动态客户端选择、使用同态加密进行隐私保护联邦学习，并结合碳感知调度与可再生能源可用性，实现元宇宙环境的可持续和智能资源编排。评估结果表明，MetaFed与常规方法相比可减少高达25%的碳排放，同时保持高准确性和低通信开销。

Key Takeaways