发布日期: 2025-10-18

更新日期: 2025-11-27

文章字数: 7.5k

阅读时长: 30 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-18 更新

Instant Skinned Gaussian Avatars for Web, Mobile and VR Applications

Authors:Naruya Kondo, Yuto Asano, Yoichi Ochiai

We present Instant Skinned Gaussian Avatars, a real-time and cross-platform 3D avatar system. Many approaches have been proposed to animate Gaussian Splatting, but they often require camera arrays, long preprocessing times, or high-end GPUs. Some methods attempt to convert Gaussian Splatting into mesh-based representations, achieving lightweight performance but sacrificing visual fidelity. In contrast, our system efficiently animates Gaussian Splatting by leveraging parallel splat-wise processing to dynamically follow the underlying skinned mesh in real time while preserving high visual fidelity. From smartphone-based 3D scanning to on-device preprocessing, the entire process takes just around five minutes, with the avatar generation step itself completed in only about 30 seconds. Our system enables users to instantly transform their real-world appearance into a 3D avatar, making it ideal for seamless integration with social media and metaverse applications. Website: https://sites.google.com/view/gaussian-vrm

我们推出了即时皮肤化高斯化身（Instant Skinned Gaussian Avatars），这是一个实时跨平台的3D化身系统。虽然有许多方法已经提出用于动画高斯拼贴（Gaussian Splatting），但它们通常需要相机阵列、长时间的预处理或高端GPU。一些方法试图将高斯拼贴转换为基于网格的表示形式，以实现轻量级性能，但牺牲了视觉保真度。相比之下，我们的系统通过利用并行拼贴处理来有效地动画化高斯拼贴，能够实时追踪底层蒙皮网格，同时保持高视觉保真度。从基于智能手机的3D扫描到设备上的预处理，整个过程仅需大约五分钟，其中化身生成步骤本身仅需约30秒即可完成。我们的系统使用户能够立即将他们在现实世界中的外观转化为3D化身，使其成为与社交媒体和元宇宙应用程序无缝集成的理想选择。网站：https://sites.google.com/view/gaussian-vrm/

论文及项目相关链接

PDF Accepted to SUI 2025 Demo Track

Summary
实时跨平台三维阿凡达系统——即时蒙皮高斯阿凡达呈现。该系统通过利用并行贴片处理，实时动态跟随蒙皮网格，同时保持高视觉保真度，高效动画高斯拼贴。整个流程从手机3D扫描到设备预处理只需约五分钟，其中阿凡达生成步骤只需约30秒。

Key Takeaways

提出了即时蒙皮高斯阿凡达系统，实现实时跨平台的三维阿凡达呈现。
利用并行贴片处理，高效动画化高斯拼贴。
系统可实时动态跟随蒙皮网格，保持高视觉保真度。
流程简洁快速，从手机3D扫描到设备预处理只需五分钟。
阿凡达生成步骤快速，仅需约30秒。
系统适用于社交媒体和元宇宙应用的无缝集成。

Cool Papers

点此查看论文截图

HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans

Authors:Chao Shi, Shenghao Jia, Jinhui Liu, Yong Zhang, Liangchao Zhu, Zhonglei Yang, Jinze Ma, Chaoyue Niu, Chengfei Lv

We present HRM$^2$Avatar, a framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real time on mobile devices. Monocular capture with smartphones provides a low-cost alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses challenges due to limited visual and geometric data. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. We extract garment meshes from monocular data to model clothing deformations effectively, and attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting. This representation efficiently learns high-resolution and dynamic information from monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for animating high-fidelity avatars in AR/VR, social gaming, and on-device creation. Our GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over $2.7\times$ faster than representative mobile-engine baselines. Experiments show that HRM$^2$Avatar delivers superior visual realism and real-time interactivity, outperforming state-of-the-art monocular methods.

我们推出了HRM$^2$Avatar框架，该框架可以通过单目手机扫描创建高保真虚拟形象，并可在移动设备上实时呈现和动画。单目捕捉与智能手机相结合，为专业级的多相机拍摄提供了低成本替代方案，使虚拟形象数字化对非专业用户也变得触手可及。从单目视频序列重建高保真虚拟形象，由于视觉和几何数据的局限性，这带来了挑战。为了克服这些局限，在数据层面，我们的方法利用智能手机捕获的两种数据：静态姿势序列用于纹理重建，动态运动序列用于学习姿势相关的变形和光照变化。在表示层面，我们采用了一种轻便而富有表现力的表示方法，从稀疏的单目数据中重建高保真数字人类。我们从单目数据中提取服装网格，以有效地对服装变形进行建模，并将光照感知高斯附加到网格表面，从而实现高保真渲染和捕捉姿势相关的光照。这种表示法可以有效地从单目数据中学习高分辨率和动态信息，从而创建详细的虚拟形象。在渲染层面，对于在AR/VR、社交游戏和设备端创建中动画化高保真虚拟形象而言，实时性能至关重要。我们的GPU驱动渲染管道在移动设备上实现了120 FPS的帧率，在独立VR设备上的帧率为90 FPS（在2K分辨率下），比典型的移动引擎基线快$2.7\times$以上。实验表明，HRM$^2$Avatar在视觉真实感和实时交互方面均优于最先进单目方法。

论文及项目相关链接

PDF SIGGRAPH Asia 2025, Project Page: https://acennr-engine.github.io/HRM2Avatar

摘要

HRM$^2$Avatar框架实现通过单目手机扫描创建高保真虚拟人，可在移动设备上实时渲染和动画。单目捕捉智能手机提供了一种低成本选择，替代了工作室级别的多相机装置，使虚拟人数字化对非专业用户更加便捷。从单视角视频序列重建高保真虚拟人面临视觉和几何数据有限的挑战。为应对这些局限，我们的方法利用智能手机捕获的两种类型的数据：静态姿势序列用于纹理重建和动态运动序列用于学习姿势相关的变形和光照变化。在表示层面，我们采用简洁而富有表现力的表示方法，从稀疏的单目数据中重建高保真数字人类。我们从单目数据中提取服装网格以有效地模拟服装变形，并将照明感知高斯附加到网格表面，以实现高保真渲染和捕捉姿势相关的照明。这种表示方法能够高效地从单目数据中学习高分辨率和动态信息，从而创建详细的虚拟人。在渲染层面，对于在AR/VR、社交游戏和实时创作中进行高保真虚拟人动画来说，实时性能至关重要。我们的GPU驱动的渲染管道在移动设备上实现了120帧/秒的帧率，在独立VR设备上的2K分辨率下实现了90帧/秒的帧率，比代表性的移动引擎基线提高了$2.7\times$的速度。实验表明，HRM$^2$Avatar在视觉真实感和实时交互性方面均优于现有技术最先进的单目方法。

关键见解

HRM$^2$Avatar框架允许通过低成本的单目手机扫描创建高保真虚拟人。
利用智能手机捕获的静态和动态数据，以应对单目视频序列中视觉和几何数据的局限性。
高效的GPU驱动渲染技术确保了高质量的实时渲染和动画性能。
通过提取服装网格和添加照明感知高斯来增强虚拟人的逼真度。
HRM$^2$Avatar实现了超越现有技术的视觉真实感和实时交互性能。
该方法在移动设备和独立VR设备上都表现出优异的性能表现。

Cool Papers

点此查看论文截图

MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

Authors:Felix Taubner, Ruihang Zhang, Mathieu Tuli, Sherwin Bahmani, David B. Lindell

Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.

数字人类化身旨在模拟虚拟环境中人类的动态外观，为游戏、电影、虚拟现实等领域提供沉浸式体验。然而，创建和驱动逼真人类化身的传统过程既昂贵又耗时，需要大型相机捕捉设备和专业3D艺术家的大量手动工作。随着强大的图像和视频生成模型的出现，最近的方法能够从单个随意捕获的目标对象参考图像自动渲染逼真的动画化身。虽然这些技术大大降低了创建化身的障碍并提供了引人注目的逼真效果，但它们缺乏多视图信息或明确的3D表示所提供的约束。因此，当从与参考图像偏离较大的视角进行渲染时，图像质量和逼真度会降低。在这里，我们建立了一个基于单个参考图像和目标表情生成数字人类多角度视频的视频模型。我们的模型MVP4D基于最先进的预训练视频扩散模型，可以同时从目标对象周围最多360度的视角生成数百帧。我们展示了如何将该模型的输出提炼成可以在实时渲染的4D化身。与以前的方法相比，我们的方法显著提高了生成的化身在逼真度、时间一致性和3D一致性方面的表现。

论文及项目相关链接

PDF 18 pages, 12 figures

Summary
数字人像是模拟虚拟环境中人类动态外观的技术。传统创建方式成本高昂且耗时，依赖大型相机捕捉设备和专业3D艺术家。最新的方法可通过单个随意拍摄的参考图像自动渲染逼真动画人像，降低了门槛。但缺乏多角度信息或明确的3D表现，导致从与参考图像角度偏差较大的视角渲染时，图像质量和逼真度下降。本文建立了一个基于单参考图像和目标表情的动画人视频生成模型MVP4D，可从多角度生成数百帧视频。与以往方法相比，显著提高了生成人偶的逼真度、时间连贯性和3D一致性。

Key Takeaways

数字人像是模拟虚拟环境中的人类动态外观的技术，已广泛应用于游戏、电影和虚拟现实等领域。
传统创建高清数字人偶的方式需要昂贵的设备和大量的专业3D艺术家参与，耗费时间和成本。
最新方法能通过单一参考图像自动渲染逼真动画人像，大大降低了创建门槛。
当前技术缺乏多角度信息或明确的3D表现，导致在偏离参考视角时图像质量下降。
MVP4D模型解决了上述问题，能够基于单参考图像和目标表情生成多角度的视频。
MVP4D模型生成的数字人偶具有高度的逼真度、时间连贯性和3D一致性。

Cool Papers

点此查看论文截图

Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided Framework

Authors:Shanzhi Yin, Bolin Chen, Xinju Wu, Ru-Ling Liao, Jie Chen, Shiqi Wang, Yan Ye

This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.

本文提出了一种高效的3D化身编码框架，该框架利用紧凑的人类先验知识和标准到目标的转换，以实现在超低比特率下的高质量3D人类化身视频压缩。该框架首先以无网络的方式使用关节贴图训练一个标准高斯化身，作为化身外观建模的基础。同时，利用人类先验模板通过紧凑的参数表示来捕获身体运动的时序。这种外观和时序演变的分解最小化了冗余，实现了有效的压缩：标准化身在整个序列中共享，只需压缩一次，而每帧仅包含94个参数的时序参数以极低的比特率传输。对于每一帧，目标人类化身是通过线性混合蒙皮变换对标准化身进行变形而生成的，这有助于实现时序一致的视频重建和新视角的合成。实验结果表明，该方法在主流的多视角人类视频数据集上，在速率失真性能上显著优于传统的2D/3D编解码器和现有的可学习动态3D高斯贴图压缩方法，为元宇宙应用中的无缝沉浸式多媒体体验铺平了道路。

论文及项目相关链接

PDF 10 pages, 4 figures

Summary
高效的三维阿凡达编码框架，利用紧凑的人类先验知识和规范到目标转换，实现超低比特率下的高质量三维阿凡达视频压缩。该框架以无网络方式使用关节拼接技术训练规范高斯阿凡达，作为阿凡达外观建模的基础。同时，利用人体先验模板以紧凑的参数表示捕获身体运动的时序变化。外观和时序演变的分解最小化了冗余，实现了高效压缩：规范阿凡达可在整个序列中共享，只需压缩一次，而每帧仅包含94个参数的临时参数以极低的比特率传输。通过线性混合蒙皮变换使规范阿凡达变形为目标阿凡达，可实现时序连贯的视频重建和新颖视图合成。实验结果表明，该方法在主流的多视角人类视频数据集上，相对于传统的二维/三维编解码器和现有的可学习动态三维高斯拼接压缩方法，在速率失真性能上有着显著的优势，为元宇宙应用中的无缝沉浸式多媒体体验铺平了道路。

Key Takeaways

提出了高效的三维阿凡达编码框架，实现了超低比特率下的高质量视频压缩。
利用规范高斯阿凡达作为外观建模基础，通过无网络方式的关节拼接技术实现训练。
采用人体先验模板捕获身体运动的时序变化，以紧凑的参数表示提高效率。
分解外观和时序演变，最小化冗余，规范阿凡达可共享并只需一次压缩。
每帧的临时参数仅包含94个参数，以极低的比特率传输。
通过线性混合蒙皮变换实现目标阿凡达的生成，支持时序连贯的视频重建和新颖视图合成。

Cool Papers

点此查看论文截图

PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Authors:Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Siyu Zhu, Gang Cheng, Zilong Dong, Yike Guo

We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work. Project page at: https://panolam.github.io/.

我们提出了一种基于单一未姿态图像的高斯全头合成前馈框架。不同于之前依赖于耗时性的GAN反演和测试时优化的工作，我们的框架能够在单次前向传递中根据单一的未姿态图像重建高斯全头模型。这可以在推理过程中实现快速重建和渲染。为了缓解大规模三维头部资产缺乏的问题，我们从训练好的三维GAN中提出了大规模合成数据集，并且只使用合成数据来训练我们的框架。为了进行有效的精细高品质生成，我们引入了从粗到细的高斯头部生成管道，其中来自FLAME模型的稀疏点与图像特征通过变换器块进行特征提取和粗略形状重建，然后进行密集化以实现高保真重建。为了充分利用预训练三维GAN中的先验知识进行有效重建，我们提出了一个双分支框架，该框架可以有效地聚合结构化球面三角平面特征和非结构化点基特征，以进行更有效的高斯头部重建。实验结果证明我们的框架相比现有工作更有效。项目页面地址为：https://panolam.github.io/。

论文及项目相关链接

PDF

Summary
基于单张未摆姿势的图像，我们提出了一种前馈框架，用于高斯全头合成。该框架能够在单个前向传递中从单个未摆姿势的图像重建高斯全头模型，从而实现了快速的重建和渲染。为了解决缺乏大规模3D头部资产的问题，我们提出了一个基于训练好的3D GANs的大规模合成数据集，并使用仅合成数据进行框架训练。为了进行高效的高保真生成，我们引入了从粗到细的高斯头部生成管道，其中FLAME模型的稀疏点与图像特征通过变压器块进行特征提取和粗略形状重建，然后进行密集化以实现高保真重建。为了充分利用预训练的3D GANs中的先验知识进行有效的重建，我们提出了一个双分支框架，该框架有效地聚合了结构化的球面triplane特征和无结构的点基特征，以实现更有效的高斯头部重建。

Key Takeaways

提出了一个前馈框架，能够从单张未摆姿势的图像进行高斯全头合成，实现快速重建和渲染。
利用训练好的3D GANs创建大规模合成数据集以解决缺乏真实3D头部资产的问题。
引入从粗到细的高斯头部生成管道，通过特征提取和粗略形状重建，实现高保真生成。
利用FLAME模型的稀疏点与图像特征的交互进行高效特征提取和形状重建。
提出双分支框架，聚合结构化的球面triplane特征和无结构的点基特征，实现更有效的高斯头部重建。
实验结果证明该框架相较于现有工作更加有效。
项目页面提供了更多详细信息：https://panolam.github.io/。

Cool Papers

点此查看论文截图

Generative Head-Mounted Camera Captures for Photorealistic Avatars

Authors:Shaojie Bai, Seunghyeon Seo, Yida Wang, Chenghui Li, Owen Wang, Te-Li Wang, Tianyang Ma, Jason Saragih, Shih-En Wei, Nojun Kwak, Hyung Jun Kim

Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars’ appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

在虚拟和增强现实（VR/AR）中实现逼真的人物动画一直具有挑战性，因为获取面部真实状态的信息很困难。头显相机（HMC）捕捉的输入同步图像与红外部分观测数据以及外部全景相机阵列的全面观测数据相匹配，以生成逼真的面部动画，但直接从这两者获得同步图像在物理上是不可能的。先前的工作通过合成分析方法可以生成准确的真实数据，但在个性化训练中，表情和风格的分离并不完美。对同一主题的大量配对捕捉（HMC和全景相机）使大规模数据集的收集变得操作昂贵，并且不能在不同的HMC观点和照明下重复使用。在这项工作中，我们提出了一种新型生成方法，即生成式HMC（GenHMC），它利用更容易收集的配对HMC捕获的大量非配对数据，根据全景捕获的任何条件虚拟人物状态直接生成高质量的合成HMC图像。我们证明，我们的方法能够适当地分离输入条件信号，该信号指定面部表情和视点，与面部外观无关，从而得到更准确的真实数据。此外，我们的方法可以推广到未见过的身份，不再依赖于配对捕获。我们通过评估合成HMC图像和从这些新的HMC-虚拟人物对应关系训练的通用面部编码器来证明这些突破，这些编码器实现了更好的数据效率和最先进的准确性。

论文及项目相关链接

PDF SIGGRAPH Asia 2025 (ACM Transactions on Graphics (TOG)). Project page: https://shawn615.github.io/genhmc/

Summary

本文提出了一种新型生成方法——生成式HMC（GenHMC），该方法利用大量未配对的HMC捕捉数据，直接生成高质量合成HMC图像，给定任何条件化avatar状态。此方法能够准确地将面部表情和视角的输入条件信号与面部外观分离，生成更准确的地面真实数据。此外，该方法可推广至未见过的身份，无需配对捕捉。通过评估合成HMC图像和基于新HMC-avatar对应关系的通用面部编码器，展现了该方法在数据效率和准确性方面的优越性。

Key Takeaways

生成式HMC（GenHMC）方法利用大量未配对的HMC捕捉数据，简化大规模数据集收集过程。
GenHMC可以直接生成高质量合成HMC图像，基于任意条件化avatar状态。
GenHMC能够准确分离面部表情和视角的条件信号与面部外观，提高地面真实数据的准确性。
GenHMC方法可推广至未见过的身份，无需配对捕捉，增强了方法的实用性。
合成HMC图像的评价证明了GenHMC方法的有效性。
基于新HMC-avatar对应关系的通用面部编码器展示了在数据效率和准确性方面的优越性。

Cool Papers

点此查看论文截图

AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

Authors:Tianbao Zhang, Jian Zhao, Yuer Li, Zheng Zhu, Ping Hu, Zhaoxin Fan, Wenjun Wu, Xuelong Li

Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

全身音频驱动的角色姿态和表情生成是创建逼真数字人和增强交互式虚拟代理人能力的一项关键任务，在虚拟现实、数字娱乐和远程通信等领域有广泛应用。现有方法通常独立生成音频驱动的面部表情和动作，这引入了一个重要的局限性：面部表情和动作元素之间缺乏无缝协调，导致动画效果不够自然和连贯。为了解决这一局限性，我们提出了AsynFusion，这是一个利用扩散变压器实现和谐表情和动作合成的新框架。该方法建立在双分支DiT架构之上，能够实现面部表情和动作的并行生成。在模型中，我们引入了一个合作同步模块，以促进两种模式之间的双向特征交互，以及一种异步LCM采样策略，以减少计算开销同时保持高质量输出。大量实验表明，AsynFusion在生成实时同步全身动画方面达到了最新技术水平，在定量和定性评估中均优于现有方法。

论文及项目相关链接

PDF 15pages, conference

Summary
全息音频驱动的全身动作捕捉技术在创建逼真数字人类和增强交互式虚拟代理能力方面扮演重要角色，广泛应用于虚拟现实、数字娱乐和远程通信等领域。现有方法常常独立生成音频驱动的面部表情和动作，导致面部表情和动作之间缺乏无缝协调，动画效果不自然连贯。我们提出AsynFusion框架，利用扩散变压器实现和谐的表情和动作合成。该方法基于双分支DiT架构，实现面部表情和动作的并行生成。模型内引入协同同步模块促进两种模态之间的双向特征交互，以及异步LCM采样策略降低计算开销同时保持高质量输出。实验证明，AsynFusion在生成实时同步全身动画方面达到最佳性能，在定量和定性评估上均超越现有方法。

Key Takeaways