嘘~ 正在从服务器偷取页面 . . .

Talking Head Generation


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-01-14 更新

Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

Authors:Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang

In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++’s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.

本文介绍了Motion-X++,这是一个大规模的多模式3D表达全身运动数据集。现有的运动数据集主要捕捉身体姿势,缺乏面部表情、手势和精细的姿势描述,通常仅限于实验室环境,并带有手动标注的文本描述,从而限制了其可扩展性。为了解决这个问题,我们开发了一个可扩展的标注管道,该管道可以自动从RGB视频中捕获3D全身人类运动和综合纹理标签,并构建了包含81.1K文本-运动对的Motion-X数据集。此外,我们通过改进标注管道、引入更多数据模式和扩大数据量,将Motion-X扩展为Motion-X++。Motion-X++提供了19.5M个3D全身姿势标注,涵盖来自大量场景中的120.5K个运动序列、80.8K个RGB视频、45.3K个音频、19.5M个帧级全身姿势描述和120.5K个序列级语义标签。综合实验验证了我们的标注管道的准确性,并突出了Motion-X++在生成表达丰富、精确、自然的运动方面的显著优势,以及配套的多模式标签支持多项下游任务,包括文本驱动全身运动生成、音频驱动运动生成、3D全身人类网格恢复和2D全身关键点估计等。

论文及项目相关链接

PDF 17 pages, 14 figures, This work extends and enhances the research published in the NeurIPS 2023 paper, “Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset”. arXiv admin note: substantial text overlap with arXiv:2307.00818

Summary

本文介绍了Motion-X++数据集,这是一个大规模的多模式3D动态人体动作数据集。现有运动数据集主要捕捉身体姿势,缺乏面部表情、手势和精细姿势描述,且通常局限于手动标注文本描述的实验室环境,限制了其可扩展性。为解决这些问题,本文开发了一个可扩展的标注管道,能够自动从RGB视频中捕获3D全身人体运动和详细的纹理标签,构建了Motion-X数据集,包含81.1K文本-运动对。此外,通过改进标注管道、引入更多数据模式和增加数据量,将Motion-X扩展为Motion-X++。Motion-X++提供了全面的数据标注和多种任务支持,如文本驱动全身运动生成、音频驱动运动生成等。

Key Takeaways

  1. 介绍了一个新的大规模多模态数据集Motion-X++。
  2. 现有运动数据集主要捕捉身体姿势,缺乏面部表情和手势等多模式信息。
  3. Motion-X++解决了现有数据集的局限性,包括从RGB视频中自动捕获多模式信息的问题。
  4. Motion-X++包含大量的数据标注和丰富的模态信息(视频、音频等)。
  5. 数据集具有可扩展性并支持多种下游任务,如文本驱动全身运动生成和音频驱动运动生成等。
  6. 通过实验验证了标注管道的准确性。

Cool Papers

点此查看论文截图

Identity-Preserving Video Dubbing Using Motion Warping

Authors:Runzhen Liu, Qinjie Lin, Yunfei Liu, Lijian Lin, Ye Zhu, Yu Li, Chuhua Xian, Fa-Ting Hong

Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model the correspondence between audio features and reference images, thereby enabling precise, identity-aware audio-visual integration. Building on this alignment, a motion warping strategy further refines the results by spatially deforming reference images to match the target audio-driven configuration. A dedicated refinement process then mitigates occlusion artifacts and enhances the preservation of fine-grained textures, such as mouth details and skin features. Extensive qualitative and quantitative evaluations demonstrate that IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention, establishing a new state of the art for high-quality, identity-consistent video dubbing.

视频配音旨在从参考视频和驱动音频信号中合成逼真的、唇部同步的视频。尽管现有方法可以准确地根据音频生成嘴巴形状,但它们往往无法保留身份特定的特征,主要是因为它们未能有效地捕捉音频线索和参考身份视觉属性之间的微妙互动。因此,生成的输出在再现参考身份的独特纹理和结构细节方面经常缺乏保真度。为了解决这些局限性,我们提出了IPTalker,这是一个用于视频配音的新型稳健框架,它实现了驱动音频和参考身份之间的无缝对齐,同时确保唇部同步准确和高保真身份保留。IPTalker的核心是基于变压器的对齐机制,旨在动态捕获和建模音频特征和参考图像之间的对应关系,从而实现精确的身份感知音频视觉集成。在此基础上,通过运动扭曲策略进一步细化结果,通过空间变形参考图像以匹配目标音频驱动的配置。然后,专用的细化过程减轻了遮挡伪影,并增强了精细纹理的保留,如嘴巴细节和皮肤特征。广泛的质量和数量评估表明,IPTalker在真实性、唇部同步和身份保留方面始终优于现有方法,为高质量、身份一致的视频配音树立了新的技术标杆。

论文及项目相关链接

PDF v2, Under Review

Summary

本文介绍了视频配音技术的新进展。针对现有方法无法有效捕捉音频线索与参考身份视觉属性之间微妙互动的问题,提出一种新型稳健框架IPTalker。该框架实现了驱动音频与参考身份之间的无缝对齐,同时确保唇形同步精确度高且身份保留高度保真。核心机制是基于变压器的对齐技术,动态捕捉和建模音频特征与参考图像之间的对应关系,从而实现精确的、具有身份意识的视听整合。在此基础上,通过运动扭曲策略进一步优化结果,通过空间变形参考图像以匹配目标音频驱动的配置。专用细化流程减轻了遮挡伪影,并增强了精细纹理的保留,如口腔细节和皮肤特征。

Key Takeaways

  1. 视频配音技术旨在从参考视频和驱动音频信号合成逼真、唇形同步的视频。
  2. 现有方法虽然能准确生成受音频驱动下的嘴巴形状,但往往无法保留身份特定特征。
  3. IPTalker框架解决了这一问题,实现了音频与参考身份之间的无缝对齐。
  4. IPTalker的核心机制是基于变压器的对齐技术,能够动态捕捉和建模音频与视觉之间的微妙关系。
  5. 运动扭曲策略用于进一步优化结果,通过空间变形参考图像以匹配目标音频配置。
  6. 专用细化流程提高了结果的逼真度,并增强了身份特征的保留。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
LLM LLM
LLM 方向最新论文已更新,请持续关注 Update in 2025-01-16 PokerBench Training Large Language Models to become Professional Poker Players
2025-01-16
下一篇 
Interactive Interactive
Interactive 方向最新论文已更新,请持续关注 Update in 2025-01-14 Real-Time Textless Dialogue Generation
2025-01-14
  目录