发布日期: 2025-11-11

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-11 更新

Dense Motion Captioning

Authors:Shiyao Xu, Benedetta Liberatori, Gül Varol, Paolo Rota

Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.

关于三维人体运动与语言融合的最新进展主要集中在文本到运动的生成上，而相对忽视了运动理解的任务。我们引入了密集运动字幕这一新任务，旨在临时定位并为三维人体运动序列中的动作添加字幕。当前的数据集在提供详细时间注释方面存在不足，主要由包含少数动作组成的短序列构成。为了克服这些局限性，我们推出了复杂运动数据集（CompMo），这是第一个具有丰富注释的大规模数据集，包含精确时间边界的复杂运动序列。通过精心设计的数据生成管道构建，CompMo包含6万个运动序列，每个序列由至少两到十个动作组成，准确标注了它们的时间范围。我们还推出了DEMO模型，该模型将大型语言模型与简单的运动适配器相结合，经过训练能够生成密集的时间性字幕。我们的实验表明，DEMO在CompMo以及适应的基准测试上的表现均优于现有方法，为未来在三维运动理解和字幕制作领域的研究建立了稳健的基准线。

论文及项目相关链接

PDF 12 pages, 5 figures, accepted to 3DV 2026

Summary
该研究引入了一种名为Dense Motion Captioning的新任务，旨在实现对3D人类运动序列中的动作进行时间定位和描述。为克服现有数据集在提供详细时间注释方面的不足，研究提出了Complex Motion Dataset（CompMo）数据集，该数据集包含丰富注释的复杂运动序列，具有精确的时间边界。此外，研究还提出了一种名为DEMO的模型，该模型将大型语言模型与简单运动适配器相结合，可生成密集的时间性字幕。研究实验表明，DEMO在CompMo以及适应的基准测试上的表现均优于现有方法，为未来在3D运动理解和字幕生成领域的研究奠定了稳健的基准。

Key Takeaways