发布日期: 2025-11-06

更新日期: 2025-11-27

文章字数: 2.5k

阅读时长: 10 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-06 更新

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Authors:Dmitrii Pozdeev, Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky

We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

我们提出了DenseMarks——一种用于人类头部的新学习表示方法，能够实现高质量的人头图像密集对应。对于一个人头的2D图像，视觉Transformer网络为每个像素预测一个3D嵌入，该嵌入对应于3D规范单位立方体中的一个位置。为了训练我们的网络，我们收集了一对点匹配的数据集，通过最先进的点追踪器在多种野外谈话头部视频上进行估计，并通过对比损失引导映射，鼓励匹配点具有接近的嵌入。我们进一步采用多任务学习与面部地标和分割约束，同时通过潜在立方体特征施加嵌入的空间连续性，从而形成一个可解释和可查询的标准空间。该表示可用于查找公共语义部分、面部/头部跟踪和立体重建。由于强大的监督能力，我们的方法对姿势变化具有鲁棒性，并覆盖了整个头部，包括头发。此外，规范空间瓶颈确保在不同姿势和个体之间获得的表示是一致的。我们在几何感知点匹配和单目3D可变形模型头部跟踪方面展示了最先进的结果。代码和模型检查点将向公众开放。

论文及项目相关链接

PDF Project page: https://diddone.github.io/densemarks/ .Video: https://youtu.be/o8DOOYFW0gI .21 pages, 13 figures, 2 tables

Summary

本文提出了DenseMarks，一种用于人类头部的新学习表示方法，可实现高质量的人类头部图像密集对应。通过视觉转换器网络对2D人头图像每个像素进行预测，得到对应于3D标准单位立方体中的位置的3D嵌入。为了训练网络，收集了一组在自然环境下拍摄的不同人头视频的配对点匹配数据集，并通过对比损失引导映射，促使匹配点具有接近的嵌入。还采用了多任务学习与面部特征点和分割约束，以及通过潜在立方体特征实现嵌入的空间连续性，从而得到可解释和可查询的标准空间。该表示可用于找到共同语义部分、面部/头部跟踪和立体重建。由于强大的监督能力，该方法对姿势变化具有鲁棒性，并覆盖了整个头部，包括头发。此外，标准空间瓶颈确保获得的表示在不同姿势和个体之间是始终一致的。在几何感知点匹配和单目头部跟踪方面，我们展示了最先进的成果。

Key Takeaways

提出了DenseMarks，一种新型人类头部学习表示方法，实现高质量密集对应。
使用视觉转换器网络预测每个像素的3D嵌入，对应3D标准单位立方体中的位置。
通过收集配对点匹配数据集和对比损失进行网络训练。
采用多任务学习、面部特征点和分割约束，提高模型性能。
通过潜在立方体特征实现嵌入的空间连续性，获得可解释和可查询的标准空间。
该表示方法可用于面部跟踪、立体重建等任务，对姿势变化具有鲁棒性。

Cool Papers

点此查看论文截图

Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

Authors:Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter

We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models – each trained by its original authors – across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies – enabling new evaluations without model reimplementation required – alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.

我们回顾了自动化、语音驱动的3D动作捕捉生成中的人为评估实践，发现存在缺乏标准化和频繁使用有缺陷的实验设置的问题。这导致我们无法了解不同方法的比较情况，也无法了解最新技术状况。为了解决评估设计中的常见缺陷，并对未来手势生成作品中的用户研究进行标准化，我们为广泛使用的BEAT2动作捕捉数据集引入了一个详细的人为评估协议。使用该协议，我们进行了大规模的众包评估，对六个最近的手势生成模型进行排名——每个模型均由其原始作者进行训练——跨越两个关键评估维度：动作真实性和语音与手势的对齐程度。我们的结果强烈表明：新的模型并不一致地优于早期方法；关于高动作真实性或语音手势对齐的公开发布的主张可能在严格评估下不成立；为了取得进展，该领域必须采用对动作质量和多模式对齐的分离评估来进行准确的标准制定。最后，为了推动标准化并促进新的评估研究，我们将发布五个小时的合成动作数据，包括用户研究中的超过750个渲染视频刺激——无需重新实现模型即可进行新的评估——以及我们的开源渲染脚本和收集的1.6万对人为偏好投票作为我们的基准测试数据。

论文及项目相关链接

PDF 23 pages, 10 figures. The last two authors made equal contributions

Summary：本文回顾了自动化语音驱动的三维手势生成中的人机评估实践，发现存在缺乏标准化和实验设计缺陷的问题。为了解决手势生成工作中评估设计的常见缺陷并标准化未来的用户研究，作者引入了一个详细的人机评估协议，用于广泛使用的BEAT2动作捕捉数据集。利用该协议，作者对六个近期的手势生成模型进行了大规模的网络评价，评价维度包括动作真实性和语音手势对齐度。研究结果表明，新模型并不总是优于早期方法，公布的高动作真实性和语音手势对齐度的说法可能无法经受严格评估的考验。为了推动标准化并促进新的评估研究，作者将发布五个小时合成动作数据、用户研究的超过750个渲染视频刺激物以及用于基准测试的人类偏好投票数据。这要求未来采用运动质量和多模态对齐的分离评估来实现准确基准测试。

Key Takeaways: