发布日期: 2025-05-14

更新日期: 2025-05-26

文章字数: 3.6k

阅读时长: 14 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-14 更新

VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback

Authors:Eason Chen, Chenyu Lin, Yu-Kai Huang, Xinyi Tang, Aprille Xi, Jionghao Lin, Kenneth Koedinger

Pedagogical Agents (PAs) show significant potential for boosting student engagement and learning outcomes by providing adaptive, on-demand support in educational contexts. However, existing PA solutions are often hampered by pre-scripted dialogue, unnatural animations, uncanny visual realism, and high development costs. To address these gaps, we introduce VTutor, an open-source SDK leveraging lightweight WebGL, Unity, and JavaScript frameworks. VTutor receives text outputs from a large language model (LLM), converts them into audio via text-to-speech, and then renders a real-time, lip-synced pedagogical agent (PA) for immediate, large-scale deployment on web-based learning platforms. By providing on-demand, personalized feedback, VTutor strengthens students’ motivation and deepens their engagement with instructional material. Using an anime-like aesthetic, VTutor alleviates the uncanny valley effect, allowing learners to engage with expressive yet comfortably stylized characters. Our evaluation with 50 participants revealed that VTutor significantly outperforms the existing talking-head approaches (e.g., SadTalker) on perceived synchronization accuracy, naturalness, emotional expressiveness, and overall preference. As an open-source project, VTutor welcomes community-driven contributions - from novel character designs to specialized showcases of pedagogical agent applications - that fuel ongoing innovation in AI-enhanced education. By providing an accessible, customizable, and learner-centered PA solution, VTutor aims to elevate human-AI interaction experience in education fields, ultimately broadening the impact of AI in learning contexts. The demo link to VTutor is at https://vtutor-aied25.vercel.app.

教育代理（PAs）在教育环境中显示出巨大的潜力，可以通过提供自适应、按需支持来提高学生参与度和学习效果。然而，现有的教育代理解决方案常常受到预先设定的对话内容、不自然的动画效果、高度逼真的视觉和昂贵的开发成本等限制。为了解决这些不足，我们推出了VTutor这一开源软件开发工具包（SDK），它利用轻量级的WebGL、Unity和JavaScript框架。VTutor从大型语言模型（LLM）接收文本输出，通过文本到语音的转换，将其转化为音频，并在基于网络的学习平台上即时渲染一个实时同步嘴唇的同步教育代理（PA）。通过提供按需个性化的反馈，VTutor增强了学生的学习动力并加深了对教材内容的参与程度。采用动漫式的审美风格，VTutor减轻了“尴尬谷”效应，使学习者能够与生动且风格舒适的字符互动。我们对50名参与者的评估表明，与现有的谈话头方式（如SadTalker）相比，VTutor在感知同步准确性、自然性、情感表达力和整体偏好等方面表现更出色。作为一个开源项目，VTutor欢迎社区驱动的各种贡献——从新型角色设计到教育代理应用程序的专门展示——推动人工智能增强教育的持续创新。通过提供易于访问、可定制化和以学生为中心的教育代理解决方案，VTutor旨在提升教育领域中人工智能与人类之间的交互体验，最终扩大人工智能在学习环境中的影响力。VTutor的演示链接为：https://vtutor-aied25.vercel.app。

论文及项目相关链接

PDF

摘要

PA（教学代理）在教育领域展现出了巨大的潜力，可通过提供适应性即时支持来提高学生的学习参与度和学习效果。然而，现有PA解决方案往往受限于预设对话、不自然动画、逼真的视觉效果和高昂的开发成本。为解决这些问题，我们推出了VTutor，这是一款利用轻量级WebGL、Unity和JavaScript框架的开源SDK。VTutor接收大型语言模型的文本输出，通过文字转语音转换为音频，然后渲染一个实时同步的PA（教学代理），可立即在基于网络的学习平台上大规模部署。VTutor通过提供个性化即时反馈，增强学生的学习动力，加深他们对教学内容的投入程度。借助动漫式的美学设计，VTutor减轻了“情感投入谷”效应，让学习者能够与生动且风格舒适的字符互动。我们的评估显示，VTutor在感知同步准确性、自然性、情感表现力和整体偏好方面显著优于现有的谈话头方法（如SadTalker）。作为开源项目，VTutor欢迎社区驱动的贡献——从新型角色设计到PA应用程序的专业展示——这些贡献推动了AI增强教育的持续创新。通过提供易于访问、可定制和以学习者为中心PA解决方案，VTutor旨在提升教育领域中的人机交互体验，最终扩大AI在学习环境中的影响力。VTutor的演示链接为：[链接地址]。

关键见解

PAs在教育领域具有提高学生参与度和学习效果的潜力。
当前PA解决方案受限于预设对话、不自然动画等。
VTutor是一个新的开源SDK，旨在解决现有PA的问题。
VTutor能够接收大型语言模型的文本输出并转换为音频和实时同步的动画教学代理。
VTutor具有个性化的即时反馈功能，可提高学生的学习动力和参与度。
VTutor采用动漫风格设计，减少“情感投入谷”效应，提高学习者的参与度。

Cool Papers

点此查看论文截图

PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

Authors:S. Z. Zhou, Y. B. Wang, J. F. Wu, T. Hu, J. N. Zhang, Z. J. Li, Y. Liu

Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose PAHA, an end-to-end audio-driven upper-body human animation framework with diffusion model. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.

音频驱动的人形动画技术广泛应用于人机交互领域，扩散模型的兴起进一步推动了其发展。当前大多数方法依赖于多阶段生成和中间表示，导致推理时间长以及在特定前景区域生成质量和音频运动一致性方面的问题。这些缺点主要是因为缺乏局部精细监督指导。针对上述挑战，我们提出了基于扩散模型的端到端音频驱动上半身人形动画框架PAHA。我们引入了两种关键方法：零件感知重新加权（PAR）和零件一致性增强（PCE）。PAR根据姿态置信度分数动态调整区域训练损失权重，有效提高视觉质量。PCE构建并训练基于扩散的区域音视频分类器，以提高运动和语音音频的一致性。之后，我们为前述分类器设计了两种新型推理指导方法：顺序指导（SG）和差分指导（DG），以分别平衡效率和质量。此外，我们构建了CNAS，首个公开的中文新闻主播语音数据集，以推动该领域的研究和验证。大量的实验和用户研究结果表明，PAHA在音频运动对齐和视频相关评估方面显著优于现有方法。代码和CNAS数据集将在接受后发布。

论文及项目相关链接

PDF

Summary

本文介绍了音频驱动的人体动画技术的发展现状及挑战，并提出了一种端到端的音频驱动上身人体动画框架PAHA，引入了两个关键方法：Parts-Aware Re-weighting（PAR）和Parts Consistency Enhancement（PCE）。PAHA通过动态调整区域训练损失权重和提高运动与音频的一致性，有效提高了动画的质量和音频一致性。此外，还设计了两种新的推理指导方法，建立了首个中文新闻主播语音数据集CNAS。

Key Takeaways

音频驱动的人体动画技术在人机交互中广泛应用，扩散模型的出现进一步推动了其发展。
当前方法存在长期推理和生成质量问题的短板，主要是由于缺乏局部精细监督指导。
提出了PAHA框架，采用端到端的音频驱动上身人体动画方法。
引入了两个关键方法：PAR和PCE，分别通过动态调整区域训练损失权重和提高运动与音频的一致性，提高动画质量。
设计了两种推理指导方法，实现了效率和质量的平衡。
建立了首个中文新闻主播语音数据集CNAS，推动该领域的研究和验证。

Cool Papers

点此查看论文截图

OT-Talk: Animating 3D Talking Head with Optimal Transportation

Authors:Xinmu Wang, Xiang Gao, Xiyun Song, Heather Yu, Zongfang Lin, Liang Peng, Xianfeng Gu

Animating 3D head meshes using audio inputs has significant applications in AR/VR, gaming, and entertainment through 3D avatars. However, bridging the modality gap between speech signals and facial dynamics remains a challenge, often resulting in incorrect lip syncing and unnatural facial movements. To address this, we propose OT-Talk, the first approach to leverage optimal transportation to optimize the learning model in talking head animation. Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences. Unlike previous methods that focus solely on vertex coordinates or displacements, we introduce Chebyshev Graph Convolution to extract geometric features from triangulated meshes. To measure mesh dissimilarities, we go beyond traditional mesh reconstruction errors and velocity differences between adjacent frames. Instead, we represent meshes as probability measures and approximate their surfaces. This allows us to leverage the sliced Wasserstein distance for modeling mesh variations. This approach facilitates the learning of smooth and accurate facial motions, resulting in coherent and natural facial animations. Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques both quantitatively and qualitatively in terms of mesh reconstruction accuracy and temporal alignment. In addition, we conducted a user perception study with 20 volunteers to further assess the effectiveness of our approach.

利用音频输入进行三维头部网格动画在AR/VR、游戏和娱乐领域通过三维化身有着重大应用。然而，弥合语音信号和面部动态之间的模式差距仍然是一个挑战，通常会导致唇同步不正确和面部动作不自然。为了解决这个问题，我们提出了OT-Talk，这是第一个利用最优传输优化说话人头动画的学习模型的方法。我们建立在现有的学习框架上，利用预先训练的Hubert模型提取音频特征，并利用变压器模型处理时间序列。与以往仅关注顶点坐标或位移的方法不同，我们引入Chebyshev图卷积从三角网格中提取几何特征。为了测量网格的不相似性，我们超越了传统的网格重建误差和相邻帧之间的速度差异。相反，我们将网格表示为概率测度并近似其表面。这使我们能够利用切片Wasserstein距离对网格变化进行建模。这种方法促进了平滑和准确面部动作的学习，从而生成连贯和自然的面部动画。我们在两个公共音频网格数据集上的实验表明，我们的方法在网格重建准确性和时间对齐方面定量和定性地优于最新技术。此外，我们还进行了包含20名志愿者的用户感知研究，以进一步评估我们方法的有效性。

论文及项目相关链接

PDF

Summary

基于音频输入的3D头部动画技术在AR/VR、游戏和娱乐领域有广泛应用前景。然而，跨越语音信号与面部动态之间的模态差距仍然是一个挑战，容易导致错误的唇同步和不自然的面部动作。为解决此问题，本文提出OT-Talk方法，首次利用最优传输理论优化学习模型进行头部动画。该方法结合现有学习框架，利用预训练的Hubert模型提取音频特征，并利用转换器模型处理时序序列。不同于仅关注顶点坐标或位移的先前方法，本文引入Chebyshev图卷积提取三角网格的几何特征。通过表示网格为概率度量并近似其表面，我们超越了传统的网格重建误差和相邻帧之间的速度差异来测量网格差异。这使我们能利用切片Wasserstein距离来模拟网格变化。该方法能学习流畅、准确的面部运动，实现连贯、自然的面部动画。在两项公共音频-网格数据集上的实验表明，我们的方法在网格重建精度和时间对齐方面均优于最新技术。此外，我们还对20名志愿者进行了用户感知研究，进一步评估了我们的方法的有效性。

Key Takeaways