TTS

发布日期: 2025-03-05

更新日期: 2025-05-14

文章字数: 2.2k

阅读时长: 9 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-03-05 更新

CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR

Authors:Kadir Burak Buldu, Süleyman Özdel, Ka Hei Carrie Lau, Mengdi Wang, Daniel Saad, Sofie Schönborn, Auxane Boch, Enkelejda Kasneci, Efe Bozkir

Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering affordable consumer-grade head-mounted displays (HMDs), XR will likely become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)–powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. This paper provides the community with an open-source, customizable, extendable, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with widely used LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify

随着计算机图形学、机器学习和传感器技术的最新发展，扩展现实（XR）技术在日常生活中的应用机会日益增多，无论是技能培训还是娱乐。随着大型企业提供经济实惠的消费级头戴式显示器（HMD），XR可能会变得普及，而HMD将像智能手机和平板电脑一样发展成为个人设备。然而，拥有智能空间和XR中的自然交互与科技进步同样重要，使用户能够增加对虚拟和增强空间的参与度。为此，利用大型语言模型（LLM）驱动的非玩家角色（NPC）通过语音到文本（STT）和文本到语音（TTS）模型，为传统或预设的NPC带来了显著优势，促进了XR中更自然的对话式用户界面（CUI）。本文为社区提供了一个开源、可定制、可扩展和注重隐私的Unity包，即CUIfy，它便于使用广泛的大型语言模型、STT和TTS模型进行基于语音的NPC-用户交互。我们的包还支持每个环境有多个大型语言模型驱动的NPC，并通过流技术减少不同计算模型之间的延迟，从而实现用户和NPC之间可用的交互。我们已将源代码发布在以下存储库中：https://gitlab.lrz.de/hctl/cuify

论文及项目相关链接

PDF 7th IEEE International Conference on Artificial Intelligence & eXtended and Virtual Reality (IEEE AIxVR 2025)

Summary

本文介绍了计算机图形学、机器学习和传感器技术的最新进展为扩展现实（XR）在日常生活中的运用提供了无限机会，如技能培训和娱乐等。随着大型公司提供负担得起的消费级头戴显示器（HMDs），XR可能会变得普遍，并且HMDs将像智能手机和平板电脑一样成为个人设备。为了在XR中实现智能空间和自然交互，利用大型语言模型（LLM）驱动的NPCs结合语音识别（STT）和文本转语音（TTS）模型，相较于传统的预设NPCs，能够提供更自然的XR人机交互界面。本文提供了一个开源、可定制、可扩展且注重隐私的Unity软件包CUIfy，便于用户与广泛使用的LLM、STT和TTS模型进行基于语音的NPC-用户交互。我们的软件包还支持每个环境使用多个LLM驱动的NPCs，并通过流式传输减少不同计算模型之间的延迟，以实现用户和NPC之间的实用交互。我们的源代码已在以下存储库发布：https://gitlab.lrz.de/hctl/cuify 。

Key Takeaways

以下是本文的七个关键要点：

近年来的技术进展如计算机图形学、机器学习和传感器技术为扩展现实（XR）在日常生活中的运用提供了机会。
头戴显示器（HMDs）可能成为像智能手机和平板电脑一样的个人设备。
大型语言模型（LLM）驱动的NPCs结合语音识别（STT）和文本转语音（TTS）模型使XR交互更加自然。
常规或预设的NPCs被赋予了更多的灵活性通过结合LLM技术。
本文提供了一个开源软件包CUIfy，促进基于语音的NPC与用户交互。
CUIfy支持多个LLM驱动的NPCs在同一环境中运行。
软件包通过流式传输技术减少不同计算模型间的延迟，增强用户体验。

Cool Papers

点此查看论文截图

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Authors:Siyin Wang, Wenyi Yu, Yudong Yang, Changli Tang, Yixuan Li, Jimin Zhuang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Chao Zhang

Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) \etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints can be found at https://github.com/bytedance/SALMONN.

语音质量评估通常需要从多个方面对音频进行评估，如平均意见得分（MOS）和说话人相似性（SIM）等。使用专为单一任务设计的小型模型来覆盖这些方面可能具有挑战性。在本文中，我们提出利用最近引入的自动语音质量评估的听觉大型语言模型（LLM）。通过采用特定任务的提示，听觉LLM经过微调以预测MOS、SIM和A/B测试结果，这些结果通常用于评估文本到语音系统。此外，经过微调后的听觉LLM能够生成自然语言描述，评估诸如噪音、失真、断续和总体质量等方面的内容，从而提供更可解释的输出。在NISQA、BVCC、SOMOS和VoxSim语音质量数据集上进行了大量实验，使用了开源的听觉LLM，如SALMONN、Qwen-Audio和Qwen2-Audio。对于自然语言描述任务，还评估了商业模型Google Gemini 1.5 Pro。结果表明，听觉LLM在预测MOS和SIM方面实现了与最新任务特定的小型模型相当的性能，同时在A/B测试和自然语言描述方面也表现出有前景的结果。我们的数据处理脚本和微调后的模型检查点位于https://github.com/bytedance/SALMONN。

论文及项目相关链接

PDF Accepted by ICASSP 2025

Summary
本文提出利用新近引入的听觉大型语言模型（LLMs）进行自动语音质量评估。通过特定任务的提示，听觉LLMs可以微调以预测平均意见得分（MOS）、相似性（SIM）和A/B测试结果，这些常用于评估文本到语音系统。此外，调优后的听觉LLM能够生成自然语言描述，评估噪声、失真、不连续性等质量方面，提供更多可解释的输出。实验结果表明，听觉LLM在预测MOS和SIM方面与最新的特定任务小型模型表现相当，同时在A/B测试和自然语言描述方面表现出有希望的成果。

Key Takeaways