TTS

发布日期: 2025-08-19

更新日期: 2025-09-08

文章字数: 7.7k

阅读时长: 31 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-08-19 更新

MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts

Authors:Heyang Xue, Xuchen Song, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, Yahui Zhou

Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM frozen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.

基于描述的文本转语音（TTS）模型在领域内的文本描述上表现出强大的性能，即那些在训练过程中遇到的描述。然而，在现实世界的应用中，用户生成的各种描述不可避免地引入了众多的离域输入，这些输入对这些系统的文本理解能力提出了挑战。为了解决这一问题，我们提出了MoE-TTS，这是一种基于描述的TTS模型，旨在增强对离域文本描述的理解。MoE-TTS采用基于模态的混合专家（MoE）方法，对预训练的文本大型语言模型（LLM）进行增强，通过一系列适应于语音模态的专用权重，同时保持训练过程中的原始LLM冻结。这种方法允许MoE-TTS有效利用预训练的知识和文本理解能力。我们的实验结果表明：首先，甚至最先进的闭源商业产品也可能会受到精心设计的离域描述测试集的挑战；其次，MoE-TTS在生成更能准确反映描述的语音方面表现出卓越的性能。我们鼓励读者在https://welkinyang.github.io/MoE-TTS/上听取演示。

论文及项目相关链接

PDF

Summary

基于描述的文本转语音（TTS）模型在训练过程中的内部文本描述上表现出强大的性能。然而，在真实世界应用中，用户生成描述的多样性不可避免地引入了众多超出领域的输入，这对这些系统的文本理解能力提出了挑战。为解决此问题，我们提出了MoE-TTS模型，该模型采用基于模态的混合专家（MoE）方法，增强了对超出领域文本描述的理解能力。该模型通过增强预训练的文本大型语言模型（LLM）并维持训练期间的原始LLM冻结来实现这一点。我们的实验结果表明：首先，最先进的闭源商业产品也可能受到精心设计的不在领域内的描述测试集的挑战；其次，MoE-TTS在生成更能准确反映描述的语音方面表现出卓越性能。我们鼓励读者在https://welkinyang.github.io/MoE-TTS/上试听演示内容。

Key Takeaways

基于描述的TTS模型在内部文本描述上表现良好，但在真实应用中面临用户生成描述的多样性挑战。
MoE-TTS模型采用基于模态的混合专家（MoE）方法增强对超出领域文本描述的理解能力。
MoE-TTS通过在预训练的文本大型语言模型（LLM）的基础上添加特定于语音的权重来解决该问题。
最先进的闭源商业TTS产品在特定测试集上可能表现不佳。
MoE-TTS能够生成更能准确反映描述的语音。
MoE-TTS鼓励听者试听演示内容以更好地了解模型的性能。

Cool Papers

点此查看论文截图

EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens

Authors:Joonyong Park, Kenichi Nakamura

This paper introduces EmoSSLSphere, a novel framework for multilingual emotional text-to-speech (TTS) synthesis that combines spherical emotion vectors with discrete token features derived from self-supervised learning (SSL). By encoding emotions in a continuous spherical coordinate space and leveraging SSL-based representations for semantic and acoustic modeling, EmoSSLSphere enables fine-grained emotional control, effective cross-lingual emotion transfer, and robust preservation of speaker identity. We evaluate EmoSSLSphere on English and Japanese corpora, demonstrating significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality. Subjective evaluations further confirm that our method outperforms baseline models in terms of naturalness and emotional expressiveness, underscoring its potential as a scalable solution for multilingual emotional TTS.

本文介绍了EmoSSLSphere，这是一个用于多语种情感文本转语音（TTS）合成的新型框架，它将球形情感向量与自我监督学习（SSL）衍生的离散令牌特征相结合。通过将情感编码在连续的球形坐标空间中，并利用基于SSL的语义和声学建模表示，EmoSSLSphere能够实现精细的情感控制、有效的跨语言情感转移和稳健的说话人身份保留。我们在英语和日语语料库上对EmoSSLSphere进行了评估，证明了其在语音清晰度、频谱保真度、语调一致性和整体合成质量方面的显著提高。主观评价进一步证实，我们的方法在自然度和情感表现力方面优于基准模型，突显了其作为可扩展的多语言情感TTS解决方案的潜力。

论文及项目相关链接

PDF In Proceedings of the 13th ISCA Speech Synthesis Workshop

Summary

本文介绍了EmoSSLSphere，一个结合球面情感向量和自监督学习（SSL）的离散令牌特征的多语言情感文本到语音（TTS）合成的新框架。通过情感在连续球坐标空间中的编码以及基于SSL的语义和声学建模，EmoSSLSphere能够实现精细的情感控制、有效的跨语言情感转移和稳健的说话人身份保留。对英文和日文语料库的评价显示，EmoSSLSphere在语音清晰度、频谱保真度、语调一致性和整体合成质量方面都有显著提高。主观评价进一步证明了我们的方法在自然度和情感表现力方面优于基准模型，突显了其在多语言情感TTS领域的可扩展解决方案潜力。

Key Takeaways

EmoSSLSphere是一个用于多语言情感文本到语音（TTS）合成的新框架。
它结合了球面情感向量和自监督学习（SSL）的离散令牌特征。
EmoSSLSphere能在连续球坐标空间中编码情感，实现精细的情感控制。
框架支持有效的跨语言情感转移和稳健的说话人身份保留。
在英文和日文语料库上的评价显示，EmoSSLSphere在语音清晰度、频谱保真度等方面有显著提高。
主观评价证明了EmoSSLSphere在自然度和情感表现力方面优于其他模型。

Cool Papers

点此查看论文截图

Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions

Authors:Tina Raissi, Nick Rossenbach, Ralf Schlüter

We analyze automatic speech recognition (ASR) modeling choices under domain mismatch, comparing classic modular and novel sequence-to-sequence (seq2seq) architectures. Across the different ASR architectures, we examine a spectrum of modeling choices, including label units, context length, and topology. To isolate language domain effects from acoustic variation, we synthesize target domain audio using a text-to-speech system trained on LibriSpeech. We incorporate target domain n-gram and neural language models for domain adaptation without retraining the acoustic model. To our knowledge, this is the first controlled comparison of optimized ASR systems across state-of-the-art architectures under domain shift, offering insights into their generalization. The results show that, under domain shift, rather than the decoder architecture choice or the distinction between classic modular and novel seq2seq models, it is specific modeling choices that influence performance.

我们分析了在领域不匹配情况下自动语音识别（ASR）的建模选择，对比了经典的模块化架构和新颖的顺序到序列（seq2seq）架构。在不同的ASR架构中，我们研究了包括标签单位、上下文长度和拓扑结构等一系列建模选择。为了将语言领域的影响与声学变化区分开来，我们使用在LibriSpeech上训练的文本到语音系统合成目标领域的音频。我们融入目标领域的n元模型和神经网络语言模型，以进行领域适应，无需重新训练声学模型。据我们所知，这是领域变化背景下优化ASR系统在各主流架构间的首次受控对比研究，为我们提供了有关其泛化性能的洞察力。结果表明，在领域变化下，影响性能的不是解码器架构的选择或是经典模块化与新式seq2seq模型之间的区别，而是特定的建模选择。

论文及项目相关链接

PDF Accepted for presentation at IEEE ASRU 2025

Summary

本文主要探讨了自动语音识别（ASR）建模在不同架构下如何处理领域不匹配的问题。该研究比较了经典的模块化和新型的序列到序列（seq2seq）架构，同时研究了标签单元、上下文长度和拓扑结构等一系列建模选择。研究通过合成目标领域音频，以减少语音变化的语言领域影响，并融入了目标领域的n-gram和神经网络语言模型进行领域适应，而无需重新训练声学模型。这是首次针对前沿架构在领域转移方面的优化ASR系统进行控制比较，为它们的泛化能力提供了深入见解。研究结果表明，在领域转移情况下，影响性能的是特定的建模选择，而非解码器架构的选择或经典模块化与新型seq2seq模型之间的差异。

Key Takeaways

本文分析了自动语音识别（ASR）建模在领域不匹配问题下的不同架构表现。
研究比较了经典的模块化架构和新型的序列到序列（seq2seq）架构。
研究重点考虑了标签单元、上下文长度和拓扑结构等建模选择。
通过合成目标领域音频，减少了语音变化的语言领域影响。
研究融入了目标领域的n-gram和神经网络语言模型进行领域适应。
这是首次针对前沿架构在领域转移方面的优化ASR系统进行的控制比较。

Cool Papers

点此查看论文截图

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs

Authors:Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan

Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model’s potential in tasks where prosody modeling is important.

韵律传达了语音信号的丰富情感和语义信息，以及个人的特殊习惯。我们提出了一个独立的模型，该模型可以将文本映射到诸如F0和能量等韵律特征上，并可用于如文本到语音转换（TTS）等下游任务。ProMode编码器以声学特征和时间对齐的文本内容为输入，两者都被部分掩盖，并获得固定长度的潜在韵律嵌入。解码器使用编码的韵律输入和未屏蔽的文本内容来预测屏蔽区域的声学特征。该模型在GigaSpeech数据集上进行训练，我们将其与最新风格编码器进行比较。对于F0和能量预测，我们的模型在不同的粒度层次上都表现出了一致的优势。我们还将这些预测的韵律特征集成到TTS系统中，并进行感知测试，结果显示我们的模型在韵律偏好上高于基线，这证明了该模型在韵律建模任务中的潜力。

论文及项目相关链接

PDF Interspeech 2025; demo page at https://promode8272.github.io/promode/index.html

摘要

文本中的语音信号包含丰富的情感和语义信息，以及个人的特殊习惯。我们提出了一种独立的模型，该模型可以将文本映射到音高和能量等语音特征上，并可用于如TTS等下游任务。ProMode编码器以声学特征和时间对齐的文本内容为输入，两者均被部分遮挡，并生成固定长度的潜在语音嵌入。解码器使用编码后的语音输入和未遮挡的文本内容来预测遮挡区域的声学特征。在GigaSpeech数据集上训练后，我们将该方法与最新风格编码器进行比较。对于音高和能量预测，我们的模型在不同粒度上均表现出一致的优势。此外，我们将这些预测的语音特征集成到TTS系统中，并进行感知测试，结果显示我们的模型在重视语音建模的任务中具有潜力。

要点

语音传达了丰富的情感和语义信息，以及个人的独特之处。
提出了一种将文本转换为语音特征的独立模型，包括音高和能量等。
ProMode编码器处理部分遮挡的声学特征和文本内容，生成固定长度的潜在语音嵌入。
解码器结合编码后的语音输入和未遮挡的文本内容来预测声学特征。
在GigaSpeech数据集上训练的模型在音高和能量预测方面表现出优越性能。
预测的语音特征被集成到TTS系统中，并进行感知测试，显示出较高的语音偏好。
模型在需要重视语音建模的任务中具有潜在应用价值。

Cool Papers

点此查看论文截图

Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention’s Alternative

Authors:Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen

Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR’s rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba.

随着语音合成技术的进步，加剧了安全威胁，推动了实时深度伪造检测研究。我们研究了双向Mamba是否能成为检测合成语音中自注意力机制的竞争替代方案。我们的解决方案“Fake-Mamba”结合了XLSR前端与双向Mamba，可以捕捉局部和全局伪迹。我们的核心创新在于推出了三种高效编码器：TransBiMamba、ConBiMamba和PN-BiMamba。利用XLSR丰富的语言表示，PN-BiMamba可以有效地捕捉合成语音的微妙线索。在ASVspoof 21 LA、21 DF和In-The-Wild基准测试中，Fake-Mamba的EER分别为0.97%、1.74%和5.85%，相较于最新模型XLSR-Conformer和XLSR-Mamba取得了实质性的相对增益。该框架能够维持跨句子长度的实时推理，显示出强大的泛化能力和实用性。代码可在https://github.com/xuanxixi/Fake-Mamba找到。

论文及项目相关链接

PDF Accepted at IEEE ASRU 2025

Summary

本文主要探讨了语音合成技术的进展所带来的安全威胁，并强调了实时检测深度伪造语音的重要性。研究团队提出了一种名为Fake-Mamba的新方法，该方法结合了XLSR前端技术与双向Mamba，旨在捕捉语音中的局部和全局特征。其核心创新点在于引入了三种高效的编码器：TransBiMamba、ConBiMamba和PN-BiMamba。在ASVspoof 21 LA、21 DF和In-The-Wild基准测试中，Fake-Mamba取得了显著的成绩，相对于现有最佳模型XLSR-Conformer和XLSR-Mamba有明显的相对增益。该框架实现了跨语句长度的实时推理，展现出强大的通用性和实用性。

Key Takeaways

语音合成技术的进步加剧了安全威胁，推动了实时深度伪造语音检测研究的重要性。
Fake-Mamba方法结合了XLSR前端技术与双向Mamba，用以捕捉语音中的局部和全局特征。
Fake-Mamba的核心创新在于引入了TransBiMamba、ConBiMamba和PN-BiMamba三种高效编码器。
PN-BiMamba能有效捕捉合成语音的细微特征，利用XLSR丰富的语言表征。
Fake-Mamba在ASVspoof 21 LA、21 DF和In-The-Wild基准测试中表现优异，相对现有模型有显著改进。
Fake-Mamba框架实现了跨语句长度的实时推理，展现出良好的通用性和实用性。
研究的代码已公开在GitHub上。

Cool Papers

点此查看论文截图

Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS

Authors:M Anuprabha, Krishna Gurugubelli, Anil Kumar Vuppala

Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies.

发音障碍的语音为开发辅助技术带来了重大挑战，这主要是因为数据有限。神经网络语音合成的最新进展，尤其是零样本语音克隆技术，促进了语音合成的数据增强；然而，它们可能对发音障碍的语音引入偏见。在本文中，我们使用TORGO数据集研究使用当前先进的F5-TTS进行发音障碍语音克隆的有效性，重点关注清晰度、说话人相似性和韵律保持。我们还使用公平性的指标，如不公平影响和均等差异等评估不同发音障碍严重程度的差距来评估潜在的偏见。结果表明，在发音障碍语音合成中，F5-TTS更偏向于语音清晰度而非说话人和韵律的保留。这项研究的见解有助于实现公平性意识的发音障碍语音合成整合，从而促进更具包容性的语音技术的进步。

论文及项目相关链接

PDF Accepted at Interspeech 2025

摘要

本研究探讨了先进的F5-TTS在利用TORGO数据集进行语言克隆技术在助残语音合成方面的有效性，重点研究了其清晰度、说话人相似性和语调保持能力。同时，本研究还利用公正性指标（如Disparate Impact和Parity Difference）分析了可能存在的偏见问题，以评估不同语言障碍程度之间的差异。研究结果表明，F5-TTS在合成语言障碍语音时，更偏向于清晰度而非说话人和语调的保持。这项研究为融合公正意识的语言障碍语音合成提供了重要见解，推动了更具包容性的语音技术的进展。尽管对于解决这一难题的最新神经语音合成方法展现了很大潜力，但由于该领域的复杂性和需求的多变性，语言障碍人士的辅助技术仍然面临巨大的挑战。未来的研究需要更深入地探索如何在保障语音合成质量的同时，减少偏见和误差，以满足不同语言障碍程度的需求。同时，还需要收集更多真实且多样化的数据，以提高技术的适用性。因此，建立一个公平、高效的语音辅助系统是一个重要的研究方向。同时我们还需要继续深入研究和改进现有技术以提高语音合成的质量减少偏见和误差更好地满足语言障碍者的需求推动人工智能技术在无障碍交流领域的广泛应用。我们相信通过不断的研究和创新我们可以为语言障碍者提供更好的支持和帮助让他们能够更轻松地与他人交流并享受高质量的生活体验。我们相信人工智能的无障碍交流潜力将为未来的社会带来深远影响。我们相信人工智能的无障碍交流潜力将为未来的社会带来深远影响并促进人类社会的持续进步和发展。

关键见解

一、神经语音合成技术在解决语言障碍数据稀缺问题上展现出巨大潜力，尤其在零样本语音克隆方面尤为突出。这对于增强语言障碍者的交流能力具有关键作用。
二、先进F5-TTS技术在克隆语言障碍语音方面效果显著，但在清晰度、说话人相似性和语调保持三者之间存在权衡关系。本研究揭示了其更偏向于清晰度的倾向性。这对于理解其性能和优化该领域技术具有指导意义。
三、本研究使用公正性指标分析了潜在的偏见问题，表明评估不同语言障碍程度之间的差异对开发更具包容性的语音技术至关重要。这为未来研究提供了重要方向，即如何在保障语音合成质量的同时减少偏见和误差。

Cool Papers

点此查看论文截图

Marco-Voice Technical Report

Authors:Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis. Our code and dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.

本文介绍了一个多功能语音合成系统，该系统在一个统一框架内集成了语音克隆和情感控制语音合成。本工作的目标是解决长期以来在实现高度表达、可控和自然语音生成方面所面临的挑战，忠实地在各种语言和情感上下文中保留说话者身份。我们的方法引入了一种有效的说话人情感分离机制，采用批量对比学习，实现对说话人身份和情感风格的独立操作，以及用于平稳情感控制的旋转情感嵌入集成方法。为了支持全面的训练和评估，我们构建了CSEMOTIONS数据集，这是一个高质量的情感语音数据集，包含来自六位专业说话人的10小时普通话语音，跨越七个情感类别。大量实验表明，我们的系统Marco-Voice在客观和主观指标上均取得了显著改进。进行了全面的评估和分析，结果表明MarcoVoice在语音清晰度和情感丰富度方面表现出竞争力，代表了神经语音合成领域的重大进展。我们的代码和数据集分别在https://github.com/AIDC-AI/Marco-Voice和https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS上公开可用。

论文及项目相关链接

PDF Technical Report. Our code and dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively

Summary

本文介绍了一个多功能语音合成系统，该系统在统一框架内集成了语音克隆和情感控制语音合成。旨在解决长期存在的挑战，实现高度表达、可控和自然的语音生成，并忠实保留说话者身份在不同的语言和情感背景中。通过引入有效的说话人情感分离机制和旋转情感嵌入集成方法，实现对说话人身份和情感风格的独立操作和平滑的情感控制。为支持全面的训练和评估，构建了高质量的情感语音数据集CSEMOTIONS。实验表明，Marco-Voice系统在客观和主观指标上取得了显著改进，并在语音清晰度和情感丰富度方面表现出竞争力，代表了神经语音合成领域的重大进展。

Key Takeaways

该论文介绍了一个集成语音克隆和情感控制语音合成的多功能语音合成系统。
系统的目标是实现高度表达、可控和自然的语音生成，并忠实保留说话者身份。
通过引入有效的说话人情感分离机制和旋转情感嵌入集成方法，实现了对说话人身份和情感风格的独立操作。
为支持全面的训练和评估，构建了高质量的情感语音数据集CSEMOTIONS。
Marco-Voice系统在客观和主观指标上均取得了显著改进。
综合评估结果显示，MarcoVoice在语音清晰度和情感丰富度方面表现出竞争力。

Cool Papers

点此查看论文截图

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Authors:Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.

人类语言不仅仅是信息的传递，更是一种情感上的深刻交流和个人之间的连接。尽管文本转语音（TTS）模型已经取得了巨大的进步，但在控制生成语音的情感表达方面仍然面临挑战。在这项工作中，我们提出了EmoVoice，这是一种新型的情感可控TTS模型，它利用大型语言模型（LLM）实现精细的自由式自然语言情感控制，并设计了一种音素增强变体，使模型能够并行输出音素标记和音频标记，以增强内容的一致性，该设计灵感来源于思维链（CoT）和模态链（CoM）技术。此外，我们还介绍了EmoVoice-DB，这是一个高质量的40小时英语情感数据集，包含表达性语音和具有自然语言描述的精细情感标签。EmoVoice仅使用合成训练数据即可在英文EmoVoice-DB测试集上实现最新性能，并在使用我们内部数据的中文Secap测试集上表现出色。我们进一步研究了现有情感评估指标的可靠性及其与人类感知偏好的一致性，并探讨了使用最先进的多媒体LLM GPT-4o-audio和Gemini来评估情感语音。数据集、代码、检查点和演示样本均可在https://github.com/yanghaha0908/EmoVoice上找到。

论文及项目相关链接

PDF Accepted at ACMMM 2025

Summary

本文介绍了一种名为EmoVoice的新型情感可控文本转语音（TTS）模型。该模型利用大型语言模型（LLM）实现精细的自由式自然语言情感控制，并提出一种音素增强变体设计，以在生成语音时增强内容一致性。此外，还引入了高质量的情感数据集EmoVoice-DB，并发现现有情感评估指标的可靠性及其与人类感知偏好的一致性。

Key Takeaways