TTS

发布日期: 2025-09-19

更新日期: 2025-10-07

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-19 更新

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

Authors:Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe, Chih-Chen Chen, Zhen Wu, Karim Benharrak, Anuj Diwan, Samuele Cornell, Eunjung Yeo, Kwanghee Choi, Carlos Carvalho, Karen Rosero

We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.

我们推出CS-FLEURS数据集，这是一个针对高资源语言以外的代码切换语音识别和翻译系统开发评估的新数据集。CS-FLEURS包含四个测试集，共涵盖113种独特的代码切换语言对，涉及52种语言：1）包含使用真实语音朗读合成生成的代码切换句子的14种英语语系语言对集；2）使用生成式文本到语音技术的16种英语语系语言对集；3）包含使用生成式文本到语音技术的包含阿拉伯语、普通话、印地语和西班牙语在内的60种非英语语系语言对集；以及包含使用串联文本到语音技术的非英语语系与英语的45种低资源语言对测试集。除了这四个测试集外，CS-FLEURS还提供包含使用生成式文本到语音技术的涵盖十六种英语语系语言对的训练集，总计训练时长为128小时。我们希望CS-FLEURS能够帮助拓宽未来代码切换语音研究的范围。数据集链接：https://huggingface.co/datasets/byan/cs-fleurs。

论文及项目相关链接

PDF

Summary

CS-FLEURS是一个新的数据集，用于开发和评估跨语言代码切换的语音识别和翻译系统。它包含四个测试集，涵盖113种独特的代码切换语言对，涉及52种语言。此外，还提供了一个包含128小时生成性语音数据的训练集。CS-FLEURS有望拓宽未来跨语言代码切换语音研究的范围。

Key Takeaways

CS-FLEURS是一个新的数据集，专为开发和评估代码切换的语音识别和翻译系统而设计。
数据集包含四个测试集，涵盖多种独特的代码切换语言对，旨在满足不同语言的需求。
数据集提供了生成性语音数据的训练集，有助于训练模型和提高性能。
CS-FLEURS包括合成生成的代码切换句子和实际声音的阅读，生成文本到语音以及文本到语音的拼接等多样化内容。
数据集特别关注低资源语言对的处理，有助于扩大研究范围。
CS-FLEURS可用于评估不同语言和领域下的语音识别和翻译系统的性能。

Cool Papers

点此查看论文截图

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

Authors:Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Chieh Wei, Kuan-Yu Chen, Hung-yi Lee

Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.

指令引导式文本到语音（ITTS）技术允许用户通过自然语言提示来控制语音生成，提供了相比传统TTS技术更直观的界面。然而，用户风格指令和听众感知之间的对齐关系仍未得到广泛探索。本研究首先针对ITTS在程度副词和分级情感强度两个表达维度上的可控性进行感知分析，并对演讲者年龄和单词级别强调属性进行人类评分。为了全面揭示指令感知差距，我们提供了一个大规模人类评估的数据收集，命名为表现性语音控制（E-VOC）语料库。此外，我们发现：（1）gpt-4o-mini-tts是表现最可靠的ITTS模型，在声音维度上用户指令和生成的语句之间的对齐性极佳。（2）经过分析的五种ITTS系统在发出儿童或老年人声音的指令时倾向于产生成人声音。（3）精细控制仍然是一个主要挑战，这意味着大多数ITTS系统在解释略有不同的属性指令方面仍有很大的改进空间。

论文及项目相关链接

PDF Submission to ICASSP 2026

Summary

指令引导式文本转语音（ITTS）通过自然语言提示控制语音生成，为用户提供了比传统TTS更直观的界面。然而，用户指令风格与听众感知之间的对齐程度尚未被充分研究。本研究首先分析了ITTS在两个表达维度（程度副词和情绪强度分级）上的可控性，并对说话人的年龄和单词级别的强调属性进行了人类评分。为了全面揭示指令与感知之间的差距，我们提供了一个大规模人类评估的数据收集，名为表达声音控制（E-VOC）语料库。研究表明，gpt-4o-mini-tts是最可靠的ITTS模型，其在声学维度上的指令与生成语音之间有很好的对齐；大多数ITTS系统在生成儿童或老年人声音方面存在倾向成人声音的问题；精细控制仍然是一个主要挑战，表明大多数ITTS系统在解释略有不同的属性指令方面仍有很大的改进空间。

Key Takeaways

ITTS通过自然语言提示实现语音生成控制，提供更直观的界面。
用户指令风格与听众感知之间的对齐在ITTS中尚未得到充分研究。
本研究分析了ITTS在两个表达维度上的可控性。
E-VOC语料库用于全面揭示指令与感知之间的差距。
gpt-4o-mini-tts是最可靠的ITTS模型，其在声学维度上的指令与生成语音之间有很好的对齐。
ITTS系统在生成特定年龄段（如儿童和老年人）的声音时存在倾向成人声音的问题。

Cool Papers

点此查看论文截图

ClonEval: An Open Voice Cloning Benchmark

Authors:Iwona Christop, Tomasz Kuczyński, Marek Kubis

We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of results on the leaderboard.

我们为语音克隆文本到语音模型提供了一个新型基准测试。该基准测试包括一个评估协议、一个评估语音克隆模型性能的开源库以及一个伴随的排行榜。论文讨论了设计考虑因素，并详细介绍了评估流程。同时，还解释了软件库的使用以及排行榜上结果的组织情况。

论文及项目相关链接

PDF Under review at ICASSP

Summary
本文介绍了一个用于语音克隆文本到语音模型的新型基准测试。该基准测试包括评估协议、开源库以及排行榜，用以评估语音克隆模型的性能。文章讨论了设计考量，并详细介绍了评估流程、软件库的使用及结果展示方式。

Key Takeaways

新型基准测试用于评估语音克隆文本到语音模型。
基准测试包含评估协议、开源库和排行榜。
文章讨论了设计评估流程和语音克隆模型的考量。
详细介绍了评估流程，包括软件库的使用。
文章提及了结果的展示方式，包括在排行榜上的组织。
该基准测试有助于更准确地评估不同语音克隆模型的性能。

Cool Papers

点此查看论文截图

Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection

Authors:Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A low-complexity alternative would broaden the reach of speech synthesis research, particularly in settings with limited computational and data resources. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly selecting frames from the target speaker’s speech, SelectTTS enables generalization to unseen speakers with significantly lower model complexity. Experimental results show that the proposed approach achieves performance comparable to state-of-the-art systems such as XTTS-v2 and VALL-E, while requiring over 8x fewer parameters and 270x less training data. Moreover, it demonstrates that frame selection with SSL features offers an efficient path to low-complexity, high-quality multi-speaker TTS.

在多说话者文本转语音（TTS）中，合成未见说话者的声音仍然是一个持续存在的挑战。现有方法通过训练过程中的说话者条件建模来模拟说话者特征，这增加了模型的复杂性，并限制了可重复性和可访问性。一种低复杂度的替代方案将扩大语音合成研究的覆盖范围，特别是在计算和数据资源有限的场景中。为此，我们提出了SelectTTS，这是一种简单有效的替代方案。SelectTTS从目标说话者中选择适当的帧，并使用帧级自监督学习（SSL）特征进行解码。我们证明这种方法可以有效地捕获未见说话者的说话者特征，并在客观和主观指标上实现与最新多说话者TTS框架相当的性能。通过直接从目标说话者的语音中选择帧，SelectTTS能够在显著降低模型复杂性的情况下实现对未见说话者的泛化。实验结果表明，所提出的方法在性能上可与最新系统（如XTTS-v2和VALL-E）相媲美，同时所需的参数少于8倍，训练数据减少270倍。而且，它证明了使用SSL特征的帧选择是实现低复杂度、高质量多说话者TTS的有效途径。

论文及项目相关链接

PDF Under review for IEEE OJSP

Summary

本文提出一种名为SelectTTS的简洁高效的多说话人文本转语音（TTS）方法。该方法通过从目标说话人中选择适当的帧，并利用帧级自监督学习（SSL）特征进行解码，有效捕捉说话人特征，实现未见说话人的高质量语音合成。相比复杂的多说话人TTS框架，SelectTTS模型复杂度更低，性能却相当。

Key Takeaways