TTS

发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 4.2k

阅读时长: 17 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

Authors:Ngoc-Son Nguyen, Hieu-Nghia Huynh-Nguyen, Thanh V. T. Tran, Truong-Son Hy, Van Nguyen

Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.

零样本文本转语音（TTS）旨在仅使用短暂的参考样本合成高质量语音，模仿未见过的说话人的声音，这不仅需要说话人适应，还需要对韵律属性进行准确建模。基于语言模型、扩散和流程匹配的方法在零样本TTS中显示出有希望的结果，但仍然存在推理速度慢和重复伪影的问题。离散编解码器表示已被广泛应用于语音合成，近期工作开始探索纯粹的离散环境中的扩散模型，表明离散生成建模在语音合成中的潜力。然而，现有的流程匹配方法通常将这些离散符号嵌入到连续空间中，并应用连续流程匹配，这可能没有充分利用离散表示的优势。为了解决这些挑战，我们引入了DiFlow-TTS，据我们所知，它是第一个探索纯粹离散流程匹配的语音合成模型。DiFlow-TTS在一个紧凑而统一的结构中显式地建模因子化的语音属性。它通过文本内容以及从参考语音中提取的韵律和声学属性来进行上下文学习，从而实现零样本设置中的有效属性克隆。此外，该模型采用因子化的流程预测机制，具有用于韵律和声音细节的独立头，允许它学习特定方面的分布。实验结果表明，DiFlow-TTS在几个关键指标上取得了有希望的性能，包括自然度、韵律、保持说话人风格以及能量控制。同时，它保持了紧凑的模型大小，实现了低延迟推理，生成语音的速度是最新的现有基准测试的25.8倍。

论文及项目相关链接

PDF

摘要

零样本文本转语音（TTS）旨在仅使用简短参考样本合成高质量语音，模仿未见过的说话人的声音，这需要不仅进行说话人适配，还要对韵律属性进行准确建模。虽然基于语言模型、扩散和流程匹配的方法在零样本TTS中显示出有希望的结果，但它们仍存在推理速度慢和重复伪影的问题。本文引入DiFlow-TTS，它是首个探索纯粹离散流程匹配的语音合成模型。DiFlow-TTS在紧凑统一架构内显式建模因子化语音属性，通过文本内容、韵律和从参考语音中提取的声学属性进行上下文学习，实现零样本设置中的有效属性克隆。此外，该模型采用因子化流预测机制，具有针对韵律和声学细节的独立头，可学习特定方面的分布。实验结果表明，DiFlow-TTS在几个关键指标上表现优异，包括自然度、韵律、保持说话人风格和控制能量。同时，它保持了模型尺寸紧凑，实现了低延迟推理，生成语音的速度是最新的现有基线模型的25.8倍。

关键见解

零样本TTS旨在使用简短参考样本合成高质量语音，需进行说话人适配和韵律属性准确建模。
现有流程匹配方法将离散令牌嵌入连续空间并应用连续流程匹配，可能未充分利用离散表示的优势。
DiFlow-TTS是首个探索纯粹离散流程匹配的语音合成模型，显式建模因子化语音属性。
DiFlow-TTS通过文本内容、韵律和参考语音的声学属性进行上下文学习，实现有效属性克隆。
DiFlow-TTS采用因子化流预测机制，针对韵律和声音细节进行特定学习。
实验结果表明，DiFlow-TTS在自然度、韵律、保持说话人风格和控制能量等方面表现优异。
DiFlow-TTS实现了低延迟推理，生成语音的速度显著快于现有基线模型。

Cool Papers

点此查看论文截图

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners

Authors:Xiaoxue Luo, Jinwei Huang, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS. Demo Page: https://luo404.github.io/DeCodecV2/

通用音频编解码器学习跨音频类型的纠缠表示，而一些特定编解码器提供了解耦的表示，但仅限于语音。然而，现实世界的音频通常包含混合的语音和背景声音，下游任务需要选择性地访问这些组件。因此，我们重新思考音频编解码器作为通用解纠缠表示学习者，以实现在不同音频任务中可控的特征选择。为此，我们引入了DeCodec，这是一种新型神经编解码器，学习将音频表示解耦成专用于语音和背景声音的正交子空间，并且在语音内部，表示进一步分解成语义和副语言成分。这种分层解纠缠允许灵活的特征选择，使DeCodec成为多个音频应用的前端通用工具。技术上，DeCodec建立在编解码器框架上，包含两个关键创新点：一个子空间正交投影模块，它将输入分解为两个解耦的正交子空间，以及一个表示交换训练过程，确保这两个子空间分别与语音和背景声音相关联。这使得并行RVQs能够独立地对语音和背景声音组件进行量化。此外，我们对语音RVQ采用语义指导，实现语义和副语言分解。实验结果表明，DeCodec在保持先进信号重建的同时，还具备新能力：通过表示重组实现噪声语音的优质语音增强和一次性语音转换，通过干净语义特征提高ASR稳健性，以及在TTS中实现可控的背景声音保留/抑制。演示页面：https://luo404.github.io/DeCodecV2/。

论文及项目相关链接

PDF

Summary
深度学习神经网络编解码器DeCodec能解耦音频表现以支持多任务可选访问控制。通过层级化脱离机制进行多种音域的分解，例如语音和背景音等，并为每个组成部分建立独立的数据空间，以此促进灵活的特性选择。此技术采用正交投影模块与训练流程保证数据空间的正确对应，从而实现独立量化不同音频组件的目标。实验结果显示，DeCodec在维持高级信号重建能力的同时，赋予了多种新功能，如语音增强、一次语音转换和语音识别鲁棒性等。更多详细信息参见官方展示页链接（具体内容依据上述官方链接有所不同）。更多详实的理论知识还需查看专业文献进行详细解析和理解。若有不清楚的文献描述，请查阅相关论文原文或咨询专业人士。

Key Takeaways

音频编解码器可以学习跨音频类型的纠缠表示形式，但特定编解码器仅限于语音的解耦表示形式。

Cool Papers

点此查看论文截图

HISPASpoof: A New Dataset For Spanish Speech Forensics

Authors:Maria Risques, Kratika Bhagtani, Amit Kumar Singh Yadav, Edward J. Delp

Zero-shot Voice Cloning (VC) and Text-to-Speech (TTS) methods have advanced rapidly, enabling the generation of highly realistic synthetic speech and raising serious concerns about their misuse. While numerous detectors have been developed for English and Chinese, Spanish-spoken by over 600 million people worldwide-remains underrepresented in speech forensics. To address this gap, we introduce HISPASpoof, the first large-scale Spanish dataset designed for synthetic speech detection and attribution. It includes real speech from public corpora across six accents and synthetic speech generated with six zero-shot TTS systems. We evaluate five representative methods, showing that detectors trained on English fail to generalize to Spanish, while training on HISPASpoof substantially improves detection. We also evaluate synthetic speech attribution performance on HISPASpoof, i.e., identifying the generation method of synthetic speech. HISPASpoof thus provides a critical benchmark for advancing reliable and inclusive speech forensics in Spanish.

零样本语音克隆（VC）和文本转语音（TTS）技术已迅速进步，能够实现高度逼真的合成语音生成，这引发了对其滥用情况的严重关注。尽管针对英语和中文的检测器已经得到开发，但全球有超过6亿人使用的西班牙语在语音取证领域仍然代表性不足。为了弥补这一空白，我们推出了HISPASpoof，这是首个用于合成语音检测和归属的西班牙语大规模数据集。它包含了来自六个口音的公共语料库的真实语音和通过六个零样本TTS系统生成的合成语音。我们评估了五种具有代表性的方法，结果表明，在英语上训练的检测器无法推广到西班牙语，而在HISPASpoof上进行训练则能显著提高检测效果。我们还评估了HISPASpoof上合成语音的归属性能，即识别合成语音的生成方法。因此，HISPASpoof为推进可靠且包容的西班牙语语音取证提供了关键基准。

论文及项目相关链接

PDF 8 pages, 1 figure, 10 tables, being submitted to ICASSP 2026 (IEEE International Conference on Acoustics, Speech, and Signal Processing 2026)

Summary

本文介绍了针对西班牙语合成语音检测与归属的首个大规模数据集HISPASpoof的创建与应用。该数据集包含六种口音的真实语音和由六种零样本TTS系统生成的合成语音。研究表明，基于英语的检测器无法泛化到西班牙语，而基于HISPASpoof的训练则能显著提高检测性能。此外，HISPASpoof还为推进可靠和包容的西班牙语语音取证提供了关键基准。

Key Takeaways

HISPASpoof是首个为西班牙语合成语音检测和归属设计的大规模数据集。
数据集包含六种口音的真实语音和由多种TTS系统生成的合成语音。
基于英语的检测器在西班牙语环境下表现不佳。
在HISPASpoof上训练的检测器能显著提高合成语音检测性能。
HISPASpoof为推进可靠和包容的西班牙语语音取证提供了重要基准。
数据集的创建是为了解决目前西班牙语在语音取证中的代表性不足问题。

Cool Papers

点此查看论文截图

A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions

Authors:Chung-Chun Wang, Jhen-Ke Lin, Hao-Chien Lu, Hong-Yun Lin, Berlin Chen

Automated speaking assessment (ASA) on opinion expressions is often hampered by the scarcity of labeled recordings, which restricts prompt diversity and undermines scoring reliability. To address this challenge, we propose a novel training paradigm that leverages a large language models (LLM) to generate diverse responses of a given proficiency level, converts responses into synthesized speech via speaker-aware text-to-speech synthesis, and employs a dynamic importance loss to adaptively reweight training instances based on feature distribution differences between synthesized and real speech. Subsequently, a multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly. Experiments conducted on the LTTC dataset show that our approach outperforms methods relying on real data or conventional augmentation, effectively mitigating low-resource constraints and enabling ASA on opinion expressions with cross-modal information.

在观点表达方面的自动口语评估（ASA）常常因缺乏标注录音而受到阻碍，这限制了提示的多样性并影响了评分的可靠性。为了应对这一挑战，我们提出了一种新的训练范式，它利用大型语言模型（LLM）来生成给定熟练程度的多样化响应，通过语音感知文本到语音合成将响应转化为合成语音，并使用动态重要性损失来根据合成语音和现实语音之间的特征分布差异自适应地重新加权训练实例。随后，多模态大型语言模型将对齐的文本特征与语音信号相结合，直接预测熟练度分数。在LTTC数据集上进行的实验表明，我们的方法优于依赖真实数据或常规增强数据的方法，有效缓解了低资源约束，并启用了具有跨模态信息的观点表达的ASA。

论文及项目相关链接

PDF submitted to the ISCA SLaTE-2025 Workshop

摘要

研究意见表达类自动口语评估（ASA）时，标记录音的稀缺性是一大难题，它限制了提示的多样性并影响评分可靠性。为应对这一挑战，本研究提出了一种新的训练范式，利用大型语言模型（LLM）生成给定水平的多样化回应，通过语音感知文本到语音合成技术将回应转化为合成语音，并根据合成语音与现实语音的特征分布差异，采用动态重要性损失来灵活调整训练实例的权重。随后，多模态大型语言模型将对齐的文本特征与语音信号相结合，直接预测熟练度得分。在LTTC数据集上的实验表明，该方法优于依赖真实数据或传统增强方法的方法，有效缓解了低资源约束问题，并实现了带有跨模态信息的意见表达类自动口语评估。

关键见解