TTS

发布日期: 2025-09-17

更新日期: 2025-10-07

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-17 更新

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Authors:Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology’s applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

语音分词能够实现离散表示并促进语音语言建模。然而，现有的神经编解码器捕捉的是低级的声学特征，忽略了人类语音中固有的语义和上下文线索。尽管最近的努力引入了来自自监督语音模型的语义表示或结合了预训练语言模型的上下文表示，但在对齐和统一语义和上下文表示方面仍存在挑战。我们推出了FuseCodec，它通过强大的跨模态对齐和全局信息监督，统一了声学、语义和上下文表示。我们提出了三种互补的技术：（i）潜在表示融合，直接将语义和上下文特征集成到编码器潜在空间，以实现稳健和统一的表示学习；（ii）全局语义上下文监督，用全局池化和广播的表示来监督离散令牌，以增强时间一致性和跨模态对齐；（iii）临时对齐上下文监督，通过在局部窗口内动态匹配上下文和语音令牌来加强对齐，以实现精细的令牌级监督。我们进一步推出了FuseCodec-TTS，展示了我们的方法在无字幕语音合成中的适用性。经验表明，FuseCodec在LibriSpeech上达到了最先进的性能，在转录准确性、感知质量、清晰度和说话人相似性方面超越了EnCodec、SpeechTokenizer和DAC。结果突出了上下文和语义引导的分词法在语音分词和下游任务中的有效性。代码和预训练模型可在https://github.com/mubtasimahasan/FuseCodec获得。

论文及项目相关链接

PDF

Summary

本文介绍了Speech tokenization的重要性以及现有神经编码器的挑战。为解决这个问题，文章提出了FuseCodec，其融合了声学、语义和上下文表示，通过强大的跨模态对齐和全局监督信息实现统一表示。文章提出了三种互补技术，包括潜在表示融合、全局语义上下文监督以及时间对齐的上下文监督。FuseCodec-TTS的引入证明了该方法在零样本语音合成中的适用性。实验结果表明，FuseCodec在LibriSpeech上的表现达到或超越了EnCodec、SpeechTokenizer和DAC，在转录准确性、感知质量、清晰度和说话人相似性方面取得了最先进的性能。

Key Takeaways

Speech tokenization对于语音语言建模很重要，但现有神经编码器忽略了语义和上下文线索。
FuseCodec融合了声学、语义和上下文表示，实现统一表示学习。
FuseCodec提出了三种互补技术：潜在表示融合、全局语义上下文监督和时间对齐的上下文监督。
FuseCodec-TTS证明了该方法在零样本语音合成中的适用性。
FuseCodec在LibriSpeech上的表现超越其他方法，达到转录准确性、感知质量、清晰度和说话人相似性方面的最前沿。
该方法强调上下文和语义引导的分词对于语音分词和下游任务的重要性。

Cool Papers

点此查看论文截图

Length-Aware Rotary Position Embedding for Text-Speech Alignment

Authors:Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton

Many recent text-to-speech (TTS) systems are built on transformer architectures and employ cross-attention mechanisms for text-speech alignment. Within these systems, rotary position embedding (RoPE) is commonly used to encode positional information in text and speech representations. In this work, we introduce length-aware RoPE (LARoPE), a simple yet effective extension of RoPE that improves text-speech alignment. Unlike RoPE, which relies on absolute indices, LARoPE computes relative distances between query and key positions using length-normalized indices. Experimental results show that LARoPE consistently outperforms RoPE, offering faster loss convergence, more accurate text-speech alignment, and higher overall TTS quality. Furthermore, LARoPE demonstrates greater resilience to variations in utterance duration and maintains stable performance in extended speech generation up to 30 seconds, whereas RoPE suffers from notable degradation. Notably, our method achieves a state-of-the-art word error rate on a standard zero-shot TTS benchmark.

近年来，许多文本到语音（TTS）系统都是基于Transformer架构构建的，并采用了跨注意力机制进行文本语音对齐。在这些系统中，旋转位置嵌入（RoPE）通常用于编码文本和语音表示中的位置信息。在这项工作中，我们引入了长度感知RoPE（LARoPE），这是RoPE的一个简单而有效的扩展，可改善文本语音对齐。与依赖绝对索引的RoPE不同，LARoPE使用长度归一化索引计算查询和键位置之间的相对距离。实验结果表明，LARoPE始终优于RoPE，具有更快的损失收敛速度、更准确的文本语音对齐和更高的整体TTS质量。此外，LARoPE表现出对发音时长变化的更强适应性，并在长达30秒的扩展语音生成中保持稳定的性能，而RoPE则会出现明显的性能下降。值得注意的是，我们的方法在标准的零样本TTS基准测试中实现了最先进的词错误率。

论文及项目相关链接

PDF 5 pages, 3 figures, preprint

Summary

基于Transformer架构的文本到语音（TTS）系统通常采用跨注意力机制进行文本到语音的对齐，其中旋转位置嵌入（RoPE）用于编码文本和语音表示中的位置信息。本文介绍了一种简单有效的RoPE改进方法——长度感知RoPE（LARoPE）。不同于依赖绝对索引的RoPE，LARoPE通过长度归一化索引计算查询和键位置之间的相对距离，从而改进文本到语音的对齐。实验结果表明，LARoPE在性能上持续超越RoPE，具有更快的损失收敛速度、更准确的文本到语音对齐以及更高的整体TTS质量。此外，LARoPE对语音持续时间变化展现出更强的适应性，并在长达30秒的扩展语音生成中保持稳定的性能，而RoPE则表现出明显的性能下降。值得注意的是，我们的方法在零样本TTS基准测试中实现了最先进的词错误率。

Key Takeaways

基于Transformer架构的TTS系统采用跨注意力机制进行文本到语音的对齐。
旋转位置嵌入（RoPE）用于编码文本和语音的位置信息。
LARoPE是RoPE的有效改进，通过计算相对距离改进文本到语音的对齐。
LARoPE采用长度归一化索引，使其适应不同的语音持续时间。
实验结果显示，LARoPE在性能上超越RoPE，具有更快的损失收敛速度、更高的TTS质量。
LARoPE在长达30秒的扩展语音生成中表现稳定。

Cool Papers

点此查看论文截图

Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes

Authors:Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Tianyu Du, Shouling Ji

The rapid advancement of voice deepfake technologies has raised serious concerns about user audio privacy, as attackers increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud, and misinformation campaigns. While existing defense methods offer partial protection, they face critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, rigid reliance on white-box knowledge, and high computational and temporal costs during the encryption process. To address these challenges and defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user data. These highly malleable frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance to voice deepfake attacks, all while preserving perceptual quality and speech intelligibility. Notably, Enkidu achieves over 50 to 200 times processing memory efficiency (as low as 0.004 gigabytes) and 3 to 7000 times runtime efficiency (real-time coefficient as low as 0.004) compared to six state-of-the-art countermeasures. Extensive experiments across six mainstream text-to-speech models and five cutting-edge automated speaker verification models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against both vanilla and adaptive voice deepfake attacks. Our code is currently available.

随着语音深度伪造技术的快速发展，用户音频隐私引发了严重关注。攻击者越来越多地利用公开的语音数据生成令人信服的虚假音频，用于身份盗窃、金融欺诈和误导信息宣传等恶意目的。现有的防御方法虽然提供了部分保护，但面临重大局限，包括对新用户数据的适应力弱、对长音频的可伸缩性差、对白盒知识的刚性依赖以及加密过程中的高计算和时间成本。为了解决这些挑战并防范个性化的语音深度伪造威胁，我们提出了Enkidu，这是一个新型的用户导向的隐私保护框架。它利用通过黑盒知识和少量用户数据的少量训练生成的通用频率扰动。这些高度灵活的频域噪声斑块能够实现实时、轻量级的保护，具有强大的泛化能力，可跨变长音频使用，并具有抵抗语音深度伪造攻击的稳健性，同时保持感知质量和语音清晰度。值得注意的是，与六种最先进的对策相比，Enkidu实现了高达50至200倍的内存处理效率（低至0.004千兆字节）和3至7000倍的运行时效率（实时系数低至0.004）。在六个主流文本到语音模型和五个先进的自动说话人验证模型上的广泛实验证明了Enkidu在防范普通和自适应语音深度伪造攻击方面的有效性、可迁移性和实用性。我们的代码目前可供使用。

论文及项目相关链接

PDF Accepted by ACM MM 2025, Open-sourced

Summary
语音深度伪造技术快速发展，引发了对用户音频隐私的严重关注。攻击者越来越多地利用公开可用的语音数据生成令人信服的虚假音频，用于身份盗窃、金融欺诈和虚假信息宣传等恶意目的。为解决这些挑战并防范个性化语音深度伪造威胁，我们提出了Enkidu，一种新型用户导向的隐私保护框架。它利用通过黑箱知识和少量用户数据进行的少量训练生成通用频率扰动。这些高度灵活的频域噪声补丁可实现实时、轻量级的保护，具有强大的泛化能力，可抵御各种长度的音频和针对语音深度伪造攻击的稳健抵抗，同时保持感知质量和语音清晰度。Enkidu在处理内存效率和运行时间效率方面相较于六种最先进的对策有着高达50至200倍和3至7000倍的优势。广泛实验证明其在抵御普通和适应性语音深度伪造攻击方面的有效性、可移植性和实用性。代码已公开。

Key Takeaways