发布日期: 2025-09-17

更新日期: 2025-10-07

文章字数: 2k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-17 更新

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Authors:Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology’s applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

语音切词技术能够实现离散表示并促进语音语言建模。然而，现有的神经编码方式主要捕捉低级别的声学特征，忽视了人类语音中固有的语义和上下文线索。尽管最近的研究引入了来自自监督语音模型的语义表示或结合了预训练语言模型的上下文表示，但在对齐和统一语义和上下文表示方面仍存在挑战。我们引入了FuseCodec，它通过强大的跨模态对齐和全局信息监督来统一声学、语义和上下文表示。我们提出了三种互补的技术：一是潜在表示融合，直接将语义和上下文特征集成到编码器潜在空间中，以实现稳健和统一的表示学习；二是全局语义上下文监督，用全局池化和广播的表示来监督离散标记，以增强时间一致性和跨模态对齐；三是时间对齐的上下文监督，通过在局部窗口中动态匹配上下文和语音标记来加强对齐，以实现精细的标记级监督。我们还引入了FuseCodec-TTS，展示了我们的方法在零样本语音合成中的适用性。经验表明，FuseCodec在LibriSpeech上达到了最先进的性能，在转录准确性、感知质量、清晰度和说话人相似性方面超越了EnCodec、SpeechTokenizer和DAC。结果突出了上下文和语义引导的词切分在语音切分和下游任务中的有效性。代码和预训练模型可在https://github.com/mubtasimahasan/FuseCodec上找到。

论文及项目相关链接

PDF

Summary

本文介绍了Speech tokenization的重要性及其在语音语言建模中的应用。现有神经编码方法主要关注低层次声学特征，忽略了语音的语义和上下文信息。为此，本文提出了FuseCodec，通过强大的跨模态对齐和全局监督，统一了声学、语义和上下文表示。同时介绍了三种互补技术，包括潜在表示融合、全局语义上下文监督以及时间对齐的上下文监督。实验表明，FuseCodec在LibriSpeech数据集上实现了最先进的性能，超越了EnCodec、SpeechTokenizer和DAC等模型。

Key Takeaways

Speech tokenization对于语音语言建模至关重要，但现有方法主要关注低层次声学特征，忽略了语义和上下文信息。
FuseCodec通过强大的跨模态对齐和全局监督，统一了声学、语义和上下文表示。
FuseCodec提出了三种互补技术：潜在表示融合、全局语义上下文监督以及时间对齐的上下文监督。
FuseCodec在LibriSpeech数据集上实现了先进的性能，包括转录准确性、感知质量、清晰度和说话人相似性等方面。
FuseCodec-TTS的引入证明了该方法在零样本语音合成中的适用性。
FuseCodec的代码和预先训练的模型可在GitHub上获得。

Cool Papers

点此查看论文截图

Emoanti: audio anti-deepfake with refined emotion-guided representations

Authors:Xiaokang Li, Yicheng Gong, Dinghao Zou, Xin Cao, Sunbowen Lee

Audio deepfake is so sophisticated that the lack of effective detection methods is fatal. While most detection systems primarily rely on low-level acoustic features or pretrained speech representations, they frequently neglect high-level emotional cues, which can offer complementary and potentially anti-deepfake information to enhance generalization. In this work, we propose a novel audio anti-deepfake system that utilizes emotional features (EmoAnti) by exploiting a pretrained Wav2Vec2 (W2V2) model fine-tuned on emotion recognition tasks, which derives emotion-guided representations, then designing a dedicated feature extractor based on convolutional layers with residual connections to effectively capture and refine emotional characteristics from the transformer layers outputs. Experimental results show that our proposed architecture achieves state-of-the-art performance on both the ASVspoof2019LA and ASVspoof2021LA benchmarks, and demonstrates strong generalization on the ASVspoof2021DF dataset. Our proposed approach’s code is available at Anonymous GitHub1.

音频深度伪造技术非常精湛，缺乏有效的检测方法是致命的。虽然大多数检测系统主要依赖于低级别的声音特征或预先训练的语音表示，但它们经常忽略高级别的情感线索，这些情感线索可以提供补充信息，并且可能具有反深度伪造的功能，从而提高泛化能力。在这项工作中，我们提出了一种利用情感特征的新型音频反深度伪造系统（EmoAnti）。我们通过利用预训练的Wav2Vec2（W2V2）模型进行微调，以识别情感任务，从而得到情感引导表示。然后，我们设计了一个基于卷积层的专用特征提取器，该特征提取器具有残差连接，可以有效地捕获和精炼来自转换器层输出的情感特征。实验结果表明，我们提出的架构在ASVspoof2019LA和ASVspoof2021LA基准测试中均达到了最新技术水平，并在ASVspoof2021DF数据集上表现出强大的泛化能力。我们提出的方法的代码可在匿名GitHub上找到。

论文及项目相关链接

PDF

Summary
本摘要针对音频深度伪造问题，提出了一种利用情感特征的新型音频反深度伪造系统（EmoAnti）。通过微调用于情感识别任务的预训练Wav2Vec2模型，获得情感引导表示。设计一个基于卷积层的专用特征提取器，带有残差连接以有效地从Transformer层输出中捕获和细化情感特征。实验结果显示，该方法在ASVspoof数据集上达到了最新性能水平，并表现出良好的泛化能力。相关代码已上传至匿名GitHub仓库。

Key Takeaways