TTS

发布日期: 2025-11-12

更新日期: 2025-11-27

文章字数: 2.8k

阅读时长: 11 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-12 更新

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Authors:Nam Luu, Ondřej Bojar

Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.

语音翻译（ST）是一项机器翻译任务，涉及将一种语言的语音信号转换为另一种语言的对应文本；此任务有两种不同的方法，即传统的级联和最新的端到端。本文探索了一种结合预训练语音编码器和大型语言模型（LLM）的端到端架构，可同时执行自动语音识别（ASR）和ST。使用英语到德语的配对语言进行的实验表明，我们最好的模型不仅可以实现比无缝M4T（一种大型基础端到端多模态翻译模型）更好的翻译结果，而且可以与使用Whisper和NLLB的级联系统相匹配，在COMET DA22指标上得分提高了高达8%。

论文及项目相关链接

PDF

Summary
本文探索了一种结合预训练语音编码器与大型语言模型（LLM）的端到端架构，可同时执行自动语音识别（ASR）和语音翻译（ST）。实验表明，该模型在英德语言对上的翻译结果优于SeamlessM4T模型，并能与采用Whisper和NLLB的级联系统相匹配，在COMET DA 22指标上提高了8%的得分。

Key Takeaways

语音翻译（ST）是机器翻译任务的一种，将语音信号从一种语言转换到另一种语言的文本表示。
现有两种主要方法：传统的级联方法和较新的端到端方法。
本文提出了一种结合预训练语音编码器和大型语言模型（LLM）的端到端架构。
该模型能同时执行自动语音识别（ASR）和语音翻译。
实验结果表明，该模型在英德语言对上的翻译效果优于SeamlessM4T模型。
该模型的性能与级联系统相当，甚至在某些指标上表现更好。

Cool Papers

点此查看论文截图

Latent Speech-Text Transformer

Authors:Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le

Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.

自动回归语音文本模型通常会在大量交错的文本标记序列和通过向量量化编码的原始语音标记序列上进行预训练。这些模型在语音到语音的理解和生成基准测试中表现出了卓越的性能，再加上前景广阔的规模法则，这主要得益于文本和语音之间的代表性对齐。然而，它们也存在一些缺点，部分原因在于语音标记序列相对于文本标记序列而言，其长度不成比例地更长。这导致了预训练和推理期间不同模式之间计算的不平衡，以及有效对齐语音和文本的潜在障碍，最终表现为几个数量级的较慢扩展定律。我们引入了潜在语音文本转换器（LST），通过动态且低成本地聚合语音标记为潜在语音补丁，使预训练语音文本模型更加数据高效。这些补丁作为高级单元，可以与相应的文本单元对齐以辅助能力转移，或者封装常见的语音序列如静音以提高计算效率。我们表明，在语音到语音以及文本到文本的基准测试中，LST的表现优于传统方法，在数据控制和计算控制的环境中都是如此。前者表明其更具有效的代表性对齐，后者表明语音文本模型有更陡峭的扩展定律。在HellaSwag故事完成任务中，LST在计算控制训练下实现了6.5%的语音准确率绝对增长，在数据控制训练下实现了5.3%的增长，同时还提高了文本性能。我们将发布我们的模型、代码和评估数据，以促进进一步的研究。

论文及项目相关链接

PDF 16 pages, 13 figures

Summary

本文介绍了自动回归语音文本模型的预训练方法和不足之处。模型通过对文本和语音的向量量化，利用高层次的潜在单位——潜在语音补丁进行预训练，实现更有效的语音和文本表示对齐，同时提高了计算效率。新提出的Latent Speech-Text Transformer（LST）在数据控制和计算控制设置中表现出优异性能，通过集成潜在语音补丁有效改进了代表性对齐，并提高了语音文本模型的缩放性能。此外，LST在HellaSwag故事完成任务上取得了显著成果。我们将发布模型、代码和评估数据以促进进一步研究。

Key Takeaways

自动回归语音文本模型通过预训练实现文本和语音的先进对齐。
模型使用向量量化将原始语音编码为语音令牌。
与现有模型相比，新提出的Latent Speech-Text Transformer（LST）通过使用潜在语音补丁提高数据效率和计算效率。
LST实现了更有效的语音和文本表示对齐。
LST在数据控制和计算控制设置中具有出色的性能表现，提高了语音文本模型的缩放性能。

Cool Papers

点此查看论文截图

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Authors:Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

对话系统通常依赖于级联的管道，对语音进行转录、处理和再合成。虽然这种方法有效，但它丢弃了副语言线索并限制了表达性。最近的端到端方法减少了延迟并更好地保留了这些线索，但仍然依赖于文本中介，形成了根本性的瓶颈。我们提出了MOSS-Speech，这是一个真正的端到端的语音到语音的大型语言模型，它能够直接理解和生成语音而无需依赖文本指导。我们的方法结合了基于模态的分层分割架构和冻结的预训练策略，保留了预训练文本LLM的推理和知识，同时增加了原生语音功能。实验表明，我们的模型在语音问答方面达到了最先进的水平，与现有的文本引导系统相比提供了相当的语音到语音性能，同时仍保持着竞争性的文本性能。通过缩小文本引导和直接语音生成之间的差距，我们的工作为表达性和高效的端到端语音交互建立了新的范式。

论文及项目相关链接

PDF

Summary

MOSS-Speech是一个真正的从语音到语音的大型语言模型，它直接理解和生成语音，无需依赖文本指导。该研究结合了基于模态的层分裂架构和冻结预训练策略，既保留了预训练文本LLM的推理和知识，又增加了原生语音功能。实验表明，该模型在语音问答方面达到了最新技术成果，与现有文本引导系统相比，语音到语音性能相当，同时仍保持良好的文本性能。该研究缩小了文本引导和直接语音生成之间的差距，为表达性和高效的端到端语音交互建立了新范式。

Key Takeaways