嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-10-23 更新

MLMA: Towards Multilingual with Mamba Based Architectures

Authors:Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti

Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture – an efficient state-space model optimized for long-context sequence processing – for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba’s potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.

多语言自动语音识别(ASR)仍然是一项具有挑战性的任务,特别是在平衡高资源和低资源语言性能时。序列建模的最新进展表明,超越Transformer的架构可能会提供更好的可扩展性和效率。在这项工作中,我们介绍了MLMA(用于ASR的Mamba多语言建模),这是一种利用Mamba架构的新方法——一种针对长上下文序列处理优化的高效状态空间模型——用于多语言ASR。使用Mamba,MLMA可以隐含地融入语言感知条件和共享表示,以支持跨不同语言的稳健识别。在标准多语言基准测试上的实验表明,与基于Transformer的架构相比,MLMA取得了具有竞争力的性能。这些结果突出了Mamba作为可扩展、高效和准确的多语言语音识别强大后端的潜力。

论文及项目相关链接

PDF The paper is under review at ICASSP 2026

Summary

基于序列建模的最新进展,本文提出利用Mamba架构(一种针对长上下文序列处理优化的高效状态空间模型)进行多语言自动语音识别(ASR)。实验结果表明,与基于Transformer的架构相比,MLMA在多语言基准测试中表现出竞争力,凸显了Mamba作为可扩展、高效和准确的多语言语音识别强大后盾的潜力。

Key Takeaways

  1. 多语言自动语音识别(ASR)仍是一项挑战,特别是在高资源和低资源语言之间的性能平衡方面。
  2. 最近序列建模的进步表明,除了Transformer之外的其他架构可能会提供更好的可扩展性和效率。
  3. 引入MLMA方法,利用Mamba架构进行多语言ASR,适用于长上下文序列处理。
  4. Mamba架构通过语言感知条件性和共享表示形式,支持多种语言的稳健识别。
  5. 实验结果表明,MLMA在标准多语言基准测试中具有竞争力。
  6. Mamba具有作为可扩展、高效和准确的多语言语音识别强大后盾的潜力。

Cool Papers

点此查看论文截图

ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

Authors:Haowei Lou, Hye-Young Paik, Wen Hu, Lina Yao

Controlling speaking style in text-to-speech (TTS) systems has become a growing focus in both academia and industry. While many existing approaches rely on reference audio to guide style generation, such methods are often impractical due to privacy concerns and limited accessibility. More recently, large language models (LLMs) have been used to control speaking style through natural language prompts; however, their high computational cost, lack of interpretability, and sensitivity to prompt phrasing limit their applicability in real-time and resource-constrained environments. In this work, we propose ParaStyleTTS, a lightweight and interpretable TTS framework that enables expressive style control from text prompts alone. ParaStyleTTS features a novel two-level style adaptation architecture that separates prosodic and paralinguistic speech style modeling. It allows fine-grained and robust control over factors such as emotion, gender, and age. Unlike LLM-based methods, ParaStyleTTS maintains consistent style realization across varied prompt formulations and is well-suited for real-world applications, including on-device and low-resource deployment. Experimental results show that ParaStyleTTS generates high-quality speech with performance comparable to state-of-the-art LLM-based systems while being 30x faster, using 8x fewer parameters, and requiring 2.5x less CUDA memory. Moreover, ParaStyleTTS exhibits superior robustness and controllability over paralinguistic speaking styles, providing a practical and efficient solution for style-controllable text-to-speech generation. Demo can be found at https://parastyletts.github.io/ParaStyleTTS_Demo/. Code can be found at https://github.com/haoweilou/ParaStyleTTS.

文本转语音(TTS)系统中的说话风格控制已成为学术界和工业界关注的焦点。尽管许多现有方法依赖于参考音频来指导风格生成,但这些方法通常由于隐私担忧和有限的可访问性而不切实际。最近,大型语言模型(LLM)已被用于通过自然语言提示来控制说话风格;然而,它们计算成本高、缺乏可解释性、对提示措辞敏感,在实时和资源受限的环境中适用性有限。在这项工作中,我们提出了ParaStyleTTS,一个轻便且可解释的TTS框架,它能够从文本提示中单独实现表达风格控制。ParaStyleTTS具有新颖的两级风格适应架构,该架构将语音风格建模分为韵律和修辞两个层次。它允许对情绪、性别和年龄等因子进行精细和稳健的控制。与基于LLM的方法不同,ParaStyleTTS在各种提示公式中保持一致的风格实现,非常适合现实世界的应用,包括设备端和低资源部署。实验结果表明,ParaStyleTTS生成的高质量语音性能与最新的基于LLM的系统相当,同时速度更快(快30倍)、参数更少(减少8倍)、CUDA内存需求更少(减少2.5倍)。此外,ParaStyleTTS在修辞性说话风格方面表现出卓越的稳定性和可控性,为风格可控的文本转语音生成提供了实用高效的解决方案。演示地址:https://parastyletts.github.io/ParaStyleTTS_Demo/。代码地址:https://github.com/haoweilou/ParaStyleTTS。

论文及项目相关链接

PDF

摘要
文本转语音(TTS)系统中的风格控制已成为学术界和工业界关注的焦点。现有方法大多依赖于参考音频来引导风格生成,但这种方法由于隐私问题和可及性有限而不切实际。最近,大型语言模型(LLM)被用于通过自然语言提示来控制说话风格,但其高计算成本、缺乏可解释性和对提示表达的敏感性限制了其在实时和资源受限环境中的适用性。在此研究中,我们提出ParaStyleTTS,一个轻便且可解释的TTS框架,仅通过文本提示就能实现表达风格的控制。ParaStyleTTS采用新颖的两级风格适应架构,将韵律和语调的语音风格建模分离,允许精细且稳健地控制情感、性别和年龄等因素。与LLM方法不同,ParaStyleTTS在多种提示形式中保持一致的风格实现,适用于现实世界的应用,包括设备端和低资源部署。实验结果表明,ParaStyleTTS生成的高质量语音性能与最先进的LLM系统相当,但速度更快(快30倍)、参数更少(减少8倍)、CUDA内存需求更低(降低2.5倍)。此外,ParaStyleTTS在语调说话风格方面表现出卓越的稳健性和可控性,为风格可控的文本转语音生成提供了实用且高效的解决方案。演示地址:https://parastyletts.github.io/ParaStyleTTS_Demo/。代码地址:https://github.com/haoweilou/ParaStyleTTS。

关键见解

  1. 文本转语音(TTS)系统中的风格控制成为研究焦点。
  2. 当前方法依赖参考音频或大型语言模型(LLM)进行风格控制。
  3. 参考音频方法因隐私和可及性问题而不切实际。
  4. LLM方法虽能有效控制风格,但计算成本高、缺乏可解释性,且在特定环境下有局限性。
  5. ParaStyleTTS框架提出一种轻便、可解释的TTS系统,仅通过文本提示实现风格控制。
  6. ParaStyleTTS采用两级风格适应架构,分离韵律和语调风格建模,实现精细和稳健的风格控制。

Cool Papers

点此查看论文截图

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Authors:Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun

With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

随着对自然人机交互的日益增长的需求,基于语音的系统得到了越来越多的关注,因为语音是日常交流中最常见的形式之一。然而,现有的语音模型在流处理过程中生成第一个音频令牌时仍然存在较高的延迟,这对部署造成了重大瓶颈。为了解决这一问题,我们提出了VITA-Audio,这是一个端到端的大型语音模型,具有快速生成音频文本令牌的能力。具体来说,我们引入了一个轻量级的跨模态令牌预测(MCTP)模块,该模块可以在单个模型前向传递过程中有效地生成多个音频令牌,这不仅加速了推理,而且显著降低了流场景中生成第一个音频的延迟。此外,我们还探索了一种四阶段渐进式训练策略,以在尽可能不损失语音质量的情况下实现模型加速。据我们所知,VITA-Audio是第一个能够在第一次前向传递过程中生成音频输出的多模态大型语言模型,可实现实时对话功能,延迟极低。VITA-Audio可完全复现,且仅基于开源数据进行训练。实验结果表明,我们的模型在7B参数规模上实现了3~5倍的推理速度提升,并且在多个自动语音识别(ASR)、文本到语音(TTS)和语音问答(SQA)任务的基准测试中显著优于类似规模的开源模型。

论文及项目相关链接

PDF Training and Inference Codes: https://github.com/VITA-MLLM/VITA-Audio

Summary

随着人类与计算机交互需求的增长,语音成为日常沟通的最常见形式之一,语音模型因此备受关注。针对现有语音模型在流式传输中产生首个音频令牌时的高延迟问题,我们提出了VITA-Audio,这是一个端到端的大型语音模型,具有快速的音频文本令牌生成能力。通过引入轻量级的跨模态令牌预测模块和四个阶段的渐进训练策略,我们不仅在推理时实现了加速,而且在流式场景中显著减少了生成第一个音频的延迟。VITA-Audio是首个能够在首次前向传递中产生音频输出的多模态大型语言模型,具有实时对话能力且延迟极低。

Key Takeaways

  1. 语音模型因自然人类与计算机交互的增长而备受关注。
  2. 现有语音模型在生成首个音频令牌时存在高延迟问题。
  3. VITA-Audio是一个端到端的大型语音模型,旨在解决高延迟问题。
  4. VITA-Audio通过引入MCTP模块和渐进训练策略实现了快速音频文本令牌生成。
  5. VITA-Audio是首个能够在首次前向传递中产生音频输出的多模态大型语言模型。
  6. VITA-Audio具有实时对话能力且延迟极低。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
GAN GAN
GAN 方向最新论文已更新,请持续关注 Update in 2025-10-23 Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
2025-10-23
下一篇 
人脸相关 人脸相关
人脸相关 方向最新论文已更新,请持续关注 Update in 2025-10-23 WMamba Wavelet-based Mamba for Face Forgery Detection
2025-10-23
  目录