TTS

发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 1.9k

阅读时长: 7 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Authors:Wei Liu, Jiahong Li, Yiwen Shao, Dong Yu

Speech-LLM models have demonstrated great performance in multi-modal and multi-task speech understanding. A typical speech-LLM paradigm is integrating speech modality with a large language model (LLM). While the Whisper encoder was frequently adopted in previous studies for speech input, it shows limitations regarding input format, model scale, and semantic performance. To this end, we propose a lightweight TTA model specialized in speech semantics for more effective LLM integration. With large-scale training of 358k hours of speech data on multilingual speech recognition (ASR), speech translation (ST) and speech-text alignment tasks, TTA is capable of producing robust cross-lingual speech representations. Extensive evaluations across diverse benchmarks, including ASR/ST, speech retrieval, and ASR-LLM performance assessments, demonstrate TTA’s superiority over Whisper. Furthermore, we rigorously validate the interplay between cross-lingual capabilities and ASR/ST performance. The model weights and training recipes of TTA will be released as part of an audio understanding toolkit Auden.

语音LLM模型在多模态多任务语音理解方面表现出卓越的性能。一种典型的语音LLM范式是将语音模态与大型语言模型（LLM）相结合。虽然whisper编码器在以前的研究中经常被用于语音输入，但在输入格式、模型规模和语义性能等方面显示出局限性。为此，我们提出了一种轻量级的TTA模型，专门用于语音语义，以更有效地集成LLM。通过对多语言语音识别（ASR）、语音翻译（ST）和语音文本对齐任务上35.8万小时的大规模语音数据进行训练，TTA能够产生稳健的跨语言语音表示。在包括ASR/ST、语音检索和ASR-LLM性能评估在内的多种基准测试上的广泛评估表明，TTA优于whisper。此外，我们严格验证了跨语言能力与ASR/ST性能之间的相互作用。TTA的模型权重和培训食谱将作为音频理解工具包Auden的一部分发布。

论文及项目相关链接

PDF Submitted to ICASSP2026

Summary

本文介绍了Speech-LLM模型在多模态多任务语音理解方面的高性能表现。针对以往研究中广泛使用的Whisper编码器在输入格式、模型规模和语义性能方面的局限性，提出了一种轻量级的TTA模型，该模型在多语种语音识别（ASR）、语音翻译（ST）和语音文本对齐任务的大规模训练上，能生成稳健的跨语种语音表征。经过广泛的评估，TTA在ASR/ST、语音检索和ASR-LLM性能评估方面均优于Whisper。此外，本文还验证了跨语种能力与ASR/ST性能之间的相互作用。TTA模型权重和训练配方将作为音频理解工具包Auden的一部分发布。

Key Takeaways

Speech-LLM模型在多模态多任务语音理解上表现优异。
Whisper编码器在语音输入方面存在局限性，包括输入格式、模型规模和语义性能。
TTA模型是一个轻量级模型，专门用于语音语义，可实现更有效的LLM集成。
TTA模型经过大规模训练，能在多语种语音识别（ASR）、语音翻译（ST）和语音文本对齐任务上生成稳健的跨语种语音表征。
TTA在多种评估标准上优于Whisper，包括ASR/ST、语音检索和ASR-LLM性能评估。
跨语种能力对ASR/ST性能有重要影响。

Cool Papers

点此查看论文截图

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Authors:Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

本文专注于语音驱动的三维面部动画任务，旨在通过语音输入生成真实且同步的面部动作。最近的方法采用音频条件下的扩散模型进行三维面部动画，在生成表达性和自然动画方面取得了令人印象深刻的结果。然而，这些方法在一次通过中处理整个音频序列，这带来了两个主要挑战：当处理超过训练范围的音频序列时，它们的性能往往较差，而且在处理长音频输入时会出现显著的延迟。为了解决这些限制，我们提出了一种新型的自回归扩散模型，以流的方式处理输入音频。这种设计确保了不同音频长度的灵活性，并实现了与音频持续时间无关的低延迟。具体来说，我们选择一定数量的过去帧作为历史运动上下文，并与音频输入相结合以创建动态条件。此条件引导扩散过程迭代生成面部运动帧，从而实现实时合成和高品质结果。此外，我们实现了一个实时交互演示，突出了我们方法的有效性和效率。我们将在https://zju3dv.github.io/StreamingTalker/发布代码。

论文及项目相关链接

PDF

Summary

本文提出一种基于自回归扩散模型的流式音频驱动面部动画方法。该方法能够在处理不同长度的音频时保持灵活性，实现低延迟的面部动画生成。通过结合历史运动上下文和音频输入，创建动态条件，指导扩散过程生成面部运动帧，实现高质量实时合成效果。

Key Takeaways

本文关注语音驱动的3D面部动画任务，旨在通过语音输入生成真实且同步的面部运动。
近期方法采用音频条件扩散模型进行3D面部动画，能生成表达丰富、自然的动画。
现有方法在处理超过训练范围的音频序列时表现不佳，且处理长音频输入时存在延迟问题。
本文提出一种自回归扩散模型，以流式方式处理输入音频，确保对不同音频长度的灵活性，实现低延迟。
方法通过选取过去帧作为历史运动上下文，与音频输入结合，创建动态条件，指导生成面部运动帧。
该方法能够实现高质量实时合成效果，并进行了实时互动演示。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-20/TTS/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

TTS

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-11-20 Listen Like a Teacher Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation

2025-11-20 Interactive

Interactive

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-11-20 Seeing Beyond the Image ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images

2025-11-20 医学图像

医学图像