TTS

发布日期: 2025-09-12

更新日期: 2025-10-07

文章字数: 3.9k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-12 更新

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

Authors:Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez

We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

我们引入了延迟流建模（DSM），这是一种用于流式、多模态序列到序列学习的灵活公式。序列到序列生成通常以一种离线的方式进行，模型在生成第一个输出时间步之前会消耗完整的输入序列。相比之下，流式序列到序列依赖于学习一个策略，以选择何时推进输入流，或写入输出流。而DSM则使用仅解码的语言模型对已经时间对齐的流进行建模。通过将对齐移动到预处理步骤，并在流之间引入适当的延迟，DSM提供了任意输出流的流式推断，适用于许多序列到序列问题，可从任何输入组合生成。特别是给定文本和音频流时，语音识别（ASR）对应于文本流延迟，而相反则为文本到语音（TTS）模型。我们为这两个主要的序列到序列任务进行了大量实验，结果表明，DSM在提供最先进的性能和延迟的同时，还支持任意长序列，甚至与离线基准竞争。相关代码、样本和演示可在 https://github.com/kyutai-labs/delayed-streams-modeling 找到。

论文及项目相关链接

PDF

Summary

本文介绍了延迟流建模（DSM）这一灵活的多模态序列到序列学习框架。传统的序列到序列生成采用离线方式，而DSM将输入流与输出流的时序对齐移至预处理步骤，通过引入适当的延迟，实现了任意输出序列的流式推断，适用于多种序列到序列问题。在语音识别（ASR）和文本到语音（TTS）两大任务上，DSM达到了最先进的性能和延迟，并支持任意长序列，与离线基线相比具有竞争力。

Key Takeaways

延迟流建模（DSM）是一个灵活的多模态序列到序列学习框架，适用于流式处理。
传统序列到序列生成采用离线方式，而DSM采用流式推断，可处理任意输出序列。
DSM通过将时序对齐移至预处理步骤，提高了模型的灵活性。
DSM通过引入适当的延迟，实现了输入流与输出流的优化对齐。
在语音识别（ASR）和文本到语音（TTS）任务上，DSM达到了最先进的性能和延迟。
DSM支持任意长序列的处理，与离线基线相比具有竞争力。

Cool Papers

点此查看论文截图

Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching

Authors:Siratish Sakpiboonchit

This paper presents a method to accelerate the inference process of diffusion transformer (DiT)-based text-to-speech (TTS) models by applying a selective caching mechanism to transformer layers. Specifically, I integrate SmoothCache into the F5-TTS architecture, focusing on caching outputs of self-attention and feed-forward network layers to reduce redundant computations during the denoising process. A calibration phase is introduced to analyze L1 relative errors between timesteps, guiding the selection of cache schedules that minimize quality degradation. To address the problem of inter-layer dependency, a unified caching schedule is adopted, applying the cache pattern derived from self-attention layers to both layer types. Experiments on LibriSpeech-PC and Seed-TTS datasets evaluate various cache thresholds and denoising step configurations. Results show that caching at higher denoising steps reduces inference time without compromising output quality, whereas caching at lower steps can negatively impact synthesis quality similarly to reducing the total number of denoising steps. Objective and subjective metrics confirm the effectiveness of SmoothCache in maintaining performance while improving computational efficiency. Comparisons between cached inference and reduced-step inference further highlight the benefits of selective caching, especially under high-step configurations. This work demonstrates that transformer layer caching is a practical solution for optimizing diffusion transformer-based TTS models without requiring architectural changes or retraining. Example inference results can be heard at https://siratish.github.io/F5-TTS_SmoothCache/ .

本文提出了一种通过应用选择性缓存机制来加速基于扩散变压器（DiT）的文本到语音（TTS）模型的推理过程的方法。具体来说，我将SmoothCache集成到F5-TTS架构中，专注于缓存自注意力层和前馈网络层的输出，以减少去噪过程中的冗余计算。引入了一个校准阶段，分析时间步之间的L1相对误差，以指导选择最小化质量退化的缓存时间表。为了解决层间依赖性的问题，采用了一种统一的缓存时间表，将自注意力层派生的缓存模式应用于这两种层类型。在LibriSpeech-PC和Seed-TTS数据集上的实验评估了各种缓存阈值和去噪步骤配置。结果表明，在较高的去噪步骤中进行缓存可以减少推理时间而不影响输出质量，而在较低的步骤中进行缓存可能会对合成质量产生负面影响，这与减少总的去噪步骤相似。客观和主观指标证实了SmoothCache在保持性能的同时提高计算效率的有效性。缓存推理与减少步骤推理的比较进一步突出了选择性缓存的效益，特别是在高步骤配置下更是如此。这项工作表明，变压器层缓存是优化基于扩散变压器的TTS模型的一种实用解决方案，而无需进行架构更改或重新训练。示例推理结果可在https://siratish.github.io/F5-TTS_SmoothCache/上听到。

论文及项目相关链接

PDF 9 pages, 2 tables, 5 figures

Summary

本文介绍了一种通过将平滑缓存集成到F5-TTS架构中，以加速基于扩散变压器（DiT）的文本到语音（TTS）模型的推理过程的方法。该方法通过选择性缓存机制减少冗余计算，引入校准阶段以分析时间步之间的L1相对误差，并优化缓存调度策略以平衡计算效率和输出质量。实验结果表明，在较高去噪步骤中缓存可以缩短推理时间而不影响输出质量，而在较低步骤中缓存可能会对合成质量产生负面影响。本研究展示了选择性缓存是一种优化扩散变压器基础TTS模型的实用解决方案，无需进行架构更改或重新训练。

Key Takeaways

论文提出了一种将平滑缓存集成到F5-TTS架构的方法，以加速基于扩散变压器的TTS模型的推理过程。
通过在自注意力层和前馈网络层中应用选择性缓存机制，减少了冗余计算。
引入了校准阶段，用于分析时间步之间的L1相对误差，以优化缓存调度策略。
在高去噪步骤中缓存能够缩短推理时间而不影响输出质量。
在较低去噪步骤中缓存可能会对合成质量产生负面影响。
论文通过客观和主观指标验证了平滑缓存在提高计算效率的同时保持性能的有效性。

Cool Papers

点此查看论文截图

TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

Authors:Peijin Xie, Shun Qian, Bingquan Liu, Dexin Wang, Lin Sun, Xiangzheng Zhang

Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech–document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

文档图像包含了丰富的知识，而口语查询的便携性则启用了更广泛和灵活的应用场景。然而，此前没有工作探索过在视觉文档图像上进行基于口语查询的知识问答。我们提出TextlessRAG，这是第一个用于基于口语问答的大规模文档图像的端到端框架。与先前的方法不同，TextlessRAG消除了ASR、TTS和OCR，直接解释口语、检索相关的视觉知识，并在完全无文本的情况下生成答案。为了进一步提升性能，我们集成了一种布局感知的重新排序机制来优化检索。实验证明，这在效率和准确性上都有显著提高。为了推动这一方向的研究，我们还发布了首个双语语音-文档RAG数据集，包含中文和英文语音查询与多媒体文档内容配对。数据集和我们的管道都将在以下仓库中提供：https://github.com/xiepeijinhit-hue/textlessrag

论文及项目相关链接

PDF 5 pages, 4 figures,

Summary
文本图像蕴含丰富知识，语音查询的便携性使得应用场景更加广泛和灵活。然而，尚无工作探索基于视觉文档图像的语音查询知识库问答。我们提出TextlessRAG，首个端到端的语音问答框架，可处理大规模文档图像。与传统方法不同，TextlessRAG消除了ASR、TTS和OCR，直接解读语音、检索相关视觉知识并生成答案，实现全文本流程。通过引入布局感知重排序机制进一步提高性能。实验证明，效率和准确性均有显著提高。我们还发布了首个双语语音文档RAG数据集，包含中文和英文语音查询与多媒体文档内容配对。数据集和流程可在仓库中找到：链接地址。

Key Takeaways

文本图像蕴含丰富知识，语音查询扩展了应用场景的灵活性和广泛性。
目前缺乏在视觉文档图像上直接进行语音查询的知识库问答研究。
TextlessRAG是首个针对大规模文档图像的端到端语音问答框架。
TextlessRAG与传统方法不同，无需ASR、TTS和OCR技术，直接处理语音和视觉知识检索。
通过引入布局感知重排序机制提高检索性能。
实验证明TextlessRAG在效率和准确性方面有明显提升。

Cool Papers

点此查看论文截图

Trust but Verify! A Survey on Verification Design for Test-time Scaling

Authors:V Venktesh, Mandeep Rathee, Avishek Anand

Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.

测试时缩放（TTS）已成为提高大型语言模型性能的新前沿。在测试时缩放中，通过在推理过程中使用更多的计算资源，LLM可以提高其推理过程和任务性能。TTS已经出现了一些方法，例如从另一个模型中提炼推理痕迹，或者通过采用验证器来探索巨大的解码搜索空间。验证器充当奖励模型，帮助对解码过程中的候选输出进行评分，以仔细探索巨大的解决方案空间并选择最佳结果。这种普遍被称为参数自由缩放的方法由于在推理时间时的灵活性和显著的性能提升而成为了一种优秀的方法。验证器可以是基于提示的，也可以作为判别模型或生成模型进行微调，以验证过程路径、结果或两者都是。尽管它们得到了广泛的应用，但关于各种验证方法和其训练机制的详细收集、明确分类和讨论仍然缺乏。在这项调查中，我们涵盖了文献中的不同方法，并对验证器的训练、类型及其在测试时缩放中的实用性进行了统一阐述。我们的仓库可以在https://github.com/elixir-research-group/Verifierstesttimescaling.github.io找到。

论文及项目相关链接

PDF 18 pages

Summary

测试时缩放（TTS）已成为扩展大型语言模型性能的新前沿。在测试时缩放中，通过在推理过程中使用更多的计算资源，LLM可以改善其推理过程并提升任务性能。当前已经出现了多种TTS方法，如从另一个模型中提炼推理轨迹，或者通过验证器探索庞大的解码搜索空间。验证器作为奖励模型，帮助对解码过程中的候选输出进行评分，从而谨慎地探索巨大的解决方案空间并选择最佳结果。尽管这些方法已经广泛应用，但对于各种验证方法及其训练机制的详细收集和分类讨论仍然缺乏。这篇综述涵盖了文献中的不同方法，并对验证器的训练、类型及其在测试时缩放中的实用性进行了统一的阐述。

Key Takeaways

测试时缩放（TTS）是扩展大型语言模型性能的新方法，通过增加计算资源来提升LLM的推理和任务性能。
TTS的方法包括从其他模型中提炼推理轨迹和通过验证器探索解码搜索空间。
验证器作为奖励模型，帮助评分解码过程中的候选输出，以选择最佳结果。
验证器可以是基于提示的，也可以作为判别或生成模型进行微调，以验证过程路径、结果或两者。
尽管TTS方法广泛应用，但关于不同验证方法和其训练机制的详细收集和分类讨论仍然不足。
这篇综述统一阐述了验证器的训练、类型及其在测试时缩放中的实用性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-12/TTS/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

TTS

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-09-12 A Robot That Listens Enhancing Self-Disclosure and Engagement Through Sentiment-based Backchannels and Active Listening

2025-09-12 Interactive

Interactive

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-09-12 Delving into the depths of NGC 3783 with XRISM II. Cross-calibration of X-ray instruments used in the large, multi-mission observational campaign

2025-09-12 医学图像

医学图像