TTS

发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 2.1k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

Authors:Fali Wang, Hui Liu, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Zongyu Wu, Chen Luo, Zhen Li, Xianfeng Tang, Qi He, Suhang Wang

Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

测试时缩放（TTS）通过推理过程中的额外计算资源分配，提高了大型语言模型（LLM）的性能。然而，现有研究主要调查单一阶段的TTS任务；而许多现实世界的问题是复杂的多阶段任务，由一系列具有特定能力的子任务组成。因此，我们研究了一个新问题：在多阶段复杂任务中实现测试时计算最优缩放，旨在选择合适的模型并为每个子任务分配预算，以最大化整体性能。多阶段任务中的TTS带来了两个基本挑战：(i) 模型和预算分配的组合搜索空间，加上推理的高成本，使得穷举搜索不切实际。(ii) 各子任务之间的最佳模型和预算分配是相互依赖的，增加了计算最优搜索的复杂性。为了弥补这一差距，我们在四个任务、六个数据集上进行了广泛的试点实验，得出了三个描述LLM在多阶段复杂任务中行为的经验见解。根据这些见解，我们提出了基于LLM代理的AgentTTS框架，该框架通过与执行环境的迭代反馈驱动交互，自主搜索计算最优分配。实验结果表明，AgentTTS在搜索效率上显著优于传统和其他LLM基线，对不同的训练集大小具有更强的稳健性，并提高了可解释性。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

文本介绍了测试时缩放（TTS）技术在多阶段复杂任务中的应用，以提高大型语言模型（LLM）的性能。现有研究主要集中在单阶段任务中的TTS应用，但现实世界中的问题往往是多阶段的复杂任务，由一系列具有特定能力的子任务组成。针对多阶段任务的测试时间计算最优缩放（TTS）面临两大挑战：模型与预算分配的组合搜索空间与推理成本较高，以及各子任务间的模型与预算最优分配相互依赖。通过广泛的试点实验和数据分析，我们提出一种基于LLM代理的框架AgentTTS，通过执行环境的迭代反馈驱动交互来自主寻找计算最优分配。实验结果表明，AgentTTS在搜索效率上显著优于传统和其他LLM基线方法，并且在不同训练集大小上表现出更高的鲁棒性和增强的可解释性。

Key Takeaways

以下是本文的七个关键要点：

测试时缩放（TTS）技术可以提高大型语言模型（LLM）在多阶段复杂任务中的性能。
多阶段任务中的TTS面临两大挑战：模型与预算分配的组合搜索空间问题以及子任务间的最优分配相互依赖问题。
通过广泛的试点实验，我们获得了关于LLM在多阶段复杂任务中行为的三个实证见解。
我们提出一种基于LLM代理的框架AgentTTS，通过迭代反馈驱动交互来寻找计算最优分配。
AgentTTS在搜索效率上显著优于传统和其他LLM基线方法。
AgentTTS表现出更高的鲁棒性，能够适应不同训练集大小的变化。

Cool Papers

点此查看论文截图

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

Authors:Dong Yang, Yiyi Cai, Yuki Saito, Lixu Wang, Hiroshi Saruwatari

We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.

我们提出了一种新的浅流匹配（SFM）机制，它在从粗到细的生成范式下增强了基于流匹配的文本到语音（TTS）模型。不同于传统的FM模块使用弱生成器的粗略表示作为条件，SFM在这些表示的基础上沿着FM路径构建中间状态。在训练过程中，我们引入了一种正交投影方法，自适应地确定这些状态的时间位置，并基于单段分段流应用了一种有原则的构建策略。SFM推理从中间状态开始，而不是纯粹的噪声，从而将计算重点放在FM路径的后期阶段。我们将SFM集成到多个TTS模型中，并使用轻量级的SFM头。实验表明，SFM在客观和主观评估中均实现了语音自然度的持续提高，并且在使用自适应步长ODE求解器时显著加速了推理过程。演示和代码可通过https://ydqmkkx.github.io/SFMDemo/访问。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

本文提出一种名为Shallow Flow Matching（SFM）的新机制，用于增强基于流匹配（FM）的文本到语音（TTS）模型的性能。SFM在传统FM模块的基础上，沿FM路径构建中间状态。训练过程中采用正交投影方法自适应确定状态的时序位置，并采用基于单段分段流的构建策略。SFM推理从中间状态开始，而非纯粹的噪声，使得计算更侧重于FM路径的后期阶段。将SFM集成到多个TTS模型中，通过轻量级的SFM头实现性能提升。实验证明SFM在客观和主观评估中均提高了语音自然度，并在使用自适应步长ODE求解器时显著加速了推理速度。

Key Takeaways

SFM是一种新型的流匹配机制，用于提升文本到语音转换模型的性能。
SFM构建中间状态，这些状态沿流匹配路径形成。
在训练过程中，SFM使用正交投影方法自适应地确定中间状态的时序位置。
SFM采用基于单段分段流的构建策略。
SFM推理从非纯粹的噪声状态开始，聚焦于流匹配路径的后期计算。
SFM可以集成到多个文本到语音转换模型中，并通过轻量级SFM头实现性能提升。
实验结果显示，SFM提高了语音的自然度，并显著加速了推理速度，特别是在使用自适应步长ODE求解器时。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-25/TTS/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

TTS

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-10-25 Stoichiometrically-informed symbolic regression for extracting chemical reaction mechanisms from data

2025-10-25 Interactive

Interactive

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-10-25 ARGenSeg Image Segmentation with Autoregressive Image Generation Model

2025-10-25 医学图像

医学图像