TTS

发布日期: 2025-11-06

更新日期: 2025-11-27

文章字数: 4.9k

阅读时长: 20 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-06 更新

Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision

Authors:Kaimeng Jia, Minzhu Tu, Zengrui Jin, Siyin Wang, Chao Zhang

Dysarthria is a speech disorder characterized by impaired intelligibility and reduced communicative effectiveness. Automatic dysarthria assessment provides a scalable, cost-effective approach for supporting the diagnosis and treatment of neurological conditions such as Parkinson’s disease, Alzheimer’s disease, and stroke. This study investigates leveraging human perceptual annotations from speech synthesis assessment as reliable out-of-domain knowledge for dysarthric speech assessment. Experimental results suggest that such supervision can yield consistent and substantial performance improvements in self-supervised learning pre-trained models. These findings suggest that perceptual ratings aligned with human judgments from speech synthesis evaluations represent valuable resources for dysarthric speech modeling, enabling effective cross-domain knowledge transfer.

言语障碍是一种表现为发音不清和沟通效果降低的言语障碍。自动言语障碍评估提供了一种可扩展、具有成本效益的方法，支持帕金森病、阿尔茨海默症和中风等神经疾病的诊断和治疗。本研究探讨了利用语音合成评估中的人类感知注释作为可靠的跨域知识来评估言语障碍的可靠性。实验结果表明，这种监督可以为自监督预训练模型带来一致且显著的性能提升。这些发现表明，与语音合成评估中人类判断一致的感知评分是言语障碍建模的宝贵资源，能够实现有效的跨域知识迁移。

论文及项目相关链接

PDF Submission of IEEE ICASSP 2026

Summary

本文探讨了利用语音合成评估中的人感知注释作为可靠的跨领域知识，对发音障碍语音评估进行自动评估的方法。研究结果表明，这种监督可以为预训练模型带来一致且显著的性能提升。这表明与发音障碍语音建模相匹配的感知评分代表了宝贵的资源，能够实现有效的跨领域知识迁移。

Key Takeaways

发音障碍是一种影响言语清晰度和沟通效果的言语障碍。
自动发音障碍评估可支持帕金森病、阿尔茨海默症和中风等神经性疾病的诊断和治疗。
人感知注释可以从语音合成评估中作为可靠的跨领域知识用于发音障碍的语音评估。
实验结果显示，利用此类监督能为预训练模型带来显著性能提升。
感知评分与语音合成评估中的人类判断相符，对于发音障碍的语音建模具有参考价值。
这种方法能够实现有效的跨领域知识迁移。

Cool Papers

点此查看论文截图

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Authors:Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

测试时缩放（TTS）通过推理过程中的额外计算来改进大型语言模型（LLM），通常是通过并行、序列或混合缩放来实现的。然而，先前的研究通常假设固定的协作架构（例如拓扑结构）和单一模型的使用，忽略了任务和跨任务的最优架构和模型组合可能会有所不同。因此，我们在固定的预算下，研究在TTS中寻找计算最优模型组合和架构的新问题。我们将它形式化为一个多LLM协作图，其中节点编码角色和LLM模型分配，边缘捕捉信息流。这个问题具有挑战性，因为（i）组合搜索空间非常大，（ii）特定任务的要求需要量身定制的设计。为了解决这个问题，我们将问题重新表述为概率图优化，并通过试点实验，得出关于TTS协作图的三个实证见解。在这些见解的指导下，我们提出了Agent-REINFORCE，这是一个增强型LLM代理框架，它映射采样-梯度更新到采样-反馈更新，反馈作为文本梯度来更新概率图并有效地搜索最优的多LLM协作图。实验表明，Agent-REINFORCE在样本效率和搜索性能上都优于传统的和基于LLM的基线，并在准确性和推理延迟的联合目标下有效地识别出最优图。

论文及项目相关链接

PDF Under review

Summary

文本阐述了Test-Time Scaling（TTS）如何通过优化大型语言模型（LLM）在推理过程中的计算分配来提升模型性能。先前的研究假设固定的合作架构和单一模型使用方式，但本文关注在固定预算下寻找计算最优模型组合和架构的新问题。通过将其形式化为多LLM协作图，并借助Agent-REINFORCE框架，本文实现了对TTS协作图的优化搜索，该框架通过映射采样-梯度更新到采样-反馈更新，以反馈作为文本梯度来更新概率图并有效搜索最优的多LLM协作图。实验表明，Agent-REINFORCE在样本效率和搜索性能上优于传统和LLM基线，并能有效地识别出准确率和推理延迟联合目标下的最优图。

Key Takeaways

TTS通过推理过程中的额外计算分配优化大型语言模型性能。
先前研究忽略了任务对最优架构和模型组合的不同需求。
本文形式化了多LLM协作图来解决这一问题。
LLM协作图优化被视为概率图优化问题。
通过Agent-REINFORCE框架，结合采样-反馈更新机制来更新概率图并搜索最优协作图。
Agent-REINFORCE在样本效率和搜索性能上优于传统和LLM基线方法。

Cool Papers

点此查看论文截图

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Authors:Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs’ capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.

最近关于利用大型语言模型（LLM）进行端到端（E2E）语音生成的研究引起了社区的关注，多项工作将基于文本的LLM扩展到生成离散语音标记。现有的端到端方法主要可分为两类：（1）独立生成离散语音标记的方法，不将其纳入LLM的自回归过程，导致文本生成无法意识到并行的语音合成。（2）通过联合自回归建模生成交替或并行语音文本标记的模型，从而在生成过程中实现跨模态感知。本文提出了基于联合自回归建模的并行语音文本对话模型DrVoice，具有双分辨率语音表示功能。值得注意的是，虽然当前方法主要使用12.5Hz的输入音频表示，但我们提出的双分辨率机制将LLM的输入频率降低到5Hz，不仅显著降低了计算成本，还缓解了语音和文本标记之间的频率差异，从而更好地利用了LLM的能力。实验结果证明，DRVOICE-7B在OpenAudioBench和BigBenchAudio基准测试中建立了新的最先进的性能，同时在VoiceBench和UltraEval-Audio基准测试上实现了与最先进的性能相当的表现，成为领先的开源语音基础模型，模型大小约为7B。

论文及项目相关链接

PDF Work in progress

Summary

本研究探讨了基于端到端（E2E）的语音生成技术，特别是利用大型语言模型（LLM）进行语音生成的最新研究。文章介绍了两种主要的E2E方法，并提出了一种新的语音对话模型DrVoice，该模型基于联合自回归建模，具有双分辨率语音表示特性。DrVoice采用5Hz的输入音频表示，降低了计算成本，并缓解了语音和文本标记之间的频率差异，从而更好地利用了LLM的能力。实验结果表明，DRVOICE-7B在OpenAudioBench和BigBenchAudio基准测试中建立了新的最优水平。

Key Takeaways

端到端（E2E）语音生成技术结合大型语言模型（LLM）是当前研究热点。
现有E2E方法主要分为两类：独立生成离散语音标记和联合自回归建模生成交织或并行语音文本标记。
DrVoice模型基于联合自回归建模，具备双分辨率语音表示特性。
DrVoice采用5Hz的输入音频表示，显著降低计算成本，并改善语音和文本标记的频率差异。
DrVOICE-7B在OpenAudioBench和BigBenchAudio基准测试中表现优异。
DrVOICE-7B性能与现有最优水平相当，在多个基准测试中实现领先水平。

Cool Papers

点此查看论文截图

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Authors:Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang

We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.

我们介绍了SLED，这是一种语音语言建模的替代方法。它通过把语音波形编码成连续的潜在表示序列，并使用能量距离目标进行自回归建模。能量距离通过对比模拟样本和目标样本，提供了一个分布差距的分析度量，从而能够高效地进行训练，以捕捉潜在的连续自回归分布。SLED通过绕过对剩余向量量化的依赖，避免了离散化误差，并消除了现有语音语言模型中常见的复杂层次架构的需要。它在简化整体建模流程的同时，保留了语音信息的丰富性，并保持了推理效率。经验结果表明，SLED在零样本和流式语音合成中都取得了良好的性能，显示出其在通用语音语言模型中更广泛应用的潜力。

论文及项目相关链接

PDF NeurIPS 2025; Demos and code are available at https://github.com/ictnlp/SLED-TTS

Summary

本文介绍了SLED，这是一种基于语音波形编码的语音语言建模的替代方法。它通过连续潜在表示序列对语音波形进行编码，并使用能量距离目标对其进行自回归建模。能量距离通过对比模拟和目标样本，提供分布差距的分析度量，使训练能够高效捕捉潜在的连续自回归分布。SLED通过绕过对剩余向量量化的依赖，避免了离散化误差，并消除了现有语音语言模型中常见的复杂层次架构的需求。它在简化整体建模流程的同时，保留了语音信息的丰富性，并保持了推理效率。经验结果表明，SLED在零样本和流式语音合成中表现强劲，显示出在通用语音语言模型中的广泛应用潜力。

Key Takeaways

SLED是一种新型的语音语言建模方法，基于语音波形编码。
SLED使用连续潜在表示序列进行建模，并采用能量距离目标进行自回归建模。
能量距离提供了分析分布差距的度量，通过对比模拟和目标样本，使训练更加高效。
SLED避免了离散化误差和复杂层次架构的需求，简化了建模流程。
SLED保留了语音信息的丰富性，并保持了推理效率。
经验结果表明，SLED在零样本和流式语音合成中表现优秀。

Cool Papers

点此查看论文截图

LEP3: A High-Luminosity e+e- Higgs and ElectroweakFactory in the LHC Tunnel

Authors:C. Anastopoulos, R. Assmann, A. Ball, O. Bruning, O. Buchmueller, T. Camporesi, P. Collier, J Dainton, G. Davies, J. R. Ellis, B. Goddard, L. Gouskos, M. Klute, M. Koratzinos, G. Landsberg, K. Long, L. Malgeri, F. Maltoni, F. Moortgat, C. Mariotti, S. Myers, J. A. Osborne, M. Pierini, D. R. Tovey, D. Treille, T. S. Virdee, N. Wardle, M. Zanetti

As stated in the 2019 European Strategy for Particle Physics (ESPP), it is of the utmost importance that the HL-LHC upgrade of the accelerator and the experiments be successfully completed in a timely manner. All necessary efforts should be devoted to achieving this goal. We also recall two of the principal recommendations of the 2019 ESPP for future accelerator initiatives, namely that 1) An electron-positron Higgs factory is the highest priority for the next collider (Rec. c). 2) Europe, together with its international partners, should investigate the technical and financial feasibility of a future hadron collider at CERN with a centre-of-mass energy of at least 100 TeV and with an electron-positron Higgs and electroweak factory as a possible first stage (Rec. e). A major objective in particle physics is always to operate an accelerator that allows a leap of an order of magnitude in the constituent centre-of-mass energy with respect to the previous one. We support FCC-ee and FCC-hh as the preferred option for CERN future, as it addresses both of the above recommendations. The guidance for the 2025 ESPP requests, in addition to the preferred option, the inclusion of ``prioritised alternatives to be pursued if the chosen preferred option turns out not to be feasible or competitive’’. Proposed alternatives to the preferred FCC option include linear, muon colliders and LHeC accelerators. In response to this request we propose reusing the existing LHC tunnel for an electron-positron collider, called LEP3, as a back-up alternative if the FCC cannot proceed. LEP3 leverages much of the R&D conducted for FCC-ee, offers high-precision studies of Z, W, and Higgs bosons below the tt threshold, and offers potential physics performance comparable or superior to other fallback options at a lower cost while supporting continued R&D towards a next-generation energy frontier machine.

如2019年欧洲粒子物理策略（ESPP）所述，及时成功完成HL-LHC加速器及实验的升级至关重要。所有必要的努力都应致力于实现这一目标。我们还回顾了2019年ESPP关于未来加速器倡议的两项主要建议，即：1) 电子正电子希格斯工厂是下一个对撞机的首要选择（建议c）。2) 欧洲应与其国际合作伙伴共同研究在CERN建造一个未来强子对撞机的技术和财务可行性，该对撞机的质心能量至少为100 TeV，并以电子正电子希格斯工厂和电弱工厂作为可能的第一阶段（建议e）。粒子物理学的重大目标始终是操作一种加速器，允许在构成质心能量方面实现相对于前一个能量质的飞跃。我们支持FCC-ee和FCC-hh作为CERN未来的首选方案，因为它符合上述两项建议。此外，2025年ESPP的请求指导除了首选方案外，还包括“如果选定的首选方案不可行或缺乏竞争力，将优先追求的替代方案”。作为首选FCC方案的替代方案，提出了线性对撞机、μ子对撞器和LHeC加速器等。对此请求，我们提议如果FCC无法实施，将利用现有LHC隧道建立一个电子正电子对撞机（称为LEP3）作为后备替代方案。LEP3充分利用了为FCC-ee开展的研究与开发工作，可以进行低于tt阈值的Z、W和希格斯玻色子的高精度研究，并在较低的成本下提供与其他备选方案相当或更好的物理性能，同时支持下一代能源前沿机器的研究与开发。

论文及项目相关链接

PDF 11 pages, 3 tables

Summary
欧洲粒子物理战略强调HL-LHC加速器及其实验的升级至关重要，需努力完成。主要推荐未来加速器倡议包括电子正负电子希格斯工厂为首选以及研究在CERN建立至少中心质量能量达百TeV的未来强子对撞机。对于新的ESPP指导请求，支持首选方案外还需包含备选方案，建议提出线性加速器、对撞机加速器等替代方案。提出如果首选方案不可行或缺乏竞争力时，利用现有LHC隧道建立电子正负电子对撞机LEP3作为备选方案。LEP3能够利用FCC-ee的研究与开发成果，提供高精确度研究Z、W和希格斯玻色子，并可能具有与其他备选方案相当或更好的物理性能，同时成本较低并支持下一代能量前沿机器的研发。

Key Takeaways