TTS

发布日期: 2025-02-28

更新日期: 2025-05-14

文章字数: 3k

阅读时长: 12 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-02-28 更新

Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs

Authors:Yiheng Yang, Yujie Wang, Chi Ma, Lei Yu, Emmanuele Chersoni, Chu-Ren Huang

Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by human brain’s dual-process mechanisms - predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex context - we propose CLADA, a \textit{\textbf{C}ognitive-\textbf{L}oad-\textbf{A}ware \textbf{D}ynamic \textbf{A}ctivation} framework that synergizes statistical sparsity with semantic adaptability. Our key insight is that LLM activations exhibit two complementary patterns: 1) \textit{Global statistical sparsity} driven by sequence-level prefix information, and 2) \textit{Local semantic adaptability} modulated by cognitive load metrics(e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline from offline error-controlled optimization ensures 40%+ sparsity, dynamically adjusted by real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves \textbf{~20% average speedup with <2% accuracy drop}, outperforming Griffin (5%+ degradation) and TT (negligible speedup). Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis ($R^2=0.17$ for sparsity-adaptation synergy). Requiring no retraining or architectural changes, CLADA offers a deployable solution for resource-aware LLM inference while advancing biologically-inspired AI design. Our code is available at \href{https://github.com/Oldify/CLADA}{CLADA}.

大型密集语言模型（LLM）面临着关键的效率瓶颈，因为它们无论输入复杂度如何，都会僵化地激活所有参数。虽然现有的稀疏性方法（静态剪枝或动态激活）可以部分地解决这一问题，但它们要么缺乏适应上下文或模型结构需求的能力，要么产生过高的计算开销。受到人类大脑双过程机制——预测编码（N400）用于骨干稀疏性和结构再分析（P600）用于复杂上下文——的启发，我们提出了CLADA，这是一个认知负载感知动态激活框架，它协同统计稀疏性和语义适应性。我们的关键见解是，LLM激活表现出两种互补的模式：1）由序列级前缀信息驱动的“全局统计稀疏性”，以及2）由认知负荷度量（如惊喜和熵）调节的“局部语义适应性”。CLADA采用分层阈值策略：离线误差控制优化提供的基线确保40%+的稀疏性，并根据实时认知信号进行动态调整。在六个主流LLM和九个基准测试上的评估表明，CLADA实现了~20%的平均加速，准确率下降不到2%，优于Griffin（降5%）和TT（几乎无加速）。最重要的是，我们通过多级回归分析建立了神经语言学事件相关电位（ERP）组件和LLM效率机制之间的首个正式联系（稀疏适应协同的R²=0.17）。CLADA无需进行再训练或架构更改，为资源感知LLM推理提供了可部署的解决方案，同时推动了生物启发的人工智能设计。我们的代码可在CLADA网站上找到。

论文及项目相关链接

PDF

Summary

本文提出一种名为CLADA的认知负载感知动态激活框架，用于解决大型语言模型（LLM）的效率瓶颈问题。CLADA结合统计稀疏性和语义适应性，通过结合序列级别的前缀信息和认知负载指标（如惊喜和熵），实现全局统计稀疏性和局部语义适应性的协同工作。该方法无需离线错误控制优化，实现了超过40%的稀疏性，可根据实时认知信号动态调整。在多个主流LLM和基准测试上的评估表明，CLADA实现了平均约20%的速度提升，精度损失小于2%，优于Griffin（性能下降超过5%）和TT（几乎无速度提升）。此外，本文建立了神经语言学事件相关电位（ERP）成分与LLM效率机制之间的首次正式联系。CLADA无需任何重新训练或架构更改，为资源感知的LLM推理提供了可部署的解决方案，并推动了生物启发的人工智能设计的发展。

Key Takeaways

大型语言模型（LLMs）面临效率瓶颈，尽管现有的稀疏性方法如静态剪枝或动态激活有助于解决此问题，但它们缺乏适应性或计算开销过大。
CLADA框架结合统计稀疏性和语义适应性，通过结合序列级别的前缀信息和认知负载指标来优化LLM性能。
CLADA实现超过40%的稀疏性，并可根据实时认知信号动态调整。
在多个主流LLM和基准测试上评估显示，CLADA实现平均约20%的速度提升，精度损失小于2%。
CLADA与现有方法相比具有优势，如Griffin和TT，无需重新训练或更改架构。
CLADA为资源感知的LLM推理提供了可部署的解决方案。

Cool Papers

点此查看论文截图

Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Authors:Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

尽管最近的零样本文本到语音（TTS）模型在语音质量和表达力方面有了显著的提升，主流系统仍然面临与语音文本对齐建模相关的问题：1）没有明确的语音文本对齐建模的模型表现出较低的稳健性，特别是在实际应用中的难句；2）基于预定义对齐的模型受到强制对齐的自然性约束的影响。本文介绍了\textit{S-DiT}系统，这是一种具备创新稀疏对齐算法的TTS系统，可引导潜在扩散变压器（DiT）。具体来说，我们为S-DiT提供稀疏对齐边界，以降低对齐学习的难度，同时不限制搜索空间，从而实现高自然度。此外，我们采用多条件无分类指导策略进行口音强度调整，并采用分段校正流技术来加速生成过程。实验表明，S-DiT达到了最先进的零样本TTS语音质量，并对口音强度实现了高度灵活的控制。值得注意的是，我们的系统只需8个采样步骤就能生成高质量的一分钟语音。音频样本可在[https://sditdemo.github.io/sditdemo/]找到。

论文及项目相关链接

PDF

Summary
基于创新的稀疏对齐算法的TTS系统S-DiT提高了语音质量并增强了灵活性。通过减少对齐学习的难度并扩大搜索空间，实现了高度自然的语音生成。该系统还支持对发音强度进行灵活控制，能够在较短的采样步骤内生成高质量的语音。

Key Takeaways

S-DiT是一个基于稀疏对齐算法的TTS系统，旨在解决主流系统面临的语音文本对齐问题。
S-DiT采用潜在扩散变换器（DiT），通过提供稀疏对齐边界来降低对齐学习的难度。
系统在不限制搜索空间的情况下实现了高度自然的语音生成。
多条件无分类指导策略用于调整发音强度。
采用分段整流流技术加速生成过程。
S-DiT达到了零样本TTS的语音质量水平，并支持高度灵活的发音强度控制。

Cool Papers

点此查看论文截图

Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Huality Text-to-Speech Method based on Contextual Semantic Understanding

Authors:Tianyun Liu

Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.

传统文本转语音（TTS）方法主要关注于建立音素与梅尔频谱之间的映射关系。然而，在音素编码阶段，通常缺乏真实的梅尔频谱辅助信息，导致编码过程缺乏真正的语义理解。同时，传统TTS系统往往难以在模型推理速度与合成语音质量之间取得平衡。生成高质量合成语音的方法往往推理速度较慢，而推理速度较快的方法往往牺牲了语音质量。本文提出了一种基于Clip架构的TTS方法——Clip-TTS。该方法使用Clip框架在文本编码阶段建立文本内容与真实梅尔频谱之间的联系，使文本编码器能够直接学习全局上下文的真实语义，从而确保合成语音的质量。在模型架构方面，我采用了Transformer的基本结构，使得Clip-TTS能够实现快速的推理速度。实验结果表明，在LJSpeech和Baker数据集上，Clip-TTS生成的语音达到了最先进的MOS得分，并且在多情感数据集上的表现也非常出色。音频样本可在：链接找到。

论文及项目相关链接

PDF

Summary

本文提出了一种基于Clip架构的TTS方法，称为Clip-TTS。该方法在文本编码阶段利用Clip框架建立文本内容与真实mel-spectrogram之间的联系，使文本编码器能够直接学习全局语境的真实语义，从而确保合成语音的质量。同时，采用Transformer基础结构，实现快速推理。实验结果表明，Clip-TTS在LJSpeech和Baker数据集上生成的语音达到了最先进的MOS分数，并且在多情感数据集上表现优异。

Key Takeaways

传统TTS方法主要建立音素与mel-spectrogram之间的映射，但在音素编码阶段缺乏真正的mel-spectrogram辅助信息，导致编码过程缺乏真正的语义理解。
Clip-TTS方法利用Clip架构建立文本内容与真实mel-spectrogram之间的联系，提高合成语音质量。
Clip-TTS采用Transformer基础结构，实现快速推理，同时保证语音生成的高质量。
实验结果表明，Clip-TTS在多个数据集上生成的语音达到了最先进的性能。
Clip-TTS生成的语音在多情感数据集上表现优异。
可以通过提供的链接获取音频样本。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-02-28/TTS/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

TTS

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-02-28 CS-Dialogue A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

2025-02-28 Interactive

Interactive

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-02-28 Deep learning and classical computer vision techniques in medical image analysis Case studies on brain MRI tissue segmentation, lung CT COPD registration, and skin lesion classification

2025-02-28 医学图像

医学图像