TTS

发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 5.6k

阅读时长: 23 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

Authors:Harold Haodong Chen, Xianfeng Wu, Wen-Jie Shu, Rongjin Guo, Disen Lan, Harry Yang, Ying-Cong Chen

Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.

测试时缩放（TTS）在增强大型语言模型方面取得了显著的成功，然而其在基于下一令牌预测（NTP）的自动回归（AR）图像生成中的应用仍然大部分未被探索。现有的用于视觉AR（VAR）的TTS方法依赖于频繁的局部解码和外部奖励模型，由于中间解码结果的不完整性，它们不适合用于NTP基础的图像生成。为了弥补这一空白，我们引入了ScalingAR，这是首个专门为NTP基础的AR图像生成设计的TTS框架，它消除了对早期解码或辅助奖励的需求。ScalingAR利用令牌熵作为视觉令牌生成中的新型信号，并在两个互补的缩放级别上运行：（i）配置文件级别，通过融合内在和条件信号来流式传输校准的信心状态；（ii）策略级别，利用此状态来适应性地终止低信心轨迹并动态安排适当相位条件的指导强度。在通用和组合基准测试上的实验表明，ScalingAR（1）在GenEval上提高了基线模型12.5%，在TIIF-Bench上提高了15.2%，（2）有效地减少了视觉令牌的消耗，提高了62.0%的效率，同时优于基线，（3）成功提高了稳健性，在具有挑战的场景中减轻了性能下降26.0%。

Summary

本文介绍了针对基于下一个词预测（NTP）的自回归（AR）图像生成的测试时间缩放（ScalingAR）框架。该框架消除了早期解码或辅助奖励的需求，利用令牌熵作为视觉令牌生成中的新信号，并在配置文件级别和政策级别进行两个互补的缩放。实验表明，ScalingAR提高了基准模型的性能，并成功增强了稳健性。

Key Takeaways

测试时间缩放（ScalingAR）框架首次被设计用于基于下一个词预测（NTP）的自回归（AR）图像生成。
ScalingAR解决了现有测试时间缩放方法在视觉AR（VAR）领域的问题，它不再依赖频繁的部分解码和外部奖励模型。
ScalingAR使用令牌熵作为新的信号用于视觉令牌生成。
ScalingAR在配置文件级别和政策级别进行两个互补的缩放操作。
实验表明，ScalingAR能提高基准模型在GenEval和TIIF-Bench上的性能。
ScalingAR能高效地减少视觉令牌的消耗，同时优于基线模型。

Cool Papers

点此查看论文截图

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

Authors:Ziyu Zhang, Hanzhao Li, Jingbin Hu, Wenhao Li, Lei Xie

Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.

可控语音合成是指通过操纵特定的韵律学和副语言特征，如性别、音量、语速、音高和音高波动，来精确控制说话风格。随着先进的生成模型，特别是大型语言模型（LLM）和扩散模型的融合，可控文本到语音（TTS）系统已从基于标签的控制逐渐转变为基于自然语言描述的控制。这通常是通过从文本提示预测全局风格嵌入来实现的。然而，这种直接的预测忽略了风格嵌入的潜在分布，可能会阻碍可控TTS系统的全部潜力。

论文及项目相关链接

PDF

Summary

本文探讨了可控语音合成技术，通过操作特定的韵律和副语言特征来实现说话风格的精确控制。随着生成模型的进步，可控文本到语音（TTS）系统已从基于标签的控制转向基于自然语言描述的控制。本研究利用t-SNE分析发现主流TTS系统的全局风格嵌入分布呈现分层聚类模式，并提出HiStyle模型，这是一个两阶段风格嵌入预测器，能基于文本提示进行分层预测，并采用对比学习来对齐文本和音频嵌入空间。此外，还提出了一种风格标注策略，结合统计方法和人类听觉偏好，生成更准确、感知上更一致的风格控制文本提示。实验表明，HiStyle在基础TTS模型上应用时，在风格控制方面显著优于其他风格嵌入预测方法，同时保持了自然度和清晰度的语音质量。

Key Takeaways

可控语音合成可通过操控韵律和副语言特征实现说话风格的精确控制。
文本到语音（TTS）系统已从基于标签的控制转变为基于自然语言描述的控制。
主流TTS系统的全局风格嵌入分布呈现分层聚类模式。
HiStyle模型是一个两阶段风格嵌入预测器，能基于文本提示进行分层预测。
HiStyle采用对比学习来增强文本和音频嵌入空间的对齐。
提出的风格标注策略结合了统计方法和人类听觉偏好，提高了风格控制文本提示的准确性。
实验表明HiStyle在风格控制方面显著优于其他方法，同时保持了语音的自然度和清晰度。

Cool Papers

点此查看论文截图

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning

Authors:Kang Yang, Yifan Liang, Fangkun Liu, Zhenping Xie, Chengshi Zheng

Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from extensive English data to the Mandarin domain but also circumvents the prohibitive cost of training such a model from scratch. To specifically model lexical tones and enhance intelligibility, we further employ a flow-matching model to generate the F0 contour. This generation process is guided by ASR-fine-tuned SSL speech units, which contain crucial suprasegmental information. The overall speech quality is then elevated through a two-stage training paradigm, where a flow-matching postnet refines the coarse spectrogram from the first stage. Extensive experiments demonstrate that LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.

对于普通话的唇到语音（L2S）合成来说，这是一个重大挑战，受到复杂的唇动-音素映射和词汇音在可辨识性中的关键作用的影响。为了解决这个问题，我们提出了“词汇音感知的唇到语音（LTA-L2S）”。为了应对唇动-音素复杂性，我们的模型通过跨语言迁移学习策略，采用英语预训练的音频视觉自我监督学习（SSL）模型。这一策略不仅将英语数据中学习的通用知识转移到普通话领域，而且避免了从头开始训练此类模型的高昂成本。为了专门建模词汇音并增强可辨识性，我们进一步采用流匹配模型来生成F0轮廓。该生成过程由ASR微调SSL语音单元引导，其中包含关键的超音段信息。整体语音质量通过两阶段训练范式得到提升，其中流匹配后网络对第一阶段产生的粗略频谱图进行细化。大量实验表明，LTA-L2S在语音可辨识性和音调准确性方面均显著优于现有方法。

论文及项目相关链接

PDF Submitted to ICASSP 2026

总结

针对普通话的唇动到语音（L2S）合成技术面临重大挑战，如复杂的唇动与音素对应关系以及词义语音中的重要性对发音准确性的影响。为应对这些挑战，我们提出了词义语音感知的唇动到语音合成技术（LTA-L2S）。为应对复杂的唇动与音素对应关系，我们的模型通过跨语言迁移学习策略，采用英语预训练的视听自监督学习模型。该策略不仅能够将通用的知识从大量英语数据转移到普通话领域，还能避免从头开始训练这种模型的昂贵成本。为更好地建模词义语音和增强可懂度，我们进一步采用流动匹配模型生成F0轮廓曲线。该生成过程由经过自动语音识别精细调整的SSL语音单元引导，这些单元包含重要的超音段信息。通过两阶段训练范式提高整体语音质量，在第一阶段生成的粗糙谱基础上，利用流动匹配后置网络进行精细修正。大量实验表明，LTA-L2S在语音可懂度和音调准确性方面均显著优于现有方法。

关键见解

LTA-L2S解决了普通话唇动到语音合成的挑战。
跨语言迁移学习策略将英语自监督学习模型的通用知识转移到普通话领域。
采用流动匹配模型生成F0轮廓曲线以增强语音的发音准确性和可懂度。
SSL语音单元包含超音段信息，指导生成过程。
两阶段训练范式提高了整体语音质量。
LTA-L2S在语音可懂度和音调准确性方面的表现显著优于现有方法。
该技术对于实现更自然、更准确的语音合成具有重大意义。

Cool Papers

点此查看论文截图

Non-linear infusion of intrinsic alignment and source clustering: impact on non-Gaussian cosmic shear statistics

Authors:J. Harnois-Déraps, N. Šarčević, L. Medina Varela, J. Armijo, C. T. Davies, N. van Alfen, J. Blazek, L. Castiblanco, A. Halder, K. Heitmann, P. Larsen, L. Linke, J. Liu, C. MacMahon-Gellér, L. Porth, S. Rangel, C. Uhlemann, the LSST Dark Energy Science Collaboration

Intrinsic alignments (IA) of galaxies is one of the key secondary signals to cosmic shear measurements, and must be modeled to interpret weak lensing data and infer the correct cosmology. There are large uncertainties in the physical description of IA, and analytical calculations are often out of reach for weak lensing statistics beyond two-point functions. We present here a set of six flexible IA models infused directly into weak lensing simulations, constructed from the mass shells, the projected tidal fields and, optionally, dark matter halo catalogues. We start with the non-linear linear alignment (NLA) and progressively sophisticate the galaxy bias and the tidal coupling models, including the commonly-used extended NLA (also known as the e-NLA or $\delta$-NLA) and the tidal torque (TT) models. We validate our methods with MCMC analyses from two-point shear statistics, then compute the impact on non-Gaussian cosmic shear probes from these catalogues as well as from reconstructed convergence maps. We find that the $\delta$-NLA model has by far the largest impact on most probes, at times more than twice the strength of the NLA. We also observe large differences between the IA models in under-dense regions, which makes minima, void profiles and lensing PDF the best probes for model rejection. Furthermore, our bias models allow us to separately study the source-clustering term for each of these probes, finding good agreement with the existing literature, and extending the results to these new probes. The third-order aperture mass statistics ($M^3_{ap}$) and the integrated three-point functions are particularly sensitive to this when including low-redshift data, often exceeding a 20% impact on the data vector. Our IA models are straightforward to implement and rescale from a single simulated IA-infused galaxy catalogue, allowing for fast model exploration.

星系的内禀排列（IA）是宇宙剪切测量中的关键次要信号之一，为了解释弱透镜数据并推断正确的宇宙学，必须对其进行建模。在IA的物理描述中存在很大的不确定性，对于两点函数之外的弱透镜统计，解析计算往往难以达到。我们在这里展示了一套直接融入弱透镜模拟的六个灵活的IA模型，这些模型由质量壳、投影潮汐场以及可选的暗物质晕目录构建。我们从非线性对齐（NLA）开始，逐步改进星系偏见和潮汐耦合模型，包括常用的扩展NLA（也称为e-NLA或δ-NLA）和潮汐扭矩（TT）模型。我们用MCMC分析从两点剪切统计来验证我们的方法，然后计算这些目录以及重建收敛图对非高斯宇宙剪切探测的影响。我们发现δ-NLA模型对大多数探测器的影响最大，有时强于NLA一倍以上。我们还观察到在密度较低的区域之间IA模型之间存在很大差异，这使得最小值、空洞分布和透镜PDF成为拒绝模型的最佳探测器。此外，我们的偏见模型允许我们针对每个探测器单独研究源聚类项，与现有文献吻合良好，并将结果扩展到这些新探测器。当包含低红移数据时，三阶孔径质量统计（M^3_{ap}）和集成的三点函数对IA模型特别敏感，通常对数据向量的影响超过20%。我们的IA模型实现起来很简单，可以从单个模拟IA注入星系目录中重新缩放，从而实现快速模型探索。

论文及项目相关链接

PDF 21 pages, 20 figures

摘要

本文研究了星系内禀排列（IA）对宇宙剪切测量的影响，并提出了一套直接融入弱剪切模拟的六个灵活IA模型。这些模型基于质量壳、投影潮汐场以及可选的暗物质晕目录构建。通过蒙特卡洛马尔可夫链分析验证了方法的有效性，并计算了这些目录和非高斯宇宙剪切探测器对重构收敛图的影响。研究发现，$\delta$-NLA模型对大多数探测器的影响最为显著，有时是NLA的两倍多。在密度较低的区域，IA模型之间存在较大差异，使得最小值、空洞剖面和透镜概率密度函数成为最佳的模型拒绝探测器。此外，我们的偏见模型允许我们单独研究每个探测器的源聚类术语，与现有文献一致，并将结果扩展到这些新探测器。当包含低红移数据时，三阶孔径质量统计量和集成的三点函数特别敏感于此，对数据向量的影响常常超过20%。我们的IA模型易于实现，可从单个模拟IA注入星系目录中重新缩放，从而实现快速模型探索。

关键见解

星系内禀排列（IA）是宇宙剪切测量的关键次要信号，必须对其进行建模以解释弱透镜数据并推断正确的宇宙学。
存在对IA物理描述的巨大不确定性，对于弱透镜的两点函数以外的统计，分析计算往往难以达到。
提出了六个直接融入弱透镜模拟的灵活IA模型，这些模型基于质量壳、投影潮汐场以及可能的暗物质晕目录构建。
$\delta$-NLA模型对大多数探测器的影响最为显著，有时是NLA模型的两倍多。
在密度较低的区域，IA模型之间存在显著差异，使得最小值、空洞剖面和透镜PDF成为最佳的模型辨别探测器。
第三阶孔径质量统计和集成的三点函数对IA模型特别敏感，尤其是包含低红移数据时。

Cool Papers

点此查看论文截图

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Authors:Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.

视频条件声音和语音生成任务，包括视频转声音（V2S）和视觉文本转语音（VisualTTS）任务，通常被视为独立的任务进行处理，并且只在有限程度上尝试在单一框架内将它们结合起来。近期将V2S和VisualTTS结合起来的尝试在处理不同条件类型（例如不同的视频和文本条件）时面临挑战，并且需要复杂的训练阶段。统一这两个任务仍然是一个开放的问题。为了缩小这一差距，我们提出了VSSFlow，它无缝地将V2S和VisualTTS任务集成到一个统一的流匹配框架中。VSSFlow使用了一种新型的条件聚合机制来处理不同的输入信号。我们发现交叉注意力和自注意力层在引入条件的过程中表现出不同的归纳偏置。因此，VSSFlow利用这些归纳偏置来有效地处理不同的表示方式：交叉注意力用于处理模糊的视频条件，自注意力用于处理更确定的语音文本。此外，尽管普遍认为这两个任务的联合训练需要复杂的训练策略并可能降低性能，但我们发现VSSFlow受益于声音和语音生成端到端的联合学习过程，而无需在训练阶段进行额外设计。详细分析将其归功于任务之间学到的共享音频先验知识，它加速了收敛过程，增强了条件生成，并稳定了无分类器引导过程。大量实验表明，VSSFlow在V2S和VisualTTS基准测试中均超越了最先进的领域特定基线，突显出统一生成模型的关键潜力。

论文及项目相关链接

PDF Paper Under Review

摘要

本文提出一种名为VSSFlow的统一框架，该框架无缝集成了视频到声音（V2S）和视觉文本到语音（VisualTTS）任务。通过新型的条件聚合机制，VSSFlow能够处理不同的输入信号。研究发现，交叉注意力和自注意力层在处理条件输入时表现出不同的诱导偏置，因此VSSFlow利用这些偏置来有效处理不同的表现形式：交叉注意力用于模糊视频条件，自注意力用于更确定的语音文本。不同于普遍的认识，联合训练这两个任务不需要复杂的训练策略，而且不会对声音和语音生成的性能产生负面影响。这得益于学到的共享音频先验知识，它加速了收敛，增强了条件生成，并稳定了对分类器引导过程的影响。大量实验证明，VSSFlow在V2S和VisualTTS基准测试中均超越了当前领先的特定领域方法，凸显了统一生成模型的巨大潜力。

关键见解