发布日期: 2025-09-30

更新日期: 2025-11-27

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-30 更新

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Authors:Junjie Cao, Yichen Han, Ruonan Zhang, Xiaoyang Hao, Hongxiang Li, Shuaijiang Zhao, Yue Liu, Xiao-Ping Zhng

Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the continuous speech waveform into a sequence of discrete tokens by neural audio codec. However, single codebook modeling is well suited to text LLMs, but suffers from significant information loss; hierarchical acoustic tokens, typically generated via Residual Vector Quantization (RVQ), often lack explicit semantic structure, placing a heavy learning burden on the model. Furthermore, the autoregressive process is inherently susceptible to error accumulation, which can degrade generation stability. To address these limitations, we propose CaT-TTS, a novel framework for robust and semantically-grounded zero-shot synthesis. First, we introduce S3Codec, a split RVQ codec that injects explicit linguistic features into its primary codebook via semantic distillation from a state-of-the-art ASR model, providing a structured representation that simplifies the learning task. Second, we propose an Understand-then-Generate'' dual-Transformer architecture that decouples comprehension from rendering. An initial Understanding’’ Transformer models the cross-modal relationship between text and the audio’s semantic tokens to form a high-level utterance plan. A subsequent ``Generation’’ Transformer then executes this plan, autoregressively synthesizing hierarchical acoustic tokens. Finally, to enhance generation stability, we introduce Masked Audio Parallel Inference (MAPI), a nearly parameter-free inference strategy that dynamically guides the decoding process to mitigate local errors.

基于现有大型语言模型（LLM）的自动回归（AR）文本到语音（TTS）系统虽然达到了最先进的水平，但仍面临关键挑战。基于LLM的范式的基础是将连续的语音波形通过神经音频编解码器离散化为一系列离散标记。然而，单码本建模非常适合文本LLM，但存在信息损失较大的问题；通过残差矢量量化（RVQ）生成的分层声学标记通常缺乏明确的语义结构，给模型带来了沉重的学习负担。此外，自动回归过程本质上容易受到误差累积的影响，可能会降低生成稳定性。为了解决这些局限性，我们提出了CaT-TTS，这是一个用于稳健和语义基础的零射击合成的全新框架。首先，我们引入了S3Codec，这是一种分裂RVQ编解码器，它通过来自最新语音识别模型的语义蒸馏将其主要编解码器中注入明确的语言特征，提供结构化表示，从而简化学习任务。其次，我们提出了一种“理解然后生成”的双Transformer架构，它将理解与渲染解耦。最初的“理解”Transformer对文本和音频语义标记之间的跨模态关系进行建模，以形成高级话语计划。随后的“生成”Transformer则执行此计划，通过自动回归方式合成分层声学标记。最后，为了提高生成的稳定性，我们引入了几乎无需参数的Masked Audio Parallel Inference（MAPI）推理策略，它动态指导解码过程，以减轻局部错误。

论文及项目相关链接

PDF conference paper about TTS

Summary

本文介绍了基于大型语言模型（LLM）的自动回归（AR）文本到语音（TTS）系统的挑战及其改进方法。针对现有方法在信息损失、模型学习负担和生成稳定性方面的问题，提出了CaT-TTS框架。该框架包括引入S3Codec编码器和“理解-生成”双Transformer架构，并采用了Masked Audio Parallel Inference（MAPI）策略来提升生成稳定性。

Key Takeaways

LLM-based AR-TTS系统虽达到先进质量，但仍面临信息损失、模型学习负担和生成稳定性问题。
单编码簿建模适用于文本LLM，但信息损失严重。
层次化声学符号通常通过残留向量量化（RVQ）生成，缺乏明确的语义结构。
引入S3Codec编码器，通过蒸馏从先进的ASR模型获取明确的语言特征，提供结构化表示，简化学习任务。
提出“理解-生成”双Transformer架构，将理解和渲染过程解耦。
“理解”Transformer建模文本与音频语义符号的跨模态关系，形成高级话语计划。

Cool Papers

点此查看论文截图

StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing

Authors:Liyang Chen, Tianze Zhou, Xu He, Boshi Tang, Zhiyong Wu, Yang Huang, Yang Wu, Zhongqian Sun, Wei Yang, Helen Meng

The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment. In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis. Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process. By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency. We expand the applicability of visual dubbing methods from comprehensive aspects, and demo videos can be found at https://stabledub.github.io.

语音替换任务旨在生成与驱动音频同步的口型变化，近年来这一领域已经取得了显著进展。然而，两个关键缺陷阻碍了其广泛应用：（1）仅音频驱动的范式不足以捕捉特定于说话者的唇部习惯，这导致无法生成与目标虚拟形象相似的唇部运动；（2）传统的盲图像修复方法在处理遮挡物（例如麦克风、手）时经常产生视觉伪影，这限制了其实践部署。在本文中，我们提出了StableDub，这是一个简洁的框架，它将唇部习惯感知建模与遮挡稳健的合成相结合。具体来说，我们基于Stable-Diffusion骨干网，开发了一种唇部习惯调制机制，该机制可以联合建模音素音频视觉同步和特定于说话者的面部动态。为了实现遮挡下的合理唇部几何形状和物体外观，我们引入了遮挡感知训练策略，通过明确暴露遮挡物体给修复过程。通过融入我们所提出的设计，该模型消除了之前方法中昂贵先验知识的必要性，从而在计算密集型的扩散基础上显示出卓越的训练效率。为了进一步从模型架构的角度优化训练效率，我们引入了混合Mamba-Transformer架构，这在低资源研究场景中展示了增强的适用性。大量的实验结果证明，StableDub在唇部习惯相似性和遮挡稳健性方面达到了卓越的性能。我们的方法在音频-唇部同步、视频质量和分辨率一致性方面也超越了其他方法。我们从各个方面扩大了语音替换方法的适用性，演示视频可在https://stabledub.github.io找到。

论文及项目相关链接

PDF

摘要

本文介绍了StableDub框架，该框架结合了唇习惯感知建模与遮挡稳健合成，旨在解决视觉配音任务中的两大难题：音频驱动范式无法捕捉特定说话人的唇习惯和传统盲修复方法在处理遮挡物时产生的视觉伪影。StableDub采用Stable-Diffusion架构，建立唇习惯调制机制，联合建模语音音频视觉同步和特定说话人的面部动态。为实现在遮挡下的可信赖唇几何和物体外观，我们引入了一种遮挡感知训练策略，明确地将遮挡物体暴露于修复过程中。StableDub的设计消除了以往方法中昂贵的先验知识需求，在计算密集型的扩散架构上展现出卓越的训练效率。此外，我们还从模型架构角度优化了训练效率，引入了混合Mamba-Transformer架构，在低资源研究场景中表现出更强的适用性。实验结果表明，StableDub在唇习惯相似性、遮挡稳健性和音频-唇部同步等方面取得了卓越性能。

关键见解

StableDub框架结合了唇习惯感知建模与遮挡稳健合成，旨在解决视觉配音中的两大挑战。
引入唇习惯调制机制，联合建模语音音频视觉同步和特定说话人的面部动态。
采用遮挡感知训练策略，处理遮挡物时提高合成质量。
消除对昂贵先验知识的需求，提升训练效率。
引入混合Mamba-Transformer架构，增强在低资源场景中的适用性。
实验证明StableDub在多项指标上表现优越，包括唇习惯相似性、遮挡稳健性和音频-唇部同步等。
提供了在线演示视频，展示应用效果。

Cool Papers

点此查看论文截图

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

Authors:George Yakushev, Alina Shutova, Ivan Rubachev, Renat Sergazinov, Artem Babenko

Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM’s creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.

表格基础模型在低资源表格问题上越来越受欢迎。这些模型通过在大量合成数据上进行预训练来弥补小型训练数据集的不足。通过预训练获得的先验知识提供了卓越的性能，但得到的模型成为一个难以解释和推理成本高昂的黑箱。在这项工作中，我们探索了一种替代策略：在代理设置中，使用具备推理能力的大型语言模型（LLMs）为小型表格数据集构建决策树。我们设计了一组最小的工具来构建、分析和操作决策树。通过使用这些工具，LLMs将其先验知识与从数据中学习到的知识相结合，创建了一个轻量级的决策树，在低资源表格问题上优于传统的CART。虽然单个决策树并不优于最先进的黑箱模型，但它附带一个人类可读的推理轨迹，可以检查偏见和数据泄露。此外，基于推理的LLM创建过程允许额外的人类输入：纠正偏见或融入特定领域的直觉，这些数据中没有捕捉到。

Summary

本文探讨了一种新的策略，即利用具备推理能力的LLMs来构建决策树以处理小型表格数据集的问题。通过预训练获得先验知识，结合数据学习，创建轻量级的决策树，在低资源表格问题上表现优于传统CART。虽然决策树的表现不及顶尖的黑盒模型，但其具有可理解的推理轨迹，便于检查偏见和数据泄露，并允许人类输入校正偏见或融入特定领域的直觉。

Key Takeaways

Tabular foundation models通过预训练获得先验知识来弥补小训练数据集的问题。
使用具备推理能力的LLMs构建决策树是一种处理小型表格数据集的替代策略。
LLMs结合先验知识和数据学习，创建轻量级的决策树，表现优于传统CART。
决策树虽然性能不及顶尖黑盒模型，但具备可理解的推理轨迹，便于检查偏见和数据泄露。
推理型LLM的创建过程允许人类输入，可以校正偏见或融入特定领域的直觉。
该方法提供了构建、分析和操作决策树的工具集。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-30/Talking%20Head%20Generation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Talking Head Generation

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-09-30 WebGen-Agent Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

2025-09-30 R1_Reasoning

R1_Reasoning

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-09-29 Diffusion Bridge Variational Inference for Deep Gaussian Processes

2025-09-29 医学图像

医学图像