TTS

发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 7.1k

阅读时长: 29 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching

Authors:Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie

Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and cross-lingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech

文本转语音（TTS）合成的最新进展，特别是那些利用大型语言模型（LLM）的技术，显著提高了表达能力和自然度。然而，生成类似人类的交互式对话语音仍然具有挑战性。当前的系统面临着双重轨道数据稀缺和难以实现自然性、上下文连贯性和交互动态性的困难，例如在多轮对话中的轮流发言、重叠语音和说话人一致性。为了解决这些挑战，我们提出了DialoSpeech，这是一种结合大型语言模型和Chunked Flow Matching的双轨架构，用于表达性、类似人类的对话语音合成。DialoSpeech生成了具有连贯的发言人和自然的重叠的自然多轮对话，支持中文和英文以及跨语言语音合成。我们引入了一个数据处理管道来构建双轨对话数据集，便于可扩展的训练和实验验证。实验表明，我们的模型优于基线模型，为解决类似人类口语对话的生成提供了解决方案。音频样本可在https://tiamojames.github.io/DialoSpeech上找到。

论文及项目相关链接

PDF

Summary

近期文本转语音（TTS）合成技术的进展，特别是利用大型语言模型（LLM）的技术，显著提高了表达的自然度。然而，生成人类般的、互动的对话语音仍具挑战。当前系统面临多轨对话数据的稀缺性，以及在实现自然性、上下文连贯性和交互动态性方面的困难，如话轮转换、重叠语音和说话人一致性等。为解决这些挑战，我们提出了DialoSpeech，这是一种结合大型语言模型和Chunked Flow Matching的双轨架构，用于表达人类般的对话语音合成。DialoSpeech可以生成自然的多轮对话，支持中文和英文以及跨语言语音合成。我们引入了一条数据处理管道来构建双轨对话数据集，便于可扩展的训练和实验验证。实验表明，我们的模型优于基线模型，为解决生成人类般的对话语音提供了解决方案。

Key Takeaways

TTS合成技术借助大型语言模型（LLM）显著提高了表达的自然度和生动性。
生成人类般的互动对话语音存在挑战，尤其在实现多轮对话的自然性、连贯性和交互动态性方面。
DialoSpeech是一个双轨架构，结合了大型语言模型和Chunked Flow Matching技术，旨在解决这些挑战。
DialoSpeech支持中文和英文以及跨语言的语音合成。
提出的数据处理管道有助于构建双轨对话数据集，促进了模型的可扩展训练和实验验证。
实验显示DialoSpeech模型性能优于基线模型。

Cool Papers

点此查看论文截图

IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation

Authors:Wei Wang, Rong Cao, Yi Guo, Zhengyang Chen, Kuan Chen, Yuanyuan Huo

Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes. To address these issues, we introduce IntMeanFlow, a framework for few-step speech generation with integral velocity distillation. By approximating average velocity with the teacher’s instantaneous velocity over a temporal interval, IntMeanFlow eliminates the need for JVPs and self-bootstrap, improving stability and reducing GPU memory usage. We also propose the Optimal Step Sampling Search (O3S) algorithm, which identifies the model-specific optimal sampling steps, improving speech synthesis without additional inference overhead. Experiments show that IntMeanFlow achieves 1-NFE inference for token-to-spectrogram and 3-NFE for text-to-spectrogram tasks while maintaining high-quality synthesis. Demo samples are available at https://vvwangvv.github.io/intmeanflow.

基于流的生成模型极大地提高了文本到语音（TTS）合成的质量，但推理速度仍然受到迭代采样过程和多次函数评估（NFE）的限制。最近的MeanFlow模型通过建模平均速度而不是瞬时速度来加速生成。然而，它直接应用于TTS时面临挑战，包括由雅可比向量积（JVP）导致的GPU内存开销和由于自举过程导致的训练不稳定。为了解决这些问题，我们引入了IntMeanFlow，这是一个具有积分速度蒸馏功能的几步语音生成框架。通过用教师在时间间隔内的瞬时速度来近似平均速度，IntMeanFlow消除了对JVP和自举的需要，提高了稳定性并减少了GPU内存使用。我们还提出了最佳步采样搜索（O3S）算法，该算法可以识别模型特定的最佳采样步骤，在不需要额外的推理开销的情况下提高语音合成质量。实验表明，IntMeanFlow在令牌到光谱图任务中实现1-NFE推理，在文本到光谱图任务中实现3-NFE推理，同时保持高质量合成。演示样本可在https://vvwangvv.github.io/intmeanflow查看。

论文及项目相关链接

PDF

Summary

基于流生成模型的文本转语音（TTS）合成质量显著提高，但推理速度受限于迭代采样过程和多次函数评估（NFE）。最近提出的MeanFlow模型通过建模平均速度来加速生成，但直接应用于TTS面临挑战，如GPU内存开销大的雅可比向量积（JVP）和自举过程导致的训练不稳定。为解决这些问题，我们推出IntMeanFlow框架，通过用教师的瞬时速度近似平均速度进行少量步骤的语音生成。IntMeanFlow消除了对JVP和自举的需要，提高了稳定性和GPU内存使用效率。我们还提出了最优步采样搜索（O3S）算法，该算法能识别模型特定的最优采样步骤，在不影响合成质量的同时提高语音合成的推理速度。

Key Takeaways

流生成模型在TTS中虽能提高合成质量，但推理速度受限。
MeanFlow模型通过建模平均速度加速生成，但直接应用于TTS存在挑战。
IntMeanFlow框架通过教师的瞬时速度近似平均速度进行少量步骤的语音生成，解决GPU内存开销和训练不稳定问题。
IntMeanFlow消除了对雅可比向量积和自举的需要。
最优步采样搜索（O3S）算法能识别模型特定的最优采样步骤，提高语音合成的推理速度。
IntMeanFlow在token-to-spectrogram任务中实现1-NFE推理，在text-to-spectrogram任务中实现3-NFE。

Cool Papers

点此查看论文截图

Parallel Test-Time Scaling for Latent Reasoning Models

Authors:Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.

并行测试时间缩放（TTS）是增强大型语言模型（LLM）的关键方法，通常通过并行采样多个基于标记的思维链，并通过投票或搜索聚合结果来实现。最近，潜在推理的进展，其中中间推理在连续向量空间中展开，为明确的思维链提供了更有效的替代方案。然而，潜在模型是否也能从并行TTS中受益仍然是一个悬而未决的问题，这主要是因为连续空间中缺少采样机制，以及缺乏用于高级轨迹聚合的概率信号。

论文及项目相关链接

PDF

Summary
文本介绍了一种将平行测试时间缩放（Parallel Test-Time Scaling，简称TTS）应用于潜在推理模型的方法。针对采样和聚合问题，引入两种不确定性启发策略进行采样，并利用Latent Reward Model进行轨迹评分和引导。实验和分析显示，这些策略可有效扩展计算，同时LatentRM可实现有效的轨迹选择。为连续空间的可扩展推理开辟了新的方向。

Key Takeaways

平行测试时间缩放（TTS）用于增强大型语言模型（LLMs）。
通过采样多个基于token的并行思维链并投票或搜索聚合结果。
潜在推理模型中的连续向量空间展开提供了更高效的替代方案。
引入两种不确定性启发策略采样方法：Monte Carlo Dropout和Additive Gaussian Noise。
设计Latent Reward Model（LatentRM）进行轨迹评分和引导，使用逐步对比目标进行训练。
实验和可视化分析表明，采样策略在算力上有效扩展，展现出不同的探索动态。

Cool Papers

点此查看论文截图

AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

Authors:Shuqing Luo, Yilin Guan, Pingzhi Li, Hanrui Wang, Tianlong Chen

Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).

测试时缩放（TTS）通过长思考链（CoT）提升大型语言模型（LLM）的推理能力，但线性KV缓存增长放大了大型语言模型解码的内存瓶颈。查询感知的页面级稀疏解码可以在有限的浮点运算预算下实现最先进的性能，但受限于顺序依赖的页面过滤和粗粒度令牌选择，这阻碍了在高并发和长思考链场景下的TTS任务的服务效率和模型性能（甚至消耗比前向管道更高的运行时间）。在本文中，我们首先发现当前步骤的查询状态可以以统一的方式从最近的查询的短窗口中准确近似，从而在解码循环中无需等待即可实现训练感知的查询感知稀疏性。我们提出了AsyncSpade，这是一个高效的TTS异步框架，它包括两个核心组件：（1）一种新型轻量级的时间回归模块，用于预测下一个令牌查询状态；（2）一个异步和分散的框架，它将KV缓存过滤与自回归解码循环解耦，通过异步性将令牌级别的KV选择与前向推理计算重叠。据我所知，AsyncSpade是第一个在不牺牲模型性能的情况下消除了顺序依赖性的方法。我们在使用A100节点的常见大型语言模型服务环境中验证了AsyncSpade的有效性，在该环境中，AsyncSpade完全将KV缓存操作与推理管道重叠，实现了理论上的最佳每输出令牌时间（TPOT）。具体来说，AsyncSpade相对于最新技术基准（例如Quest）减少了超过20%的TPOT，并且在Qwen3-8B和Qwen3-32B模型上至少减少了50%的TPOT，同时在各种TTS基准测试中达到了与其相当或更高的准确率（AIME-24/25、GPQA-Diamond、MATH-500）。

论文及项目相关链接

PDF 14 pages, 17 figures

Summary

该文本探讨了测试时缩放（TTS）如何通过长链思维（CoT）增强大型语言模型（LLM）的推理能力，但同时也面临着内存瓶颈问题。文章提出了一种新的异步框架AsyncSpade，通过两个核心组件解决现有问题：一是轻量级的时序回归模块，用于预测下一个令牌查询状态；二是异步和分散的框架，将KV缓存过滤与自动回归解码循环解耦，通过异步性将令牌级KV选择与前向推理计算重叠。AsyncSpade消除了顺序依赖性，同时不牺牲模型性能，并在常见LLM服务设置上进行了验证，实现了理论上的最佳每输出令牌时间（TPOT）。

Key Takeaways

测试时缩放（TTS）增强了大型语言模型（LLM）的推理能力，通过长链思维（CoT）。
线性KV-cache增长放大了LLM解码的内存限制瓶颈。
查询感知的页面级稀疏解码在受限的FLOPs预算下实现了先进性能，但受限于顺序依赖的页面过滤和粗粒度的令牌选择。
当前步骤的查询状态可以准确地从近期的短查询窗口中近似，实现了训练免费的查询感知稀疏性，无需等待解码循环。
AsyncSpade是一个新的异步框架，旨在解决TTS的效率问题，包括两个核心组件：轻量级时序回归模块和异步分散框架。
AsyncSpade消除了顺序依赖性，实现了理论上的最佳每输出令牌时间（TPOT），在某些模型上实现了超过20%的TPOT减少。
AsyncSpade在多种TTS基准测试上达到了或超越了现有模型的准确性。

Cool Papers

点此查看论文截图

Paper2Video: Automatic Video Generation from Scientific Papers

Authors:Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics–Meta Similarity, PresentArena, PresentQuiz, and IP Memory–to measure how videos convey the paper’s information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

学术报告视频已成为研究交流的重要媒介，但其制作仍然高度依赖人工，通常需要数小时的幻灯片设计、录制和编辑才能制作出短短2到10分钟的视频。与常规视频不同，演示视频生成面临一些独特挑战：如研究论文的输入、密集的多模式信息（文本、图表、表格），以及需要协调多个对齐的通道（如幻灯片、字幕、语音和人类说话者）。为了应对这些挑战，我们推出了Paper2Video，这是由作者创建的第一个包含研究报告视频、幻灯片以及演讲者元数据的基准测试集，包含101篇研究论文。我们进一步设计了四个专门的评估指标——Meta Similarity、PresentArena、PresentQuiz和IP Memory，来衡量视频如何向观众传达论文的信息。在此基础上，我们提出了学术报告视频生成的首个多代理框架PaperTalker。它整合了幻灯片生成与有效的布局优化，通过新颖的有效树搜索视觉选择、光标定位、字幕、语音合成和说话人渲染等技术，同时并行幻灯片生成以提高效率。在Paper2Video上的实验表明，我们的方法生成的报告视频比现有基线更忠实和富有信息，朝着自动化和即用的学术视频生成迈出了实用的一步。我们的数据集、代理和代码可在https://github.com/showlab/Paper2Video找到。

Summary

本文介绍了学术报告视频的重要性及其制作过程中的挑战。为应对这些挑战，研究者推出了Paper2Video数据集，包括研究论文与作者制作的报告视频、幻灯片及演讲者元数据。基于该数据集，提出了PaperTalker框架，实现了幻灯片生成、布局优化、字幕、语音合成和真人演讲渲染等功能，且可并行生成幻灯片以提高效率。实验证明其生成的报告视频更具准确性和信息性。

Key Takeaways

学术报告视频已成为研究交流的重要媒介，但制作过程中存在高度劳动密集的问题。
Paper2Video数据集包含研究论文、作者制作的报告视频、幻灯片及演讲者元数据，为解决制作过程中的挑战提供了基础。
推出了PaperTalker框架，集成了幻灯片生成、布局优化、字幕、语音合成和真人演讲渲染等功能。
PaperTalker框架通过有效树状视觉搜索、光标定位等技术实现高效制作。
Paper2Video数据集上的实验证明，PaperTalker生成的报告视频相比现有方法更忠实且信息更丰富。
PaperTalker框架可实现幻灯片并行生成，提高制作效率。

Cool Papers

点此查看论文截图

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Authors:Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

视频理解是计算机视觉领域最具挑战性的前沿方向，它要求模型能够推理复杂的时空关系、长期依赖关系和多模态证据。最近出现的视频大型多模态模型（Video-LMMs），集成了视觉编码器和基于解码器的强大语言模型，在视频理解任务中表现出了卓越的能力。然而，将这些模型从基本感知系统转变为先进推理引擎的关键阶段——后训练，在文献中仍然分散。这篇综述提供了对视频大型多模态模型后训练方法的首次全面研究，涵盖了三个基本支柱：通过思维链进行有监督微调（SFT）、基于可验证目标的强化学习（RL）以及通过增强推理计算进行测试时缩放（TTS）。我们提出了一个结构化分类法，阐明了这些方法的作用、相互关联以及针对视频的特定适应，应对独特的挑战，如时间定位、时空定位、长视频效率和多模态证据融合。通过对代表性方法的系统分析，我们综合了关键的设计原则、见解和评估协议，同时确定了奖励设计、可扩展性和成本性能优化方面的关键开放挑战。我们还整理了重要的基准测试、数据集和指标，以便对后训练的有效性进行严格评估。本综述旨在为研究人员和实践者提供一个统一框架，以推进视频大型多模态模型的能力。更多资源和更新信息请访问：https://github.com/yunlong10/Awesome-Video-LMM-Post-Training。

论文及项目相关链接

PDF The 1st version

摘要

视频理解是计算机视觉领域最具挑战性的前沿课题，需要模型推理复杂的时空关系、长期依赖性和多模态证据。视频大型多模态模型（Video-LMMs）的兴起，通过整合视觉编码器和基于语言模型的解码器，展现了视频理解任务的卓越能力。然而，将这些模型从基本感知系统转变为先进推理引擎的后续训练阶段，在文献中仍显得零散。本文首次全面调查了Video-LMMs的后续训练方法论，包括三个基本支柱：借助思维链的监督微调（SFT）、基于可验证目标的强化学习（RL）以及通过增强推理计算进行测试时缩放（TTS）。本文提出了一个结构化分类法，阐明了这些方法在视频理解中的角色、相互联系和特定适应性，并解决了诸如时间定位、时空定位、长视频效率和多模态证据融合等独特挑战。通过对代表性方法的系统分析，我们总结了关键设计原则、见解和评估协议，同时确定了奖励设计、可扩展性和成本性能优化方面的关键开放挑战。此外，我们还整理了重要的基准测试、数据集和度量标准，以便对后续训练的有效性进行严格评估。本文旨在为研究人员和从业者提供一个推进Video-LMM能力的统一框架。

关键见解

视频理解是当前计算机视觉领域最具挑战性的任务之一，需要模型处理复杂的时空关系、长期依赖和多模态证据。
Video-LMMs通过结合视觉编码器与语言模型解码器，已在视频理解方面展现出显著能力。
后续训练是将Video-LMMs从基本感知系统转变为先进推理引擎的关键阶段。
论文详细介绍了三种主要的后续训练策略：监督微调（SFT）、强化学习（RL）和测试时缩放（TTS）。
这些策略面临一些独特挑战，如时间定位、时空定位、长视频效率以及多模态证据融合。
论文还提供了一系列关键设计原则、见解和评估协议，以及对奖励设计、可扩展性和成本性能优化等关键开放挑战的识别。
为了方便对后续训练的有效性进行严格评估，论文还整理了重要的基准测试、数据集和度量标准。

Cool Papers

点此查看论文截图

SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement

Authors:Kuan-Yu Chen, Jeng-Lin Li, Jian-Jiun Ding

With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.

随着零样本文本到语音技术的快速发展，生成高质量、与真实语音信号无法区分的语音信号已成为可能。语音编辑，包括语音插入和替换，因其潜在的应用价值而吸引研究人员。然而，现有研究仅涉及清洁语音场景。在真实世界应用中，环境噪声的存在可能会显著降低生成语音的质量。本研究提出一种噪声鲁棒的语音编辑框架SeamlessEdit，用于噪声语音编辑。SeamlessEdit采用频带感知噪声抑制模块和内容细化策略。它能很好地处理语音和背景噪声频率带未分离的场景。所提出的SeamlessEdit框架在多个定量和定性评估中均优于最新技术方法。

论文及项目相关链接

PDF 5 pages, 3 figures

Summary

随着零样本文本转语音技术的快速发展，生成高质量、与现实语音难以区分的语音信号成为可能。语音编辑，包括语音插入和替换，因其潜在的应用价值而吸引研究者关注。然而，现有研究主要集中在干净语音场景。实际应用中，环境噪声会对语音生成质量造成显著影响。本研究提出一种噪声鲁棒的语音编辑框架SeamlessEdit，用于噪声语音编辑。SeamlessEdit采用频带感知噪声抑制模块和内容优化策略，能够应对语音和背景噪声频带不分离的场景。SeamlessEdit框架在多项定量和定性评估中表现优于现有先进方法。

Key Takeaways