TTS

发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 2.6k

阅读时长: 10 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

Authors:Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.

讽刺是一种微妙的非字面语言形式，由于其依赖于微妙的语义、上下文和韵律线索，对语音合成构成了重大挑战。虽然现有的语音合成研究主要集中在广泛的情感类别上，但讽刺在很大程度上仍未被探索。在本文中，我们提出了一种基于大型语言模型（LLM）增强的检索增强框架来进行讽刺感知语音合成。我们的方法结合了（1）来自LoRA微调LLaMA 3的语义嵌入，能够捕捉讽刺的语用不一致和话语级线索；（2）通过检索增强生成（RAG）模块检索到的韵律范例，为讽刺表达提供了表达性参考模式。在VITS主干网络中集成这种双重条件，能够实现更自然、更合适的上下文讽刺语音。实验表明，我们的方法在客观指标和主观评估方面都优于基线方法，在语音自然度、讽刺表现力和下游讽刺检测方面取得了改进。

论文及项目相关链接

PDF

Summary
本论文针对讽刺语的语音合成提出了一个大型语言模型增强的检索增强框架。通过结合语义嵌入和语音范例，提高了讽刺语的表达自然度和语境适当性。

Key Takeaways

讽刺是一种非字面的语言形式，对语音合成提出了重大挑战，需要微妙的语义、上下文和语调线索。
现有语音合成研究主要关注广泛的情感类别，而讽刺语仍然被大大忽视。
本研究提出了一种结合大型语言模型（LLM）的检索增强框架，用于讽刺意识语音合成。
通过结合LoRA微调LLaMA 3的语义嵌入和通过检索增强生成（RAG）模块检索的语调范例，该方法能够捕捉讽刺的语用不一致和话语级线索，并提供讽刺表达的参考模式。
该方法在一个VITS的框架下实现，通过双重条件使讽刺性语音更加自然和上下文恰当。
实验表明，该方法在客观指标和主观评估中都优于基线方法，提高了语音的自然度、讽刺表达力和下游讽刺检测能力。

Cool Papers

点此查看论文截图

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Authors:Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

视频理解是计算机视觉领域最具挑战性的前沿方向，它要求模型能够推理复杂的时空关系、长期依赖关系和多模态证据。最近出现的视频大型多模态模型（Video-LMMs），集成了视觉编码器和基于解码器的强大语言模型，在视频理解任务中展现了出色的能力。然而，将这些模型从基本感知系统转变为先进推理引擎的关键阶段——即后训练阶段，在文献中仍然零散。本文首次全面调查了视频大型多模态模型的后训练方法论，包括三个基本支柱：利用思维链进行监督微调（SFT）、基于可验证目标的强化学习（RL）以及通过增强推理计算进行测试时缩放（TTS）。我们提出了一种结构化分类法，阐明了这些方法的作用、相互关联以及针对视频的特定适应，应对独特的挑战，如时间定位、时空定位、长视频效率和多模态证据融合等。通过对代表性方法进行系统分析，我们综合了关键设计原则、见解和评估协议，同时确定了奖励设计、可扩展性和成本性能优化方面的关键开放挑战。我们还整理了重要的基准测试、数据集和指标，以便对后训练的有效性进行严格评估。本调查旨在为研究人员和从业者提供一个统一框架，以推进视频大型多模态模型的能力。更多资源和更新信息请访问：[https://github.com/yunlong10/Awesome-Video-LMM-Post-Training]

论文及项目相关链接

PDF The 1st version

Summary
视频理解是计算机视觉领域最具挑战性的前沿课题，涉及复杂的时空关系、长期依赖性和多模态证据的理解。视频大模态模型（Video-LMMs）通过集成视觉编码器和基于语言模型的解码器，在视频理解任务中展现出卓越的能力。本文首次全面调查了Video-LMMs的后训练策略，包括通过思维链的监督微调（SFT）、基于可验证目标的强化学习（RL）以及通过增强推理计算进行测试时缩放（TTS）等三个基本支柱。本文明晰了这些技术的角色、相互关联和视频特定适应，解决了诸如时间定位、时空定位、长视频效率和多模态证据集成等独特挑战。通过对代表性方法的分析，我们综合了关键设计原则、见解和评估协议，同时确定了奖励设计、可扩展性和成本性能优化等关键开放挑战。

Key Takeaways

视频理解是计算机视觉领域的挑战性前沿，需要模型理解复杂的时空关系、长期依赖性和多模态证据。
视频大模态模型（Video-LMMs）在视频理解任务中表现出卓越的能力，通过集成视觉编码器和语言模型解码器。
本文调查了Video-LMMs的后训练策略，包括监督微调（SFT）、强化学习（RL）和测试时缩放（TTS）等三个基本方法。
SFT通过思维链进行，RL基于可验证目标，TTS则通过增强推理计算来实现。
这些后训练策略面临一些挑战，如奖励设计、可扩展性和成本性能优化等。
文章还提供了一系列评估协议、基准测试数据集和指标，以严格评估后训练的有效性。
此调查旨在为研究人员和从业者提供一个统一框架，以推动Video-LMM的能力发展。

Cool Papers

点此查看论文截图

Good practices for evaluation of synthesized speech

Authors:Erica Cooper, Sébastien Le Maguer, Esther Klabbers, Junichi Yamagishi

This document is provided as a guideline for reviewers of papers about speech synthesis. We outline some best practices and common pitfalls for papers about speech synthesis, with a particular focus on evaluation. We also recommend that reviewers check the guidelines for authors written in the paper kit and consider those as reviewing criteria as well. This is intended to be a living document, and it will be updated as we receive comments and feedback from readers. We note that this document is meant to provide guidance only, and that reviewers should ultimately use their own discretion when evaluating papers.

本文档旨在为语音合成论文的审稿人提供指导。我们概述了语音合成论文的最佳实践和常见陷阱，特别侧重于评估方面。我们还建议审稿人查看论文集中为作者撰写的指导方针，并将其作为评审标准。本文档旨在成为不断更新的文档，并根据我们从读者那里收到的评论和反馈进行更新。请注意，本文档仅旨在提供指导，审稿人在评估论文时应最终自行判断。

论文及项目相关链接

PDF

Summary
文本作为关于语音合成论文审稿的指南。提供了关于语音合成论文的最佳实践、常见陷阱及审稿注意事项。建议审稿人同时参考作者指南并考虑将其作为评审标准。这是一份动态更新的文档，会根据读者反馈进行更新。最终审稿时应以自身判断为主。

Key Takeaways

本文为关于语音合成论文的审稿者提供指南。
涵盖最佳实践、常见陷阱和审稿注意事项。
强调评价论文时应结合作者指南和自身判断。
文档会随读者反馈更新，具有动态性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-10/TTS/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

TTS

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-10-10 SHADES -- $^{22}$Ne($α$,n)$^{25}$Mg reaction rate in the Gamow window

2025-10-10 Interactive

Interactive

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-10-10 Validation of Various Normalization Methods for Brain Tumor Segmentation Can Federated Learning Overcome This Heterogeneity?

2025-10-10 医学图像

医学图像