嘘~ 正在从服务器偷取页面 . . .

TTS


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-18 更新

CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

Authors:Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang, Haoyu Song, Ian Mcloughlin

Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.

指令引导的文本到语音(TTS)研究已经达到了一个成熟度水平,可以根据需求生成高质量的语音。然而,仍存在两种耦合偏见:口音偏见,即模型默认采用主导语音模式;语言偏见,即忽略方言特定的词汇和文化线索。这些偏见是相互依存的,因为真实的口音生成需要口音保真和本地化文本。我们提出ContextuaLinguistic Adaptation and Retrieval for Inclusive TTS sYnthesis(CLARITY),这是一个与主干无关的框架,通过双信号优化来解决这些偏见:(i)上下文语言适应,使输入文本适应目标方言;(ii)检索增强口音提示(RAAP),提供与口音一致的语音提示。在十二种英语口音中,CLARITY在提高口音准确性和公平性的同时,保持了强大的感知质量。

论文及项目相关链接

PDF Submitted to ICASSP 2026

Summary

文本介绍了指令式文本转语音(TTS)研究的现状,虽然语音生成质量已经非常高,但仍存在两个耦合的偏见:口音偏见和语言偏见。为了解决这个问题,提出了一种名为CLARITY的上下文语言适应和检索框架,通过双信号优化来平衡口音和文本本地化。该框架包括两个主要部分:语境语言适应和检索增强口音提示(RAAP)。它能够提高口音准确性、公平性和感知质量。CLARITY方法在不同的英语口音中也表现良好。

Key Takeaways

  1. 文本转语音(TTS)研究中存在口音偏见和语言偏见的问题。口音偏见默认使用主流语音模式,语言偏见忽略方言特定的词汇和文化线索。两者是相互关联的。

  2. Contextual Linguistic Adaptation和Retrieval for Inclusive TTS sYnthesis(CLARITY)是一个解决口音偏见和语言偏见的框架。它通过双信号优化实现口音和文本的本地化平衡。

  3. CLARITY包括两个部分:语境语言适应用于本地化目标方言的输入文本,而检索增强口音提示(RAAP)提供一致的口音提示以增强语音的自然度和准确度。

  4. 该框架可提高口音准确性、公平性和感知质量,提高模型的通用性和性能。

  5. 该框架具有良好的适应性,可以应用于不同的英语口音中,显示出其广泛的应用前景。

  6. 通过CLARITY框架的应用,可以改进TTS系统的性能,使其更加公正和包容性更强。

Cool Papers

点此查看论文截图

Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces

Authors:Farhan Sheth, Girish, Mohd Mujtaba Akhtar, Muskaan Singh

In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms-including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds-where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis-invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD.

在本次研究中,我们应对了在各种语音合成范式中进行通用化音频深度伪造检测(ADD)的挑战,包括传统的文本到语音(TTS)系统和现代的基于扩散或流匹配(FM)的生成器。之前的研究大多针对单个合成家族,由于过度拟合特定生成的伪影,通常无法跨范式进行通用化。我们假设,无论合成语音的生成来源如何,它在嵌入空间中留下的共享结构失真可以通过几何感知建模来对齐。为此,我们提出了RHYME,这是一个统一的检测框架,它通过非欧几里得投影融合来自各种预训练语音编码器的句子级嵌入。RHYME将表示映射到双曲和球面流形上——双曲几何擅长对分层生成器家族进行建模,球面投影捕获周期变声器的角度、能量不变线索等。通过黎曼重心平均法获得融合表示,实现合成不变的对齐。RHYME在跨范式ADD中表现出色,超过了单个预训练模型和均匀融合基准测试,达到了顶尖性能并创造了跨范式ADD的最新技术记录。

论文及项目相关链接

PDF Accepted to IJCNLP-AACL 2025

Summary

本文提出一种统一检测框架RHYME,旨在解决不同语音合成范式下音频深度伪造检测的挑战。通过融合不同预训练语音编码器的发言级嵌入,利用非欧几里得投影,实现合成语音的跨范式检测。RHYME映射表示进入双曲和球面流形,其中双曲几何擅长建模层次生成器家族,球面投影捕捉角度、能量不变的线索,如周期性编解码器伪迹。通过黎曼重心平均获得融合表示,实现合成不变对齐。RHYME在跨范式ADD上表现优异,优于单个预训练模型和均匀融合基线,达到顶尖性能并刷新了最新记录。

Key Takeaways

  1. 本文提出了一种新的音频深度伪造检测框架RHYME,旨在解决跨不同语音合成范式的挑战。
  2. RHYME框架融合了多种预训练语音编码器的发言级嵌入,使用非欧几里得投影进行建模。
  3. 双曲几何擅长建模层次生成器家族,而球面投影则有助于捕捉角度、能量不变的线索。
  4. 通过黎曼重心平均法得到融合表示,实现合成不变对齐。
  5. RHYME框架在跨范式音频深度伪造检测上表现优异,优于其他单一模型和融合方法。
  6. 该框架实现了顶尖性能,刷新了最新记录。

Cool Papers

点此查看论文截图

Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate

Authors:Eyal Rabin, Zohar Elyoseph, Rotem Israel-Fishelson, Adi Dali, Ravit Nussinson

Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both “polite and formal” and “casual and informal” conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio’s voices and for a large majority of OpenAI’s voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.

基于语音的人工智能越来越被期望遵循人类社会规范,但它能否学习那些没有明确编程的隐含线索呢?本研究旨在探讨最前沿的文本到语音系统是否内化人类减少语速以表达礼貌的倾向——这是一个不那么明显的韵律标记。我们要求来自两个主要AI平台(AI Studio和OpenAI)的22个合成声音在“礼貌正式”和“随意非正式”的条件下阅读固定脚本,并测量所得的语音持续时间。在两个AI平台上,礼貌提示产生的语音速度都比随意提示慢,且效果非常大,这一效果在AI Studio的所有声音和OpenAI的大部分声音上都具有统计学意义。这些结果表明,AI可以隐式地学习并复制人类沟通的心理学细微差别,凸显出其作为能够强化人类社会规范的社交行动者而日益重要的作用。

论文及项目相关链接

PDF

Summary
最新研究发现,语音人工智能在遵循人类社会规范时,能够学习并复制人类沟通的微妙心理差异。本研究测试了先进的文本转语音系统在处理礼貌和正式语境与随意和非正式语境时的表现,发现它们在礼貌情境下语音语速较慢。结果表明人工智能已具备内化的隐含人类交流特点的能力,可在强化社会规范方面扮演重要的社交角色。

Key Takeaways

  1. 语音人工智能逐渐需符合人类社交规范,并能够学习隐性非直接编程线索。
  2. 本研究探索了先进的文本转语音系统是否内化了人类降低语速以表达礼貌的倾向。
  3. 在不同语境下测试了AI语音表现,发现它们在礼貌情境中说话速度较慢。
  4. AI平台表现出对人类礼貌交流模式的隐性学习模仿能力。
  5. 所有AI Studio的语音系统和大部分OpenAI的语音系统在礼貌情境下的语速降低具有统计显著性和很大的效应大小。
  6. 这一研究证明了人工智能在社会交流方面拥有更大潜力,体现在其可内化社会规范和心理微妙差异上。

Cool Papers

点此查看论文截图

StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak

Authors:Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, Jie Wu

Large Audio-language Models (LAMs) have recently enabled powerful speech-based interactions by coupling audio encoders with Large Language Models (LLMs). However, the security of LAMs under adversarial attacks remains underexplored, especially through audio jailbreaks that craft malicious audio prompts to bypass alignment. Existing efforts primarily rely on converting text-based attacks into speech or applying shallow signal-level perturbations, overlooking the impact of human speech’s expressive variations on LAM alignment robustness. To address this gap, we propose StyleBreak, a novel style-aware audio jailbreak framework that systematically investigates how diverse human speech attributes affect LAM alignment robustness. Specifically, StyleBreak employs a two-stage style-aware transformation pipeline that perturbs both textual content and audio to control linguistic, paralinguistic, and extralinguistic attributes. Furthermore, we develop a query-adaptive policy network that automatically searches for adversarial styles to enhance the efficiency of LAM jailbreak exploration. Extensive evaluations demonstrate that LAMs exhibit critical vulnerabilities when exposed to diverse human speech attributes. Moreover, StyleBreak achieves substantial improvements in attack effectiveness and efficiency across multiple attack paradigms, highlighting the urgent need for more robust alignment in LAMs.

大型音频语言模型(LAMs)最近通过耦合音频编码器与大型语言模型(LLMs)实现了强大的基于语音的交互。然而,LAMs在遭受对抗性攻击时的安全性仍然被探索得不够深入,特别是通过音频越狱来制造恶意音频提示以绕过对齐。现有的努力主要依赖于将文本攻击转换为语音或应用浅层次的信号级扰动,忽视了人类语音表达变化对LAM对齐稳健性的影响。为了弥补这一空白,我们提出了StyleBreak,这是一种新型的风格感知音频越狱框架,系统地研究了不同的人类语音属性如何影响LAM对齐的稳健性。具体来说,StyleBreak采用两阶段风格感知转换管道,对文本内容和音频进行扰动,以控制语言、副语言和非语言属性。此外,我们开发了一个查询自适应策略网络,自动搜索对抗性风格,以提高LAM越狱探索的效率。广泛评估表明,当暴露于多种人类语音属性时,LAMs表现出关键漏洞。而且,StyleBreak在多种攻击模式下实现了攻击效果和效率的大幅提升,这凸显了LAMs中更稳健对齐的迫切需求。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary
大型音频语言模型(LAMs)结合音频编码器与大型语言模型(LLMs)实现了强大的语音交互功能。但LAMs在面临恶意音频提示的攻击时其安全性仍未得到深入研究,现有的攻击方式忽略了人类语音表达的丰富变化对LAM对齐稳健性的影响。为解决此问题,提出StyleBreak框架,通过两阶段风格感知转换管道,同时扰动文本内容和音频,影响语言、副语言和超语言属性。此外,开发查询自适应策略网络自动搜索对抗性风格,提高LAM越狱探索的效率。评估表明,LAM在面对多样的语音属性时存在严重漏洞,StyleBreak在多种攻击模式下实现了攻击效果和效率的大幅提升。

Key Takeaways

  1. LAMs结合音频编码器与LLMs实现强大的语音交互功能。
  2. LAMs在面临恶意音频提示攻击时的安全性尚未得到充分研究。
  3. 现有攻击方式忽略了人类语音表达的丰富变化对LAM对齐稳健性的影响。
  4. StyleBreak框架通过风格感知转换管道,影响语言、副语言和超语言属性来扰动LAMs。
  5. StyleBreak采用两阶段转换管道,同时扰动文本内容和音频,增强攻击效果。
  6. 查询自适应策略网络可自动搜索对抗性风格,提高LAM越狱探索效率。
  7. 评估显示LAM在面对多样的语音属性时存在严重漏洞,StyleBreak显著提高攻击效果和效率。

Cool Papers

点此查看论文截图

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Authors:Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang

Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.

随着大型语言模型(LLMs)的最新进展,虽然能够理解语音和非语音音频,但也暴露出由复杂的音频输入带来的新的安全风险,而当前的安全措施对这些复杂音频的处理不足。我们推出SACRED-Bench(用于红队对抗的语音音频组合评估)来评估LLMs在复杂的音频攻击下的稳健性。与现有的基于扰动的方法不同,这些方法依赖于噪声优化或白盒访问,SACRED-Bench利用语音音频组合机制。SACRED-Bench采用三种机制:(a)语音重叠和多人对话,在良性语音之前或之中嵌入有害提示;(b)语音音频混合,通过非语音音频传达不安全意图,伴随良性语音或音频;(c)多样的口头指令格式(开放式问答、是非题)规避仅文本过滤器。实验表明,即使是最先进的专有LLM——Gemini 2.5 Pro,在SACRED-Bench测试集中攻击成功率仍高达66%,暴露出跨模态和语音音频组合攻击下的漏洞。为了弥补这一差距,我们提出SALMONN-Guard,一个安全保护LLM,它能联合检查语音、音频和文本以做出安全判断,将攻击成功率降低到仅20%。我们的结果凸显了对多模式LLM的安全需要音频感知防御。该基准测试和SALMONN-Guard检查点可在https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench 找到。警告:本文包含可能具有冒犯性或有害的示例。

论文及项目相关链接

PDF

摘要

大型语言模型(LLM)的最新进展使其能够理解语音和非语音音频,但同时也暴露出复杂音频输入所带来的新安全威胁,当前的安全措施对此处理不足。我们推出SACRED-Bench(用于红队合作的语音音频组合),以评估LLM在复杂音频攻击下的稳健性。SACRED-Bench采用不同于现有基于扰动的方法,它依赖噪声优化或白盒访问,而是利用语音音频组合机制。SACRED-Bench采用三种机制:1)通过无害语音夹杂有害提示进行语音重叠和多语种对话;2)通过良性语音或非语音音频传达危险意图进行语音音频混合;以及通过多种方式规避文本过滤器的检测,如开放式问答和是非题等多样化的口头指令格式。实验表明,即使是目前最先进的私有LLM,Gemini 2.5 Pro,在SACRED-Bench测试集中的攻击成功率仍高达66%,显示出在跨模态语音音频组合攻击下的漏洞。为解决这一问题,我们提出SALMONN-Guard安全卫士,它能联合检查语音、音频和文本进行安全判断,将攻击成功率降至仅20%。我们的研究结果突显了多模态LLM的音频感知防御需求。基准测试和SALMONN-Guard检查点可在链接找到。警告:本文包含可能具有冒犯性或有害的示例。

关键见解

  1. LLM虽能理解语音和非语音音频,但复杂音频输入带来新安全威胁。
  2. SACRED-Bench推出用于评估LLM在复杂音频攻击下的稳健性。
  3. SACRED-Bench采用语音重叠、多语种对话等机制进行攻击。
  4. 目前最先进的LLM仍面临较高攻击成功率。SACRED-Bench展示了跨模态攻击的威胁。

Cool Papers

点此查看论文截图

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Authors:Shubhashis Roy Dipta, Francis Ferraro

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

最近的方法展示了从大型语言模型(LLM)和视觉语言模型(VLM)中提取和利用参数知识的惊人能力。在这项工作中,我们考虑如何通过自动提取有关这些事件的潜在参数知识,来提高与复杂现实世界事件相关的视频的识别和检索能力。我们提出了Q2E:一种用于零样本多语言文本到视频检索的查询到事件分解方法,可适应不同的数据集、领域、LLM或VLM。我们的方法证明,我们可以通过利用LLM和VLM中嵌入的知识来分解查询,增强对过于简化的人类查询的理解。我们还展示了如何将我们的方法应用于视觉和基于语音的输入。为了结合这种多样化的多模态知识,我们采用了基于熵的融合评分进行零样本融合。在两个不同的数据集和多个检索指标上的评估表明,Q2E优于几种最新技术水平的基线。我们的评估还表明,整合音频信息可以显著改进文本到视频的检索。我们已经发布了代码和数据,以供未来研究使用。

论文及项目相关链接

PDF Accepted in IJCNLP-AACL 2025 (also presented in MAGMAR 2025 at ACL 2025)

Summary

本文介绍了一种名为Q2E的查询-事件分解方法,用于零样本多语言文本-视频检索。该方法能够自动提取关于复杂现实事件的潜在参数知识,提高视频检索的准确性。通过采用熵融合评分,结合多种模态的知识,Q2E在多个数据集和检索指标上的表现均优于现有方法。同时,本文还展示了集成音频信息对文本-视频检索的改进效果。

Key Takeaways

  1. Q2E方法利用LLMs和VLMs中的参数知识,提高了对复杂现实事件相关视频的识别和检索能力。
  2. Q2E通过查询分解增强了对简化查询的理解,利用LLMs和VLMs中的知识。
  3. 方法可适应多种数据集、领域、LLMs或VLMs,具有广泛的应用性。
  4. 采用熵融合评分结合多种模态知识,实现零样本融合。
  5. 在两个不同数据集上的评估表明,Q2E在多个检索指标上的性能优于现有方法。
  6. 集成音频信息能显著改进文本-视频检索效果。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
  目录