TTS

发布日期: 2025-11-22

更新日期: 2025-11-27

文章字数: 5.6k

阅读时长: 23 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-22 更新

SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

Authors:Rui Sang, Yuxuan Liu

Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen’s d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness evaluation shows that SceneGuard maintains or enhances protection under five common countermeasures including MP3 compression, spectral subtraction, lowpass filtering, and downsampling. Our results suggest that audible, scene-consistent noise provides a more robust alternative to imperceptible perturbations for training-time voice protection. The source code are available at: https://github.com/richael-sang/SceneGuard.

语音克隆技术能够通过有限的音频样本进行未经授权的语音合成，从而构成重大隐私威胁。基于不可察觉对抗性扰动的现有防御措施容易受到如降噪和压缩等常见音频预处理的影响。我们提出了SceneGuard，这是一种在训练时进行的语音保护方法，它对话语录音应用了与场景一致的可听见的背景噪音。不同于不可察觉的扰动，SceneGuard利用自然发生的声学场景（例如机场、街道、公园）来创建保护性的噪音，这些噪音在语境上是恰当的，并且对于对抗措施具有稳健性。我们在文本到语音的训练攻击中对SceneGuard进行了评估，结果显示，它使说话人相似性降低了5.5%，具有极高的统计显著性（p < 10^-15，Cohen’s d = 2.18），同时保持了98.6%的语音清晰度（STOI = 0.986）。稳健性评估显示，SceneGuard在包括MP3压缩、谱减法、低通滤波和降采样等五种常见对抗措施下，都能保持或增强保护效果。我们的结果表明，可听见的、与场景一致噪音为训练时的语音保护提供了比不可察觉的扰动更稳健的替代方案。源代码可在以下网址找到：https://github.com/richael-sang/SceneGuard。

论文及项目相关链接

PDF

Summary

语音克隆技术通过从有限的音频样本中进行未经授权的语音合成，对隐私构成重大威胁。现有的基于不可察觉对抗性扰动的防御措施容易受到如降噪和压缩等常见音频预处理的影响。我们提出SceneGuard，一种训练时的语音保护方法，该方法对语音记录应用与场景一致的可闻背景噪声。不同于不可察觉的扰动，SceneGuard利用自然发生的声学场景（如机场、街道、公园等）来创建保护性的噪声，该噪声具有上下文关联性，并对对策具有鲁棒性。我们在文本到语音的训练攻击中对SceneGuard进行了评估，结果表明，在保持98.6%的语音清晰度（STOI = 0.986）的同时，Speaker相似性降低了5.5%，并具有极高的统计显著性（p < 10^-15，Cohen的d = 2.18）。稳健性评估显示，SceneGuard在五种常见对策（包括MP3压缩、谱减法、低通滤波和降采样）下仍能维持或提升保护效果。我们的研究结果暗示，可闻的、与场景一致噪声为训练时的语音保护提供了比不可察觉扰动更稳健的替代方案。

Key Takeaways

语音克隆技术对隐私构成威胁，能够通过有限的音频样本进行未经授权的语音合成。
现有的防御措施基于不可察觉的对抗性扰动，但容易受到常见的音频预处理（如降噪和压缩）的影响。
SceneGuard方法通过应用与场景一致的可闻背景噪声，提供一种训练时的语音保护方案。
SceneGuard利用自然声学场景创建保护性的噪声，具有上下文关联性和对策鲁棒性。
SceneGuard在文本到语音的训练攻击中表现出良好的性能，显著降低了演讲者相似性，同时保持了高语音清晰度和统计显著性。
SceneGuard在多种常见对策下维持或提升了保护效果。

Cool Papers

点此查看论文截图

Step-Audio-R1 Technical Report

Authors:Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

最近推理模型的进步通过扩展的思维链过程在文本和视觉领域取得了显著的成果。然而，音频语言模型中仍然存在一个令人困惑的现象：它们通常在无需或只需极少推理的情况下表现更佳，这引发了一个根本性的问题——音频智能是否真的能从深思熟虑中获益？我们推出了Step-Audio-R1，这是首个成功解锁音频领域推理能力的音频推理模型。通过我们提出的模态基础推理蒸馏（MGRD）框架，Step-Audio-R1学会了生成与音频相关的推理链，这些推理链真正基于声学特征，而不是虚幻的不连贯思考。我们的模型展现出强大的音频推理能力，超越了Gemini 2.5 Pro，并在涵盖语音、环境声音和音乐的全面音频理解和推理基准测试中，达到了与最先进的Gemini 3 Pro相当的性能。这些结果证明，当适当定位时，推理是跨模态转移的能力，将扩展的深思熟虑从负债转变为音频智能的强大资产。通过建立首个成功的音频推理模型Step-Audio-R1，为构建真正跨感官模态的深度多模态推理系统开辟了新的途径。

论文及项目相关链接

PDF 15 pages, 5 figures. Technical Report

Summary

基于最新进展的推理模型在文本和视觉领域表现出显著的成功，但在音频语言模型中仍存在一个令人困惑的现象：它们在少量或无需推理的情况下表现更佳。为此，我们推出首个音频推理模型Step-Audio-R1，成功解锁音频领域的推理能力。借助我们提出的模态基础推理蒸馏（MGRD）框架，Step-Audio-R1学会生成基于音频特征的推理链，而非凭空想象无关紧要的讨论。该模型展现出强大的音频推理能力，超越Gemini 2.5 Pro，并在涵盖语音、环境声音和音乐的综合音频理解和推理测试中表现与最先进的Gemini 3 Pro相当。结果证明，当适当锚定时，推理是跨模态的转移能力，将扩展性讨论从劣势转变为音频智能的强大资产。Step-Audio-R1的推出为建立真正跨感官模态的深度思考的多模态推理系统开辟了新的道路。

Key Takeaways

音频语言模型在少量或无需推理的情况下表现更佳，引发关于音频智能是否需要推理能力的根本问题。
引入首个音频推理模型Step-Audio-R1，具备真正的音频推理能力。
通过提出的Modality-Grounded Reasoning Distillation (MGRD)框架，Step-Audio-R1能够生成基于音频特征的推理链。
Step-Audio-R1在音频理解和推理测试中表现卓越，超越Gemini 2.5 Pro，并与Gemini 3 Pro性能相当。
推理是跨模态的转移能力，适当锚定后可将扩展性讨论转变为音频智能的优势。
Step-Audio-R1的推出为建立多模态推理系统铺平了道路，这些系统能够在所有感官模态上进行深度思考。

Cool Papers

点此查看论文截图

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Authors:Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

大型语言模型（LLM）在自动语音识别（ASR）和文本转语音（TTS）系统中均展现出良好的性能，逐渐成为主流方法。然而，当前大多数方法都是分别解决这两个任务，而非通过一个统一框架。本文旨在将这两个任务集成到一个统一模型中。虽然离散语音分词可以实现联合建模，但其固有的信息损失限制了识别和生成的性能。在这项工作中，我们提出了UniVoice，这是一个通过连续表示实现的统一LLM框架，可以无缝地将语音识别和语音合成集成到一个单一模型中。我们的方法结合了用于语音识别和用于高质量生成的自回归建模与流匹配的优势。为了减轻自回归模型和流匹配模型之间的固有差异，我们进一步设计了一种双重注意力机制，在识别和合成之间切换因果掩码和双向注意力掩码。此外，提出的文本前缀条件语音填充方法可实现高保真零样本语音克隆。实验结果表明，我们的方法在ASR和零样本TTS任务上能达到或超过当前单任务建模方法的性能。本文为端到端的语音理解和生成探索了新的可能性。代码可在https://github.com/gwh22/UniVoice上找到。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的自动语音识别（ASR）和文本转语音（TTS）系统表现卓越，逐渐成为主流方法。然而，当前大多数方法都是单独处理这两个任务，缺乏统一框架。本研究旨在通过连续表示的方式，提出一个统一的LLM框架——UniVoice，将语音识别和语音合成在一个模型中无缝集成。该研究结合了用于语音识别的自回归建模和生成高质量语音的流匹配方法。实验结果表明，该方法在ASR和零样本TTS任务上达到或超越了当前单任务建模方法的效果。代码已公开于GitHub上。

Key Takeaways

大型语言模型在ASR和TTS系统中表现优异，成为主流方法。
当前大多数方法都是单独处理ASR和TTS任务，缺乏统一框架。
UniVoice是一个统一的大型语言模型框架，能无缝集成语音识别和语音合成。
UniVoice结合了自回归建模和流匹配方法，分别用于语音识别和高质量语音生成。
UniVoice设计了一种双注意力机制，能在识别和合成之间灵活切换。
提出的文本前缀条件语音填充方法实现了高保真零样本语音克隆。

Cool Papers

点此查看论文截图

JWST spectroscopic confirmation of the Cosmic Gems arc at z=9.625 – Insights into the small scale structure of a post-burst system

Authors:M. Messa, E. Vanzella, F. Loiacono, A. Adamo, M. Oguri, K. Sharon, L. D. Bradley, L. Christensen, A. Claeyssens, J. Richard, Abdurro’uf, F. E. Bauer, P. Bergamini, A. Bolamperti, M. Bradač, F. Calura, D. Coe, J. M. Diego, C. Grillo, T. Y-Y. Hsiao, A. K. Inoue, S. Fujimoto, M. Lombardi, M. Meneghetti, T. Resseguier, M. Ricotti, P. Rosati, B. Welch, R. A. Windhorst, X. Xu, E. Zackrisson, A. Zanella, A. Zitrin

We present JWST/NIRSpec integral field spectroscopy of the Cosmic Gems arc, strongly magnified by the galaxy cluster SPT-CL J0615$-$5746. Six-hour integration using NIRSpec prism spectroscopy (resolution $\rm R\simeq 30-300$), covering the spectral range $0.8-5.3~~μm$, reveals a pronounced $\rm Lyα$-continuum break at $λ\simeq 1.3~~μm$, as well as weak optical $\rm Hβ$ and $\rm [OIII]\lambda4959$ emission lines at $z=9.625\pm0.002$, located in the reddest part of the spectrum ($λ> 5.1μm$). No additional ultraviolet or optical emission lines are reliably detected. A weak Balmer break is measured alongside a very blue ultraviolet slope ($β\leq-2.5$, $\rm F_λ \sim λ^β$). Spectral fitting with $\tt Bagpipes$ suggests that the Cosmic Gems galaxy is in a post-starburst phase, making it the highest-redshift system currently observed in a mini-quenched state. Spatially resolved spectroscopy at tens of parsecs shows relatively uniform features across subcomponents of the arc. These findings align well with the physical properties previously derived from JWST/NIRCam photometry of the stellar clusters, now corroborated by spectroscopic evidence. In particular, five observed star clusters exhibit ages of $\rm 7-30Myr$. An updated lens model constrains the intrinsic sizes and masses of these clusters, confirming they are extremely compact and denser than typical star clusters in local star-forming galaxies. Additionally, four compact stellar systems consistent with star clusters ($\lesssim10$ pc) are identified along the extended tail of the arc. A sub-parsec line-emitting HII region straddling the critical line, lacking a NIRCam counterpart, is also serendipitously detected.

我们对宇宙宝石弧进行了JWST/NIRSpec积分场光谱观测，这一弧被星系团SPT-CL J0615-5746强烈放大。使用NIRSpec棱镜光谱法（分辨率R≈30-300）进行六小时积分，覆盖光谱范围0.8-5.3μm，结果显示在λ≈1.3μm处有明显的Lyα连续谱断裂，以及位于光谱最红部分（λ> 5.1μm）的弱光学Hβ和[OIII]λ4959发射线，红移为z=9.625±0.002。没有检测到其他可靠的紫外或光学发射线。测量到一个微弱的巴尔默断裂和一个非常蓝的紫外斜率（β≤-2.5，Fλ∼λ^β）。用Bagpipes进行光谱拟合表明，宇宙宝石星系处于后爆发阶段，使其成为目前观察到的处于小型熄灭状态的红移最高的系统。几十帕的空间分辨光谱显示弧的亚组件具有相对统一的特点。这些发现与先前从JWST/NIRCam对恒星团的摄影术得出的物理性质相吻合，现在得到了光谱证据的支持。特别是，五个观察到的星团年龄为7-30Myr。更新的透镜模型限制了这些星团的内禀尺寸和质量，证实它们比局部星系中的典型星团更紧凑、密度更高。此外，沿着弧的延伸尾部还发现了四个与星团一致的致密恒星系统（≤10pc）。还意外检测到了一个跨越临界线的亚帕线发射HII区域，但没有NIRCam对应物。

论文及项目相关链接

PDF 22 pages (15 figures, 4 tables). Accepted for publication in A&A; see also the companion work Vanzella et al. 2025

Summary

JWST/NIRSpec光谱仪对宇宙宝石弧进行了积分场光谱观测，利用NIRSpec棱镜光谱法（分辨率R≈30-300）进行了六小时积分，覆盖了光谱范围0.8-5.3μm。观测结果显示了明显的Lyα连续谱断裂和位于红端的光学Hβ和[OIII]λ4959发射线。宇宙宝石星系处于爆发后阶段，是目前观察到的最高红移的迷你熄灭状态系统。空间解析光谱显示弧的各子成分特征相对均匀。这些发现与JWST/NIRCam摄影术先前对星团的物理性质推导结果一致，现在得到了光谱证据的证实。

Key Takeaways

JWST/NIRSpec对宇宙宝石弧进行了深入的光谱观测，时间长达六小时。
观测结果揭示了明显的Lyα连续谱断裂以及Hβ和[OIII]λ4959发射线。
宇宙宝石星系处于爆发后阶段，是目前观察到的最高红移的迷你熄灭状态系统。
空间解析光谱显示弧的子成分特征相对均匀。
发现与JWST/NIRCam摄影术对星团物理性质的先前推导结果一致，得到了光谱证据的证实。
观测到的星团年龄为7-30Myr，更新了透镜模型，约束了星团的内禀尺寸和质量。

Cool Papers

点此查看论文截图

Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks

Authors:Darpan Aswal, Manjira Sinha

Transformer based models, especially large language models (LLMs) dominate the field of NLP with their mass adoption in tasks such as text generation, summarization and fake news detection. These models offer ease of deployment and reliability for most applications, however, they require significant amounts of computational power for training as well as inference. This poses challenges in their adoption in resource-constrained applications, especially in the open-source community where compute availability is usually scarce. This work proposes a graph-based approach for Environmental Claim Detection, exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives to transformer-based models. Re-framing the task as a graph classification problem, we transform claim sentences into dependency parsing graphs, utilizing a combination of word2vec & learnable part-of-speech (POS) tag embeddings for the node features and encoding syntactic dependencies in the edge relations. Our results show that our graph-based models, particularly HGNNs in the poincaré space (P-HGNNs), achieve performance superior to the state-of-the-art on environmental claim detection while using up to \textbf{30x fewer parameters}. We also demonstrate that HGNNs benefit vastly from explicitly modeling data in hierarchical (tree-like) structures, enabling them to significantly improve over their euclidean counterparts.

基于Transformer的模型，尤其是大型语言模型（LLM）在自然语言处理领域占据主导地位，它们被广泛应用于文本生成、摘要和假新闻检测等任务。这些模型为大多数应用程序提供了易于部署和可靠的性能，然而，它们在训练和推理过程中需要大量的计算能力。这给在资源受限的应用程序，特别是在计算资源通常稀缺的开源社区中采用它们带来了挑战。本研究提出了一种基于图的环保声明检测法，探索图神经网络（GNN）和双曲图神经网络（HGNN）作为基于Transformer的模型的轻量级且有效的替代方案。我们将任务重新构建为图分类问题，将声明句子转换为依赖解析图，使用word2vec和可学习的词性标签嵌入作为节点特征，并在边缘关系中编码句法依赖关系。我们的结果表明，我们的基于图的模型，特别是在庞加莱空间的HGNN（P-HGNN）在环保声明检测方面实现了优于最新技术的性能，同时使用的参数最多减少了30倍。我们还证明，HGNN从显式建模分层（树状）结构的数据中受益匪浅，使它们能够显著改进其在欧几里得空间中的对应模型。

论文及项目相关链接

PDF

摘要
本文提出了基于图的方法用于环境声明检测，探索了图神经网络（GNNs）和双曲图神经网络（HGNNs）作为轻量级但有效的替代方案，以替代基于Transformer的模型。通过将任务重新构建为图形分类问题，我们将声明句子转换为依赖解析图形，使用word2vec和可学习的词性标签嵌入作为节点特征，并将句法依赖性编码为边缘关系。结果表明，我们的基于图形的模型，特别是在庞加莱空间中的HGNNs，在环境声明检测方面的性能优于最新技术，同时使用的参数最多减少了XX倍。我们还证明了HGNNs从显式建模层次结构数据中获益极大，能够显著提高其在欧几里得空间中的表现。

关键见解