TTS

发布日期: 2025-10-21

更新日期: 2025-11-27

文章字数: 987

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-21 更新

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Authors:Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao

Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

人类依赖多感官融合来感知空间环境，其中听觉线索能使声音源在三维空间中进行定位。尽管空间音频在VR/AR等沉浸式技术中发挥着关键作用，但大多数现有的多模式数据集只提供单声道音频，这限制了空间音频生成和理解的发展。为了解决这些挑战，我们推出了MRSAudio，这是一个大规模的多模式空间音频数据集，旨在推动空间音频理解和生成方面的研究。MRSAudio跨越四个独特组成部分：MRSLife、MRSSpeech、MRSMusic和MRSSing，涵盖了多样化的现实场景。数据集包括同步的双耳和环绕声音频、外向和内向的视频、运动轨迹，以及精细的注释，如文字稿、音素边界、歌词、乐谱和提示。为了展示MRSAudio的实用性和通用性，我们建立了五个基本任务：音频定位、空间文本到语音、空间歌声合成、空间音乐生成以及声音事件定位和检测。结果表明，MRSAudio能够实现高质量的空间建模，并支持广泛的空间音频研究。演示和数据集访问可在https://mrsaudio.github.io找到。

论文及项目相关链接

PDF 24 pages

Summary

MRSAudio是一个大规模的多模式空间音频数据集，旨在推动空间音频理解和生成的研究。它包含四个不同的组成部分，覆盖各种真实场景，并提供同步的双耳和环绕声音频、外向和内向视频、运动轨迹以及精细的标注。该数据集可用于完成五个基础任务，包括音频空间化、空间文本到语音、空间歌声合成、空间音乐生成以及声音事件定位和检测。

Key Takeaways