发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 5.9k

阅读时长: 23 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation

Authors:Omar Garib, Jayaprakash D. Kambhampaty, Olivia J. Pinon Fischer, Dimitri N. Mavris

We introduce AIRHILT (Aviation Integrated Reasoning, Human-in-the-Loop Testbed), a modular and lightweight simulation environment designed to evaluate multimodal pilot and air traffic control (ATC) assistance systems for aviation conflict detection. Built on the open-source Godot engine, AIRHILT synchronizes pilot and ATC radio communications, visual scene understanding from camera streams, and ADS-B surveillance data within a unified, scalable platform. The environment supports pilot- and controller-in-the-loop interactions, providing a comprehensive scenario suite covering both terminal area and en route operational conflicts, including communication errors and procedural mistakes. AIRHILT offers standardized JSON-based interfaces that enable researchers to easily integrate, swap, and evaluate automatic speech recognition (ASR), visual detection, decision-making, and text-to-speech (TTS) models. We demonstrate AIRHILT through a reference pipeline incorporating fine-tuned Whisper ASR, YOLO-based visual detection, ADS-B-based conflict logic, and GPT-OSS-20B structured reasoning, and present preliminary results from representative runway-overlap scenarios, where the assistant achieves an average time-to-first-warning of approximately 7.7 s, with average ASR and vision latencies of approximately 5.9 s and 0.4 s, respectively. The AIRHILT environment and scenario suite are openly available, supporting reproducible research on multimodal situational awareness and conflict detection in aviation; code and scenarios are available at https://github.com/ogarib3/airhilt.

我们介绍了AIRHILT（航空综合推理、人机循环测试平台），这是一个模块化且轻量级的仿真环境，旨在评估航空冲突检测的飞行员和空中交通管制（ATC）辅助系统的多模式功能。AIRHILT建立在开源的Godot引擎之上，同步飞行员和ATC的无线电通信、来自摄像头流的视觉场景理解以及ADS-B监控数据在一个统一、可扩展的平台内。该环境支持飞行员和控制器循环交互，提供了一个全面的场景套件，涵盖终端区域和航路操作冲突，包括通信错误和程序错误。AIRHILT提供了基于标准化的JSON接口，使研究人员能够轻松集成、交换和评估自动语音识别（ASR）、视觉检测、决策和文本到语音（TTS）模型。我们通过参考管道展示了AIRHILT，该管道结合了经过微调后的Whisper ASR、基于YOLO的视觉检测、基于ADS-B的冲突逻辑和GPT-OSS-20B结构化推理，并展示了来自典型跑道重叠场景的初步结果，其中助理的平均首次警告时间为约7.7秒，ASR和视觉的平均延迟时间分别为约5.9秒和0.4秒。AIRHILT环境和场景套件公开可用，支持航空多模式态势感知和冲突检测的复现研究；代码和场景可在https://github.com/ogarib3/airhilt找到。

论文及项目相关链接

PDF 9 pages, 4 figures, 1 table, 1 algorithm

Summary

AIRHILT是一款模块化、轻量级的仿真环境，用于评估航空冲突检测的多模式飞行员和空中交通管制（ATC）辅助系统。它同步飞行员和ATC的无线电通信、摄像头流中的视觉场景理解以及ADS-B监控数据，在一个统一、可扩展的平台内提供支持。环境支持飞行员和控制器循环交互，提供涵盖终端区域和路线操作冲突的全面场景套件，包括通信错误和程序错误。

Key Takeaways

AIRHILT是一个用于评估航空冲突检测辅助系统的仿真环境。
它同步飞行员和ATC的通信、视觉场景理解以及ADS-B数据。
AIRHILT支持飞行员和控制器循环交互。
环境提供全面的场景套件，涵盖终端区域和路线操作冲突。
AIRHILT采用开源的Godot引擎构建。
它提供标准化的JSON接口，便于研究者集成、替换和评估自动语音识别、视觉检测、决策和文本转语音模型。
AIRHILT环境及场景套件公开可用，支持在航空中的多模式态势感知和冲突检测的可重复性研究。

Cool Papers

点此查看论文截图

A Multimodal Conversational Agent for Tabular Data Analysis

Authors:Mohammad Nour Al Awad, Sergey Ivanov, Olga Tikhonova, Ivan Khodnenko

Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants.

大型语言模型（LLM）可以通过与用户进行交互式的、语境感知的对话来处理数据分析、可视化和解释，从而重塑信息处理能力，同时保持高性能。在本文中，我们介绍了Talk2Data，这是一个由多模态LLM驱动用于直观数据探索的对话代理。该系统让用户使用语音或文本指令查询数据集，并以图表、表格、统计数据或口头解释的形式接收答案。该系统基于LLM构建，结合了OpenAI的Whisper自动语音识别（ASR）系统、Qwen-coder的代码生成LLM/模型、自定义的沙箱执行工具和Coqui文本到语音（TTS）库，在一个代理编排循环中运行。与传统的仅文本分析工具不同，它适应跨模态的响应，并支持基于数据集上下文的多轮对话。在三个数据集上的48项任务评估中，我们的原型达到了95.8%的准确率，仅模型生成时间不到1.7秒（不包括ASR和执行时间）。对五个不同规模LLM（1.5B-32B）的比较揭示了准确性、延迟和成本之间的权衡，其中7B模型在交互式使用方面提供了最佳平衡。通过在用户对话与代码执行之间进行路由选择，被限制在一个透明的沙箱中，同时根据模式级别的上下文来提示基础，Talk2Data代理能够可靠地从表格中提取可操作的见解，同时使计算可验证。在本文中，除了Talk2Data代理本身，我们还讨论了人类与数据交互的启示、对LLM驱动分析的可信性以及未来向大规模多模态助理扩展的方向。

论文及项目相关链接

PDF \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

摘要

大型语言模型（LLM）能够通过处理数据分析、可视化和解释来重塑信息处理能力，并以交互的、意识上下文的对谈方式与用户进行语音互动，同时保持高性能。本文介绍了Talk2Data，这是一个以LLM驱动的多模式对话代理，用于直观的数据探索。系统允许用户通过语音或文本指令查询数据集，并获得图表、表格、统计数据或口头解释等答案。该系统的建议设计结合了OpenAI自动语音识别（ASR）系统、Qwen-coder代码生成LLM/模型、自定义沙箱执行工具和文本到语音（TTS）的Coqui库，在一个代理协同循环中运行。与传统的纯文本分析工具不同，它能够适应跨模态的响应，并支持基于数据集上下文的多轮对话。在三个数据集上的48项任务评估中，我们的原型达到了95.8%的准确率，仅模型生成时间不到1.7秒（不包括ASR和执行时间）。对五个不同规模的LLM（从1.5B到32B）的比较揭示了准确性、延迟和成本之间的权衡，其中7B模型在交互式使用方面提供了最佳平衡。通过在与用户的对话和代码执行之间进行路由选择，同时基于模式上下文定位提示，Talk2Data代理能够从表格中可靠地获取可操作的信息，同时使计算可验证。除Talk2Data代理本身外，本文还讨论了人机数据交互、对LLM驱动分析的可信度和未来向大规模多模式助理扩展的启示。

要点掌握

大型语言模型（LLM）能够通过交互方式处理数据分析和可视化，提高信息处理能力。
Talk2Data是一个以LLM驱动的对话代理，用于直观的数据探索，支持语音和文本指令查询数据集。
Talk2Data结合了多种技术，包括自动语音识别（ASR）、代码生成LLM/模型、沙箱执行工具和文本到语音（TTS）库。
Talk2Data原型在三个数据集上的任务评估中取得了高准确率，并且模型生成时间短。
不同规模的LLM在准确性、延迟和成本之间存在权衡，其中7B模型适合交互式使用。
Talk2Data代理能够可靠地提取可操作的数据见解，并在沙箱环境中进行透明的计算验证。

Cool Papers

点此查看论文截图

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Authors:Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the dense LLM, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

我们推出了来自Lychee家族的Uni-MoE 2.0。作为一款完全开源的多模态大型模型（OLM），它极大地推动了Lychee的Uni-MoE系列在语言为中心的多模态理解、推理和生成方面的进展。基于密集LLM，我们通过三个核心贡献从头构建Uni-MoE-2.0-Omni：动态容量混合专家（MoE）设计，增强迭代强化策略的渐进训练策略，以及精心策划的多模态数据匹配技术。它能够实现多模态理解，并生成图像、文本和语音。在结构上，我们新的MoE框架通过共享、路由和空专家在10个跨模态输入之间平衡计算效率和功能；我们的Omni-Modality 3D RoPE确保自注意力层中的时空跨模态对齐。在训练方面，我们在跨模态预训练之后采用渐进的监督微调策略，该策略激活特定于模态的专家，并通过平衡数据组合和迭代GSPO-DPO方法来稳定强化学习训练和提升推理能力。从数据角度看，基础模型在大约75B个开源多模态数据标记上进行训练，并配备了特殊的语音和图像生成令牌，这使其能够通过语言线索来生成这些生成任务的条件输出。在85个基准测试上的广泛评估表明，我们的模型达到了最新水平或在对领先OLM的竞争性能中具有高度竞争力，在超过50个基准测试中超过了使用1.2T令牌训练的Qwen2.5-Omni。主要优势包括视频理解（+平均提升7%），多模态理解（+平均提升7%）和视听推理（+平均提升4%）。它还推动了长形式语音处理（降低WER 4.2%）并在低级别图像处理和控制生成方面领先其他指标在五个指标上占据领先地位。

论文及项目相关链接

PDF 47 pages,10 Figures, Project Website: https://idealistxy.github.io/Uni-MoE-v2.github.io/ Codes: https://github.com/HITsz-TMG/Uni-MoE

摘要

Lychee家族的Uni-MoE 2.0全面升级了Uni-MoE系列，作为一款全开源的跨模态大型模型（OLM），它在语言为中心的跨模态理解、推理和生成方面取得了显著进展。基于稠密的大型语言模型，通过三项核心贡献构建Uni-MoE-2.0-Omni：动态容量的Mixture-of-Experts（MoE）设计、结合迭代强化策略的渐进训练策略以及精心策划的跨模态数据匹配技术。它具备跨模态理解的能力，并能生成图像、文本和语音。模型架构上，新的MoE框架在自我注意层实现了时空跨模态对齐，同时通过共享、路由和空专家平衡了计算效率和处理10种跨模态输入的能力。训练方面，采用渐进的监督微调策略激活模态特定专家，并使用平衡的数据组合和迭代GSPO-DPO方法来稳定强化学习训练和提高推理能力。数据方面，基础模型在约75B个开源跨模态令牌上进行训练，配备了特殊的语音和图像生成令牌，通过语言线索来指导输出。在85个基准测试上的评估表明，该模型在领先的OLM上达到或具有高度竞争力，超过使用1.2万亿令牌训练的Qwen2.5-Omni在超过一半的基准测试上表现优异。主要优势包括视频理解（平均提高7%）、跨模态理解（平均提高7%）和视听推理（平均提高4%）。同时，它改进了长语音处理（降低单词错误率4.2%），并在低级别图像处理和控制生成方面领先。

关键见解

Uni-MoE 2.0作为全开源的跨模态大型模型（OLM），在语言为中心的跨模态理解、推理和生成方面取得了显著进展。
通过动态容量的Mixture-of-Experts（MoE）设计、渐进训练策略和跨模态数据匹配技术构建Uni-MoE-2.0-Omni。
模型具备跨模态理解能力，并能生成图像、文本和语音。
新的MoE框架在自我注意层实现了时空跨模态对齐，平衡了计算效率和处理多种跨模态输入的能力。
采用渐进的监督微调策略，结合平衡的数据组合和迭代GSPO-DPO方法，提高模型推理能力。
模型在大量开源跨模态数据上进行训练，具备特殊语音和图像生成令牌，用于指导生成输出。

Cool Papers

点此查看论文截图

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

Authors:Lu Gan, Xi Li

The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5% on English and 98% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Our work directly addresses the data bottleneck in TinyML, providing a practical, scalable foundation for building private, low-latency, and energy-efficient voice interfaces on resource-constrained edge devices. The dataset and source code are publicly available at https://github.com/lugan113/SynTTS-Commands-Official.

高性能设备端关键词识别（KWS）系统在超低功耗硬件上的发展受到了特定多命令训练数据集稀缺的严重制约。传统通过人工录制的数据收集方式成本高昂、速度缓慢且缺乏可扩展性。本文介绍了SYNTTS-COMMANDS，一种全新多语言语音命令数据集，完全使用最先进的文本到语音（TTS）合成生成。我们通过利用CosyVoice 2模型和公共语料库的说话人嵌入，创建了一个可扩展的英语和中文命令集合。在一系列高效的声学模型上的广泛基准测试表明，我们的合成数据集能够实现出色的准确性，英语命令识别率高达99.5%，中文命令识别率达98%。这些结果稳健地验证了合成语音可以有效地替代人工录音，用于训练KWS分类器。我们的工作直接解决了TinyML中的数据瓶颈问题，为在资源受限的边缘设备上构建私密、低延迟和节能的语音接口提供了实用且可扩展的基础。数据集和源代码可在https://github.com/lugan113/SynTTS-Commands-Official获取。

论文及项目相关链接

PDF

Summary

该论文通过引入SYNTTS-COMMANDS数据集，利用最新文本转语音（TTS）合成技术，生成了多语种语音命令数据集。借助CosyVoice 2模型和公开语料库的说话人嵌入，创建了可扩展的英语和中文命令集。研究表明，该合成数据集在多种声学模型上的性能卓越，在英语命令识别上达到了99.5%，在中文上达到了98%，证明了合成语音可有效替代人类录音用于训练关键词识别分类器。该研究解决了TinyML中的数据瓶颈问题，为构建私有、低延迟、节能的语音接口提供了实用、可扩展的基础。

Key Takeaways

SYNTTS-COMMANDS是一个多语种语音命令数据集，通过文本转语音（TTS）合成技术生成。
该数据集使用CosyVoice 2模型和公开语料库的说话人嵌入创建，具备可扩展性。
研究表明，SYNTTS-COMMANDS数据集在英语和中文命令识别上表现出色。
合成语音可有效替代人类录音用于训练关键词识别分类器。
该研究解决了TinyML中的数据瓶颈问题，为构建资源受限的边缘设备上的语音接口提供了基础。
SYNTTS-COMMANDS数据集和源代码已公开发布在GitHub上。
此研究为开发高性能、在超低功耗设备上运行的关键词识别系统提供了新的可能性。

Cool Papers

点此查看论文截图

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

Authors:Jaya Narain, Zakaria Aldeneh, Shirley Ren

Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.

语音和传感器时间序列数据在时间和频率域中都编码信息，如谱功率和波形形状。我们展示语音基础模型学习到的表征能够推广到语音领域之外，并在多样化的时间序列任务上实现了卓越的性能，这些任务来自可穿戴传感器。使用HuBERT和wav2vec 2.0提取的特征训练的探测器，在情绪分类、心律失常检测和活动分类任务上的表现超过了直接从特定模态数据集上训练的自我监督模型提取的特征。我们发现语音模型的卷积特征编码器对于可穿戴传感器应用特别重要。所提出的方法使用简单的探测方法提高了数据稀缺时间序列任务的性能。这项工作朝着开发统一语音和传感器模态的通用时间序列模型迈出了一步。

论文及项目相关链接

PDF Preprint, under review

Summary

本文探讨了语音和传感器时间序列数据在时间和频率域中的信息编码，如谱功率和波形形状。研究结果表明，基于语音的模型学习到的表示能够推广到语音领域之外，并在多样化的时间序列任务中达到最新性能水平，这些任务涉及可穿戴传感器的使用。基于HuBERT和wav2vec 2.0提取的特征的探针训练在情绪分类、心律失常检测和运动分类任务上的表现优于直接在特定数据集上训练的自我监督模型的特征。研究发现，语音模型的卷积特征编码器对可穿戴传感器应用特别重要。本文提出的方法通过使用简单的探测方法提高了在数据稀缺的时间序列任务上的性能。本研究朝着开发统一语音和传感器模态的通用时间序列模型迈出了重要一步。

Key Takeaways