发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 5.4k

阅读时长: 22 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

A Multimodal Conversational Agent for Tabular Data Analysis

Authors:Mohammad Nour Al Awad, Sergey Ivanov, Olga Tikhonova, Ivan Khodnenko

Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants.

大型语言模型（LLM）可以通过与用户进行交互式的、上下文感知的对话来处理数据分析、可视化和解释，同时保持高性能，从而重塑信息处理。在本文中，我们介绍了Talk2Data，这是一个由多模态LLM驱动用于直观数据探索的对话代理。该系统让用户使用语音或文本指令查询数据集，并以图表、表格、统计数据或口头解释的形式接收答案。Talk2Data建议的设计建立在LLM之上，结合了OpenAI的Whisper自动语音识别（ASR）系统、Qwen-coder代码生成LLM/模型、自定义的沙箱执行工具和Coqui文本到语音（TTS）库在一个代理协同循环中。与仅使用文本的分析工具不同，它可以在不同模态之间适应响应，并支持基于数据集上下文的多轮对话。在三个数据集上的48项任务评估中，我们的原型达到了95.8%的准确率，仅模型生成时间不到1.7秒（不包括ASR和执行时间）。对五个不同大小LLM（1.5B-32B）的比较揭示了准确性、延迟和成本之间的权衡，7B模型在交互式使用方面提供了最佳平衡。Talk2Data代理通过在用户对话与代码执行之间进行路由，受限于透明的沙箱环境，同时根据模式级别的上下文来提示，从而可靠地从表格中提取可操作见解，同时使计算可验证。在文章中，除了Talk2Data代理本身外，我们还讨论了人类与数据交互、对LLM驱动分析的可信性以及未来大规模多模态助理的扩展性。

论文及项目相关链接

PDF \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

摘要

大规模语言模型（LLM）能够通过与用户进行交互式的、语境感知的对话处理数据分析、可视化和解释，同时保持高性能，从而重塑信息处理能力。本文介绍了Talk2Data，一种基于LLM的多模式对话代理，用于直观的数据探索。该系统允许用户通过语音或文本指令查询数据集，并以图表、表格、统计信息或口头解释的形式接收答案。Talk2Data结合OpenAI自动语音识别（ASR）系统、Qwen-coder代码生成LLM/模型、自定义沙箱执行工具和文本到语音库，在代理协同循环中实现了跨模态响应和基于数据集上下文的多轮对话支持。在三个数据集上的48项任务评估中，我们的原型实现了95.8%的准确率，模型生成时间（不包括ASR和执行时间）在1.7秒以下。比较了五种不同规模的LLM（1.5B-32B），发现准确率、延迟和成本之间存在权衡，7B模型在交互式使用方面提供了最佳平衡。Talk2Data代理通过用户对话与代码执行之间的路由，在透明的沙箱中受限，同时提示根植于模式级别的上下文，从而可靠地从表格中提取可操作见解，使计算可验证。除Talk2Data代理本身外，本文还讨论了人机数据交互、对LLM驱动分析的可信性以及未来向大规模多模式助理扩展的启示。

关键见解

大规模语言模型（LLM）能够通过交互式的、语境感知的对话处理数据分析、可视化和解释。
Talk2Data是一个基于LLM的多模式对话代理，用于直观的数据探索，支持语音和文本指令查询数据集。
Talk2Data结合了多种技术，包括自动语音识别（ASR）、代码生成LLM、沙箱执行工具和文本到语音库。
Talk2Data原型在48项任务上实现了95.8%的准确率，模型生成时间短暂。
不同规模的LLM在准确率、延迟和成本之间存在权衡。
Talk2Data代理能够在透明的沙箱环境中可靠地从表格中提取可操作见解，使计算可验证。

Cool Papers

点此查看论文截图

MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding

Authors:Mengchun Zhang, Kateryna Shapovalenko, Yucheng Shao, Eddie Guo, Parusha Pradhan

Neural decoding from electroencephalography (EEG) remains fundamentally limited by poor generalization to unseen subjects, driven by high inter-subject variability and the lack of large-scale datasets to model it effectively. Existing methods often rely on synthetic subject generation or simplistic data augmentation, but these strategies fail to scale or generalize reliably. We introduce \textit{MultiDiffNet}, a diffusion-based framework that bypasses generative augmentation entirely by learning a compact latent space optimized for multiple objectives. We decode directly from this space and achieve state-of-the-art generalization across various neural decoding tasks using subject and session disjoint evaluation. We also curate and release a unified benchmark suite spanning four EEG decoding tasks of increasing complexity (SSVEP, Motor Imagery, P300, and Imagined Speech) and an evaluation protocol that addresses inconsistent split practices in prior EEG research. Finally, we develop a statistical reporting framework tailored for low-trial EEG settings. Our work provides a reproducible and open-source foundation for subject-agnostic EEG decoding in real-world BCI systems.

从脑电图（EEG）进行神经解码仍然受到未见主体泛化能力差的根本限制，这是由主体间的高变异性以及缺乏大规模数据集进行有效建模所驱动的。现有方法通常依赖于合成主体生成或简单数据增强，但这些策略在扩展或泛化方面存在可靠性问题。我们引入了MultiDiffNet，这是一个基于扩散的框架，它完全绕过了生成增强，通过学习一个针对多个目标优化的紧凑潜在空间来实现。我们直接从该空间进行解码，并在各种神经解码任务中实现了最先进的泛化能力，采用主体和会话分离评估。我们还整理和发布了一套统一的基准测试套件，涵盖了四个复杂度递增的EEG解码任务（SSVEP、运动想象、P300和想象中的语音），以及解决了先前EEG研究中存在的不一致分割实践的评估协议。最后，我们为低试验EEG设置开发了一个统计报告框架。我们的工作提供了一个可复制和开源的基础，用于主体无关的EEG解码，在真实世界的脑机接口（BCI）系统中具有应用价值。

论文及项目相关链接

PDF

Summary

本文介绍了MultiDiffNet框架，该框架通过建立一个优化的紧凑潜在空间，绕过了生成式增强，直接在其中进行解码，实现了在各种神经解码任务中的主体间优异泛化能力。文章还统一了EEG解码任务的基准套件，包括SSVEP、运动想象、P300和想象中的言语等任务，并提供了一个针对先验EEG研究的不一致拆分实践的评估协议。最后，开发了一个适用于低试验EEG设置的统计报告框架。整体工作为现实世界的脑机接口系统中的主体无关EEG解码提供了可重复和开源的基础。

Key Takeaways

MultiDiffNet框架通过建立优化的紧凑潜在空间提高了EEG神经解码的泛化能力。
框架绕过了生成式增强，直接在潜在空间中进行解码。
文章统一了多种EEG解码任务的基准套件，包括SSVEP、运动想象、P300和想象中的言语等。
提供了针对不一致拆分实践的评估协议，解决EEG研究中的实践问题。
开发了适用于低试验EEG设置的统计报告框架。
该工作为现实世界的脑机接口系统中的主体无关EEG解码提供了基础。

Cool Papers

点此查看论文截图

A superpersuasive autonomous policy debating system

Authors:Allen Roush, Devin Gonier, John Hines, Judah Goldfeder, Philippe Martin Wyder, Sanjay Basu, Ravid Shwartz Ziv

The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main

人工智能在面对高度复杂、基于证据且策略灵活的劝说场景时，仍然面临着巨大的挑战。此前的工作，如IBM的Project Debater，主要集中在生成简化、缩短格式的辩论中的说服性演讲，目标受众是相对普通的大众。我们推出DeepDebater，这是一个全新的自主系统，能够参与并赢得完整、未经修改的两队竞争性政策辩论。我们的系统采用专业化的多代理工作流程的分层架构，其中由大型语言模型驱动的多代理团队相互协作和批判，以执行离散的说服性任务。每个工作流程都利用迭代检索、综合和自助校正，使用大量的政策辩论证据集（OpenDebateEvidence），并产生完整的演讲稿、交叉审查和反驳。我们引入了一个实时、交互式的端到端演示管道，该管道通过人工智能语音和动画呈现辩论：演讲稿通过OpenAI TTS实现并合成音频，然后以说话人肖像视频的形式通过EchoMimic V1显示。除了完全自主的比赛（AI对抗AI），DeepDebater还支持人机混合操作：人类辩手可以在任何阶段进行干预，人类可以选择在任何演讲中充当AI的对手，从而进行AI与AI以及AI与人类的辩论回合。在针对人类作者案例的初步评估中，DeepDebater产生的论证成分在质量上更胜一筹，并且作为由独立自主裁判裁决的模拟回合中持续获胜。人类辩论教练也认为DeepDebater构建的论点、证据和案例更好。我们在此公开所有代码、生成的演讲稿、音频和说话人视频：https://github.com/Hellisotherpeople/DeepDebater/tree/main。

论文及项目相关链接

PDF Accepted to CLIP workshop at AAAI 2026

Summary

新一代人工智能系统DeepDebater可参与并赢得全、未经修改、两队竞争的政策辩论。该系统采用层次化的多智能体工作流程架构，智能体团队相互协作、批判，完成离散论证任务。系统使用大量政策辩论证据进行迭代检索、合成和自我修正，生成完整的演讲稿、交叉审查和反驳。DeepDebater支持全自动匹配以及人机混合操作，初步评估显示其辩论能力优于人类作者构建的案例，专家教练亦对其构建的论点、证据和案例表示赞赏。

Key Takeaways

DeepDebater是一种新型人工智能辩论系统，可参与完整政策辩论并赢得胜利。
系统采用多智能体层次架构，智能体团队共同完成论证任务。
DeepDebater使用大量政策辩论证据，进行迭代检索、合成和自我修正。
系统支持全自动匹配以及人机混合操作，人类可在任何阶段介入。
初步评估显示，DeepDebater在模拟辩论中的表现优于人类作者构建的案例。
专家教练对DeepDebater构建的论点、证据和案例表示赞赏。

Cool Papers

点此查看论文截图

Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Authors:Yassir Benhammou, Suman Kalyan, Sujay Kumar

Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.

广播和媒体组织越来越依赖人工智能来自动化内容索引、标签和元数据生成的劳动密集型流程。然而，现有的AI系统通常只在单一模式（如视频、音频或文本）上运行，无法充分理解广播材料中的复杂跨模式关系。在这项工作中，我们提出了一种多模态自编码器（MMAE），它能在文本、音频和视觉数据之间学习统一表示，实现元数据提取和语义聚类的端到端自动化。该模型在最近推出的LUMA数据集上进行训练，该数据集由具有代表性的真实世界多媒体内容的多模态三元组组成，并进行了全面对齐。通过最小化跨模式的联合重建损失，MMAE能够发现模式不变的语义结构，无需依赖大型配对或对比数据集。我们在聚类和对齐指标（轮廓系数、ARI、NMI）上展示了与线性基线相比的显著改善，这表明基于重建的多模态嵌入可以作为广播档案中可扩展的元数据生成和跨模态检索的基础。这些结果突出了重建驱动的多模态学习在现代广播工作流程中提高自动化、可搜索性和内容管理效率的潜力。

论文及项目相关链接

PDF 8 pages, 5 figures, 4 tables

Summary

多媒体和广播组织越来越依赖人工智能自动化内容索引、标签和元数据生成的劳动密集型过程。然而，现有的人工智能系统通常在单一模态（如视频、音频或文本）上运行，无法理解和处理广播素材中复杂的多模态关系。本研究提出了一种多模态自编码器（MMAE），它能学习跨文本、音频和视觉数据的统一表示，实现元数据提取和语义聚类的端到端自动化。该模型在最近推出的LUMA数据集上进行训练，该数据集是现实媒体内容的完全对齐的多模态三元组基准。通过最小化跨模态的联合重建损失，MMAE发现了模态不变的语义结构，无需依赖大量配对或对比数据集。与线性基线相比，我们在聚类和对齐指标（Silhouette、ARI、NMI）上取得了显著改进，这表明基于重建的多模态嵌入可以作为广播档案中可扩展的元数据生成和跨模态检索的基础。这些结果突显了重建驱动的多模态学习在增强自动化、可搜索性和内容管理效率方面的潜力。

Key Takeaways

广播和媒体组织正在增加对AI的依赖，以自动化内容索引和元数据生成。
现有AI系统主要处理单一模态，忽略了跨模态关系的复杂性。
提出了一种多模态自编码器（MMAE），能学习跨不同数据模式的统一表示。
MMAE在LUMA数据集上进行训练，该数据集是对现实媒体内容的多元数据进行的完全对齐的基准测试。
通过最小化跨模态的联合重建损失，MMAE能发现模态不变的语义结构。
与线性基线相比，MMAE在聚类和对齐指标上表现优越。

Cool Papers

点此查看论文截图

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

Authors:Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

最近的基础模型，如SSAST、EAT、HuBERT、Qwen-Audio和Audio Flamingo，在标准音频基准测试中取得了顶尖的结果，但它们受限于固定的输入速率和时长，阻碍了其可重用性。本文介绍了增广驱动的多视角音频转换器（AMAuT），这是一种从头开始训练的框架，它消除了对预训练权重的依赖，同时支持任意采样率和音频长度。AMAuT集成了四个关键组件：（1）增广驱动的多视角学习，以提高稳健性；（2）一个conv1 + conv7 + conv1的一维CNN瓶颈，以实现稳定的临时编码；（3）双向上下文表示的双CLS+TAL令牌；（4）测试时间适应/增广（TTA^2）以提高推理可靠性。在五个公共基准测试（AudioMNIST、SpeechCommands V1 & V2、VocalSound和CochlScene）上的实验表明，AMAuT的准确率高达99.8%，同时其消耗的GPU时间不到类似预训练模型所需的3%。因此，AMAuT为大型预训练模型提供了一种高效且灵活的替代方案，在计算受限的环境中实现了最先进的音频分类。

论文及项目相关链接

PDF Updating note: 1. CLS+TAL is the distill token from DeiT rather than the alternative class token. Adjust the content to clarify it. 2. Figure 4 presents an error sequence of figures (a) and (b). 3. Remove an unrelated citation about the VS set. 4. A missing citation in section 4.4 (SSAST [19] here is not a correct citation)

Summary

该论文介绍了Augmentation-driven Multiview Audio Transformer（AMAuT），这是一个从头开始训练的框架，支持任意采样率和音频长度，无需依赖预训练权重。通过四项关键技术，包括增强驱动的多视角学习、一维CNN瓶颈、双向上下文表示和测试时的适应与增强，AMAuT在五个公共基准测试上实现了高达99.8%的准确率，同时计算成本仅为类似预训练模型的3%以下。因此，AMAuT为计算受限环境中前沿音频分类提供了高效灵活的选择。

Key Takeaways