MMT

发布日期: 2025-11-17

更新日期: 2025-11-27

文章字数: 21.5k

阅读时长: 88 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-17 更新

HI-TransPA: Hearing Impairments Translation Personal Assistant

Authors:Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

为了为听力受损人士的日常交流提供一个统一且灵活解决方案，我们将Omni-Model范式引入辅助技术，并推出HI-TransPA，这是一款指令驱动的视听个人助理。该模型融合了不清晰语音与高帧率唇动态，能够在单一的多模式框架内实现翻译和对话。为了解决原始数据中的噪声和异质性以及现有Omni-Models对听力受损语音的适应性有限的挑战，我们构建了一个全面的预处理和筛选管道，用于检测面部特征点、隔离和稳定唇部区域，并定量评估多模式样本质量。这些质量分数引导了一种课程学习策略，该策略首先使用干净、高置信度的样本进行训练，并逐步加入更困难的案例以增强模型的稳健性。我们进一步采用SigLIP编码器结合Unified 3D-Resampler高效编码高帧率唇动。在我们专门构建的HI-Dialogue数据集上的实验表明，HI-TransPA在字面准确性和语义保真度方面都达到了最新技术水平。这项工作为将Omni-Models应用于辅助通信技术奠定了基础，为未来研究提供了端到端的建模框架和必要的处理工具。

论文及项目相关链接

PDF

Summary

本文介绍了一种将Omni-Model范式引入辅助技术，为听力受损人士提供统一、灵活的日常沟通解决方案。提出的HI-TransPA指令驱动视听个人助理，能够融合模糊语音与高帧率唇动态，在单一多模式框架内实现翻译和对话。为解决噪声和异质原始数据挑战，以及现有Omni-Model对听力受损语音的有限适应性，构建了全面的预处理和筛选管道，包括检测面部地标、隔离和稳定唇部区域，并定量评估多模式样本质量。质量分数引导一种循序渐进学习策略，先训练清洁、高信心样本，逐步融入更困难案例以增强模型稳健性。实验表明，HI-TransPA在字面准确性和语义保真度方面达到最佳性能。为将Omni-Model应用于辅助通信技术提供了基础，为未来研究提供了端到端的建模框架和必要的处理工具。

Key Takeaways

Omni-Model范式被引入辅助技术，为听力受损人士提供日常沟通的统一、灵活解决方案。
HI-TransPA是一个指令驱动的视听个人助理，能融合模糊语音和高帧率唇动态。
构建了一个全面的预处理和筛选管道，以处理噪声和异质原始数据，并提高模型的适应性。
通过检测面部地标、隔离和稳定唇部区域，定量评估多模式样本质量。
采用了质量分数引导的渐进学习策略，通过先训练清洁、高信心样本来增强模型的稳健性。
HI-TransPA在字面准确性和语义保真度方面达到最佳性能。

Cool Papers

点此查看论文截图

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

Authors:Lequan Lin, Dai Shi, Andi Han, Feng Chen, Qiuzheng Chen, Jiawen Li, Zhaoyang Li, Jiyuan Li, Zhenbang Sun, Junbin Gao

Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most “suspicious” cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.

监督学习依赖于高质量的有标签数据，但通过人工标注获取此类数据既昂贵又耗时。最近的工作探索使用大型语言模型（LLM）进行标注，但LLM生成的标签仍然达不到人类水平的品质。为了解决这个问题，我们提出了“批判性思维标注”（ACT）数据管道，其中LLM不仅作为标注器，还作为裁判来批判性地识别潜在错误。然后，人工努力只针对最“可疑”的案例进行审查，从而极大地提高了人工标注的效率。我们的主要贡献如下：（1）通过利用多模态大型语言模型（MLLM），ACT适用于包括自然语言处理（NLP）、计算机视觉（CV）和多模态理解在内的广泛领域。（2）通过实证研究，我们得出了如何提高标注质量并有效减少人力成本的7个见解，并将这些发现转化为用户友好的指南。（3）我们从理论上分析了如何修改损失函数，以便在ACT数据上训练的模型达到与完全由人类标注的数据训练的模型相似的性能。我们的实验表明，在大多数基准数据集上，性能差距可以缩小到不到2%，同时节省高达90%的人力成本。

论文及项目相关链接

PDF NeurIPS 2025

Summary
大规模语言模型虽可用于标注数据，但其生成的标签质量难以达到人类水平。为此，我们提出基于批判性思维（ACT）的数据管道，让语言模型不仅作为标注器，还作为判断器来识别潜在错误。人类只需重点审查最“可疑”的案例，从而显著提高人工标注效率。ACT适用于NLP、计算机视觉和多模态理解等领域，通过实证研究和用户友好指南提高标注质量并降低人力成本。实验表明，在大多数基准数据集上，ACT训练的模型性能与全人工标注数据的性能差距小于2%，同时节省了高达90%的人力成本。

Key Takeaways

提出Annotation with Critical Thinking (ACT)数据管道解决大型语言模型标注数据质量问题。
利用语言模型的批判性思维能力，提高标注质量并降低人工审查工作量。
ACT适用于NLP、计算机视觉和多模态理解等多个领域。
通过实证研究得出提高标注质量和降低人力成本的7个关键见解。
提出修改损失函数的方案，缩小模型性能差距，同时降低成本。

Cool Papers

点此查看论文截图

CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?

Authors:Peiyu Li, Xiaobao Huang, Nitesh V. Chawla

We present CrochetBench, a benchmark for evaluating the ability of multimodal large language models to perform fine-grained, low-level procedural reasoning in the domain of crochet. Unlike prior benchmarks that focus on high-level description or visual question answering, CrochetBench shifts the emphasis from describing to doing: models are required to recognize stitches, select structurally appropriate instructions, and generate compilable crochet procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply declines as the evaluation shifts from surface-level similarity to executable correctness, exposing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains. Code is available at https://github.com/Peiyu-Georgia-Li/crochetBench.

我们推出了CrochetBench，这是一个用于评估多模态大型语言模型在钩编领域进行精细、低级的程序推理能力的基准测试。不同于以往专注于高级描述或视觉问答的基准测试，CrochetBench将重点从描述转向操作：模型需要识别针脚、选择结构适当的指令，并生成可编译的钩编程序。我们采用CrochetPARADE DSL作为中间表示，通过执行进行结构验证和功能评估。该基准测试包括针脚分类、指令接地以及自然语言图像到DSL的翻译任务。随着评估从表面相似性转向可执行正确性，性能急剧下降，暴露了远程符号推理和3D感知程序合成方面的局限性。CrochetBench为评估多模态模型中的程序能力提供了新的视角，并突出了现实世界创意领域中表面理解与可执行精度之间的差距。代码可在https://github.com/Peiyu-Georgia-Li/crochetBench找到。

Summary

本文介绍了CrochetBench，一个评估多模态大型语言模型在钩编领域内进行精细、低级的程序推理能力的基准测试。不同于以往侧重于高级描述或视觉问答的基准测试，CrochetBench将重点从描述转向操作：模型需要识别针脚、选择结构适当的指令，并生成可编译的钩编程序。采用CrochetPARADE DSL作为中间表示，通过执行实现结构验证和功能评估。该基准测试包括针脚分类、指令接地以及自然语言与图像到DSL的翻译任务。在所有这些任务中，随着评估从表面层次的相似性转向可执行的正确性，性能急剧下降，暴露了长距离符号推理和3D感知程序合成的局限性。CrochetBench为评估多模态模型中的程序能力提供了新视角，并突出了现实世界中创意领域表面层次理解与可执行精度之间的差距。

Key Takeaways

CrochetBench是一个用于评估多模态大型语言模型的基准测试，专注于精细、低级的程序推理能力在钩编领域的应用。
与其他基准测试不同，CrochetBench强调操作而非描述，要求模型识别针脚、选择适当指令并生成可编译程序。
采用CrochetPARADE DSL作为中间表示，实现结构验证和功能评估。
基准测试包括针脚分类、指令接地以及自然语言与图像到DSL的翻译任务。
随着评估从表面层次的相似性转向可执行正确性，模型性能急剧下降，表明存在长距离符号推理和3D感知程序合成的局限性。
CrochetBench为评估多模态模型中的程序能力提供了新视角。
强调了现实世界中创意领域表面层次理解与可执行精度之间的差距。

Cool Papers

点此查看论文截图

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

Authors:Mingde Xu, Zhen Yang, Wenyi Hong, Lihang Pan, Xinyue Fan, Yan Wang, Xiaotao Gu, Bin Xu, Jie Tang

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \href{https://zheny2751-dotcom.github.io/webvia.github.io/}{\texttt{https://webvia.github.io}}.

用户界面（UI）开发需要将设计草图转化为功能代码，这一过程仍然重复且劳动密集。虽然最近的视觉语言模型（VLMs）可以自动进行UI到代码的生成，但它们只能生成缺乏交互性的静态HTML/CSS/JavaScript布局。为了解决这个问题，我们提出了WebVIA，这是第一个用于交互式UI到代码生成和验证的代理框架。该框架包括三个组件：1）一个探索代理，用于捕获多状态UI截图；2）一个UI2Code模型，用于生成可执行的交互式代码；3）一个验证模块，用于验证交互性。实验表明，WebVIA代理在UI探索方面实现了比通用代理（例如Gemini-2.5-Pro）更稳定和准确的性能。此外，我们微调后的WebVIA-UI2Code模型在生成可执行的交互式HTML/CSS/JavaScript代码方面表现出显著改进，在交互式和静态UI2Code基准测试中均优于其基础版本。我们的代码和模型可在https://webvia.github.io上找到。

论文及项目相关链接

PDF 36 pages, 30 figures

Summary

用户界面（UI）开发过程中，将设计草图转化为功能性代码仍然是一个重复且劳动密集型的任务。虽然现有的视觉语言模型（VLMs）能够实现UI到代码的自动生成，但它们主要生成静态的HTML/CSS/JavaScript布局，缺乏交互性。为了解决这个问题，我们提出了WebVIA，这是第一个用于交互式UI到代码生成和验证的agentic框架。该框架包括三个组件：1）探索agent，用于捕获多状态UI截图；2）UI2Code模型，用于生成可执行的交互式代码；3）验证模块，用于验证交互性。实验表明，WebVIA-Agent在UI探索方面实现了比一般用途agent（如Gemini-2.5-Pro）更稳定和准确的性能。此外，我们微调后的WebVIA-UI2Code模型在生成可执行和交互式HTML/CSS/JavaScript代码方面显示出显著的优势，在交互式和静态UI2Code基准测试中均超越了基础模型。我们的代码和模型均可在https://webvia.github.io访问。

Key Takeaways

UI开发过程中，设计到代码的转化仍然有很高的重复性和劳动强度。
当前VLMs虽可实现UI到代码的自动生成，但生成的代码缺乏交互性。
WebVIA是首个用于交互式UI到代码生成和验证的agentic框架。
WebVIA框架包含探索agent、UI2Code模型和验证模块三个主要组件。
WebVIA-Agent在UI探索方面性能稳定且准确，优于一般用途的agent。
WebVIA-UI2Code模型在生成交互式HTML/CSS/JavaScript代码方面表现出显著优势。

Cool Papers

点此查看论文截图

TRACE: Textual Reasoning for Affordance Coordinate Extraction

Authors:Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown

Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR’s effectiveness. Furthermore, analysis of the model’s attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at https://github.com/jink-ucla/TRACE

视觉语言模型（VLMs）在将高级指令转化为机器人操作所需的精确空间适应性时面临挑战。尽管存在视觉思维链（CoT）方法，但它们通常计算量大。在这项工作中，我们引入了TRACE（用于适应性坐标提取的文本推理），这是一种将文本思维链（CoR）融入适应性预测过程的新方法。我们使用此方法创建了TRACE数据集，这是一个大规模集合，通过自主管道创建，该管道将指令与明确的文本理由配对。通过在此数据上微调VLM，我们的模型学会了在采取行动之前进行外部空间推理。实验表明，我们的TRACE调整模型达到了最先进的性能，在主要的Where2Place（W2P）基准测试上达到了48.1%的准确率（相对提高了9.6%），而在更具挑战性的W2P（h）子集上达到了55.0%的准确率。关键的是，一项消融研究证明性能与所使用的推理数据量直接相关，证实了CoR的有效性。此外，对模型的注意力图的分析揭示了一个可解释的推理过程，其中焦点在推理步骤之间动态变化。这项工作表明，训练VLM生成文本CoR是一种有效且稳健的策略，可以提高基于VLM的机器人控制的精确性、可靠性和可解释性。我们的数据集和代码可在https://github.com/jink-ucla/TRACE获得。

论文及项目相关链接

PDF ICCV 2025. *Equal contribution. †Corresponding author

摘要

本文介绍了TRACE（用于配对的文本推理方法），这是一种新的将文本推理链（CoR）整合到操控过程的框架。我们创建了TRACE数据集，这是一个大规模的通过自主管道生成的配对指令与明确的文本推理集合。通过对视觉语言模型（VLM）进行微调，我们的模型能够在行动前进行外部空间推理。实验表明，使用TRACE训练的模型在Where2Place（W2P）基准测试上取得了最佳性能，准确率达到了48.1%（相对提高了9.6%），并在更具挑战性的W2P（h）子集上达到了55.0%。分析模型的注意力地图揭示了可解释的推理过程，其中焦点在推理步骤间动态变化。这表明训练VLM生成文本推理链是一种有效且稳健的策略，可以提高基于VLM的机器人控制的精确性、可靠性和可解释性。我们的数据集和代码已在GitHub上公开。

关键见解

TRACE方法结合了文本推理链（CoR），以增强视觉语言模型（VLM）在机器人操控中的精确空间推理能力。
引入了TRACE数据集，该数据集通过自主管道生成，将指令与明确的文本理由配对。
通过在TRACE数据集上微调VLM，模型学会了在行动前进行外部空间推理，提高了性能。
实验结果显示，使用TRACE训练的模型在Where2Place（W2P）基准测试中表现最佳，相对提高了9.6%的准确率。
消融研究证明，性能与使用的推理数据量直接相关，证实了CoR的有效性。
分析模型的注意力地图揭示了可解释的推理过程，其中焦点在推理步骤间动态移动。
整体而言，TRACE方法是一种有效且稳健的策略，可增强VLM在机器人控制中的精确性、可靠性和可解释性。

Cool Papers

点此查看论文截图

Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

Authors:Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Jian Tang, Xiaozhu Ju

This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.

本文介绍了Pelican-VL 1.0，这是一个新的开源实体脑模型家族，参数规模从7亿到72亿不等。我们的明确使命是：将强大的智能嵌入各种实体。Pelican-VL 1.0目前是最大规模的开源实体多模态脑模型。它的核心优势在于深度整合数据能力和智能自适应学习机制。具体来说，metaloop从包含4亿多个令牌的原数据集中提炼出高质量的数据集。Pelican-VL 1.0是在由1000多个A800 GPU组成的大规模集群上进行训练的，每个检查点消耗超过5万多个A800 GPU小时。这使其基础模型的性能提升了20.3%，并在众所周知的实体基准测试中超越了100B级别的开源竞争对手，提高了10.6%，与领先的专有系统不相上下。我们建立了一个新的框架DPPO（深思熟虑的策略优化），受人类元认知的启发来训练Pelican-VL 1.0。我们将其操作化为一个metaloop，教会人工智能有意识地练习，这是一个RL-Refine-Diagnose-SFT循环。

论文及项目相关链接

PDF

Summary

佩利肯-VL 1.0是一款新推出的开源实体模型家族，拥有从7亿到72亿的参数规模。其核心优势在于深度整合数据能力与智能自适应学习机制。它使用大型GPU集群进行训练，性能提升显著，与领先的专有系统不相上下。

Key Takeaways

佩利肯-VL 1.0是一个新的开源实体模型家族，参数规模从7亿到72亿不等。
它的核心优势在于深度整合数据能力与智能自适应学习机制。
该模型使用大型GPU集群进行训练，性能提升显著。
佩利肯-VL 1.0训练过程中使用了名为DPPO（深思熟虑的策略优化）的新框架。
DPPO框架借鉴了人类元认知的理念，让AI学会刻意练习。
该模型在著名的实体基准测试中表现优秀，与领先的专有系统不相上下。

Cool Papers

点此查看论文截图

World Simulation with Video Foundation Models for Physical AI

Authors: NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

我们介绍了[Cosmos-Predict2.5]，这是宇宙世界基金会物理人工智能模型的最新一代产品。基于流架构构建，[Cosmos-Predict2.5]在一个模型中统一了Text2World、Image2World和Video2World的生成，并利用[Cosmos-Reason1]这一物理人工智能视觉语言模型，提供更丰富的文本基础和更精细的世界模拟控制。经过在2亿个精选的视频片段上进行训练，并通过基于强化学习的后期训练进行微调，[Cosmos-Predict2.5]在视频质量和指令对齐方面较之前的模型实现了显著提升。目前发布模型规模为两千亿和一万四千亿。这些功能能够支持更可靠的综合数据生成、策略评估和闭环模拟，在机器人和自主系统中具有广泛的应用潜力。我们进一步推出了家族产品[Cosmos-Transfer2.5]，这是一个用于Sim2Real和Real2Real世界翻译的控制网络风格框架。尽管体积比上一代产品小3.5倍，[Cosmos-Transfer2.5]却实现了更高保真度和稳健的长期视频生成能力。总的来说，[Cosmos-Predict2.5]和[Cosmos-Transfer2.5]的进步使其成为智能体技术的通用工具。为了加速物理人工智能的研究和应用部署，我们在NVIDIA开放模型许可下发布了源代码、预训练检查点和精选基准测试数据集，网址为https://github.com/nvidia-cosmos/cosmos-predict2.5和https://github.com/nvidia-cosmos/cosmos-transfer2.5。我们希望这些开放资源能够降低采用门槛并促进创新，共同推动下一代智能体技术的发展。

论文及项目相关链接

PDF

摘要

[Cosmos-Predict2.5]是宇宙世界基金会模型的最新一代物理人工智能。它采用流架构，统一了Text2World、Image2World和Video2World生成，在一个单一模型中实现了文本接地更丰富、世界模拟控制更精细的[Cosmos-Reason1]物理人工智能视觉语言模型。经过在2亿个精选视频片段上的训练，并利用基于强化学习的后训练进行精炼，[Cosmos-Predict2.5]在视频质量和指令对齐方面相对于[Cosmos-Predict1]实现了实质性改进，发布规模为2B和14B的模型。这些功能使更可靠合成数据生成、政策评估、机器人和自主系统的闭环模拟成为可能。[Cosmos-Transfer2.5]是控制网络风格的框架，用于Sim2Real和Real2Real世界翻译。尽管它比[Cosmos-Transfer1]小3.5倍，但提供了更高保真和稳健的长期视频生成。这些进步使[Cosmos-Predict2.5]和[Cosmos-Transfer2.5]成为推动智能身体规模化的多功能工具。为加速物理人工智能的研究和部署，我们在NVIDIA开放模型许可证下发布源代码、预训练检查点和精选基准测试，可通过以下链接访问：https://github.com/nvidia-cosmos/cosmos-predict2.5和https://github.com/nvidia-cosmos/cosmos-transfer2.5。我们希望这些开放资源能降低采用门槛并促进创新，以构建下一代智能身体。。这些资源旨在推动物理人工智能领域的研究和创新。我们相信这些模型的开源将促进更多研究人员和企业参与到物理人工智能的研发中来，推动技术进步和创新应用的发展。我们相信，未来的智能系统将更加依赖于物理世界的理解和模拟能力，而这些模型将为这一领域的发展提供强大的支持。此外，我们也希望这些开放资源能够帮助降低开发门槛，使得更多的开发者和研究人员能够更容易地获取和使用这些模型，进一步推动智能科技领域的发展。总的来说，[Cosmos-Predict2.5]和[Cosmos-Transfer2.5]的开源将为物理人工智能的发展注入新的活力，并推动智能科技领域的进步。我们也期待着这个领域的未来发展。我们也欢迎广大用户积极参与我们的开源项目，共同推动物理人工智能的发展和应用落地。让我们一起携手构建一个更智能的世界！关键见解

[Cosmos-Predict2.5]是最新一代宇宙世界基金会模型，用于物理人工智能。
基于流架构，它统一了Text2World、Image2World和Video2World生成。
引入[Cosmos-Reason1]，提供丰富的文本接地和精细的世界模拟控制。
经过在大量视频片段上的训练和强化学习后训练，视频质量和指令对齐得到显著提高。
推出[Cosmos-Transfer2.5]，用于Sim2Real和Real2Real世界翻译的控制网络风格框架。
尽管规模缩小，但[Cosmos-Transfer2.5]实现了更高保真和长期视频生成。

Cool Papers

点此查看论文截图

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Authors:Weijia Zhang, Zijia Liu, Haoru Li, Haoqi Chen, Jiaxuan You

Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the text-only LLM, which serves as a reasoning agent. Crucially, the translator and reasoner engage in multi-round feedback and interaction, enabling the extraction of targeted visual details and yielding more confident answers. Experiments on knowledge-intensive VQA benchmarks, including MMMU and MIA-Bench, demonstrate that Seeing Eye not only reduces inference cost but also surpasses much larger end-to-end VLMs. For example, an instantiation combining a 3B-parameter vision translator with an 8B-parameter language reasoner outperforms a monolithic 32B VLM on challenging knowledge-based questions. Our results highlight that decoupling perception from reasoning via agent information flow offers a scalable and plug-and-play pathway to multimodal reasoning, allowing strong text-only LLMs to fully leverage their reasoning capabilities. Code is available at: https://github.com/ulab-uiuc/SeeingEye

近期文本型大型语言模型（LLM）如DeepSeek-R1展现出令人印象深刻的推理能力。然而，当扩展到多模态任务时，这些模型仍然显得脆弱或完全无法应对。现有方法大多依赖于单一形式的字幕，缺乏多样性，且往往无法适应不同类型的视觉问答（VQA）基准测试。因此，它们没有提供传输精细粒度视觉信息的原则性或有效性渠道。我们推出了Seeing Eye，这是一个模块化框架，它通过基于代理的小型VLM翻译器在文本型LLM中实现多模态推理。该翻译器充当感知代理：它可以调用专业工具（例如OCR和裁剪），并迭代地将多模态输入转化为针对问题的结构化中间表示（SIR）。然后，这些SIR传递给文本型LLM，充当推理代理。关键的是，翻译器和推理器进行多轮反馈和交互，能够提取有针对性的视觉细节并产生更确定的答案。在知识密集型VQA基准测试，包括MMMU和MIA-Bench上的实验表明，Seeing Eye不仅降低了推理成本，而且超越了更大的端到端VLM。例如，一个结合了3B参数视觉翻译器和8B参数语言推理器的实例，在具有挑战性的知识性问题上表现优于32B单一VLM。我们的研究结果强调，通过代理信息流程将感知与推理解耦，为多模态推理提供了可扩展的即插即用路径，使强大的文本型LLM能够充分利用其推理能力。代码可在以下网址找到：https://github.com/ulab-uiuc/SeeingEye

论文及项目相关链接

PDF

Summary

这是一篇关于文本型大型语言模型（LLMs）在多模态任务中表现不足的研究。为此，研究团队引入了Seeing Eye框架，该框架通过基于代理的小型视觉语言模型（VLM）翻译器解锁了文本型LLMs的多模态推理能力。该翻译器能够调用专业工具处理视觉信息，并将其转化为适合问题的结构化中间表示（SIR）。这些SIRs随后传递给文本型LLM进行推理。两者之间的多轮反馈和交互使得答案更加精准。实验表明，Seeing Eye不仅降低了推理成本，而且在知识密集型多模态问答基准测试上表现优异，超越了一些大型的端到端VLMs。代码已在GitHub上公开。

Key Takeaways

文本型大型语言模型（LLMs）在多模态任务中表现不足。
Seeing Eye框架通过引入基于代理的小型视觉语言模型（VLM）翻译器解锁了文本型LLMs的多模态推理能力。
该翻译器可以调用专业工具处理视觉信息并转化为结构化中间表示（SIR）。
Seeing Eye框架通过多轮反馈和交互提高了答案的精准度。
实验证明，Seeing Eye在知识密集型多模态问答基准测试上表现优异，超越了部分大型端到端VLMs。
Seeing Eye框架降低了推理成本。
代码已在GitHub上公开可供查阅。

Cool Papers

点此查看论文截图

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Authors:Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik

Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

视觉语言模型（VLMs）在多模态推理方面取得了显著的进步。然而，现有的基准测试在高质量、经人工验证的样本方面仍存在局限性。目前许多数据集依赖于大型语言模型（LLMs）合成生成的内容。此外，大多数数据集仅限于英语，因为对翻译样本进行人工质量保障是耗时且成本高昂的。为了填补这一空白，我们推出了PISA-Bench，这是一个源于专家创建的PISA测试英语样本的多语言基准测试。PISA测试是一个统一框架，用于评估八十多个国家学生的能力。每个样本都由人工提取的指令、问题、答案选项和图像组成，丰富了问题类型类别，并从英语翻译成了另外五种语言（西班牙语、德语、中文、法语和意大利语），从而形成了一个涵盖六种语言的完全平行语料库。我们在PISA-Bench上评估了最先进的视觉语言模型，发现尤其是小于20B参数的小模型难以取得较高的测试成绩。我们还发现在非英语分割上的性能大幅下降，以及在模型和空间及几何推理任务中的高错误率。我们通过发布数据集和评估框架，为推进多语言多模态推理研究提供了资源。

论文及项目相关链接

PDF 8 pages, 11 tables and figures

Summary

基于PISA测试的多语言基准数据集PISA-Bench填补了当前视觉语言模型在多模态推理领域的空白。该数据集不仅包含了英语样本，还翻译成了五种其他语言，形成了一个全面的平行语料库。评估发现，小型模型在非英语分支上的性能显著下降，尤其在空间几何推理方面错误率较高。此数据集的发布为推进多语言多模态推理研究提供了资源。

Key Takeaways

PISA-Bench是一个基于PISA测试的多语言基准数据集，用于评估视觉语言模型的多模态推理能力。
数据集包含英语以及其他五种语言的平行语料，丰富了模型的训练数据，并有助于研究多语言环境下的模型性能。
现有视觉语言模型在PISA-Bench上的表现不佳，尤其是小型模型。
非英语分支上的模型性能显著下降，表明模型对不同语言的适应性有待提高。
模型在空间几何推理方面的错误率较高，需要进一步加强相关研究。
PISA-Bench的发布为推进多语言多模态推理研究提供了宝贵的资源。

Cool Papers

点此查看论文截图

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Authors:Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model’s textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.

虽然多模态大型语言模型（MLLMs）在视觉理解方面表现出色，但在需要视觉规划和想象力的复杂场景中，它们常常面临挑战。受人类使用草图作为视觉思维来发展和交流思想方式的启发，我们引入了潜在草图板（Latent Sketchpad）框架，该框架为MLLMs配备了内部视觉草稿板。MLLMs的内部视觉表示传统上仅限于感知理解。我们重新定位它们以支持生成式视觉思维，而不会损害推理能力。我们的方法基于前沿的MLLMs构建，将视觉生成直接集成到其固有的自回归推理过程中。它允许模型在文本推理和视觉潜在生成之间进行交替。这些潜在性指导内部思维过程，并可转化为草图图像以实现可解释性。为了实现这一点，我们引入了两个组件：上下文感知视觉头（Context-Aware Vision Head）自回归地生成视觉表示，预训练的草图解码器（Sketch Decoder）将这些表示呈现为人类可理解的图像。我们在新的数据集迷宫规划（MazePlanning）上评估了该框架。跨各种MLLMs的实验表明，潜在草图板框架在推理性能上表现相当甚至更胜一筹。它进一步适用于不同的前沿MLLMs，包括Gemma3和Qwen2.5-VL。通过将模型的文本推理扩展到视觉思维，我们的框架为更丰富的人机交互和更广泛的应用提供了新的机会。更多细节和资源请访问我们的项目页面：https://latent-sketchpad.github.io/。

论文及项目相关链接

PDF

Summary

基于人类利用草图作为视觉思考的方式来发展和交流思想，研究团队引入了Latent Sketchpad框架，该框架为多模态大型语言模型（MLLMs）配备了内部视觉草图板。此框架使MLLMs的内部视觉表征不再仅限于感知理解，而是支持生成性的视觉思考而不损害其推理能力。该框架集成了前沿的MLLMs，将视觉生成直接纳入其天生的自回归推理过程中。该框架允许模型将文本推理与视觉潜在生成交替进行，这些潜在生成指导内部思维过程并可转化为草图图像以便解释。为现实此，我们引入了Context-Aware Vision Head自回归地生成视觉表征和预训练的Sketch Decoder将其转化为人类可理解的图像两个组件。我们在新数据集MazePlanning上评估了该框架，实验表明，Latent Sketchpad在多种MLLMs上的推理性能与其主干相当甚至更优，并能广泛应用于不同的前沿MLLMs，包括Gemma3和Qwen2.5-VL。通过扩展模型的文本推理至视觉思考，该框架为更丰富的人机交互和更广泛的应用打开了新的机会。

Key Takeaways

多模态大型语言模型（MLLMs）在需要视觉规划和想象力的复杂场景中常常表现挣扎。
人类利用草图作为视觉思考的方式启发研究团队引入了Latent Sketchpad框架。
Latent Sketchpad为MLLMs配备了内部视觉草图板，支持生成性视觉思考而不损害其推理能力。
该框架集成了前沿的MLLMs，将视觉生成纳入其自回归推理过程中。
Latent Sketchpad允许模型交替进行文本推理与视觉潜在生成，这些潜在生成可转化为草图图像以便解释。
框架包括两个关键组件：Context-Aware Vision Head和Sketch Decoder。
在MazePlanning数据集上的实验表明，Latent Sketchpad在多种MLLMs上表现优秀，并能广泛应用于不同模型。该框架有助于更丰富的人机交互和更广泛的应用。

Cool Papers

点此查看论文截图

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Authors:Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.

视觉语言模型（VLMs）对于嵌入式人工智能（Embodied AI）至关重要，能够让机器人在复杂环境中进行感知、推理和行动。它们也是最近视觉语言行动（VLA）模型的基础。然而，大多数对VLMs的评估都集中在单视图设置上，忽视了它们整合多视图信息的能力。同时，多相机设置在机器人平台上越来越标准，因为它们提供了补充视角，可以缓解遮挡和深度歧义。因此，VLMs是否能有效利用这种多视图输入进行机器人推理仍然是一个悬而未决的问题。为了弥补这一差距，我们推出了MV-RoboBench基准测试，专门用于评估VLMs在多视图空间推理能力方面的机器人操作。MV-RoboBench包含1700个经过人工筛选的问答项目，分为八个子任务，分为两个主要类别：空间理解和机器人执行。我们评估了一系列现有的VLMs，包括开源和闭源模型，以及融入认知图灵（CoT）启发技术的增强版本。结果表明，最先进模型的性能仍然远远低于人类水平，强调了VLMs在多视图机器人感知方面所面临的巨大挑战。此外，我们的分析还发现了两个关键发现：（i）在多视图机器人场景中，空间智能和机器人任务执行是正相关的；（ii）在现有的通用单视图空间理解基准测试上的出色表现并不一定能转化为在我们基准测试中评估的机器人空间任务的成功。我们将MV-RoboBench作为开放资源发布，以促进基于空间的语言模型的发展，不仅提供数据，还提供标准化的多视图嵌入式推理评估协议。

论文及项目相关链接

PDF The project and benchmark are publicly available at https://github.com/microsoft/MV-RoboBench

Summary

本文介绍了视觉语言模型（VLMs）在嵌入式人工智能（Embodied AI）中的重要性，并指出大多数评估主要关注单一视角设置，忽视了它们对多视角信息的整合能力。为了弥补这一差距，本文引入了MV-RoboBench基准测试，专门用于评估VLMs在多视角空间推理能力方面的表现。研究结果表明，现有模型在MV-RoboBench基准测试上的表现远低于人类水平，凸显出VLM在多视角机器人感知方面所面临的挑战。同时，本文还发现了两个关键观点：在多视角机器人场景中，空间智能和机器人任务执行能力呈正相关；在单一视角空间理解基准测试中表现良好的模型并不一定能成功应对机器人空间任务评估。文章希望促进在空间语境下的VLM和VLA领域的研究进步，提供数据和标准化的多视角实体推理评估协议。

Key Takeaways

VLMs在嵌入式人工智能中扮演关键角色，使机器人能够在复杂环境中感知、推理和行动。
当前对VLMs的评估主要集中在单一视角设置上，缺乏对其多视角信息整合能力的探索。
MV-RoboBench基准测试的引入，旨在评估VLMs在多视角空间推理能力方面的表现。
现有模型在MV-RoboBench上的表现远低于人类水平，突显出VLM在多视角机器人感知方面的挑战。
空间智能和机器人任务执行能力在多视角机器人场景中呈正相关。
单一视角空间理解基准测试的表现并不等同于在机器人空间任务评估中的成功。

Cool Papers

点此查看论文截图

Authors:Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Shouling Ji

Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model’s attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model’s ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.

多模态大型语言模型（MLLMs）在多种真实世界应用中表现出了显著的实用性。但是，MLLMs仍然容易受到“越狱攻击”，对抗性输入可能会破坏其安全约束并触发不道德的响应。在这项工作中，我们研究了文本-视觉多模态设置中的“越狱攻击”，并率先观察到视觉对齐在MLLMs中会对不同模态施加不均匀的安全约束，从而产生多模态安全不对称性。随后，我们开发了基于强化学习的PolyJailbreak黑盒越狱方法。最初，我们探索模型的注意力动态和潜在表示空间，评估视觉输入如何重塑跨模态信息流并削弱模型区分有害和良性输入的能力，从而暴露可利用的漏洞。在此基础上，我们将它们系统化，形成可概括和重复使用的操作规则，构成原子策略原始元素的结构化库，通过逐步转换将有害意图转化为越狱输入。在原始元素的指导下，PolyJailbreak采用多智能体优化过程，自动适应目标模型。我们对各种开源和闭源MLLMs进行了全面评估，结果表明PolyJailbreak优于最新基线。

论文及项目相关链接

PDF

Summary

文本探讨了多模态大型语言模型（MLLMs）的潜在漏洞问题。研究发现，视觉对齐在MLLMs中造成了跨模态的安全约束不均衡，引发了多模态安全不对称问题。为此，研究团队提出了基于强化学习的黑盒越狱方法PolyJailbreak。该方法通过探查模型的注意力动态和潜在表示空间，将有害意图转化为越狱输入，实现对目标模型的自动适应。经过在多种开源和闭源MLLMs上的综合评估，证明PolyJailbreak优于现有基线方法。

Key Takeaways

多模态大型语言模型（MLLMs）在真实世界应用中表现出显著效用，但存在被恶意输入破坏安全约束并触发非道德回应的漏洞，即“jailbreak”现象。
视觉对齐在多模态语言模型中引发跨模态安全约束不均衡，形成多模态安全不对称问题。
PolyJailbreak方法通过探查模型注意力动态和潜在表示空间来评估模型脆弱性，将有害意图转化为越狱输入。
PolyJailbreak采用基于强化学习的黑盒越狱策略，通过多智能体优化过程自动适应目标模型。
研究通过全面评估多种开源和闭源MLLMs，证明PolyJailbreak方法优于现有基线。
该研究为增强多模态语言模型的安全性提供了新的视角和方法论。

Cool Papers

点此查看论文截图

From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment

Authors:Ke Ye, Jiaming Zhou, Yuanfeng Qiu, Jiayi Liu, Shihui Zhou, Kun-Yu Lin, Junwei Liang

Generalizing to long-horizon manipulation tasks in a zero-shot setting remains a central challenge in robotics. Current multimodal foundation based approaches, despite their capabilities, typically fail to decompose high-level commands into executable action sequences from static visual input alone. To address this challenge, we introduce Super-Mimic, a hierarchical framework that enables zero-shot robotic imitation by directly inferring procedural intent from unscripted human demonstration videos. Our framework is composed of two sequential modules. First, a Human Intent Translator (HIT) parses the input video using multimodal reasoning to produce a sequence of language-grounded subtasks. These subtasks then condition a Future Dynamics Predictor (FDP), which employs a generative model that synthesizes a physically plausible video rollout for each step. The resulting visual trajectories are dynamics-aware, explicitly modeling crucial object interactions and contact points to guide the low-level controller. We validate this approach through extensive experiments on a suite of long-horizon manipulation tasks, where Super-Mimic significantly outperforms state-of-the-art zero-shot methods by over 20%. These results establish that coupling video-driven intent parsing with prospective dynamics modeling is a highly effective strategy for developing general-purpose robotic systems.

在零样本设定下将技术推广至长期操作任务仍然是机器人技术领域的核心挑战。尽管当前的多模态基础方法具有某些能力，但它们通常无法仅从静态视觉输入中分解高级命令为可执行的行动序列。为了解决这一挑战，我们引入了Super-Mimic，这是一个分层框架，可以通过直接从非脚本的人类演示视频中推断程序意图来实现零样本机器人模仿。我们的框架由两个顺序模块组成。首先，人类意图翻译器（HIT）使用多模态推理解析输入视频，以产生一系列基于语言的子任务。这些子任务然后为未来动态预测器（FDP）提供条件，该预测器采用生成模型，针对每一步合成物理上合理的视频滚动。结果中的视觉轨迹是动态的，可以明确地模拟关键物体之间的交互和接触点，以指导低级控制器。我们通过一系列长期操作任务的大量实验验证了这种方法，Super-Mimic在零样本方法中表现出超过20%的显著优势。这些结果表明，将视频驱动的意图解析与前瞻性动态建模相结合是一种开发通用机器人系统的高效策略。

论文及项目相关链接

PDF More details and videos can be found at: https://yipko.com/super-mimic

Summary

本文介绍了一种名为Super-Mimic的分层框架，该框架通过直接从非脚本的人类演示视频中推断程序意图，解决了零样本机器人模仿中的长期操作任务挑战。该框架包含两个模块，首先通过人类意图翻译器解析输入视频并生成基于语言的子任务序列，然后这些子任务条件化未来动态预测器，采用生成模型对每一步进行物理可行性视频滚动合成。这一方法通过一系列长期操作任务的实验验证，显著优于现有零样本方法，表明视频驱动的意图解析与前瞻性动态建模的结合是开发通用机器人系统的高效策略。

Key Takeaways

当前的多模式基础方法难以从静态视觉输入中分解高级命令为可执行动作序列。
Super-Mimic框架通过直接解析非脚本人类演示视频来推断程序意图，解决了这一挑战。
Super-Mimic包含两个模块：人类意图翻译器和未来动态预测器。
人类意图翻译器从输入视频中解析出基于语言的子任务序列。
未来动态预测器采用生成模型对每一步进行物理可行性视频滚动合成。
Super-Mimic在多个长期操作任务的实验中显著优于现有零样本方法。

Cool Papers

点此查看论文截图

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Authors:Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

自动化将用户界面（UI）设计转化为前端代码，对于加速软件开发和普及设计工作流程具有巨大潜力。虽然多模态大型语言模型（MLLM）可以将图像转换为代码，但在复杂的用户界面上常常会出现失败，因为它们在一个单一的单体模型中统一视觉感知、布局规划和代码合成时遇到困难，导致频繁的感知和规划错误。为了解决这一问题，我们提出了ScreenCoder，这是一个模块化多智能体框架，将任务分解为三个可解释的阶段：接地、规划和生成。通过将这些独特的责任分配给专业智能体，我们的框架比端到端的方法实现了更高的稳健性和保真度。此外，ScreenCoder作为一个可扩展的数据引擎，能够生成高质量的图片-代码对。我们使用这些数据通过监督精细调整和强化学习的双阶段管道对开源MLLM进行微调，证明了其在UI生成能力上的显著提。大量实验表明，我们的方法在布局准确性、结构连贯性和代码正确性方面达到了最先进的性能。我们的代码已公开在https://github.com/leigest519/ScreenCoder上发布。

论文及项目相关链接

PDF ScreenCoder-v2

Summary

自动化用户界面（UI）设计转化为前端代码的过程具有加速软件开发和普及设计工作流程的巨大潜力。虽然多模态大型语言模型（MLLM）能够翻译图像到代码，但在复杂UI上常常失败，难以在一个单一模型中统一视觉感知、布局规划和代码合成，导致频繁出现感知和规划错误。为解决这一问题，我们提出了ScreenCoder模块化多智能体框架，将任务分解成可解释的三个阶：接地、规划和生成。通过将这些独特责任分配给专业智能体，我们的框架比端到端方法更具鲁棒性和保真度。此外，ScreenCoder还是一个可扩展的数据引擎，能够生成高质量图像-代码对。我们使用这些数据通过监督微调的两阶段管道和强化学习来微调开源MLLM，在UI生成能力方面取得了实质性进展。实验表明，我们的方法达到了先进的性能水平，具有布局准确性、结构连贯性和代码正确性。我们的代码公开在https://github.com/leigest519/ScreenCoder。

Key Takeaways

自动化UI设计转化为前端代码有助于加速软件开发和普及设计工作流程。
多模态大型语言模型（MLLM）在复杂UI上存在感知和规划错误的问题。
ScreenCoder模块化多智能体框架通过将任务分解为三个阶段来提高鲁棒性和保真度：接地、规划和生成。
ScreenCoder框架作为可扩展的数据引擎，能够生成高质量图像-代码对。
通过监督微调的两阶段管道和强化学习，使用ScreenCoder生成的数据对MLLM进行微调，提高了UI生成能力。
ScreenCoder方法实现了先进的性能水平，包括布局准确性、结构连贯性和代码正确性。

Cool Papers

点此查看论文截图

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Authors:Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.

医疗数据的匮乏严重限制了诊断机器学习模型的通用性，因为小型的临床数据集无法代表疾病的全部变异谱。为了解决这个问题，扩散模型（DMs）已被视为合成图像生成和增强的有前途的途径。然而，它们经常产生医学上不准确的图像，从而降低了模型性能。在数据稀缺的情况下，特别是在质量高于数量的情境下，领域专业知识对于合成正确编码临床信息的图像至关重要。现有的结合人类反馈的方法，如强化学习（RL）和直接偏好优化（DPO），依赖于稳健的奖励功能或需要大量的人工评估。最近多模态大型语言模型（MLLMs）的进展显示出了强大的视觉推理能力，使其成为评估者的合适候选。在这项工作中，我们提出了一个名为MAGIC（通过AI-专家协作进行医学准确的图像生成）的新框架，用于合成用于数据增强的临床准确的皮肤病图像。我们的方法创造性地将专家定义的标准转化为对扩散模型图像合成的可操作反馈，显著提高了临床准确性，同时减少了直接人力工作量。实验表明，我们的方法大大提高了合成皮肤病图像的临床质量，输出结果与皮肤科医生的评估相符。此外，使用这些合成图像增强训练数据，在具有挑战性的20种皮肤病分类任务中，诊断准确率提高了9.02%，在少量样本的情况下提高了13.89%。

论文及项目相关链接

PDF NeurIPS 2025

Summary

本文指出医疗数据的匮乏严重限制了诊断机器学习模型的通用性。为解决这一问题，研究人员尝试利用扩散模型进行图像生成和增强，但它们常产生医学上不准确的图像，影响模型性能。为此，本文提出了一种结合人工智能与专家知识的新型框架MAGIC，旨在合成临床准确的皮肤疾病图像用于数据增强。通过专家定义的准则转化为对图像合成的可操作反馈，显著提高了临床准确性并降低了人工工作量。实验证明，该方法大大提高了合成皮肤疾病图像的临床质量，并与皮肤科医生的评估相符。此外，使用这些合成图像增强训练数据，在20种皮肤疾病分类任务中提高了9.02%的诊断准确率，在少量样本情况下提高了13.89%。

Key Takeaways

医疗数据的匮乏限制了诊断机器学习模型的通用性。
扩散模型在医学图像生成和增强中有潜力，但会产生医学上不准确的图像。
专家知识对于合成正确编码临床信息的图像至关重要，尤其在数据稀缺的情况下。
现有融入人类反馈的方法如强化学习和直接偏好优化，需要稳健的奖励函数或繁重的专家评估。
多模态大型语言模型具备强大的视觉推理能力，可作为评估者。
提出的MAGIC框架结合人工智能与专家知识，能合成临床准确的皮肤疾病图像用于数据增强。

Cool Papers

点此查看论文截图

Caption This, Reason That: VLMs Caught in the Middle

Authors:Zihan Weng, Lucas Gomez, Taylor Whittington Webb, Pouya Bashivan

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs substantially improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on these other benchmarks. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.

视觉语言模型（VLMs）近年来在视觉理解方面取得了显著进展。然而，它们在特定的视觉任务（如计数或关系推理）方面仍然落后于人类的能力。为了了解潜在的局限性，我们采用认知科学的方法，分析VLM在核心认知轴（感知、注意力和记忆）上的表现。通过一系列针对这些能力的任务，我们评估了最先进的VLMs，包括GPT-4o。我们的分析揭示了不同的认知特征：虽然先进模型在某些任务上接近天花板性能（例如类别识别），但在需要空间理解或选择性注意力的任务中仍存在显著差距。通过调查失败的原因和潜在的改进方法，我们采用了视觉文本解耦分析，发现模型在直接视觉推理方面遇到困难时，通过对他们自己生成的文本描述进行推理会显著改善。这些实验揭示了对改进VLM链式思维（CoT）能力的强烈需求，即使在那些始终超越人类性能的模型中也是如此。此外，我们展示了针对复合视觉推理任务进行有针对性的微调潜力，并表明对较小的VLMs进行微调会极大地提高其核心认知能力。虽然这种改进并不转化为具有挑战性的分布外基准测试的大幅提升，但我们总体上发现我们的数据集上的VLM性能与这些其他基准测试的性能之间存在强烈的相关性。我们的工作对VLM的认知优势和劣势进行了详细分析，并确定了在同时感知和推理过程中的关键瓶颈，同时提供了一种有效而简单的解决方案。

论文及项目相关链接

PDF Paper accepted by nips 2025

Summary

本文探讨了视觉语言模型（VLMs）在视觉理解方面的进展与局限。虽然VLMs在某些任务上表现出卓越的性能，如类别识别，但在需要空间理解或选择性注意的任务上仍存在显著差距。研究通过认知科学的方法，对VLM的认知能力进行分析，发现模型在视觉推理方面存在薄弱环节。通过文本视觉解耦分析，发现模型在基于自身生成的文本描述进行推理时，性能有所改善。研究还强调了VLM链式思维（CoT）能力的重要性，并展示了针对性微调对复合视觉推理任务的潜在影响。本文详细分析了VLM的认知优势和弱点，并提供了有效的简单解决方案。

Key Takeaways

VLMs在视觉理解方面取得显著进展，但在特定视觉任务上仍落后于人类，如计数和关系推理。
通过认知科学的方法，发现VLM在感知、注意力和记忆等核心认知领域存在局限。
在需要空间理解或选择性注意的任务上，VLMs存在显著差距。
VLMs在视觉推理方面存在薄弱环节，但通过文本视觉解耦分析，发现模型在基于自身生成的文本描述进行推理时性能改善。
VLM链式思维（CoT）能力对性能有影响，即使是超过人类性能表现的模型也需要强化此能力。
针对性微调对复合视觉推理任务有潜在影响，可以显著提高VLM的核心认知能力。

Cool Papers

点此查看论文截图

MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

Authors:Sirui Zhao, Zhengye Zhang, Shifeng Liu, Xinglong Mao, Shukang Yin, Chaoyou Fu, Tong Xu, Enhong Chen

Micro-expressions (MEs), brief and low-intensity facial movements revealing concealed emotions, are crucial for affective computing. Despite notable progress in ME recognition, existing methods are largely confined to discrete emotion classification, lacking the capacity for comprehensive ME Understanding (MEU), particularly in interpreting subtle facial dynamics and underlying emotional cues. While Multimodal Large Language Models (MLLMs) offer potential for MEU with their advanced reasoning abilities, they still struggle to perceive such subtle facial affective behaviors. To bridge this gap, we propose a ME Large Language Model (MELLM) that integrates optical flow-based sensitivity to subtle facial motions with the powerful inference ability of LLMs. Specifically, an iterative, warping-based optical-flow estimator, named MEFlowNet, is introduced to precisely capture facial micro-movements. For its training and evaluation, we construct MEFlowDataset, a large-scale optical-flow dataset with 54,611 onset-apex image pairs spanning diverse identities and subtle facial motions. Subsequently, we design a Flow-Guided Micro-Expression Understanding paradigm. Under this framework, the optical flow signals extracted by MEFlowNet are leveraged to build MEU-Instruct, an instruction-tuning dataset for MEU. MELLM is then fine-tuned on MEU-Instruct, enabling it to translate subtle motion patterns into human-readable descriptions and generate corresponding emotional inferences. Experiments demonstrate that MEFlowNet significantly outperforms existing optical flow methods in facial and ME-flow estimation, while MELLM achieves state-of-the-art accuracy and generalization across multiple ME benchmarks. To the best of our knowledge, this work presents two key contributions: MEFlowNet as the first dedicated ME flow estimator, and MELLM as the first LLM tailored for MEU.

微表情（MEs）是短暂且低强度的面部运动，能够揭示隐藏的情感，对于情感计算至关重要。尽管在微表情识别方面取得了显著进展，但现有方法大多局限于离散情绪分类，缺乏全面的微表情理解（MEU）能力，特别是在解释微妙的面部动态和潜在的情感线索方面。尽管多模态大型语言模型（MLLMs）凭借其先进的推理能力在MEU方面具有潜力，但它们仍然难以感知到这些微妙的面部情感行为。为了弥补这一差距，我们提出了一种微表情大型语言模型（MELLM），它将基于光流的微妙面部运动敏感性与LLMs的强大推理能力相结合。具体来说，我们引入了一种基于迭代、warping的光流估计器，名为MEFlowNet，可以精确捕捉面部的微表情运动。为了对其进行训练和评估，我们构建了MEFlowDataset，这是一个大规模的光流数据集，包含54611个起始-顶点图像对，涵盖了多种身份和微妙的面部运动。随后，我们设计了Flow-Guided Micro-Expression Understanding范式。在此框架下，利用MEFlowNet提取的光流信号构建MEU-Instruct，这是用于微表情理解的指令调整数据集。然后，我们在MEU-Instruct上对MELLM进行微调，使其能够将微妙的运动模式转化为人类可读的描述并生成相应的情感推断。实验表明，MEFlowNet在面部和微表情流估计方面显著优于现有光流方法，而MELLM在多个微表情基准测试中达到了最先进的准确性和泛化能力。据我们所知，这项工作有两个主要贡献：一是首创的微表情流估计器MEFlowNet，二是针对微表情理解量身定制的首个大型语言模型MELLM。

论文及项目相关链接

PDF

Summary

本文介绍了面部微表情（MEs）在情感计算中的重要性，现有方法主要局限于离散情绪分类，缺乏对面部微表情的综合理解（MEU）。研究提出了一个ME大型语言模型（MELLM），结合光学流动对细微面部运动的敏感性与大型语言模型（LLMs）的强大推理能力。特别是引入了名为MEFlowNet的基于光流的估计器来精确捕捉面部微运动，并构建了MEFlowDataset数据集用于训练和评估。最后，设计了一个基于光学流动的微表情理解范式，通过MEFlowNet提取的光流信号建立MEU-Instruct数据集，对MELLM进行微调，使其能够转化微妙的运动模式为人类可读的描述并产生相应的情感推断。该研究为面部微表情的理解提供了新的方向。

Key Takeaways

面部微表情（MEs）在情感计算中起关键作用，但现有方法主要局限于离散情绪分类，缺乏综合理解（MEU）。
MEFlowNet被引入作为首个专门针对ME的光流估计器，能够精确捕捉面部微运动。
构建了大型光学流动数据集MEFlowDataset，用于训练和评估MEFlowNet。
提出了基于光学流动的微表情理解范式，利用光流信号建立MEU-Instruct数据集。
MELLM模型被设计为首个针对MEU的大型语言模型，结合光学流动敏感性与LLMs的推理能力。
MELLM经过在MEU-Instruct数据集上的微调，能够转化微妙的运动模式为人类可读的描述，并产生相应的情感推断。

Cool Papers

点此查看论文截图

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Authors:Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai

This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

本文提出了一种跨行业的多模态大型语言模型（LLM）管道，该管道融合了听觉、视觉和语言学模式，以克服如有限的三模态数据集、高计算成本和复杂的特征对齐等挑战。我们的管道主要由三个部分组成：首先，一个模块化框架，能够灵活配置各种编码器-LLM-解码器架构。其次，一种轻量级的训练策略，通过对最先进的视觉语言模型Qwen2.5-VL进行预训练，实现音频语言对齐，从而避免了针对特定视觉模式的昂贵预训练。最后，一个音频合成管道，其从各种真实场景中生成高质量音频文本数据，支持如自动语音识别和语音到语音聊天等应用。为此，我们推出了跨行业的多模态LLM——Nexus。大量实验验证了我们的管道的有效性，得出以下关键发现：（1）在视觉理解任务中，Nexus相较于其基础模型Qwen2.5-VL-7B展现出卓越性能，验证了我们训练策略的有效性。（2）在英语口语问答任务中，该模型的准确性高于同期竞争对手（即LLaMA Q. benchmark中的MiniCPM-o2.6-7B）。（3）在我们的真实ASR测试集中，Nexus表现出卓越性能，表明其在真实场景中的稳健性。（4）在语音到文本翻译任务中，我们的模型优于Qwen2-Audio-Instruct-7B。（5）在文本到语音任务中，基于预训练的vocoder（例如Fishspeech1.4或CosyVoice2.0），Nexus在Seed-TTS benchmark上的表现与其基础vocoder相当。（6）对三模态对齐的深入分析表明，融入音频模式增强了视觉和语言之间的代表性对齐。

Summary

本文提出了一种跨行业的多模态大型语言模型（LLM）管道，该管道融合了听觉、视觉和语言学三大模态，以克服如有限的三模态数据集、高昂的计算成本和复杂的特征对齐等挑战。该管道包括三个主要组件：一是模块化框架，可以灵活配置各种编码器-LLM-解码器架构；二是轻量级训练策略，利用先进的视觉语言模型Qwen2.5-VL进行音频语言对齐的预训练；三是音频合成管道，可从各种真实场景生成高质量的音频文本数据，支持如自动语音识别和语音到语音聊天等应用。通过引入行业级别的多模态LLM Nexus，实验证明该管道的有效性，并实现了在多个任务上的优异性能。

Key Takeaways

提出了一个多模态大型语言模型（LLM）管道，集成了听觉、视觉和语言学三大模态。
管道包括模块化框架、轻量级训练策略和音频合成管道三个主要组件。
利用先进的视觉语言模型进行音频语言对齐的预训练，避免了针对特定视觉模态的昂贵预训练。
在视觉理解任务中，Nexus相较于其基础模型Qwen2.5-VL-7B表现出卓越性能。
在英语口语问答任务中，Nexus的准确率超过了同期竞争对手。
在真实场景的自动语音识别测试集中，Nexus表现出卓越性能。

Cool Papers

点此查看论文截图

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Authors:Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan

Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.

物理推理对于视觉语言模型（VLMs）来说仍然是一个重大挑战。这一限制源于无法将学到的知识转化为对物理行为的预测。虽然持续微调可以缓解这个问题，但对于大型模型来说成本很高，并且对于每个任务都重复进行微调并不实际。因此，必须创建模块化和可扩展的方法来教授VLMs进行物理推理。为此，我们引入了物理上下文构建器（PCBs），这是一个模块化框架，其中专门的小型VLMs经过微调以生成详细的物理场景描述。这些描述可以用作物理上下文，以增强大型VLMs的推理能力。PCBs使视觉感知与推理分离，使我们能够分析它们对物理理解的相对贡献。我们在CLEVRER和Falling Tower（一个包含模拟和真实场景的稳定检测数据集）上进行了实验，以证明PCBs提供了显著的性能改进，在复杂的物理推理任务上平均准确率提高了高达13.8%。值得注意的是，PCBs还显示出强大的Sim2Real迁移能力，成功地从模拟训练数据推广到真实场景。

论文及项目相关链接

PDF

Summary
物理推理仍然是视觉语言模型（VLMs）面临的一大挑战。该挑战源于无法将学到的知识转化为对物理行为的预测。虽然持续微调可以缓解这一问题，但对于大型模型而言成本高昂，且难以针对每个任务重复执行。因此，需要创建模块化和可扩展化的方法来教授VLMs物理推理。我们为此引入了物理场景构建器（PCBs），这是一个模块化框架，其中专门的小型VLMs经过微调以生成详细的物理场景描述。这些描述可以用作物理上下文，以增强大型VLMs的推理能力。PCBs实现了视觉感知与推理的分离，使我们能够分析它们在理解物理方面的相对贡献。我们在CLEVRER和Falling Tower（一个包含模拟和真实场景的稳定性检测数据集）上进行了实验，证明PCBs在复杂的物理推理任务上提供了显著的性能提升，平均准确率提高了高达13.8%。值得注意的是，PCBs还表现出强大的Sim2Real迁移能力，成功地从模拟训练数据推广到真实场景。

Key Takeaways

物理推理是视觉语言模型（VLMs）的重要挑战，因为它们难以将学到的知识转化为物理行为的预测。
持续微调虽然可以改善这一问题，但对大型模型而言成本高昂，且难以重复执行。
引入物理场景构建器（PCBs）这一模块化框架，通过专门的小型VLMs生成详细的物理场景描述，增强大型VLMs的推理能力。
PCBs实现了视觉感知与推理的分离，便于分析两者的相对贡献。
在CLEVRER和Falling Tower数据集上的实验表明，PCBs在物理推理任务上显著提高性能，平均准确率提升高达13.8%。
PCBs具有强大的Sim2Real迁移能力，能在模拟和真实场景之间实现有效的推广。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-17/MMT/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

MMT

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-11-17 Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback

2025-11-17 Few-Shot

Few-Shot

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-11-17 Explaining Decentralized Multi-Agent Reinforcement Learning Policies

2025-11-17 Agent

Agent

MMT

2025-11-17 更新

HI-TransPA: Hearing Impairments Translation Personal Assistant

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

TRACE: Textual Reasoning for Affordance Coordinate Extraction

Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

World Simulation with Video Foundation Models for Physical AI

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Caption This, Reason That: VLMs Caught in the Middle

MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models