MMT

发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 5.1k

阅读时长: 20 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Authors:Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yunhui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

视觉语言模型（VLM）是实现稳健机器人操作的重要组成部分。然而，将其用于将人类指令翻译成可操作的中间表示形式时，往往需要在VLM的可理解性和泛化能力之间进行权衡。受无上下文文法的启发，我们设计了一种名为SEAM的语义装配表示法，通过将中间表示分解为词汇和语法。这样做使我们得到语义丰富的操作简洁词汇，以及一种适用于处理各种未见任务的VLM友好语法。此外，我们设计了一种新的开放词汇分割范式，采用增强检索的少量学习策略来定位精细的物体部分以进行操作，并且在所有最新平行工作中实现了最短的推理时间。我们还制定了动作泛化性和VLM可理解性的新指标，显示了SEAM在各方面的表现均优于主流表现。大量现实世界的实验进一步证明了在各种设置和任务下，其表现均达到最佳水平。

论文及项目相关链接

PDF

Summary

基于上下文无关语法，设计了一种名为SEAM的语义装配表示方法，通过将中间表示分解为词汇和语法，实现了在机器人操作中的视觉语言模型（VLM）的简洁词汇和语义丰富的操作，以及友好的语法处理各种未见任务的能力。此外，提出了一种新的开放词汇分割范式，采用检索增强的少样本学习策略进行精细目标部件的定位和操作，推理时间最短。制定了新的动作泛化性和VLM可理解性的度量标准，并在多个主流表示方法上证明了SEAM在这两个方面的优异性能。大量现实世界实验进一步证明了其在不同设置和任务下的卓越性能。

Key Takeaways

SEAM基于上下文无关语法设计，将中间表示分解为词汇和语法。
SEAM实现了VLM的简洁词汇和语义丰富的操作。
新的开放词汇分割范式用于机器人操作中的精细目标部件定位。
采用了检索增强的少样本学习策略，具有最短的推理时间。
制定了新的动作泛化性和VLM可理解性的度量标准。
SEAM在多个主流表示方法上表现出了优秀的性能。

Cool Papers

点此查看论文截图

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

Authors:Yongkun Du, Pinxuan Chen, Xuye Ying, Zhineng Chen

The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.

多模态大型语言模型（MLLMs）的出现，为端到端的文档解析和翻译提供了巨大的潜力。然而，现有的基准测试如OmniDocBench和DITrans主要是针对清晰扫描或数字原生文档，因此未能充分代表真实捕获条件（如几何失真和光度变化）的复杂挑战。为了填补这一空白，我们推出了DocPTBench，这是一个专门为拍摄文档解析和翻译设计的综合基准测试。DocPTBench包含来自多个领域的超过1300份高分辨率拍摄文档，包含八种翻译场景，并为解析和翻译提供了精心的人工验证注释。我们的实验表明，从数字原生文档过渡到拍摄文档会导致性能大幅下降：流行的MLLMs在端到端解析和翻译中的平均准确率分别下降了18%和12%，而专业化的文档解析模型则显示出平均下降了25%的显著差距。这一明显的性能差距突显了由真实世界条件下捕获的文档所带来的独特挑战，并揭示了现有模型的有限稳健性。数据集和代码可在https://github.com/Topdu/DocPTBench获取。

论文及项目相关链接

PDF

Summary

MLLMs在端到端文档解析和翻译方面具有潜力，但现有基准测试未能充分代表真实世界捕获条件的复杂挑战。为此，推出DocPTBench，专门用于拍摄文档解析和翻译的基准测试，包含超过1300张来自多个领域的高分辨率拍摄文档，提供针对解析和翻译的详尽人工验证注释。实验表明，从数字文档转向拍摄文档导致性能显著下降，普通MLLMs的平均准确度下降18%（端到端解析）和12%（翻译），而专用文档解析模型平均下降25%。

Key Takeaways

MLLMs在文档解析和翻译方面具有潜力，但需要应对真实世界捕获条件的挑战。
现有基准测试如OmniDocBench和DITrans主要关注于扫描或数字原生文档，无法充分反映真实世界的复杂性。
DocPTBench是一个专为拍摄文档解析和翻译设计的综合基准测试。
DocPTBench包含来自多个领域的高分辨率拍摄文档，提供详尽的人工验证注释。
从数字文档转向拍摄文档会导致性能显著下降，普通MLLMs在端到端解析和翻译方面的平均准确度分别下降18%和12%。
专用文档解析模型在面对真实世界捕获的文档时，性能下降更为显著，平均下降25%。

Cool Papers

点此查看论文截图

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Authors:Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs’ perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

视觉语言模型（VLMs）在标准视频任务上表现良好，但在涉及运动动力学和空间交互的物理驱动推理方面存在困难。这一局限性降低了它们解释真实或AI生成内容（AIGC）视频以及生成物理一致内容的能力。我们提出了一种解决这个问题的方法，通过将物理世界上下文线索翻译成与VLMs感知、理解和推理对齐的可解释表示。我们推出了MASS-Bench，这是一个包含4350个真实世界和AIGC视频以及8361个专注于物理相关理解任务的自由形式视频问答对的综合基准测试，包含详细的注释，包括视觉检测、子段定位和实体全序列3D运动跟踪。我们还推出了MASS，这是一种模型无关的方法，它通过深度为基础的3D编码和视觉定位将时空信号注入VLM语言空间，并配备运动跟踪器进行物体动力学分析。为了加强跨模态对齐和推理，我们应用了强化微调。实验和消融研究结果表明，我们优化后的VLMs比同类和更大的基准模型以及之前的最新模型高出8.7%和6.0%，在物理推理和理解方面达到了与封闭源代码的SoTA VLMs（如Gemini-2.5-Flash）相当的性能，验证了我们的方法的有效性。

论文及项目相关链接

PDF

Summary

该文本描述了在视频任务中，Vision Language Models（VLMs）在处理涉及物理世界的运动动态和空间交互的任务时存在困难。为解决这个问题，文中提出了一种将物理世界上下文线索转化为与VLMs感知、理解和推理相匹配的解读表示的方法。同时，介绍了一个名为MASS-Bench的基准测试平台，该平台包含真实和AI生成的视频，以及针对物理理解任务的自由形式问答对。此外，还提出了一种名为MASS的方法，它通过深度为基础的3D编码和视觉定位来注入时空信号到VLM语言空间，并通过运动跟踪器来追踪物体动态。实验证明，经过精细调整的VLMs性能得到了提升。

Key Takeaways

VLMs在标准视频任务上表现良好，但在涉及物理世界的运动动态和空间交互的任务上存在局限。
为解决此问题，文中将物理世界上下文转化为与VLMs相匹配的解读表示。
引入了MASS-Bench基准测试平台，包含真实和AI生成的视频以及物理理解任务的问答对。
提出了一种名为MASS的方法，通过深度为基础的3D编码和视觉定位注入时空信号到VLM语言空间。
MASS方法结合了运动跟踪器来追踪物体动态，强化跨模态对齐和推理。
实验证明，经过精细调整的VLMs性能有所提升，超过了基准模型和先前最先进的模型。

Cool Papers

点此查看论文截图

InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

Authors:Haoming Wang, Qiyao Xue, Wei Gao

Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

现代视觉语言模型（VLMs）预计会具备处理不同场景复杂度的空间推理能力，但由于缺乏多样化、可扩展且可完全自定义的基准测试，评估这种能力很困难。现有基准测试对场景复杂度的可定制性有限，无法在不同的空间条件下隔离和分析特定的VLM失败模式。为了弥补这一差距，本文并未针对不同场景复杂度分别提供基准测试，而是推出了InfiniBench，这是一个完全自动化、可定制且用户友好的基准测试生成器，可以合成理论上无限种类的3D场景，并通过参数控制场景复杂度。InfiniBench独特地将自然语言描述的场景转化为具有复杂且物理上可行的3D布局的照片级视频。这是通过三个关键创新点实现的：1）基于大型语言模型的代理框架，迭代优化场景描述的程序约束；2）灵活的基于集群的布局优化器，生成之前程序方法无法处理的密集和杂乱场景；3）任务感知的相机轨迹优化方法，将场景渲染成视频，实现VLM输入的全面对象覆盖。实验表明，InfiniBench在提示保真度和物理可行性方面优于最先进的程序化和基于大型语言模型的3D生成方法，尤其是在高复杂度场景中。我们进一步通过为典型的空间推理任务生成基准测试，展示了InfiniBench的实用性，包括测量、视角和时空跟踪。

论文及项目相关链接

PDF

Summary

本文介绍了一个名为InfiniBench的自动化、可定制化的基准测试生成器，它能够根据参数化控制场景复杂度，合成理论上无限多种3D场景。InfiniBench通过将自然语言描述的场景转化为具有复杂且物理合理的3D布局的照片级视频，实现了三大创新。其实验结果表明，InfiniBench在高复杂度场景下的提示保真度和物理合理性方面优于现有的程序化和LLM-based的3D生成方法。此外，它还展示了在测量、视角转换和时空跟踪等代表性空间推理任务中生成基准测试的实用性。

Key Takeaways

现代视觉语言模型（VLMs）需要具备在不同复杂场景中的空间推理能力，但评估这种能力很困难，因为缺乏既多样又可扩展，同时完全可定制的基准测试。
现有基准测试在场景复杂度上的可定制性有限，无法孤立地分析VLM在特定空间条件下的失败模式。
InfiniBench是一个全自动、可定制的基准测试生成器，可以合成理论上无限多的3D场景，并通过参数化控制场景复杂度。
InfiniBench将自然语言描述的场景转化为照片级视频，具备复杂且物理合理的3D布局。
该方法通过三大创新实现：基于LLM的代理框架、灵活的集群布局优化器和任务感知的相机轨迹优化方法。
实验表明，InfiniBench在高复杂度场景下的性能优于其他3D生成方法，特别是在提示保真度和物理合理性方面。

Cool Papers

点此查看论文截图

A Viable Paradigm of Software Automation: Iterative End-to-End Automated Software Development

Authors:Jia Li, Zhi Jin, Huangzhao Zhang, Kechi Zhang, Jiaru Qian, Tiankuo Zhao

Software development automation is a long-term goal in software engineering. With the development of artificial intelligence (AI), more and more researchers are exploring approaches to software automation. They view AI systems as tools or assistants in software development, still requiring significant human involvement. Another initiative is ``vibe coding’’, where AI systems write and repeatedly revise most (or even all) of the code. We foresee these two development paths will converge towards the same destination: AI systems participate in throughout the software development lifecycle, expanding boundaries of full-stack software development. In this paper, we present a vision of an iterative end-to-end automated software development paradigm AutoSW. It operates in an analyze-plan-implement-deliver loop, where AI systems as human partners become first-class actors, translating human intentions expressed in natural language into executable software. We explore a lightweight prototype across the paradigm and initially execute various representative cases. The results indicate that AutoSW can successfully deliver executable software, providing a feasible direction for truly end-to-end automated software development.

软件开发的自动化是软件工程领域的长期目标。随着人工智能（AI）的发展，越来越多的研究人员正在探索软件自动化的方法。他们视AI系统为软件开发中的工具或助手，仍然需要大量的人工参与。另一项倡议是“振动编码”，其中AI系统编写并反复修改大部分（甚至全部）代码。我们预计这两条发展道路将朝着同一目标发展：AI系统参与整个软件开发生命周期，拓展全栈软件开发的边界。在本文中，我们提出了一个迭代端到端自动化软件开发范式AutoSW的愿景。它在分析-计划-实施-交付的循环中运行，其中AI系统作为人类伙伴成为一线行动者，将人类用自然语言表达的意图转化为可执行的软件。我们在该范式下探索了一个轻量级原型，并初步执行了各种具有代表性的案例。结果表明，AutoSW能够成功交付可执行的软件，为真正的端到端自动化软件开发提供了可行的方向。

论文及项目相关链接

PDF

Summary
自动化软件开发是软件工程领域的长期目标。随着人工智能（AI）的发展，越来越多的研究者正在探索软件自动化的方法。他们认为AI系统是软件开发中的工具或助手，但仍需要重大的人力投入。另一项倡议是“振动编码”，其中AI系统编写并反复修改大部分（甚至全部）代码。我们预计这两条发展道路将朝着相同的目标发展：AI系统参与整个软件开发生命周期，扩展全栈软件开发的边界。在本文中，我们提出了一个名为AutoSW的迭代端到端自动化软件开发范式。它在分析-计划-实施-交付的循环中运行，其中AI系统作为人类伙伴成为主要参与者，将人类意图通过自然语言转化为可执行软件。我们探索了该范式的轻量级原型并初步执行了各种代表性案例。结果表明，AutoSW能够成功交付可执行软件，为真正的端到端自动化软件开发提供了可行的方向。

Key Takeaways