MMT

发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 4.1k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

A Viable Paradigm of Software Automation: Iterative End-to-End Automated Software Development

Authors:Jia Li, Zhi Jin, Kechi Zhang, Huangzhao Zhang, Jiaru Qian, Tiankuo Zhao

Software development automation is a long-term goal in software engineering. With the development of artificial intelligence (AI), more and more researchers are exploring approaches to software automation. They view AI systems as tools or assistants in software development, still requiring significant human involvement. Another initiative is ``vibe coding’’, where AI systems write and repeatedly revise most (or even all) of the code. We foresee these two development paths will converge towards the same destination: AI systems participate in throughout the software development lifecycle, expanding boundaries of full-stack software development. In this paper, we present a vision of an iterative end-to-end automated software development paradigm AutoSW. It operates in an analyze-plan-implement-deliver loop, where AI systems as human partners become first-class actors, translating human intentions expressed in natural language into executable software. We explore a lightweight prototype across the paradigm and initially execute various representative cases. The results indicate that AutoSW can successfully deliver executable software, providing a feasible direction for truly end-to-end automated software development.

软件开发的自动化是软件工程领域的长期目标。随着人工智能（AI）的发展，越来越多的研究人员正在探索软件自动化的方法。他们视AI系统为软件开发中的工具或助手，但仍需要大量的人工参与。另一项倡议是“震动编码”，其中AI系统编写并反复修改大部分（甚至全部）代码。我们预计这两条发展道路将朝着同一目标发展：AI系统参与整个软件开发生命周期，拓展全栈软件开发的边界。在本文中，我们提出了一个迭代端到端的自动化软件开发范式AutoSW的愿景。它在分析-计划-实施-交付的循环中运行，其中AI系统作为人类合作伙伴成为主要参与者，将人类以自然语言表达意图转化为可执行的软件。我们在该范式下探索了一个轻量级的原型，并初步执行了各种具有代表性的案例。结果表明，AutoSW能够成功交付可执行的软件，为真正的端到端自动化软件开发提供了可行的方向。

论文及项目相关链接

PDF

Summary
自动化软件开发是软件工程领域的长期目标。随着人工智能（AI）的发展，越来越多的研究者正在探索软件自动化的方法。他们认为AI系统是软件开发中的工具或助手，仍然需要大量的人类参与。另一种倡议是“振动编码”，其中AI系统编写并反复修改大部分（甚至全部）代码。我们预计这两条发展道路将朝着同一目标发展：AI系统参与整个软件开发生命周期，拓展全栈软件开发的边界。在本文中，我们提出了一个迭代端到端的自动化软件开发范式AutoSW的愿景。它在分析-计划-实施-交付的循环中运行，其中AI系统作为人类伙伴成为一等参与者，将人类以自然语言表达的意图翻译成可执行的软件。我们探索了跨范式的轻型原型，并初步执行了各种代表性案例。结果表明，AutoSW能够成功交付可执行的软件，为真正的端到端自动化软件开发提供了可行的方向。

Key Takeaways

软件开发自动化是软件工程领域的长期目标。
AI系统在软件开发中扮演工具或助手的角色，但仍需大量人类参与。
“振动编码”倡议中，AI系统负责编写和反复修改大部分或全部代码。
两条主要的发展路径（AI作为工具和“振动编码”）将朝着同一目标发展，即AI全面参与软件开发生命周期。
提出了迭代端到端的自动化软件开发范式AutoSW的愿景。
AutoSW在分析-计划-实施-交付的循环中运行，其中AI系统能够翻译人类自然语言意图为可执行的软件。

Cool Papers

点此查看论文截图

HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

Authors:Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta, Dasari Srikar, Krishna Teja Kuchimanchi, Rajiv Misra, Rohun Tripathi

With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.

印度拥有近1.5亿人口和超过120种主要语言，是世界上多元化程度最高的地区之一。随着跨语言视觉模型（VLMs）的兴起，建立稳健的评估方法对于推动面向低资源语言的均衡人工智能发展至关重要。当前的多语言VLM评估存在四个主要局限性：依赖未经验证的自动翻译、任务/领域覆盖面狭窄、样本规模有限以及缺乏文化和本土来源的问题回答（QA）。为了解决这些差距，我们提出了一个可评估印度语言中VLMs的可扩展框架，并将其与英语性能进行比较。使用该框架，我们创建了HinTel-AlignBench基准测试，该测试从印地语和泰卢固语的多种来源中抽取样本，并与英语对齐的样本进行比较。我们的贡献有三点：（1）结合回译、过滤和人工验证的半自动数据集创建框架；（2）印地语和泰卢固语中最为全面的视觉语言基准测试，包括改编的英语数据集（VQAv2、RealWorldQA、CLEVR-Math）和本土新颖印度数据集（针对STEM的JEE、针对文化背景的VAANI），每种语言约有4000个问答对；（3）对最先进的公开权重和闭源VLMs进行详细性能分析。我们发现，在英语和印度语任务中，所有模型的4个任务中有5个在印度的表现有所下滑，印地语平均下降8.3分，泰卢固语下降5.5分。我们归纳了常见的失败模式，以突出改进多语言多模态理解的重点领域。

论文及项目相关链接

PDF

Summary

本文介绍了印度多语言Vision-Language Models（VLMs）评估的重要性和现有评估方法的局限性。为了解决这个问题，作者提出了一种可扩展的框架来评估印度语言的VLMs，并与英语性能进行比较。通过该框架，作者创建了Hindi和Telugu语言的基准测试数据集HinTel-AlignBench。该研究有三个主要贡献：半自动化数据集创建框架、包含多种语言的综合视觉语言基准测试数据集以及对多种最新VLMs的详细性能分析。研究发现，在印度语言的任务中，大部分模型在大部分任务上的性能有所下降。

Key Takeaways

印度拥有近15亿人口和超过120种主要语言，对多语言Vision-Language Models（VLMs）的评估至关重要。然而，当前的多语言VLM评估存在四个主要局限性。

Cool Papers

点此查看论文截图

IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Authors:Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi

The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models’ ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/SIGMME/IWR-Bench.

网页到代码的任务要求模型理解网页的视觉表示并生成相应的代码。然而，现有的基准测试主要关注静态截图到代码的任务，从而忽略了现实世界网页应用中根本的动态交互。为了解决这一局限性，本文引入了IWR-Bench，这是一个用于评估大型视觉语言模型（LVLMs）从视频中进行交互式网页重建能力的新型基准测试。IWR-Bench包含113个精心策划的任务，来自100个真实网站，包含1001个动作，具有多样化的交互复杂性（如网页游戏）、视觉风格和领域。与标准的网页开发实践相一致，每个任务不仅包括用户交互视频，还包括所有爬取的静态资产（如图像、视频）。此基准测试评估模型面临两个基本挑战：从视频和资产中推断交互逻辑的综合多模式推理，以及将这一逻辑转化为功能代码的先进代码生成。一个以法官为框架的自动评估系统可以自动评估生成网页的功能正确性和视觉保真度。在28个LVLMs上的大量实验表明了一个重大挑战：最佳模型的总得分仅为36.35%，功能正确性（24.39% IFS）远远落后于视觉保真度（64.25% VFS）。这些结果凸显了当前模型在理解时间动态和合成事件驱动逻辑方面的关键局限性，使IWR-Bench成为视觉语言研究的前沿挑战。基准测试和评估代码将在https://github.com/SIGMME/IWR-Bench上公开发布。

论文及项目相关链接

PDF

Summary

本文介绍了一个名为IWR-Bench的新型基准测试，该测试旨在评估大型视觉语言模型（LVLMs）在视频交互式网页重建中的能力。IWR-Bench包含了来自真实世界的网站的113项精心挑选的任务，涵盖用户交互视频以及所有爬取的静态资源（如图片和视频）。这一基准测试评估了模型面临两个基本挑战：从视频和资产中推断交互逻辑的综合跨模态推理能力，以及将这些逻辑转化为功能性代码的先进代码生成能力。结果显示现有模型的局限性，并确立IWR-Bench为视觉语言研究的前沿挑战。

Key Takeaways

IWR-Bench是一个新型基准测试，专注于评估大型视觉语言模型在交互式网页重建的能力。
它包含来自真实世界网站的多样化任务，涵盖复杂的交互、视觉风格和领域。
基准测试评估了两个主要挑战：跨模态推理和代码生成。
现有模型在这项基准测试中表现有限，凸显了它们在处理时间动态和事件驱动逻辑方面的不足。
IWR-Bench为视觉语言研究提供了一个具有挑战性的前沿领域。

Cool Papers

点此查看论文截图

ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

Authors:Adeela Islam, Stefano Fiorini, Stuart James, Pietro Morerio, Alessio Del Bue

The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 57% and 87% for RMSE Rotation and Translation, respectively.

重组任务在多个领域都是一项重大挑战，包括考古学、基因组学和分子对接等。它要求精确地放置和定位元素，以重建原始结构。在这项工作中，我们解决了最先进的深度学习重组方法的三个主要局限性，即一）可扩展性；二）多模态；以及三）现实世界的适用性：超出正方形或简单几何形状，真实且复杂的侵蚀或其他现实世界的实际问题。我们提出了ReassembleNet方法，它通过简化复杂性来降低复杂性，通过将每个输入片段表示为轮廓关键点集，并利用图神经网络池化技术来学习选择最具有信息量的关键点。ReassembleNet方法有效降低计算复杂性的同时，还能实现对包括几何和纹理数据在内的多种特征数据的融合。通过在半合成数据集上进行预训练进一步增强其性能。然后，我们应用基于扩散的姿态估计来恢复原始结构。在RMSE旋转和平移方面，我们的方法分别改进了先前的57%和87%。

论文及项目相关链接

PDF

Summary

本文介绍了重组任务在多个领域中的挑战，包括考古学、基因组学和分子对接。文章指出当前深度学习方法在重组方面的局限性，并提出了ReassembleNet方法来解决这些问题。ReassembleNet通过降低计算复杂性并融合多模态数据来实现复杂的重组任务，且在半合成数据集上进行预训练以增强性能。最终应用扩散姿态估计来恢复原始结构，在旋转和平移RMSE方面分别提高了57%和87%。

Key Takeaways