⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-22 更新
Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Authors:Min Cao, Xinyu Zhou, Ding Jiang, Bo Du, Mang Ye, Min Zhang
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.
文本转图像人物检索(TIPR)旨在使用文本描述来识别目标人物,面临模态异构性的挑战。早期的研究工作试图通过开发跨模态全局或局部对齐策略来解决这个问题。然而,全局方法通常忽略了跨模态的细微差异,而局部方法则需要先验信息来探索明确的局部对齐。此外,当前的方法以英语为中心,限制了它们在多语言环境中的使用。为了缓解这些问题,我们率先开展了一项多语言TIPR任务,并开发了一个多语言TIPR基准测试数据集。我们利用大型语言模型进行初步翻译,并通过整合领域特定知识对其进行改进。相应地,我们提出了Bi-IRRA:一个双向隐式关系推理与对齐框架,用于学习跨语言和模态的对齐。在Bi-IRRA中,双向隐式关系推理模块能够实现被遮挡图像和文本的双向预测,从而隐式地增强跨语言和模态的局部关系的建模。同时,我们集成了一个多维全局对齐模块来弥合模态异构性。所提出的方法在所有多语言TIPR数据集上取得了最新的最先进的成果。数据和代码可通过https://github.com/Flame-Chasers/Bi-IRRA查看。
论文及项目相关链接
PDF Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: https://ieeexplore.ieee.org/document/11199360
Summary
基于文本描述进行人物图像检索(TIPR)旨在解决跨模态异质性问题,通过全局或局部跨模态对齐策略实现目标人物的识别。然而,全局方法通常忽略了精细的跨模态差异,而局部方法需要额外的先验信息探索明确的部分对齐。为解决这一问题,本研究开创性地推出多语种TIPR任务,并构建多语种TIPR基准测试集,通过大型语言模型进行初步翻译并整合领域专业知识进行完善。同时,提出Bi-IRRA框架,实现跨语言和跨模态的双向隐式关系推理与对齐。该框架包括一个双向隐式关系推理模块,能进行图像和文本的双向预测,隐式增强跨语言和跨模态的局部关系建模。此外,融合了多维全局对齐模块,以缩短模态差异。该方法在所有多语种TIPR数据集上达到了最新状态。相关数据与代码可在链接中找到。
Key Takeaways
- TIPR旨在解决跨模态异质性挑战,使用文本描述来识别目标人物。
- 目前方法多侧重于全局或局部跨模态对齐策略,但各有缺陷。
- 提出的多语种TIPR任务是创新的,包括构建多语种TIPR基准测试集。
- 利用大型语言模型进行初步翻译并整合领域知识完善。
- Bi-IRRA框架实现了跨语言和跨模态的双向隐式关系推理与对齐。
- Bi-IRRA框架包括一个能进行图像和文本双向预测的模块和一个多维全局对齐模块。
点此查看论文截图
Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework
Authors:Mohammad R. Salmanpour, Sonya Falahati, Amir Hossein Pouria, Amin Mousavi, Somayeh Sadat Mehrnia, Morteza Alizadeh, Arman Gorji, Zeinab Farsangi, Alireza Safarian, Mehdi Maghsudi, Carlos Uribe, Arman Rahmim, Ren Yuan
Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.
肺癌仍然是导致癌症死亡的主要原因,CT成像在筛查、预后和治疗中发挥着核心作用。手动分割具有可变性和时间密集性,而深度学习(DL)虽然提供了自动化,但临床采用仍面临障碍。本研究遵循知识到行动框架,开发了一种临床医生参与的深度学习管道,以提高可重复性、预后准确性和临床信任度。使用五个深度学习模型(3D Attention U-Net、ResUNet、VNet、ReconNet、SAM-Med3D)对来自12个公共数据集的999名患者的多中心CT数据进行了分析,并以全图像和点击点裁剪图像的专家轮廓为基准进行了对比。使用497个PySERA提取的放射学特征,通过斯皮尔曼相关性、ICC、威尔科克森检验和MANOVA评估了分割可重复性,同时比较了监督学习(SL)和半监督学习(SSL)在38种降维策略和24种分类器中的预后模型。六位医生在七个领域对掩膜进行了定性评估,包括临床意义、边界质量、预后价值、信任度和工作流程集成。VNet取得了最佳性能(Dice = 0.83,IoU = 0.71),放射稳定性(平均相关性= 0.76,ICC = 0.65),并且在SSL下预测准确度较高(准确度= 0.88,F1 = 0.83)。SSL在所有模型中均表现优于SL。放射科医生青睐于VNet的肿瘤周围表现和更平滑的边界,更喜欢使用AI生成的初始掩膜进行细化而不是替换。这些结果表明,将VNet与SSL相结合,可实现基于CT的肺癌预后准确、可重复和临床信赖的翻译,为以医生为中心的AI翻译提供了可行的路径。
论文及项目相关链接
PDF 13 pages, 2 figures, and 2 tables
Summary
本研究结合知识行动框架,建立医生参与的深度学习管道,提高肺癌CT影像分析的可重复性、预后准确性和临床信任度。研究采用多中心CT数据,对比五种深度学习模型性能,并对结果进行量化评估与医生定性评价。结果表明,VNet结合半监督学习展现出最佳性能,得到医生的青睐。此策略为肺癌CT影像分析提供了准确、可复制且被临床信赖的翻译路径。
Key Takeaways
- 肺癌仍是导致癌症死亡的主要原因,CT成像在筛查、预后和治疗中起关键作用。
- 深度学习在自动化分割上展现潜力,但临床应用存在障碍。
- 研究采用知识行动框架指导,建立医生参与的深度学习管道,旨在提高模型的可重复性、预后准确性和临床信任度。
- VNet在分割性能、放射组学稳定性和预测准确性方面表现最佳。
- 半监督学习在模型性能上持续优于监督学习。
- 放射科医生更倾向于使用VNet生成的初始掩膜进行微调而非完全替代。
- 此策略为肺癌CT影像分析提供了一种准确、可复制且被临床信赖的翻译路径。
点此查看论文截图
Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
Authors:Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos
While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
虽然通过搜索进行推理时间缩放已经使大型语言模型发生了革命性的变化,但将这些优势转化为图像生成却证明是困难的。最近将搜索策略应用于连续扩散模型的尝试显示出有限的效益,而简单的随机抽样通常表现最佳。我们证明,视觉自回归模型的离散序列特性能够有效地用于图像生成搜索。我们表明,集束搜索可以大大改进文本到图像的生成,使2B参数的自回归模型在基准测试中优于12B参数的扩散模型。系统消融实验表明,这一优势来自于离散标记空间,它允许早期修剪和计算重用,而我们的验证器分析强调了速度和推理能力之间的权衡。这些发现表明,模型架构对于视觉生成的推理时间优化至关重要,而不仅仅是规模问题。
论文及项目相关链接
Summary
基于文本所述,虽然在大规模语言模型中推理时间缩放通过搜索实现了革新,但将这些应用于图像生成则显得尤为困难。尽管有研究人员尝试在连续扩散模型中采用搜索策略,但其成效有限,简单随机采样往往表现最佳。本研究展示了离散序列视觉自回归模型在图像生成中进行有效搜索的能力。通过光束搜索的应用,我们发现其对文本到图像的生成有显著改善效果,即便是在一个规模较小(拥有参数约二十亿分之一规模)的自回归模型也能在基准测试中超越规模更大的扩散模型(拥有参数十二亿规模)。系统性分析表明,这种优势来源于离散符号空间,允许早期剪枝和计算复用。验证分析还强调了速度与推理能力之间的权衡。这些发现暗示模型架构而非单纯的规模对于视觉生成的推理时间优化至关重要。这一发现有助于推进未来模型架构的改进和创新。
Key Takeaways
- 推理时间缩放在大规模语言模型中通过搜索取得了显著进展,但在图像生成中的应用仍然面临挑战。
- 在连续扩散模型中应用搜索策略的效果有限,简单随机采样往往是最优选择。
- 研究发现离散序列视觉自回归模型能够有效进行图像生成搜索。
- 光束搜索的应用显著提升了文本到图像的生成效果,较小的自回归模型在基准测试中表现优于更大的扩散模型。
- 这一优势来源于离散符号空间,允许早期剪枝和计算复用,而不仅仅是模型的规模大小。
点此查看论文截图
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Authors:Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
自动化用户界面(UI)设计转化为前端代码的过程在加速软件开发和民主化设计工作流程方面显示出巨大的前景。虽然多模态大型语言模型(MLLM)可以将图像转换为代码,但在复杂用户界面上通常会出现问题,在一个单一的大型模型中统一视觉感知、布局规划和代码合成方面面临挑战,导致频繁出现感知和规划错误。为了解决这个问题,我们提出了ScreenCoder,这是一个模块化多代理框架,将任务分解为三个可解释的阶段:接地、规划和生成。通过将这些独特的责任分配给专业代理,我们的框架比端到端的方法实现了更高的稳健性和保真度。此外,ScreenCoder作为一个可扩展的数据引擎,使我们能够生成高质量的图片代码对。我们使用这些数据通过监督精细调整和强化学习的双阶段管道对开源MLLM进行微调,在UI生成能力方面取得了显著的提升。大量实验表明,我们的方法在布局准确性、结构连贯性和代码正确性方面达到了最新水平。我们的代码公开在https://github.com/leigest519/ScreenCoder上提供。
论文及项目相关链接
PDF ScreenCoder-v2
Summary
本文提出一种名为ScreenCoder的模块化多智能体框架,用于将用户界面(UI)设计自动化转换为前端代码。该框架通过将任务分解为可解释的三个步骤:接地、规划和生成,实现了比端到端方法更高的鲁棒性和保真度。此外,ScreenCoder还是一个可扩展的数据引擎,可生成高质量图像代码对,用于微调开源的多模态大型语言模型。实验表明,该方法在布局准确性、结构连贯性和代码正确性方面达到了最新技术水平。
Key Takeaways
- ScreenCoder框架采用模块化多智能体方法,将UI设计转化为前端代码的任务分解为三个步骤:接地、规划和生成。
- 该框架实现了高鲁棒性和保真度,优于传统的端到端方法。
- ScreenCoder作为可扩展的数据引擎,能够生成高质量图像代码对。
- 利用生成的数据对开源多模态大型语言模型进行微调,通过监督精细调整和强化学习的两个阶段管道实现UI生成能力的显著增强。
- 实验结果表明,ScreenCoder在布局准确性、结构连贯性和代码正确性方面达到最新技术水平。
- 该研究提供的代码已公开发布在GitHub上。
点此查看论文截图
Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models
Authors:Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao
Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.Project Page: https://owensun2004.github.io/Furniture-Assembly-Web/
人类拥有通过解读抽象说明书来理解并执行复杂操作任务的不凡能力。然而,对于机器人来说,这一能力仍然是一个巨大挑战,因为它们无法解读抽象指令并将其翻译成可执行的行动。在本文中,我们提出了Manual2Skill这一新型框架,使机器人能够在高级手册指令的引导下执行复杂的组装任务。我们的方法利用视觉语言模型(VLM)从指令图像中提取结构化信息,然后使用这些信息来构建分层组装图。这些图代表部件、子组件及其之间的关系。为了促进任务执行,姿态估计模型会预测每个组装步骤中组件的相对6D姿态。同时,运动规划模块会生成可用于实际机器人操作的行动序列。我们通过成功组装几个现实生活中的宜家家具产品来展示Manual2Skill的有效性。这一应用突显了其在效率和精确度方面管理长期操作任务的能力,显著增强了机器人通过学习说明书来执行任务的实用性。这项工作标志着在推进机器人系统方面取得了进展,这些系统能够以类似于人类的方式理解和执行复杂的操作任务。项目页面:https://owensun2004.github.io/Furniture-Assembly-Web/。
论文及项目相关链接
Summary:
机器人执行复杂操作任务的能力仍然是一个挑战,因为它们无法解读抽象指令并将其转化为可执行动作。本文提出了Manual2Skill框架,使机器人能够根据高级手册指令执行复杂的装配任务。该框架利用视觉语言模型从指令图像中提取结构化信息,构建层次装配图,并预测每个装配步骤组件的6D姿态。同时,运动规划模块生成可用于实际机器人操作的行动序列。通过成功装配IKEA家具产品,展示了Manual2Skill框架在管理和执行长期操作任务方面的效率和精确度,显著提高了机器人从说明书学习的实用性。这项工作标志着机器人在理解和执行复杂操作任务方面的能力取得了进展,与人类能力相似。
Key Takeaways:
- 机器人执行复杂操作任务存在挑战,难以解读抽象指令并转化为可执行动作。
- Manual2Skill框架使机器人能够根据高级手册指令执行复杂的装配任务。
- Manual2Skill利用视觉语言模型从指令图像中提取结构化信息。
- 层次装配图表示部件、子组件及其关系。
- 姿态估计模型预测每个装配步骤中组件的6D姿态。
- 运动规划模块生成可用于实际机器人操作的行动序列。
点此查看论文截图
A Synthetic Data-Driven Radiology Foundation Model for Pan-tumor Clinical Diagnosis
Authors:Wenhui Lei, Hanyu Chen, Zitian Zhang, Luyang Luo, Qiong Xiao, Yannian Gu, Peng Gao, Yankai Jiang, Ci Wang, Guangtao Wu, Tongjia Xu, Yingjie Zhang, Pranav Rajpurkar, Xiaofan Zhang, Shaoting Zhang, Zhenning Wang
AI-assisted imaging made substantial advances in tumor diagnosis and management. However, a major barrier to developing robust oncology foundation models is the scarcity of large-scale, high-quality annotated datasets, which are limited by privacy restrictions and the high cost of manual labeling. To address this gap, we present PASTA, a pan-tumor radiology foundation model built on PASTA-Gen, a synthetic data framework that generated 30,000 3D CT scans with pixel-level lesion masks and structured reports of tumors across ten organ systems. Leveraging this resource, PASTA achieves state-of-the-art performance on 45 of 46 oncology tasks, including non-contrast CT tumor screening, lesion segmentation, structured reporting, tumor staging, survival prediction, and MRI-modality transfer. To assess clinical applicability, we developed PASTA-AID, a clinical decision support system, and ran a retrospective simulated clinical trial across two scenarios. For pan-tumor screening on plain CT with fixed reading time, PASTA-AID increased radiologists’ throughput by 11.1-25.1% and improved sensitivity by 17.0-31.4% and precision by 10.5-24.9%; additionally, in a diagnosis-aid workflow, it reduced segmentation time by up to 78.2% and reporting time by up to 36.5%. Beyond gains in accuracy and efficiency, PASTA-AID narrowed the expertise gap, enabling less-experienced radiologists to approach expert-level performance. Together, this work establishes an end-to-end, synthetic data-driven pipeline spanning data generation, model development, and clinical validation, thereby demonstrating substantial potential for pan-tumor research and clinical translation.
人工智能辅助成像在肿瘤诊断和治疗方面取得了巨大进步。然而,构建稳健的肿瘤基础模型的主要障碍是缺乏大规模、高质量的有标注数据集,这受到隐私限制和手动标注的高成本的限制。为了解决这个问题,我们推出了PASTA,这是一个基于PASTA-Gen的泛肿瘤放射学基础模型。PASTA-Gen是一个合成数据框架,生成了带有像素级病变掩膜和十个器官系统肿瘤结构化报告的3万份三维CT扫描。利用这一资源,PASTA在46个肿瘤任务中的45个上达到了最新性能水平,包括非对比CT肿瘤筛查、病变分割、结构化报告、肿瘤分期、生存预测和MRI模态转换。为了评估临床适用性,我们开发了PASTA-AID这一临床决策支持系统,并在两种场景下进行了回顾性模拟临床试验。对于普通CT上的泛肿瘤筛查且固定阅读时间的情况下,PASTA-AID提高了放射科医生的工作效率达11.1%~25.1%,并提高了敏感性和精确度为介于范围约分别提升 进展报告 并帮助不那么有经验的放射科医生达到了接近专家水平的性能。总之这项研究建立了贯穿数据生成、模型开发和临床验证的端到端合成数据驱动管道,为泛肿瘤研究和临床转化展示了巨大的潜力。
论文及项目相关链接
PDF 63 pages, 7 figures
Summary
AI辅助成像在肿瘤诊断和管理方面取得显著进展,但缺乏大规模高质量标注数据集是构建稳健肿瘤基础模型的主要障碍。为此,研究团队提出了基于合成数据框架PASTA-Gen构建的泛肿瘤放射学基础模型PASTA。该模型能生成带有像素级病灶掩码和结构化报告的跨十种组织系统的3万份三维CT扫描数据。评估结果显示,PASTA在46项肿瘤任务中的45项上达到最佳性能,包括非对比CT肿瘤筛查、病灶分割、结构化报告、肿瘤分期、生存预测和MRI模态转换等。为评估临床适用性,团队开发了临床决策支持系统PASTA-AID,并在两种场景下进行了回顾性模拟临床试验。在固定阅读时间的普通CT泛肿瘤筛查中,PASTA-AID提高了放射科医生的工作效率,敏感性提高17.0%-31.4%,精准度提高10.5%-24.9%。在诊断辅助工作流程中,它减少了分割时间达78.2%,报告时间减少达36.5%。此外,PASTA-AID缩小了专家之间的差距,使经验较少的放射科医生能够达到专家级性能。这项研究建立了一条端到端的合成数据驱动管道,包括数据生成、模型开发和临床验证,为泛肿瘤研究和临床转化展示了巨大潜力。
Key Takeaways
- AI在肿瘤诊断和管理方面的应用取得了进展。
- 缺乏大规模高质量标注数据集是肿瘤基础模型发展的主要障碍。
- PASTA模型利用合成数据框架PASTA-Gen生成了带有像素级病灶掩码和结构化报告的3D CT扫描数据。
- PASTA在多种肿瘤任务上表现出卓越性能。
- 临床决策支持系统PASTA-AID提高了肿瘤诊断的准确性和效率。
- PASTA-AID缩小了放射科医生之间的专家差距,使经验较少的医生也能达到高水平表现。
点此查看论文截图