发布日期: 2025-11-18

更新日期: 2025-11-27

文章字数: 3.5k

阅读时长: 14 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-18 更新

Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Authors:Francisco Nogueira, Alexandre Bernardino, Bruno Martins

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

指代表达式理解（REC）要求模型根据自然语言描述在图像中定位对象。尽管全球部署需求不断增加，但该领域的研究仍以英语为中心。这项工作通过两个主要贡献解决了多语言REC问题。首先，我们通过机器翻译和基于上下文的翻译增强法，系统地扩展了12个现有的英语REC基准测试，构建了一个统一的多语言数据集，涵盖10种语言。该数据集包含约800万条多语言指代表达式，跨越177620张图像，带有336882个注释对象。其次，我们引入了一种以注意力为中心的神经网络架构，该架构使用多语言SigLIP2编码器。我们的基于注意力的方法会产生粗略的空间锚点，这些锚点是从注意力分布中生成的，并通过学习残差进行改进。实验评估表明，在标准基准测试上的表现具有竞争力，例如在RefCOCO综合多语言评估中，IoU@50的准确度达到86.9%，而仅使用英语的结果为91.3%。多语言评估表明，不同语言的实用功能保持一致，证明了多语言视觉定位系统的实际可行性。数据集和模型可在multilingual.franreno.com获取。

论文及项目相关链接

PDF

Summary

本文介绍了多语言指代表达式理解（REC）的研究。该研究构建了包含10种语言的大规模多语言REC数据集，通过机器翻译和基于语境的翻译增强方法扩展了12个现有的英语REC基准测试集。此外，该研究还提出了一种基于注意力的神经网络架构，采用多语言SigLIP2编码器。实验评估表明，该模型在标准基准测试上的表现具有竞争力，如在RefCOCO综合多语言评估上的准确率达到了IoU@50为86.9%，相比于仅使用英语的模型提升了性能。该研究表明了多语言视觉定位系统的实际可行性。

Key Takeaways

研究介绍了一种多语言指代表达式理解（REC）的任务，并强调了全球部署需求下的重要性。
构建了一个统一的多语言数据集，涵盖10种语言，通过对现有英语REC基准测试集的系统扩展形成。
数据集包含大约8百万多语言指代表达式和标注对象，跨越了超过一定数量级的图像。
提出了一种基于注意力的神经网络架构，用于处理多语言REC任务。该架构利用多语言SigLIP2编码器，并通过注意力分布生成粗略的空间锚点进行定位。
实验评估表明模型在标准基准测试上的表现具有竞争力，相较于仅使用英语的模型性能有所提升。
模型在多语言环境下的评估表现出跨语言的稳定性能。

Cool Papers

点此查看论文截图

MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis

Authors:Yiran Song, Yikai Zhang, Shuang Zhou, Guojun Xiong, Xiaofeng Yang, Nian Wang, Fenglong Ma, Rui Zhang, Mingquan Lin

Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology, achieving strong diagnostic performance through patch-level feature aggregation. However, existing MIL methods face critical limitations: (1) they rely on attention mechanisms that lack causal interpretability, and (2) they fail to integrate patient demographics (age, gender, race), leading to fairness concerns across diverse populations. These shortcomings hinder clinical translation, where algorithmic bias can exacerbate health disparities. We introduce \textbf{MeCaMIL}, a causality-aware MIL framework that explicitly models demographic confounders through structured causal graphs. Unlike prior approaches treating demographics as auxiliary features, MeCaMIL employs principled causal inference – leveraging do-calculus and collider structures – to disentangle disease-relevant signals from spurious demographic correlations. Extensive evaluation on three benchmarks demonstrates state-of-the-art performance across CAMELYON16 (ACC/AUC/F1: 0.939/0.983/0.946), TCGA-Lung (0.935/0.979/0.931), and TCGA-Multi (0.977/0.993/0.970, five cancer types). Critically, MeCaMIL achieves superior fairness – demographic disparity variance drops by over 65% relative reduction on average across attributes, with notable improvements for underserved populations. The framework generalizes to survival prediction (mean C-index: 0.653, +0.017 over best baseline across five cancer types). Ablation studies confirm causal graph structure is essential – alternative designs yield 0.048 lower accuracy and 4.2x times worse fairness. These results establish MeCaMIL as a principled framework for fair, interpretable, and clinically actionable AI in digital pathology. Code will be released upon acceptance.

多任务学习（MIL）已成为计算病理学全幻灯片图像（WSI）分析的主导范式，通过补丁级别的特征聚合实现了强大的诊断性能。然而，现有的MIL方法面临关键局限性：（1）它们依赖于缺乏因果解释性的注意机制；（2）未能整合患者人口统计学信息（如年龄、性别、种族），导致不同人群之间存在公平性问题。这些缺点阻碍了临床翻译，算法偏见可能会加剧健康差异。我们引入了因果感知MIL框架MeCaMIL，它通过结构因果图显式建模人口统计学混杂因素。不同于以前将人口统计学作为辅助特征处理的方法，MeCaMIL采用有原则的因果推理——利用do计算和碰撞器结构——来分离疾病相关信号，消除与人口统计学特征的偶然相关性。在CAMELYON16、TCGA-Lung和TCGA-Multi三个基准测试上的广泛评估表明，MeCaMIL在性能上达到了最新水平（CAMELYON16的准确率/曲线下面积/F1分数为：0.939/0.983/0.946；TCGA-Lung为：0.935/0.979/0.931；TCGA-Multi涵盖五种癌症类型的为：0.977/0.993/0.970）。最重要的是，MeCaMIL在公平性方面表现更优越——人口统计学差异的方差在属性上平均降低了超过65%的相对减少，对服务不足的人群有明显的改进。该框架可推广到生存预测（平均C指数为：0.653，相较于涵盖五种癌症类型的最佳基线模型高出+0.017）。消融研究证实因果图结构至关重要——其他设计产生了降低0.048的准确率和4.2倍的公平性。这些结果确立了MeCaMIL在数字病理学中公平、可解释和临床可操作的人工智能的有原则框架。代码将在接受后发布。

论文及项目相关链接

PDF 15page,5 figures,8 tables

Summary
本文介绍了一种因果感知的多实例学习（MIL）框架MeCaMIL，用于计算病理学中的全幻灯片图像（WSI）分析。通过构建结构化因果图模型来明确处理人口统计学信息（如年龄、性别和种族），解决了现有MIL方法缺乏因果解释性和算法偏见问题。MeCaMIL框架展现出卓越的性能和公平性，有望为数字病理学提供公平、可解释和临床可行的AI解决方案。

Key Takeaways

MeCaMIL框架解决了现有MIL方法在计算病理学中的两大局限性：缺乏因果解释性和缺乏集成患者人口统计学信息。
MeCaMIL利用结构因果图来模型化人口统计学混杂因素，以实现公平性和可解释性。
对比现有方法，MeCaMIL在CAMELYON16、TCGA-Lung和TCGA-Multi三个基准测试上展现出卓越的性能。
MeCaMIL框架降低了群体间的公平差异，平均减少超过65%的相对差异，并在服务不足的群体中取得了显著的改进。

Cool Papers

点此查看论文截图

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Authors:Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video’s frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question’s logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning. We open-source our code at https://utaustin-swarmlab.github.io/NeuS-QA/.

视觉语言模型（VLMs）在涉及单张图片或短视频的任务上表现出色，但由于长视频问答（LVQA）需要复杂的多步时序推理，它们在处理长视频问答时仍然面临挑战。传统方法只是简单地均匀采样帧，并将其与问题一起输入到VLM中，这导致了大量的令牌开销。这迫使对长视频进行激烈的降采样，导致模型错过精细的视觉结构、微妙的事件过渡和关键的时间线索。最近的工作试图通过启发式方法来克服这些限制，然而，它们缺乏编码时间关系的明确机制，无法提供任何保证所采样上下文实际上包含了问题所需的组合或因果逻辑。为了解决这些基础差距，我们引入了NeuS-QA，这是一种无需训练的神经符号管道，用于长视频问答。NeuS-QA首先将自然语言问题翻译成逻辑规范，该规范模拟帧级别事件之间的时间关系。接下来，我们构建了一个视频自动机来模拟视频帧到帧的事件进展，并最终采用模型检查来比较自动机与规范，以识别所有满足问题逻辑要求的视频片段。只有这些经过逻辑验证的片段才会提交给VLM，从而提高可解释性，减少幻觉现象，并在无需修改或微调模型的情况下实现组合推理。在长视频基准测试和CinePile LVQA基准测试上的实验表明，NeuS-QA的性能提高了超过10%，特别是在涉及事件顺序、因果关系和多步推理的问题上。我们的代码已在https://utaustin-swarmlab.github.io/NeuS-QA/开源。

论文及项目相关链接

PDF

Summary
视觉语言模型在处理涉及单张图片或短视频的任务时表现出色，但在长视频问答（LVQA）方面仍面临挑战。由于长视频问答需要复杂的多步时序推理，现有方法简单地对视频帧进行均匀采样并输入到视觉语言模型中，导致模型无法捕捉精细的视觉结构、微妙的事件转换和关键时序线索。为解决这些问题，我们提出了NeuS-QA，一种用于长视频问答的无训练神经符号管道。它通过翻译自然语言问题来模拟帧级别事件的时序关系，构建视频自动机模拟视频帧帧事件的进展，并使用模型检查来验证自动机是否符合逻辑规范。仅将逻辑验证的片段输入视觉语言模型，提高了可解释性，减少了虚构，并实现了无需修改或微调模型的组合推理。在LongVideoBench和CinePile长视频问答基准测试上，NeuS-QA的性能提高了超过10%，特别是在涉及事件排序、因果关系和多步推理的问题上。

Key Takeaways