发布日期: 2025-10-19

更新日期: 2025-11-27

文章字数: 6.3k

阅读时长: 25 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-19 更新

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Authors:Siting Li, Xiang Gao, Simon Shaolei Du

While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.

虽然一张图片胜过千言万语，但只有少数图片能为特定任务提供关键信息，因此应该重点关注。鉴于此，理想的文本到图像（T2I）检索器应该优先关注与查询相关的特定视觉属性。为了评估当前检索器在处理属性导向查询方面的表现，我们基于COCO构建了COCO-Facet基准测试，包含9112个关于各种属性的查询。我们发现，由于CLIP类检索器的效率和零样本能力，其被广泛采用，但表现不佳且不平衡，可能是因为其图像嵌入侧重于全局语义和主题，而忽略了其他细节。值得注意的是，我们发现即使是基于最近的多模态大型语言模型（MLLM）的更强检索器，具有更大的输出维度也受此限制所困扰。因此，我们假设使用通用图像嵌入进行检索对于执行此类查询是次优的。作为一种解决方案，我们提出使用由这些多模态检索器启用的可提示图像嵌入，通过突出所需属性来提升性能。我们为导出此类嵌入的管道可跨查询类型、图像池和基本检索器架构进行推广。为了提高在现实世界中的适用性，我们提供了两种加速策略：预处理可提示的嵌入和使用线性近似。我们显示，当提示预先定义时，前者在Recall@5上提高了15%，而后者在仅推理期间提供提示时，提高了8%。

论文及项目相关链接

PDF NeurIPS 2025; 27 pages, 6 figures

Summary：该文探讨文本图像检索系统（T2I）在处理侧重于视觉属性查询时的表现问题。文中通过构建基于COCO数据集的COCO-Facet基准测试平台来评估当前系统对属性关注查询的处理能力。文章指出CLIP类似的检索器在性能和平衡方面的不足，在于其图像嵌入集中关注全局语义和主题而忽视细节。作者还指出即使更强大的基于多模态大型语言模型（MLLM）的系统也有同样的局限。因此，该文提出了通过可提示的图像嵌入来提高检索性能的解决方案，这种嵌入能够突出显示所需属性，并且这一方法适用于各类查询类型、图像库和基础检索器架构。为提高实际应用性能，文章提供了两种加速策略：预处理可提示嵌入和使用线性近似方法。前者在预设提示的情况下提高了召回率@5达15%，后者在仅推理期间提供提示的情况下提高了召回率达8%。

Key Takeaways:

文本图像检索系统应侧重于特定视觉属性以处理查询请求。
COCO-Facet基准测试平台用于评估文本图像检索器处理属性关注查询的能力。
CLIP类似检索器存在性能与平衡问题，主要由于关注全局语义而忽视图像细节。
多模态大型语言模型也存在类似局限。
使用可提示的图像嵌入能提高检索性能，通过突出必要属性强化查询效果。
解决方案适用于多种查询类型、图像库和基础检索器架构。

Cool Papers

点此查看论文截图

MAFA: A multi-agent framework for annotation

Authors:Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem

Modern consumer banking applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi-agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task-specific JSON queries. Our framework features a few-shot example strategy, where each agent receives different few-shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real-world major bank dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single-agent approaches across multiple metrics, including a 14% increase in Top-1 accuracy, an 18% increase in Top-5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional and single-agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well-suited for deployment in production banking applications while showing strong generalization capabilities across different domains and languages.

现代消费银行业应用需要准确高效地检索用户查询的信息。将用户的话语映射到最相关的常见问题（FAQ）是这些系统的关键组成部分。传统方法通常依赖于单一模型或技术，这可能无法捕捉不同用户查询的细微差别。在本文中，我们介绍了一种结合多个专业代理和判断代理进行重新排序以产生最佳结果的FAQ注释多代理框架。我们的代理采用基于注意力推理查询（ARQ）的结构化推理方法，通过使用目标特定的JSON查询指导他们进行系统化的推理步骤。我们的框架采用少数案例策略，每个代理接收不同的少数案例，以增强组合多样性和查询空间的覆盖。我们在现实世界的主要银行数据集以及公共基准数据集（LCQMC和FiQA）上评估了我们的框架，与传统单代理方法和单一标注技术相比，在多个指标上取得了显著改进，包括在数据集上提高了14%的Top-1准确率、提高了18%的Top-5准确率以及提高了12%的平均倒数排名等成绩；同时在公共基准测试上也展现了类似的提升。我们的框架在处理模糊查询方面特别有效，非常适合在生产银行业应用中部署，并在不同领域和语言中显示出强大的泛化能力。

论文及项目相关链接

PDF

Summary

在信息检索系统中，准确地响应用户的查询对于现代消费者银行业务至关重要。本文通过提出一种基于多智能体的FAQ标注框架，结合多种专业智能体和法官智能体的重新排名机制，实现了最优结果。该框架采用受ARQ启发的结构化推理方法，通过有针对性的任务特定JSON查询引导智能体进行推理。采用少数几个示例来指导智能体的工作，提高了整体系统的多样性和查询覆盖广度。在银行数据集和公共基准数据集上的评估表明，与传统的单一智能体方法相比，该框架在多个指标上取得了显著改进，包括Top-1准确率提高14%，Top-5准确率提高18%，以及平均倒数排名提高12%。该框架在处理模糊查询方面表现出色，适合部署在生产环境中处理银行业务场景，并展现出在不同领域和语言的强大泛化能力。

Key Takeaways

以下是关键要点：

提出了一种基于多智能体的FAQ标注框架。此框架结合多个专业智能体和法官智能体的重排名机制。法官智能体能选出最佳的候选结果。
该框架采用结构化推理方法，通过任务特定的JSON查询引导智能体进行推理。这有助于智能体准确理解并响应复杂的用户查询。
采用少数几个示例来指导每个智能体的工作，增强了系统的多样性和查询覆盖广度。这有助于框架适应不同的业务领域和数据集。
在实际银行数据集和公共基准数据集上的评估显示，与传统的单一智能体方法相比，该框架在多个指标上表现优越。具体来说，Top-1和Top-5准确率有明显提高。

Cool Papers

点此查看论文截图

Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications

Authors:Samuel Stevens, S M Rayeed, Jenna Kline

The practical application of AI tools for specific computer vision tasks relies on the “small-data regime” of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.

人工智能工具在实际应用于特定的计算机视觉任务时，依赖于数百至数千个标记样本的“小数据时代”。对于需要昂贵的专家注释的应用（如生态监测、医学诊断或工业质量控制），这种小数据时代至关重要。然而，我们发现计算机视觉研究忽视了小数据时代，因为评估越来越侧重于零样本和少样本学习。我们使用自然语言任务（NeWT）基准测试来比较不同训练集大小下的多模态大型语言模型（MLLMs）和仅视觉方法。多模态大型语言模型在前期表现平稳，而仅视觉方法在小数据时代不断改进，在超过10个训练样本的情况下，性能差距扩大。我们首次全面比较了小数据背景下这些方法之间的比较，并主张在人工智能研究中进行明确的小数据评估，以更好地将理论进步与实际部署相结合。

论文及项目相关链接

PDF 5 pages (main text), 3 figures. Accepted at the Imageomics Workshop at NeurIPS 2025

Summary
人工智能工具在实际计算机视觉任务中的应用依赖于拥有数百至数千个标记样本的“小数据时代”。这一小数据时代对于需要昂贵专家注解的应用程序至关重要，如生态监测、医学诊断或工业质量控制等。然而，计算机视觉研究忽略了小数据时代，评估越来越侧重于零样本和少样本学习。本研究使用自然世界任务（NeWT）基准测试，对比不同训练集大小下的多模态大型语言模型（MLLMs）和仅使用视觉的方法。多模态大型语言模型早期性能表现平稳，而仅使用视觉的方法在小数据时代表现优异，训练样本超过10个时性能差距拉大。本文首次全面比较了小数据环境下这两种方法的表现，并提倡在人工智能研究中进行明确的小数据评估，以更好地将理论进步与实践部署相结合。

Key Takeaways

AI工具在实际计算机视觉任务中依赖于“小数据时代”，即数百至数千个标记样本。
小数据时代对需要昂贵专家注解的应用程序至关重要，如生态监测、医学诊断或工业质量控制。
计算机视觉研究忽略了小数据时代的评估，更多地关注零样本和少样本学习。
多模态大型语言模型（MLLMs）和仅使用视觉的方法在计算机视觉任务中有不同的表现。
在小数据环境下，仅使用视觉的方法表现更佳，与多模态大型语言模型的性能差距随训练样本数量的增加而扩大。
研究使用自然世界任务（NeWT）基准测试进行了全面的方法比较。

Cool Papers

点此查看论文截图

Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning

Authors:Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, Jeannette Bohg

Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website https://portal-cornell.github.io/motion_track_policy/.

教授机器人自主完成日常任务仍然是一个挑战。模仿学习（IL）是一种通过演示赋予机器人技能的强大方法，但受限于收集遥控机器人数据时劳动密集型的流程。人类视频提供了可扩展的替代方案，但由于缺乏机器人动作标签，直接从其中训练IL策略仍然很困难。为了解决这一问题，我们提出将动作表示为图像上的短周期2D轨迹。这些动作，或运动轨迹，捕捉了人类手部或机器人末端执行器预测的运动方向。我们实例化了一种名为运动轨迹策略（MT-pi）的IL策略，它接收图像观察结果并输出运动轨迹作为动作。通过利用这种统一的、跨实体的动作空间，MT-pi仅通过几分钟的人类视频和有限的额外机器人演示就成功完成了任务。在测试时，我们从两个相机视角预测运动轨迹，通过多视角合成恢复6DoF轨迹。MT-pi在4个真实任务中的平均成功率达到86.5%，比不利用人类数据或我们动作空间的最新IL基线高出40%，并且能够推广到仅在人类视频中出现的场景。相关代码和视频请访问我们的网站：网站链接。

论文及项目相关链接

PDF

Summary

本文介绍了机器人自主学习完成日常任务中的挑战。针对模仿学习（IL）方法中存在的收集遥控机器人数据劳动密集度高的问题，提出了一种新的动作表示方法——运动轨迹策略（MT-pi）。该方法通过图像中的短程二维轨迹来表示动作，并从人类视频中学习，仅需要几分钟的人类视频和少量的机器人演示，即可成功完成任务。测试表明，该策略在四个真实任务中的平均成功率达到了86.5%，相比不利用人类数据或我们动作空间的最新IL基线，性能提升了40%，并且能泛化到仅在人类视频中看到的场景。

Key Takeaways

机器人自主完成日常任务是一大挑战。
模仿学习（IL）是通过演示赋予机器人技能的一种强大方法，但收集遥控机器人数据的过程很劳动密集。
人类视频提供了一个可扩展的替代方案，但直接从中训练IL政策仍具有挑战性。
提出了运动轨迹策略（MT-pi），通过图像中的短程二维轨迹表示动作。
MT-pi仅需要几分钟的人类视频和有限的机器人演示即可成功完成任务。
在四个真实任务中，MT-pi的平均成功率达到了86.5%。
MT-pi比不利用人类数据或特定动作空间的IL基线性能更优，并且具有良好的泛化能力。

Cool Papers

点此查看论文截图

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Authors:Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O’Brien, Kevin Zhu

Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models. We have open-sourced our source code on GitHub and created a website to showcase our work at https://aavenuee.github.io.

在针对非洲裔美国人通俗英语（AAVE）的自然语言理解（NLU）中检测偏见，对于开发包容性的自然语言处理（NLP）系统至关重要。为了解决方言引起的性能差异，我们引入了AAVENUE（{AAVE}{N}atural Language {U}nderstanding {E}valuation），这是一个基准测试，用于评估大型语言模型（LLM）在AAVE和标准美式英语（SAE）中的NLU任务性能。

AAVENUE建立在现有的VALUE基准测试之上并进行了扩展，放弃了确定性语法和形态学转换，采用更灵活的方法，利用基于LLM的翻译和少量提示，在翻译GLUE和SuperGLUE基准测试中的关键任务时，提高了我们的评估指标性能。我们使用五个流行的大型语言模型和一套全面的指标，包括流利度、BARTScore、质量、连贯性和易懂性，对AAVENUE和VALUE翻译进行了比较。此外，我们还招募了流利的AAVE使用者来验证我们翻译的真实性。

论文及项目相关链接

PDF Published at NLP4PI @ EMNLP 2024

Summary

本文主要介绍了针对非洲裔美国人使用的方言英语（AAVE）的自然语言理解（NLU）中的偏见检测对于开发包容性自然语言处理（NLP）系统的重要性。为此，我们引入了AAVENUE基准测试，用于评估大型语言模型（LLM）在AAVE和标准美式英语（SAE）上的NLU任务性能。AAVENUE建立在现有基准测试的基础上并进行了扩展，如VALUE基准测试。它通过利用LLM驱动的翻译和少量提示，采用更灵活的方法取代了确定性句法形态转换。在将关键任务从GLUE和SuperGLUE基准测试中翻译过来的过程中，我们的评估指标表现有所提高。我们对AAVENUE和VALUE翻译的翻译进行了比较，包括流畅度、BARTScore、质量、连贯性和可理解性等综合指标。此外，我们还邀请了熟练的AAVE说话者来验证翻译的真实性。评估结果显示，LLM在SAE任务上的表现始终优于AAVE翻译版本，这突显了内在的偏见，并强调了需要更多包容性的NLP模型。我们的源代码已开源并已在GitHub上发布相关网站展示我们的工作。

Key Takeaways

AAVE的语言理解偏见检测对于开发包容性NLP系统至关重要。
AAVENUE是一个针对AAVE和SAE的NLU任务性能的基准测试，用于评估LLM的表现。
AAVENUE通过利用LLM驱动的翻译和少量提示，采用更灵活的方法改进了VALUE基准测试。
与VALUE翻译相比，AAVENUE的翻译在多个综合指标上表现良好。
熟练的AAVE说话者验证了AAVENUE翻译的真实性。
LLM在SAE任务上的表现优于AAVE翻译任务，这突显了内在偏见。

Cool Papers

点此查看论文截图

Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Authors:Congying Liu, Gaosheng Wang, Peipei Liu, Xingyuan Wei, Hongsong Zhu

Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.

基于少量标注样本的命名实体识别能够识别出新的命名实体类型。以前采用令牌级别或跨度级别度量学习的方法面临着计算负担大和负样本跨度数量多的挑战。在本文中，我们提出了基于实体感知对比学习的少样本NER混合多阶段解码（MsFNER）。它将一般的命名实体识别分为两个阶段：实体跨度检测和实体分类。引入MsFNER的过程包括训练、微调、推理三个阶段。在训练过程中，我们在源域上分别训练并获取最佳的实体跨度检测模型和实体分类模型，采用元学习法创建对比学习模块，以增强实体分类的实体表示。在微调过程中，我们对目标域的辅助数据集进行微调。在推理过程中，对于未标记的数据，我们首先检测实体跨度，然后由实体分类模型和K最近邻算法共同确定实体跨度。我们在开放的FewNERD数据集上进行了实验，结果表明MsFNER的优势。

论文及项目相关链接

PDF

Summary

本文提出了基于实体感知对比学习的少样本命名实体识别（NER）的混合多阶段解码方法（MsFNER）。该方法将一般的NER任务分为两个阶段：实体跨度检测和实体分类。MsFNER包含训练、微调、和推理三个步骤。在训练过程中，利用元学习在源域上分别训练得到最佳的实体跨度检测模型和实体分类模型，并创建一个对比学习模块来增强实体分类的实体表示。在微调过程中，对两个模型进行微调以适配目标域的支撑数据集。在推理过程中，对于未标记的数据，首先检测实体跨度，然后由实体分类模型和K最近邻算法联合确定实体跨度。实验结果表明MsFNER的有效性。

Key Takeaways