发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 19.9k

阅读时长: 81 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Authors:Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

遥感视觉语言模型（RSVLMs）由于大规模预训练而显示出显著潜力，在各种任务上实现了强大的零样本性能。然而，它们在低数据环境下的泛化能力，如小样本学习，仍未得到充分探索。在这项工作中，我们首次为评估RSVLMs的小样本适应方法建立了结构化基准测试。我们在十个遥感场景分类数据集上进行全面实验，将五种广泛使用的小样本适应策略应用于三种具有不同背骨的最先进RSVLMs。我们的研究发现，具有相似零样本性能的模型在小样本适应下可能会表现出截然不同的行为，有些RSVLMs天生更适合这种适应。性能的可变性和现有方法之间没有明显胜出者，这凸显了需要开发更稳健的针对RS的小样本适应方法的必要性。为了促进未来的研究，我们提供了一个可复制的基准测试框架和开源代码，以在有限样本条件下系统地评估RSVLMs。源代码可在Github上公开获取：https://github.com/elkhouryk/fewshot_RSVLMs

论文及项目相关链接

PDF

Summary

远程遥感视觉语言模型（RSVLMs）的零样本学习能力已经展现出显著潜力。然而，它们在低数据环境下的泛化能力，尤其是小样本学习能力，仍待充分探索。本研究首次构建了评估RSVLMs小样本适应方法的基准测试，对十种遥感场景分类数据集进行全面实验，将五种常用的小样本适应策略应用于三种先进的RSVLMs。研究发现，具有相似零样本性能的模型在小样本适应下表现差异显著，凸显出开发更适合遥感领域的小样本适应方法的必要性。

Key Takeaways

RSVLMs在零样本学习方面已表现出显著潜力。
小样本学习在RSVLMs中的研究仍待深入。
不同RSVLMs在小样本适应下的表现差异显著。
现有小样本适应策略在RSVLMs中无明确优胜者。
需要开发更适用于遥感领域的小样本适应方法。
研究提供了一个可重现的基准测试框架，用于评估RSVLMs的小样本学习能力。

Cool Papers

点此查看论文截图

Crossing Domains without Labels: Distant Supervision for Term Extraction

Authors:Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank

Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.

自动术语提取（ATE）是下游NLP任务（如文档标记、本体构建和专利分析）中的关键组成部分。当前最前沿的方法需要昂贵的人力标注，并且在领域迁移方面存在困难，限制了其实际部署。这强调了需要更稳健、可扩展的解决方案和现实的评估环境。为了解决这一问题，我们引入了一个涵盖七个不同领域的综合基准测试，能够在文档和语料库两个层面进行性能评估。此外，我们提出了一个稳健的基于大型语言模型（LLM）的模型，该模型不仅优于有监督的跨域编码器模型和少样本学习基准测试，而且在我们的基准测试中与GPT-4o教师表现相当。我们的方法的第一步是利用这个黑盒大型语言模型在一般领域和科学领域生成伪标签，以确保其通用性。在此基础上，我们对第一批大型语言模型进行ATE微调。为了进一步增强下游任务通常需要的文档级一致性，我们引入了轻量级的后验启发式方法。我们的方法在5/7的领域上超越了以前的方法，平均提高了10个百分点。我们公开我们的数据集和微调模型，以支持该领域的未来研究。

论文及项目相关链接

PDF Accepted at EMNLP Industry Track 2025

Summary

本文介绍了自动术语抽取（ATE）在自然语言处理（NLP）下游任务中的重要性，如文档标注、本体构建和专利分析。针对当前先进方法依赖昂贵的人力标注和领域迁移困难的问题，文章提出了一种全面的跨七个不同领域的基准测试，以评估文档和语料库级别的性能。文章还提出了一种基于大型语言模型（LLM）的稳健模型，该模型在基准测试上表现出优异的性能，超越了监督跨域编码器模型和少样本学习基线，并与GPT-4o教师竞争。文章首先通过通用和科学领域生成伪标签来确保模型的泛化能力，并在此基础上微调了首个ATE领域的LLM。为了进一步提高下游任务所需的文档级别一致性，文章引入了轻量级事后启发式方法。该方法在五个领域超过了以前的方法，平均提高了10个百分点。文章还公开了数据集和微调模型，以支持未来在该领域的研究。

Key Takeaways

ATE在NLP下游任务中起关键作用，如文档标注、本体构建和专利分析。
当前先进方法存在人力标注成本高和领域迁移困难的问题。
引入了一个全面的跨七个不同领域的基准测试，用于评估模型在文档和语料库级别的性能。
提出了一种基于LLM的稳健模型，该模型在基准测试上表现出优异性能。
通过生成伪标签确保模型的泛化能力，并微调了首个ATE领域的LLM。
引入了轻量级事后启发式方法来提高文档级别的一致性。

Cool Papers

点此查看论文截图

AISysRev – LLM-based Tool for Title-abstract Screening

Authors:Aleksi Huotala, Miikka Kuutila, Olli-Pekka Turtio, Mika Mäntylä

Systematic reviews are a standard practice for summarizing the state of evidence in software engineering. Conducting systematic reviews is laborious, especially during the screening or study selection phase, where the number of papers can be overwhelming. During this phase, papers are assessed against inclusion and exclusion criteria based on their titles and abstracts. Recent research has demonstrated that large language models (LLMs) can perform title-abstract screening at a level comparable to that of a master’s student. While LLMs cannot be fully trusted, they can help, for example, in Rapid Reviews, which try to expedite the review process. Building on recent research, we developed AiSysRev, an LLM-based screening tool implemented as a web application running in a Docker container. The tool accepts a CSV file containing paper titles and abstracts. Users specify inclusion and exclusion criteria. One can use multiple LLMs for screening via OpenRouter. AiSysRev supports both zero-shot and few-shot screening, and also allows for manual screening through interfaces that display LLM results as guidance for human reviewers.We conducted a trial study with 137 papers using the tool. Our findings indicate that papers can be classified into four categories: Easy Includes, Easy Excludes, Boundary Includes, and Boundary Excludes. The Boundary cases, where LLMs are prone to errors, highlight the need for human intervention. While LLMs do not replace human judgment in systematic reviews, they can significantly reduce the burden of assessing large volumes of scientific literature. Video: https://www.youtube.com/watch?v=jVbEj4Y4tQI Tool: https://github.com/EvoTestOps/AISysRev

系统综述是软件工程中总结证据状况的标准实践。进行系统综述是一项艰巨的工作，特别是在筛选或研究选择阶段，论文数量可能非常庞大。在这一阶段，将根据论文的标题和摘要来评估其是否符合纳入和排除标准。最近的研究表明，大型语言模型（LLMs）进行标题摘要筛选的能力可以达到硕士生的水平。虽然不能完全信任LLMs，但它们可以在某些情况下提供帮助，例如在快速审查中尝试加快审查过程。基于近期研究，我们开发了AiSysRev，这是一个基于LLM的筛选工具，以Web应用程序的形式实现，运行在Docker容器中。该工具接受包含论文标题和摘要的CSV文件。用户指定纳入和排除标准。用户可以通过OpenRouter使用多个LLMs进行筛选。AiSysRev支持零样本和少样本筛选，并允许通过显示LLM结果作为人类评审者的指导来进行手动筛选。我们对包含137篇论文的工具进行了试验性研究。我们的研究结果表明，论文可分为四类：易于包含的、易于排除的、边界包含的、边界排除的。边界案例是LLMs容易出现错误的地方，这强调了人工干预的必要性。虽然LLMs不能取代系统综述中的人的判断，但它们可以显著减少评估大量科学文献的负担。视频：https://www.youtube.com/watch?v=jVbEj4Y4tQI 工具：https://github.com/EvoTestOps/AISysRev

论文及项目相关链接

PDF 4 pages

Summary

该文本介绍了系统审查在软件工程中的标准实践，并指出其中的筛选阶段工作量巨大。近期研究表明，大型语言模型（LLMs）可以执行标题摘要筛选，与硕士生的水平相当。在此基础上，开发了AiSysRev工具，支持零样本和少样本筛选，并允许通过显示LLM结果来指导人工审查者进行手动筛选。研究发现，论文可分为四类，其中边界案例需要人工干预。虽然LLMs不能完全取代系统审查中的人的判断力，但它们可以显著减少评估大量科学文献的负担。

Key Takeaways

系统审查是软件工程证据总结的标准实践，其中筛选阶段尤为耗时。
大型语言模型（LLMs）可以执行标题和摘要的筛选，效率与硕士生相当。
AiSysRev工具基于LLM，支持零样本和少样本筛选，并可手动筛选。
在使用AiSysRev进行的试验研究中，论文被分为四类：Easy Includes, Easy Excludes, Boundary Includes, 和 Boundary Excludes。
边界案例表明LLMs存在错误倾向，需要人工干预。
LLMs不能替代系统审查中的人的判断力，但可显著减少评估大量科学文献的工作量。

Cool Papers

点此查看论文截图

Cross-Embodiment Dexterous Hand Articulation Generation via Morphology-Aware Learning

Authors:Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou, Weisi Lin, Yan Wu

Dexterous grasping with multi-fingered hands remains challenging due to high-dimensional articulations and the cost of optimization-based pipelines. Existing end-to-end methods require training on large-scale datasets for specific hands, limiting their ability to generalize across different embodiments. We propose an eigengrasp-based, end-to-end framework for cross-embodiment grasp generation. From a hand’s morphology description, we derive a morphology embedding and an eigengrasp set. Conditioned on these, together with the object point cloud and wrist pose, an amplitude predictor regresses articulation coefficients in a low-dimensional space, which are decoded into full joint articulations. Articulation learning is supervised with a Kinematic-Aware Articulation Loss (KAL) that emphasizes fingertip-relevant motions and injects morphology-specific structure. In simulation on unseen objects across three dexterous hands, our model attains a 91.9% average grasp success rate with less than 0.4 seconds inference per grasp. With few-shot adaptation to an unseen hand, it achieves 85.6% success on unseen objects in simulation, and real-world experiments on this few-shot generalized hand achieve an 87% success rate. The code and additional materials will be made available upon publication on our project website https://connor-zh.github.io/cross_embodiment_dexterous_grasping.

灵巧的多指手抓取仍然是一个挑战，主要是由于其高维度的关节运动和基于优化的管道的成本较高。现有的端到端方法需要在特定的手的大规模数据集上进行训练，这限制了其在不同手部的泛化能力。我们提出了一种基于固有握法的端到端框架，用于跨手部形态的抓取生成。根据手部形态描述，我们得出形态嵌入和固有握法集。基于这些条件，以及物体点云和手腕姿态，振幅预测器会回归低维空间中的关节运动系数，然后将其解码为完整的关节运动。关节运动学习是通过运动感知关节运动损失（KAL）进行监督的，它强调指尖相关运动并注入形态特定的结构。在三种灵巧手上对未见过的物体进行仿真实验，我们的模型达到了平均91.9%的抓取成功率，每次抓取的推理时间不到0.4秒。通过几次适应未见过的手部，它在仿真中对未见过的物体的成功率达到了85.6%，并在实际进行的关于该少数镜头通用化的手部实验中达到了87%的成功率。相关代码和额外材料将在我们的项目网站https://connor-zh.github.io/cross_embodiment_dexterous_grasping上发布。

论文及项目相关链接

PDF

Summary

本文提出一种基于特征抓握（eigengrasp）的端到端框架，用于跨形态抓握生成。该框架通过手的形态描述生成形态嵌入和特征抓握集，再结合物体点云和手腕姿态，通过振幅预测器在低维空间中回归关节活动系数，最后解码为全关节活动。学习关节活动通过监督动觉关节损失（KAL），强调指尖相关运动和注入形态特定结构。模型在仿真中未见物体跨三种灵巧手的平均抓握成功率达到91.9%，单次抓握推理时间不到0.4秒。对未见的手进行少量样本适应后，在仿真中未见物体的成功率达到85.6%，在实际实验中的成功率达到87%。

Key Takeaways

提出一种基于特征抓握的端到端框架，用于跨不同形态的抓握生成。
通过手的形态描述生成形态嵌入和特征抓握集。
结合物体点云和手腕姿态，通过振幅预测器回归关节活动系数。
采用低维空间学习关节活动，并解码为全关节活动。
监督学习通过动觉关节损失（KAL），强调指尖相关运动和形态特定结构。
在仿真环境中，模型对未见物体和未见手的抓握表现出高成功率和快速推理时间。

Cool Papers

点此查看论文截图

GLVD: Guided Learned Vertex Descent

Authors:Pol Caselles Rico, Francesc Moreno Noguer

Existing 3D face modeling methods usually depend on 3D Morphable Models, which inherently constrain the representation capacity to fixed shape priors. Optimization-based approaches offer high-quality reconstructions but tend to be computationally expensive. In this work, we introduce GLVD, a hybrid method for 3D face reconstruction from few-shot images that extends Learned Vertex Descent (LVD) by integrating per-vertex neural field optimization with global structural guidance from dynamically predicted 3D keypoints. By incorporating relative spatial encoding, GLVD iteratively refines mesh vertices without requiring dense 3D supervision. This enables expressive and adaptable geometry reconstruction while maintaining computational efficiency. GLVD achieves state-of-the-art performance in single-view settings and remains highly competitive in multi-view scenarios, all while substantially reducing inference time.

现有的3D人脸建模方法通常依赖于3D可变形模型，这固有地限制了其表示能力，只能表现为固定的形状先验。基于优化的方法虽然能提供高质量的重建，但计算成本往往很高。在这项工作中，我们引入了GLVD，这是一种用于从少量图像进行3D人脸重建的混合方法。GLVD扩展了学习顶点下降法（LVD），通过整合顶点神经场优化与动态预测的全局结构指导的3D关键点，实现了在少量图像下的3D人脸重建。通过融入相对空间编码，GLVD能够迭代优化网格顶点，而无需密集的3D监督。这实现了在保持计算效率的同时，进行生动且可适应的几何重建。GLVD在单视图设置中达到了最先进的性能，在多视图场景中仍具有高度竞争力，同时大大降低了推理时间。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于少量图像的3D人脸重建的混合方法GLVD，该方法结合了顶点神经场优化和动态预测的全局结构指导，实现了高效、灵活的人脸重建。GLVD通过引入相对空间编码，能够在不需要密集3D监督的情况下迭代优化网格顶点，达到表情丰富、适应性强且计算效率高的几何重建效果。

Key Takeaways

GLVD是一种用于从少量图像进行3D人脸重建的混合方法。
GLVD结合了顶点神经场优化和全局结构指导，提高了重建的质量和效率。
相对空间编码的引入使得GLVD能够在不需要密集3D监督的情况下迭代优化网格顶点。
GLVD在单视图和多视图场景下均表现出卓越性能。
GLVD大幅降低了推理时间，提高了计算效率。
GLVD在人脸重建中实现了表情丰富、适应性强的人脸几何重建。

Cool Papers

点此查看论文截图

Training-Free Time Series Classification via In-Context Reasoning with LLM Agents

Authors:Songyuan Sui, Zihang Xu, Yu-Neng Chuang, Kwei-Herng Lai, Xia Hu

Time series classification (TSC) spans diverse application scenarios, yet labeled data are often scarce, making task-specific training costly and inflexible. Recent reasoning-oriented large language models (LLMs) show promise in understanding temporal patterns, but purely zero-shot usage remains suboptimal. We propose FETA, a multi-agent framework for training-free TSC via exemplar-based in-context reasoning. FETA decomposes a multivariate series into channel-wise subproblems, retrieves a few structurally similar labeled examples for each channel, and leverages a reasoning LLM to compare the query against these exemplars, producing channel-level labels with self-assessed confidences; a confidence-weighted aggregator then fuses all channel decisions. This design eliminates the need for pretraining or fine-tuning, improves efficiency by pruning irrelevant channels and controlling input length, and enhances interpretability through exemplar grounding and confidence estimation. On nine challenging UEA datasets, FETA achieves strong accuracy under a fully training-free setting, surpassing multiple trained baselines. These results demonstrate that a multi-agent in-context reasoning framework can transform LLMs into competitive, plug-and-play TSC solvers without any parameter training. The code is available at https://github.com/SongyuanSui/FETATSC.

时间序列分类（TSC）涵盖了多种应用场景，但标注数据通常稀缺，使得针对特定任务的训练成本高昂且不够灵活。最近以推理为导向的大型语言模型（LLM）在理解时间序列模式方面显示出潜力，但纯零样本使用仍然不够理想。我们提出了FETA，这是一个基于范例的上下文推理的完全无训练TSC多智能体框架。FETA将多元序列分解成通道级别的子问题，为每个通道检索几个结构相似的标注范例，并利用推理LLM来比较查询与这些范例，以自我评估的信心产生通道级别的标签；一个信心加权聚合器随后融合所有通道决策。这种设计消除了对预训练或微调的需求，通过删除无关通道和控制输入长度来提高效率，并通过范例接地和信心估计增强可解释性。在九个具有挑战性的UEA数据集上，FETA在完全无训练的设置下实现了较高的准确性，超越了多个训练过的基线。这些结果表明，多智能体上下文推理框架可以将LLM转变为无需任何参数训练的竞争型、即插即用TSC求解器。代码可在https://github.com/SongyuanSui/FETATSC 获得。

论文及项目相关链接

PDF 8 pages main content, 12 pages total including appendix, 1 figure

Summary：

针对时间序列分类（TSC）中标签数据稀缺的问题，提出了一种基于范例的上下文推理的多智能体框架FETA，用于无训练TSC。FETA将多元序列分解为通道级子问题，通过范例检索和推理LLM进行比较，产生具有自我评估置信度的通道级标签。一个基于置信度加权的聚合器融合所有通道决策，无需预训练或微调，提高了效率和可解释性。在UEA数据集上的实验表明，FETA在完全无训练设置下取得了强大的准确性。

Key Takeaways:

时间序列分类（TSC）中标签数据稀缺，任务特定训练成本高昂且不够灵活。
提出了一种基于范例的上下文推理的多智能体框架FETA，用于无训练TSC。
FETA通过分解多元序列为通道级子问题，检索结构相似的标记范例，利用推理LLM进行比较。
产生通道级标签具有自我评估的置信度。
置信度加权聚合器融合所有通道决策。
FETA无需预训练或微调，提高了效率，通过范例接地和置信度估计增强了可解释性。
在UEA数据集上的实验表明，FETA在无训练设置下取得了强大的准确性。

Cool Papers

点此查看论文截图

Syn-Diag: An LLM-based Synergistic Framework for Generalizable Few-shot Fault Diagnosis on the Edge

Authors:Zijun Jia, Shuang Liang, Jinsong Yu

Industrial fault diagnosis faces the dual challenges of data scarcity and the difficulty of deploying large AI models in resource-constrained environments. This paper introduces Syn-Diag, a novel cloud-edge synergistic framework that leverages Large Language Models to overcome these limitations in few-shot fault diagnosis. Syn-Diag is built on a three-tiered mechanism: 1) Visual-Semantic Synergy, which aligns signal features with the LLM’s semantic space through cross-modal pre-training; 2) Content-Aware Reasoning, which dynamically constructs contextual prompts to enhance diagnostic accuracy with limited samples; and 3) Cloud-Edge Synergy, which uses knowledge distillation to create a lightweight, efficient edge model capable of online updates via a shared decision space. Extensive experiments on six datasets covering different CWRU and SEU working conditions show that Syn-Diag significantly outperforms existing methods, especially in 1-shot and cross-condition scenarios. The edge model achieves performance comparable to the cloud version while reducing model size by 83% and latency by 50%, offering a practical, robust, and deployable paradigm for modern intelligent diagnostics.

工业故障检测面临着数据稀缺和在资源受限环境中部署大型AI模型的双重挑战。本文介绍了一种新型的云边协同框架Syn-Diag，它利用大型语言模型来克服小样例故障检测中的这些限制。Syn-Diag建立在三层机制上：1）视觉语义协同，通过跨模态预训练将信号特征与LLM的语义空间对齐；2）内容感知推理，动态构建上下文提示，以提高有限样本的诊断准确性；3）云边协同，利用知识蒸馏创建轻量级、高效的边缘模型，通过共享决策空间进行在线更新。在涵盖不同CWRU和SEU工作条件的六个数据集上的大量实验表明，Syn-Diag显著优于现有方法，特别是在1-shot和跨场景条件下。边缘模型的性能与云版本相当，同时模型大小减少了83%，延迟降低了50%，为现代智能诊断提供了实用、稳健和可部署的范式。

论文及项目相关链接

PDF

Summary

本文提出一种名为Syn-Diag的云端协同诊断框架，用于解决工业故障诊断中的数据稀缺和模型部署困难的问题。该框架利用大型语言模型，通过视觉语义协同、内容感知推理和云端协同机制，实现了在少量样本下的高效故障诊断。实验证明，Syn-Diag在多种数据集上显著优于现有方法，特别是单样本和跨条件场景下的表现更为出色。边缘模型的性能与云端版本相当，同时减小了模型体积和延迟，为现代智能诊断提供了实用、稳健和可部署的范式。

Key Takeaways

Syn-Diag框架解决了工业故障数据稀缺和大型AI模型部署困难的问题。
通过视觉语义协同，将信号特征与LLM语义空间对齐。
内容感知推理机制提高了诊断的准确性和样本的有限性。
云端协同利用知识蒸馏创建轻量级边缘模型，支持在线更新。
实验证明Syn-Diag在多种数据集上表现优异，特别是在单样本和跨条件场景下。
边缘模型性能与云端版本相当，同时显著减小了模型体积和延迟。

Cool Papers

点此查看论文截图

nnSAM2: nnUNet-Enhanced One-Prompt SAM2 for Few-shot Multi-Modality Segmentation and Composition Analysis of Lumbar Paraspinal Muscles

Authors:Zhongyi Zhang, Julie A. Hides, Enrico De Martino, Abdul Joseph Fofanah, Gervase Tuxworth

Purpose: To develop and validate No-New SAM2 (nnsam2) for few-shot segmentation of lumbar paraspinal muscles using only a single annotated slice per dataset, and to assess its statistical comparability with expert measurements across multi-sequence MRI and multi-protocol CT. Methods: We retrospectively analyzed 1,219 scans (19,439 slices) from 762 participants across six datasets. Six slices (one per dataset) served as labeled examples, while the remaining 19,433 slices were used for testing. In this minimal-supervision setting, nnsam2 used single-slice SAM2 prompts to generate pseudo-labels, which were pooled across datasets and refined through three sequential, independent nnU-Net models. Segmentation performance was evaluated using the Dice similarity coefficient (DSC), and automated measurements-including muscle volume, fat ratio, and CT attenuation-were assessed with two one-sided tests (TOST) and intraclass correlation coefficients (ICC). Results: nnsam2 outperformed vanilla SAM2, its medical variants, TotalSegmentator, and the leading few-shot method, achieving DSCs of 0.94-0.96 on MR images and 0.92-0.93 on CT. Automated and expert measurements were statistically equivalent for muscle volume (MRI/CT), CT attenuation, and Dixon fat ratio (TOST, P < 0.05), with consistently high ICCs (0.86-1.00). Conclusion: We developed nnsam2, a state-of-the-art few-shot framework for multi-modality LPM segmentation, producing muscle volume (MRI/CT), attenuation (CT), and fat ratio (Dixon MRI) measurements that were statistically comparable to expert references. Validated across multimodal, multicenter, and multinational cohorts, and released with open code and data, nnsam2 demonstrated high annotation efficiency, robust generalizability, and reproducibility.

目的：旨在为腰椎旁肌肉的少样本分割开发并验证No-New SAM2（nnsam2），该方法仅使用每个数据集的一个标注切片，并评估其在多序列MRI和多协议CT上与专家测量的统计可比性。方法：我们回顾性分析了6个数据集中的762名参与者的1,219次扫描（共19,439个切片）。其中六个切片（每个数据集一个）作为标记示例，其余19,433个切片用于测试。在这种最小监督设置下，nnsam2使用单个切片的SAM2提示来生成伪标签，这些标签被汇集在数据集中，并通过三个顺序独立的nnU-Net模型进行改进。使用Dice相似性系数（DSC）评估分割性能，并使用两次单侧检验（TOST）和组内相关系数（ICC）评估自动化测量，包括肌肉体积、脂肪比和CT衰减。结果：nnsam2的表现优于基础SAM2、其医学变体、TotalSegmentator以及领先的少样本方法，在MR图像上达到0.94-0.96的DSC，在CT上达到0.92-0.93的DSC。自动化和专家测量的肌肉体积（MRI/CT）、CT衰减和Dixon脂肪比（TOST，P < 0.05）在统计上是相当的，并且具有始终较高的ICC（0.86-1.00）。结论：我们开发了最先进的少样本框架nnsam2，用于多模态LPM分割，可生成与专家参考相比具有统计可比性的肌肉体积（MRI/CT）、衰减（CT）和脂肪比（Dixon MRI）测量值。经过跨多模态、多中心和跨国队列的验证，并以开放代码和数据发布，nnsam2显示出高的标注效率、稳健的通用性和可重复性。

论文及项目相关链接

PDF

Summary

nnsam2方法用于基于少量标注数据的腰椎旁肌肉分割，仅使用每个数据集的一个标注切片。该方法在多种序列MRI和多协议CT上的表现与专家测量具有统计可比性。通过对大量扫描数据的回顾性分析和独立模型验证，结果显示nnsam2的分割性能优秀，自动化测量结果与专家测量结果相当。

Key Takeaways

nnsam2是一种用于腰椎旁肌肉少样本分割的方法，仅需要每个数据集的一个标注切片。
nnsam2在多种序列MRI和多协议CT上具有优秀的分割性能。
nnsam2使用单切片SAM2提示生成伪标签，并跨数据集整合和精炼。
通过Dice similarity coefficient (DSC)评估分割性能。
自动化测量结果与专家测量结果具有统计等效性，包括肌肉体积、脂肪比和CT衰减。
nnsam2具有高的标注效率、稳健的通用性和可重复性。

Cool Papers

点此查看论文截图

Attention-Enhanced Prototypical Learning for Few-Shot Infrastructure Defect Segmentation

Authors:Christina Thrainer, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Christian Guetl, Steven Sloan, Kendall N. Niles, Ken Pathak

Few-shot semantic segmentation is vital for deep learning-based infrastructure inspection applications, where labeled training examples are scarce and expensive. Although existing deep learning frameworks perform well, the need for extensive labeled datasets and the inability to learn new defect categories with little data are problematic. We present our Enhanced Feature Pyramid Network (E-FPN) framework for few-shot semantic segmentation of culvert and sewer defect categories using a prototypical learning framework. Our approach has three main contributions: (1) adaptive E-FPN encoder using InceptionSepConv blocks and depth-wise separable convolutions for efficient multi-scale feature extraction; (2) prototypical learning with masked average pooling for powerful prototype generation from small support examples; and (3) attention-based feature representation through global self-attention, local self-attention and cross-attention. Comprehensive experimentation on challenging infrastructure inspection datasets illustrates that the method achieves excellent few-shot performance, with the best configuration being 8-way 5-shot training configuration at 82.55% F1-score and 72.26% mIoU in 2-way classification testing. The self-attention method had the most significant performance improvements, providing 2.57% F1-score and 2.9% mIoU gain over baselines. Our framework addresses the critical need to rapidly respond to new defect types in infrastructure inspection systems with limited new training data that lead to more efficient and economical maintenance plans for critical infrastructure systems.

少样本语义分割对于基于深度学习的设施检测应用至关重要，因为这些应用的训练样本既稀缺又昂贵。尽管现有的深度学习框架表现良好，但对大量标注数据集的需求以及无法用少量数据学习新缺陷类别的问题仍然存在。我们提出了增强型特征金字塔网络（E-FPN）框架，利用原型学习框架对涵洞和下水道缺陷类别进行少样本语义分割。我们的方法主要有三个贡献：（1）使用InceptionSepConv块和深度可分离卷积的自适应E-FPN编码器，进行高效的多尺度特征提取；（2）利用掩模平均池化的原型学习，从小量支持样本中生成强大的原型；（3）基于注意力的特征表示，通过全局自注意力、局部自注意力和交叉注意力。具有挑战性的设施检测数据集上的综合实验表明，该方法实现了出色的少样本性能，最佳配置为8路5次训练的配置，在2路分类测试中达到82.55%的F1分数和72.26%的mIoU。自注意力方法具有最显著的性能提升，相比基线提高了2.57%的F1分数和2.9%的mIoU。我们的框架解决了设施检测系统中对有限新训练数据快速响应新缺陷类型的迫切需求，从而为关键设施系统制定更高效、更经济的维护计划。

论文及项目相关链接

PDF

Summary

本文介绍了针对深度学习在基础设施检测中应用的few-shot语义分割问题，提出一种增强型特征金字塔网络（E-FPN）框架。该框架使用原型学习框架对桥涵和排水沟缺陷类别进行few-shot语义分割。主要贡献包括自适应E-FPN编码器、原型学习和注意力机制。实验表明，该方法在具有挑战性的基础设施检测数据集上取得了良好的few-shot性能，且自注意力方法具有最佳表现，显著提高了基线性能。该框架满足了在有限新训练数据下对基础设施检测系统中新缺陷类型的快速响应需求，有助于提高关键基础设施系统的维护效率和经济效益。

Key Takeaways

Few-shot语义分割在深度学习为基础设施检测应用中至关重要，特别是在缺乏标注训练样本的情况下。
现有深度学习框架存在对新缺陷类别学习不足的问题。
提出的E-FPN框架包括自适应E-FPN编码器、原型学习和注意力机制三个主要贡献。
自适应E-FPN编码器通过InceptionSepConv块和深度可分离卷积实现高效多尺度特征提取。
原型学习采用掩码平均池化生成强大原型，基于小规模支持样本。
注意力机制包括全局自注意力、局部自注意力和交叉注意力，有助于提升特征表示能力。

Cool Papers

点此查看论文截图

HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks

Authors:Zheng Xiong, Kang Li, Zilin Wang, Matthew Jackson, Jakob Foerster, Shimon Whiteson

Built upon language and vision foundation models with strong generalization ability and trained on large-scale robotic data, Vision-Language-Action (VLA) models have recently emerged as a promising approach to learning generalist robotic policies. However, a key drawback of existing VLAs is their extremely high inference costs. In this paper, we propose HyperVLA to address this problem. Unlike existing monolithic VLAs that activate the whole model during both training and inference, HyperVLA uses a novel hypernetwork (HN)-based architecture that activates only a small task-specific policy during inference, while still retaining the high model capacity needed to accommodate diverse multi-task behaviors during training. Successfully training an HN-based VLA is nontrivial so HyperVLA contains several key algorithm design features that improve its performance, including properly utilizing the prior knowledge from existing vision foundation models, HN normalization, and an action generation strategy. Compared to monolithic VLAs, HyperVLA achieves a similar or even higher success rate for both zero-shot generalization and few-shot adaptation, while significantly reducing inference costs. Compared to OpenVLA, a state-of-the-art VLA model, HyperVLA reduces the number of activated parameters at test time by $90\times$, and accelerates inference speed by $120\times$. Code is publicly available at https://github.com/MasterXiong/HyperVLA

基于具有强大泛化能力的语言和视觉基础模型，并在大规模机器人数据上进行训练，视觉-语言-动作（VLA）模型最近作为学习通用机器人策略的一种有前途的方法而出现。然而，现有VLA的一个关键缺点是推理成本极高。针对这一问题，本文提出了HyperVLA解决方案。与现有的在训练和推理过程中都会激活整个模型的单一VLA不同，HyperVLA采用了一种基于超网络（HN）的新型架构，该架构在推理时只激活一小部分特定任务的策略，同时在训练时仍保留容纳多种多任务行为所需的高模型容量。成功训练基于HN的VLA并不容易，因此HyperVLA包含一些关键算法设计功能来提高其性能，包括适当利用现有视觉基础模型的先验知识、HN归一化以及动作生成策略。与单一VLA相比，HyperVLA在零样本泛化和少量适应方面达到了相似甚至更高的成功率，同时大大降低了推理成本。与最先进的VLA模型OpenVLA相比，HyperVLA在测试时减少了被激活参数的数量（减少90倍），并加快了推理速度（加速120倍）。代码公开在https://github.com/MasterXiong/HyperVLA。

论文及项目相关链接

PDF

Summary

本文介绍了基于语言和视觉基础模型的通用机器人策略学习新方法——HyperVLA。该方法使用超网络（HN）架构，在推理时仅激活小型的特定任务策略，降低了推理成本，同时保留了在训练期间适应多样多任务行为所需的高模型容量。HyperVLA还包括一些关键算法设计特点，如利用现有视觉基础模型的先验知识、HN归一化和动作生成策略。相较于传统的单一体VLAs，HyperVLA在零样本泛化和少量适应方面取得了相似甚至更高的成功率，同时显著降低了推理成本。与现有的顶尖VLA模型OpenVLA相比，HyperVLA在测试时激活的参数数量减少了90倍，推理速度提高了120倍。

Key Takeaways

HyperVLA是一种基于语言和视觉基础模型的机器人策略学习方法，旨在解决现有VLAs高推理成本的问题。
HyperVLA采用超网络（HN）架构，推理时仅激活小型的任务特定策略。
HyperVLA在训练过程中保留高模型容量以适应多种任务行为。
HyperVLA利用现有视觉基础模型的先验知识，并包括HN归一化和动作生成策略等关键算法设计特点。
与传统单一体VLAs相比，HyperVLA在泛化和适应方面表现优越。
HyperVLA较现有的顶尖VLA模型OpenVLA，显著减少了测试时激活的参数数量和推理时间。

Cool Papers

点此查看论文截图

SAFA-SNN: Sparsity-Aware On-Device Few-Shot Class-Incremental Learning with Fast-Adaptive Structure of Spiking Neural Network

Authors:Huijing Zhang, Muyang Cao, Linshan Jiang, Xin Du, Di Yu, Changze Lv, Shuiguang Deng

Continuous learning of novel classes is crucial for edge devices to preserve data privacy and maintain reliable performance in dynamic environments. However, the scenario becomes particularly challenging when data samples are insufficient, requiring on-device few-shot class-incremental learning (FSCIL) to maintain consistent model performance. Although existing work has explored parameter-efficient FSCIL frameworks based on artificial neural networks (ANNs), their deployment is still fundamentally constrained by limited device resources. Inspired by neural mechanisms, Spiking neural networks (SNNs) process spatiotemporal information efficiently, offering lower energy consumption, greater biological plausibility, and compatibility with neuromorphic hardware than ANNs. In this work, we present an SNN-based method for On-Device FSCIL, i.e., Sparsity-Aware and Fast Adaptive SNN (SAFA-SNN). We first propose sparsity-conditioned neuronal dynamics, in which most neurons remain stable while a subset stays active, thereby mitigating catastrophic forgetting. To further cope with spike non-differentiability in gradient estimation, we employ zeroth-order optimization. Moreover, during incremental learning sessions, we enhance the discriminability of new classes through subspace projection, which alleviates overfitting to novel classes. Extensive experiments conducted on two standard benchmark datasets (CIFAR100 and Mini-ImageNet) and three neuromorphic datasets (CIFAR-10-DVS, DVS128gesture, and N-Caltech101) demonstrate that SAFA-SNN outperforms baseline methods, specifically achieving at least 4.01% improvement at the last incremental session on Mini-ImageNet and 20% lower energy cost over baseline methods with practical implementation.

持续学习新类别对于边缘设备在动态环境中保持数据隐私和可靠性能至关重要。然而，当数据样本不足时，情况变得尤为具有挑战性，需要设备上的小样本类别增量学习（FSCIL）来维持模型的持续性能。虽然现有工作已经探索了基于人工神经网络（ANN）的参数有效FSCIL框架，但其部署仍然受到有限设备资源的根本制约。受神经机制的启发，脉冲神经网络（SNNs）能够高效处理时空信息，与ANN相比，具有更低的能耗、更大的生物合理性和与神经形态硬件的兼容性。在这项工作中，我们提出了一种基于SNN的设备上FSCIL方法，即稀疏感知和快速自适应SNN（SAFA-SNN）。我们首先提出稀疏条件神经元动态，其中大多数神经元保持稳定，而一小部分神经元保持活跃，从而减轻灾难性遗忘。为了进一步应对梯度估计中的脉冲不可微性，我们采用零阶优化。此外，在增量学习期间，我们通过子空间投影增强新类别的辨别力，从而减轻对新类别的过度拟合。在CIFAR100和Mini-ImageNet两个标准基准数据集以及CIFAR-10-DVS、DVS128gesture和N-Caltech1个基准数据集上进行的大量实验表明，SAFA-SNN优于基准方法，特别是在Mini-ImageNet的最后一个增量会话上至少提高了4.01%，并且在实践中实现了比基准方法低20%的能耗成本。

论文及项目相关链接

PDF

Summary

基于脉冲神经网络的高效在线少样本类增量学习。为解决边缘设备在动态环境中保护数据隐私并保持可靠性能的问题，研究提出一种基于脉冲神经网络（SNN）的在线少样本类增量学习方法——稀疏感知快速自适应SNN（SAFA-SNN）。通过稀疏性条件神经元动态、零阶优化和子空间投影等技术，实现了高效的增量学习和良好的性能表现。实验结果表明，SAFA-SNN在多个数据集上优于基线方法。

Key Takeaways

边缘设备在动态环境中进行持续的新型类别学习至关重要。
数据样本不足时，需要采用基于脉冲神经网络的少样本类增量学习方法（FSCIL）。
SAFA-SNN方法通过稀疏性条件神经元动态减轻灾难性遗忘问题。
采用零阶优化应对脉冲的非可微性，在梯度估计中进行优化。
子空间投影技术用于提高新类别的可分辨性，减轻对新类别的过度拟合。
实验结果表明SAFA-SNN在多个数据集上的性能优于基线方法。

Cool Papers

点此查看论文截图

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Authors:Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

少量学习（FSL）旨在从仅有的几个标记支持样本中识别出新概念。最近的研究通过融入额外的语义信息或设计复杂的语义融合模块来增强支持特征。然而，由于缺乏在实际实例中的定位，它们仍然会遭受与视觉证据相矛盾的幻想语义的困扰，导致产生嘈杂的指导和昂贵的修正。为了解决这些问题，我们提出了一种新的框架——利用大型语言模型（LLM）进行视觉与文本桥接的少量学习（VT-FSL）。该框架构建了精确跨模态提示，这些提示基于大型语言模型（LLM）和支持图像，通过几何感知对齐无缝集成它们。它主要由跨模态迭代提示（CIP）和跨模态几何对齐（CGA）组成。具体而言，CIP根据类名和辅助图像对LLM进行条件处理，以在单个结构化推理过程中迭代生成精确的类描述。这些描述不仅丰富了对新颖类的语义理解，还实现了语义一致图像的零样本合成。这些描述和合成图像分别作为补充的文本和视觉提示，提供高级类语义和低级类内多样性，以弥补有限的辅助数据。此外，CGA通过最小化它们所跨越的3维平行四边形的核化体积来联合对齐融合的文本、支持和合成视觉表示。它捕捉了所有表示之间的全局和非线性关系，实现了结构化且一致的多模式集成。所提出的VT-FSL方法在包括标准、跨域和细粒度少量学习场景在内的十个不同基准测试上创下了最新技术水平。代码可在[https://github.com/peacelwh/VT-FSL找到。]

论文及项目相关链接

PDF Accepted by NeurIPS 2025

摘要

本文介绍了针对小样学习（FSL）的一种新方法——结合视觉和文本与大型语言模型（LLM）的小样学习框架（VT-FSL）。该框架通过构建精确的跨模态提示和基于支持图像的迭代提示，无缝集成它们通过几何感知对齐。它主要包括跨模态迭代提示（CIP）和跨模态几何对齐（CGA）。CIP以类名和图像为条件来生成精确类描述，这些描述不仅丰富了对新型语义的理解，还实现了语义一致图像的零样本合成。合成图像作为辅助视觉提示，与描述共同为有限的支持数据提供了高级类语义和低级类内多样性补偿。CGA则通过最小化跨越的三维平行四边形的核化体积来联合对齐融合后的文本、支持和合成视觉表示。它捕捉了所有表示之间的全局和非线性关系，实现了结构化且一致的多模态集成。VT-FSL方法在十个不同基准测试中建立了最新性能标准，包括标准、跨域和精细分类小样学习场景。

关键见解

Few-shot learning (FSL)旨在从少量标记样本中识别新概念。
最近的研究通过增加额外的语义信息或设计复杂的语义融合模块来增强支持特征。
缺乏基于实际实例的接地会导致语义上的虚构现象和视觉证据相矛盾，造成昂贵的修正和嘈杂的引导。
提出的VT-FSL框架结合了视觉和文本与大型语言模型（LLM），构建了精确的跨模态提示和基于支持图像的迭代提示。它主要由跨模态迭代提示（CIP）和跨模态几何对齐（CGA）组成。
CIP通过生成精确类描述来丰富对新型语义的理解，并实现了语义一致图像的零样本合成。合成图像作为辅助视觉提示，提供了高级类语义和低级类内多样性的补偿。
CGA通过最小化跨越的三维平行四边形的核化体积来联合对齐文本、支持和合成视觉表示，实现了结构化且一致的多模态集成。

Cool Papers

点此查看论文截图

Scaled Signed Averaging Improves In-Context and Early Learning Benchmark Performance in Small Transformers

Authors:Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

While Large Language models’ abilities for in-context learning (ICL) have drawn much attention, we examine some of its limitations on semantic tasks involving quantifiers like “all” and “some”, as well as on tasks with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these limitations. We propose scaled signed averaging (SSA), a novel alternative to Softmax to mitigate these problems. We show that SSA significantly improves performance on our ICL tasks. In addition, SSA outperforms transformer models with Softmax on several early learning NLP benchmarks and linguistic probing tasks on zero and few-shot settings.

虽然大型语言模型在上下文学习（ICL）方面的能力已经引起了广泛关注，但我们研究了一些其在涉及量词（如“所有”和“一些”）的语义任务以及线性函数任务上的局限性。我们确定了注意力机制中的评分函数Softmax是这些局限性的一个因素。我们提出了Scaled Signed Averaging（SSA）这一新型的Softmax替代方案来缓解这些问题。我们证明SSA在我们的ICL任务上显著提高性能。此外，在零样本和少样本设置下，SSA在多个早期学习NLP基准测试和语言学探测任务上的表现优于使用Softmax的Transformer模型。

论文及项目相关链接

PDF

Summary

大型语言模型的上下文学习能力（ICL）虽然备受关注，但在涉及量词（如“所有”和“一些”）的语义任务以及线性函数任务上存在一些局限性。本文发现注意力机制中的评分函数Softmax是这些局限性的原因之一。为此，本文提出了使用一种名为Scaled Signed Averaging（SSA）的新型评分函数来替代Softmax，以缓解这些问题。实验表明，SSA在ICL任务上的性能显著提高，并且在零样本和少样本环境下的早期学习NLP基准测试和语言学探测任务中优于使用Softmax的transformer模型。

Key Takeaways

大型语言模型的上下文学习能力（ICL）在涉及量词和线性函数的任务中存在局限性。
Softmax作为注意力机制中的评分函数，是这些局限性的原因之一。
提出了Scaled Signed Averaging（SSA）新型评分函数，以替代Softmax，改善模型性能。
SSA在ICL任务上的性能显著提高。
SSA在零样本和少样本环境下的早期学习NLP基准测试中表现优异。
SSA优于使用Softmax的transformer模型。

Cool Papers

点此查看论文截图

Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Authors:Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt with two different answers. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting. Moreover, it achieves comparable in-distribution performance to training-based SOTA reasoning method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing the importance of structural thinking diversity and the benefits of consistency check. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

近期，大型推理语言模型（RLLMs）通过结构化及多步骤推理展现了显著的实力。尽管先前的研究主要聚焦于改进其训练和推理策略，但它们对于情境内学习（ICL）的潜力却鲜有探索。为了填补这一空白，我们提出了“无思考校准思考”（JointThinking）这一全新的ICL范式，它提示模型并行生成两种答案：一种是思考模式，另一种是无思考模式。当两个初步回答不一致时，才会触发第二轮思考，并使用一个带有两个不同答案的单一提示。在多个推理基准测试上的广泛实验表明，JointThinking显著优于少镜头思维链（CoT）、三思而后行和多数投票。而且，它在基于训练的状态最优推理方法上实现了相当的表现，同时在离分布任务上表现出显著的优势。我们进一步对校准机制进行了系统分析，显示了结构性思维多样性的重要性以及一致性检查的好处。此外，我们观察到实际与理想推理之间的性能差距随着第二次思考时模型规模的增加而缩小，这表明了我们的方法具有很强的可扩展性。最后，我们讨论了当前的局限性，并为未来RLLMs的ICL研究提出了有前景的方向。

论文及项目相关链接

PDF

Summary

大型语言模型的推理能力近期备受瞩目，通过结构和多步骤推理展现出卓越的性能。尽管已有大量关于改进其训练和推理策略的研究，但其上下文学习（ICL）的潜力尚未得到充分探索。本文提出一种名为“无思考校准思考”（JointThinking）的新ICL范式，它鼓励模型同时生成两种答案：一种为思考模式，另一种为无思考模式。仅在两种初始回答不一致时，才会触发第二轮思考。在多个推理基准测试上的广泛实验表明，JointThinking显著优于少样本思维链（CoT）、二次思考和多数投票。此外，它在内部分布任务上的表现与基于训练的最新推理方法相当，但在外部分布任务上表现出显著优势。本文对校准机制进行了系统分析，强调了结构性思维多样性和一致性检查的重要性。同时观察到，随着模型大小的增加，实际与理想推理之间的性能差距缩小，这表明我们方法具有很强的可扩展性。

Key Takeaways

大型语言模型在结构和多步骤推理方面表现出卓越性能。
上下文学习（ICL）在大型语言模型中的潜力尚未充分探索。
JointThinking是一种新的ICL范式，通过生成两种答案（思考模式和无思考模式）来触发第二轮思考。
JointThinking在多个推理基准测试上表现优异，优于少样本思维链、二次思考和多数投票。
JointThinking在内部和外部任务分布上均表现出强大性能。
系统分析强调了结构性思维多样性和一致性检查在校准机制中的重要性。

Cool Papers

点此查看论文截图

GLiDRE: Generalist Lightweight model for Document-level Relation Extraction

Authors:Robin Armingaud, Romaric Besançon

Relation Extraction (RE) is a fundamental task in Natural Language Processing, and its document-level variant poses significant challenges, due to complex interactions between entities across sentences. While supervised models have achieved strong results in fully resourced settings, their behavior with limited training data remains insufficiently studied. We introduce GLiDRE, a new compact model for document-level relation extraction, designed to work efficiently in both supervised and few-shot settings. Experiments in both low-resource supervised training and few-shot meta-learning benchmarks show that our approach outperforms existing methods in data-constrained scenarios, establishing a new state-of-the-art in few-shot document-level relation extraction. Our code will be publicly available.

关系抽取（RE）是自然语言处理中的一项基本任务，其文档级变体由于句子间实体的复杂交互而带来重大挑战。虽然监督模型在全资源丰富的环境中取得了强大的效果，但在有限训练数据下的表现仍然研究不足。我们引入了GLiDRE，这是一种新的紧凑型文档级关系抽取模型，旨在在监督和小样本环境中都能高效工作。在低资源监督训练和少样本元学习基准测试的实验表明，我们的方法在数据受限的场景中表现优于现有方法，在少样本文档级关系抽取中建立了新的最新技术状态。我们的代码将公开可用。

论文及项目相关链接

PDF Submitted to ARR October

Summary

本文介绍了GLiDRE模型，这是一种针对文档级关系抽取的新颖紧凑模型。该模型旨在实现在有监督学习和少样本场景下的高效运作。实验表明，在数据受限的监督训练和少样本元学习基准测试中，该方法在数据受限的场景中优于现有方法，为文档级关系抽取领域树立了新的里程碑。

Key Takeaways

GLiDRE是一个用于文档级关系抽取的紧凑模型。
该模型旨在在有监督学习和少样本场景下高效运作。
在数据受限的监督训练中，GLiDRE表现优于现有方法。
在少样本元学习基准测试中，GLiDRE为文档级关系抽取领域树立了新的里程碑。
GLiDRE模型能够处理复杂的实体间交互。
该模型能够应对文档中的跨句子实体关系抽取挑战。

Cool Papers

点此查看论文截图

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

训练于互联网规模数据的视觉语言模型（VLMs）在常见物体（如汽车、卡车和行人）上的零样本检测性能显著。然而，最先进的模型在推广到其预训练中没有出现的分布外的类别、任务和成像模式时仍面临困难。我们主张，与其简单地使用更多的视觉数据重新训练VLMs，不如使用包含几个视觉示例和丰富的文本描述的注释指令来对齐VLMs的新概念。为此，我们推出了Roboflow100-VL，这是一个包含100个多模式对象检测数据集的大规模集合，其中包含的概念多样且在其VLM预训练中并不常见。我们在基准测试上对最先进的模型进行了零样本、小样测试、半监督和完全监督下的评估，这允许在不同数据模式下进行比较。值得注意的是，我们发现像GroundingDINO和Qwen2.5-VL这样的VLM在Roboflow100-VL中的具有挑战性的医学影像数据集上的零样本准确率低于2%，这证明了小样本概念对齐的必要性。最后，我们讨论了最近的CVPR 2025基础FSOD竞赛并从社区中分享了一些见解。值得注意的是，冠军团队的表现超出我们的基线水平高达17 mAP！我们的代码和数据集可以在https://github.com/roboflow/rf100-vl和https://universe.roboflow.com/rf100-vl/找到。

论文及项目相关链接

PDF The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/

Summary

本文介绍了针对互联网规模数据的视觉语言模型（VLMs）在常见物体上的零样本检测性能。然而，现有模型在泛化到非分布类别、任务和成像模式方面仍存在挑战。为此，作者引入了Roboflow100-VL，这是一个包含100个多模式对象检测数据集的大规模集合，其中包含各种不常见于VLM预训练的概念。作者在不同的数据环境下评估了现有模型，发现一些模型在具有挑战性的医疗图像数据集上的零样本准确率低于2%，突显了少样本概念对齐的必要性。此外，本文还讨论了CVPR 2025基础FSOD竞赛的见解，并分享了社区洞察。

Key Takeaways

视觉语言模型（VLMs）在常见物体上实现了出色的零样本检测性能。
现有模型在泛化到非分布类别、任务和成像模式时仍面临挑战。
引入Roboflow100-VL，一个包含多样化概念的大规模多模式对象检测数据集。
一些模型在挑战性医疗图像数据集上的零样本准确率较低，突显少样本概念对齐的必要性。
CVPR 2025基础FSOD竞赛提供了有关模型性能的深入见解。
获胜团队在比赛中显著超越了基线17 mAP。

Cool Papers

点此查看论文截图

ReLI: A Language-Agnostic Approach to Human-Robot Interaction

Authors:Linus Nwankwo, Bjoern Ellensohn, Ozan Özdenizci, Elmar Rueckert

Adapting autonomous agents for real-world industrial, domestic, and other daily tasks is currently gaining momentum. However, in global or cross-lingual application contexts, ensuring effective interaction with the environment and executing unrestricted human-specified tasks regardless of the language remains an unsolved problem. To address this, we propose ReLI, a language-agnostic approach that enables autonomous agents to converse naturally, semantically reason about their environment, and perform downstream tasks, regardless of the task instruction’s modality or linguistic origin. First, we ground large-scale pre-trained foundation models and transform them into language-to-action models that can directly provide common-sense reasoning and high-level robot control through natural, free-flow conversational interactions. Further, we perform cross-lingual adaptation of the models to ensure that ReLI generalises across the global languages. To demonstrate ReLI’s robustness, we conducted extensive experiments on various short- and long-horizon tasks, including zero- and few-shot spatial navigation, scene information retrieval, and query-oriented tasks. We benchmarked the performance on $140$ languages involving $70K+$ multi-turn conversations. On average, ReLI achieved over $90%\pm0.2$ accuracy in cross-lingual instruction parsing and task execution success. These results demonstrate its potential to advance natural human-agent interaction in the real world while championing inclusive and linguistic diversity. Demos and resources will be public at: https://linusnep.github.io/ReLI/.

适应自主代理用于现实世界中的工业、家居和其他日常任务正日益受到重视。然而，在全球或跨语言应用环境中，确保与环境的有效互动并执行不受限制的人类指定任务，无论使用何种语言，仍然是一个未解决的问题。为了解决这一问题，我们提出了ReLI，这是一种语言无关的方法，使自主代理能够自然地进行对话，对其环境进行语义推理，并执行下游任务，无论任务指令的模式或语言起源如何。首先，我们基于大规模预训练基础模型，将它们转化为语言到行动模型，可以直接通过自然、自由的对话式交互提供常识推理和高级机器人控制。此外，我们对模型进行了跨语言适配，以确保ReLI在全球范围内各种语言中都能通用。为了验证ReLI的稳健性，我们在各种短期和长期任务上进行了广泛实验，包括零样本和少样本空间导航、场景信息检索和查询导向任务。我们在涉及超过7万多次回合对话的140种语言上对性能进行了评估。平均而言，ReLI在跨语言指令解析和任务执行方面的成功率达到了超过$90%\pm0.2$的准确率。这些结果证明了其在推动现实世界中自然的人机交互方面的潜力，同时倡导包容性和语言多样性。相关演示和资源将在：https://linusnep.github.io/ReLI/公开。

论文及项目相关链接

PDF

Summary

该文介绍了针对自主代理人在真实世界中的应用问题，提出一种语言无关的方法ReLI。该方法能够使自主代理人进行自然对话，理解环境语义，执行下游任务，无论任务指令的方式或语言如何。实验证明，ReLI在不同语言和任务上具有良好的表现，能够实现跨语言适应，并具备高度的自然语言交互能力。ReLI模型具有广泛的应用前景，包括工业、家庭和其他日常任务中的实际应用。资源可通过网址进行访问：https://linusnep.github.io/ReLI/。

Key Takeaways

以下是文本中关键的见解概述：

ReLI是一种语言无关的方法，旨在解决自主代理人在全球或跨语言环境中执行任务时的语言障碍问题。
ReLI通过将大规模预训练基础模型转化为语言到行动模型，实现自然对话和机器人控制。
ReLI进行了跨语言适应，确保其在全球范围内的多种语言中的通用性。
ReLI在多种短期和长期任务上进行了广泛实验验证，包括零样本和少样本空间导航、场景信息检索和查询导向任务等。
实验结果显示，ReLI在跨语言指令解析和任务执行成功率方面达到了超过90%的平均准确率。
ReLI展示了在自然人机交互中的潜力，并强调包容性和语言多样性。

Cool Papers

点此查看论文截图

AutoPDL: Automatic Prompt Optimization for LLM Agents

Authors:Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel

The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.

大型语言模型的性能取决于如何对其进行提示，提示的选择范围包括高级提示模式（例如Zero-Shot、CoT、ReAct、ReWOO）和特定提示内容（指令和少量示例）。手动调整这种结合方式既繁琐又容易出现错误，而且针对特定的LLM和任务。因此，本文提出了AutoPDL，一种发现良好LLM代理配置自动化方法。我们将这个问题构建为一个结构化AutoML问题，在代理和非代理提示模式及示例的组合空间上进行解决，使用连续减半法有效地遍历这个空间。我们引入了一个使用PDL提示编程语言的库来实现常见的提示模式。AutoPDL解决方案是可读、可编辑、可执行的PDL程序，使用此库。这种方法还实现了源到源的优化，允许人类参与循环优化和重用。在三个任务和七个LLM（从3B到70B参数）上的评估显示，准确率持续提高（9.21±15.46个百分点），最高达到67.5个百分点，并且显示所选的提示策略在不同模型和任务之间有所不同。

论文及项目相关链接

PDF Presented at AutoML 2025 (Methods Track); to be published in proceedings

Summary

大型语言模型的性能取决于如何对其进行提示，涉及高级提示模式（如零击、上下文提示、反应和ReWOO）和特定提示内容（指令和少量示例）。手动调整这种组合既繁琐又容易出错，且针对特定的大型语言模型和任务。因此，本文提出了AutoPDL，一种用于发现良好大型语言模型代理配置的自动化方法。该方法将其构建为一个结构化自动机器学习问题，在代理和非代理提示模式及示例的组合空间中高效导航。引入一个使用PDL提示编程语言的常见提示模式库。AutoPDL解决方案是可读、可编辑和可执行的使用该库的PDL程序。该方法还支持源到源的优化，允许人为参与优化和改进。在三个任务和七个大型语言模型上的评估显示，准确性持续提高（增加9.21±15.46个百分点），最高达67.5个百分点，表明选择的提示策略在不同模型和任务之间有所不同。

Key Takeaways

大型语言模型的性能受提示方式影响，包括高级提示模式和具体提示内容。
手动调整提示组合既繁琐又易出错。
AutoPDL是一种自动化方法，用于发现良好的大型语言模型代理配置。
AutoPDL将问题构建为结构化自动机器学习问题，在代理和非代理提示模式及示例的组合空间中高效导航。
引入PDL提示编程语言的常见提示模式库，使AutoPDL解决方案更加人性化、可编辑和执行。
AutoPDL支持源到源的优化，允许人为参与优化和改进。

Cool Papers

点此查看论文截图

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Authors:Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves

The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the model output on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

在医疗工作流中采用基于深度学习的解决方案的主要挑战是标注数据可用性和系统解释性的缺乏。概念瓶颈模型（CBMs）通过约束模型输出在一组预定义和可解释的概念上来解决后者的问题。然而，通过基于概念的解释提高的解释性意味着更高的标注负担。而且，如果需要添加新概念，整个系统需要重新训练。受大型视觉语言模型（LVLMs）在少量场景设置中的出色表现的启发，我们提出了一种简单而有效的方法，即CBVLM，它解决了上述两个挑战。首先，对于每个概念，我们提示LVLM回答概念是否出现在输入图像中。然后，我们让LVLM基于之前的概念预测对图像进行分类。此外，在两个阶段中，我们都融入了一个检索模块，负责选择最佳示例进行上下文学习。通过基于预测概念的最终诊断，我们确保了解释性，并通过利用LVLMs的少量样本能力，我们大大降低了标注成本。我们通过四个医疗数据集和十二个（通用和医疗）LVLM的广泛实验验证了我们的方法，并表明CBVLM在不需要任何训练和只需少量标注样本的情况下，始终优于CBMs和针对特定任务的有监督方法。更多关于我们项目的信息请访问：[https://cristianopatricio.github.io/CBVLM/]

论文及项目相关链接

PDF Accepted for publication in Computers in Biology and Medicine

Summary
基于深度学习的医疗解决方案面临标注数据可用性和系统解释性两大挑战。概念瓶颈模型（CBMs）通过约束模型输出在预定义和可解释的概念集上来解决解释性问题，但这也增加了标注负担，且添加新概念需重新训练整个系统。受大型视觉语言模型（LVLMs）在少样本环境下的出色表现的启发，我们提出一种简单有效的CBVLM方法，同时应对上述两大挑战。我们引导LVLM判断输入图像中是否存在特定概念，并基于预测的概念对图像进行分类。同时，我们引入检索模块，负责选择最佳样本进行上下文学习。通过将最终诊断结果建立在预测概念上，我们确保了可解释性，并利用LVLMs的少样本能力大幅降低标注成本。实验证明，CBVLM在四个医疗数据集和十二个（通用和医疗）LVLM上的表现均优于CBMs和任务特定的监督方法，且无需任何训练，仅使用少量标注样本即可。

Key Takeaways

深度学习和医疗工作流中的挑战包括标注数据的可用性和系统解释性不足。
概念瓶颈模型（CBMs）通过约束模型输出在预定义和可解释的概念集上解决解释性问题，但增加了标注负担。
CBVLM方法结合大型视觉语言模型（LVLMs）解决上述问题，通过引导LVLM判断特定概念的存在并对图像进行分类。
CBVLM方法引入检索模块选择最佳样本进行上下文学习。
CBVLM通过预测概念确保解释性，并利用LVLMs的少样本能力降低标注成本。
实验证明CBVLM在多个医疗数据集上的表现优于CBMs和任务特定监督方法。

Cool Papers

点此查看论文截图

VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Authors:Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON’s adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.

上下文操作网络（ICONs）已经展示了在多种偏微分方程中使用少量上下文学习操作的能力。然而，现有的ICON将每个空间点视为一个单独的标记，在处理高维空间的密集数据时，计算效率严重受限。我们提出了视觉上下文操作网络（VICON），它结合了视觉转换器架构，通过块操作高效地处理二维数据，同时保留ICON对多物理系统和不同时间步长的适应性。在三个流体动力学基准测试中进行了评估，VICON显著优于最新基线技术DPOT和MPP，与DPOT相比平均最后一步滚动误差减少了37.9%，与MPP相比减少了44.7%，同时仅需要它们各自推理时间的72.5%和34.8%。VICON自然地支持具有不同时间步长的灵活滚动策略，能够在测量系统不完善的情况下进行即时部署，其中采样频率可能有所不同或帧可能会丢失——现实环境中的常见挑战——无需重新训练或插值。在这些现实场景中，VICON表现出惊人的稳健性，相对性能下降只有24.41%，而基线方法的性能下降在71.37%-74.49%之间，证明了其在现实应用中的通用性。我们的数据集处理脚本和代码可在https://github.com/Eydcao/VICON公开获取。

论文及项目相关链接

PDF update after NIPS suggestions

Summary

本文介绍了In-Context Operator Networks（ICONs）在处理部分微分方程时的学习能力，但其在处理高维空间密集数据时计算效率受限。为此，提出了Vision In-Context Operator Networks（VICON），结合视觉转换器架构，通过补丁操作高效处理二维数据，同时保留ICON在多物理系统和不同时间步长的适应性。在三个流体动力学基准测试中，VICON显著优于现有技术基准，减少了平均最后一步滚动误差，并提高了推理时间的效率。VICON支持灵活的时间步长策略，可在真实世界的不完美测量系统中部署，面对采样频率差异或帧丢失等挑战时表现出强大的稳健性。代码和数据集处理脚本已公开可用。

Key Takeaways