⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-25 更新
A Scalable, Causal, and Energy Efficient Framework for Neural Decoding with Spiking Neural Networks
Authors:Georgios Mentzelopoulos, Ioannis Asmanis, Konrad P. Kording, Eva L. Dyer, Kostas Daniilidis, Flavia Vitale
Brain-computer interfaces (BCIs) promise to enable vital functions, such as speech and prosthetic control, for individuals with neuromotor impairments. Central to their success are neural decoders, models that map neural activity to intended behavior. Current learning-based decoding approaches fall into two classes: simple, causal models that lack generalization, or complex, non-causal models that generalize and scale offline but struggle in real-time settings. Both face a common challenge, their reliance on power-hungry artificial neural network backbones, which makes integration into real-world, resource-limited systems difficult. Spiking neural networks (SNNs) offer a promising alternative. Because they operate causally these models are suitable for real-time use, and their low energy demands make them ideal for battery-constrained environments. To this end, we introduce Spikachu: a scalable, causal, and energy-efficient neural decoding framework based on SNNs. Our approach processes binned spikes directly by projecting them into a shared latent space, where spiking modules, adapted to the timing of the input, extract relevant features; these latent representations are then integrated and decoded to generate behavioral predictions. We evaluate our approach on 113 recording sessions from 6 non-human primates, totaling 43 hours of recordings. Our method outperforms causal baselines when trained on single sessions using between 2.26 and 418.81 times less energy. Furthermore, we demonstrate that scaling up training to multiple sessions and subjects improves performance and enables few-shot transfer to unseen sessions, subjects, and tasks. Overall, Spikachu introduces a scalable, online-compatible neural decoding framework based on SNNs, whose performance is competitive relative to state-of-the-art models while consuming orders of magnitude less energy.
脑机接口(BCIs)为具有神经运动障碍的人提供了实现重要功能(如语言和假肢控制)的承诺。其成功的关键是神经解码器,即能将神经活动映射到预期行为的模型。当前的基于学习的解码方法分为两类:简单的因果模型缺乏泛化能力,而复杂的非因果模型虽然可以离线泛化和扩展,但在实时环境中却表现不佳。两者都面临一个共同的挑战,那就是它们依赖于能耗巨大的人工神经网络主干,这使得它们难以集成到资源有限的现实世界中。脉冲神经网络(SNNs)提供了一个有前途的替代方案。因为这些模型以因果方式运行,所以适合实时使用,而且它们低功耗的特点使其成为电池受限环境的理想选择。为此,我们推出了Spikachu:一个基于SNN的可扩展、因果和高效的神经解码框架。我们的方法通过直接将分箱后的脉冲投影到共享潜在空间中进行处理,在那里,适应输入时间的脉冲模块提取相关特征;然后将这些潜在表示进行集成和解码,以产生行为预测。我们在来自6只非人类灵长类动物的113个记录时段(总计43小时记录)上评估了我们的方法。当在单个会话上进行训练时,我们的方法表现优于因果基线,同时消耗2.26至418.81倍更少的能量。此外,我们证明了对多个会话和主体的训练可扩展性可以提高性能,并可实现少量会话转移到未见过的会话、主体和任务。总体而言,Spikachu引入了一个基于SNN的可扩展的、在线兼容的神经解码框架,其性能与最新模型相当,同时消耗的能量要少得多。
论文及项目相关链接
Summary
本文主要介绍了基于脉冲神经网络(SNNs)的神经网络解码框架Spikachu。该框架具有可扩展性、因果性和节能性,适用于处理神经活动并转化为行为预测。在猴子实验数据上的评估显示,Spikachu在能耗较低的情况下表现出良好的性能,并且支持跨会话、跨主体和跨任务的少量迁移学习。
Key Takeaways
- Spikachu是一个基于脉冲神经网络(SNNs)的神经网络解码框架,适用于处理神经活动并转化为行为预测。
- SNNs具有因果性,适合实时使用,并且低功耗,适合资源有限的系统。
- Spikachu通过投影二值化脉冲到共享潜在空间,然后解码生成行为预测。
- 在猴子实验数据上评估显示,Spikachu在单会话训练时比因果基线具有更好的性能,并且消耗更少的能量。
- Spikachu支持跨多个会话、主体和任务的少量迁移学习。
- Spikachu提供了一个可扩展的在线兼容神经网络解码框架,与最新模型相比具有竞争力。
点此查看论文截图
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
Authors:Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li, Qi Zhu, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala
As structured texts become increasingly complex across diverse domains – from technical reports to generative AI prompts – the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL’s effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
随着结构化文本在各个领域(从技术报告到生成式AI提示)的日益复杂性,将文本分割成语义上有意义的组件的需求变得至关重要。此类文本通常包含超出纯语言的元素,包括表格、代码片段和占位符,传统的句子或段落级别的分割方法无法有效地处理。为了应对这一挑战,我们提出了BoundRL,这是一种联合执行令牌级别的文本分割和标签预测的长结构化文本的新型高效方法。它不需要为每个段生成完整内容,而是只生成一系列起始令牌,并通过在原始文本中定位这些令牌来重建完整内容,从而通过数量级地减少推理成本并最大限度地减少幻觉。为了适应输出格式,BoundRL通过强化学习与可验证奖励(RLVR)进行奖励的特定设计,联合优化文档重建保真度和语义对齐。为了减轻熵崩溃,它通过系统地扰动一部分生成的段序列来构建中间候选者,从而为实现更高质量的解决方案创造阶梯。为了证明BoundRL在处理特别具有挑战性的结构化文本上的有效性,我们将评估重点放在用于LLM应用程序的复杂提示上。实验表明,BoundRL使小型语言模型(1.7B参数)能够优于更大模型的少数提示。此外,使用我们设计的奖励的RLVR在监督微调方面取得了显着改进,而中间候选者的引入进一步提高了性能和泛化能力。
论文及项目相关链接
Summary
该文本主要提出了一种针对长结构文本进行文本分割和标签预测的新方法BoundRL。它通过生成起始令牌序列,重构完整内容,大幅降低推理成本并最小化虚构内容。采用强化学习和可验证奖励(RLVR)进行模型适应,通过设计奖励来优化文档重构保真度和语义对齐。通过构建中间候选来减轻熵崩溃问题,进一步提高解决方案质量。在大型语言模型应用中的复杂提示上进行了评估,证明BoundRL能有效处理复杂的结构化文本,并提升小型语言模型的表现。
Key Takeaways
- 随着结构化文本的复杂性增加,需要进行语义上有意义的文本分割。
- 传统句子或段落级别的文本分割方法无法有效处理包含表格、代码片段和占位符等元素的结构化文本。
- BoundRL是一种针对长结构文本进行联合文本分割和标签预测的新方法。
- BoundRL通过生成起始令牌序列和重构完整内容来降低推理成本并减少虚构内容。
- 强化学习和可验证奖励(RLVR)被用于模型适应,通过设计奖励来优化文档重构的保真度和语义对齐。
- 通过构建中间候选来缓解熵崩溃问题,进一步提高解决方案质量。
点此查看论文截图
Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
Authors:Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar
Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
实体链接(EL)传统上依赖于大型注释数据集和广泛的模型微调。虽然最近的少样本方法通过提示利用大型语言模型(LLM)来减少训练要求,但它们通常因为基于LLM的推理而面临效率低下的问题。ARTER(自适应路由和靶向实体推理)提出了一个结构化管道,通过战略性地结合候选生成、基于上下文评分、自适应路由和选择性推理,实现了高性能而无需深度微调。ARTER计算一小部分互补信号(基于嵌入和LLM),对检索到的候选进行排序,将上下文提及分为简单和困难两种情况。然后这些情况分别由低计算实体链接器(例如ReFinED)和更昂贵的靶向LLM基础推理进行处理。在标准基准测试中,ARTER的性能比ReFinED高出+4.47%,在6个数据集中的5个数据集上平均提高+2.53%,并且在所有提及的管道中与基于LLM的推理管道表现相当,同时在LLM令牌数量方面效率提高一倍。
论文及项目相关链接
Summary
实体链接(EL)传统上依赖于大量标注数据集和复杂的模型微调。而最近的少样本方法通过提示大型语言模型(LLM)来减少训练要求,但它们常常因为基于LLM的推理而效率低下。ARTER(自适应路由和针对性实体推理)提出一个结构化管道,通过战略性地结合候选生成、基于上下文打分、自适应路由和选择性推理,实现高性能而无需深度微调。ARTER计算一小部分互补信号(包括嵌入和基于LLM的信号)对检索到的候选进行排序,将上下文提及分为简单和困难两种情况。然后,这些情况分别由计算量较低的实体链接器(例如ReFinED)和更昂贵的针对性LLM-基于推理进行处理。在标准基准测试中,ARTER比ReFinED高出+4.47%,在6个数据集中的5个上平均提升+2.53%,并且在所有提及的管道中表现与基于LLM的推理相当,同时在LLM令牌的效率上提高了两倍。
Key Takeaways
- ARTER提出了一种新的实体链接方法,结合了候选生成、上下文打分、自适应路由和选择性推理。
- ARTER通过战略性地利用大型语言模型(LLM),实现了少样本学习,减少了模型微调的需求。
- ARTER通过计算互补信号来区分上下文提及的难易程度,并据此选择不同的处理方式。
- ARTER在标准基准测试中的表现优于ReFinED,平均提升+2.53%,并在某些数据集上达到+4.47%的提升。
- ARTER与基于LLM的推理管道相比具有相当的性能,但在效率上有所提高,特别是在LLM令牌的利用率上。
- ARTER通过结合低计算实体链接器和针对性LLM-基于推理,实现了高效和准确的实体链接。
点此查看论文截图
SEMPO: Lightweight Foundation Models for Time Series Forecasting
Authors:Hui He, Kun Yi, Yuanchi Ma, Qi Zhang, Zhendong Niu, Guansong Pang
The recent boom of large pre-trained models witnesses remarkable success in developing foundation models (FMs) for time series forecasting. Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resource-constrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at https://github.com/mala-lab/SEMPO.
近期的大型预训练模型热潮见证了时间序列预测基础模型(FMs)的显著成功。尽管在各种下游预测任务中表现出令人印象深刻的效果,但现有的时间序列FMs具有大规模的网络架构,需要大量在大规模数据集上进行预训练,这在资源受限的环境中显著阻碍了其部署。针对通用性和可负担性之间日益紧张的矛盾,我们提出了SEMPO,这是一种新型轻量级基础模型,它只需要在相对较小的数据集上进行预训练,但展现出强大的通用时间序列预测能力。具体来说,SEMPO包含两个关键模块:1)能量感知的SpEctral分解模块,通过建模不仅高能量频率信号,还包括当前方法中忽略的低能量但信息丰富的频率信号,从而大大提高了预训练数据的使用效率;2)混合提示启用的Transformer,它通过小数据集特定的提示学习不同的时间模式,并自适应地将时间序列令牌路由到基于提示的专家,以实现不同数据集和领域中的参数有效模型适应。配备了这些模块,SEMPO在显著减少预训练数据规模和模型大小的同时,实现了强大的泛化能力。在涵盖16个数据集的两大基准测试上的实验表明,与最新方法相比,SEMPO在零样本和少样本预测场景中均表现出卓越的性能。代码和数据可在https://github.com/mala-lab/SEMPO找到。
论文及项目相关链接
PDF Accepted by NeurIPS 2025
Summary
大规模预训练模型在时序预测领域的基础模型(FMs)发展中取得了显著成功。然而,现有FMs网络结构庞大,需要在大规模数据集上进行预训练,这在资源受限的环境中部署时构成了挑战。针对通用性和可负担性之间的日益紧张关系,我们提出了SEMPO,这是一种新型轻量级基础模型,它只需要在相对较小的数据集上进行预训练,便能够展现出强大的通用时序预测能力。SEMPO包括两个关键模块:一是能量感知的谱分解模块,它通过对当前方法所忽略的低能量但包含信息的频率信号进行建模,大大提高了预训练数据的利用率;二是混合提示赋能Transformer,它通过小数据集特定的提示来学习不同的时间模式,并自适应地将时间序列标记路由到基于提示的专家,以实现跨不同数据集和领域的参数高效模型适应。通过这些模块的应用,SEMPO显著减少了预训练数据规模和模型大小,同时实现了强大的泛化能力。在涵盖16个数据集的两个大型基准测试上的实验表明,在零样本和少样本预测场景中,SEMPO的性能均优于最新方法。代码和数据可在我们的GitHub页面找到:https://github.com/mala-lab/SEMPO。
Key Takeaways
- SEMPO是一种针对时序预测的新型轻量级基础模型,可在资源受限环境中部署。
- SEMPO通过能量感知的谱分解模块提高了预训练数据的利用率。
- 模型通过混合提示赋能Transformer实现了在不同数据集和领域的参数高效模型适应。
- SEMPO显著减少了预训练数据规模和模型大小。
- SEMPO在涵盖多个数据集的大型基准测试上表现出优异的性能。
- SEMPO的优越性能在零样本和少样本预测场景中尤为突出。
点此查看论文截图
Neural Variational Dropout Processes
Authors:Insu Jeon, Youngjin Park, Gunhee Kim
Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-specific dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efficient mapping of dropout rates from a few observed contexts. It allows for a quick reconfiguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional \textit{dropout} posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-specific dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classification. The results show the excellent performance of NVDPs.
学习推断条件后验模型是实现稳健元学习的重要步骤。本文提出了一种新的贝叶斯元学习方法,称为神经变分丢弃过程(NVDPs)。NVDPs基于任务特定的丢弃来建模条件后验分布;它利用低秩的伯努利专家元模型来实现从少量观测上下文中的丢弃率的内存有效映射。这允许快速重新配置全局学习和共享神经网络,以进行多任务小样本学习中的新任务。此外,NVDPs使用基于整个任务数据的新型先验,以优化摊销变分推断中的条件“丢弃”后验。令人惊讶的是,这能够实现稳健的任务特定丢弃率的近似,该丢弃率可以处理广泛的函数模糊性和不确定性。我们将所提出的方法与其他元学习方法进行了比较,包括小样本学习任务,如1D随机回归、图像修复和分类。结果显示NVDPs表现出卓越的性能。
论文及项目相关链接
PDF Accepted as a Poster at International Conference on Learning Representations (ICLR) 2022 (Apr 25-29, 2022)
Summary
神经网络变分丢弃过程(NVDPs)是一种新型的贝叶斯元学习方法,用于在少样本情况下进行条件后验模型推断。它通过任务特定的丢弃法来模拟条件后验分布,并采用低阶的伯努利专家元模型来实现从少量观测上下文中的丢弃率的映射。NVDPs允许快速重新配置全局学习的共享神经网络,以处理多任务少样本学习中的新任务。此外,NVDPs使用基于整个任务数据的新先验,以优化摊销变分推断中的条件丢弃后验。该方法能够稳健地近似任务特定的丢弃率,应对各种功能模糊性和不确定性。在少样本学习任务(如1D随机回归、图像修复和分类)中,NVDPs的表现出色。
Key Takeaways
- NVDPs是一种新型的贝叶斯元学习方法,用于少样本学习中的条件后验模型推断。
- NVDPs通过任务特定的丢弃法模拟条件后验分布,实现记忆高效的丢弃率映射。
- NVDPs允许快速重新配置全局学习的共享神经网络,适应多任务少样本学习中的新任务。
- NVDPs采用低阶伯努利专家元模型来处理功能模糊性和不确定性。
- NVDPs使用基于整个任务数据的新先验来优化条件丢弃后验在摊销变分推断中。
- NVDPs在多种少样本学习任务上表现出色,包括1D随机回归、图像修复和分类。
点此查看论文截图
Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment
Authors:Yuhang Liu, Minglai Shao, Zengyi Wo, Yunlong Chu, Bing Hao, Shengzhong Liu, Ruijie Wang, Jianxin Li
Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery. However, existing CLIP-style graph-text aligners face two key limitations: they assume strict one-to-one correspondences between nodes and texts, overlooking the inherent many-to-many relations in real-world graphs; and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision. Together, these limitations expose a core dilemma: embracing expressive many-to-many alignment amplifies noise, while reverting to strict one-to-one strategies sacrifices semantic diversity and fails to handle inherently mismatched pairs. To address these challenges, we propose ADAligner, a dynamic, quality-aware graph-text alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality. ADAligner estimates batch-level alignment reliability in real time and adapts its optimization accordingly, promoting soft, subgraph-level many-to-many alignment when supervision is clean, while emphasizing reliable one-to-one alignment by dynamically filtering low-confidence pairs under noise. Theoretically, we prove that this dynamic mechanism forms a stable negative feedback process, ensuring convergence and robustness. Comprehensive experiments on nine diverse TAG datasets demonstrate that ADAligner consistently outperforms prior graph-text aligners on zero-/few-shot node classification, link prediction and cross-modal retrieval tasks. It maintains strong robustness under noisy supervision and accelerates pre-training by approximately 2 to 3 times compared to multimodal baselines, establishing a scalable and reliable foundation for graph-text representation learning in real-world web environments.
预训练图基础模型(GFMs)在文本属性图(TAGs)上对于搜索、推荐和知识发现等大规模应用至关重要。然而,现有的CLIP风格的图形文本对齐器面临两个主要局限性:它们假定节点和文本之间存在严格的一对一对应关系,忽略了现实世界中图形中固有的多对多关系;它们依赖于静态对齐目标,无法适应不断变化的数据质量,使得它们在嘈杂的监督下变得脆弱。总的来说,这些局限性暴露了一个核心困境:接受表达性的多对多对齐会放大噪声,而恢复到一对一的策略则会牺牲语义多样性,并且无法处理固有的不匹配对。为了解决这些挑战,我们提出了ADAligner,一个动态、质量感知的图形文本对齐框架,它根据监督质量动态调整表达性的多对多和保守的一对一目标之间的平衡。ADAligner实时估计批处理级别的对齐可靠性并相应调整其优化,在监督干净时促进柔性、子图级别的多对多对齐,而在噪声下通过动态过滤低信心对来强调可靠的一对一对齐。从理论上讲,我们证明了这种动态机制形成了一个稳定的负反馈过程,确保了收敛性和稳健性。在九个不同的TAG数据集上的综合实验表明,ADAligner在零/少次节点分类、链接预测和跨模态检索任务上始终优于先前的图形文本对齐器。它在嘈杂的监督下保持强大的稳健性,与多模式基线相比,预训练速度提高了大约2到3倍,为现实世界的网络环境中的图形文本表示学习建立了可扩展和可靠的基石。
论文及项目相关链接
Summary
本文介绍了预训练图基础模型(GFMs)在文本属性图(TAGs)上的重要性,并指出了现有CLIP风格的图文本对齐器面临的关键局限性。为了解决这个问题,本文提出了动态、质量感知的图文本对齐框架ADAligner。该框架能够根据监督质量动态调整表达丰富多样的多对多和对保守的一对一目标。在清洁监督下促进软子图级别的多对多对齐,而在噪声环境下则通过动态过滤低信心对来强调可靠的一对一对齐。实验证明,ADAligner在零/少次节点分类、链接预测和跨模态检索任务上均优于先前的图文本对齐器,并在噪声监督下保持了强大的稳健性,同时预训练速度提高了大约2到3倍。
Key Takeaways
- 预训练图基础模型(GFMs)在文本属性图(TAGs)上的重要性。
- 现有CLIP风格的图文本对齐器面临的关键局限性:忽略真实世界图中的许多对许多关系,以及无法适应变化的数据质量。
- ADAligner是一个动态、质量感知的图文本对齐框架,能够根据监督质量动态调整对齐策略。
- ADAligner在清洁监督下采用软子图级别的多对多对齐,而在噪声环境下则强调可靠的一对一对齐。
- ADAligner通过动态过滤低信心对来提高稳健性。
- 实验证明ADAligner在多种TAG数据集上优于其他图文本对齐方法,特别是在零/少次节点分类、链接预测和跨模态检索任务上。
- ADAligner在噪声监督下保持强大稳健性,并加速了预训练。
点此查看论文截图
Enhancing Early Alzheimer Disease Detection through Big Data and Ensemble Few-Shot Learning
Authors:Safa Ben Atitallah, Maha Driss, Wadii Boulila, Anis Koubaa
Alzheimer disease is a severe brain disorder that causes harm in various brain areas and leads to memory damage. The limited availability of labeled medical data poses a significant challenge for accurate Alzheimer disease detection. There is a critical need for effective methods to improve the accuracy of Alzheimer disease detection, considering the scarcity of labeled data, the complexity of the disease, and the constraints related to data privacy. To address this challenge, our study leverages the power of big data in the form of pre-trained Convolutional Neural Networks (CNNs) within the framework of Few-Shot Learning (FSL) and ensemble learning. We propose an ensemble approach based on a Prototypical Network (ProtoNet), a powerful method in FSL, integrating various pre-trained CNNs as encoders. This integration enhances the richness of features extracted from medical images. Our approach also includes a combination of class-aware loss and entropy loss to ensure a more precise classification of Alzheimer disease progression levels. The effectiveness of our method was evaluated using two datasets, the Kaggle Alzheimer dataset and the ADNI dataset, achieving an accuracy of 99.72% and 99.86%, respectively. The comparison of our results with relevant state-of-the-art studies demonstrated that our approach achieved superior accuracy and highlighted its validity and potential for real-world applications in early Alzheimer disease detection.
阿尔茨海默病是一种严重的脑障碍疾病,会损害大脑的多个区域并导致记忆力下降。标注医疗数据的有限可用性给阿尔茨海默病的准确检测带来了重大挑战。考虑到标注数据的稀缺性、疾病的复杂性和数据隐私的相关限制,急需有效的方法来提高阿尔茨海默病检测的准确性。为了应对这一挑战,我们的研究利用大数据的力量,在少量学习(FSL)和集成学习的框架下,采用预训练的卷积神经网络(CNN)的形式。我们提出了一种基于原型网络(ProtoNet)的集成方法,这是FSL中的一种强大方法,它将多个预训练的CNN作为编码器进行集成。这种集成提高了从医疗图像中提取特征的丰富性。我们的方法还包括结合类感知损失和熵损失,以确保更精确地分类阿尔茨海默病的进展水平。我们的方法使用Kaggle阿尔茨海默病数据集和ADNI数据集进行了评估,准确率分别为99.72%和99.86%。与相关最先进的研究的比较结果表明,我们的方法具有更高的准确性,并突出了其在早期阿尔茨海默病检测中的有效性、实际应用潜力和优势。
论文及项目相关链接
Summary
本研究利用大数据的力量,借助预训练的卷积神经网络(CNN)和Few-Shot学习(FSL)框架,提出一种基于原型网络(ProtoNet)的集成方法来解决阿尔茨海默症检测的挑战性问题。通过结合多种预训练CNN作为编码器,提高了从医学图像中提取的特征的丰富性。该方法还结合了类感知损失和熵损失,以确保更精确地分类阿尔茨海默症进展水平。在Kaggle阿尔茨海默症数据集和ADNI数据集上的实验结果显示,该方法准确性高达99.72%和99.86%,与现有先进技术相比具有更高的准确性。
Key Takeaways
- 阿尔兹海默症是一种严重的脑部疾病,影响大脑多个区域并导致记忆损伤。
- 医学数据的有限可用性对准确检测阿尔兹海默症提出了重大挑战。
- 研究利用大数据的力量,借助预训练的卷积神经网络(CNN)和Few-Shot学习(FSL)框架来解决这个问题。
- 采用基于原型网络(ProtoNet)的集成方法,整合多种预训练CNN作为编码器,提高医学图像特征提取的丰富性。
- 结合类感知损失和熵损失,确保更精确地分类阿尔兹海默症进展水平。
- 在Kaggle阿尔茨海默症数据集和ADNI数据集上的实验结果显示该方法具有很高的准确性。
点此查看论文截图
Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
Authors:Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim
We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.
我们提出了一种用于自动驾驶的两阶段视觉语言问答系统,该系统能够回答高级感知、预测和规划问题。在第一阶段,大型多模态LLM(Qwen2.5-VL-32B)基于六个摄像头的输入、短时间窗口的历史信息以及带有少量样本的思维链提示进行训练。通过自我一致性集合(多个采样推理链)进一步提高了答案的可靠性。在第二阶段,我们利用nuScenes场景元数据(对象注释、自我车辆状态等)和任务特定的问题指令(感知、预测、规划任务的单独提示)来增强提示。在驾驶问答基准测试的实验中,我们的方法显著优于基线Qwen2.5模型。例如,在第一阶段使用5个历史帧和10次拍摄提示达到65.1%的整体准确率(与零次拍摄的62.61%相比);应用自我一致性将其提高到66.85%。第二阶段达到67.37%的整体准确率。值得注意的是,系统在严重的视觉失真情况下仍能保持96%的准确率。这些结果表明,经过精心设计的提示和上下文定位可以极大地增强使用预训练的视觉语言模型的高级驾驶问答能力。
论文及项目相关链接
Summary
该论文介绍了一个两阶段的视觉语言问答系统,用于自动驾驶中的感知、预测和规划任务。第一阶段使用大型多模态语言模型进行条件化处理,利用六相机输入、短暂历史时间窗口和少量样本示例进行推理。通过自我一致性集合(多个采样推理链)提高答案的可靠性。第二阶段通过添加场景元数据(对象注释、车辆状态等)和任务特定指令来增强提示。实验表明,该方法显著优于基线模型,并能在视觉失真条件下保持高准确性。
Key Takeaways
以下是提取出的关键要点,以精简和清晰的方式呈现:
- 提出两阶段视觉语言问答系统,用于自动驾驶中的感知、预测和规划任务。
- 第一阶段使用大型多模态语言模型进行条件化处理,通过少量样本示例进行推理。
- 自我一致性集合方法提高了答案的可靠性。
- 第二阶段通过添加场景元数据和任务特定指令增强提示。
- 实验表明该系统在驾驶问答基准测试中表现优异,显著优于基线模型。
- 系统能在视觉失真条件下保持高准确性。
点此查看论文截图
Prompting the Priorities: A First Look at Evaluating LLMs for Vulnerability Triage and Prioritization
Authors:Osama Al Haddad, Muhammad Ikram, Ejaz Ahmed, Young Lee
Security analysts face increasing pressure to triage large and complex vulnerability backlogs. Large Language Models (LLMs) offer a potential aid by automating parts of the interpretation process. We evaluate four models (ChatGPT, Claude, Gemini, and DeepSeek) across twelve prompting techniques to interpret semi-structured and unstructured vulnerability information. As a concrete use case, we test each model’s ability to predict decision points in the Stakeholder-Specific Vulnerability Categorization (SSVC) framework: Exploitation, Automatable, Technical Impact, and Mission and Wellbeing. Using 384 real-world vulnerabilities from the VulZoo dataset, we issued more than 165,000 queries to assess performance under prompting styles including one-shot, few-shot, and chain-of-thought. We report F1 scores for each SSVC decision point and Cohen’s kappa (weighted and unweighted) for the final SSVC decision outcomes. Gemini consistently ranked highest, leading on three of four decision points and yielding the most correct recommendations. Prompting with exemplars generally improved accuracy, although all models struggled on some decision points. Only DeepSeek achieved fair agreement under weighted metrics, and all models tended to over-predict risk. Overall, current LLMs do not replace expert judgment. However, specific LLM and prompt combinations show moderate effectiveness for targeted SSVC decisions. When applied with care, LLMs can support vulnerability prioritization workflows and help security teams respond more efficiently to emerging threats.
安全分析师面临着处理大量复杂漏洞记录的巨大压力。大型语言模型(LLM)可以通过自动化部分解释过程提供潜在帮助。我们评估了四个模型(ChatGPT、Claude、Gemini和DeepSeek)在解释半结构化和非结构化漏洞信息时采用的十二种提示技术。作为具体的应用场景,我们测试了每个模型在利益相关者特定漏洞分类(SSVC)框架中预测决策点的能力,包括利用、可自动化、技术影响以及任务和福祉等方面。我们使用VulZoo数据集中的384个真实世界漏洞,发出了超过16.5万次查询,以评估在一次性提示、少数几次提示和思维链提示等提示风格下的性能。我们报告了每个SSVC决策点的F1分数以及最终SSVC决策结果的加权和非加权Cohen的kappa值。Gemini一直排名最高,在四个决策点中的三个上领先,并给出了最正确的建议。一般来说,使用范例进行提示提高了准确性,但所有模型在某些决策点上都存在困难。只有在加权指标下,DeepSeek才实现了公平协议,所有模型都倾向于过度预测风险。总体而言,当前LLM还不能取代专家判断。然而,特定的LLM和提示组合对于目标SSVC决策显示出中等有效性。如果应用得当,LLM可以支持漏洞优先级排序工作流程,帮助安全团队更有效地应对新兴威胁。
论文及项目相关链接
PDF 19 pages, 8 figures
摘要
安全分析师面临处理大量复杂漏洞清单的压力。大型语言模型(LLMs)可以通过自动化部分解释过程提供帮助。我们评估了四种模型(ChatGPT、Claude、Gemini和DeepSeek)在解释半结构化和非结构化漏洞信息方面的表现,采用了十二种提示技术。作为具体的应用场景,我们测试了每个模型在利益相关者特定漏洞分类(SSVC)框架中预测决策点的能力,包括利用、自动化、技术影响和使命及健康。我们使用VulZoo数据集中的384个真实世界漏洞,发出了超过16.5万次查询,以评估在一次性提示、少数提示和思维链提示风格下的表现。我们报告了每个SSVC决策点的F1分数以及最终SSVC决策结果的加权和非加权Cohen kappa值。Gemini在大多数指标上表现最好,在四个决策点中的三个上领先,并提供了最正确的建议。用范例提示通常可以提高准确性,但所有模型在某些决策点上仍存在困难。只有DeepSeek在加权指标下实现了公平协议,所有模型都有过度预测风险的趋势。总体而言,当前LLMs不能取代专家判断。然而,特定的LLM和提示组合显示对目标SSVC决策具有中等有效性。适当应用时,LLMs可以支持漏洞优先级排序工作流程,帮助安全团队更高效地应对新兴威胁。
关键见解
- 大型语言模型(LLMs)在自动化解释漏洞信息方面展现出潜力,特别是在处理大量复杂漏洞清单时。
- 在利益相关者特定漏洞分类(SSVC)框架的决策点预测中评估了四种LLM模型,发现Gemini在多数指标上表现最佳。
- 使用范例提示可以提高模型准确性,但所有模型在某些决策点上仍有困难。
- 当前LLMs无法完全替代专家判断,但在特定情境下可作为有效支持工具。
- 适当应用LLMs有助于优化漏洞优先级排序工作流程,提高安全团队响应新兴威胁的效率。
- 深度分析发现,所有模型在预测风险方面存在过度预测的趋势。
点此查看论文截图
TabR1: Taming GRPO for tabular reasoning LLMs
Authors:Pengxiang Cai, Zihao Gao, Jintai Chen
Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
表格预测传统上依赖于梯度提升决策树和专用深度学习模型,这些模型在任务内表现优异,但提供有限的解释性,并且在跨表格时的迁移能力较弱。大型语言模型(LLM)的推理承诺跨任务适应性,具有透明的推理轨迹,但它们对表格数据的潜力尚未得到充分实现。本文提出了TabR1,这是一个用于表格预测的多步推理的推理LLM。其核心是排列相对策略优化(PRPO),这是一种简单而高效的强化学习方法,它将列排列不变性编码为结构先验。通过对每个样本构建多个标签保留排列,并在排列内部和之间估计优势,PRPO将稀疏奖励转化为密集的学习信号并提高了泛化能力。在有限的监督下,PRPO激活了LLM的推理能力进行表格预测,提高了少样本和零样本的性能以及可解释性。全面的实验表明,TabR1在全监督微调的情况下实现了与强大基准线相当的性能。在零样本设置中,TabR1的性能接近在32样本设置下的强大基线。此外,TabR1(8B)在各种任务上大幅优于更大的LLMs,相对于DeepSeek-R1(685B)实现了高达53.17%的改进。
论文及项目相关链接
Summary
本文提出一种名为TabR1的表格预测推理模型,它结合了大型语言模型(LLM)和多步推理技术。模型的核心是Permutation Relative Policy Optimization(PRPO)方法,它能够在有限监督下提升LLM的推理能力,增强表格预测的性能和泛化能力。实验表明,TabR1在全监督微调下的性能与强基线相当,零样本设置下的性能接近32样本设置下的强基线。此外,TabR1(8B)在各种任务上显著优于更大的LLMs,最高提升了53.17%。
Key Takeaways
- TabR1是首个针对表格预测设计的推理大型语言模型(LLM),结合了多步推理技术。
- PRPO是该模型的核心,是一种简单高效的强化学习方法,将列置换不变性作为结构先验知识。
- PRPO通过构建每个样本的多个标签保留置换,并在置换内部和之间估计优势,将稀疏奖励转化为密集学习信号,提高泛化能力。
- 在有限监督下,PRPO激活了LLMs的推理能力,用于表格预测,提高了少样本和零样本性能以及可解释性。
- 综合实验表明,TabR1在全监督微调下的性能与强基线相当,在零样本设置下的性能接近甚至达到中等数据量下的基线性能。
- TabR1(8B)在各种任务上的表现显著优于更大的LLMs,对某些任务有高达53.17%的性能提升。
点此查看论文截图
Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration
Authors:Yonghao Liu, Yajun Wang, Chunli Guo, Wei Pang, Ximing Li, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan
Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.
图少量学习已经引起了广泛的关注,因为它只需要有限的标记节点就能迅速适应新任务。尽管现有的图少量学习方法已经取得了显著的进步,但仍存在一些关键局限性。首先,大多数当前方法依赖于预定义和统一的图滤波器(例如低通或高通滤波器)来全局增强或抑制节点频率信号。这种固定的谱操作忽略了真实世界图中固有的局部拓扑结构的异质性。此外,这些方法通常假设支撑集和查询集来自同一分布。然而,在少量样本条件下,支撑集中有限的标记数据可能不足以捕捉查询集的复杂分布,导致次优泛化。为了解决这些挑战,我们提出了GRACE,这是一种新型的图少量学习框架,融合了自适应谱专家与跨集分布校准技术。理论上,该方法通过适应局部结构变化和跨集分布校准来提高模型的泛化能力。经验上,GRACE在广泛的实验设置下始终优于最新基线。我们的代码可以在这里找到。
论文及项目相关链接
PDF NeurIPS25
Summary
本文介绍了图少样本学习领域的一个新框架GRACE。该框架旨在解决现有方法在面对复杂现实图结构时的局限性,如固定的谱操作无法适应局部拓扑结构的异质性,以及假设支持集和查询集来自同一分布的问题。GRACE通过集成自适应谱专家和跨集分布校准技术,提高了模型的泛化能力。
Key Takeaways
- 图少样本学习能够利用有限标记节点快速适应新任务。
- 现有方法使用预定义的统一图滤波器来全局增强或抑制节点频率信号,存在局限性。
- 现有方法通常假设支持集和查询集来自同一分布,但在少样本条件下,这可能不成立。
- GRACE是一个新的图少样本学习框架,通过自适应谱专家和跨集分布校准技术解决上述问题。
- GRACE能增强模型对局部结构变化和跨集分布差异的适应性。
- GRACE在广泛的实验设置下,性能表现超过现有基线。
点此查看论文截图
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Authors:Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
对于盲人或视力受损(BLV)人群来说,在极端或复杂环境中提供步行辅助仍然是一个巨大的挑战,这主要是因为缺乏对整体场景的理解。受BLV社区实际需求的驱动,我们构建了mmWalk,这是一个模拟的多模式数据集,集成了多视图传感器和面向可访问性的特征,用于室外安全导航。我们的数据集包含120个手动控制、按场景分类的行走轨迹,共62k同步帧。它包含了超过55.9万张全景图像,涵盖了RGB、深度和语义模式。此外,为了强调现实世界的关联性,每条轨迹都涉及室外角落情况和针对BLV用户的特定可访问性地标。此外,我们还生成了mmWalkVQA,这是一个量身定制于安全和有知识的步行辅助的VQA基准测试,包含超过6.9万个视觉问答三元组,分为9类。我们评估了零样本和少样本设置下的最新视觉语言模型(VLMs),发现它们在我们的风险评估和导航任务方面遇到了困难。我们在真实世界的数据集上验证了经过mmWalk训练的模型的有效性,证明了我们的数据集在推进多模式步行辅助方面的作用。
论文及项目相关链接
PDF Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk
Summary
本文介绍了一个针对盲人或低视力人群在极端或复杂环境中行走的模拟多模态数据集mmWalk。该数据集融合了多视角传感器和面向无障碍功能的特点,包含户外安全导航的多种场景轨迹和图像数据。此外,为了强调实际应用,数据集中还包括室外特殊情况和无障碍特定地标信息。研究评估了多种先进的视觉语言模型,并发现它们在风险评估和导航任务方面的表现不足。最后验证了使用mmWalk进行微调模型在真实世界数据集上的有效性。
Key Takeaways
- 针对盲人或低视力人群在极端或复杂环境中行走的挑战,提出了模拟多模态数据集mmWalk。
- mmWalk集成了多视角传感器和面向无障碍功能的特点,用于户外安全导航。
- 数据集包含多种场景轨迹和图像数据,强调室外特殊情况和无障碍特定地标信息的重要性。
- 研究评估了视觉语言模型在风险评估和导航任务中的表现,发现它们面临挑战。
- 通过在真实世界数据集上的验证,证明了使用mmWalk进行微调模型的有效性。
- 数据集可用于提高多模态行走辅助技术的性能。
点此查看论文截图
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Authors:Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
少量学习(FSL)旨在从仅有的几个标记支持样本中识别出新概念。最近的研究通过融入额外的语义信息或设计复杂的语义融合模块来增强支持特征。然而,由于缺乏在实际实例中的基础,它们仍然会遭受与视觉证据相矛盾的幻觉语义的困扰,导致产生嘈杂的指导和昂贵的修正。为了解决这些问题,我们提出了一种新的框架——基于大型语言模型(LLM)进行视觉与文本桥接的少量学习(VT-FSL)。该框架构建了精确跨模态提示,这些提示基于大型语言模型和支撑图像的条件,并通过几何感知对齐无缝集成。它主要由跨模态迭代提示(CIP)和跨模态几何对齐(CGA)组成。具体而言,CIP在类名和支撑图像的基础上对LLM进行条件处理,以在单个结构化推理过程中迭代生成精确的类描述。这些描述不仅丰富了对新颖类的语义理解,还实现了语义一致图像的零样本合成。这些描述和合成图像分别作为补充的文本和视觉提示,提供高级类语义和低级类内多样性,以弥补有限的支撑数据。此外,CGA通过最小化它们所跨越的3维平行四边形的核化体积来联合对齐融合的文本、支撑和合成视觉表示。它捕捉了所有表示之间的全局和非线性关系,实现了结构化且一致的多模态集成。所提出的VT-FSL方法在包括标准、跨域和细粒度少量学习场景在内的十个不同基准测试上建立了新的最先进的性能。代码可在https://github.com/peacelwh/VT-FSL找到。
论文及项目相关链接
PDF Accepted by NeurIPS 2025
摘要
少数样本学习(FSL)旨在从少量标记样本中识别新概念。尽管现有研究通过融入额外语义信息或设计复杂语义融合模块来增强支持特征,但它们仍面临因缺乏实际实例的支撑而导致的语义矛盾问题,产生误导性的指导并需要昂贵的修正。为解决这些问题,本文提出了一种新的框架——基于大型语言模型(LLM)的视觉与文本桥接少数样本学习(VT-FSL)。它构建精确的跨模态提示,以支持图像和LLM为条件,无缝集成它们通过一个几何感知对齐。它主要包括跨模态迭代提示(CIP)和跨模态几何对齐(CGA)。具体而言,CIP以类名和图像为条件来生成精确类描述,这些描述不仅丰富了对新颖类的语义理解,还实现了语义一致合成图像的零样本合成。描述和合成图像分别作为互补的文本和视觉提示,提供高级类语义和低级类内多样性,以弥补有限的支持数据。此外,CGA联合对齐融合的文本、支持和合成视觉表示,通过最小化它们所跨越的平行四边形的核化体积来捕捉所有表示之间的全局和非线性关系,从而实现结构化且一致的多模态集成。VT-FSL方法在新的十个多样化的基准测试中建立了最先进的性能,包括标准、跨域和精细少数样本学习场景。
关键见解
- VT-FSL框架结合了视觉和文本,利用大型语言模型(LLM)进行少数样本学习(FSL),解决现有模型的语义矛盾问题。
- 跨模态迭代提示(CIP)结合了类名和图像来生成精确的类描述,增强了语义理解并实现了零样本图像合成。
- 描述和合成图像作为互补的文本和视觉提示,提供高级语义和低级类内多样性,弥补有限的支持数据。
- 跨模态几何对齐(CGA)联合对齐文本、支持和合成视觉表示,通过捕捉全局和非线性关系实现一致的多模态集成。
- VT-FSL方法实现了在新的十个多样化基准测试中的最佳性能,包括标准、跨域和精细少数样本学习场景。
- 该方法通过结合视觉和文本信息提高了模型的泛化能力,为少数样本学习任务提供了新的解决方案。
点此查看论文截图
PlantSegNeRF: A few-shot, cross-species method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching
Authors:Xin Yang, Ruiming Du, Hanyang Huang, Jiayang Xie, Pengyao Xie, Leisen Fang, Ziyue Guo, Nanjun Jiang, Yu Jiang, Haiyan Cen
Organ segmentation of plant point clouds is a prerequisite for the high-resolution and accurate extraction of organ-level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high-precision instance point clouds from multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi-view images to generate instance masks for each organ with a corresponding ID. The multi-view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high-precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the second-best results on structurally complex species. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant species, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ-level plant phenotyping and provides a high-throughput way to supply high-quality 3D data for the development of large-scale models in plant science.
植物点云器官分割是实现高分辨率和精确提取器官水平表型特征的前提。尽管深度学习快速发展,推动了植物点云分割领域的大量研究,但现有的器官分割技术仍面临分辨率、分割精度和跨物种泛化能力方面的局限性。本研究提出了一种名为PlantSegNeRF的新型方法,旨在从多视角RGB图像序列直接生成高精度的实例点云,适用于广泛的植物物种。PlantSegNeRF对多视角图像进行二维实例分割,为每个器官生成具有相应ID的实例掩膜。然后,使用专门设计的实例匹配模块对对应于同一植物器官的跨视角实例ID进行匹配和细化。开发了实例NeRF来呈现包含颜色、密度、语义和实例信息的隐式场景。最终,基于体积密度将隐式场景转换为高精度的植物实例点云。结果证明,在点云语义分割方面,PlantSegNeRF优于常用方法,在结构复杂的物种上,相较于第二好的结果,其在精确度、召回率、F1分数和交并比方面分别平均提高了16.1%、18.3%、17.8%和24.2%。更重要的是,PlantSegNeRF在植物点云实例分割任务中显示出显著优势。在所有植物物种中,其在mPrec、mRec、mCov和mWCov方面平均分别提高了11.7%、38.2%、32.2%和25.3%。本研究扩展了植物器官水平的表型分析,并为植物科学中大规模模型的开发提供了一种提供高质量3D数据的高通量方法。
论文及项目相关链接
Summary
本文提出一种名为PlantSegNeRF的新方法,用于从多视角RGB图像序列直接生成高精度植物器官点云。该方法通过2D实例分割生成每个器官的实例掩模和对应ID,使用专门设计的实例匹配模块进行多视角实例匹配和细化,并开发实例NeRF渲染包含颜色、密度、语义和实例信息的隐式场景。最终,根据体积密度将隐式场景转换为高精度植物实例点云。在点云语义分割和植物点云实例分割任务中,PlantSegNeRF表现出显著优势,平均提高了精度、召回率、F1分数和IoU等指标。
Key Takeaways
- PlantSegNeRF是一种用于植物点云高精度实例分割的新方法。
- 该方法通过多视角RGB图像序列生成高精度植物器官点云。
- PlantSegNeRF通过2D实例分割生成实例掩模和对应ID,使用实例匹配模块进行多视角匹配和细化。
- 开发了实例NeRF以渲染包含颜色、密度、语义和实例信息的隐式场景。
- PlantSegNeRF在语义分割和实例分割任务中均表现出优势,提高了多个评估指标。
- 该方法适用于多种植物物种,为植物科学的大规模模型开发提供高质量3D数据。
点此查看论文截图
With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You
Authors:Fabian Gröger, Shuo Wen, Huyen Le, Maria Brbić
Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6%$ in classification and $91.8%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.
多模态模型在需要多模态对齐的复杂任务中展现了强大的能力,包括零样本分类和跨模态检索。然而,现有模型通常依赖于数百万配对的多模态样本,这在许多领域中是昂贵且难以获得的。在这项工作中,我们通过对齐预训练的单模态基础模型,探索了使用有限配对数据构建多模态模型的可行性。我们展示,只需数万个配对样本(不到该领域通常使用数据的1%),就可以实现高质量的对齐。为了实现这一点,我们引入了STRUCTURE,这是一种有效的正则化技术,能够保持单模态编码器潜在空间的邻近几何结构。此外,我们证明了对齐最后几层通常效果较差,并展示了对齐各模态中代表性最高相似性的层的好处。这两个组件可以轻松地融入现有的对齐方法,在24个零样本图像分类和检索基准测试中取得了显著的提升,分类任务的平均相对改进率为51.6%,检索任务的改进率为91.8%。我们的结果突出了我们框架在有限样本多模态学习中的有效性和广泛适用性,并为资源受限领域提供了前景光明的未来路径。
论文及项目相关链接
PDF NeurIPS 2025 camera-ready
Summary
本文探讨了利用有限配对数据构建多模态模型的可行性,通过对预训练的单模态基础模型进行对齐来实现。通过引入STRUCTUR正则化技术,实现了在仅使用数千个配对样本(不到该领域通常使用的数据的1%)的情况下,就能达到高质量的对齐效果。此外,文章还指出单纯对齐最后一层并不理想,并展示了对齐具有最高跨模态代表性相似性的层次所带来的优势。这两个组成部分可以很容易地融入现有的对齐方法,在零样本图像分类和检索基准测试中表现出显著的提升,分类平均相对提升51.6%,检索任务提升91.8%。本文框架对于有限样本多模态学习具有有效性和广泛的应用性,为资源受限领域提供了有前途的发展路径。
Key Takeaways
- 现有多模态模型通常需要大量的配对样本数据。针对这一点,本文探索了在有限的配对数据下构建多模态模型的可行性。
- 通过引入STRUCTUR正则化技术,实现了高质量的多模态对齐,仅使用数千个配对样本。
- 文章指出单纯对齐模型的最后一层并不理想,强调了对齐具有最高跨模态代表性相似性的层次的重要性。
- 该研究展示了新方法在零样本图像分类和检索任务上的显著优势。
- 通过融入现有的对齐方法,新方法具有广泛的应用性。
- 本文提出的框架对于资源受限领域具有特别重要的意义。
点此查看论文截图
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback
Authors:Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm
Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.
医学数据的匮乏严重限制了诊断机器学习模型的泛化能力,因为小型的临床数据集无法代表疾病的全部变异谱。为了解决这一问题,扩散模型(DMs)已被视为合成图像生成和增强的有前途的途径。然而,它们经常产生医学上不准确的图像,从而降低了模型性能。在数据稀缺的情况下,尤其是质量胜过数量时,利用专业领域知识合成正确编码临床信息的图像至关重要。现有的人类反馈融入方法,如强化学习(RL)和直接偏好优化(DPO),都依赖于稳健的奖励函数或需要大量人工评估。最近多模态大型语言模型(MLLMs)的进展显示出了强大的视觉推理能力,使其成为评估者的合适候选。在这项工作中,我们提出了一个名为MAGIC(通过AI-专家协作进行医学准确图像生成)的新框架,用于合成用于数据增强的临床准确皮肤病图像。我们的方法创造性地将专家定义的标准转化为对扩散模型的图像合成的可操作反馈,这显著提高了临床准确性,同时减少了直接人工工作量。实验表明,我们的方法大大提高了合成皮肤病图像的临床质量,输出结果与皮肤科医生的评估相符。此外,使用这些合成图像增强训练数据,在具有挑战性的20种皮肤病分类任务中,诊断准确率提高了+9.02%,在少样本情况下提高了+13.89%。
论文及项目相关链接
PDF NeurIPS 2025
Summary
本文提出一种名为MAGIC的新框架,该框架通过AI与专家合作,利用扩散模型合成临床准确的皮肤疾病图像用于数据增强。该方法将专家定义的评估标准转化为合成图像的实际反馈,从而提高临床准确性并减少人工工作量。实验证明,该方法能显著提高合成皮肤疾病图像的临床质量,并与皮肤科医生的评估相符。此外,使用这些合成图像增强训练数据,在20种皮肤疾病分类任务中提高了9.02%的诊断准确率,在少量样本情况下提高了13.89%的诊断准确率。
Key Takeaways
- 医学数据的缺乏限制了诊断ML模型的泛化能力。
- 扩散模型(DMs)在医学图像生成和增强方面展现出潜力,但常产生医学上不准确的图像。
- 专家领域知识对于合成正确编码临床信息的图像至关重要,特别是在数据稀缺的情况下。
- 现有的人类反馈方法如强化学习和直接偏好优化存在缺陷。
- 多模态大型语言模型展现出强大的视觉推理能力,适合作为评估者。
- 提出的MAGIC框架结合了AI与专家合作,利用扩散模型合成临床准确的皮肤疾病图像。
点此查看论文截图
Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.
基于互联网规模数据的视觉语言模型(VLMs)在常见对象(如汽车、卡车和行人)上的零样本检测性能显著。然而,最先进的模型在推广到其预训练中没有出现的分布外的类别、任务和成像模式时,仍然面临困难。我们主张不应仅仅通过更多的视觉数据重新训练VLMs,而应该使用包含少量视觉示例和丰富文本描述的新概念对齐VLMs。为此,我们推出了Roboflow100-VL,这是一个包含100个多模态对象检测数据集的大规模集合,其中包含的概念并不常见于VLM预训练。我们在我们的基准测试上对最先进的模型进行了零样本、小样本次数、半监督和全监督的设置进行评估,以便在不同的数据领域进行比较。值得注意的是,我们发现像GroundingDINO和Qwen2.5-VL这样的VLM在Roboflow100-VL中具有挑战性的医学影像数据集上零样本准确率低于2%,这显示了少样本概念对齐的必要性。最后,我们讨论了近期CVPR 2025的基础FSOD竞赛并分享了社区的见解。值得注意的是,冠军团队超过了我们的基线成绩17 mAP!我们的代码和数据集可以在https://github.com/roboflow/rf100-vl和https://universe.roboflow.com/rf100-vl/获取。
论文及项目相关链接
PDF The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/
Summary
本文介绍了针对互联网规模数据的视觉语言模型(VLMs)在常见物体上的零样本检测性能。然而,现有模型在泛化到非分布类别、任务和成像模式方面仍存在挑战。为此,作者引入了Roboflow100-VL,这是一个包含100个多模式对象检测数据集的大规模集合,其中包含各种不常见于VLM预训练的概念。作者在不同的数据环境下评估了现有模型,发现一些模型在具有挑战性的医疗成像数据集上的零样本精度较低,表明需要进行少量样本概念对齐。此外,本文还介绍了CVPR 2025的FSOD竞赛和社区见解。
Key Takeaways
- 视觉语言模型(VLMs)在互联网规模数据上的训练可以在常见物体上实现出色的零样本检测性能。
- 现有模型在泛化到非分布类别、任务和成像模式时仍存在挑战。
- 引入Roboflow100-VL,一个包含多样化概念的大规模多模态对象检测数据集。
- 一些模型在挑战性医疗成像数据集上的零样本精度较低,需要少量样本概念对齐。
- CVPR 2025的FSOD竞赛中,冠军团队显著超越了基线17 mAP。
- 作者分享了其代码和数据集,可供进一步研究使用。
点此查看论文截图
CLEVER: A Curated Benchmark for Formally Verified Code Generation
Authors:Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean’s type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
我们介绍了${\rm C{\small LEVER}}$,这是一个高质量的、经过筛选的包含161个问题的基准测试,适用于端到端的验证代码生成。每个问题都包含(1)生成与未公开的真实规范相匹配的规范的任务,(2)生成一个能够证明满足此规范的可信Lean实现的任务。不同于以前的基准测试,${\rm C{\small LEVER}}$避免了测试用例的监督、由大型语言模型生成的注释以及泄露实现逻辑或允许空洞解决方案的规范。所有输出都使用Lean的类型检查器进行事后验证,以确保可机器检查的正确性。我们使用${\rm C{\small LEVER}}$来评估基于最新语言模型的几种小样本和智能方法。这些方法都很难实现完全验证,这使其成为程序合成和形式推理的前沿挑战基准测试。我们的基准测试可以在GitHub(https://github.com/trishullab/clever)以及HuggingFace(https://huggingface.co/datasets/amitayusht/clever)上找到。所有评估代码也在线可用(https://github.com/trishullab/clever-prover)。%E3%80%82)
论文及项目相关链接
Summary
${\rm C{\small LEVER}}$是一个高质量的、针对端对端验证的代码生成任务的基准测试集,包含161个问题。它避免了测试用例监督、大型语言模型生成的注解以及泄露实现逻辑或允许空洞解决方案的规格。所有输出都使用Lean的类型检查器进行事后验证,以确保机器可检查的正确性。该基准测试集对于评估基于当前最前沿语言模型的少量方法和自主方法都是个巨大的挑战,它被认为是个富有挑战性的任务对于程序合成和形式化推理领域的研究人员来说极具价值。这个基准测试集可以在GitHub和HuggingFace找到。其相关的评估代码也已经公开提供。
Key Takeaways
- ${\rm C{\small LEVER}}$是一个高质量的基准测试集,专门针对端对端验证的代码生成任务。它包含用于评估和验证代码生成的多个问题。这个基准测试集在程序合成和形式化推理领域具有挑战性。
点此查看论文截图
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?
Authors:Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen
The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.
从例子中识别模式并将其应用于新例子是通用智能的基本能力,也是心理学和人工智能研究者广泛研究的课题。许多基准测试已经被提出来衡量大型语言模型(LLM)的这种能力;然而,它们主要集中在少样本(通常少于10个)的设置上,缺乏从长文中聚合大量信息的评估。另一方面,LLM的上下文长度不断增长,带来了多示例上下文学习(ICL)的新范式,该范式可以用数百到数千个例子解决新任务,而无需昂贵且低效的微调。然而,多示例评估通常侧重于分类,流行的长上下文LLM任务,如“needle in haystack”(找针工作),很少需要整合大量信息的复杂智能。为了解决两者的问题,我们提出了MIR-Bench,这是用于模式识别的首个多示例上下文推理基准测试,要求LLM通过输入和输出示例预测输出,这些示例来自具有不同数据格式的基本函数。基于MIR-Bench,我们对多示例上下文推理进行了许多新问题的研究,并获得了许多深刻的见解,包括规模效应、稳健性、归纳与演绎推理、检索增强生成(RAG)、编码归纳推理、跨域泛化等。
论文及项目相关链接
PDF 39 pages, 11 figures. The paper is accepted at NeurIPS 2025 Datasets & Benchmarks Track, and the latest version adds modifications in camera-ready
Summary
文章探讨了从例子中识别模式并将其应用于新情境的智能能力的重要性,并指出心理学和人工智能研究者对此进行了广泛研究。文章提出了针对大型语言模型(LLM)的评估基准测试MIR-Bench,该测试用于评估模型在通过大量实例上下文进行模式识别推理时的能力。 MIR-Bench涵盖了多样数据格式的输入和输出预测问题,能够评估模型在许多情境下的推理能力,包括对许多新颖问题的探讨和多种洞察力的获取。
Key Takeaways
- 识别模式并将其应用于新情境是智能能力的核心,受到心理学和人工智能研究者的广泛关注。
- 大型语言模型(LLM)的评估基准测试通常关注于小规模示例(< 10个)评估方法尚未适用于大规模的上下文聚合。然而新的范型在许多-Shot情境下的出现为模式识别提供了新的视角。
- 提出MIR-Bench基准测试作为针对大型语言模型的多个上下文情境下的推理模式识别测试,该测试要求模型通过不同格式的输入和输出示例预测输出。
点此查看论文截图
Pre-training Epidemic Time Series Forecasters with Compartmental Prototypes
Authors:Zewen Liu, Juntong Ni, Max S. Y. Lau, Wei Jin
Accurate epidemic forecasting is crucial for outbreak preparedness, but existing data-driven models are often brittle. Typically trained on a single pathogen, they struggle with data scarcity during new outbreaks and fail under distribution shifts caused by viral evolution or interventions. However, decades of surveillance data from diverse diseases offer an untapped source of transferable knowledge. To leverage the collective lessons from history, we propose CAPE, the first open-source pre-trained model for epidemic forecasting. Unlike existing time series foundation models that overlook epidemiological challenges, CAPE models epidemic dynamics as mixtures of latent population states, termed compartmental prototypes. It discovers a flexible dictionary of compartment prototypes directly from surveillance data, enabling each outbreak to be expressed as a time-varying mixture that links observed infections to latent population states. To promote robust generalization, CAPE combines self-supervised pre-training objectives with lightweight epidemic-aware regularizers that align the learned prototypes with epidemiological semantics. On a comprehensive benchmark spanning 17 diseases and 50+ regions, CAPE significantly outperforms strong baselines in zero-shot, few-shot, and full-shot forecasting. This work represents a principled step toward pre-trained epidemic models that are both transferable and epidemiologically grounded.
准确的疫情预测对于疫情准备至关重要,但现有的数据驱动模型往往很脆弱。这些模型通常针对单一病原体进行训练,在新疫情期间数据稀缺时面临困境,在病毒进化或干预措施导致的分布变化下也会失效。然而,来自多种疾病的几十年监控数据提供了尚未开发的知识转移来源。为了利用历史的集体教训,我们提出了CAPE(传染病预测的首个开源预训练模型)。与现有的时间序列基础模型不同,CAPE将疫情动态建模为潜在人群状态的混合物,称为“隔室原型”。它直接从监控数据中发现了灵活的隔室原型词典,使每个疫情都能表达为一种随时间变化混合体,将观察到的感染与潜在人群状态联系起来。为了促进稳健的泛化,CAPE结合了自我监督的预训练目标与轻量级的疫情意识调节器,使学到的原型与流行病学语义相符。在涵盖17种疾病和50多个地区的综合基准测试中,CAPE在零样本、少样本和全样本预测中都显著优于强大的基准模型。这项工作朝着既可转移又基于流行病学的预训练传染病模型的方向迈出了有原则的一步。
论文及项目相关链接
PDF version 2.0_fixed
Summary
该文提出CAPE,一个利用历史数据进行流行病预测的开源预训练模型。面对现有的数据驱动模型在新疫情爆发时数据缺乏情况下的脆弱性,CAPE模型从长期疾病监测数据中挖掘可迁移知识。CAPE将流行病动态建模为潜在人群状态的混合体,称为“隔室原型”。该模型直接从监测数据中获取灵活的隔室原型词典,使每个疫情都能表达为随时间变化的混合体,将观察到的感染与潜在人群状态联系起来。为提高模型的稳健性,CAPE结合了自我监督的预训练目标和轻量级的流行病感知调节器,使学习到的原型与流行病语义相符合。在跨越17种疾病和50多个地区的综合基准测试中,CAPE在零样本、少样本和全样本预测中显著优于其他强基线模型。这是朝着既具有可迁移性又具备流行病基础的可预训练流行病模型的重要一步。
Key Takeaways
- CAPE是一个用于流行病预测的开源预训练模型。
- 现有数据驱动模型在面对新疫情爆发时存在数据缺乏的脆弱性。
- CAPE从长期疾病监测数据中挖掘知识,将流行病动态建模为潜在人群状态的混合体。
- CAPE采用自我监督的预训练结合流行病感知调节器,提高模型的稳健性。
- CAPE通过发现隔室原型词典,使每个疫情都能表达为时间变化的混合体。
- 在多项基准测试中,CAPE在零样本、少样本和全样本预测方面显著优于其他模型。
点此查看论文截图