LLM

发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 22.3k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

RescueLens: LLM-Powered Triage and Action on Volunteer Feedback for Food Rescue

Authors:Naveen Raman, Jingwu Tang, Zhiyu Chen, Zheyuan Ryan Shi, Sean Hudson, Ameesh Kapoor, Fei Fang

Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.

食品救援组织通过与志愿者合作，将捐赠者多余的食物重新分配给需要它的接收者，同时解决粮食不安全问题和浪费问题。志愿者的反馈使食品救援组织能够早期发现问题并确保志愿者满意度。然而，食品救援组织目前采用手动监控反馈的方式，这可能既繁琐又耗劳力，难以确定哪些问题是最重要的。在这项工作中，我们研究大型语言模型（LLM）如何帮助食品救援组织者基于志愿者经验理解和采取行动。我们与位于宾夕法尼亚州匹兹堡的大型食品救援组织412食品救援合作，设计了一个名为RescueLens的LLM驱动工具，该工具可自动分类志愿者反馈，建议跟进的捐赠者和接收者，并根据反馈更新志愿者指示。我们在标注数据集上评估了RescueLens的性能，结果表明其可以恢复96%的志愿者问题，精度达到71%。此外，根据志愿者问题的比率对捐赠者和接收者进行排名，RescueLens允许组织者重点关注0.5%的捐赠者，这些捐赠者负责超过30%的志愿者问题。RescueLens现在已部署在412食品救援中，通过与组织者进行半结构化访谈，我们发现RescueLens简化了反馈流程，使组织者能更好地分配时间。

论文及项目相关链接

PDF Accepted at IAAI’26

Summary

食物救援组织通过志愿者将多余的食物从捐赠者重新分配给需要它的接收者，同时解决粮食不安全和浪费问题。志愿者反馈使食物救援组织能够早期发现问题并确保志愿者满意度。然而，食物救援组织目前主要通过手动方式监控反馈，这既繁琐又劳力密集，难以确定哪些问题是最重要的。本研究探讨了大型语言模型（LLM）如何帮助食物救援组织者根据志愿者的经验进行理解和行动。我们与位于宾夕法尼亚州匹兹堡的大型食物救援组织412 Food Rescue合作，设计了RescueLens工具，该工具利用LLM自动分类志愿者反馈，建议跟进的捐赠者和接收者，并根据反馈更新志愿者指南。评估表明，RescueLens能够恢复96%的志愿者问题，精度为71%。此外，通过根据志愿者问题的频率对捐赠者和接收者进行排名，RescueLens帮助组织者专注于解决由仅占0.5%的捐赠者引起的超过30%的志愿者问题。RescueLens现已在412 Food Rescue部署，通过与非结构化访谈发现，它简化了反馈流程，使组织者能更好地分配时间。

Key Takeaways

食物救援组织致力于解决食品浪费和食品不安全的问题。
志愿者反馈在食物救援组织中起到重要作用，允许早期发现问题并确保志愿者满意度。
目前食物救援组织主要通过手动处理大量反馈数据，效率较低且难以优先处理关键问题。
大型语言模型（LLM）在食物救援组织中具有应用价值，可帮助处理和分析志愿者反馈。
研究中设计的RescueLens工具能够自动分类志愿者反馈，并建议跟进的捐赠者和接收者以及更新志愿者指南。
RescueLens具有出色的性能表现，能够在精确性较高的前提下恢复大部分志愿者问题。

Cool Papers

点此查看论文截图

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Authors:Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang

Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.

混合专家（MoE）多模态大型语言模型（MLLMs）在视觉语言任务方面表现出色，但计算效率低下。为了减少推理开销，人们提出了专家跳过方法，根据当前的输入令牌来停用多余的专家。然而，我们发现将这些原本为单模态大型语言模型（LLMs）设计的方法应用于MLLMs会导致性能显著下降。这主要是因为这些方法未能考虑到MoE层中专家之间的异质贡献以及这些层内令牌特有的跨模态行为。基于这些发现，我们提出了MoDES，这是一种无需训练即可自适应跳过专家的框架，可实现高效且准确的MoE MLLM推理。它结合了全局调控的局部门控（GMLG）机制，将全局层重要性融入局部路由概率中，以准确估计每个令牌专家的重要性。然后应用双模态阈值（DMT）方法，该方法分别处理来自每个模态的令牌，以得出跳过时间表。为了设置最佳阈值，我们引入了一种前沿搜索算法，该算法利用单调性属性，将收敛时间从几天缩短到几小时。对三个模型系列进行广泛的实验验证，跨越13个基准测试集表明，MoDES大大优于以前的方法。例如，在为Qwen3-VL-MoE-30B-A3B-Instruct跳过88%的专家时，性能提升了高达10.67%（从原来的提升到）。此外，MoDES还显著提高了推理速度，加快填充时间高达2.16倍，解码时间高达1.26倍。

论文及项目相关链接

PDF Code will be released upon acceptance

摘要

MoE（Mixture-of-Experts）多模态大型语言模型（MLLMs）在视觉语言任务上表现出色，但计算效率低下。为减少推理开销，提出了基于当前输入令牌停用冗余专家的专家跳过方法。然而，将原本为单模态大型语言模型（LLMs）设计的方法应用于MLLMs会导致性能显著下降。主要是因为这些方法忽略了MoE层中专家的异质贡献和这些层内代币的模态特定行为。基于此，我们提出MoDES，这是一种无需训练的自适应跳过专家框架，可实现MoE MLLM的高效且准确推理。它结合了全局调控的本地门控（GMLG）机制，将全局层重要性融入本地路由概率，以准确估计每个代币的专家重要性。然后应用双模态阈值（DMT）方法，该方法分别处理每种模态的代币，以得出跳过时间表。为了设置最佳阈值，我们引入了一种前沿搜索算法，该算法利用单调性属性，将收敛时间从几天减少到几小时。广泛实验证明MoDES远超先前方法。例如，对于Qwen3-VL-MoE-30B-A3B-Instruct模型跳过88%的专家时，性能提升高达10.67%（97.33%对比86.66%）。此外，MoDES显着提高了推理速度，加快填充时间2.16倍和编码时间1.26倍。

关键见解

MoE MLLMs在视觉语言任务上表现优秀但计算效率低。
直接将针对LLMs设计的专家跳过方法应用于MLLMs会导致性能下降。
提出MoDES框架，一种无需训练的专家自适应跳过方法，以提高MoE MLLM的推理效率和准确性。
MoDES结合GMLG机制，通过考虑全局层重要性来估计每个代币的专家重要性。
MoDES采用DMT方法处理每种模态的代币，并设置最佳阈值以决定专家跳过。
利用前沿搜索算法快速设置阈值，提高收敛速度。

Cool Papers

点此查看论文截图

VisPlay: Self-Evolving Vision-Language Models from Images

Authors:Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang

Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/

强化学习（RL）为在复杂推理任务上提升视觉语言模型（VLM）提供了一个原则性的框架。然而，现有的强化学习方法通常依赖于人工标注的标签或任务特定的启发式方法来定义可验证的奖励，这两者都成本高昂且难以扩展。我们推出了VisPlay，这是一个自我进化的强化学习框架，能够让VLM使用大量的未标记图像数据自主地提高其推理能力。从单个基础VLM出发，VisPlay将模型分为两个交互角色：图像条件下的提问者和多模态推理者。图像条件下的提问者会制定具有挑战性但可回答的视觉问题，而多模态推理者则会产生银质回应。这两个角色通过群体相对策略优化（GRPO）进行联合训练，GRPO结合了多样性和难度奖励，以平衡生成问题的复杂性与银质答案的质量。VisPlay在两个模型家族之间实现了有效扩展。在Qwen2.5-VL和MiMo-VL上进行训练时，VisPlay在八个基准测试中实现了视觉推理、组合泛化和幻觉减少方面的一致改进，包括MM-Vet和MMMU，展示了自我进化多模态智能的可扩展路径。项目页面可通过https://bruno686.github.io/VisPlay/访问。

论文及项目相关链接

PDF

Summary
强化学习（RL）为视觉语言模型（VLM）在复杂推理任务上的提升提供了原则性的框架。然而，现有的RL方法常常依赖人工标注的标签或任务特定的启发式方法来定义可验证的奖励，这些成本高昂且难以扩展。我们推出了VisPlay，这是一种自我进化的RL框架，能够使VLM自主地提高其推理能力，利用大量的未标记图像数据。VisPlay通过两个交互角色——图像条件下的提问者和多模态推理者——进行训练。提问者制定具有挑战性但可回答的视觉问题，而推理者生成银质答案。这两个角色通过群体相对策略优化（GRPO）联合训练，以融入多样性和难度奖励来平衡生成问题的复杂性与银质答案的质量。VisPlay在两个模型家族之间实现高效扩展。在Qwen2.5-VL和MiMo-VL上进行训练后，VisPlay在八个基准测试中实现了视觉推理、组合泛化和幻觉减少方面的一致改进，包括MM-Vet和MMMU等。该项目的网页地址为：链接地址。

Key Takeaways

强化学习可以用于提升视觉语言模型的复杂推理任务性能。
现有强化学习方法常常依赖人工标注的标签和任务特定启发式方法，成本高昂且难以扩展。
VisPlay是一种自我进化的RL框架，利用大量未标记图像数据，使VLM自主提高推理能力。
VisPlay通过图像条件下的提问者和多模态推理者的两个交互角色进行训练。
VisPlay采用群体相对策略优化（GRPO），融入多样性和难度奖励，平衡生成问题的复杂性和银质答案的质量。
VisPlay在两个模型家族之间实现高效扩展，并在多个基准测试中实现视觉推理性能的提升。

Cool Papers

点此查看论文截图

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Authors:Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris

Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.

自动化视频分析在野生动物保护中扮演着至关重要的角色。该领域的一个基础任务是多种动物跟踪（MAT），它是个体再识别和行为识别等应用的基础。然而，现有数据集在规模上有限制，仅限于少数物种，或者缺乏足够的时间和地理多样性，因此没有合适的基准来训练适用于野生动物群体的通用MAT模型。为了解决这个问题，我们推出了SA-FARI，这是最大的野生动物开放源代码MAT数据集。它包含了大约十年（2014年至2024年）内从四大洲的741个地点收集的共包含约收集共包含共包含包含动物视频的摄像机陷阱录像。数据集共包括有来自四大洲超过横跨涉及覆盖四大洲横跨共涵盖个跨越不同地理位置的地点，跨越物种类别。每个视频都进行了详尽的标注，最终形成了约密集标注的视频片段小时密集标注的视频片段，包含个动物身份和个个体边界框等物种标签和边界框分割掩码以及物种标签。除了针对任务的标注外，我们还为每条视频发布了匿名摄像机陷阱位置。最后我们在SA-FARI数据集上提供了关于先进视觉语言模型和最新的自然语言理解技术构建的方法和前沿的技术基准评估报告，包括使用特定物种和通用动物提示的SAM 3检测与追踪性能评估报告。我们还比较了专为野生动物分析开发的视觉方法。SA-FARI是首个结合高物种多样性、多区域覆盖和高质量时空标注的大规模数据集，为推进野外通用多动物跟踪提供了新的基础。数据集可在网址上进行访问：conservationxlabs.com/SA-FARI。

论文及项目相关链接

PDF

摘要
自动化视频分析在野生动物保护中发挥着至关重要的作用。其中的一项基础任务是进行多动物追踪（MAT），它为个体再识别和行为识别等应用提供了支持。然而，现有的数据集规模有限，仅限于少数物种，缺乏足够的时间和地理多样性，没有合适的基准来训练适用于野生动物种群的多用途MAT模型。为了解决这一问题，我们引入了SA-FARI数据集，这是迄今为止最大的用于野生动物的开放多动物追踪数据集。它由从四大洲的741个地点收集的超过大约十年的（即截止至现在后增加的加上过往的四五年历史时间段是相当于整体的约十年收集）涉及包括稀有大物种共计的将近一万两千个物种的视频组成。每个视频都进行了详尽的标注，包括大约标注了约一万六千多个面具身份和九十四万多个个体边界框，分割掩模和物种标签的时间序列片段视频信息近达四十六小时时长，从而形成了密集标注的影像资料。除了特定任务的注释外，我们还发布了每个视频的匿名相机陷阱位置。最后，我们利用最先进的视觉语言模型对SA-FARI进行了全面的基准测试。与仅专注于野生动物分析的视觉方法的对比分析证明，SA-FARI数据集是首个结合高物种多样性、多区域覆盖和高质量时空注释的大规模数据集，为推进通用的多动物野外追踪提供了坚实的基础。该数据集可在 conservationxlabs.com/SA-FARI 获得。该数据集将成为推动野生动物保护研究的重要工具。它不仅促进了技术的开发和应用，还加强了科研人员对野生动物的了解和保护意识。未来研究可以进一步利用该数据集进行更深入的物种识别和多模态数据融合分析，以推动野生动物保护工作的进步。

关键见解

SA-FARI数据集是迄今为止最大的用于野生动物的开放多动物追踪数据集。
数据集包含跨越四大洲七个地点的近一万两千个物种的视频，长达十年的收集期带来丰富的时间与地理多样性。
数据集中包含大量的密集标注影像资料，包含面部身份标注、边界框、分割掩模和物种标签等详细信息。

Cool Papers

点此查看论文截图

When to Think and When to Look: Uncertainty-Guided Lookback

Authors:Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yunlong, Tang, Luchuan Song, Susan Liang, Zhongfei, Zhang, Jason J. Corso, Chenliang Xu

Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.

测试时的思考（即生成明确的中间推理链）在大语言模型中已知可以提高性能，最近也显示出大型视觉语言模型（LVLMs）的强劲增长。然而，尽管这些结果很有希望，但目前还没有关于思考如何实际影响视觉推理的系统性分析。我们首次对LVLMs的思考进行了大规模、受控的比较分析，评估了InternVL 3.5和Qwen3-VL家族中的十个变体在MMMU-val下的通用符号预算和多通道解码。我们表明，更多的思考并不总是更好；长链通常会导致忽略图像的长错误轨迹，并且在标准指令模式下运行相同的模型表现较差。更深入的分析表明，某些短回顾短语（明确指回图像）在成功的轨迹中丰富且与更好的视觉定位相关。基于这一见解，我们提出了不确定性引导回顾，这是一种无需训练的解码策略，结合了不确定性信号与自适应回顾提示和广度搜索。我们的方法提高了整体的MMMU性能，在标准思考较弱的类别中取得了最大的收益，并超越了多个强大的解码基线，在固定的模型家族和符号预算下达到了最新的先进水平。我们进一步表明，这种解码策略可以推广，在另外五个基准测试中产生了持续一致的改进，包括两个广泛的多模式套件和数学聚焦的视觉推理数据集。

论文及项目相关链接

PDF

Summary

测试时的思考（即生成明确的中间推理链）已知可以提升大语言模型的性能，并最近显示出对大型视觉语言模型（LVLMs）的强烈增益。然而，尽管结果令人鼓舞，但目前仍没有系统分析思考如何影响视觉推理。我们首次进行了这样的分析，对LVLMs的思考进行了大规模、受控的比较，评估了InternVL3.5和Qwen3-VL家族的十个变种在MMMU-val上的表现，同时在慷慨的令牌预算和多通道解码下。我们发现更多的思考并不总是更好；长链常常导致忽视图像并产生错误的轨迹，并且在标准指令模式下表现较差。更深入的分析表明，某些短回看短语（明确指代图像）在成功的轨迹中丰富存在，并与更好的视觉定位相关。在此基础上，我们提出了不确定性引导回看，这是一种无需训练即可解码的策略，结合了不确定性信号与自适应回看提示和广度搜索。我们的方法提高了MMMU的整体性能，在标准思考较弱的类别中取得了最大的收益，并超越了多个强大的解码基线，在固定的模型家族和令牌预算下创下了最新记录。此外，该解码策略还显示出一般性，在五个额外的基准测试中取得了持续进步，包括两个通用的多模式套件和数学视觉推理数据集。

Key Takeaways

测试时的思考可以提升大型语言模型的性能。
对于大型视觉语言模型（LVLMs），明确的中间推理链有助于增强性能。
更多的思考并不总是带来更好的视觉推理效果，长链可能导致忽视图像并产生错误的轨迹。
短回看短语在成功的视觉推理轨迹中频繁出现，并与更好的视觉定位相关。
不确定性引导回看策略是一种有效的解码方法，结合了不确定性信号、自适应回看提示和广度搜索。
该方法提高了MMMU的整体性能，特别是在标准思考较弱的类别中表现优异。

Cool Papers

点此查看论文截图

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Authors:Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR’s effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

随着视频内容的日益普及，对于长视频的有效理解和问答已成为众多应用的核心需求。尽管大型视觉语言模型（LVLMs）的性能已经提升，但它们仍然面临着微妙查询的挑战，这些查询需要全面理解和详细分析。为了克服这些障碍，我们推出了AVATAAR，这是一个模块化且可解释的框架，它结合了全局和局部视频上下文，以及预检索思考代理和反思模块。AVATAAR创建了一个持久的全局摘要，并在反思模块和预检索思考代理之间建立了一个反馈循环，允许系统根据部分答案对检索策略进行精细化调整，并实现类似人类的迭代推理。在CinePile基准测试中，AVATAAR相较于基线模型表现出显著的改进，在时序推理方面实现了+5.6%的相对增长，在技术查询方面实现了+5%的增长，在基于主题的提问中实现了+8%的提升，以及在叙事理解方面实现了+8.2%的提升。我们的实验证实，每个模块都对整体性能做出了积极贡献，而反馈循环对于适应性至关重要。这些发现凸显了AVATAAR在提高视频理解能力方面的有效性。最终，AVATAAR为长视频问答（QA）提供了可扩展的解决方案，融合了准确性、可解释性和可扩展性。

论文及项目相关链接

PDF Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)

Summary

随着视频内容的普及，对长视频进行有效理解和问答在多个领域变得至关重要。尽管大型视觉语言模型（LVLMs）提升了性能，但它们仍面临处理复杂查询的挑战，需要全面理解和详细分析。为此，我们推出AVATAAR，一个模块化且可解释的框架，结合全局和局部视频上下文，以及预检索思考代理和反思模块。AVATAAR创建持久全局摘要，建立反思模块和预检索思考代理之间的反馈循环，允许系统根据部分答案优化检索策略，模拟人类迭代推理。在CinePile基准测试中，AVATAAR相对于基线有显著改进，在时序推理、技术查询、主题问题和叙事理解方面分别提高了+5.6%、+5%、+8%和+8.2%。实验证明，每个模块都对整体性能产生积极影响，反馈循环对于适应性至关重要。AVATAAR在提高视频理解能力方面效果显著，为长视频问答提供了可扩展的解决方案，融合了准确性、可解释性和可扩展性。

Key Takeaways

随着视频内容的普及，对长视频进行问答的能力变得至关重要。
大型视觉语言模型在处理复杂查询时面临挑战。
AVATAAR是一个模块化且可解释的框架，结合了全局和局部视频上下文。
AVATAAR通过建立反馈循环，模拟人类迭代推理。
在CinePile基准测试中，AVATAAR相对于基线有显著提高。
AVATAAR的每个模块都对整体性能产生积极影响，反馈循环是关键因素。

Cool Papers

点此查看论文截图

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning

Authors:Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao, Tianyong Hao

Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners’ language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.

语言习得对于揭示人类语言智能的本质至关重要，并最近成为提高大型语言模型（LLM）可解释性的有前途的视角。然而，进行需要控制人类学习者语言输入的实验在伦理上和实践上都是不可行的。这给语言习得建模的验证和可扩展性带来了挑战，特别是在中文第二语言习得（SLA）方面。虽然LLM提供了可控和可复制的选择，但仍缺乏支持分阶段建模和评估的系统性基准测试。在本文中，我们提出了HSKBenchmark，这是中文SLA的LLM分阶段建模和写作评估的首个基准测试。它涵盖HSK 3至6级，包含真实的教科书、676万个令牌、1万六千个合成指令样本、三十个测试主题以及基于语言学的评估系统。为了模拟人类学习轨迹，我们引入了课程调整框架，该框架可以从初级到高级训练模型。评估系统用于检查基于级别的语法覆盖率、写作错误、词汇和句法复杂性以及整体评分。我们还构建了基于1万篇学习者作文的HSKAgent。大量的实验结果表明，HSKBenchmark不仅有效地模拟了中文SLA，而且作为LLM中动态写作评估的可靠基准测试。我们微调后的LLM的写作性能与高级人类学习者相当，并表现出类似人类的习得特征。HSKBenchmark、HSKAgent和检查点作为基础工具和资源，为未来的语言习得建模和LLM可解释性研究铺平了道路。代码和数据可在https://github.com/CharlesYang030/HSKB公开访问。

论文及项目相关链接

PDF Accepted by AAAI-2026

Summary

该文本介绍了语言获取对于揭示人类语言智能的重要性，并指出大型语言模型（LLM）在汉语第二语言获取（SLA）方面的挑战。为解决这些问题，本文提出了HSKBenchmark，这是一个为LLM的汉语SLA建模和写作评估提供的阶段性基准测试。HSKBenchmark包括HSK3至HSK6级别的内容，并包括真实的教科书、合成指令样本、测试主题以及基于语言学的评估系统。此外，还引入了模拟人类学习轨迹的课程调整框架，并建立了一个HSKAgent来评估模型表现。总之，HSKBenchmark不仅有效地模拟了汉语SLA，而且为动态写作评估提供了一个可靠的基准测试。

Key Takeaways

语言获取对于揭示人类语言智能至关重要，LLM在汉语SLA建模方面面临挑战。
HSKBenchmark是首个针对LLM在汉语SLA建模和写作评估的阶段性基准测试。
HSKBenchmark包括真实的教科书、合成指令样本、测试主题和基于语言学的评估系统。
引入课程调整框架来模拟人类学习轨迹。
HSKAgent用于评估模型表现，其精细调整后的性能与高级人类学习者相当。
HSKBenchmark不仅有效模拟汉语SLA，还为动态写作评估提供可靠基准测试。

Cool Papers

点此查看论文截图

Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining

Authors:Qian’ang Mao, Yuxuan Zhang, Jiaman Chen, Wenjun Zhou, Jiaqi Yan

As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on-/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi-agent Large Language Model (LLM) system to robustly infer user intents. A Meta-Level Planner dynamically coordinates domain experts to decompose multiple perspective-specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi-modal on/off-chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context-aware explanations for complex blockchain activity.

随着去中心化金融（DeFi）的发展，理解DeFi交易背后的用户意图至关重要，但同时也面临巨大挑战，原因包括复杂的智能合约交互、多方面的链上链下因素以及不透明的hex日志。现有方法缺乏深度语义洞察。为解决此问题，我们提出了交易意图挖掘（TIM）框架。TIM利用基于扎根理论和多智能体大型语言模型（LLM）系统构建的DeFi意图分类法来稳健地推断用户意图。元级规划器动态协调领域专家，将多个视角特定的意图分析分解为可解决的子任务。问题求解器处理这些任务，采用多模态链上链下数据。认知评估器则减轻LLM的幻觉效应并确保可验证性。实验表明，TIM在性能上显著优于机器学习模型、单一LLM和单一智能体基线。我们还分析了意图推断中的核心挑战。这项工作有助于更可靠地理解用户在DeFi中的动机，为复杂的区块链活动提供情境感知解释。

论文及项目相关链接

PDF Written in 2025 Q1

Summary
提出一种名为Transaction Intent Mining（TIM）的框架，用于深入解析去中心化金融（DeFi）交易中的用户意图。该框架结合了基于实证理论的DeFi意图分类法、多智能体的大型语言模型（LLM）系统以及一系列策略，如元级规划器、问题求解器和认知评估器。实验表明，TIM显著优于机器学习模型、单一LLM和单一智能体基线。它有助于更可靠地理解用户在DeFi中的动机，并为复杂的区块链活动提供上下文感知的解释。

Key Takeaways

TIM框架被提出以解决去中心化金融（DeFi）交易中理解用户意图的挑战，该挑战源于复杂的智能合约交互、多方面的链内外因素以及不透明的hex日志。
TIM利用基于实证理论的DeFi意图分类法以及多智能体的大型语言模型（LLM）系统来推断用户意图。
TIM框架包括元级规划器，用于协调领域专家分解特定视角的意图分析为可解决的任务。
问题求解器利用多模态的链上/链下数据来处理这些任务。
认知评估器用于减轻LLM的幻觉并确保可验证性。
实验表明，TIM框架在性能上显著优于机器学习模型、单一LLM和单一智能体基线。

Cool Papers

点此查看论文截图

CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search

Authors:Ao Xie, Jiahui Chen, Quanzhi Zhu, Xiaoze Jiang, Zhiheng Qin, Enyun Yu, Han Li

Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.

密集检索已成为现代搜索系统，特别是在短视频平台上的基础范式。然而，大多数工业系统采用自我加强的训练管道，该管道依赖于历史暴露的用户交互进行监管。这种范式不可避免地会导致过滤泡效应，将潜在的相关但未曾见过的内容排除在训练信号之外，使模型偏向于狭窄和保守的检索。在本文中，我们介绍了CroPS（跨视角正样本），这是一种新型检索数据引擎，旨在通过从多个角度引入多样化和语义上意义重大的正例来缓解这个问题。CroPS通过用户查询重新表述行为（查询级别）、推荐流中的参与数据（系统级别）以及由大型语言模型合成的工作知识（知识级别）来增强训练。为了有效利用这些异质信号，我们引入了分层标签分配（HLA）策略以及相应的H-InfoNCE损失，它们共同实现了精细粒度的、基于相关性的优化。在快手搜索等大规模商业短视频搜索平台上进行的广泛实验表明，CroPS在离线测试和实时A/B测试中均显著优于强大的基线，实现了出色的检索性能和降低了查询重新表述率。CroPS现已在快手搜索中全面部署，每天为数百万用户提供服务。

论文及项目相关链接

PDF AAAI-2026, Oral

Summary

本文介绍了现代搜索系统尤其是短视频平台中的密集检索范式存在的问题，如采用依赖历史用户交互数据的自强化训练流程可能导致过滤气泡效应，使模型偏向狭窄且保守的检索。为解决这一问题，本文提出了CroPS（跨视角正样本）检索数据引擎，通过引入来自多个视角的多样化和语义丰富的正样本，增强训练效果。CroPS通过利用用户查询重构行为、推荐流中的参与数据以及大型语言模型合成的世界知识等异质信号，实现了精细粒度的相关感知优化。在快手搜索等大型商业短视频搜索平台上的实验表明，CroPS显著优于基线，提高了检索性能和降低了查询重构率。现已全面部署在快手搜索中，每日服务数亿用户。

Key Takeaways

密集检索已成为现代搜索系统的核心范式，尤其在短视频平台上。
自强化训练流程依赖历史用户交互数据，可能导致过滤气泡效应。
CroPS通过引入跨视角正样本解决此问题，增强训练效果。
CroPS利用用户查询重构行为、推荐流参与数据以及世界知识等异质信号。
提出了分层标签分配（HLA）策略和相应的H-InfoNCE损失，实现精细粒度的相关感知优化。
在快手搜索平台上的实验表明，CroPS显著提高了检索性能和降低了查询重构率。

Cool Papers

点此查看论文截图

Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs

Authors:Georg Goldenits, Philip Koenig, Sebastian Raubitzek, Andreas Ekelhart

Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper investigates the feasibility of small language models (SLMs) for detecting phishing websites using only their raw HTML code. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used Small Language Models (SLMs), ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption, demonstrating that while SLMs underperform compared to state-of-the-art proprietary LLMs, they can still provide a viable and scalable alternative to external LLM services. By presenting a comparative analysis of costs and benefits, this work lays the foundation for future research on the adaptation, fine-tuning, and deployment of SLMs in phishing detection systems, aiming to balance security effectiveness and economic practicality.

钓鱼网站对网络安全构成重大威胁，它们欺骗毫无防备的用户，并给财务和组织带来重大损害。传统的机器学习方法在检测钓鱼网站时，通常需要大量的特征工程、持续的再训练和昂贵的基础设施维护。同时，私有大型语言模型（LLM）在钓鱼相关的分类任务中表现出了强大的性能，但其运营成本和对外部提供商的依赖限制了它们在许多商业环境中的实际应用。本文研究了仅使用原始HTML代码检测钓鱼网站的小型语言模型（SLM）的可行性。这些模型的一个关键优势是它们可以部署在本地基础设施上，使组织能够更大地控制数据和操作。我们系统地评估了15种常用的小型语言模型（SLM），参数范围从1亿到70亿，对其分类准确性、计算要求和成本效益进行了基准测试。我们的结果突出了检测性能与资源消耗之间的权衡，表明虽然SLM的性能不及最新的私有LLM，但它们仍然可以为外部LLM服务提供可行和可扩展的替代方案。通过对成本和效益的比较分析，这项工作为未来研究在钓鱼检测系统中对SLM的适应、微调和部署奠定了基础，旨在平衡安全有效性和经济实用性。

论文及项目相关链接

PDF

Summary：小型语言模型（SLMs）用于检测钓鱼网站具有可行性，利用原始HTML代码即可实现。相较于大型语言模型（LLMs），SLMs部署于本地基础设施，具有更好的数据和控制权优势。本文评估了多个常用SLMs的分类准确性、计算需求和成本效益，发现它们在资源消耗和检测性能之间存在权衡。虽然SLMs的性能略低于最先进的LLMs，但它们仍为钓鱼网站检测提供了可行且可扩展的替代方案。

Key Takeaways：

钓鱼网站是网络安全的主要威胁之一，造成财务和组织上的重大损失。
传统机器学习方法在钓鱼网站检测中存在特征工程繁琐、需要持续再训练和昂贵的基础设施维护等问题。
大型语言模型（LLMs）在钓鱼相关分类任务中表现出强大的性能，但操作成本和对外部提供商的依赖限制了其在商业环境中的实际应用。
小型语言模型（SLMs）可以仅使用原始HTML代码检测钓鱼网站。
SLMs部署在本地基础设施上，提供组织和数据控制的优势。
系统评估了多种SLMs的分类准确性、计算需求和成本效益，发现它们在性能和资源消耗之间存在权衡。

Cool Papers

点此查看论文截图

Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Authors:Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

理解复杂的人类活动需要能够将运动分解为精细且语义对齐的子动作的能力。这一运动定位过程对行为分析、实体人工智能和虚拟现实至关重要。然而，大多数现有方法都依赖于预定义动作类别的密集监督，这在开放词汇的现实世界环境中并不可行。在本文中，我们提出了ZOMG，这是一个零样本、开放词汇的框架，它能够将运动序列分解为语义上有意义的子动作，而无需任何注释或微调。从技术上说，ZOMG集成了（1）语言语义分区，它利用大型语言模型将指令分解为有序的子动作单元；（2）软掩模优化，它学习实例特定的时间掩模，以关注对子动作至关重要的帧，同时保持段内连续性并强制分段分离，而无需更改预训练编码器。在三个运动语言数据集上的实验证明了运动定位性能的最先进的有效性和效率，在HumanML3D基准测试上比现有方法高出+8.7% mAP。同时，下游检索也有显著改进，为无注释运动理解建立了新范式。

论文及项目相关链接

PDF

Summary

基于大型语言模型的零样本开放词汇框架（ZOMG）可实现动作序列的语义分段。它通过利用语言语义划分和软掩模优化技术，无需任何标注或微调即可将指令分解为有序的子动作单元，关注关键帧，维持段内连续性并强化段间分离。在三个运动语言数据集上的实验证明了其在动作定位任务上的先进效果和效率，并在HumanML3D基准测试中提高了8.7%的mAP，同时在下游检索中取得了显著改进。这为无标注动作理解提供了新的范例。

Key Takeaways

ZOMG框架可以实现动作序列的语义分段，无需任何标注或微调。
ZOMG集成了语言语义划分和软掩模优化技术。
语言语义划分利用大型语言模型将指令分解为有序的子动作单元。
软掩模优化学习特定实例的临时掩模，以关注对子动作至关重要的帧。
ZOMG在维持段内连续性的同时，强化了段间的分离。
在三个运动语言数据集上的实验证明了ZOMG在动作定位任务上的先进效果。

Cool Papers

点此查看论文截图

Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Authors:Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang

Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.

多模态持续指令微调使多模态大型语言模型能够依次适应新任务，同时建立在已获取的知识之上。然而，这种持续学习范式面临着灾难性遗忘的重大挑战，即学习新任务会导致对先前任务的性能下降。在本文中，我们通过将其概念化为在新任务学习中缺少来自旧任务的梯度的问题，从而引入了关于灾难性遗忘的新见解。我们的方法通过利用参数空间的几何属性来近似这些缺失的梯度，具体是使用当前参数与先前最优参数之间的方向向量作为梯度指导。这个近似的梯度可以进一步与来自有限回放缓冲区的真实梯度相结合，并由动态平衡模型稳定性和可塑性的伯努利采样策略进行调控。在关于多模态持续指令微调数据集的大量实验表明，我们的方法在不进行模型扩展的情况下达到了最先进的技术性能，有效地缓解了灾难性遗忘的问题，同时保持了紧凑的架构。

论文及项目相关链接

PDF

Summary

本文介绍了多模态持续指令调整中的灾难性遗忘问题，并提出了一种新的解决方案。文章通过把灾难性遗忘问题概念化为旧任务在新任务学习中的缺失梯度，提出了基于参数空间几何性质的近似梯度方法来解决这一问题。该方法可以与有限的回放缓冲区中的真实梯度相结合，并通过动态平衡模型稳定性和可塑性的伯努利采样策略进行调控。实验证明，该方法在不扩大模型规模的情况下达到了最先进的性能，有效缓解了灾难性遗忘问题。

Key Takeaways

多模态持续指令调整面临灾难性遗忘的挑战。
灾难性遗忘问题被概念化为旧任务在新任务学习中的缺失梯度。
提出基于参数空间几何性质的近似梯度方法来解决灾难性遗忘问题。
近似梯度可以与有限的回放缓冲区中的真实梯度相结合。
采用伯努利采样策略平衡模型稳定性和可塑性。
实验证明该方法在缓解灾难性遗忘方面效果显著。

Cool Papers

点此查看论文截图

Do Large Language Models (LLMs) Understand Chronology?

Authors:Pattaraphon Kenny Wongchamcharoen, Paul Glasserman

Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.

大型语言模型（LLM）在金融和经济学领域的应用越来越广泛，其中基于提示的对抗前瞻偏差的尝试隐含地假设模型能够理解时间顺序。我们通过一系列时间顺序任务测试这一基本问题，这些任务的复杂度逐渐增加，涉及模型在预训练阶段已经知道的事实。我们的任务包括（1）时间顺序排列，（2）条件排序（先过滤，然后排序），以及（3）年代错误检测。我们评估了GPT-4.1、Claude-3.7 Sonnet（带扩展思考和不带扩展思考）以及GPT-5在多种推理难度设置下的表现。即使在序列长度增加时，精确匹配率会急剧下降，但排名相关性仍然保持较高水平，因为大型语言模型主要在保持局部顺序方面表现良好，但在维持单一全局一致的时间线方面遇到困难。在条件排序中，大多数失败源于过滤步骤，而非排序步骤，但GPT-5和Claude-3.7 Sonnet搭配扩展思考功能显著超越了普通模型。最后，发现年代错误检测是LLMs最容易的任务，但随着时间线或实体的重叠程度越来越高，性能仍然会下降。总的来说，我们的主要贡献是表明分配明确的推理预算有助于时间顺序排列。GPT-5在中/高推理力度下能在所有长度上实现完美排序和条件排序（自我过滤和给定子集），而低/最小力度在面对更长的列表时会表现退化，类似于早期的模型。我们的研究界定了当前LLM在时间顺序任务上的局限性，为任务复杂度提供了见解，并展示了推理有助于的场景。这些模式对于LLM在金融领域的实时应用非常重要。我们发布所有代码和评估模板以支持完全可重复性。

论文及项目相关链接

PDF Version 2: corrected footnote and added code repository link. Extended version of our work presented at the AAAI-26 AI4TS Workshop (poster) and AAAI-26 Student Abstract Program (oral)

Summary

大型语言模型（LLM）在金融与经济领域的应用日益广泛，其基于时间顺序的理解假设在进行基于提示的任务时隐性地存在。本研究通过一系列具有递进复杂性的时间顺序任务来测试模型对已预训练知识的掌握情况。包括（1）时间顺序排序，（2）条件筛选（先过滤再排序），以及（3）年代错误检测。研究发现，即使在序列长度增加的情况下，精确匹配率急剧下降，但大型语言模型在保持局部顺序方面大体保持得很好，难以维持全局一致的时间线。在条件筛选中，大多数失败源于筛选步骤而非排序步骤，但GPT-5和带有扩展思维（Extended Thinking）的Claude-3.7 Sonnet表现显著优于常规模型。年代错误检测被认为是LLM最容易完成的任务，但随着时间线或实体的重叠程度增加，性能仍然会下降。总体而言，本研究的主要贡献在于显示分配明确的推理预算有助于改善时间顺序排序任务，GPT-5在中等/高推理努力的情况下，在各种长度上实现了完美的排序和条件筛选（包括自我过滤和给定子集）。研究结果揭示了当前LLM在时序任务上的局限性，为任务复杂性提供了见解，并展示了推理有助于的场景。这些模式对于LLM在金融领域的实时应用具有重要意义。

Key Takeaways

大型语言模型（LLM）在金融和经济领域的应用中，对时间顺序的理解至关重要。
通过一系列任务发现，LLM在保持全局一致的时间线方面存在困难，但在保持局部顺序方面表现较好。
在条件筛选任务中，模型的失败主要源于筛选步骤。
GPT-5和带有扩展思维的模型（如Claude-3.7 Sonnet）在复杂任务中表现优异。
年代错误检测是LLM最容易完成的任务，但随着实体和时间线的重叠，性能会下降。
分配明确的推理预算可以改善时间顺序排序任务的表现。

Cool Papers

点此查看论文截图

Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports

Authors:Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paal, Navid Jafari, Jean-Louis Briaud, Yunlong Zhang, Youssef M. A. Hashash, Yang Zhou

Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.

灾后侦察报告对于理解多灾种相互作用含有重要证据，但其非结构化的叙述使得知识的系统传递变得困难。大型语言模型（LLM）为分析这些报告提供了新的潜力，但在缺乏领域基础的情况下，通常会生成不可靠或虚构的输出。本研究引入了混合检索代理RAG（MoRA-RAG），这是一个以知识为基础的语言模型框架，可将侦察报告转化为多灾种推理的结构化基础。该框架结合了混合检索机制，该机制能够动态地跨特定灾害数据库路由查询，同时使用代理分块在检索过程中保持上下文连贯性。它还包含一个验证循环，该循环评估证据充分性、改进查询，并在信息不完整时启动有针对性的搜索。我们通过从GEER侦察报告中派生问答对来构建HazardRecQA，这些报告记录了全球90个事件中的七种主要灾害类型。MoRA-RAG的准确率高达94.5%，比零样本LLM高出30%，比最先进的RAG系统高出10%，同时减少了在各种LLM架构中的虚构情况。MoRA-RAG还使开放式权重LLM能够实现与专有模型相当的性能。它为将灾后文档转化为可用于实际行动的可靠情报，建立了一个新的范式，以用于灾害韧性。

论文及项目相关链接

PDF 17 pages, 5 figures

Summary

本文介绍了Mixture-of-Retrieval Agentic RAG（MoRA-RAG）框架，这是一种基于知识的LLM框架，可将灾后侦察报告转化为多灾种推理的结构化基础。该框架采用混合检索机制，可在不同灾种数据库中动态路由查询，同时利用agentic chunking技术保持检索过程中的上下文连贯性。此外，它还包括一个验证循环，用于评估证据充足性、完善查询并启动有针对性的搜索。MoRA-RAG实现了高达94.5%的准确率，相较于零次LLM和最新的RAG系统分别提高了30%和10%，同时减少了不同LLM架构的虚构情况。该框架还为开放权重LLM提供了与专有模型相当的性能，为灾害后的情报转化提供了可靠、可行动的新范式。

Key Takeaways

LLM在分析灾后侦察报告方面具有潜力，但缺乏领域知识会导致结果不可靠或虚构。
MoRA-RAG框架是一种基于知识的LLM框架，用于将侦察报告转化为结构化数据，便于多灾种推理。
MoRA-RAG结合了混合检索机制和agentic chunking技术，增强了上下文连贯性和查询的精准度。
验证循环的存在使MoRA-RAG能够评估证据充足性并进行针对性的搜索。
MoRA-RAG实现了高准确率，相较于其他模型有显著提升，并减少了虚构情况。
该框架适用于多种LLM架构，并为开放权重LLM提供了与专有模型相当的性能。

Cool Papers

点此查看论文截图

AdamX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

Authors:Meng Zhu, Quan Xiao, Weidong Min

Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamX.

自21世纪以来，人工智能引领了新一轮工业革命。在训练框架下，优化算法旨在稳定地将高维优化收敛到局部甚至是全局最小值。进入大型语言模型时代，尽管模型参数和数据规模有所增加，Adam仍然是主流优化算法。然而，与基于随机梯度下降（SGD）的优化算法相比，Adam更容易收敛到非平坦最小值。为了解决这一问题，提出了AdamX算法。其核心创新点在于提出了一种新型的二阶矩估计指数衰减率，随着训练的进行，逐渐减弱学习步长校正强度，在稳定训练期退化至SGD，从而提高了稳定期训练的稳定性，并可能提高了泛化能力。实验结果表明，我们提出的二阶矩估计指数衰减率优于当前二阶矩估计指数衰减率，且在性能上，AdamX能够稳定优于Adam及其变体。我们的代码已开源在：https://github.com/mengzhu0308/AdamX。

论文及项目相关链接

PDF 25 pages, 6 figures, 12 tables. Version 2: (1) Clarified i.i.d. assumption on gradient and noise components (implicitly used in v1). See Hypothesis 1 for details. (2) Refined abstract terminology: explicitly states degradation to momentum SGD. The theoretical results and conclusions remain unchanged

Summary
人工智能引领新一轮工业革命，优化算法稳定收敛至局部甚至全局最小值。进入大语言模型时代，虽然模型参数和数据规模增加，但Adam仍是主流优化算法。针对Adam易收敛至非平坦最小值的问题，提出AdamX算法。其核心创新在于提出新型二阶矩估计指数衰减率，随着训练进展逐步减弱学习步长校正强度，在稳定训练期退化至SGD，提高稳定期训练稳定性并可能提升泛化能力。实验结果显示，AdamX性能稳定优于Adam及其变体。

Key Takeaways

人工智能在新工业革命中起引领作用，优化算法的稳定收敛至关重要。
尽管模型参数和数据规模增大，Adam优化算法依然主流。
Adam优化算法易收敛至非平坦最小值，需要改进。
AdamX算法提出新型二阶矩估计指数衰减率，逐步减弱学习步长校正强度。
AdamX算法在稳定训练期退化至SGD，提高训练稳定性。
AdamX可能提升模型的泛化能力。

Cool Papers

点此查看论文截图

Model Merging Improves Zero-Shot Generalization in Bioacoustic Foundation Models

Authors:Davide Marincione, Donato Crisostomi, Roberto Dessi, Emanuele Rodolà, Emanuele Rossi

Foundation models capable of generalizing across species and tasks represent a promising new frontier in bioacoustics, with NatureLM being one of the most prominent examples. While its domain-specific fine-tuning yields strong performance on bioacoustic benchmarks, we observe that it also introduces trade-offs in instruction-following flexibility. For instance, NatureLM achieves high accuracy when prompted for either the common or scientific name individually, but its accuracy drops significantly when both are requested in a single prompt. We address this by applying a simple model merging strategy that interpolates NatureLM with its base language model, recovering instruction-following capabilities with minimal loss of domain expertise. Finally, we show that the merged model exhibits markedly stronger zero-shot generalization, achieving over a 200% relative improvement and setting a new state-of-the-art in closed-set zero-shot classification of unseen species.

在生物声学领域，能够跨物种和任务进行概括的基础模型代表了一个充满希望的新前沿，NatureLM是最突出的例子之一。虽然其针对特定领域的微调在生物声学基准测试中表现出卓越的性能，但我们观察到这也带来了指令遵循灵活性的权衡。例如，当分别提示通用名或学名时，NatureLM的准确率很高，但当两者都在一个提示中同时要求时，其准确率会显著降低。我们通过应用一个简单的模型合并策略来解决这一问题，该策略将NatureLM与其基础语言模型进行插值，以恢复指令遵循能力，同时最小化领域专业知识的损失。最后，我们证明合并后的模型展现出更出色的零样本泛化能力，实现了超过200%的相对改进，并在未见物种的封闭集零样本分类中达到最新最先进的水平。

论文及项目相关链接

PDF

Summary

自然LM模型在生物声学领域具有通用化的潜力，例如NatureLM模型就是一个显著实例。尽管其在特定领域的微调能在生物声学基准测试中取得出色表现，但同时也带来指令遵循灵活性的权衡问题。本研究发现，当要求同时提供常用名和学名时，NatureLM的准确性显著降低。本研究通过采用简单的模型合并策略，即结合NatureLM与其基础语言模型进行解决该问题，不仅恢复了其指令遵循能力且仅轻微损失专业领域知识。最后研究显示，合并后的模型零样本泛化能力显著提高，达到超过当前水平的两倍以上改进效果，为封闭集零样本分类中的未知物种分类设定了新的基准线。

Key Takeaways

自然LM模型（如NatureLM）在生物声学领域展现通用性潜力。
自然LM模型特定领域的微调能在生物声学基准测试中表现优异。
当同时要求提供常用名和学名时，模型的准确性会降低。
通过结合基础语言模型和特定领域模型，可解决准确性降低的问题并恢复指令遵循能力。
模型合并策略带来了适度的专业领域知识损失。
合并后的模型表现出显著提高的零样本泛化能力。

Cool Papers

点此查看论文截图

Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

Authors:Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao

Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer’s success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model’s high-level reasoning and effective low-level execution.

Android作为最大的移动平台，其应用的自动构建仍然是一个实际挑战。虽然大型语言模型（LLM）在代码修复方面显示出潜力，但它们在解决Android构建错误方面的应用仍被较少探索。为了弥补这一空白，我们首先推出了AndroidBuildBench，这是一个包含1019个构建失败的基准测试，这些失败案例是从43个开源Android项目的提交历史中精心挑选的。每个问题都与后续提交中的经过验证的解决方案配对，以确保修复方案是可行的。其次，我们提出了GradleFixer，这是一个LLM代理，拥有用于检查和操作Gradle构建环境的领域专用工具。GradleFixer的解决率高达81.4%（pass@1），显著优于依赖通用shell的最新编码代理。GradleFixer的成功表明，虽然LLM具备解决这些失败的高级知识，但它们在使用通用shell将知识转化为有效的低级操作方面遇到了困难。我们展示了一种我们称之为“工具桥接”的策略的有效性，该策略用领域感知的抽象替换了通用shell命令。我们假设这种方法通过两种机制起作用：1）它提供LLM更可靠使用的API格式工具；2）它将动作空间限制在相关操作上。这种方法弥合了模型高级推理和有效低级执行之间的鸿沟。

论文及项目相关链接

PDF

Summary

该文本介绍了Android平台在自动构建应用程序方面的挑战，并探讨了大型语言模型（LLM）在修复Android构建错误方面的潜力。文中首先推出了AndroidBuildBench，这是一个包含1019个构建失败的基准测试平台，这些失败案例均来自43个开源Android项目的提交历史，每个问题都配有后续提交的验证解决方案。接着，提出了GradleFixer，这是一个具有领域特定工具的LLM代理，用于检查和操作Gradle构建环境。GradleFixer的解决率高达81.4%（pass@1），显著优于依赖通用外壳的先进编码代理。文章还探讨了LLM在将高级知识转化为有效的低级行动方面的挑战，并展示了所谓的“工具桥接”策略的有效性，该策略通过用领域感知的抽象替换通用外壳命令来弥补这一差距。此举让模型的高级推理与有效的低级执行得以结合。

Key Takeaways

Android是最大的移动平台，但自动构建应用程序仍然是一个实际挑战。
大型语言模型（LLMs）在代码修复方面显示出潜力，但在修复Android构建错误方面的应用仍然被探索不足。
推出了AndroidBuildBench，这是一个包含真实构建失败案例的基准测试平台，用于评估修复方案的有效性。
提出了GradleFixer，一个具有领域特定工具的LLM代理，用于检查和操作Gradle构建环境，解决率较高。
LLM在将高级知识转化为有效的低级行动方面存在挑战，需要策略来桥接这一差距。
“工具桥接”策略通过用领域感知的抽象替换通用外壳命令，提高了LLM代理的效率和性能。

Cool Papers

点此查看论文截图

Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Authors:Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3–72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6% to 41.8% (+5.2%), and the mean MindCube accuracy rose from 31.4% to 38.1% (+6.7%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.

空间智能涵盖了一系列丰富的能力，包括可视化并转换形状、心理旋转物体、判断关系位置和包含关系以及估计数量。然而，对于多模态大型语言模型（MLLMs）来说，这仍然是一个尚未解决的关键挑战。为了填补这一空白，我们提出将欧几里得几何问题解决作为替代任务。具体来说，我们精心构建了一个名为Euclid30K的多模态数据集，包含大约30000个平面和立体几何问题。此外，为了让模型从这些几何问题中学习并应用欧几里得原理，我们使用集团相对策略优化（GRPO）对来自Qwen2.5VL、Qwen3VL和RoboBrain2.0家族的七个模型变体（涵盖3-72B参数）进行了微调，激发模型识别形状、计数和关联实体，并使用欧几里得原理进行多步骤推理。我们的实验表明，在四个空间推理基准测试（Super-CLEVR、Omni3DBench、VSI-Bench和MindCube）上，未经任何任务特定适应的模型取得了显著的零样本增益。值得注意的是，经过Euclid30K训练后，VSI-Bench的平均准确率从36.6%提高到41.8%（+5.2%），MindCube的平均准确率从31.4%提高到38.1%（+6.7%）。据我们所知，这是第一项系统研究，表明以几何为中心的微调可以赋予视觉语言模型广泛可转移的空间技能。代码和Euclid30K数据集可以在此找到：https://zgca-ai4edu.github.io/Euclids_Gift。

论文及项目相关链接

PDF

Summary

空间智能涵盖了一系列能力，包括可视化与转换形状、在大脑中旋转物体、判断相对位置和包含关系以及估算数量等。然而，对于多模态大型语言模型（MLLMs）来说，空间智能仍然是一个亟待解决的挑战。为解决此问题，我们提出将欧几里得几何问题求解作为替代任务。我们精心构建了一个名为Euclid30K的多模态数据集，包含约3万道平面和立体几何问题。为了使得模型能够从这些问题中学习并应用欧几里得原理，我们使用组相对策略优化（GRPO）对七个模型变体（参数范围从3B到72B）进行了微调，激发模型识别形状、计数和关联实体以及使用欧几里得原理进行多步骤推理的能力。实验表明，在四个空间推理基准测试（Super-CLEVR、Omni3DBench、VSI-Bench和MindCube）上，这些模型无需任何特定任务适应即可实现显著的零样本增益。在Euclid30K训练后，VSI-Bench的平均准确率从36.6%提高到41.8%（+5.2%），MindCube的平均准确率从31.4%提高到38.1%（+6.7%）。据我们所知，这是首次有系统的研究表明，以几何为中心的微调可以使视觉语言模型获得广泛可转移的空间技能。

Key Takeaways

空间智能涵盖多种能力，包括形状可视化、物体旋转、位置关系判断及数量估算等。
多模态大型语言模型（MLLMs）在空间智能方面存在挑战。
提出将欧几里得几何问题求解作为提升空间智能的替代任务。
构建了一个名为Euclid30K的多模态数据集，包含大量几何问题。
通过组相对策略优化（GRPO）对模型进行微调，提升模型对欧几里得原理的应用能力。
实验结果显示，模型在多个空间推理基准测试上实现显著性能提升。

Cool Papers

点此查看论文截图

Privacy Preserving In-Context-Learning Framework for Large Language Models

Authors:Bishnu Bhusal, Manoj Acharya, Ramneet Kaur, Colin Samplawski, Anirban Roy, Adam D. Cobb, Rohit Chadha, Susmit Jha

Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility. Our code is available at https://github.com/bhusalb/privacy-preserving-icl.

大型语言模型（LLM）显著地改变了自然语言的理解和生成，但它们引发了关于潜在暴露敏感信息的隐私担忧。研究已经强调了信息泄露的风险，对手可以提取嵌入在提示中的敏感信息。在这项工作中，我们介绍了一种新型的私有预测框架，用于生成具有强大隐私保证的高质量合成文本。我们的方法利用差分隐私（DP）框架来确保在最坏情况下信息泄露的理论界限，而无需对底层模型进行任何微调。所提出的方法对私有记录进行推理，并聚合得到的每个标记输出分布。这能够在保持隐私保证的同时生成更长和连贯的合成文本。此外，我们提出了一种简单的混合操作，结合了私有和公共推理，以进一步增强实用性。经验评估表明，我们的方法在上下文学习（ICL）任务上的表现优于之前的最先进方法，这为在保持高实用性的同时实现隐私保护文本生成提供了一个有前途的方向。我们的代码位于https://github.com/bhusalb/privacy-preserving-icl。

Summary

大型语言模型（LLM）在自然语言理解和生成方面有着显著变革，但同时也引发了隐私泄露的担忧。研究指出，对手可以通过提示来提取嵌入的敏感信息。本研究引入了一种新型的隐私预测框架，该框架能生成高质量合成文本并具备强大的隐私保障。该研究利用差分隐私（DP）框架确保信息泄露的最坏情况理论界限，且无需对底层模型进行微调。该方法对私密记录进行推理，并聚合结果中的每符号输出分布，以此生成长且连贯的合成文本，同时保持隐私保障。此外，研究还提出了一种简单的混合操作，将私密推理和公共推理相结合，进一步提高了实用性。实证评估表明，该方法在上下文学习（ICL）任务上的表现优于先前的方法，为隐私保护文本生成的同时保持高实用性提供了有前景的方向。

Key Takeaways

大型语言模型（LLMs）在自然语言处理方面的应用显著，但存在隐私泄露风险。
对手可能通过提示提取嵌入的敏感信息。
研究提出了一种新的隐私预测框架，生成合成文本时具有强隐私保障。
利用差分隐私（DP）框架，无需微调模型即可确保信息泄露的理论界限。
该方法对私密记录进行推理，聚合符号输出分布以生成连贯文本。
提出混合操作，结合私密和公共推理，提高实用性。
在上下文学习（ICL）任务上的表现优于先前方法，为隐私保护文本生成提供了有前景的方向。

Cool Papers

点此查看论文截图

On the Alignment of Large Language Models with Global Human Opinion

Authors:Yang Liu, Masahiro Kaneko, Chenhui Chu

Today’s large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs’ opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/ku-nlp/global-opinion-alignment and https://github.com/nlply/global-opinion-alignment.

当今的大型语言模型（LLM）能够支持多语言场景，允许用户用母语与LLM进行交互。当LLM回应用户提出的主观问题时，它们应与用户与模型交互时使用的语言的特定群体或历史时期的观点保持一致。现有研究主要关注LLM在美国或少数国家群体中代表的观点，缺乏全球各国样本和不同历史时期的人类观点研究，也缺乏如何利用语言来引导LLM的讨论。此外，他们还忽视了提示语言对LLM观点一致性的潜在影响。本研究旨在填补这些空白。为此，我们基于世界价值观调查（WVS）建立了一个评估框架，系统地评估LLM在全球不同国家、语言和历史时期的观点与人类观点的一致性。我们发现LLM适当或过度符合少数国家的观点，而与其他大多数国家的观点不太一致。此外，将提示的语言更改为与问卷中使用的语言相匹配，可以有效地引导LLM更有效地与相应国家的观点保持一致，这比现有的引导方法更有效。同时，LLM与现代人口的观点更为一致。据我们所知，本研究是全球、语言和时间维度上关于LLM观点一致性的首次全面调查。我们的代码和数据在https://github.com/ku-nlp/global-opinion-alignment和https://github.com/nlply/global-opinion-alignment公开可用。

论文及项目相关链接

PDF 28 pages, 26 figures

Summary

大型语言模型（LLM）能够支持多语言场景，允许用户用母语与模型交互。现有研究主要关注LLM在美国等国家人口群体中的意见代表性，缺乏全球范围内的国家样本和不同历史时期的意见研究，以及如何利用语言引导LLM的讨论。本研究旨在填补这些空白，基于世界价值观调查构建评估框架，系统研究LLM与人类在不同国家、语言和历史时期的意见一致性。研究发现，LLM对某些国家的意见适当或过度对齐，而对大多数国家的意见不足对齐。此外，将提示的语言匹配到问卷所使用的语言可以有效引导LLM更加贴近对应国家的意见，但相比于当代人口意见存在较多误差。本研究是全球范围内首个全面调查LLM在不同国家、语言和时间的意见一致性的研究。代码和数据公开可访问。

Key Takeaways