⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-21 更新
The Impact of Quantization on Large Reasoning Model Reinforcement Learning
Authors:Medha Kumar, Zifei Xu, Xin Wang, Tristan Webb
Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.
现在,通过大规模强化学习(RL)无需任何监督微调即可实现强大的推理能力。虽然量化感知训练(PTQ)和量化感知训练(QAT)在微调领域已经得到了很好的研究,但量化对大规模推理模型(LRMs)中的强化学习的影响仍然是一个悬而未决的问题。为了回答这个问题,我们进行了系统的实验,发现在数学基准测试中,强化学习后的量化模型与其量化感知强化学习优化的对应模型之间的推理性能存在显著差距。我们的研究结果表明,量化感知强化学习训练对学习过程产生了负面影响,而PTQ和QLoRA则带来了更好的性能。
论文及项目相关链接
PDF Accepted to the NeurIPS 2025 Efficient Reasoning Workshop
Summary
大型推理模型在强化学习(RL)中通过无需监督精细调整的强大推理能力现在可以实现。虽然针对精细调整的量化(PTQ)和量化感知训练(QAT)已有广泛研究,但量化对大型推理模型(LRMs)在强化学习中的应用仍是一个开放的问题。我们的系统实验表明,强化学习后量化的模型在数理基准测试上的推理性能与其经过量化感知优化的模型之间存在显著差距。研究发现,量化感知强化学习训练对学习过程产生了负面影响,而PTQ和QLoRA则带来了更好的性能。
Key Takeaways
- 强化学习(RL)现在可以在没有监督精细调整的情况下实现强大的推理能力。
- 针对量化在大型推理模型(LRMs)的强化学习应用上的影响,仍有许多未解问题。
- 对比了强化学习后量化的模型与经过量化感知优化的模型在数理基准测试上的推理性能。
- 发现了量化感知强化学习训练对学习过程的负面影响。
- PTQ和QLoRA方法能带来更好的性能表现。
点此查看论文截图
MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features
Authors:Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin
Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
眼动数据可以量化抑郁群体经常观察到的对负面刺激的注意力偏向。音频和视频数据则捕捉到了抑郁的特征,如情感平淡和心理运动迟缓。统计验证已证实这些数据在区分抑郁和非抑郁群体时具有显著的辨别力。我们解决了现有基于图的模型的一个关键限制,这些模型主要关注低频信息,并提出了一种多频率图卷积网络(MF-GCN)。该框架由一个新颖的多频率滤波器模块(MFFBM)组成,该模块能够利用低频和高频信号。与传统机器学习算法和深度学习框架的评估结果相比,MF-GCN持续超越基线。在二元(抑郁和非抑郁)分类中,该模型的敏感性达到0.96,F2分数为0.94。对于三类(无抑郁、轻度至中度抑郁和重度抑郁)分类任务,所提出的方法的敏感性达到0.79,特异性为0.87,并显著超越了其他模型。为了验证模型的通用性,该模型还在中国多模式抑郁语料库(CMDC)数据集上进行了评估,敏感性达到0.95,F2分数为0.96。这些结果证实,我们的三模态多频框架有效地捕捉了跨模态交互,可实现准确的抑郁检测。
论文及项目相关链接
Summary:本文利用眼动追踪数据、音视频数据,通过统计验证,对抑郁症患者和非患者的区分力进行了评估。针对现有图模型局限于低频信息的不足,提出了基于多频信息的图卷积网络(MF-GCN)。该模型包括新型的多频滤波器模块(MFFBM),能同时利用高低频信号。与传统机器学习和深度学习框架相比,MF-GCN在二元分类和三分类任务中均表现更佳,并在中国多模态抑郁语料库(CMDC)上进行了验证。
Key Takeaways:
- 眼动追踪数据和音视频数据在抑郁症诊断中的应用。
- 统计验证在区分抑郁症患者和非患者方面的有效性。
- 现有图模型局限于低频信息的不足。
- 提出了一种新的多频图卷积网络(MF-GCN)。
- MF-GCN包括多频滤波器模块(MFFBM),能同时处理高低频信号。
- MF-GCN在二元分类和三分类任务中的表现优于传统机器学习和深度学习框架。
- 模型在中国多模态抑郁语料库(CMDC)上的验证结果良好。
点此查看论文截图
VisPlay: Self-Evolving Vision-Language Models from Images
Authors:Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
强化学习(RL)为视觉语言模型(VLM)在复杂的推理任务上提供了原则性的框架进行改进。然而,现有的强化学习方法往往依赖于人工标注的标签或任务特定的启发式方法来定义可验证的奖励,这两者都成本高昂且难以扩展。我们引入了VisPlay,这是一个自我进化的强化学习框架,使VLM能够利用大量的无标签图像数据自主提高其推理能力。从单个基础VLM开始,VisPlay将模型分配给两个交互角色:一个受图像条件制约的提问者,它制定具有挑战性但可回答的视觉问题,以及一个多模态推理者,它生成银质答案。这些角色通过群体相对策略优化(GRPO)进行联合训练,它结合了多样性和难度奖励,以平衡生成问题的复杂性与银质答案的质量。VisPlay在两个模型家族中的效率都很高。在Qwen2.5-VL和MiMo-VL上进行训练时,VisPlay在视觉推理、组合泛化和幻觉减少等八个基准测试中实现了持续的改进,其中包括MM-Vet和MMMU,展示了向自我进化的多模态智能发展的可扩展路径。项目页面可在https://bruno686.github.io/VisPlay/访问。
论文及项目相关链接
Summary
强化学习(RL)为视觉语言模型(VLM)在复杂推理任务上的提升提供了理论框架。然而,现有的RL方法常常依赖人工标注的标签或任务特定的启发式方法来定义可验证的奖励,这两者都成本高昂且难以扩展。我们推出了VisPlay,这是一种自我进化的RL框架,能够使得VLM自主提高其推理能力,并利用大量未标记的图像数据。VisPlay将模型分为两个交互角色:图像条件下的提问者和多模态推理者。通过群体相对策略优化(GRPO),这两个角色联合训练,通过融入多样性和难度奖励来平衡生成问题的复杂性与银答案的质量。VisPlay在两个模型家族中实现了有效扩展。在Qwen2.5-VL和MiMo-VL上进行训练后,VisPlay在八个基准测试中实现了视觉推理、组合泛化和幻觉减少方面的一致改进,包括MM-Vet和MMMU,证明了自我进化多模式智能的可行路径。
Key Takeaways
- 强化学习为视觉语言模型的复杂推理任务改进提供了框架。
- 现有强化学习方法依赖人工标注和启发式方法,成本高昂且难以扩展。
- VisPlay是一种自我进化的强化学习框架,利用大量未标记图像数据提高VLM的推理能力。
- VisPlay将模型分为图像条件下的提问者和多模态推理者两个交互角色。
- VisPlay通过群体相对策略优化联合训练这两个角色,融入多样性和难度奖励。
- VisPlay在两个模型家族中实现了有效扩展,改进了视觉推理、组合泛化和幻觉减少方面。
点此查看论文截图
When to Think and When to Look: Uncertainty-Guided Lookback
Authors:Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yunlong, Tang, Luchuan Song, Susan Liang, Zhongfei, Zhang, Jason J. Corso, Chenliang Xu
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
测试时的思考(即产生明确的中间推理链)在大语言模型中已知可以提高性能,并且最近在大视觉语言模型(LVLMs)中显示出强大的增益。然而,尽管这些结果很有希望,但目前还没有关于思考如何实际影响视觉推理的系统性分析。我们首次进行了这样的分析,对LVLMs的思考进行了大规模、受控的比较,在宽松令牌预算和多遍解码下,评估了InternVL3.5和Qwen3-VL家族中的十个变体在MMMU-val上的表现。我们发现思考更多并不总是更好;长链通常会产生忽略图像的长错误轨迹,并且在标准指令模式下运行相同的模型表现较差。更深入的分析表明,某些短回顾短语(明确指回图像)在成功轨迹中丰富存在,并与更好的视觉定位相关。基于这一见解,我们提出了不确定性引导回顾,这是一种无需训练即可解码的策略,它将不确定性信号与自适应回顾提示和广度搜索相结合。我们的方法提高了整体的MMMU性能,在标准思考较弱的类别中取得了最大的收益,并超越了多个强大的解码基线,在固定的模型家族和令牌预算下达到了新的技术水平。我们还表明,这种解码策略具有通用性,在另外五个基准测试中均取得了持续性的改进,包括两个广泛的多模式套件和数学导向的视觉推理数据集。
论文及项目相关链接
Summary
本文研究了测试时的思考(即生成明确的中间推理链)对大型视觉语言模型(LVLMs)的影响。通过对多个模型的比较分析,发现更多的思考并不总是更好,长链推理有时会导致错误的轨迹,忽略图像并低于标准指令模式下的性能。研究还发现成功的推理轨迹中富含短回顾短语,这些短语能够明确引用图像。基于此,提出了不确定性引导回顾法,这是一种无需训练的解码策略,结合了不确定性信号与自适应回顾提示和广度搜索。该方法提高了整体性能,特别是在标准思考能力较弱的类别中取得了最大的收益,并超越了多个强大的解码基线,在固定模型家族和令牌预算下达到了新的技术水平。此外,该解码策略在五套额外的基准测试中表现一致,包括两个广泛的多模式套件和数学视觉推理数据集。
Key Takeaways
- 测试时的思考(生成中间推理链)对大型视觉语言模型(LVLMs)性能有提升作用。
- 更多的思考并不总是更好,长链推理可能导致错误的轨迹。
- 成功推理轨迹中包含短回顾短语,这些短语能够明确引用图像。
- 提出了一种新的解码策略——不确定性引导回顾法,结合了不确定性信号与自适应回顾提示和广度搜索。
- 该方法提高了整体性能,特别是在标准思考能力较弱的类别中取得了显著收益。
- 在固定模型家族和令牌预算下,该解码策略达到了新的技术水平,并超越了多个强大的解码基线。
点此查看论文截图
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Authors:Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model’s own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model’s latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO’s efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
视觉-语言-动作(VLA)模型在机器人操作方面表现出色,但其严重依赖于专家演示,这导致了演示偏见并限制了性能。强化学习(RL)是一种克服这些限制的重要后训练策略,然而,当前的VLA-RL方法,包括基于组的优化方法,都受到奖励稀疏性的严重困扰。依赖二元成功指标浪费了失败轨迹中的宝贵信息,导致训练效率低下。为了解决这一问题,我们提出了自我参照策略优化(SRPO),这是一种新型的VLA-RL框架。SRPO通过利用模型在当前训练批次内生成的自身成功轨迹作为自我参照,消除了对外部演示或手动奖励工程的需求。这使我们能够为失败的尝试分配进度奖励。一个核心创新点是通过使用潜在的世界表示来稳健地测量行为进度,而不是依赖原始像素或需要特定领域的微调,我们使用了世界模型的潜在空间中的压缩、可转移编码。这些表示自然地捕获了环境中的进度模式,能够实现准确、通用的轨迹比较。在LIBERO基准测试上的经验评估证明了SRPO的高效性和有效性。从监督学习的基线开始,成功率达到48.9%,SRPO仅在200步RL中便达到了99.2%的新里程碑式成功率,相对提高了103%,且无需任何额外的监督。此外,SRPO显示出巨大的稳健性,在LIBERO-Plus基准测试上实现了167%的性能提升。
论文及项目相关链接
Summary
本文主要探讨了在机器人操作领域,Vision-Language-Action(VLA)模型面临的挑战以及解决方案。尽管VLA模型在某些任务中表现出色,但它们严重依赖于专家演示,导致演示偏见并限制性能。为此,研究者提出使用强化学习(RL)作为重要的训练策略。然而,现有的VLA-RL方法,如基于群体的优化方法,受到奖励稀疏性的严重制约。针对这一问题,本文提出了一种新型的VLA-RL框架——Self-Referential Policy Optimization(SRPO)。SRPO通过利用模型在当前训练批次内生成的自身成功轨迹作为自我参照,消除了对外部演示或手动奖励工程的需求。这为失败的尝试分配了进度奖励。此外,SRPO还利用世界模型的潜在空间编码来稳健地衡量行为进度,而不是依赖原始像素或需要特定领域的微调。在LIBERO基准测试上,SRPO表现出了高效和有效性,从监督学习的基线48.9%成功率跃升至99.2%,仅用了200步强化学习,实现了相对103%的提升,且无需额外的监督。此外,SRPO在LIBERO-Plus基准测试上也表现出强大的稳健性,性能提升167%。
Key Takeaways
- VLA模型在机器人操作领域面临依赖专家演示、演示偏见和性能限制的挑战。
- 强化学习(RL)是克服这些限制的重要策略,但现有方法受到奖励稀疏性的困扰。
- SRPO框架通过利用模型自身的成功轨迹作为自我参照,解决了奖励稀疏性问题。
- SRPO为失败的尝试分配了进度奖励,提高了训练效率。
- SRPO利用世界模型的潜在空间编码来衡量行为进度,避免依赖原始像素或特定领域微调。
- 在LIBERO基准测试中,SRPO实现了从基线到99.2%的高成功率,仅用了200步强化学习,且无需额外监督,表现出高效和有效性。
点此查看论文截图
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
Authors:Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar
With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR’s effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
随着视频内容的普及,对长视频进行有效理解和回答问题已成为许多应用领域的核心需求。尽管大型视觉语言模型(LVLMs)的性能已经提升,但它们仍然面临着微妙查询的挑战,这些查询需要全面理解和详细分析。为了克服这些障碍,我们推出了AVATAAR,这是一个模块化且可解释的框架,它结合了全局和局部视频上下文,以及预检索思考代理和反思模块。AVATAAR创建了一个持久的全局摘要,并在反思模块和预检索思考代理之间建立了反馈循环,允许系统根据部分答案优化其检索策略,并模拟人类类似的迭代推理。在CinePile基准测试中,AVATAAR相对于基线表现出了显著的改进,在时序推理方面实现了+5.6%的相对增长,在技术查询方面实现了+5%的增长,在基于主题的问题中实现了+8%的增长,以及在叙事理解方面实现了+8.2%的增长。我们的实验证实,每个模块都对整体性能做出了积极贡献,反馈循环对于适应性至关重要。这些发现凸显了AVATAAR在提高视频理解能力方面的有效性。最终,AVATAAR为长视频问答(QA)提供了一种可扩展的解决方案,融合了准确性、可解释性和扩展性。
论文及项目相关链接
PDF Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)
Summary
AVATAAR框架结合全局和局部视频上下文,通过预检索思考代理和反思模块,提高了对长视频的理解与问答能力。该框架在CinePile基准测试上相对于基线有显著改进,各模块对整体性能有正面贡献,反馈循环对于适应性至关重要。
Key Takeaways
- AVATAAR框架旨在提高长视频内容的理解和问答能力,适用于多种应用。
- 面临大型视觉语言模型在处理复杂查询时的挑战,需要全面理解和详细分析。
- AVATAAR结合全局和局部视频上下文,提供模块化且可解释的解决方案。
- 引入预检索思考代理和反思模块,建立反馈循环,优化检索策略。
- 在CinePile基准测试上实现显著性能提升,包括时序推理、技术查询、主题问题和叙事理解等方面。
- 反馈循环对适应性至关重要,各模块对整体性能有正面贡献。
点此查看论文截图
HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning
Authors:Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao, Tianyong Hao
Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners’ language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.
语言习得对于揭示人类语言智能的本质至关重要,并最近成为提高大型语言模型(LLM)可解释性的有前途的视角。然而,进行需要控制人类学习者语言输入的实验在伦理上和实践上都是不可行的。这给语言习得建模的验证和可扩展性带来了挑战,特别是在中文第二语言习得(SLA)方面。虽然大型语言模型提供了一种可控和可复制的替代方案,但仍缺乏系统基准来支持分阶段的建模和评估。在本文中,我们提出了HSKBenchmark,它是针对中文SLA的大型语言模型分阶段建模和写作评估的首个基准。它涵盖了汉语水平考试(HSK)的三至六级,包括真实的教科书、676万个令牌、1万六千个合成指令样本、三十个测试主题以及一个基于语言学的评估系统。为了模拟人类学习轨迹,我们引入了课程调整框架,该框架可以从初级到高级训练模型。评估系统用于检查基于级别的语法覆盖率、写作错误、词汇和句法复杂性以及整体评分。我们还建立了HSKAgent,对1万篇学习者作文进行了微调。大量的实验结果表明,HSKBenchmark不仅有效地模拟了中文SLA,而且作为大型语言模型动态写作评估的可靠基准。我们微调的大型语言模型具有与人类高级学习者相当的写作性能,并展现出人类般的学习特性。HSKBenchmark、HSKAgent和检查点作为基本工具和资源,有可能为未来关于语言习得建模和大型语言模型解释性的研究铺平道路。代码和数据可在以下网址公开获取:https://github.com/CharlesYang030/HSKB。
论文及项目相关链接
PDF Accepted by AAAI-2026
Summary
本文提出并介绍了HSKBenchmark,这是一个针对中文第二语言习得(SLA)的深度学习模型评估和基准测试系统。该系统覆盖了汉语水平考试(HSK)的三至六级,包含了真实的课本数据、合成的教学样本、测试话题以及基于语言学的评估体系。此外,为了模拟人类的学习轨迹,论文中还提出了一个课程调整框架,该框架可以根据从初级到高级的不同层次来训练模型。该研究还构建了HSKAgent,一个在10K学习者作文上精细调整的模型。实验结果表明,HSKBenchmark不仅有效地模拟了中文SLA,还为大型语言模型(LLMs)的动态写作评估提供了可靠的基准。该研究的模型和资源为未来语言习得建模和LLMs解释性研究铺平了道路。
Key Takeaways
- HSKBenchmark是首个针对中文第二语言习得(SLA)的大型语言模型(LLM)的阶段性建模和写作评估基准。
- 该系统覆盖HSK三至六级,包含真实课本数据、合成教学样本、测试话题以及语言基础的评估体系。
- 为模拟人类学习轨迹,引入课程调整框架,按层次训练模型。
- 构建HSKAgent模型,在10K学习者作文上进行精细调整。
- HSKBenchmark有效模拟中文SLA,并提供可靠的LLM写作评估基准。
- 研究的模型和资源为未来语言习得建模和LLM解释性研究提供了基础。
点此查看论文截图
Computer-Use Agents as Judges for Generative User Interface
Authors:Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans–prioritizing aesthetics and usability–forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
用户界面代理(CUA)正越来越能够通过图形用户界面(GUI)自主操作数字环境。然而,大多数GUI仍然主要面向人类设计——优先考虑美观和可用性——迫使代理采用面向人类的、对于高效任务执行不必要的行为。同时,编码导向的语言模型的快速发展已经改变了自动GUI设计。这引发了一个基本问题:作为法官的CUA能否协助Coder进行自动GUI设计?为了调查这个问题,我们推出了AUI-Gym,这是一个跨越52个应用程序的自动GUI开发基准测试平台。我们使用语言模型合成了1560个模拟真实场景的任务。为了确保任务可靠性,我们进一步开发了一个验证器,该验证器可以编程检查每个任务是否能在其环境中执行。在此基础上,我们提出了Coder-CUA协作框架:Coder充当设计师,生成和修改网站,而CUA充当法官,评估功能并改进设计。成功的衡量标准不是视觉效果,而是任务的可解决性和CUA的导航成功率。为了将CUA的反馈转化为可用的指导,我们设计了CUA仪表板,它将多步导航历史压缩成简洁的视觉摘要,为迭代设计提供可解释的指导。通过将代理定位为设计师和法官,我们的框架将界面设计转向代理本身的效率和可靠性。我们的工作标志着代理从被动使用转向主动参与数字环境的一步。我们的代码和数据集可在https://github.com/showlab/AUI获取。
论文及项目相关链接
PDF Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI
Summary
该文探讨计算机使用代理(CUA)在图形用户界面(GUI)设计中的潜在应用。由于现有的GUI设计主要面向人类,因此不适应CUA的自动化任务执行需求。随着面向编码的语言模型(Coder)的快速发展,自动GUI设计已成为可能。文章提出了一个自动GUI开发基准测试AUI-Gym,并设计了一个Coder与CUA协同工作的框架。在这个框架中,Coder负责设计网站并进行修订,而CUA则作为裁判评估功能并改进设计。成功衡量标准不在于视觉外观,而在于任务解决能力和CUA导航成功率。此外,还设计了CUA仪表板,将多步导航历史压缩成简洁的视觉摘要,为迭代设计提供可解释的指导。该研究将代理从被动使用转向积极参与数字环境的设计。
Key Takeaways
- CUA(计算机使用代理)越来越能自主操作数字环境,通过GUI进行交互。
- 现有的GUI设计主要面向人类,对CUA效率不高。
- 自动GUI设计已成为可能,得益于面向编码的语言模型的快速发展。
- 引入AUI-Gym基准测试,为自动GUI开发提供评估平台。
- 提出Coder与CUA协同工作的框架,其中Coder负责设计网站,CUA则评估功能。
- 成功衡量标准在于任务解决能力和CUA导航成功率,而非视觉外观。
点此查看论文截图
Meta-Black-Box Optimization with Bi-Space Landscape Analysis and Dual-Control Mechanism for SAEA
Authors:Yukun Du, Haiyue Yu, Xiaotong Xie, Yan Zheng, Lixin Zhan, Yudong Du, Chongshuang Hu, Boxuan Wang, Jiang Jiang
Surrogate-Assisted Evolutionary Algorithms (SAEAs) are widely used for expensive Black-Box Optimization. However, their reliance on rigid, manually designed components such as infill criteria and evolutionary strategies during the search process limits their flexibility across tasks. To address these limitations, we propose Dual-Control Bi-Space Surrogate-Assisted Evolutionary Algorithm (DB-SAEA), a Meta-Black-Box Optimization (MetaBBO) framework tailored for multi-objective problems. DB-SAEA learns a meta-policy that jointly regulates candidate generation and infill criterion selection, enabling dual control. The bi-space Exploratory Landscape Analysis (ELA) module in DB-SAEA adopts an attention-based architecture to capture optimization states from both true and surrogate evaluation spaces, while ensuring scalability across problem dimensions, population sizes, and objectives. Additionally, we integrate TabPFN as the surrogate model for accurate and efficient prediction with uncertainty estimation. The framework is trained via reinforcement learning, leveraging parallel sampling and centralized training to enhance efficiency and transferability across tasks. Experimental results demonstrate that DB-SAEA not only outperforms state-of-the-art baselines across diverse benchmarks, but also exhibits strong zero-shot transfer to unseen tasks with higher-dimensional settings. This work introduces the first MetaBBO framework with dual-level control over SAEAs and a bi-space ELA that captures surrogate model information.
代理辅助进化算法(SAEAs)广泛应用于昂贵的黑箱优化。然而,它们在搜索过程中依赖刚性、手动设计的组件,如填充标准和进化策略,这限制了它们在任务中的灵活性。为了解决这些局限性,我们提出了双控制双空间代理辅助进化算法(DB-SAEA),这是一种针对多目标问题的元黑箱优化(MetaBBO)框架。DB-SAEA学习一种元策略,联合调控候选生成和填充标准选择,实现双重控制。DB-SAEA中的双空间探索景观分析(ELA)模块采用基于注意力的架构,从真实和代理评估空间中捕获优化状态,同时确保在问题维度、种群大小和目标上的可扩展性。此外,我们整合了TabPFN作为代理模型,进行准确高效的预测,同时估计不确定性。该框架通过强化学习进行训练,利用并行采样和集中训练提高效率和跨任务的迁移能力。实验结果表明,DB-SAEA不仅在多种基准测试上优于最新基线,而且在对高维度设置的未见任务中表现出强大的零镜头迁移能力。这项工作引入了具有对SAEAs的双重级别控制和捕获代理模型信息的双空间ELA的MetaBBO框架。
论文及项目相关链接
Summary
基于Surrogate-Assisted Evolutionary Algorithms(SAEA)在处理昂贵黑箱优化中的局限性,本文提出了Dual-control Bi-Space Surrogate-Assisted Evolutionary Algorithm(DB-SAEA)。DB-SAEA是一种针对多目标问题的Meta-Black-Box Optimization(MetaBBO)框架,通过联合调控候选生成和填充标准选择来实现双重控制。该框架引入了一个双空间探索景观分析模块,并采用基于注意力的架构来捕捉真实和替代评估空间中的优化状态。此外,还集成了TabPFN作为替代模型进行准确且高效的预测,并评估不确定性。实验结果表明,DB-SAEA不仅在多种基准测试中表现出优于现有技术水平的性能,而且在更高维度的未见过任务中表现出强大的零转移性能。
Key Takeaways
- DB-SAEA是首个结合Surrogate-Assisted Evolutionary Algorithms(SAEA)进行双重控制的Meta-Black-Box Optimization(MetaBBO)框架。
- DB-SAEA解决了传统SAEA在任务灵活性方面的局限性,特别适用于多目标优化问题。
- 引入双空间探索景观分析模块,结合真实和替代评估空间进行更有效的优化状态捕捉。
- 采用基于注意力的架构确保模块的可扩展性,适应不同的问题维度、种群大小和目标准则。
- 集成TabPFN作为替代模型,实现准确且高效的预测,并包含不确定性评估。
- DB-SAEA通过强化学习进行训练,并利用并行采样和集中训练提高效率和跨任务的迁移能力。
点此查看论文截图
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
Authors:Yuanjie Zhu, Liangwei Yang, Ke Xu, Weizhi Zhang, Zihe Song, Jindong Wang, Philip S. Yu
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
大型语言模型(LLMs)通过基于深度语义理解的文本聚类能力,正在重塑无监督学习。然而,其直接应用受到缺乏迭代优化的状态记忆和难以管理聚类粒度等根本限制。因此,现有方法往往依赖外部模块的复杂流水线,牺牲了真正的端到端方法。我们引入了LLM-MemCluster,这是一个重新将聚类定义为完全由LLM驱动的任务的新框架。它利用动态内存来培养状态意识,并利用双提示策略使模型能够推理并确定聚类数量。在多个基准数据集上评估,我们的无需调整参数的框架显著且持续优于强大的基线。LLM-MemCluster为基于LLM的文本聚类提供了一种有效、可解释和真正的端到端范式。
论文及项目相关链接
Summary
大规模语言模型(LLMs)通过深度语义理解进行文本聚类,重塑了无监督学习。然而,由于缺少迭代优化的状态记忆和难以管理聚类粒度,其直接应用存在根本局限性。我们提出LLM-MemCluster框架,重新定义聚类为完全的LLM原生任务。它利用动态内存实现状态感知,并用双提示策略使模型能够推理并确定聚类数量。在多个基准数据集上评估,我们的无调整框架显著且一致地优于强基线。LLM-MemCluster为基于LLM的文本聚类提供了有效、可解释和真正的端到端范式。
Key Takeaways
- LLMs通过深度语义理解重塑了无监督学习中的文本聚类。
- LLMs直接应用于文本聚类存在根本局限性,如缺乏状态记忆和难以管理聚类粒度。
- 现有方法常依赖复杂的管道和外部模块,牺牲了真正的端到端方法。
- 引入LLM-MemCluster框架,重新定义聚类为完全的LLM原生任务。
- LLM-MemCluster利用动态内存实现状态感知,并采用双提示策略进行推理和确定聚类数量。
- LLM-MemCluster在多个基准数据集上显著且一致地优于现有强基线。
点此查看论文截图
DEPO: Dual-Efficiency Preference Optimization for LLM Agents
Authors:Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
近期大型语言模型(LLM)的进步极大地提升了其作为智能体进行推理和决策的能力。然而,更丰富的推理往往伴随着更长的思维链(CoT),这在现实场景中的交互效率造成了阻碍。尽管如此,目前对于LLM智能体的效率仍缺乏系统的定义,这妨碍了有针对性的改进。为此,我们引入了双效率概念,包括(i)步骤级效率,即最小化每步的令牌数,(ii)轨迹级效率,即最小化完成一项任务所需的步骤数。基于这个定义,我们提出了DEPO,一种双效率偏好优化方法,可以同时奖励简洁的回应和较少的行动步骤。在WebShop和BabyAI上的实验表明,DEPO将令牌使用量减少了高达60.9%,步骤减少了高达26.9%,同时性能提高了高达29.3%。DEPO还适用于三种离域数学基准测试,并且在仅使用25%的数据进行训练时仍能保持其效率提升。我们的项目页面位于 https://opencausalab.github.io/DEPO。
论文及项目相关链接
PDF Accepted to AAAI 2026
Summary
大型语言模型(LLM)作为智能体进行部署时,其推理和决策能力有了显著提高。然而,更丰富的推理往往伴随着更长的思维链(CoT),影响了在真实场景中的交互效率。针对这一问题,本文提出了双效率概念,包括步骤级效率和轨迹级效率,并基于此定义提出了DEPO优化方法,旨在同时奖励简洁回应和较少的行动步骤。实验表明,DEPO在WebShop和BabyAI任务上能减少60.9%的令牌使用和最多减少26.9%的步骤,同时性能提高最多达29.3%。此外,DEPO还能泛化到三个领域外的数学基准测试,并在仅使用25%的数据进行训练时仍能保持效率提升。
Key Takeaways
- 大型语言模型(LLM)的推理和决策能力有了显著提高,但更长的思维链(CoT)影响了真实场景中的交互效率。
- 首次提出双效率概念,包括步骤级效率和轨迹级效率,以评估语言模型在任务完成过程中的效率。
- 提出DEPO优化方法,通过联合奖励简洁回应和较少的行动步骤来提高语言模型的任务完成效率和性能。
- 在WebShop和BabyAI任务上的实验表明,DEPO能有效减少令牌使用和步骤数量,同时提高性能。
- DEPO具有良好的泛化能力,能在三个领域外的数学基准测试中表现优异。
- DEPO在仅使用部分数据进行训练时仍能保持效率提升,显示出其强大的适应性和实用性。
点此查看论文截图
Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models
Authors:Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.
大规模语言模型(LLM)在多种任务上取得了显著的性能,但由于其庞大的规模,阻碍了其在现实世界中的部署。现有的针对LLM的修剪方法(例如万达)严重依赖于手动设计修剪算法,从而导致巨大的劳动力成本并要求专家知识。此外,我们是首批识别出在较高修剪比率下由均匀稀疏性引起的性能急剧下降背后存在的严重“离群值问题”的研究人员,这引发了关于如何为LLM设计自适应修剪稀疏性的理想方法的额外担忧。LLM能自我修剪吗?在这项工作中,我们通过提出一种新颖的自修剪方法给出了肯定的回答,该方法首次克服了专家知识的限制,利用LLM自动设计最佳修剪算法而无需任何专家知识。具体来说,为了减轻LLM的黑盒性质,我们提出了Graph驱动的思考链(GCoT)来优化提示,这显著增强了学习修剪算法过程中的推理过程,使我们能够生成下一代具有卓越性能和可解释性的修剪算法。最后,基于对离群值问题的见解,我们引入了倾斜感知动态稀疏分配(SDSA)来解决离群值问题,缓解高修剪比率下的性能下降。我们在主流的LLM基准测试上进行了广泛的实验,证明了AutoPrune的优越性,它始终优于最先进的竞争对手。代码可在:https://anonymous.4open.science/r/AutoPrune访问。
论文及项目相关链接
Summary
大型语言模型(LLMs)的出色性能在多个任务中得到验证,但其大规模部署面临挑战。现有的修剪方法依赖于手动设计的算法,导致劳动成本高昂且需要专业知识。本文提出了一种名为AutoPrune的新型修剪方法,该方法利用LLMs自动设计最佳修剪算法,无需专家知识。通过引入Graph驱动的思维链(GCoT)优化提示,增强了对修剪算法的学习过程,并在下一代中生成具有优越性能和可解释性的修剪算法。此外,针对离群值问题引入Skew感知动态稀疏分配(SDSA),以缓解高修剪比例下的性能下降问题。实验证明,AutoPrune在主流LLMs基准测试中表现卓越。
Key Takeaways
- 大型语言模型(LLMs)在多种任务上表现优异,但部署面临规模挑战。
- 当前修剪方法依赖手动设计算法,成本高且需要专业知识。
- 首次识别高修剪比例下性能急剧下降背后的离群值问题。
- 提出名为AutoPrune的新型自动修剪方法,无需专家知识即可设计最佳修剪算法。
- 利用Graph驱动的思维链(GCoT)优化提示,增强对修剪算法的学习过程。
- 引入Skew感知动态稀疏分配(SDSA)解决离群值问题,缓解高修剪比例下的性能下降。
点此查看论文截图
Parameter Importance-Driven Continual Learning for Foundation Models
Authors:Lingxiang Wang, Hainan Zhang, Zhiming Zheng
Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.
领域特定的后训练往往会导致灾难性遗忘,使得基础模型失去其通用推理能力,并限制其在动态现实世界环境中的适应能力。在获取下游领域知识的同时保留通用能力是大型语言和多媒体模型的核心挑战。传统的持续学习方法,如正则化、回放和架构隔离,存在下游性能差、依赖不可访问的历史数据或额外的参数开销等问题。虽然最近的参数高效调整(PET)方法可以减轻遗忘,但其有效性强烈依赖于参数和更新策略的选择。在本文中,我们介绍了基于参数重要性估计的持续增强方法PIECE,该方法在保留通用能力的同时,能够高效地学习领域知识,而无需访问先前的训练数据或增加模型参数。PIECE选择性地更新与新任务最相关的核心参数,仅占0.1%,由两个重要性估计器引导:基于Fisher信息的PIECE-F,以及结合梯度和曲率信息的二阶归一化的PIECE-S。在三种语言模型和两种多媒体模型上的实验表明,PIECE保持了通用能力,并在多种下游任务上实现了最先进的持续学习性能。我们的研究结果突出了一条实用的道路,即构建可扩展的、适应领域的基础模型,而不会发生灾难性遗忘。
论文及项目相关链接
Summary
该文探讨了大型语言和多媒体模型在领域特定训练后面临的灾难性遗忘问题。文章提出了PIECE方法,通过参数重要性估计进行持续增强,以保留通用能力并高效学习领域知识。该方法无需访问先前训练数据或增加模型参数,仅选择性地更新与新任务最相关的0.1%的核心参数。实验表明,PIECE在保持通用能力的同时,实现了跨多种下游任务的最新持续学习性能。
Key Takeaways
- 大型语言和多媒体模型在领域特定训练后会面临灾难性遗忘问题,导致通用推理能力丧失和对动态现实环境的适应性受限。
- 传统的持续学习方法,如正则化、回放和架构隔离,存在下游性能差、依赖不可访问的历史数据或额外的参数开销等问题。
- 最近的参数效率调整(PET)方法可以减轻遗忘,但其有效性取决于参数和更新策略的选择。
- PIECE方法基于参数重要性估计进行持续增强,可保留通用能力并高效学习领域知识,无需访问先前训练数据或增加模型参数。
- PIECE通过两种重要性估计器:基于Fisher信息的PIECE-F和结合梯度与曲率信息的PIECE-S,选择性地更新与新任务最相关的核心参数。
- 实验表明,PIECE在多种语言和多媒体模型上均能保持通用能力,并在多种下游任务上实现最先进的持续学习性能。
点此查看论文截图
Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Authors:Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Yingchao Li, Huacan Wang, Ronghao Chen
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
现有的多模态推理模型与框架存在基本架构上的局限:大多数模型缺乏人类自主探索多样化推理路径的能力,无论是直接推理、工具驱动的视觉探索、程序化的视觉操作,还是内在的视觉想象。因此,它们很难适应现实任务中动态变化的能力要求。与此同时,人类在解决此类任务时表现出互补的思考能力,而现有方法通常只涵盖这些维度的一部分。受此启发,我们提出了Octopus:具有六种能力协同的多模态自主推理新范式。我们定义了多模态推理所必需的六种核心能力,并相应地构建了综合评估基准Octopus-Bench。Octopus能够在推理过程中自主探索,并根据当前状态动态选择最合适的能力。实验结果表明,Octopus在Octopus-Bench中的绝大多数任务上取得了最佳性能,这突出了能力协调在多模态自主推理中的关键作用。
论文及项目相关链接
Summary
本文介绍了现有的多模态推理模型和框架存在的局限性,它们缺乏人类自主探索多样化推理路径的能力,难以适应动态变化的能力要求。为此,作者提出了Octopus:具有六种能力协同的多模态智能体推理,并定义了六种核心能力和相应的综合评估基准Octopus-Bench。Octopus能够自主探索并在推理过程中动态选择最合适的能力。实验结果表明,Octopus在Octopus-Bench的大多数任务中取得了最佳性能,凸显了能力协调在多模态智能体推理中的关键作用。
Key Takeaways
- 现有多模态推理模型和框架存在局限性,缺乏自主探索多样化推理路径的能力。
- 人类在解决任务时展现出的思维能力是多维度的,而现有方法通常只覆盖一部分。
- 作者提出了Octopus:具有六种能力协同的多模态智能体推理,以弥补现有方法的不足。
- 六种核心能力被定义,并相应制定了综合评估基准Octopus-Bench。
- Octopus能够自主探索并在推理过程中动态选择最合适的能力。
- 实验结果表明,Octopus在大多数任务中性能最佳。
点此查看论文截图
C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models
Authors:Nayoung Oh, Dohyun Kim, Junhyeong Bang, Rohan Paul, Daehyung Park
Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning – such as distance, geometry, and inter-object relationships – while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM’s spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.
空间定位是指定位自然语言指令中描述的一组空间参考。传统方法往往无法处理复杂的推理,例如距离、几何和对象间关系,而视觉语言模型(VLM)尽管具有很强的推理能力,但在生成精细输出区域时却遇到困难。为了克服这些局限性,我们提出了C2F-Space,这是一种新的粗到细的空间定位框架,它(i)使用VLM估计一个大致但空间上一致的区域,然后(ii)通过超像素化使该区域与局部环境相适应。对于粗略估计,我们设计了一种基于网格的视觉定位提示,采用提出-验证策略,最大限度地发挥VLM的空间理解能力,并产生物理和语义上有效的规范区域(即椭圆)。对于细化,我们局部适应区域周围环境,而不会过于放松到自由空间。我们建立了一个新的空间定位基准,用成功率和交集比比较C2F-Space与五种最新技术基准。我们的C2F-Space显著优于所有基准。我们的消融研究证实了两步过程中每个模块的有效性以及它们协同作用的整体效果。我们最后展示了C2F-Space在模拟的机器人拾取和放置任务中的应用性。
论文及项目相关链接
PDF 16 pages, 12 figures
Summary
该文本介绍了一种新的空间定位框架C2F-Space,它能克服传统方法和视觉语言模型在空间定位上的局限。C2F-Space首先使用视觉语言模型进行粗略估计,然后通过超像素化对区域进行细化,以与局部环境对齐。该框架在新型空间定位基准测试中显著优于其他先进方法,并成功应用于模拟的机器人抓取和放置任务。
Key Takeaways
- C2F-Space是一种新的粗到细的空间定位框架,旨在克服传统方法和视觉语言模型在空间定位上的局限。
- 框架首先使用基于网格的视觉定位提示和提出验证策略进行粗略估计。
- 框架利用超像素化对初步估计的区域进行细化,以与局部环境对齐。
- C2F-Space设计了一种物理和语义上有效的规范区域(即椭圆)。
- 新建立的空间定位基准测试显示,C2F-Space显著优于其他先进方法。
- 消融研究证实了框架中每个模块的有效性以及它们协同作用的重要性。
点此查看论文截图
Efficiency Will Not Lead to Sustainable Reasoning AI
Authors:Philipp Wiesner, Daniel W. O’Neill, Francesca Larosa, Odej Kao
AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing’s global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.
人工智能研究正日益转向复杂问题解决,模型优化不仅针对模式识别,而且针对多步骤推理。在计算行业的全球能源足迹历史性地通过持续效率提升和市场需求自然饱和门槛得以稳定之后,由于效率改进正在接近物理极限,新兴的人工智能推理缺少相应的饱和点:性能不再受可用训练数据量限制,而是随着训练和推理方面的指数级计算投资而持续发展。本文认为,单靠效率无法实现可持续的人工智能推理,并讨论了研究和政策方向,将明确限制纳入此类系统的优化和治理中。
论文及项目相关链接
PDF Presented at the Rethinking AI Workshop @ EurIPS’25
Summary:人工智能研究正朝着解决复杂问题方向发展,模型优化不仅限于模式识别,还包括多步骤推理。随着效率提升接近物理极限,新兴推理人工智能缺乏相应的饱和点,性能不再受限于可用训练数据量,而是随着训练和推理中的指数级计算投资而不断扩展。本文认为单靠效率无法实现可持续的推理人工智能,并讨论了将明确限制嵌入此类系统的优化和治理的研究和政策方向。
Key Takeaways:
- AI研究正在向解决复杂问题方向发展,涵盖多步骤推理。
- 效率提升接近物理极限,新兴推理AI缺乏饱和点。
- 推理AI的性能扩展不再受限于训练数据量。
- 性能和计算投资呈指数级增长。
- 单纯依靠效率无法实现可持续的推理AI。
- 需要研究和政策方向来明确限制推理AI系统的优化和治理。
点此查看论文截图
EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control
Authors:Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, Saiyong Yang
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
长期训练大型语言模型(LLM)需要保持稳定的探索,以防止模型陷入次优行为。在此情况下,熵至关重要,因为它控制探索并有助于避免过早地收敛到次优解。然而,现有的强化学习方法在保持适当的熵水平方面存在困难,因为训练过程涉及正负面样本的混合,每一步中每个样本对熵的影响各不相同。为了解决这一问题,我们提出了通过比例积分控制实现熵稳定化(EntroPIC),这是一种新方法,通过动态调整正负面样本的损失系数来自适应地调整其影响。这种方法在整个训练过程中稳定了熵,确保了高效的探索和稳定的进步。我们为基于策略和离策略学习情境提供了全面的理论分析,证明了EntroPIC在大规模LLM训练中控制熵的有效性。实验结果表明,我们的方法成功地维持了所需的熵水平,实现了LLM的稳定和最优强化学习训练。
论文及项目相关链接
Summary
大型语言模型的长期训练需要保持稳定的探索,以防止模型陷入次优行为。熵在此上下文中至关重要,它能控制探索并帮助避免过早收敛到次优解。现有强化学习方法在维持适当的熵水平方面遇到困难,因为训练过程中涉及的正负样本在不同步骤中以不同方式影响熵。为解决此问题,我们提出通过比例积分控制实现熵稳定(EntroPIC)的新方法,该方法通过动态调整正负样本的损失系数来自适应地调整其影响。此方法在整个训练过程中稳定了熵,确保了高效的探索和稳定的进展。我们为基于策略和离策略学习场景提供了全面的理论分析,证明EntroPIC在控制大型语言模型训练中的熵时非常有效。实验结果表明,我们的方法成功地维持了期望的熵水平,实现了大型语言模型强化学习的稳定和优化训练。
Key Takeaways
- 大型语言模型长期训练需保持稳定的探索,防止陷入次优行为。
- 熵在控制模型探索中起关键作用,影响模型避免过早收敛到次优解。
- 现有强化学习方法在维持训练过程中的适当熵水平时面临挑战。
- EntroPIC方法通过动态调整正负样本的影响来稳定熵。
- EntroPIC方法确保了高效的探索和稳定的训练进展。
- 理论和实验结果表明EntroPIC在大型语言模型训练中的熵控制非常有效。
点此查看论文截图
BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer
Authors:Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, Jizhou Huang
Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA datasets are limited to the page level and lack fine grained spatial grounding, constraining the interpretability and reasoning capability of Vision Language Models (VLMs). To address this gap, we introduce BBox DocVQA a large scale, bounding box grounded dataset designed to enhance spatial reasoning and evidence localization in visual documents. We further present an automated construction pipeline, Segment Judge and Generate, which integrates a segment model for region segmentation, a VLM for semantic judgment, and another advanced VLM for question answer generation, followed by human verification for quality assurance. The resulting dataset contains 3.6 K diverse documents and 32 K QA pairs, encompassing single and multi region as well as single and multi page scenarios. Each QA instance is grounded on explicit bounding boxes, enabling fine grained evaluation of spatial semantic alignment. Benchmarking multiple state of the art VLMs (e.g., GPT 5, Qwen2.5 VL, and InternVL) on BBox DocVQA reveals persistent challenges in spatial grounding and reasoning accuracy. Furthermore, fine tuning on BBox DocVQA substantially improves both bounding box localization and answer generation, validating its effectiveness for enhancing the reasoning ability of VLMs. Our dataset and code will be publicly released to advance research on interpretable and spatially grounded vision language reasoning.
文档视觉问答(DocVQA)是多媒体文档理解的基本任务,也是视觉语言推理的重要测试平台。然而,大多数现有的DocVQA数据集仅限于页面级别,缺乏精细的空间定位,限制了视觉语言模型(VLM)的可解释性和推理能力。为了解决这一差距,我们引入了BBox DocVQA,这是一个大规模、基于边界框的数据集,旨在提高文档中的空间推理和证据定位能力。我们还提出了一种自动构建流程,即“分段判断并生成”,该流程集成了区域分割的段模型、语义判断的VLM以及用于生成问答的先进VLM,随后进行人工验证以保证质量。生成的数据集包含3600多种不同的文档和3万多个问答对,涵盖单区域和多区域以及单页和多页场景。每个问答实例都基于明确的边界框,实现对空间语义对齐的精细评估。在BBox DocVQA上对多个最先进的VLM(例如GPT 5、Qwen 2.5 VL和InternVL)进行基准测试表明,在空间定位和推理准确性方面仍存在持续挑战。此外,在BBox DocVQA上进行微调可显著提高边界框定位和答案生成能力,验证了其提高VLM推理能力的有效性。我们的数据集和代码将公开发布,以促进可解释和基于空间定位的视觉语言推理研究的发展。
论文及项目相关链接
PDF 22 pages, 4 figures
Summary
本文介绍了文档视觉问答(DocVQA)中的一项新挑战:大多数现有数据集仅限于页面级别,缺乏精细的空间定位,限制了视觉语言模型的解释和推理能力。为解决此问题,研究团队推出了BBox DocVQA数据集,旨在增强空间推理和证据定位能力。该数据集包含大规模边界框标注,可为视觉文档中的空间语义对齐进行精细评估。此外,文章还介绍了自动化构建流程Segment Judge and Generate,并展示了其在多种先进视觉语言模型上的表现。数据集中包含了丰富的文档与问答对,并在数据集上进行微调能有效提升模型在空间和语义推理上的准确性。本文数据和代码将公开发布,以推动可解释性和空间定位的视觉语言推理研究的发展。
Key Takeaways
- DocVQA面临挑战:现有数据集缺乏精细的空间定位,限制了视觉语言模型的解释和推理能力。
- BBox DocVQA数据集旨在增强空间推理和证据定位能力,包含大规模边界框标注,用于视觉文档中的空间语义对齐的精细评估。
- 自动化构建流程Segment Judge and Generate集成了区域分割、语义判断和问答生成等功能,并进行人工验证以确保质量。
- 数据集包含多种文档和问答对,涵盖单区域、多区域、单页和多页场景。
- 现有视觉语言模型在BBox DocVQA数据集上表现有挑战,尤其在空间定位和推理准确性方面。
- 在BBox DocVQA数据集上进行微调可有效提升模型在空间和语义推理上的准确性。
点此查看论文截图
Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization
Authors:Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras
Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
对于多轮工具集成推理(TIR)训练大型语言模型(LLM)的任务——其中模型进行迭代推理、生成代码并通过执行进行验证——现有的强化学习(RL)方法仍然面临挑战。以群体相对策略优化(GRPO)为代表的当前RL方法,遭受了奖励粒度粗大、轨迹级别奖励不足的困扰,无法为复杂的多轮交互提供足够的学习信号,导致训练停滞。为了解决这个问题,我们提出了针对多轮TIR任务训练LLM的特定强化学习算法——群体轮次策略优化(GTPO)。GTPO引入了三项关键创新:(1)轮次级别的奖励分配,为单个轮次提供精细的反馈;(2)基于回报的优势估计,其中将归一化折扣回报计算为优势;(3)自我监督奖励塑形,利用生成代码中的自我监督信号来丰富稀疏的二元结果奖励。我们的全面评估表明,GTPO在多样化的推理基准测试中平均优于GRPO 3.0%,这证明了其在推进现实世界复杂数学推理方面的有效性。
论文及项目相关链接
Summary
大型语言模型(LLMs)在多轮工具集成推理(TIR)任务上的训练仍然是一个挑战,现有强化学习(RL)方法,如群组相对策略优化(GRPO),存在奖励粒度粗、轨迹级别奖励不足的问题,无法为复杂的多轮交互提供足够的学习信号,导致训练停滞。为解决这一问题,我们提出了集团回合策略优化(GTPO)这一新型的RL算法,专为训练LLMs进行多轮TIR任务设计。GTPO引入了三个关键创新点:1)回合级别的奖励分配,为单个回合提供精细的反馈;2)基于回报的优势估计,将归一化折扣回报计算为优势;3)自我监督奖励塑形,利用生成代码的自我监督信号来密集稀疏的二元结果导向奖励。我们的综合评估显示,GTPO在多种推理基准测试上的平均表现优于GRPO 3.0%,证明了其在推动现实世界复杂数学推理方面的有效性。
Key Takeaways
- 大型语言模型(LLMs)在多轮工具集成推理(TIR)任务上的训练具有挑战性。
- 现有强化学习(RL)方法,如群组相对策略优化(GRPO),存在奖励粒度粗的问题,导致训练停滞。
- GTPO通过引入回合级别的奖励分配、基于回报的优势估计和自我监督奖励塑形三个关键创新点,解决了这一问题。
- GTPO在多种推理基准测试上的表现优于GRPO。
- GTPO可以提高复杂数学推理在现实世界的表现。
- GTPO的回合级别奖励分配可以提供更精细的反馈,帮助模型更好地理解任务。
点此查看论文截图
LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation
Authors:Hao Jiang, Guoquan Wang, Donglin Zhou, Sheng Yu, Yang Zeng, Wencong Zeng, Kun Gai, Guorui Zhou
Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM’s geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.
最近的大型语言模型(LLM)的进步通过为传统的ID基类方法增加语义泛化能力,从而增强了基于文本的推荐。基于文本的方法通常通过提示设计对项目的文本信息进行编码,并通过项目符号化生成离散的语义ID。然而,在特定领域的任务中,如本地生活服务,仅仅通过提示注入位置信息无法捕捉精细的时空特性和项目之间的现实世界距离感知。为了解决这一问题,我们提出了LGSID,这是一种用于本地生活推荐的与大型语言模型对齐的地理项目符号化框架。该框架由两个关键组件构成:(1)基于强化学习的地理大型语言模型对齐,(2)分层地理项目符号化。在基于强化学习的对齐模块中,我们首先训练一个列表奖励模型来捕捉项目之间的现实世界空间关系。然后,我们引入了一种新的G-DPO算法,该算法使用预训练的奖励模型将泛化的空间知识和协同信号注入大型语言模型,同时保持其语义理解。此外,我们提出了一种分层地理项目符号化策略,其中主要符号是从离散的空间和内容属性中派生出来的,剩余符号则通过使用对齐的大型语言模型的地理表示向量进行细化。在快手行业真实数据集上的大量实验表明,LGSID始终优于最先进的判别和生成推荐模型。消融研究、可视化和案例研究进一步验证了其有效性。
论文及项目相关链接
Summary
基于LLM的最新进展,文本推荐方法得到了提升。传统基于ID的方法通过提示设计和令牌化生成离散语义ID进行编码,但难以处理特定领域的任务,如本地生活服务。为此,我们提出了LGSID框架,包括RL地理LLM对齐和分层地理项目令牌化两部分。通过实验验证,LGSID在真实世界数据集上表现出超越当前主流推荐模型的性能。
Key Takeaways
- LLM的最新进展增强了文本推荐的语义泛化能力。
- 传统基于文本的推荐方法通过提示设计和令牌化生成离散语义ID进行编码。
- 在特定领域任务中,如本地生活服务,仅注入位置信息到提示中无法捕捉精细的空间特性和现实世界中的物品距离感知。
- LGSID框架包括两个关键组件:RL地理LLM对齐和分层地理项目令牌化。
- 提出了一种新型的RL地理LLM对齐方法,包括训练列表奖励模型和引入G-DPO算法来注入泛化的空间知识和协同信号到LLMs中。
- 分层地理项目令牌化策略被提出,其中主要令牌来自离散的空间和内容属性,剩余令牌则通过LLM对齐后的地理表示向量进行精炼。