LLM

发布日期: 2025-11-11

更新日期: 2025-11-27

文章字数: 18.5k

阅读时长: 76 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-11 更新

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Authors:Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

基于Transformer的代码语言模型已在各种软件分析任务中取得了最先进的性能，但由于计算成本高、推理速度慢以及对环境产生重大影响，其实践部署仍然受到限制。为了解决这些挑战，最近的研究越来越倾向于探索知识蒸馏作为一种方法，用于将大型代码语言模型（教师模型）压缩成小型模型（学生模型）同时保持性能。然而，学生模型在多大程度上深度模仿教师模型的预测行为和内部表示在很大程度上尚未被探索，因为当前的基于准确度的评估只能提供模型质量的表面视图，并且经常无法捕获教师和学生模型之间行为保真度的更深层次差异。为了弥补这一空白，我们实证发现学生模型往往无法深度模仿教师模型，导致在对抗性攻击下性能下降高达285%，这是传统基于准确度的评估所无法捕捉到的。因此，我们提出了MetaCompress，一个基于元形态测试的框架，通过比较教师和学生在一组保持行为的元形态关系下的输出，系统地评估行为保真度。我们在两个广泛研究的任务上评估了MetaCompress，使用通过三种不同的知识蒸馏技术获得的流行代码语言模型的压缩版本：压缩机、化身和形态变化。结果表明，MetaCompress发现了学生模型中高达62%的行为差异，这强调了知识蒸馏管道中行为保真度评估的必要性，并证明了MetaCompress作为测试通过知识蒸馏得到的压缩代码语言模型的实用框架的地位。

论文及项目相关链接

PDF The paper is currently under review at a peer-reviewed journal

Summary：基于Transformer的代码语言模型在各种软件分析任务中取得了最先进的性能，但其高昂的计算成本、缓慢的推理速度和显著的环境影响限制了其实际部署。为应对这些挑战，近年来越来越多的研究通过知识蒸馏方法压缩大型语言模型（教师模型）以生成小型模型（学生模型）并保持性能。然而，当前的知识蒸馏技术在评估学生模型是否深度模仿教师模型方面存在不足，现有的准确性评估只能提供模型质量的表面视图，无法捕捉两者之间的行为忠实度差异。本文通过实证研究证明，学生模型通常未能深度模仿教师模型，在攻击场景下性能下降高达285%，超越传统准确性评估所能触及的范围。因此，本文提出MetaCompress框架，通过比较教师和学生在一系列行为保持的元形态关系下的输出，系统地评估行为忠实度。在两项广泛研究的任务中，我们对使用三种不同知识蒸馏技术压缩的代码语言模型进行了评估，结果表明MetaCompress可发现学生模型中高达62%的行为差异。这表明在知识蒸馏过程中需要评估行为忠实度，而MetaCompress框架则为评估通过知识蒸馏获得的压缩语言模型提供了实用测试方法。

Key Takeaways：

基于Transformer的代码语言模型在软件分析任务中表现优异，但面临高计算成本、慢推理速度和环境影响等挑战。
知识蒸馏是一种有效的压缩大型语言模型并维持性能的方法。
现有准确性评估无法全面反映学生模型是否深度模仿教师模型。
学生模型在攻击场景下性能下降显著，表明需要更深入的模型行为评估。
MetaCompress框架通过比较教师和学生在行为保持的元形态关系下的输出，系统地评估行为忠实度。
MetaCompress可发现学生模型中高达62%的行为差异，强调在知识蒸馏过程中评估行为忠实度的必要性。

Cool Papers

点此查看论文截图

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Authors:Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Xiaojiang Zhang, Jinghui Wang, Huiming Wang, Wenhao Zhuang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

评估大型语言模型（LLM）在软件工程方面的应用一直受到任务覆盖面狭窄、语言偏见以及与现实世界开发者工作流程对齐不足的限制。现有基准测试通常侧重于算法问题或Python中心的错误修复，使得软件工程的关键维度被忽视。为了弥补这些差距，我们引入了SWE-Compass1，这是一个统一的基准测试，它将各种代码相关的评估整合到一个结构化且符合生产实际要求的框架中。SWE-Compass涵盖了8种任务类型、8种编程场景和10种编程语言，其中包括从真实的GitHub拉取请求中精心挑选的2000个高质量实例，并通过系统的过滤和验证进行了完善。我们在两个代理框架SWE-Agent和Claude Code下对十个最先进的大型语言模型进行了基准测试，揭示了任务类型、语言和场景之间的难度层次。此外，通过与现实世界的开发者实践对齐评估，SWE-Compass为诊断和改进大型语言模型中的代理编码能力提供了严格且可复制的基准。

论文及项目相关链接

PDF

Summary：
针对软件工程中大型语言模型（LLM）的评估存在任务覆盖面狭窄、语言偏见以及与真实世界开发者工作流程对齐不足的问题，现有基准测试主要集中在算法问题或Python为中心的bug修复上，忽视了软件工程的关键方面。为此，我们推出SWE-Compass1，它是一个统一的框架，将各种代码相关的评估内容整合到结构化、与实际应用相符的框架中。SWE-Compass涵盖8种任务类型、8种编程场景和10种编程语言，共精选了2000个高质量实例，这些实例来源于真实的GitHub pull请求，经过系统筛选和验证。我们对十种最先进的LLM进行了基准测试，揭示了任务类型、语言和场景之间的难度层次。此外，通过与真实世界开发者实践对齐的评估，SWE-Compass为诊断和改进大型语言模型中的代理编码能力提供了严格且可复制的基石。

Key Takeaways：

SWE-Compass是一个综合性的LLM评估基准，涵盖了多种任务类型、编程场景和编程语言。
它超越了现有基准测试的局限，考虑了真实世界开发者的工作流程和编程环境的多样性。
通过精选高质量实例，SWE-Compass确保了评估的有效性和真实性。
基准测试揭示了不同任务类型、语言和场景在LLM表现上的差异和难度层次。
SWE-Compass提供了代理编码能力的评估框架，有助于诊断和改LLM的进步。
该基准测试为评估LLM在软件工程领域的表现提供了可复制的基石。

Cool Papers

点此查看论文截图

Large Language Models for Explainable Threat Intelligence

Authors:Tiago Dinis, Miguel Correia, Roger Tavares

As cyber threats continue to grow in complexity, traditional security mechanisms struggle to keep up. Large language models (LLMs) offer significant potential in cybersecurity due to their advanced capabilities in text processing and generation. This paper explores the use of LLMs with retrieval-augmented generation (RAG) to obtain threat intelligence by combining real-time information retrieval with domain-specific data. The proposed system, RAGRecon, uses a LLM with RAG to answer questions about cybersecurity threats. Moreover, it makes this form of Artificial Intelligence (AI) explainable by generating and visually presenting to the user a knowledge graph for every reply. This increases the transparency and interpretability of the reasoning of the model, allowing analysts to better understand the connections made by the system based on the context recovered by the RAG system. We evaluated RAGRecon experimentally with two datasets and seven different LLMs and the responses matched the reference responses more than 91% of the time for the best combinations.

随着网络威胁的复杂性不断增长，传统安全机制在应对上感到困难重重。大型语言模型（LLM）在文本处理和生成方面拥有先进的能力，因此在网络安全领域具有巨大的潜力。本文探讨了使用带有检索增强生成（RAG）的LLM结合实时信息检索和领域特定数据获取威胁情报的方法。所提出的系统RAGRecon使用带有RAG的LLM回答有关网络安全威胁的问题。此外，它通过生成并可视化呈现给用户每个答案的知识图谱，使这种形式的人工智能（AI）变得可解释。这增加了模型的推理的透明度和可解释性，使分析师能够基于RAG系统恢复的上下文更好地理解系统所建立的连接。我们实验性地使用两个数据集和七个不同的LLM对RAGRecon进行了评估，最佳组合的响应与参考响应的匹配度超过9.%。

论文及项目相关链接

PDF

Summary
基于大型语言模型（LLM）的检索增强生成（RAG）技术在网络安全领域具有巨大潜力。通过实时信息检索和特定领域数据结合，RAGRecon系统能够回答关于网络安全威胁的问题，并生成知识图谱解释AI的推理过程，提高系统的透明度和可解释性。实验评估显示，RAGRecon的响应与参考响应匹配度超过91%。

Key Takeaways

大型语言模型（LLM）在网络安全领域具有显著潜力。
LLMs结合检索增强生成（RAG）技术用于获取威胁情报。
RAGRecon系统利用LLMs和RAG回答问题，关于网络安全威胁的信息。
RAGRecon通过生成并展示知识图谱，提高AI的透明度和可解释性。
实验评估显示，RAGRecon的响应匹配度高。
LLMs在文本处理和生成方面的先进能力，使其在网络安全领域具有广泛应用前景。
RAGRecon系统能够实时检索信息，并基于特定领域数据做出判断。

Cool Papers

点此查看论文截图

PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

Authors:Zehui Feng, Tian Qiu, Tong Wu, Junxuan Li, Huayuan Xu, Ting Han

Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

视觉质量评估（QA）旨在预测人类对视觉保真度的感知判断。虽然最近的多模态大型语言模型（MLLM）在图像和视频质量推理方面显示出潜力，但现有方法主要依赖于有监督的微调或仅排名目标，导致推理肤浅、分数校准不佳和跨域泛化有限。我们提出了PreResQ-R1，一个偏好响应分解强化学习框架，它在一个以推理驱动的优化方案中统一了绝对分数回归和相对排名一致性。不同于先前的QA方法，PreResQ-R1引入了双分支奖励公式，该公式分别建模样本内响应一致性和样本间偏好对齐，通过群体相对策略优化（GRPO）进行优化。这种设计鼓励对感知质量进行精细、稳定、可解释的链式思维推理。为了超越静态图像，我们进一步设计了一种全局时间和局部空间的数据流策略，用于视频质量评估。值得注意的是，仅在6K图像和2万8千个视频上进行强化微调后，PreResQ-R1在SRCC和PLCC指标下的十个图像质量评估和五个视频质量评估基准测试中取得了最新成果，在图像质量评估任务中分别提高了百分之五点三零和百分之二点一五的优势。除了量化收益之外，它产生与人类对齐的推理轨迹，揭示了质量判断背后的感知线索。代码和模型可供使用。

论文及项目相关链接

PDF 27 pages, 14 figures, under review as a conference paper

Summary

该文提出了一个名为PreResQ-R1的视觉质量评估（QA）框架，该框架结合了绝对分数回归和相对排名一致性，通过强化学习进行优化。PreResQ-R1采用双分支奖励公式建模样本内响应一致性和样本间偏好对齐，通过群体相对策略优化（GRPO）进行优化。该设计鼓励对感知质量进行精细、稳定且可解释的链式思维推理。在静态图像和视频质量评估方面，PreResQ-R1取得了最先进的成果。

Key Takeaways

PreResQ-R1是一个基于强化学习的视觉质量评估框架，结合了绝对分数回归和相对排名一致性。
该框架采用双分支奖励公式，分别建模样本内响应一致性和样本间偏好对齐。
PreResQ-R1通过群体相对策略优化（GRPO）进行优化，鼓励对感知质量进行精细、稳定且可解释的链式思维推理。
PreResQ-R1在静态图像和视频质量评估方面表现出色，达到或超越了现有技术。
该框架实现了对感知质量判断的深入理解，并通过人类对齐的推理轨迹揭示了判断背后的感知线索。
PreResQ-R1只需要少量的图像和视频数据进行强化微调，即可实现显著的性能提升。

Cool Papers

点此查看论文截图

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

Authors:Chao Zhang, Yuhao Wang, Derong Xu, Haoxin Zhang, Yuanjie Lyu, Yuhao Chen, Shuochen Liu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen

Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models’ (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.

检索增强生成（RAG）利用外部知识增强大型语言模型（LLM）的可靠性。为了灵活性，主动RAG采用自主、多轮检索和推理来解决查询。尽管最近的主动RAG通过强化学习得到了改进，但它们常常因为搜索和推理过程而产生大量的令牌开销，这种权衡优先考虑准确性而非效率。为了解决这一问题，本文提出了TeaRAG，一个高效的主动RAG框架，能够压缩检索内容和推理步骤。首先，通过结合基于块的语义检索和基于简洁三元组的图检索来压缩检索内容。然后，根据语义相似性和共现性构建知识关联图。最后，利用个性化PageRank来突出显示该图中的关键知识，减少每次检索的令牌数量。此外，为了简化推理步骤，提出了迭代过程感知的直接偏好优化（IP-DPO）。具体来说，我们的奖励函数通过知识匹配机制来评估知识的充足性，同时惩罚过多的推理步骤。这种设计可以产生高质量的首选配对数据集，支持迭代DPO以提高推理的简洁性。在六个数据集上，TeaRAG在Llama3-8B-Instruct和Qwen2.5-14B-Instruct上分别将平均精确匹配度提高了4%和2%，同时减少了输出令牌的61%和59%。代码可访问于 https://github.com/Applied-Machine-Learning-Lab/TeaRAG。

论文及项目相关链接

PDF 32 pages

Summary

RAG利用外部知识增强LLM的可靠性，而agentic RAG则通过自主、多轮次的检索和推理来解决查询问题。现有agentic RAG虽经强化学习改进，但搜索和推理过程会产生大量令牌开销，优先考虑准确性而牺牲效率。为解决此问题，提出Token Efficient agentic RAG（TeaRAG）框架，通过基于块的语义检索与知识关联图结合来压缩检索内容，并利用个性化PageRank突出关键知识，减少每次检索的令牌数量。同时，提出迭代过程感知的直接偏好优化（IP-DPO），通过知识匹配机制评估知识充足性并惩罚过多的推理步骤，从而提高推理的简洁性。在六个数据集上的实验表明，TeaRAG在提高精确匹配率的同时，显著减少了输出令牌数量。

Key Takeaways

Retrieval-Augmented Generation (RAG)结合外部知识提升LLM可靠性。
Agentic RAG使用自主多轮检索和推理提高灵活性。
强化学习虽提升了agentic RAG性能，但引发显著的令牌开销问题。
TeaRAG框架被提出以解冑上述问题，通过压缩检索内容和推理步骤实现Token效率。
TeaRAG结合块语义检索、知识关联图和个性化PageRank减少检索令牌数量。
提出IP-DPO方法以提高推理简洁性，通过知识匹配机制评估知识充足性并优化过度推理。

Cool Papers

点此查看论文截图

Dense Motion Captioning

Authors:Shiyao Xu, Benedetta Liberatori, Gül Varol, Paolo Rota

Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.

关于人体三维动作和语言融合研究的最新进展主要集中在动作生成文本上，相对而言动作理解的任务仍然未被充分探索。我们引入密集动作描述（Dense Motion Captioning）这一新型任务，旨在识别三维人体运动序列中的时间位置并对动作进行描述。现有的数据集在时间层面标注信息不全，主要由包含少数动作的简短序列组成。为了克服这些局限性，我们推出了复杂动作数据集（CompMo），这是首个包含丰富标注信息的复杂动作数据集，具有精确的时间边界。通过精心设计的数据生成管道构建，CompMo包含6万个动作序列，每个序列包含至少两到十个动作，精确标注了时间跨度。我们还推出了DEMO模型，它结合了大型语言模型和简单动作适配器，经过训练能够生成密集的时间性描述。我们的实验表明，DEMO在CompMo以及改编的基准测试集上的表现均优于现有方法，为未来的三维动作理解和描述研究建立了稳健的基准。

论文及项目相关链接

PDF 12 pages, 5 figures, accepted to 3DV 2026

Summary

近期三维人体运动与语言整合方面的进展主要集中在文本到运动的生成，而运动理解的任务相对被忽视。本文引入密集运动字幕（Dense Motion Captioning）这一新任务，旨在临时定位和字幕三维人体运动序列中的动作。当前数据集在提供详细时间注释方面存在不足，主要由包含少数动作特征的短序列组成。为了克服这些局限性，本文提出了复杂运动数据集（CompMo），这是第一个包含丰富注释的大规模数据集，具有精确时间边界的复杂运动序列。通过精心设计的数据生成管道构建，CompMo包含6万个运动序列，每个序列由至少两到十个动作组成，准确标注了它们的时间范围。此外，本文还提出了DEMO模型，它将大型语言模型与简单的运动适配器相结合，训练生成密集、时间上的字幕。实验表明，DEMO在CompMo以及适应的基准测试上大幅优于现有方法，为未来在三维运动理解和字幕方面的研究建立了稳健的基线。

Key Takeaways

当前进展主要集中在文本到运动的生成任务上，而三维人体运动理解的任务相对被忽视。
引入Dense Motion Captioning新任务，目标是理解三维人体运动序列中的动作。
当前数据集在提供详细时间注释方面存在不足。
提出复杂运动数据集（CompMo），包含丰富注释的大规模数据集。
通过精心设计的数据生成管道构建CompMo数据集，包含准确标注的复杂运动序列。
提出DEMO模型，整合大型语言模型和简单运动适配器进行密集运动字幕生成。

Cool Papers

点此查看论文截图

What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Authors:Klára Bendová, Tomáš Knap, Jan Černý, Vojtěch Pour, Jaromir Savelka, Ivana Kvapilíková, Jakub Drápal

Criminal justice administrative data contain only a limited amount of information about the committed offense. However, there is an unused source of extensive information in continental European courts’ decisions: descriptions of criminal behaviors in verdicts by which offenders are found guilty. In this paper, we study the feasibility of extracting these descriptions from publicly available court decisions from Slovakia. We use two different approaches for retrieval: regular expressions and large language models (LLMs). Our baseline was a simple method employing regular expressions to identify typical words occurring before and after the description. The advanced regular expression approach further focused on “sparing” and its normalization (insertion of spaces between individual letters), typical for delineating the description. The LLM approach involved prompting the Gemini Flash 2.0 model to extract the descriptions using predefined instructions. Although the baseline identified descriptions in only 40.5% of verdicts, both methods significantly outperformed it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and 99.5% when combined. Evaluation by law students showed that both advanced methods matched human annotations in about 90% of cases, compared to just 34.5% for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of instances, and a combination of advanced regular expressions with LLMs reached 92%.

刑事司法行政数据中对犯罪行为的描述信息非常有限。然而，在大陆欧洲的法院判决中存在大量未使用的信息来源：即判决书中对犯罪行为的描述，通过这些描述，罪犯被判有罪。本文研究了从斯洛伐克公开可用的法院裁决中提取这些描述的可行性。我们采用了两种检索方法：正则表达式和大型语言模型（LLM）。我们的基线方法是一种使用正则表达式识别描述前后出现的典型词语的简单方法。高级正则表达式方法进一步关注“间隔”及其归一化（在单个字母之间插入空格），这典型地用于描述边界。LLM方法是通过预先设定的指令提示Gemini Flash 2.0模型来提取描述。虽然基线方法只在40.5%的判决中识别到了描述，但两种方法都显著优于它，高级正则表达式方法达到97%，LLM方法达到98.75%，组合使用时达到99.5%。法律学生的评估显示，在大约90%的情况下，两种高级方法与人类注释相匹配，而基线方法仅为34.5%。LLMs在91.75%的情况下完全匹配人类标注的描述，而高级正则表达式与LLMs的组合则达到了92%。

论文及项目相关链接

PDF Paper accepted to the proceedings of ASAIL 2025 Workshop under ICAIL conference for publication. Paper contains 6 pages (references included) and 2 appendices. It contains 8 tables, no figures

Summary：

本文探讨了从斯洛伐克公开法院裁决中提取犯罪行为描述信息的可行性。研究采用正则表达式和大型语言模型（LLM）两种检索方法。虽然基线方法仅能在40.5%的裁决中找到描述，但高级正则表达式和LLM方法显著优于基线方法，分别达到了97%、98.75%和99.5%的准确率。法律学生的评估显示，高级方法与人类标注的匹配度约为90%，而基线方法仅为34.5%。LLM在完全匹配人类标注的描述方面表现最佳，达到91.75%，结合高级正则表达式与LLM的方法准确率更高达92%。

Key Takeaways：

刑事司法行政数据对犯罪行为的描述信息有限。
在斯洛伐克公开法院裁决中存在丰富的犯罪行为描述信息。
采用正则表达式和大型语言模型（LLM）两种方法来提取描述信息。
高级正则表达式和LLM方法显著提高提取准确率。
LLM在匹配人类标注的描述方面表现优越。
结合高级正则表达式与LLM的方法能够进一步提高准确率。

Cool Papers

点此查看论文截图

Code Review Automation using Retrieval Augmented Generation

Authors:Qianru Meng, Xiao Zhang, Zhaochen Ren, Joost Visser

Code review is essential for maintaining software quality but is labor-intensive. Automated code review generation offers a promising solution to this challenge. Both deep learning-based generative techniques and retrieval-based methods have demonstrated strong performance in this task. However, despite these advancements, there are still some limitations where generated reviews can be either off-point or overly general. To address these issues, we introduce Retrieval-Augmented Reviewer (RARe), which leverages Retrieval-Augmented Generation (RAG) to combine retrieval-based and generative methods, explicitly incorporating external domain knowledge into the code review process. RARe uses a dense retriever to select the most relevant reviews from the codebase, which then enrich the input for a neural generator, utilizing the contextual learning capacity of large language models (LLMs), to produce the final review. RARe outperforms state-of-the-art methods on two benchmark datasets, achieving BLEU-4 scores of 12.32 and 12.96, respectively. Its effectiveness is further validated through a detailed human evaluation and a case study using an interpretability tool, demonstrating its practical utility and reliability.

代码审查对于维护软件质量至关重要，但它是劳动密集型的。自动代码审查生成为解决这一挑战提供了有前景的解决方案。基于深度学习的生成技术和基于检索的方法在此任务中都表现出了强大的性能。然而，尽管取得了这些进展，仍然存在一些局限性，生成的评论可能会偏离主题或过于笼统。为了解决这些问题，我们引入了检索增强评审者（RARe），它利用检索增强生成（RAG）结合了基于检索和生成的方法，明确地将外部领域知识纳入代码审查过程。RARe使用密集检索器从代码库中选取最相关的评论，然后丰富神经生成器的输入，利用大型语言模型的上下文学习能力，产生最终的评论。RARe在两个基准数据集上的表现超过了最先进的方法，BLEU-4得分分别为12.32和12.96。其有效性通过详细的人工评估和案例研究得到了进一步验证，使用了解释性工具，证明了其实用性、可靠性和实用性。

论文及项目相关链接

PDF

Summary

这是一篇关于代码审查自动化的研究论文。文章指出自动化代码审查生成对于维护软件质量至关重要，但存在生成评论偏离主题或过于笼统的问题。为此，文章提出了结合检索和生成方法的Retrieval-Augmented Reviewer（RARe）模型。该模型利用密集检索器从代码库中选取最相关的评论，丰富输入内容，利用大型语言模型的上下文学习能力生成最终评论。RARe在两个基准数据集上的表现优于现有方法，并通过人类评估和案例研究验证了其有效性和实用性。

Key Takeaways

代码审查对于维护软件质量至关重要，但人工审查劳动强度大，需要自动化解决方案。
深度学习和检索方法都在自动化代码审查生成中表现出强大的性能。
生成的评论可能偏离主题或过于笼统，需要改进。
引入Retrieval-Augmented Reviewer（RARe）模型，结合检索和生成方法。
RARe使用密集检索器从代码库中选取相关评论，丰富输入内容。
RARe利用大型语言模型的上下文学习能力生成最终评论。

Cool Papers

点此查看论文截图

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

Authors:Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, Changsheng Xu

Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar’s state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.

尽管离线视频理解的视频大语言模型（Video-LLM）已经取得了重大进展，但现有的在线Video-LLM通常在同时处理连续帧输入和确定最佳响应时间方面遇到困难，经常牺牲实时响应能力和叙事连贯性。为了解决这个问题，我们引入了LiveStar，这是一款开创性的直播助理，通过自适应流媒体解码实现始终在线的主动响应。具体来说，LiveStar结合了以下几点：（1）一种训练策略，实现对变长视频流的增量视频语言对齐，保持动态演化帧序列的时间一致性；（2）一种响应沉默解码框架，通过单次前向传递验证来确定最佳的主动响应时机；（3）通过峰值结束内存压缩和在线推理的流媒体键值缓存，实现对超过10分钟视频的内存感知加速，达到1.53倍推理速度。我们还构建了OmniStar数据集，这是一个涵盖15种不同现实场景和5种在线视频理解评估任务的全面训练和基准测试数据集。在三个基准测试上的大量实验表明，LiveStar的性能达到了最新水平，与现有的在线Video-LLM相比，语义正确性平均提高了19.5%，时间差异减少了18.1%，同时在所有五个OmniStar任务上提高了12.0%的FPS。我们的模型和数据集可在https://github.com/yzy-bupt/LiveStar访问。

论文及项目相关链接

PDF NeurIPS 2025 Accepted

摘要

针对在线视频大型语言模型（Video-LLM）在实时处理连续帧输入并确定最佳响应时机方面的局限性，我们推出了LiveStar——一款具备自适应流媒体解码功能的始终在线的主动响应直播助手。它通过训练策略、响应-沉默解码框架和内存感知加速技术，实现了对可变长度视频流的增量视频语言对齐、单前向传递验证的最佳主动响应时机确定，以及长达10分钟以上的在线推理内存压缩。此外，我们还构建了OmniStar数据集，用于培训和评估在线视频理解任务。实验表明，LiveStar在语义正确性和响应时间方面均表现出卓越性能，平均提升19.5%，且在所有五个OmniStar任务中提高了12.0%的帧率。

关键见解

LiveStar解决了在线Video-LLM在处理连续帧输入和确定最佳响应时机方面的挑战，实现了实时响应和叙事连贯性。
LiveStar通过训练策略实现了对可变长度视频流的增量视频语言对齐，保持了动态演化帧序列的时间一致性。
响应-沉默解码框架通过单前向传递验证确定了最佳的主动响应时间。
通过内存感知加速技术，包括峰值结束内存压缩和流媒体键值缓存，LiveStar实现了更快的推理速度。
OmniStar数据集的构建为在线视频理解提供了全面的培训和评估资源，涵盖15个多样化的真实场景和5个评估任务。
实验结果表明，LiveStar在语义正确性和响应时间方面显著优于其他在线Video-LLM，平均提升19.5%。
LiveStar在所有五个OmniStar任务中提高了帧率，平均提高12.0%。

Cool Papers

点此查看论文截图

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

Authors:Jon Kleinberg, Fan Wei

The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity–breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset $C \subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $\alpha$ in $K$, the algorithm’s output achieves density at least $\alpha/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset’s density. We further revisit the classical Gold–Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible – when hypotheses $M_t$ eventually satisfy $C \subseteq M \subseteq K$ – and in the process give a new topological formulation of Angluin’s characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.

大型语言模型（LLM）的成功激发了关于语言生成和学习的形式化理论的研究。我们研究了“极限语言生成”的框架，其中对手从可数类别中列举出一个未知语言K的字符串，算法必须生成K中未见过的字符串。早期的研究表明，生成总是可能的，并且一些算法能达到正的下密度，这揭示了正确性（正确性）和覆盖面（广度）之间的有效性和广度权衡。我们解决了这一领域的一个主要开放问题，证明了最佳可达下密度的紧密界限为1/2。然后我们对模型进行了加强，允许“部分枚举”，在这种情况下，对手只显示K的一个无限子集C。我们表明在极限中进行生成仍然是可以实现的，并且如果子集C在K中的密度较低α下α较低α的情况下实现下密度的最少二分之一匹配上界。这将二分之一的界限推广到部分信息设置，其中生成器必须在揭示的子集的密度的一半内恢复。我们进一步重新审视了部分枚举下的经典Gold-Angluin语言识别模型。我们描述了何时可以实现极限识别——当假设M_t最终满足C \subseteq M \subseteq K时——并在过程中给出了Angluin刻画的新拓扑公式，证明她的条件正好相当于一个适当的拓扑空间具有TD分离属性。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的成功激发了语言生成与学习理论的研究。本文研究了“极限语言生成”框架，其中对手从一个可数类别中列举未知语言K的字符串，算法必须生成未见的K字符串。先前的工作表明生成总是可能的，且一些算法能达到正的下密度，揭示了正确性与覆盖面之间的“有效性-广度”权衡。本文解决了该方向的一个主要开放问题，证明最佳可达成下密度的紧密界为1/2。随后，本文将模型扩展到允许“部分列举”，对手只展示K的一个无限子集C。本文证明极限生成仍然可行，如果C在K中的下密度为α，则算法输出至少达到α/2的密度，与上界相匹配。这推广了部分信息设置中的1/2界，在该设置中，生成器必须恢复到显示子集的密度的1/2之内。此外，本文重新审视了部分列举下的经典Gold-Angluin语言识别模型，并描述了极限识别何时可能——当假设Mt最终满足C⊆M⊆K时。在此过程中，给出了Angluin表征的新拓扑公式，显示她的条件恰好等同于适当的拓扑空间具有TD分离属性。

Key Takeaways

大型语言模型（LLM）的成功推动了语言生成和学习理论的研究。
引入“极限语言生成”框架，研究算法在对手列举未知语言字符串情况下的生成能力。
解决了一个主要开放问题，即证明了最佳可达成下密度的紧密界为1/2。
模型被扩展到允许部分列举的情况，并证明在给定子集C的下密度为α时，算法输出密度的至少为α/wise进一步探讨部分信息设置中的语言生成问题。α/2。这推广了先前的结果，强调了算法需要在更有限的信息下进行有效的生成。
重新审视了经典Gold-Angluin语言识别模型在部分列举下的情况，并探索了极限语言识别的可能性。
给出了Angluin表征的新拓扑公式，揭示了其与拓扑空间的TD分离属性之间的精确关系。这一发现有助于更深入地理解语言识别的机制和条件。

Cool Papers

点此查看论文截图

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Authors:Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu

The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user’s preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.

黑箱大型语言模型（LLM）的个性化是一项至关重要的任务，但同时也是一个巨大的挑战。目前的方法主要依赖于上下文注入，通过将用户历史嵌入提示来直接引导生成过程。然而，这种单步模式对模型提出了双重挑战：既要生成准确的内容，又要与用户特定的风格保持一致。这通常会导致输出质量和精确控制之间的权衡问题。为了解决这一基本矛盾，我们提出了反射个性化优化（RPO）这一新型框架，通过解耦内容生成与对齐来重新定义个性化范式。RPO分为两个独立阶段进行操作：首先，基础模型生成高质量、通用的响应；然后，外部反射模块显式地重写输出，以符合用户的偏好。该反射模块采用两阶段过程进行训练。首先，利用结构化重写轨迹进行有监督微调，以建立核心个性化推理策略，该策略可对通用响应到用户对齐响应的转换进行建模。随后，应用强化学习进一步改进和提高个性化输出的质量。在LaMP基准测试上的综合实验表明，RPO通过解耦内容生成和个性化，显著优于当前最优基线。这些发现强调了显式响应塑造优于隐式上下文注入的优越性。此外，RPO引入了一个高效、模型无关的个人化层，它可以无缝集成到任何基础模型中，为用户为中心的生成场景开辟了一条新的有效方向。

论文及项目相关链接

PDF

Summary

本文提出一种名为反射个性化优化（RPO）的新型框架，用于解决大型语言模型（LLM）个性化过程中的挑战。传统的个性化方法主要通过语境注入实现，这要求模型在生成内容的同时符合用户特定风格，导致输出质量和精确控制之间的权衡。RPO通过解耦内容生成和风格对齐来重新定义个性化范式。首先，基础模型生成高质量通用响应，然后外部反射模块显式地重写输出以符合用户偏好。该反射模块通过两阶段过程进行训练：首先，利用结构化重写轨迹进行有监督微调，建立从通用到用户对齐响应的核心个性化推理策略；然后，应用强化学习进一步提高个性化输出的质量。在LaMP基准测试上的综合实验表明，RPO通过解耦内容生成和个性化，显著优于现有技术基线。这强调了显式响应塑造相对于隐式上下文注入的优越性。此外，RPO引入了一种高效、模型无关的个人化层，可以无缝集成到任何基础模型中，为以用户为中心的生成场景开辟了新而有效的方向。

Key Takeaways

大型语言模型的个性化是一个挑战，现有的方法主要依赖语境注入，但存在输出质量和精确控制的权衡问题。
反射个性化优化（RPO）框架被提出来解决这一问题，它通过解耦内容生成和风格对齐来重新定义个性化。
RPO包括两个主要阶段：基础模型生成通用响应，外部反射模块进行显式重写以符合用户偏好。
反射模块采用两阶段训练过程：有监督微调和强化学习，以提高个性化输出的质量。
RPO在LaMP基准测试上显著优于现有技术基线，强调了显式响应塑造的优越性。
RPO引入了一种模型无关的个人化层，可以与任何基础模型无缝集成。

Cool Papers

点此查看论文截图

GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

Authors:Hari Mohan Pandey, Anshul Gupta, Subham Sarkar, Minakshi Tomer, Schneider Johannes, Yan Gong

Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.

文本到SQL系统使用户能够通过自然语言与结构化数据库进行交互，从而无需具备专门的编程知识。在这项工作中，我们推出了基于开源Gemma 2B架构的轻量级、高效的文本到SQL模型——GEMMA-SQL。与许多大型语言模型（LLM）不同，GEMMA-SQL以资源高效、迭代的方式进行微调，可在低成本硬件上部署。利用SPIDER基准数据集进行训练和评估，GEMMA-SQL结合了多种提示策略，包括小样本学习，以提高SQL查询生成准确性。指令调整型变体GEMMA-SQL Instruct达到了66.8%的测试套件准确率和63.3%的精确集匹配准确率，优于IRNet、RYANSQL和CodeXDavinci等多个最新先进基线。所提出的方法表明，有效的提示设计和有针对性的指令调整可以在保持高可扩展性和适应性的同时，显著提高性能。这些结果使GEMMA-SQL成为一个实用、开源的替代方案，用于构建稳健和可访问的文本到SQL系统。

论文及项目相关链接

PDF

Summary
文本转SQL系统让用户能够通过自然语言与结构化数据库进行交互，无需特定编程知识。本研究推出基于开源Gemma 2B架构的轻量级高效文本转SQL模型——GEMMA-SQL。与其他大型语言模型不同，GEMMA-SQL以资源高效的方式进行微调，可在低成本硬件上部署。利用SPIDER基准进行测试和评估，GEMMA-SQL结合多种提示策略，包括少样本学习，提高SQL查询生成准确性。指令调整型的GEMMA-SQL Instruct达到66.8%的测试套件准确率和63.3%的精确集匹配准确率，优于IRNet、RYANSQL和CodeXDavinci等最新基线。研究结果证明，有效的提示设计和有针对性的指令调整可以在保持高性能的同时，显著提高系统的可扩展性和适应性。这使GEMMA-SQL成为实用、开源的文本转SQL系统的稳健和可访问的替代品。

Key Takeaways

GEMMA-SQL是一个基于开源Gemma 2B架构的轻量级文本转SQL模型。
与大型语言模型不同，GEMMA-SQL以资源高效的方式微调，适用于低成本硬件部署。
GEMMA-SQL结合多种提示策略，包括少样本学习，提高SQL查询生成准确性。
GEMMA-SQL Instruct型号实现高准确率，优于其他最新基线。
有效提示设计和指令调整可显著提高系统性能、可扩展性和适应性。
GEMMA-SQL作为实用、开源的文本转SQL系统，具有替代其他系统的潜力。

Cool Papers

点此查看论文截图

MACO: A Multi-Agent LLM-Based Hardware/Software Co-Design Framework for CGRAs

Authors:Zesong Jiang, Yuqi Sun, Qing Zhong, Mahathi Krishna, Deepak Patil, Cheng Tan, Sriram Krishnamoorthy, Jeff Zhang

Coarse-grained Reconfigurable Arrays (CGRAs) are a promising computing architecture that can deliver high-performance, energy-efficient acceleration across diverse domains. By supporting reconfiguration at the functional unit level, CGRAs efficiently adapt to varying computational patterns and optimize resource utilization. However, designing CGRAs is highly challenging due to the vast design space, independent architectural parameters, and the time-consuming nature of manual design. Fortunately, the rapid advancement of large language models (LLMs) presents new opportunities to automate this process. In this work, we propose MACO– an open-source multi-agent LLM-based framework for Hardware/Software (HW/SW) co-design of CGRAs. The framework employs LLM reasoning to generate CGRAs across four stages: HW/SW co-design, Design error correction, Best design selection, and Evaluation & Feedback. Furthermore, MACO iteratively optimizes the generated CGRAs, leveraging agent reasoning and feedback to achieve higher PPA (that is, power, performance, and area) design points for a given domain. In addition, we introduce an LLM self-learning mechanism that employs LLM-driven decision making to select the optimal CGRA to accelerate the design process. We evaluate the framework with state-of-the-art LLM-based methods and manual CGRA design, in terms of performance, power consumption, and area. Experimental results show that MACO efficiently generates high-quality CGRA architectures, significantly reducing manual design effort and demonstrating the potential of our framework for real-world CGRA design.

粗粒度可重构阵列（CGRAs）是一种有前途的计算架构，可以在不同的领域提供高性能、能源效率高的加速。通过支持功能单元级别的重构，CGRAs能够高效地适应不同的计算模式并优化资源利用。然而，设计CGRAs是一项极具挑战性的任务，因为设计空间大、独立的架构参数以及手动设计的耗时性质。幸运的是，大型语言模型（LLM）的快速发展为自动化这一过程提供了新的机会。

在这项工作中，我们提出了MACO——一个开源的、基于多智能体的大型语言模型的硬件/软件（HW/SW）协同设计CGRAs的框架。该框架采用大型语言模型的推理能力在四个阶段生成CGRAs：HW/SW协同设计、设计错误校正、最佳设计选择、评估与反馈。此外，MACO通过利用智能体的推理和反馈来迭代优化生成的CGRAs，以实现给定领域更高的功率性能面积比（PPA）。此外，我们还引入了一种大型语言模型自主学习机制，采用大型语言模型驱动的决策来选择最佳CGRA以加速设计过程。

论文及项目相关链接

PDF

Summary

基于粗粒度可重构阵列（CGRAs）的硬件/软件（HW/SW）协同设计是一项有前途的技术，该技术可实现高性能且能源效率高的跨域加速。然而，设计CGRAs具有巨大的挑战性和复杂性。在本研究中，我们提出了一种利用大型语言模型（LLM）的开源多智能体框架MACO，用于CGRAs的HW/SW协同设计。MACO框架利用LLM推理生成CGRAs并优化其性能。实验结果表明，MACO能够高效生成高质量的CGRA架构，显著降低手动设计的工作量。

Key Takeaways

CGRAs是一种有前景的计算架构，可实现跨域的高性能、能源效率加速。
设计CGRAs面临巨大的挑战，包括巨大的设计空间、独立的架构参数和时间消耗的手动设计。
MACO是一个基于LLM的开源多智能体框架，用于CGRAs的HW/SW协同设计。
MACO利用LLM推理生成CGRAs并在四个不同阶段进行优化。
MACO框架具备自我学习能力，能够通过LLM驱动的决策选择最佳CGRA来加速设计过程。
实验结果表明，MACO能够高效生成高质量的CGRA架构。

Cool Papers

点此查看论文截图

Authors:Sneha Oram, Pushpak Bhattacharyya

Although explainability and interpretability have received significant attention in artificial intelligence (AI) and natural language processing (NLP) for mental health, reasoning has not been examined in the same depth. Addressing this gap is essential to bridge NLP and mental health through interpretable and reasoning-capable AI systems. To this end, we investigate the pragmatic reasoning capability of large-language models (LLMs) in the mental health domain. We introduce PRiMH dataset, and propose pragmatic reasoning tasks in mental health with pragmatic implicature and presupposition phenomena. In particular, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the tasks presented, we consider four models: Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning abilities in the domain. Subsequently, we study the behavior of MentaLLaMA on the proposed reasoning tasks with the rollout attention mechanism. In addition, we also propose three StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT4o-mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with stigma more responsibly compared to the other two LLMs.

尽管人工智能和自然语言处理在心理健康领域已经得到了对解释性和可解释性的广泛关注，但推理尚未得到同等深度的研究。为了弥补这一空白，我们通过可解释和具备推理能力的人工智能系统，将自然语言处理和心理健康连接起来。为此，我们研究心理健康领域中大型语言模型的实用推理能力。我们引入了PRiMH数据集，并提出了具有实用隐含和预设现象的心理健康实用推理任务。特别是，我们制定了两个隐含任务和一个预设任务。为了评估所提出的数据集和任务，我们考虑了四种模型：Llama3.1、Mistral、MentaLLaMa 和 Qwen。实验结果表明，Mistral 和 Qwen 在该领域的推理能力显著。随后，我们研究了 MentaLLaMA 在所提出推理任务上的表现，采用滚动注意力机制。此外，我们还提出了三个 StiPRompts 来研究使用最前沿的大型语言模型 GPT4o-mini、Deepseek-chat 和 Claude-3.5-haiku 围绕心理健康的污名化现象。我们的评估结果表明，与其他两个大型语言模型相比，Claude-3.5-haiku 在处理污名化方面更加负责任。

论文及项目相关链接

PDF

Summary

本文关注人工智能在自然语言处理领域中对精神健康领域的推理能力研究。为解决目前解释性和可理解性研究中缺乏深度的问题，研究团队针对大型语言模型（LLM）的实用推理能力展开研究。文章介绍了PRiMH数据集，并提出了具有实用隐含和预设现象的精神健康领域的实用推理任务。实验结果显示，Mistral和Qwen在该领域的推理能力较强。此外，文章还研究了MentaLLaMA在提出的推理任务上的表现，并提出了三个关于精神健康耻辱的StiPRompts提示来研究最先进的LLMs的表现。评估结果显示，Claude-3.5-haiku在处理精神健康耻辱方面表现更负责任。

Key Takeaways

大型语言模型（LLM）在精神健康领域的实用推理能力受到关注。
PRiMH数据集被引入，用于研究实用推理任务。
实验中，Mistral和Qwen展现出较强的推理能力。
MentaLLaMA在提出的推理任务上的表现被研究。
提出三个关于精神健康耻辱的StiPRompts提示。
Claude-3.5-haiku在处理精神健康耻辱方面表现更负责任。

Cool Papers

点此查看论文截图

Know What You Don’t Know: Uncertainty Calibration of Process Reward Models

Authors:Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach – performed via quantile regression – that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.

流程奖励模型（PRM）在大语言模型（LLM）的推理时间缩放算法中起到了核心作用。然而，我们发现即使是最先进的PRM也可能存在校准不良的情况。具体来说，它们倾向于高估部分推理步骤会导致正确最终答案的成功概率，特别是在使用较小的LLM来完成推理轨迹时更为明显。

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

基于过程奖励模型（PRM）在大规模语言模型（LLM）推理时间缩放算法中的核心作用，文章指出当前最先进的PRM存在校准不良的问题。为解决此问题，文章提出了一种通过分位回归进行校准的方法，该方法可调整PRM输出以更好地与真实成功率对齐。并利用这些校准后的成功估算和相关的置信区间，引入了自适应计算资源分配策略——动态根据预估的可能性调整计算预算，决定不同推理轨迹的分配。实验表明，该策略实现了良好的性能提升。

Key Takeaways

PRMs在大规模语言模型推理时间缩放算法中起核心作用，但存在校准不良的问题。
分位回归方法用于校准PRM输出，以更好地与真实成功率对齐。
利用校准后的成功估算和置信区间，引入了自适应调整计算预算的策略。
实验表明，提出的PRM校准方法具有较小的校准误差，优于基准方法。
校准对于实现有效的自适应缩放至关重要。

Cool Papers

点此查看论文截图

Authors:Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a “black box”. Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.

图像生成技术的进步引发了公众安全方面的重大担忧。我们认为，假图像检测不应像“黑箱”那样运行。相反，理想的方法必须确保既强大又透明。多模态大型语言模型（MLLMs）的最新进展为基于推理的AI生成图像检测提供了新的机会。在这项工作中，我们评估了MLLMs与传统检测方法和人类评估者的能力，突出了它们的优势和局限性。此外，我们设计了六种不同的提示并提出了一个框架，该框架结合了这些提示来开发一个更稳健、可解释和以推理为中心的检测系统。代码可在https://github.com/Gennadiyev/mllm-defake找到。

论文及项目相关链接

PDF Accepted to ACM MM 2025; 14 pages including Appendix

Summary
图像生成技术的进步引发了公众安全担忧。我们主张假图像检测不应像“黑箱”一样运作，理想的方法必须同时确保强大的泛化能力和透明度。多模态大型语言模型（MLLMs）的最新进展为基于推理的AI生成图像检测提供了新的机会。我们评估了MLLMs与传统检测方法和人类评估者的能力，突出了它们的优点和局限性。此外，我们设计了六种不同的提示，并提出了一个框架，将这些提示集成在一起，以开发一个更稳健、可解释、基于推理的检测系统。

Key Takeaways

图像生成技术的进步引发了公众安全关注，特别是在假图像检测方面。
理想的假图像检测方法需要兼具强大的泛化能力和透明度。
多模态大型语言模型（MLLMs）在AI生成的图像检测方面具有潜力。
MLLMs在假图像检测方面的性能优于传统方法，但仍存在局限性。
设计了六种提示来增强MLLMs在图像检测中的稳健性和可解释性。
提出的框架旨在集成这些提示，以创建一个更稳健、可解释、基于推理的检测系统。

Cool Papers

点此查看论文截图

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Authors:Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams, often leading to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step before applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves machine translation and language modeling performance. Additionally, to handle the dependent vowels common in syllable-based writing systems used by Indic languages, we propose Constrained BPE (CBPE), an extension to the standard BPE algorithm incorporating script-specific constraints. In particular, CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit. Our results show that CBPE achieves a 1.68% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation and language modeling, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.

分词是自然语言处理中的一个关键步骤，特别是随着大型语言模型（LLM）的兴起，它对下游性能、计算成本和效率产生了影响。现有的大型语言模型依赖于经典的字节对编码（BPE）算法进行子词分词，该算法贪婪地合并频繁出现的字符双字节，通常导致与语言上有意义的单位不匹配的分割。为了解决这一问题，我们提出形态感知分割作为应用BPE之前的预分词步骤。为了促进形态感知分割，我们为印地语和马拉地语创建了新型数据集，融入沙陀分裂以增强子词分词。下游任务的实验表明，形态基础分词提高了机器翻译和语言建模的性能。此外，为了处理印地文字母系统中常见的依赖元音，我们对标准BPE算法进行了扩展，提出了约束BPE（CBPE），增加了特定脚本的约束。特别是，CBPE处理依赖元音，与其他字符形成连贯的单位，而不是以单个单位出现。我们的结果表明，CBPE在保持或提高机器翻译和语言建模的下游性能的同时，实现了生育率评分降低1.68%，为大型语言模型提供了一种计算效率高的替代方案。此外，为了评估不同分词算法的分割效果，我们引入了一种新的人类评估指标EvalTok，使人类基础的评估成为可能。

论文及项目相关链接

PDF Accepted at Tokenization Workshop (TokShop), ICML 2025

Summary

本文探讨了自然语言处理（NLP）中的词法分析步骤的重要性，特别是在大型语言模型（LLM）背景下，其影响了下游性能、计算成本和效率。为解决传统字节对编码（BPE）算法的局限性，如无法对齐语言上意义单元的问题，本文提出了形态感知分割作为预词法分析步骤。同时，为了处理印度语言中的依存元音问题，本文提出了约束BPE（CBPE）算法，该算法在标准BPE算法的基础上增加了脚本特定约束。研究结果表明，CBPE在保证下游任务性能的同时，实现了生育率得分的降低。此外，为了评估不同词法分析算法的分词效果，本文引入了全新的人类评估指标EvalTok。

Key Takeaways

词法分析在自然语言处理中的重要性，特别是在大型语言模型背景下。
传统字节对编码（BPE）算法存在的局限性及其在NLP中的表现问题。
提出形态感知分割作为预词法分析步骤来解决传统算法的不足。
为适应印度语言的特点，引入了约束BPE（CBPE）算法处理依存元音问题。
CBPE算法在保证下游任务性能的同时降低了生育率得分。
引入全新的人类评估指标EvalTok来评估不同词法分析算法的分词效果。

Cool Papers

点此查看论文截图

LEME: Open Large Language Models for Ophthalmology with Advanced Reasoning and Clinical Validation

Authors:Hyunjae Kim, Xuguang Ai, Sahana Srinivasan, Aidan Gilson, Maxwell B. Singer, Krithi Pushpanathan, Qianqian Xie, Jungwoo Park, Serina Applebaum, Gabriel Dawei Yang, Minjie Zou, David Ziyou Chen, Ke Zou, Soshian Sarrafpour, Ji Liu, Yu Yin, Jimin Huang, Quang Ngoc Nguyen, Erping Long, Peixing Wan, Dianbo Liu, Richard Hintz, W. Jim Zheng, Sophia Y. Wang, Lucila Ohno-Machado, Hua Xu, Ron A. Adelman, Luciano V. Del Priore, Yih-Chung Tham, Qingyu Chen

The rising prevalence of eye diseases poses a growing public health burden. Large language models (LLMs) offer a promising path to reduce documentation workload and support clinical decision-making. However, few have been tailored for ophthalmology, and most evaluations focus mainly on knowledge-based QA without clinically relevant benchmarks or real-world validation. Here, we present LEME, a suite of open-weight LLMs developed through a two-stage process: (1) instruction tuning on 200,000 samples from clinical guidelines, textbooks, and case reports to enhance reasoning and task-following, and (2) reinforcement learning with ~30,000 preference labels to enhance accuracy and informativeness. LEME was evaluated on five curated zero-shot benchmarks spanning tasks such as patient QA, consultation, and treatment planning. It outperformed all seven baselines (all p < 0.004), exceeding GPT-4o by 3.32% (absolute ROUGE-L gain). It was further evaluated on three downstream tasks using deidentified patient data, reviewed by clinicians. In patient QA, LEME received the highest ratings from attending clinicians in 3 out of 4 criteria, with scores of 4.67 for factuality, 4.77 for specificity, 4.79 for completeness, and 4.88 for safety (1-5 scale). Its completeness score surpassed that of expert-written answers (4.79 vs. 4.56; p = 0.015). In visual acuity extraction, LEME achieved the highest F1, outperforming LLaMA-3 by 14.1% and Eye-LLaMA by 59.0%. In a pilot evaluation on assessment and treatment planning for diabetic retinopathy, AMD, and glaucoma, LEME received scores of 4.36 for factuality, 4.55 for specificity, 4.42 for completeness, and 4.36 for safety, approaching attending-level performance. All models, data, and code will be released to support further development and clinical translation, laying the groundwork for improved efficiency and patient care

眼疾的日益普及给公共卫生带来了越来越大的负担。大型语言模型（LLM）为解决文档工作量并支持临床决策制定提供了前景。然而，针对眼科定制的语言模型寥寥无几，大多数评估主要集中在基于知识的问答上，缺乏临床相关的基准测试或真实世界验证。在这里，我们介绍了LEME，这是一套通过两个阶段开发的开源LLM：首先是基于临床指南、教科书和病例报告的20万个样本进行指令微调，以增强推理和任务执行能力；其次是使用约3万条偏好标签进行强化学习，以提高准确性和信息量。LEME在五个精选的零样本基准测试上进行了评估，涵盖了患者问答、咨询和治疗计划等任务。它超越了所有七个基准（所有p < 0.004），比GPT-4o高出3.32%（绝对ROUGE-L增益）。它还在使用匿名患者数据的三个下游任务上接受了临床医生的评估。在患者问答环节，LEME在四项标准中的三项获得了主治医师的最高评分，在事实性、特异性、完整性和安全性方面的得分分别为4.67、4.77、4.79和4.88（1-5级）。其完整性得分超过了专家答案（4.79比4.56；p = 0.015）。在视力提取方面，LEME的F1得分最高，比LLaMA-3高出14.1%，比Eye-LLaMA高出59.0%。在糖尿病视网膜病变、AMD和青光眼评估和试点治疗计划中，LEME在事实性、特异性、完整性和安全性方面的得分分别为4.36、4.55、4.42和4.36，接近主治医师水平的表现。所有模型、数据和代码都将公开发布，以支持进一步开发和临床翻译，为提高效率和患者护理奠定基础。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在眼疾领域具有巨大潜力，可减少文档工作量并支持临床决策。本研究开发的LEME模型通过两阶段训练，在眼疾相关的问答、咨询和治疗计划等任务上表现出优异性能，接近甚至超过专家水平。

Key Takeaways

眼疾的普及带来公共卫生负担，LLM具有减少文档工作量及支持临床决策的巨大潜力。
LEME模型通过两阶段训练，包括基于临床指南、教科书和病例报告的指令调优，以及使用约30,000个偏好标签的强化学习。
LEME在多个基准测试中表现优异，超过GPT-4o，尤其是在患者问答、治疗计划和视觉清晰度提取方面。
在患者问答任务中，LEME在事实性、特异性、完整性和安全性方面获得医生高度评价。
LEME在眼疾评估和治疗计划中接近医生水平的表现。
所有模型、数据和代码将公开发布，支持进一步开发和临床转化。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-11/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-11-11 SWE-Compass Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

2025-11-11 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-11-11 Visual Spatial Tuning

2025-11-11 R1_Reasoning

R1_Reasoning

LLM

2025-11-11 更新

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Large Language Models for Explainable Threat Intelligence

PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

Dense Motion Captioning

What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Code Review Automation using Retrieval Augmented Generation

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

MACO: A Multi-Agent LLM-Based Hardware/Software Co-Design Framework for CGRAs

P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

Know What You Don’t Know: Uncertainty Calibration of Process Reward Models

Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

MorphTok: Morphologically Grounded Tokenization for Indian Languages

LEME: Open Large Language Models for Ophthalmology with Advanced Reasoning and Clinical Validation