LLM

发布日期: 2025-05-15

更新日期: 2025-05-26

文章字数: 20k

阅读时长: 82 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-15 更新

CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Authors:Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar

Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling – all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.

偏微分方程（PDE）是物理系统建模的基础，但其求解仍然是一项复杂的挑战。传统的数值求解器依赖于专家知识进行实施，计算成本高昂，而基于神经网络的求解器则需要大量的训练数据集，并且往往缺乏可解释性。在这项工作中，我们将PDE求解看作是一种代码生成任务，并引入了CodePDE，这是使用大型语言模型（LLM）生成PDE求解器的第一个推理框架。利用先进的推理时算法和扩展策略，CodePDE解锁了LLM用于PDE求解的关键能力：推理、调试、自我完善和测试时间扩展——所有这一切都不需要进行特定任务的调整。CodePDE在一系列具有代表性的PDE问题上实现了超人类的性能。我们还对LLM生成的求解器进行了系统的实证分析，分析了它们的准确性、效率和数值方案的选择。我们的研究结果突出了LLM在PDE求解中的潜力及当前局限，为求解器设计和未来模型开发提供了新视角和机会。我们的代码可在https://github.com/LithiumDA/CodePDE找到。

论文及项目相关链接

PDF

Summary
PDE求解是物理系统建模的基础，但求解仍然是一项复杂的挑战。传统数值求解器依赖于专业知识实施且计算成本高，而基于神经网络的求解器需要大量训练数据集且缺乏可解释性。本研究将PDE求解框架化为代码生成任务，并引入CodePDE，这是使用大型语言模型（LLM）生成PDE求解器的首个推理框架。CodePDE利用先进的推理时间算法和扩展策略，解锁了LLM用于PDE求解的关键能力：推理、调试、自我完善和测试时间扩展，无需特定任务调整。CodePDE在代表性PDE问题上实现了超人类性能。我们还对LLM生成的求解器进行了系统的实证分析，分析其准确性、效率和数值方案选择。我们的研究突出了LLM在PDE求解中的潜力与当前局限性，为求解器设计和未来模型开发提供了新的视角和机会。

Key Takeaways

PDE求解是物理系统建模的基础，但求解过程复杂且传统方法存在局限性。
本研究将PDE求解框架化为代码生成任务，引入CodePDE，这是首个使用大型语言模型（LLM）的推理框架。
CodePDE能够利用LLM进行推理、调试、自我完善和测试时间扩展，无需特定任务调整。
CodePDE在代表性PDE问题上实现了超人类性能。
LLM生成的求解器在准确性、效率和数值方案选择方面进行了系统的实证分析。
研究突出了LLM在PDE求解中的潜力与当前局限性。

Cool Papers

点此查看论文截图

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Authors:Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo’s 16% to GPT-4o’s 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

我们推出HealthBench，这是一个衡量大型语言模型在医疗保健领域性能和安全的开源基准测试。HealthBench包含5000次多轮对话，对话内容是在模型和个人用户或医疗专业人士之间进行的。响应的评价使用由262名医生创建的特定对话评分标准。与健康护理之前的多项选择或简短答案基准测试不同，HealthBench能够通过涵盖多个健康背景（例如紧急情况、转换临床数据、全球健康）和行为维度（例如准确性、遵循指令、沟通）的48562个独特评分标准，进行现实、开放的评估。过去两年的HealthBench性能反映了稳定的初步进展（比较GPT-3.5 Turbo的16%到GPT-4o的32%），以及更快的近期改进（o3得分为60%）。小型模型尤其取得了进步：GPT-4.1 nano优于GPT-4o，并且成本降低了25倍。我们还发布了两个HealthBench变体：HealthBench Consensus，包括通过医生共识验证的模型行为的34个特别重要的维度；以及HealthBench Hard，当前最高得分为32%。我们希望HealthBench能为模型发展和有益于人类健康的应用奠定基础。

Summary

健康基准（HealthBench）是一个衡量大型语言模型在医疗领域性能和安全的开源基准测试。它包含5000次模型与用户或医疗专业人士之间的多轮对话。响应评价采用由262名医师创建的特定对话标准的评分系统。不同于之前的多项选择或简短答案基准测试，健康基准测试通过涵盖多个健康背景和行为维度的48562个独特标准，实现了现实、开放式的评估。在过去的两年里，随着模型的发展，健康基准测试的得分呈现出稳步上升的趋势。尤其是小型模型的进步显著：GPT-4.1 nano的表现超过了GPT-4o，并且成本仅为后者的二十五分之一。此外，还推出了健康基准共识和健康基准难题两个版本。

Key Takeaways

健康基准（HealthBench）是一个用于评估大型语言模型在医疗领域的性能和安全的开源基准测试。
它包含5000次多轮对话，对话内容模拟了模型与用户或医疗专业人士的互动。
响应评价采用特定对话标准的评分系统，由262名医师共同制定。
健康基准测试实现了现实、开放式的评估，涵盖多个健康背景和行为维度，包括紧急情况、临床数据转换、全球健康等。
过去两年里，语言模型在健康基准测试上的表现逐渐提高，尤其是小型模型的进步显著。
GPT-4.1 nano模型表现优秀，不仅超越了GPT-4o，而且成本更为低廉。

Cool Papers

点此查看论文截图

Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

Authors:Yatai Ji, Zhengqiu Zhu, Yong Zhao, Beidan Liu, Chen Gao, Yihao Zhao, Sihang Qiu, Yue Hu, Quanjun Yin, Yong Li

Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects using visual and textual cues without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object distinction, and the exploration-exploitation dilemma. To bridge this gap and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of common urban objects. This dataset comprises 2,420 tasks across six object categories with varying difficulty levels, enabling comprehensive evaluation of UAV agents’ search capabilities. To solve the AVOS tasks, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that mimics human three-tier cognition. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic attraction values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Also, our approach incorporates a denoising mechanism to mitigate interference from similar objects and utilizes an Inspiration Promote Thought (IPT) prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). While promising, the performance gap compared to humans highlights the need for better semantic reasoning and spatial exploration capabilities in AVOS tasks. This work establishes a foundation for future advances in embodied target search. Dataset and source code are available at https://anonymous.4open.science/r/CityAVOS-3DF8.

无人机在城市环境中的空中视觉目标搜索（AVOS）任务要求无人机利用视觉和文本线索自主搜索并识别目标对象，而无需外部指导。现有方法在复杂的城市环境中面临冗余语义处理、相似对象区分以及探索与利用困境等挑战。为了弥补这一差距并支持AVOS任务，我们引入了CityAVOS，这是用于城市常见目标自主搜索的第一个基准数据集。该数据集包含六个目标类别的2,420个任务，难度级别各不相同，能够全面评估无人机代理的搜索能力。为了解决AVOS任务，我们还提出了PRPSearcher（感知-推理-规划搜索器），这是一种由多模态大型语言模型（MLLMs）驱动的新型代理方法，可模仿人类的三层认知。具体来说，PRPSearcher构建了三张专门地图：一张增强空间感知的对象中心动态语义地图，一张基于目标推理的语义吸引力值的3D认知地图，以及一张平衡探索与利用的3D不确定性地图。此外，我们的方法还融入了一种降噪机制，以减轻相似对象的干扰，并采用了灵感促进思考（IPT）的提示机制来进行自适应行动规划。在CityAVOS上的实验结果表明，PRPSearcher在成功率和搜索效率上均超越了现有基线（平均成功率+37.69%，SPL+28.96%，MSS-30.69%，NE-46.40%）。尽管很有希望，但与人类的性能差距表明，AVOS任务需要更好的语义推理和空间探索能力。这项工作为未来的嵌入式目标搜索进展奠定了基础。数据集和源代码可在https://anonymous.4open.science/r/CityAVOS-3DF8上找到。

论文及项目相关链接

PDF

Summary

在城市环境中的空中视觉目标搜索（AVOS）任务中，无人机需要自主通过视觉和文本线索搜索并识别目标物体，而无需外部指导。现有方法在处理复杂城市环境时面临冗余语义处理、相似物体区分以及探索与利用困境等问题。为了弥补这一差距并支持AVOS任务，我们引入了CityAVOS数据集，这是首个针对城市常见目标自主搜索的基准数据集。该数据集包含六个不同难度级别的对象类别的2420个任务，能够全面评估无人机搜索能力。为解决AVOS任务，我们还提出了PRPSearcher（感知-推理-规划搜索器），这是一种由多模态大型语言模型驱动的新型代理方法，可模仿人类的三层认知。实验结果表明，PRPSearcher在成功率和搜索效率上均超越了现有基线方法。尽管结果具有前景，但与人类相比的性能差距仍突显了AVOS任务中更好语义推理和空间探索能力的必要性。本研究为未来的进步奠定了坚实的基础。更多信息请访问：链接地址。

Key Takeaways

城市环境中的空中视觉目标搜索（AVOS）任务需要无人机自主完成目标物体的视觉和文本线索搜索和识别。
当前方法在处理复杂城市环境时存在挑战，如冗余语义处理、相似物体区分和探索与利用的矛盾。
为支持AVOS任务，引入了CityAVOS数据集，包含六个对象类别的不同难度级别的任务，用于全面评估无人机的搜索能力。
提出了一种新的代理方法PRPSearcher，通过多模态大型语言模型模仿人类的三层认知结构来解决AVOS任务。
PRPSearcher通过构建三种特殊地图（对象为中心动态语义图、基于语义吸引力的三维认知图和三维不确定性图）以及去噪机制和自适应行动规划机制来提高搜索效率。
实验结果表明，PRPSearcher在成功率和搜索效率上超越了现有基线方法。

Cool Papers

点此查看论文截图

DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

Authors:Xiaoyang Chen, Xinan Dai, Yu Du, Qian Feng, Naixu Guo, Tingshuo Gu, Yuting Gao, Yingyi Gao, Xudong Han, Xiang Jiang, Yilin Jin, Hongyi Lin, Shisheng Lin, Xiangnan Li, Yuante Li, Yixing Li, Zhentao Lai, Zilu Ma, Yingrong Peng, Jiacheng Qian, Hao-Yu Sun, Jianbo Sun, Zirui Wang, Siwei Wu, Zian Wang, Bin Xu, Jianghao Xu, Yiyang Yu, Zichuan Yang, Hongji Zha, Ruichong Zhang

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs’ creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria – emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations – the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

为了提升大型语言模型（LLM）的数学能力，DeepMath团队发起了一项开源倡议，旨在开发一个开源的数学LLM，并系统地评估其数学创造力。本文代表了这个倡议的初步贡献。虽然最近的数学LLM发展主要强调推理能力，正如在基础到大学水平的数学任务基准测试中所证明的那样，但这些模型的创造性能力相比之下受到的关注较少，评估数据集仍然稀缺。为了解决这一差距，我们提出了数学创造力的评估标准，并介绍了DeepMath-Creative，这是一个新的高质量基准测试，包含代数、几何、分析和其他领域的构造性问题。我们使用此数据集系统地评估主流LLM的创造性问题解决能力。实验结果表明，即使在宽松的评分标准下——强调核心解决方案成分，并忽略小的不准确之处，如小的逻辑差距、不完整理由或冗余解释——表现最佳的模型O3 Mini也仅达到70%的准确率，主要是在基本的本科水平构造性任务上。在更复杂的问题上，性能急剧下降，模型无法为开放问题提供实质性的策略。这些发现表明，尽管当前LLM在熟悉和较低难度的问题上显示出一定程度的构造性能力，但这种表现很可能归因于记忆模式的重组，而不是真正的创造力或新颖的综合能力。

论文及项目相关链接

PDF 14 pages, 4 figures

Summary

DeepMath团队推出了一项开源倡议，旨在开发一个开放的数学大型语言模型（LLM）并系统地评估其数学创造力。为解决当前数学LLM在创造力评价方面的不足，团队提出了数学创造力的评价标准和新型高质量基准测试DeepMath-Creative。使用此数据集对主流LLM的创造性解决问题的能力进行了系统评估。结果显示，即使在宽松评分标准下，现有模型的性能仍有限，面临复杂问题时无法提供实质性解决方案。这表明当前LLM的表现更多依赖于对记忆模式的重组，而非真正的创造性洞察或新颖的综合能力。

Key Takeaways

DeepMath团队提出了一个旨在开发数学LLM并评估其数学创造力的开源倡议。
当前数学LLM主要侧重于推理能力，而对其创造性能力关注较少。
团队引入了数学创造力的评价标准和新型基准测试DeepMath-Creative。
使用DeepMath-Creative进行的系统评估显示，主流LLM在创造性解决问题方面的能力有限。
在宽松评分标准下，最佳模型O3 Mini在基本本科生水平的构造任务上仅达到70%的准确率。
在复杂问题上，LLM的实质性策略缺失，难以提供有效的解决方案。

Cool Papers

点此查看论文截图

Securing RAG: A Risk Assessment and Mitigation Framework

Authors:Lukas Ammann, Sara Ott, Christoph R. Landolt, Marco P. Lehmann

Retrieval Augmented Generation (RAG) has emerged as the de facto industry standard for user-facing NLP applications, offering the ability to integrate data without re-training or fine-tuning Large Language Models (LLMs). This capability enhances the quality and accuracy of responses but also introduces novel security and privacy challenges, particularly when sensitive data is integrated. With the rapid adoption of RAG, securing data and services has become a critical priority. This paper first reviews the vulnerabilities of RAG pipelines, and outlines the attack surface from data pre-processing and data storage management to integration with LLMs. The identified risks are then paired with corresponding mitigations in a structured overview. In a second step, the paper develops a framework that combines RAG-specific security considerations, with existing general security guidelines, industry standards, and best practices. The proposed framework aims to guide the implementation of robust, compliant, secure, and trustworthy RAG systems.

检索增强生成（RAG）已成为面向用户的NLP应用的行业实际标准，它提供了在无需重新训练或微调大型语言模型（LLM）的情况下整合数据的能力。这一功能提高了响应的质量和准确性，但也带来了新的安全和隐私挑战，尤其是在整合敏感数据时。随着RAG的快速采用，保障数据和服务的安全已成为至关重要的优先事项。本文首先回顾了RAG管道的安全漏洞，并概述了从数据预处理和数据存储管理到与LLM集成的攻击面。然后，将确定的风险与相应的缓解措施在结构化概述中配对。在第二步中，本文开发了一个框架，该框架结合了RAG特定的安全考虑因素、现有的通用安全准则、行业标准和最佳实践。所提出的框架旨在指导实施稳健、合规、安全和可信赖的RAG系统。

论文及项目相关链接

PDF 8 pages, 3 figures, Sara Ott and Lukas Ammann contributed equally

Summary
大規模語言模型（LLM）集成中的检索增强生成（RAG）已成为面向用户的NLP应用程序的行业标准，可在不重新训练或微调模型的情况下整合数据。这项功能增强了响应的质量和准确性，但也带来了新的安全和隐私挑战，尤其是在集成敏感数据时。本综述介绍了RAG管道的安全漏洞，并概述了从数据预处理和存储管理到与LLM集成的攻击面。所提出的风险与相应的缓解措施被结构化概述。本文接着开发了一个结合了RAG特有的安全考虑因素以及现有的一般安全准则、行业标准和最佳实践的框架。该框架旨在指导实施稳健、合规、安全和可靠的RAG系统。

Key Takeaways

RAG已成为NLP应用的行业标准，可以集成数据而无需重新训练或微调LLM，提高了响应质量和准确性。
RAG引入新的安全和隐私挑战，特别是在处理敏感数据时。
本文详细回顾了RAG的安全漏洞，并描述了从数据预处理到LLM集成的攻击面。
针对RAG的安全风险，提出了相应的缓解措施。
结合RAG特有的安全考虑因素，本综述开发了综合性的安全框架。
该框架结合了现有的一般安全准则、行业标准和最佳实践。

Cool Papers

点此查看论文截图

LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs

Authors:K M Sajjadul Islam, Ayesha Siddika Nipu, Jiawei Wu, Praveen Madiraju

Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting.

电子健康记录（EHRs）是患者信息的数字记录，通常包含非结构的临床文本。在电子健康记录中，命名实体识别（NER）对于提取关键医疗实体至关重要，如问题、测试和治疗方法，以支持下游的临床应用。本文探讨了基于大型语言模型的医疗实体识别（LLM），特别是使用GPT-4o和DeepSeek-R1模型进行探索，这些模型由各种提示工程技术引导，包括零样本、少样本和集成方法。在所有策略中，使用提示集成的GPT-4o在F1分数上达到了最高的分类性能为0.95和召回率为0.98。在此任务上超过了DeepSeek-R1模型的表现。集成方法通过基于嵌入的相似性和多数投票来聚合输出，提高了可靠性。

论文及项目相关链接

PDF IEEE 26th International Conference on Information Reuse and Integration for Data Science (IRI 2025), San Jose, CA, USA

Summary

本文主要探讨了基于大型语言模型（LLMs）的命名实体识别（NER）在电子健康记录（EHRs）中的应用。研究采用了GPT-4o和DeepSeek-R1模型，并运用了多种提示工程技术，如零样本、少样本和集成方法。其中，GPT-4o与提示集成策略取得了最高的分类性能，F1分数达到0.95，召回率达到0.98，在任务上优于DeepSeek-R1。集成方法通过基于嵌入的相似性和多数投票来聚合输出，提高了可靠性。

Key Takeaways

电子健康记录（EHRs）包含患者信息的数字记录，其中涉及大量的非结构化临床文本。
命名实体识别（NER）在EHRs中非常重要，可提取关键医疗实体如疾病、测试和治疗方法，以支持下游临床应用。
研究采用了大型语言模型（LLMs），如GPT-4o和DeepSeek-R1，进行基于提示的医学实体识别。
提示工程技术包括零样本、少样本方法被运用于模型训练中。
GPT-4o与提示集成策略表现出最佳性能，F1分数高达0.95，召回率为0.98。
集成方法通过基于嵌入的相似性和多数投票机制提高了性能可靠性。

Cool Papers

点此查看论文截图

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities

Authors:George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mittal, Brian Kingsbury, David Haws, Edmilson Morais, Gakuto Kurata, Hagai Aronowitz, Ibrahim Ibrahim, Jeff Kuo, Kate Soule, Luis Lastras, Masayuki Suzuki, Ron Hoory, Samuel Thomas, Sashi Novitasari, Takashi Fukuda, Vishal Sunder, Xiaodong Cui, Zvi Kons

Granite-speech LLMs are compact and efficient speech language models specifically designed for English ASR and automatic speech translation (AST). The models were trained by modality aligning the 2B and 8B parameter variants of granite-3.3-instruct to speech on publicly available open-source corpora containing audio inputs and text targets consisting of either human transcripts for ASR or automatically generated translations for AST. Comprehensive benchmarking shows that on English ASR, which was our primary focus, they outperform several competitors’ models that were trained on orders of magnitude more proprietary data, and they keep pace on English-to-X AST for major European languages, Japanese, and Chinese. The speech-specific components are: a conformer acoustic encoder using block attention and self-conditioning trained with connectionist temporal classification, a windowed query-transformer speech modality adapter used to do temporal downsampling of the acoustic embeddings and map them to the LLM text embedding space, and LoRA adapters to further fine-tune the text LLM. Granite-speech-3.3 operates in two modes: in speech mode, it performs ASR and AST by activating the encoder, projector, and LoRA adapters; in text mode, it calls the underlying granite-3.3-instruct model directly (without LoRA), essentially preserving all the text LLM capabilities and safety. Both models are freely available on HuggingFace (https://huggingface.co/ibm-granite/granite-speech-3.3-2b and https://huggingface.co/ibm-granite/granite-speech-3.3-8b) and can be used for both research and commercial purposes under a permissive Apache 2.0 license.

花岗岩语音LLM是为英语自动语音识别（ASR）和自动语音翻译（AST）专门设计的小型、高效的语音语言模型。这些模型通过模态对齐对花岗岩3.3-instruct的2B和8B参数变体进行训练，在公众可访问的开源语料库上进行语音训练，这些语料库包含由人类转录的ASR音频输入或自动生成的翻译AST文本目标。全面的基准测试显示，我们的主要关注点英语ASR上，它们的性能超过了用大量专有数据训练的竞争对手的模型，并且在主要的欧洲语言、日语和中文的英语到X的AST上也能与之竞争。语音特定组件包括：使用块注意力和自调节训练的连接主义者时间分类法的conformer声音编码器，用于对声音嵌入进行时间下采样并将其映射到LLM文本嵌入空间的窗口查询变压器语音模态适配器，以及LoRA适配器来进一步微调文本LLM。花岗岩语音3.3有两种模式：在语音模式下，它通过激活编码器、投影仪和LoRA适配器来执行ASR和AST；在文本模式下，它直接调用底层的花岗岩3.3-instruct模型（不使用LoRA），基本上保留了所有文本LLM的功能和安全。这两个模型都可以在HuggingFace上免费获得（https://huggingface.co/ibm-granite/granite-speech-3.3-2b 和 https://huggingface.co/ibm-granite/granite-speech-3.3-8b），并且可以在研究和商业用途下根据Apache 2.0许可证使用。

论文及项目相关链接

PDF 7 pages, 9 figures

Summary

这是一篇关于Granite-speech LLM模型的介绍。该模型是为英语语音识别（ASR）和自动语音翻译（AST）设计的紧凑高效的语音语言模型。模型通过模态对齐训练，可在公开可用的开源语料库上进行英语ASR和主要欧洲语言、日语和中文的英语至X的AST任务。它包含语音特定的组件，如使用块注意力和自调节训练的卷积神经网络声学编码器，用于执行时序分类的窗口查询转换器语音模态适配器和LoRA适配器。Granite-speech模型有两种模式：语音模式下执行ASR和AST，文本模式下则直接调用底层的granite-3.3-instruct模型，保留所有文本LLM功能和安全性。这些模型可在Hugging Face上免费使用，并可用于研究和商业目的。

Key Takeaways

Granite-speech LLMs是专为英语ASR和AST设计的紧凑高效的语言模型。
模型通过模态对齐训练，可在公开可用的开源语料库上进行英语ASR任务。
它在英语ASR方面表现出卓越性能，甚至超越了使用大量专有数据训练的竞争对手模型。
模型对于主要欧洲语言、日语和中文的英语至X的AST任务也有良好表现。
Granite-speech模型包含语音特定的组件，如卷积神经网络声学编码器和语音模态适配器。
模型具有两种模式：语音模式和文本模式，可以在不同场景下灵活使用。

Cool Papers

点此查看论文截图

Revealing economic facts: LLMs know more than they say

Authors:Marcus Buckmann, Quynh Anh Nguyen, Edward Hill

We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models’ text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.

我们研究大型语言模型（LLM）的隐藏状态是否可以用于估计和填补经济和金融统计数据。我们专注于县级（如失业率）和企业级（如总资产）的变量，发现基于开源LLM隐藏状态训练的简单线性模型表现优于LLM的文本输出。这表明隐藏状态捕捉到的经济信息比LLM的直接回应更丰富。学习曲线分析表明，只需几十个标记样本就足以进行训练。我们还提出了一种迁移学习方法，可以在不需要目标变量的标记数据的情况下提高估计精度。最后，我们展示了隐藏状态表示在超分辨率和数据填补任务中的实际效用。

论文及项目相关链接

PDF 34 pages, 17 figures

Summary

大型语言模型的隐藏状态可用于估计和填补经济与财务统计数据。研究关注县级（如失业率）和企业级（如总资产）变量，显示基于开源大型语言模型隐藏状态训练的简单线性模型优于模型文本输出。这表明隐藏状态捕捉的经济信息比模型直接揭示的更为丰富。学习曲线分析显示只需几十个标签样本即可进行有效训练。此外，提出了一种无需目标变量标签数据的迁移学习方法来提高估计精度。最后，展示了隐藏状态表示在超分辨率和数据填补任务中的实用性。

Key Takeaways

大型语言模型的隐藏状态可用于经济金融统计数据的估计和填补。
隐藏状态包含比模型直接输出更丰富经济信息的可能性。
只需少量标签样本即可训练模型。
提出一种无需目标变量标签数据的迁移学习方法，提高估计准确性。
隐藏状态表示在超分辨率任务中具有实用性。
隐藏状态在数据填补任务中具有实用价值。

Cool Papers

点此查看论文截图

Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Authors:Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim

Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

文本到图像的生成模型，如DALL-E和Stable Diffusion，已经彻底改变了包括广告、个性化媒体和设计原型在内的各种应用中的视觉内容创作。然而，为引导这些模型而撰写有效的文本提示仍具有挑战性，通常需要大量的试错。现有的提示反转方法，如软提示和硬提示技术，由于解释性和连贯性提示生成能力有限，效果并不理想。为了解决这个问题，我们提出了“视觉引导解码”（VGD）方法，这是一种无需梯度的方法，它利用大型语言模型（LLM）和基于CLIP的指导来生成连贯且语义对齐的提示。本质上，VGD利用LLM强大的文本生成能力来产生人类可读的提示。此外，通过使用CLIP评分来确保与用户指定的视觉概念对齐，VGD提高了提示生成的解释性、通用性和灵活性，而无需额外的训练。我们的实验表明，在生成易于理解和上下文相关的提示方面，VGD优于现有的提示反转技术，更能促进与文本到图像模型的直观和可控交互。

论文及项目相关链接

PDF ICLR 2025

Summary
文本转图像生成模型如DALL-E和Stable Diffusion已广泛应用于广告、个性化媒体和设计原型等各个领域。然而，如何为这些模型提供有效的文本提示仍是一大挑战，通常需要大量试错。现有提示反转方法如软提示和硬提示技术因缺乏可解释性和不连贯的提示生成而效果不佳。为解决这些问题，我们提出一种无需梯度的视觉引导解码（VGD）方法，结合大型语言模型（LLM）和CLIP指导生成连贯且语义对齐的提示。VGD利用LLM的健壮文本生成能力产生人类可读的提示，并通过CLIP评分确保与用户指定的视觉概念对齐，提高了提示生成的可解释性、通用性和灵活性，且无需额外训练。实验证明，VGD在生成可理解和上下文相关的提示方面优于现有提示反转技术，使文本转图像模型的交互更加直观和可控。

Key Takeaways

文本转图像生成模型在广告、个性化媒体和设计原型等领域有广泛应用。
当前模型面临如何提供有效文本提示的挑战，需要更多可解释和连贯的提示生成方法。
现有提示反转方法存在局限性，如缺乏可解释性和不连贯的提示生成。
VGD方法结合大型语言模型和CLIP指导，无需梯度生成连贯且语义对齐的提示。
VGD利用LLM的健壮文本生成能力，产生人类可读的提示。
CLIP评分确保用户指定的视觉概念与提示对齐，提高提示生成的可解释性、通用性和灵活性。

Cool Papers

点此查看论文截图

Small but Significant: On the Promise of Small Language Models for Accessible AIED

Authors:Yumou Wei, Paulo Carvalho, John Stamper

GPT has become nearly synonymous with large language models (LLMs), an increasingly popular term in AIED proceedings. A simple keyword-based search reveals that 61% of the 76 long and short papers presented at AIED 2024 describe novel solutions using LLMs to address some of the long-standing challenges in education, and 43% specifically mention GPT. Although LLMs pioneered by GPT create exciting opportunities to strengthen the impact of AI on education, we argue that the field’s predominant focus on GPT and other resource-intensive LLMs (with more than 10B parameters) risks neglecting the potential impact that small language models (SLMs) can make in providing resource-constrained institutions with equitable and affordable access to high-quality AI tools. Supported by positive results on knowledge component (KC) discovery, a critical challenge in AIED, we demonstrate that SLMs such as Phi-2 can produce an effective solution without elaborate prompting strategies. Hence, we call for more attention to developing SLM-based AIED approaches.

GPT已经成为大型语言模型（LLM）的代名词，这是人工智能教育程序中的一个日益流行的术语。一个简单的基于关键词的搜索结果显示，在AIED 2024会议上展示的76篇长短论文中，有61%的论文描述了使用LLM解决教育中一些长期存在挑战的新解决方案，其中43%特别提到了GPT。虽然由GPT开创的大型语言模型为人工智能对教育的影响创造了激动人心的机会，我们认为，该领域对GPT和其他资源密集型大型语言模型（具有超过10亿个参数）的过度关注，可能忽视了小型语言模型（SLM）在资源受限机构中实现公平和负担得起的优质人工智能工具访问的潜在影响。通过对人工智能教育中的关键挑战——知识组件（KC）发现的积极结果的支持，我们证明了Phi-2等小型语言模型可以在不需要复杂提示策略的情况下产生有效解决方案。因此，我们呼吁更多地关注基于小型语言模型的人工智能教育方法的发展。

论文及项目相关链接

PDF This vision paper advocates using small language models (e.g., Phi-2) in AI for education (AIED)

Summary

大型语言模型（LLM）已成为AI教育研讨会上的热门词汇，GPT尤为突出。尽管GPT引领的LLM为AI在教育领域带来了机会，但过分关注可能会忽略小型语言模型（SLM）在资源受限机构中的潜力。研究表明，SLM如Phi-2在知识组件（KC）发现等关键挑战上表现出积极结果，无需复杂提示策略即可提供有效解决方案。因此，呼吁更多关注基于SLM的AI教育方法的开发。

Key Takeaways

GPT和大型语言模型（LLM）在AI教育领域受到广泛关注。
61%的AIED 2024论文描述了使用LLM解决教育长期挑战的新方法。
43%的论文特别提到了GPT。
对GPT和其他资源密集型LLM的过度关注可能忽视了小型语言模型（SLM）的潜力。
SLM在知识组件（KC）发现等关键挑战上展现出积极结果。
SLM如Phi-2无需复杂的提示策略即可提供有效解决方案。

Cool Papers

点此查看论文截图

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Authors:Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy

Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

大型语言模型（LLM）中的安全对齐方法通常会导致良性查询被拒绝，这在敏感场景中大大降低了其效用。为了应对这一挑战，我们引入了FalseReject，这是一个包含16,000个看似有毒查询的综合资源，并提供了跨44个安全相关类别的结构化响应。我们提出了基于图信息的对抗性多智能体交互框架，以生成多样化和复杂的提示，同时通过结构化响应提供明确推理来帮助模型准确区分安全和不安全的上下文。FalseReject既包含针对标准指令调整模型和推理导向模型的训练数据集，也包含一个人工注释的基准测试集。我们对29个最新的大型语言模型进行了广泛的基准测试，揭示了持续的过度拒绝挑战。经验结果表明，使用FalseReject进行监督微调可以大大减少不必要的拒绝，同时不会损害整体安全性或通用语言功能。

论文及项目相关链接

PDF

Summary：大型语言模型（LLM）的安全对齐方法常常导致良性查询被拒绝，严重影响了其在敏感场景中的实用性。为解决这一问题，我们推出FalseReject资源，包含16万个看似有毒的查询以及结构化的回复跨越44个安全类别。我们提出一个图驱动的对抗多智能体交互框架来生成多样且复杂的提示，同时结构化回复以明确推理帮助模型准确区分安全和不安全上下文。FalseReject包括针对标准指令调优模型和推理导向模型的训练数据集，以及一个人工注释的基准测试集。我们对29个最新的大型语言模型进行了广泛的基准测试，发现持续存在的过度拒绝挑战。实证结果表明，使用FalseReject进行有监督微调能显著减少不必要的拒绝，同时不损害整体安全性或通用语言能力。

Key Takeaways：

大型语言模型（LLM）在安全性方面存在过度拒绝良性查询的问题。
FalseReject资源包含看似有毒的查询和结构化回复，旨在解决这一问题。
推出图驱动的对抗多智能体交互框架，用于生成复杂多样的提示。
FalseReject资源包含针对不同类型的语言模型的训练数据集。
广泛的基准测试显示，现有大型语言模型在区分安全和不安全上下文方面存在挑战。
使用FalseReject进行有监督微调能减少不必要的拒绝。

Cool Papers

点此查看论文截图

Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model

Authors:Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijin Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, Junjun He

In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/mar-cry/Ophora.

在眼科手术中，开发一个能够解读手术视频并预测后续操作的AI系统需要大量的带有高质量注释的眼科手术视频。由于隐私问题和劳动消耗，这些视频的收集非常困难。文本引导的视频生成（T2V）作为一种有前途的解决方案，通过基于外科医生的指令生成眼科手术视频来解决这个问题。在本文中，我们提出了一种先进的模型——奥芙拉（Ophora），它可以根据自然语言指令生成眼科手术视频。为了构建奥芙拉，我们首先提出了一个全面的数据整理管道，将叙述性眼科手术视频转化为大规模的高质量数据集，包含超过16万对视频指令对，即奥芙拉-16万数据集。然后，我们提出了一种渐进的视频指令调整方案，从一个在天然视频文本数据集上预训练的T2V模型中转移丰富的时空知识，用于基于奥芙拉-16万数据集的隐私保护眼科手术视频生成。通过对视频质量的定量分析和眼科医生的反馈进行的实验表明，奥芙拉可以根据外科医生的指令生成现实和可靠的眼科手术视频。我们还验证了奥芙拉在执行眼科手术工作流程理解等下游任务方面的能力。代码可在https://github.com/mar-cry/Ophora中找到。

论文及项目相关链接

PDF Early accepted in MICCAI25

Summary

在眼科手术领域，开发能够解读手术视频并预测后续操作的AI系统需要大量带有高质量注释的眼科手术视频。由于隐私问题和劳动消耗，这些视频的收集很困难。文本引导的视频生成（T2V）技术作为一种解决方案，能够根据外科医生的指令生成眼科手术视频。本文介绍了一种名为Ophora的开创性模型，它能根据自然语言指令生成眼科手术视频。为构建Ophora，我们首先提出了一种全面的数据整理管道，将叙事眼科手术视频转化为大规模高质量数据集Ophora-160K，包含超过16万对视频指令。接着，我们提出了一种渐进式视频指令调整方案，从一个在天然视频文本数据集上预训练的T2V模型转移丰富的时空知识，用于基于Ophora-160K的隐私保护眼科手术视频生成。通过定量分析和眼科医生反馈进行的视频质量评估实验表明，Ophora可以根据外科医生的指令生成真实可靠的眼科手术视频。我们还验证了Ophora在眼科手术工作流程理解下游任务中的能力。相关代码可通过链接获取。

Key Takeaways

眼科手术AI系统的开发面临收集高质量标注视频数据的挑战，因为这些数据涉及隐私问题和劳动密集度高的难题。
文本引导的视频生成（T2V）技术为解决此问题提供了希望，能够根据外科医生的指令生成眼科手术视频。
介绍了一种名为Ophora的模型，能够基于自然语言指令生成眼科手术视频，并具有强大的能力支持下游任务，如眼科手术工作流程的理解。
Ophora的构建包括两个关键步骤：一是使用综合数据整理管道创建大规模高质量数据集Ophora-160K；二是采用渐进式视频指令调整方案，从预训练的T2V模型转移知识。
实验证明，Ophora生成的眼科手术视频既真实又可靠，能够根据外科医生的指令进行生成。
代码已公开发布，便于其他人使用和改进。

Cool Papers

点此查看论文截图

FMNV: A Dataset of Media-Published News Videos for Fake News Detection

Authors:Yihao Wang, Zhong Qian, Peifeng Li

News media, particularly video-based platforms, have become deeply embed-ded in daily life, concurrently amplifying the risks of misinformation dissem-ination. Consequently, multimodal fake news detection has garnered signifi-cant research attention. However, existing datasets predominantly comprise user-generated videos characterized by crude editing and limited public en-gagement, whereas professionally crafted fake news videos disseminated by media outlets-often politically or virally motivated-pose substantially greater societal harm. To address this gap, we construct FMNV, a novel da-taset exclusively composed of news videos published by media organizations. Through empirical analysis of existing datasets and our curated collection, we categorize fake news videos into four distinct types. Building upon this taxonomy, we employ Large Language Models (LLMs) to automatically generate deceptive content by manipulating authentic media-published news videos. Furthermore, we propose FMNVD, a baseline model featuring a dual-stream architecture that integrates spatio-temporal motion features from a 3D ResNeXt-101 backbone and static visual semantics from CLIP. The two streams are fused via an attention-based mechanism, while co-attention modules refine the visual, textual, and audio features for effective multi-modal aggregation. Comparative experiments demonstrate both the generali-zation capability of FMNV across multiple baselines and the superior detec-tion efficacy of FMNVD. This work establishes critical benchmarks for de-tecting high-impact fake news in media ecosystems while advancing meth-odologies for cross-modal inconsistency analysis. Our dataset is available in https://github.com/DennisIW/FMNV.

新闻媒体的普及，特别是视频平台，已经深入到日常生活中，同时也加剧了假信息传播的潜在风险。因此，多模态假新闻检测引起了显著的研究关注。然而，现有数据集主要由用户生成的视频组成，这些视频编辑粗糙且公众参与度有限，而由媒体发布的专业制作的假新闻视频—通常带有政治或病毒式营销动机—对社会造成了更大的危害。为了弥补这一空白，我们构建了FMNV数据集，该数据集仅包含由媒体组织发布的新闻视频。通过对现有数据集和我们的精选集合进行实证分析，我们将假新闻视频分为四种类型。基于这一分类，我们利用大型语言模型（LLM）通过操纵真实的媒体发布新闻视频自动生成欺骗性内容。此外，我们提出FMNVD基线模型，采用双流架构，融合了来自3D ResNeXt-101骨干的时空运动特征和CLIP的静态视觉语义。两个流通过基于注意力的机制融合，同时共注意力模块对视觉、文本和音频特征进行细化，以实现有效的多模态聚合。对比实验表明，FMNV在多个基线上的通用化能力和FMNVD的检测效能均表现出色。这项工作为检测媒体生态系统中高影响力的假新闻建立了关键基准，并推动了跨模态不一致性分析的方法。我们的数据集可在https://github.com/DennisIW/FMNV上获取。

论文及项目相关链接

PDF

Summary

新闻媒体的普及带来了假新闻传播的风险，特别是在视频平台上。现有的假新闻数据集主要集中在用户生成的视频上，缺乏媒体发布的经过专业制作的假新闻视频的研究。本文构建了一个由媒体组织发布的新闻视频组成的新数据集FMNV，并对假新闻视频进行分类。利用大型语言模型（LLM）自动生成欺骗内容，并提出基于双流架构的FMNVD基线模型，实现跨模态假新闻检测。该模型融合时空运动特征和静态视觉语义，通过注意力机制进行多模态聚合，提高检测效果。

Key Takeaways

新闻媒体的普及增加了假新闻传播的风险，特别是视频平台上的假新闻。
现有假新闻数据集主要集中在用户生成的视频上，缺乏媒体发布的经过专业制作的假新闻视频研究。
构建了一个由媒体组织发布的新闻视频组成的新数据集FMNV。
假新闻视频被分类为四种类型。
利用大型语言模型（LLM）自动生成欺骗内容。
提出了基于双流架构的FMNVD基线模型，实现跨模态假新闻检测。

Cool Papers

点此查看论文截图

AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening

Authors:Frank P. -W. Lo, Jianing Qiu, Zeyu Wang, Haibao Yu, Yeming Chen, Gao Zhang, Benny Lo

Resume screening is a critical yet time-intensive process in talent acquisition, requiring recruiters to analyze vast volume of job applications while remaining objective, accurate, and fair. With the advancements in Large Language Models (LLMs), their reasoning capabilities and extensive knowledge bases demonstrate new opportunities to streamline and automate recruitment workflows. In this work, we propose a multi-agent framework for resume screening using LLMs to systematically process and evaluate resumes. The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. To enhance the contextual relevance of candidate assessments, we integrate Retrieval-Augmented Generation (RAG) within the resume evaluator, allowing incorporation of external knowledge sources, such as industry-specific expertise, professional certifications, university rankings, and company-specific hiring criteria. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition. We assess the effectiveness of our approach by comparing AI-generated scores with ratings provided by HR professionals on a dataset of anonymized online resumes. The findings highlight the potential of multi-agent RAG-LLM systems in automating resume screening, enabling more efficient and scalable hiring workflows.

简历筛选是人才招聘中一个至关重要但又耗费时间的环节，招聘人员需要分析大量的求职申请，同时保持客观、准确和公正。随着大型语言模型（LLM）的进步，其推理能力和广泛的知识库为招聘流程的简化和自动化提供了新的机会。在此工作中，我们提出了一个使用LLM进行简历筛选的多代理框架，系统地处理和评估简历。该框架包含四个核心代理，包括简历提取器、评估器、摘要器和计分格式化器。为了提高候选人评估的上下文相关性，我们在简历评估器中集成了检索增强生成（RAG），允许融入外部知识源，如行业专业知识、专业认证、大学排名和公司特定的招聘标准。这种动态适应使个性化招聘成为可能，缩小了人工智能自动化和人才招聘之间的差距。我们通过比较AI生成的分数和人力资源专业人士在匿名在线简历数据集上提供的评分来评估我们方法的有效性。研究结果表明，多代理RAG-LLM系统在自动化简历筛选方面具有潜力，能够实现更高效、可扩展的招聘流程。

论文及项目相关链接

PDF Accepted by CVPR 2025 Workshop

Summary

基于大型语言模型（LLM）的简历筛选多智能体框架，旨在系统化处理和评估简历，提升招聘流程效率和个性化程度。通过简历提取器、评估器、摘要器和分数格式化器等四个核心智能体的协同工作，结合检索增强生成（RAG）技术，融入外部知识源，实现个性化招聘。研究结果表明，该框架在自动化简历筛选方面具有潜力，有助于提高招聘流程的效率和规模性。

Key Takeaways

LLMs在招聘流程中展现自动化潜力。
提出一种基于LLM的多智能体框架用于简历筛选。
框架包括简历提取器、评估器、摘要器和分数格式化器。
整合RAG技术以提升评估的上下文相关性。
能融入外部知识源，如行业专长、专业认证、大学排名和公司招聘标准。
实现个性化招聘，缩小AI自动化与人才招聘之间的差距。

Cool Papers

点此查看论文截图

SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals

Authors:Changhao Jiang, Ming Zhang, Junjie Ye, Xiaoran Fan, Yifei Cao, Jiajun Sun, Zhiheng Xi, Shihan Dou, Yi Dong, Yujiong Shen, Jingqi Tong, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang

The GPT-4 technical report highlights the possibility of predicting model performance on downstream tasks using only pre-training signals, though detailed methodologies are absent. Such predictive capabilities are essential for resource-efficient pre-training and the construction of task-aligned datasets. In this paper, we aim to predict performance in closed-book question answering (QA), a vital downstream task indicative of a model’s internal knowledge. We address three primary challenges: (1) limited access to and understanding of pre-training corpora, (2) limitations of current evaluation methods for pre-trained models, and (3) limitations of frequency-based metrics in predicting model performance. In response to these challenges, we conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models. Subsequently, we develop a multi-template QA evaluation framework incorporating paraphrased question variants. Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics, model size, and QA accuracy, without requiring any additional training. The experimental results demonstrate that SMI outperforms co-occurrence-based baselines, achieving $R^2$ > 0.75 on models with over one billion parameters. Theoretical analysis further reveals the marginal benefits of scaling model size and optimizing data, indicating that the upper limit of specific QA task accuracy is approximately 80%. Our project is available at https://github.com/yuhui1038/SMI.

GPT-4技术报告强调了仅使用预训练信号预测模型在下游任务上的性能的可能性，尽管缺乏详细的方法论。这种预测能力对于资源高效的预训练和任务对齐数据集的建设至关重要。本文旨在预测封闭式问答（QA）中的性能，这是指示模型内部知识的重要下游任务。我们面临三个主要挑战：（1）对预训练语料库的有限访问和了解，（2）当前预训练模型评估方法的局限性，以及（3）基于频率的指标在预测模型性能方面的局限性。为了应对这些挑战，我们对21个公开可用和3个自定义训练的大型语言模型的预训练语料库进行了大规模检索和语义分析。随后，我们开发了一个多模板QA评估框架，该框架结合了改述问题变体。在此基础上，我们提出了Size-dependent Mutual Information（SMI），这是一种信息理论度量标准，它线性地关联预训练数据特性、模型大小与问答准确性，而无需任何额外的训练。实验结果表明，SMI优于基于共现的基线，在具有超过亿级参数的模型上实现R²> 0.75。理论分析进一步揭示了扩大模型规模和优化数据的边际效益，表明特定QA任务准确度的上限大约为80%。我们的项目可在https://github.com/yuhui1038/SMI上找到。

论文及项目相关链接

PDF

Summary

本文旨在预测大型语言模型在封闭问答任务中的性能，解决现有方法面临的挑战。通过对多个预训练大型语言模型的语料库进行大规模检索和语义分析，提出一种新的信息理论度量标准——Size-dependent Mutual Information (SMI)。实验结果表明，SMI能有效预测模型在问答任务中的性能，优于基于共现的基线方法。

Key Takeaways

研究旨在预测大型语言模型在封闭问答任务中的性能，解决预训练语料库的有限访问和理解、当前预训练模型评估方法的局限性以及频率基础指标在预测模型性能方面的局限性等三大挑战。
研究通过对多个预训练大型语言模型的语料库进行大规模检索和语义分析，提出了一种新的信息理论度量标准——Size-dependent Mutual Information (SMI)。
SMI线性相关预训练数据特性、模型大小与问答准确性，且无需任何额外的训练。
实验结果表明，SMI在预测模型在问答任务中的性能时表现优异，优于基于共现的基线方法，具有高达R²> 0.75的预测准确性。
理论分析揭示了扩大模型规模和优化数据的边际效益，并指出特定问答任务准确度的上限约为80%。
研究成果已公开在GitHub上，可供公众查阅和使用。

Cool Papers

点此查看论文截图

CursorCore: Assist Programming through Aligning Anything

Authors:Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang

Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at https://github.com/TechxGenus/CursorCore.

大型语言模型已成功应用于编程辅助任务，如代码补全、代码插入和指令代码编辑。然而，这些应用自动化程度仍然不足，在编程过程中难以有效地整合各种类型的信息，包括编码历史、当前代码和用户指令。在这项工作中，我们提出了一种新的对话框架，该框架全面整合了这些信息源，并收集数据以训练我们的模型并评估其性能。首先，为了彻底评估模型对不同类型信息的契合程度及其输出的质量，我们引入了一个新的基准测试APEval（辅助编程评估），以全面评估模型在编程辅助任务中的性能。其次，对于数据收集，我们开发了一个数据生成管道Programming-Instruct，该管道可以从GitHub和在线裁判平台等不同的来源合成训练数据，能够自动生成编程过程中的各种类型的信息。最后，使用这个管道，我们生成了21.9万个样本，对多个模型进行了微调，并开发了CursorCore系列。我们证明CursorCore在同类模型中表现优越。这一框架统一了如即时聊天和自动编辑等应用，为编码助手的进步做出了贡献。相关代码、模型和数据可在https://github.com/TechxGenus/CursorCore上免费获取。

论文及项目相关链接

PDF

Summary：大型语言模型在编程辅助任务中表现出良好的应用前景，但仍面临自动化程度不足和难以整合多种信息的问题。本研究提出了一种新的对话框架，旨在全面整合编程过程中的各种信息，并引入新的评估标准和数据生成管道来训练和评估模型。最终开发的CursorCore系列模型在编程辅助任务上表现出优异性能，统一了即时聊天和自动编辑等应用。

Key Takeaways：

大型语言模型已成功应用于编程辅助任务，如代码补全、代码插入和指令代码编辑。
现有应用自动化程度不足，难以整合编程过程中的多种信息，如编码历史、当前代码和用户指令。
研究提出了一种新的对话框架，旨在全面整合这些信息源。
引入了新的评估标准APEval，以全面评估模型在编程辅助任务中的性能。
开发了数据生成管道Programming-Instruct，可从GitHub和在线裁判平台等来源合成训练数据。
使用该管道生成了219K个样本，对多个模型进行了微调，并开发了CursorCore系列模型。

Cool Papers

点此查看论文截图

Round and Round We Go! What makes Rotary Positional Encodings useful?

Authors:Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust “positional” attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.

位置编码（PEs）是基于Transformer的大型语言模型（LLM）的重要组成部分，为注意力机制提供重要的序列位置信息。目前LLM中最流行的编码类型之一是旋转位置编码（RoPE），它根据查询和键之间的相对距离进行旋转。人们普遍认为RoPE是有用的，因为它有助于随着相对距离的增加而衰减令牌依赖性。在这项工作中，我们认为这不太可能成为核心原因。我们研究了一个训练过的Gemma 7B模型的内部，以了解RoPE在机械级别是如何被使用的。我们发现Gemma学会了利用最高频率使用RoPE来构建稳健的“位置”注意力模式。我们还发现，总的来说，Gemma更喜欢使用RoPE的最低频率，我们怀疑这些频率用于传递语义信息。我们数学地证明了RoPE的有趣行为，并通过实验验证了我们的发现，提出了对RoPE的修改，解决了某些突出的问题并提高了性能。我们认为这项工作在更好地理解LLM中的PEs方面代表了有趣的一步，我们认为这对于将LLM扩展到大规模和上下文长度具有关键价值。

论文及项目相关链接

PDF

Summary：

该研究探讨了位置编码（PE）在基于Transformer的大型语言模型（LLM）中的关键作用，特别是旋转位置编码（RoPE）。研究发现，RoPE并非主要用于衰减令牌依赖关系，而是用于构建稳健的“位置”注意力模式，并利用最高频率进行利用。此外，模型更倾向于使用RoPE的最低频率部分，可能用于传递语义信息。研究对RoPE的数学行为进行了验证并提出了改进方案，为解决一些关键问题并提高性能奠定了基础。该工作对于理解LLM中的位置编码具有重大意义，对于扩大LLM规模和上下文长度至关重要。

Key Takeaways：

位置编码（PE）是大型语言模型（LLM）中的关键组件，为注意力机制提供重要的序列位置信息。
旋转位置编码（RoPE）被用于构建稳健的“位置”注意力模式。
RoPE并非主要用于衰减令牌依赖关系，而是通过利用最高频率来工作。
模型倾向于使用RoPE的最低频率部分，这部分可能涉及传递语义信息。
研究对RoPE的数学行为进行了验证，提出了一些有趣的发现。
对RoPE进行了改进，解决了某些关键问题，提高了模型性能。

Cool Papers

点此查看论文截图

Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

Authors:Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon

Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive evaluation of recent models like Llama-3.3. In this paper, we conduct a comprehensive evaluation of instruction-tuned models spanning 1B to 405B parameters, applying four quantization methods across 13 datasets. Our findings reveal that (1) quantized models generally surpass smaller FP16 baselines, yet they often struggle with instruction-following and hallucination detection; (2) FP8 consistently emerges as the most robust option across tasks, and AWQ tends to outperform GPTQ in weight-only quantization; (3) smaller models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale models maintain stable performance; (4) notably, \textit{hard} tasks do not always experience the largest accuracy losses, indicating that quantization magnifies a model’s inherent weaknesses rather than simply correlating with task difficulty; and (5) an LLM-based judge (MT-Bench) highlights significant performance declines in Coding and STEM tasks, though it occasionally reports improvements in reasoning.

量化技术作为部署大小语言模型的一种具有成本效益的有前途的解决方案而备受关注。然而，大多数早期的工作仅限于困惑度或基础知识任务，缺乏对诸如Llama-3.3等最新模型的全面评估。在本文中，我们对指令调优模型进行了全面评估，这些模型跨越了1B到405B的参数范围，在13个数据集上应用了四种量化方法。我们的研究发现：（1）量化模型总体上超过了较小的FP16基准模型，但在遵循指令和幻觉检测方面往往存在困难；（2）FP8在各项任务中始终表现出最稳健的选择，AWQ在仅权重量化方面往往优于GPTQ；（3）较小模型在4位量化时可能会遭受严重的精度下降，而70B规模的模型则能保持稳定的性能；（4）值得注意的是，难度大的任务并不总是经历最大的精度损失，这表明量化放大了一个模型的固有弱点，而不是简单地与任务难度相关；（5）基于大型语言模型的判断（MT-Bench）突出了编码和STEM任务的性能显著下降，尽管它有时会报告推理方面的改进。

论文及项目相关链接

PDF Accepted in IJCAI 2025, 21 pages, 2 figure

Summary

本文研究了指令微调模型的量化效果，涉及从1B到405B参数的模型范围。实验采用了四种量化方法，并在多个数据集上进行测试。研究结果表明，量化模型在某些方面超越了较小模型的基线水平，但在指令遵循和幻觉检测方面表现欠佳。其中，FP8在所有任务中最具稳健性，AWQ在重量级量化方面优于GPTQ。较小的模型在四比特量化时会遇到准确度的严重下降，而大型模型则能保持稳定的性能。此外，量化会放大模型的固有弱点，而非单纯与任务难度相关。LLM判定（MT-Bench）显示编码和STEM任务的性能明显下降，但在推理任务中有时会有改进。

Key Takeaways

量化作为部署大小语言模型的经济解决方案已引起关注。
本研究全面评估了指令微调模型的量化效果，涉及多种参数规模的模型。
量化模型在某些方面超越较小模型的基线水平，但在指令遵循和幻觉检测方面存在挑战。
FP8在所有任务中最具稳健性，AWQ在重量级量化上表现较好。
较小模型在四比特量化时准确度下降严重，而大型模型性能稳定。
量化会放大模型的固有弱点，并非仅与任务难度相关。

Cool Papers

点此查看论文截图

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

Authors:Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, Dhabaleswar K. Panda

Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.

大型语言模型（LLM）具有长文本处理能力，对于自然语言处理和计算生物学中的复杂任务（如文本生成和蛋白质序列分析）至关重要。然而，直接在极长的文本上训练LLM需要大量的GPU资源和增加的内存，导致成本和复杂性增加。通过下游微调或适应引入长文本处理能力的替代方法带来了显著的设计限制。在本文中，我们提出了全管道分布式转换器（FPDT），以极高的硬件效率有效地训练具有长文本处理能力的大型语言模型。对于GPT和骆马模型，我们在相同的硬件上实现了序列长度的16倍增加，这超过了当前最先进的解决方案。通过我们专门的序列块管道设计，我们现在可以在仅使用4个GPU的情况下训练具有2百万序列长度的8B大型语言模型，同时保持超过55%的最大浮点单元使用率。我们提出的FPDT与现有的训练技术无关，并已证明在不同的大型语言模型中都能有效工作。

论文及项目相关链接

PDF The Eighth Annual Conference on Machine Learning and Systems (MLSys’25)

摘要

本文提出了Fully Pipelined Distributed Transformer（FPDT）方法，旨在高效训练具有长上下文能力的大型语言模型（LLMs），同时保持极高的硬件效率。与现有解决方案相比，FPDT可实现序列长度达当前先进水平的16倍增长，同时在相同硬件上进行训练。通过专用的序列块流水线设计，现在只需使用4个GPU就可以训练拥有超过2百万序列长度的模型。研究结果表明，FPDT适用于不同的大型语言模型。通过结合先前的技术和高效架构创新方法实现强大且性能出众的语言模型提供了可行方案。我们只需几个服务器节点即可实现大规模模型的训练，这极大地降低了训练成本并提高了效率。我们相信FPDT将极大地推动大型语言模型的发展，并促进其在自然语言处理和计算生物学等领域的应用。

关键见解

大型语言模型（LLMs）对于处理复杂的自然语言处理和计算生物学任务至关重要，如文本生成和蛋白质序列分析。然而，训练具有长上下文能力的LLMs需要大量的GPU资源和内存，导致成本增加和复杂性提高。
当前研究中提出了一种名为Fully Pipelined Distributed Transformer（FPDT）的新方法来解决这个问题，通过专门的序列块流水线设计提高了训练大型语言模型的效率。这一方法在保持高硬件效率的同时实现了对超长序列的训练能力增长。此外，该方法对现有的训练技术具有中立性，能够在不同的LLM模型中有效运行。

Cool Papers

点此查看论文截图

TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era

Authors:Sascha Caron, Nadezhda Dobreva, Antonio Ferrer Sánchez, José D. Martín-Guerrero, Uraz Odyurt, Roberto Ruiz de Austri Bazan, Zef Wolffs, Yue Zhao

High-Energy Physics experiments are facing a multi-fold data increase with every new iteration. This is certainly the case for the upcoming High-Luminosity LHC upgrade. Such increased data processing requirements forces revisions to almost every step of the data processing pipeline. One such step in need of an overhaul is the task of particle track reconstruction, a.k.a., tracking. A Machine Learning-assisted solution is expected to provide significant improvements, since the most time-consuming step in tracking is the assignment of hits to particles or track candidates. This is the topic of this paper. We take inspiration from large language models. As such, we consider two approaches: the prediction of the next word in a sentence (next hit point in a track), as well as the one-shot prediction of all hits within an event. In an extensive design effort, we have experimented with three models based on the Transformer architecture and one model based on the U-Net architecture, performing track association predictions for collision event hit points. In our evaluation, we consider a spectrum of simple to complex representations of the problem, eliminating designs with lower metrics early on. We report extensive results, covering both prediction accuracy (score) and computational performance. We have made use of the REDVID simulation framework, as well as reductions applied to the TrackML data set, to compose five data sets from simple to complex, for our experiments. The results highlight distinct advantages among different designs in terms of prediction accuracy and computational performance, demonstrating the efficiency of our methodology. Most importantly, the results show the viability of a one-shot encoder-classifier based Transformer solution as a practical approach for the task of tracking.

高能物理实验面临着每一次迭代都呈现多重数据增长的情况。即将到来的高亮度LHC升级也是如此。这种增加的数据处理要求几乎迫使数据处理流程的每一步都进行修改。需要全面检修的步骤之一是粒子轨迹重建的任务，也称为跟踪。由于跟踪中最耗时的步骤是将命中点分配给粒子或轨迹候选者，因此预计机器学习辅助解决方案将带来重大改进。这是本文的主题。我们从大型语言模型中汲取灵感。因此，我们考虑两种方法：预测句子中的下一个词（轨迹中的下一个命中点），以及事件内所有命中点的一次性预测。在广泛的设计工作中，我们尝试了三基于Transformer架构的模型和一种基于U-Net架构的模型，对碰撞事件命中点进行轨迹关联预测。在我们的评估中，我们考虑了问题的简单到复杂表示形式，并尽早淘汰了指标较低的方案。我们报告了广泛的结果，涵盖了预测精度（得分）和计算性能。我们利用REDVID模拟框架以及对TrackML数据集的简化，为我们的实验从简单到复杂地创建了五个数据集。结果突出了不同设计在预测精度和计算性能方面的优势，证明了我们的方法的有效性。最重要的是，结果表明基于一次性编码器分类器解决方案的Transformer是一种用于跟踪任务的实用方法。

论文及项目相关链接

PDF

Summary
高能量物理实验面临数据量大增的问题，特别是即将升级的高亮度LHC。数据处理的每个环节都需要重新审视和优化。论文中探讨了利用机器学习辅助的解决方案来处理粒子轨迹重建问题，提出了一种基于Transformer架构和U-Net架构的模型，实现了碰撞事件中的轨迹关联预测。实验结果表明，不同设计在预测精度和计算性能上各有优势，其中基于编码器分类器的一站式Transformer解决方案表现出较高的效率和可行性。

Key Takeaways