发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 1.4k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Authors:Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung

Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

近期大型视觉语言模型（LVLMs）的进步在通用医疗任务上展现了强大的性能。然而，它们在牙科等特定领域的效果尚未得到充分探索。特别是全景X射线，作为一种口腔放射学中广泛使用的成像方式，由于密集的解剖结构和微妙的病理线索，给解读带来了挑战，而这些挑战并未被现有的医疗基准测试或指令数据集所涵盖。为此，我们推出了MMOral，这是首个针对全景X射线解读的大型多模式指令数据集和基准测试。MMOral包含20,563张带注释的图像，以及与多种任务类型相匹配的130万个指令实例，包括属性提取、报告生成、视觉问答和图像基地对话等。此外，我们还推出了MMOral-Bench，这是一个涵盖牙科五个关键诊断维度的综合评估套件。我们对64个LVLMs在MMOral-Bench上进行了评估，并发现即使表现最佳的模型，即GPT-4o，也仅达到41.45%的准确率，这揭示了当前模型在这个领域的显著局限性。为了促进这个特定领域的进步，我们还提出了OralGPT，它使用我们的精心策划的MMOral指令数据集对Qwen2.5-VL-7B进行有监督微调（SFT）。值得注意的是，仅一个周期的有监督微调就能为LVLMs带来显著的性能提升，例如OralGPT显示了24.73%的改进。MMOral和OralGPT在智能牙科领域具有巨大的潜力，为牙科领域提供了更具临床影响力的多模式人工智能系统。数据集、模型、基准测试和评估套件可在https://github.com/isbrycee/OralGPT找到。

论文及项目相关链接

PDF 40 pages, 26 figures, 9 tables

摘要

大型视觉语言模型（LVLMs）在通用医疗任务上展现出强大的性能，但在牙科等特定领域的应用仍然缺乏研究。全景X射线是口腔放射学中广泛使用的成像方式，由于其复杂的解剖结构和微妙的病理线索，使得其解读具有挑战性。为此，我们推出了MMOral，这是首个针对全景X射线解读的大规模多模式指令数据集和基准测试。MMOral包含20,563张注释图像和130万个指令实例，涵盖属性提取、报告生成、视觉问答和图像基础对话等多种任务类型。我们还推出了MMOral-Bench评估套件，全面涵盖牙科五大关键诊断维度。评估表明，现有模型在这个领域的表现存在显著局限性，即使是最先进的GPT-4o模型，准确率也只有41.45%。为促进该领域的发展，我们提出了OralGPT模型，它在Qwen2.5-VL-7B的基础上使用我们的MMOral指令数据集进行有监督微调（SFT）。令人惊讶的是，仅一个周期的有监督微调就能显著提高LVLM的性能，OralGPT的改进幅度达到24.73%。MMOral和OralGPT为智能牙科奠定了坚实的基础，并能为牙科领域带来更具临床影响力的多模式人工智能系统。数据集、模型、基准测试和评估套件均可在链接中获取。

关键见解