发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 18.6k

阅读时长: 75 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

Enabling Fast and Accurate Neutral Atom Readout through Image Denoising

Authors:Chaithanya Naik Mude, Linipun Phuttitarn, Satvik Maurya, Kunal Sinha, Mark Saffman, Swamit Tannu

Neutral atom quantum computers hold promise for scaling up to hundreds of thousands of qubits, but their progress is constrained by slow qubit readout. Measuring qubits currently takes milliseconds-much longer than the underlying quantum gate operations-making readout the primary bottleneck in deploying quantum error correction. Because each round of QEC depends on measurement, long readout times increase cycle duration and slow down program execution. Reducing the readout duration speeds up cycles and reduces decoherence errors that accumulate while qubits idle, but it also lowers the number of collected photons, making measurements noisier and more error-prone. This tradeoff leaves neutral atom systems stuck between slow but accurate readout and fast but unreliable readout. We show that image denoising can resolve this tension. Our framework, GANDALF, uses explicit denoising using image translation to reconstruct clear signals from short, low-photon measurements, enabling reliable classification at up to 1.6x shorter readout times. Combined with lightweight classifiers and a pipelined readout design, our approach both reduces logical error rate by up to 35x and overall QEC cycle time up to 1.77x compared to state-of-the-art CNN-based readout for Cesium (Cs) Neutral Atom arrays.

中性原子量子计算机有望实现数十万个量子比特（qubit）的扩展，但其进展受到缓慢量子比特读出的限制。目前测量量子比特需要毫秒时间，远远超过底层量子门操作的持续时间，因此读出是部署量子错误校正时的主瓶颈。因为每次量子错误校正（QEC）都依赖于测量，所以读出时间延长会增加周期持续时间并减慢程序执行速度。减少读出时间可以加快周期速度并减少空闲量子位脱相干时积累的误差，但它也会减少收集到的光子数量，使得测量结果更嘈杂、更容易出错。这种权衡让中性原子系统处于慢但准确的读出和快但不可靠的读出之间。我们证明图像去噪可以缓解这种紧张关系。我们的框架GANDALF利用图像转换显式去噪来从短、低光子测量中重建清晰信号，从而在高达1.6倍的较短读出时间内实现可靠分类。结合轻量级分类器和流水线读出设计，我们的方法将逻辑错误率降低了高达35倍，并且与基于卷积神经网络（CNN）的铯（Cs）中性原子阵列读出相比，整体QEC周期时间缩短了高达1.77倍。

论文及项目相关链接

PDF 12 pages, 15 figures

Summary

中性原子量子计算机有潜力扩展到数十万量子位，但其进展受到量子位读出速度较慢的限制。目前量子位的测量需要毫秒时间，远长于底层量子门操作的时间，成为部署量子错误校正的主要瓶颈。减少读出时间可以加快循环速度并减少空闲时量子位的失相干错误，但同时也减少了收集到的光子数量，使测量结果更加嘈杂且容易出错。本文展示了一种图像去噪技术可以解决这种矛盾。使用图像翻译进行显式去噪的GANDALF框架可以从短暂的低光子测量中重建清晰的信号，实现在较短的读出时间内进行可靠的分类。结合轻量级分类器和管道读出设计，该方法将逻辑错误率降低了高达35倍，并将基于铯（Cs）中性阵列的最新卷积神经网络（CNN）读出技术的总体QEC循环时间缩短了高达1.77倍。

Key Takeaways

中性原子量子计算机在扩展量子位方面有很大潜力，但受到慢量子位读出的限制。
量子位测量时间较长是部署量子错误校正的主要瓶颈。
读出时间的缩短可以减少循环时间和减少空闲时的失相干错误。
然而，缩短读出时间会导致收集到的光子数量减少，增加测量结果的噪声和错误率。
GANDALF框架使用图像去噪技术解决这一矛盾，能够从短暂的低光子测量中重建清晰的信号。
GANDALF框架可以在较短的读出时间内实现可靠的分类。

Cool Papers

点此查看论文截图

Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers

Authors:Gautam A. Viruthagiri, Arnuv Tandon, Gerald G. Fuller, Vinny Chandran Suja

Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.

薄膜干涉测量术是一种强大的无创测量液膜厚度的技术，在眼科有广泛的应用，但其临床翻译受到了从干涉模式重建厚度分布的挑战的阻碍——这是一个不适定的逆问题，由相位周期性、成像噪声和环境干扰物复杂化。传统的重建方法要么计算量大，对噪声敏感，要么需要手动专家分析，这对于实时诊断来说是不切实际的。为了应对这一挑战，在这里我们提出了一种基于视觉Transformer的方法，用于直接从孤立的干涉图推断薄液膜的厚度分布，实现实时推断。我们的模型在一个混合数据集上进行训练，该数据集结合了生理上相关的合成和实验性泪膜数据，利用远程空间相关性解决相位模糊问题，并在单次前向传递中从体内和离体获得的动态干涉图重建时间连贯的厚度分布。该网络在带有运动伪影的嘈杂、快速变化的薄膜上达到了最先进的性能，克服了传统相位解卷和迭代拟合方法的局限性。我们的数据驱动方法能够在消费者硬件上以实时速度实现自动化、一致的厚度重建，为连续监测预透镜眼表泪膜和非侵入性诊断干眼症等疾病提供了新的可能性。

论文及项目相关链接

PDF 6 pages, 2 figures, will be updated

Summary

薄膜干涉测量技术是一种非侵入性测量液体膜厚度的方法，广泛应用于眼科。然而，从干涉图案重建厚度轮廓的挑战阻碍了其临床转化。为解决这一问题，本文提出了一种基于视觉变压器的实时推断液体薄膜厚度轮廓的方法。该模型可直接从孤立的干涉图进行推断，利用长期空间相关性解决相位模糊问题，并在动态干涉图上实现实时速度的一致性厚度重建。此方法克服了传统相位解卷和迭代拟合方法的局限性，为连续监测预透镜眼内泪膜和非侵入性诊断干眼病等病症提供了新的可能性。

Key Takeaways

薄膜干涉测量技术可非侵入性地测量液体膜厚度，在眼科有广泛应用。
从干涉图案重建厚度轮廓是一个复杂的问题，受到相位周期性、成像噪声和环境干扰的影响。
传统重建方法计算量大、噪声敏感或依赖手动分析，不适用于实时诊断。
研究提出了一种基于视觉变压器的实时推断方法，可直接从干涉图推断液体薄膜厚度轮廓。
模型利用长期空间相关性解决相位模糊问题，实现实时速度的一致性厚度重建。
该方法克服了传统相位解卷和迭代拟合方法的局限性，具有更高的鲁棒性和准确性。

Cool Papers

点此查看论文截图

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Authors:Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik

Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

视觉语言模型（VLMs）在多模态推理方面取得了显著的进步。然而，现有基准测试在高质量、经人工核实的例子方面仍有所限制。目前许多数据集依赖于大型语言模型（LLMs）合成生成的内容。此外，大多数数据集仅限于英语，因为对翻译样本进行人工质量保障既耗时又成本高昂。为了填补这一空白，我们推出了PISA-Bench，这是一个多语言基准测试，源于专家创建的PISA测试中的英语样本。PISA测试是一个统一框架，用于评估八十多个国家的学生能力。每个样本都由人工提取的指令、问题、答案选项和图像组成，这些问题类型类别丰富，已从英语翻译成了另外五种语言（西班牙语、德语、中文、法语和意大利语），形成了一个涵盖六种语言的完全平行语料库。我们在PISA-Bench上评估了最先进的视觉语言模型，发现尤其是小型模型（<20B参数）很难取得较高的测试成绩。我们还发现在非英语分割上的性能大幅下降，以及在空间模型和几何推理任务中的高错误率。我们通过发布数据集和评估框架，为推进多语言多模态推理研究提供了一个资源。

论文及项目相关链接

PDF 8 pages, 11 tables and figures

Summary

基于现有视觉语言模型（VLMs）在多元模态推理上的突出进展，PISA-Bench的出现弥补了高质量多语言数据集的空缺。PISA-Bench来源于英语基准测试例子PISA的测试内容，扩展为多语言数据集。它包括人工提取的指令、问题、答案选项和图像等，覆盖六个语言（英语、西班牙语、德语、中文等）。研究发现，小型视觉语言模型在非英语数据集上的性能显著下降，特别是在空间几何推理方面存在较高错误率。PISA-Bench的发布为推进多语言多元模态推理研究提供了资源。

Key Takeaways

PISA-Bench是基于专家创建的PISA测试英语示例的多语言基准测试。
数据集包含人工提取的指令、问题、答案选项和图像等丰富内容。
数据集翻译至六种语言，形成完全平行的语料库。
现有视觉语言模型在PISA-Bench上的表现参差不齐，小型模型性能较差。
非英语数据集上，视觉语言模型的性能显著下降。
模型在空间几何推理方面的错误率较高。

Cool Papers

点此查看论文截图

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Authors:Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model’s textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.

多模态大型语言模型（MLLMs）在视觉理解方面表现出色，但在需要视觉规划和想象力的复杂场景中往往表现挣扎。我们受到人类如何使用草图作为视觉思维的一种形式来发展和交流思想的启发，引入了“潜在草图板”（Latent Sketchpad）框架，该框架为MLLMs配备了内部视觉草稿板。MLLMs的内部视觉表示传统上仅限于感知理解。我们重新设计它们以支持生成性视觉思维，同时不损害推理能力。我们的方法建立在前沿的MLLMs之上，将视觉生成直接集成到其本地的自回归推理过程中。它允许模型在文本推理和视觉潜在生成之间进行交替。这些潜在因素引导内部思维过程，并可转化为草图图像以便于解释。为了实现这一点，我们引入了两个组件：上下文感知视觉头（Context-Aware Vision Head）自回归地生成视觉表示，预训练的草图解码器（Sketch Decoder）将这些生成图像转化为人类可理解的图像。我们在新的数据集迷宫规划（MazePlanning）上评估了框架的性能。对各种MLLMs的实验表明，潜在草图板在推理性能上实现了与基准模型相当甚至更好的表现。它进一步推广到了不同的前沿MLLMs，包括Gemma3和Qwen2.5-VL。通过将模型的文本推理扩展到视觉思维，我们的框架为更丰富的人机交互和更广泛的应用打开了新的机会。更多细节和资源可在我们的项目页面找到：https://latent-sketchpad.github.io/。

论文及项目相关链接

PDF

Summary

MLLM在视觉理解方面表现出色，但在需要视觉规划和想象力的复杂场景中常常遇到困难。通过借鉴人类利用绘画作为视觉思维的方式来表达和沟通思想，研究团队推出了Latent Sketchpad框架，为MLLM提供了一个内部视觉草稿箱。该框架使MLLM能够在不损害推理能力的情况下支持生成性视觉思维。它集成了前沿的MLLM技术，将视觉生成直接融入其天生的自回归推理过程中。该框架允许模型在文本推理与视觉潜在生成之间进行交替，这些潜在生成物可以指导内部思维过程并可转化为草图图像以便于解释。Latent Sketchpad框架通过两个组件实现这一目标：自回归生成视觉表示的Context-Aware Vision Head和将视觉表示转化为人类可理解的图像的预训练Sketch Decoder。实验表明，Latent Sketchpad框架在多个MLLM上的表现与其主干模型相当甚至更优，并可以广泛应用于不同的前沿MLLM模型，如Gemma3和Qwen2.5-VL等。通过扩展模型的文本推理至视觉思维，该框架为更丰富的人机交互和更广泛的应用场景打开了新的机会。更多细节和资源可访问研究团队的项目网站。

Key Takeaways

MLLM在视觉理解上表现出色，但在复杂场景下（如需要视觉规划和想象力的场景）存在挑战。
人类利用绘画作为视觉思维的方式启发研究团队开发Latent Sketchpad框架。
Latent Sketchpad为MLLM提供了一个内部视觉草稿箱，支持生成性视觉思维而不损害推理能力。
该框架集成了前沿的MLLM技术，将视觉生成融入其自回归推理过程中。
Latent Sketchpad包括两个关键组件：Context-Aware Vision Head和预训练的Sketch Decoder。
实验表明Latent Sketchpad在多个MLLM上的表现优异，并可以广泛应用于不同的模型。

Cool Papers

点此查看论文截图

Robust and Generalizable Background Subtraction on Images of Calorimeter Jets using Unsupervised Generative Learning

Authors:Yeonju Go, Dmitrii Torbunov, Yi Huang, Shuhang Li, Timothy Rinn, Haiwang Yu, Brett Viren, Meifeng Lin, Yihui Ren, Dennis Perepelitsa, Jin Huang

Accurate separation of signal from background is one of the main challenges for precision measurements across high-energy and nuclear physics. Conventional supervised learning methods are insufficient here because the required paired signal and background examples are impossible to acquire in real experiments. Here, we introduce an unsupervised unpaired image-to-image translation neural network that learns to separate the signal and background from the input experimental data using cycle-consistency principles. We demonstrate the efficacy of this approach using images composed of simulated calorimeter data from the sPHENIX experiment, where physics signals (jets) are immersed in the extremely dense and fluctuating heavy-ion collision environment. Our method outperforms conventional subtraction algorithms in fidelity and overcomes the limitations of supervised methods. Furthermore, we evaluated the model’s robustness in an out-of-distribution test scenario designed to emulate modified jets as in real experimental data. The model, trained on a simpler dataset, maintained its high fidelity on a more realistic, highly modified jet signal. This work represents the first use of unsupervised unpaired generative models for full detector jet background subtraction and offers a path for novel applications in real experimental data, enabling high-precision analyses across a wide range of imaging-based experiments.

从背景中准确分离信号是高能和核物理精确测量面临的主要挑战之一。传统的监督学习方法在这里是不够的，因为在真实实验中无法获取所需的配对信号和背景示例。在这里，我们引入了一种无监督的、非配对的图像到图像的转换神经网络，该网络利用循环一致性原理，从输入的实验数据中学习分离信号和背景。我们使用由sPHENIX实验模拟的calo仪数据组成的图像来证明这种方法的有效性，物理信号（jet）沉浸在极其密集且波动的重离子碰撞环境中。我们的方法在保真度上优于传统的减法算法，并克服了监督方法的局限性。此外，我们在设计用于模拟真实实验数据中修改过的jet的离群测试场景中评估了模型的稳健性。在更简单数据集上训练的模型，在对更为现实、高度修改过的jet信号上仍能保持其高保真度。这项工作代表了无监督非配对生成模型在完整的探测器jet背景减法中的首次应用，并为真实实验数据中的新型应用开辟了道路，可在广泛的成像实验中进行高精度分析。

论文及项目相关链接

PDF

Summary

一个主要挑战在于在高能和核物理学中的精密测量中准确地将信号与背景分离。由于在实际实验中无法获取所需的配对信号和背景样本，传统的监督学习方法在这里并不适用。本研究引入了一种无监督的、无需配对的图像到图像的转换神经网络，该网络利用循环一致性原理从实验数据中学习信号和背景的分离。通过对模拟的sPHENIX实验量能器数据的图像进行验证，本方法在保真度上优于传统减法算法，并克服了监督方法的局限性。此外，在模拟真实实验数据的分布外测试场景中评估了模型的稳健性。即使在经过高度修改的喷气信号上，该模型依然保持了较高的保真度。本研究首次将无监督的非配对生成模型应用于全探测器喷气背景减法，并为真实实验数据中的新颖应用铺平了道路，使广泛的成像实验实现高精度分析成为可能。

Key Takeaways

论文面临的主要挑战是在高能和核物理学精密测量中准确区分信号和背景。
传统监督学习方法在此场景下不适用，因为无法获取实际实验中的配对信号和背景样本。
引入了一种无监督的、无需配对的图像到图像转换神经网络，该网络能利用循环一致性原理从实验数据中学习信号和背景的分离。
在模拟的sPHENIX实验数据中验证了该方法的效用，并显示了它在保真度上优于传统减法算法。
模型在模拟真实实验数据的分布外测试场景中表现出稳健性，即使在高度修改的喷气信号上也能保持高保真度。
这是首次将无监督的非配对生成模型应用于全探测器喷气背景减法。

Cool Papers

点此查看论文截图

Gut decisions based on the liver: A radiomics approach to boost colorectal cancer screening

Authors:Anna Hinterberger, Jonas Bohn, Dasha Trofimova, Nicolas Knabe, Julia Dettling, Tobias Norajitra, Fabian Isensee, Johannes Betge, Stefan O. Schönberg, Dominik Nörenberg, Sergio Grosu, Sonja Loges, Ralf Floca, Jakob Nikolas Kather, Klaus Maier-Hein, Freba Grawe

Non-invasive colorectal cancer (CRC) screening represents a key opportunity to improve colonoscopy participation rates and reduce CRC mortality. This study explores the potential of the gut-liver axis for predicting colorectal neoplasia through liver-derived radiomic features extracted from routine CT images as a novel opportunistic screening approach. In this retrospective study, we analyzed data from 1,997 patients who underwent colonoscopy and abdominal CT. Patients either had no colorectal neoplasia (n=1,189) or colorectal neoplasia (n_total=808; adenomas n=423, CRC n=385). Radiomics features were extracted from 3D liver segmentations using the Radiomics Processing ToolKit (RPTK), which performed feature extraction, filtering, and classification. The dataset was split into training (n=1,397) and test (n=600) cohorts. Five machine learning models were trained with 5-fold cross-validation on the 20 most informative features, and the best model ensemble was selected based on the validation AUROC. The best radiomics-based XGBoost model achieved a test AUROC of 0.810, clearly outperforming the best clinical-only model (test AUROC: 0.457). Subclassification between colorectal cancer and adenoma showed lower accuracy (test AUROC: 0.674). Our findings establish proof-of-concept that liver-derived radiomics from routine abdominal CT can predict colorectal neoplasia. Beyond offering a pragmatic, widely accessible adjunct to CRC screening, this approach highlights the gut-liver axis as a novel biomarker source for opportunistic screening and sparks new mechanistic hypotheses for future translational research.

非侵入性结直肠癌（CRC）筛查是提高结肠镜检查参与率并降低CRC死亡率的关键机会。本研究通过从常规CT图像中提取肝脏衍生的放射学特征来探索通过肠道-肝脏轴预测结肠直肠新生物的可能性，作为一种新型的机会性筛查方法。这是一项回顾性研究，我们分析了1997名接受结肠镜检查和腹部CT检查的患者数据。患者分为无结肠直肠新生物（n=1,189）或结肠直肠新生物（总计n=808；腺瘤n=423，CRC n=385）。使用放射学处理工具箱（RPTK）从3D肝脏分割中提取放射学特征，该工具箱执行特征提取，过滤和分类。数据集分为训练集（n=1,397）和测试集（n=600）。使用最具信息量的前20个特征进行五折交叉验证的五机器学习模型训练，并根据验证AUROC选择最佳模型组合。基于最佳放射学的XGBoost模型实现了测试AUROC为0.810，明显超过了仅使用临床模型的测试AUROC（0.457）。在结肠癌和腺瘤之间进行分类的准确度较低（测试AUROC：0.674）。我们的研究结果证明了从常规腹部CT中提取的肝脏衍生放射学预测结肠直肠新生物的概念可行性。除了提供一个实用且广泛可接入的CRC筛查辅助手段外，这种方法还突出了肠道-肝脏轴作为机会性筛查的新生物标志物来源，并激发了对未来翻译研究的新的机制假设。

论文及项目相关链接

PDF Equal contribution between first, second, fifteenth, and sixteenth authors

摘要
利用常规腹部CT图像中的肝脏衍生放射学特征，通过非侵入性方法预测结直肠癌（CRC）。研究采用回顾性分析方法，对接受结肠镜检查和腹部CT扫描的1997名患者数据进行分析。根据是否有结肠直肠新生物（无新生物患者1189名，结肠新生物患者总数808名，腺瘤患者423名，CRC患者385名）进行分类。使用放射学处理工具箱（RPTK）从三维肝脏分割中提取放射学特征，并进行特征提取、过滤和分类。数据集分为训练集（1397人）和测试集（600人）。在最具信息性的20个特征上训练了五个机器学习模型，并通过验证AUROC选择了最佳模型组合。最佳的基于放射学特征的XGBoost模型的测试AUROC为0.810，明显优于仅基于临床模型的测试AUROC（0.457）。结肠直肠癌与腺瘤之间的分类精度较低（测试AUROC：0.674）。研究证实，利用常规腹部CT的肝脏衍生放射学特征预测结肠直肠新生物的概念验证。这不仅为CRC筛查提供了一个实用的辅助手段，而且突显了肠道-肝脏轴作为机会性筛查的新生物标记源，并为未来的翻译研究提供了新的机制假设。

关键见解

非侵入性结直肠癌筛查可通过改善结肠镜检查参与率和降低CRC死亡率来发挥作用。
肝脏衍生放射学特征作为预测结肠直肠新生物的新方法，其预测性能通过机器学习方法得到验证。
使用常规腹部CT图像的放射学特征提取在预测结肠直肠新生物方面具有潜力。
XGBoost模型在预测结肠直肠新生物方面表现出较高的准确性（测试AUROC为0.810），相较于仅使用临床模型的测试AUROC（0.457），其表现更为出色。
在区分结肠直肠癌和腺瘤方面，尽管分类准确性稍低（测试AUROC：0.674），但这一方法仍然显示出潜力。
该研究证明了肠道-肝脏轴在预测结肠直肠新生物方面的作用，为后续机制研究提供了新的方向。

Cool Papers

点此查看论文截图

The Sun as an X-ray star V.: A new method to retrieve coronal filling factors

Authors:Wilhelmina Maryann Joseph, Beate Stelzer, Salvatore Orlando, Moritz Klawin

Context. Stellar coronae are unresolved in X-rays, so inferences about their structure rely on spectral analysis. The “Sun-as-an-X-ray-star” (SaXS) approach uses the Sun as a spatially resolved template to interpret stellar spectra, but previous SaXS implementations were indirect and computationally heavy. Aims. We present a new SaXS implementation that converts solar emission measure distributions (EMDs) of distinct coronal region types into XSPEC spectral components and test whether broad-band X-ray spectra alone can recover their filling factors. Methods. We built XSPEC multi-temperature spectral models for four solar region types (background/quiet corona, active regions, cores, and flares) by using EMDs derived from Yohkoh/SXT data and translating each EMD bin into an isothermal apec component. These models were fit (using PyXspec) to two one-hour DAXSS spectra representative of quiescent (2022-06-29) and flaring (2022-04-25) states. Best-fit normalizations were converted into projected areas and filling factors and compared with near-coincident Hinode/XRT full-disk images. Results. Using the Yohkoh/SXT EMDs, the quiescent Sun spectrum is dominated by active region emission (filling factor 22%), with the background corona poorly constrained. The flaring Sun spectrum is best described by a combination of active regions, cores, and flares with filling factors of ~47.5%, ~4.1%, and ~0.062%, respectively. The dominant components match spatial features seen in Hinode/XRT images. Limitations include the DAXSS low-energy cutoff (0.7 keV) and the small, non-uniform Yohkoh EMD sample. Conclusions. Our SaXS implementation enables direct retrieval of coronal filling factors from broad-band X-ray spectra and provides a physically motivated alternative to ad hoc few-temperature fits, suitable for stellar X-ray analyses.

背景。恒星冕在X射线中无法解析，因此对其结构的推断依赖于光谱分析。“日冕为X射线恒星”（SaXS）方法将太阳作为空间解析模板来解释恒星光谱，但之前的SaXS实现方法是间接且计算繁重的。目标。我们提出了一种新的SaXS实现方法，将太阳发射量分布（EMDs）的不同冕区类型转换为XSPEC光谱成分，并测试了仅通过宽带X射线光谱是否能恢复其填充因子。方法。我们使用基于Yohkoh/SXT数据的EMDs为四种太阳区域类型（背景/宁静冕区、活动区、核心和耀斑）构建了多温度XSPEC光谱模型，并将每个EMD区间转换为等温apec成分。这些模型使用PyXspec拟合了两个代表静止状态（2022年6月29日）和耀斑状态（2022年4月25日）的一小时DAXSS光谱。最佳拟合归一化值被转换为投影面积和填充因子，并与近似的Hinode/XRT全日面图像进行了比较。结果。使用Yohkoh/SXT的EMDs，静止太阳光谱主要由活动区发射（填充因子约为22%）主导，背景冕区的约束较差。耀斑太阳光谱最好由活动区、核心和耀斑的组合来描述，填充因子分别为~~47.5%，~~4.1%，和~~0.062%。主要成分与Hinode/XRT图像中的空间特征相匹配。限制因素包括DAXSS的低能截止（~~0.7 keV）和Yohkoh EMD样本的小规模和非均匀性。结论。我们的SaXS实现方法能够直接从宽带X射线光谱中检索出冠冕填充因子，并提供了一种有物理依据的替代方案，适用于恒星X射线分析中的临时几温度拟合。

论文及项目相关链接

PDF

Summary：
我们提出了一种新的基于太阳X射线数据的恒星冠冕研究的方法（SaXS）。通过将不同冠冕区域类型的太阳发射度量分布转化为XSPEC光谱成分，对太阳静止和活动状态下的X射线光谱进行了模拟分析。最佳拟合的正常化数据被转换为投影面积和填充因子，并与Hinode / XRT全日面图像进行了比较。结果显示，静止太阳光谱主要由活动区发射主导，而背景冠冕难以约束；活动太阳光谱则是由活动区、核心和耀斑共同主导。我们的方法提供了从宽带X射线光谱直接获取冠冕填充因子的可能，为恒星X射线分析提供了物理驱动的替代方法。由于基于太阳作为参考模板的原理，这种新方法为之前难以解析的恒星冠冕结构提供了有效工具。

Key Takeaways:

新的SaXS方法通过将太阳作为空间解析模板，转化了不同类型冠冕区域的太阳发射度量分布到XSPEC光谱成分。
对静止和活动状态下的太阳X射线光谱进行了模拟分析，提取了冠冕填充因子。静止太阳光谱主要由活动区主导，而背景冠冕难以约束；活动太阳光谱则涉及多个区域类型。
新方法使得直接从宽带X射线光谱获取冠冕填充因子成为可能，为恒星X射线分析提供了物理驱动的替代方案。这一方法对于解析恒星冠冕结构具有重要的应用前景。

Cool Papers

点此查看论文截图

Alias-Free ViT: Fractional Shift Invariance via Linear Attention

Authors:Hagay Michaeli, Daniel Soudry

Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets’ translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.

Transformer在视觉任务中已成为卷积神经网络（convnets）的竞争替代品，但它们缺少卷积神经网络的架构归纳偏见，这可能阻碍其潜在性能。具体来说，视觉Transformer（ViTs）不具有平移不变性，对于轻微的图像平移比标准卷积神经网络更加敏感。然而，先前的研究表明，由于下采样和非线性层中的混叠，卷积神经网络也并非完全平移不变。因此，人们提出了抗混叠方法来验证卷积神经网络的平移鲁棒性。基于这项工作，我们提出了无混叠ViT，它结合了两个主要组件。首先，它使用无混叠的下采样和非线性。其次，它使用对整数和平移分数都具有平移等变性的线性交叉协方差注意力，从而实现平移不变的全局表示。我们的模型在图像分类中保持了竞争力，并且在对抗平移方面表现优于类似规模的模型。

论文及项目相关链接

PDF Accepted at NeurIPS 2025. Code is available at https://github.com/hmichaeli/alias_free_vit

Summary

 视觉Transformer（ViTs）在视觉任务中已成为卷积神经网络（convnets）的有力竞争对手，但缺乏卷积神经网络的架构归纳偏见可能会阻碍其潜在性能。具体地说，ViTs对细微的图像平移敏感，并非平移不变。然而，先前的研究表明，由于下采样和非线性层中的混叠，卷积神经网络也并非完全平移不变。因此，提出了抗混叠方法来验证其平移鲁棒性。在此基础上，我们提出了无混叠ViT模型，该模型结合了两种主要成分：一是使用无混叠的下采样和非线性；二是使用对整数和平移分数均保持平移等变的线性交叉协方差注意力机制，从而实现全局平移不变表示。该模型在图像分类中保持竞争力，并在对抗平移方面具有优于同类规模模型的鲁棒性。

Key Takeaways

视觉Transformer（ViTs）已成为视觉任务中卷积神经网络（convnets）的有力竞争对手。
ViTs缺乏卷积神经网络的架构归纳偏见，可能会影响其性能。
ViTs并非完全平移不变，对细微的图像平移敏感。
卷积神经网络也并非完全平移不变，存在混叠效应。
抗混叠方法被用来增强卷积神经网络的平移鲁棒性。
提出了无混叠ViT模型，结合了无混叠下采样、非线性以及线性交叉协方差注意力机制。
该模型在图像分类中表现优异，并且在对抗平移方面具有鲁棒性。

Cool Papers

点此查看论文截图

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Authors:Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

近期生成建模的进展使得扩散模型成为从复杂数据分布中采样的最先进的工具。虽然这些模型在单模态领域（如图像和音频）取得了显著的成功，但它们的能力拓展到跨不同感官模态的信息翻译（模态翻译）仍然是一个开放挑战。现有方法通常依赖于限制性假设，包括共享维度、高斯源先验和特定模态架构，这限制了它们的通用性和理论根据。在这项工作中，我们提出了潜在去噪扩散桥梁模型（Latent Denoising Diffusion Bridge Model，LDDBM），这是一个基于去噪扩散桥梁模型的潜在变量扩展的通用模态翻译框架。通过在共享潜在空间中进行操作，我们的方法在无需对齐维度的情况下学会了在任意模态之间搭建桥梁。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性，并设计了一种针对潜在空间中的噪声预测的域无关编码器-解码器架构。此外，我们提出了预测损失来指导训练以实现准确的跨域翻译，并探索了多种训练策略来提高稳定性。我们的方法支持任意模态对，并在多种模态翻译任务上表现强劲，包括多视图到3D形状生成、图像超分辨率和多视图场景合成。全面的实验和消融实验验证了我们的框架的有效性，为一般模态翻译建立了新的强大基准。有关更多信息，请访问我们的项目页面：https://sites.google.com/view/lddbm/home。

论文及项目相关链接

PDF Accepted as a poster at NeurIPS 2025

Summary
扩散模型最近在生成建模领域取得了突破性进展，已成为复杂数据分布采样的顶尖工具。尽管它们在单模态领域（如图像和音频）取得了显著成功，但将它们的能力扩展到跨不同感官模态的翻译（模态转换，MT）仍然是一个挑战。本研究提出了潜在去噪扩散桥梁模型（LDDBM），这是一种基于去噪扩散桥梁模型的潜在变量扩展的通用模态转换框架。通过在共享潜在空间中进行操作，该方法能够在任意模态之间建立桥梁，无需对齐维度。

Key Takeaways

扩散模型在生成建模领域是最新顶尖工具，尤其擅长复杂数据分布采样。
模态转换（MT）是将信息从一种感官模态转换为另一种模态，现有方法在这方面存在挑战。
潜在去噪扩散桥梁模型（LDDBM）是一种新的通用模态转换框架，可在任意模态之间建立桥梁，无需对齐维度。
LDDBM通过引入对比对齐损失来强制配对样本之间的语义一致性。
该框架设计了一个针对潜在空间噪声预测的域不识别的编码器-解码器架构。
LDDBM支持任意模态对，并在多种MT任务上表现强劲，包括多视图到3D形状生成、图像超分辨率和多视图场景合成。

Cool Papers

点此查看论文截图

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Authors:Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

虽然通过搜索进行推理时间缩放已经使大型语言模型发生了革命性的变化，但将这些优势转化为图像生成却证明是困难的。最近将搜索策略应用于连续扩散模型的尝试显示出有限的效益，而简单的随机抽样通常表现最佳。我们证明，视觉自回归模型的离散序列特性能够有效地用于图像生成搜索。我们证明，光束搜索可以大大改进文本到图像的生成，使2B参数自回归模型在基准测试中优于12B参数扩散模型。系统分析表明，这一优势来自于离散标记空间，它允许早期修剪和计算重用，我们的验证器分析强调了速度和推理能力之间的权衡。这些发现表明，模型架构对于视觉生成的推理时间优化至关重要，而不仅仅是规模问题。

论文及项目相关链接

PDF

Summary

在大型语言模型中，推理时间缩放搜索技术取得了革命性进展，但在图像生成中应用这些技术却面临挑战。近期尝试将搜索策略应用于连续扩散模型的效果有限，简单随机采样通常表现最佳。本研究展示了离散序列视觉自回归模型在图像生成中的有效搜索能力。通过应用光束搜索，我们显著提高了文本到图像的生成质量，使得一个拥有2亿参数的自回归模型在基准测试中超越了拥有12亿参数的扩散模型。系统性消融实验表明，这一优势来源于离散符号空间，允许早期修剪和计算复用。我们的验证分析强调了速度与推理能力之间的权衡。这些发现表明，模型架构而非规模对于视觉生成的推理时间优化至关重要。

Key Takeaways

推理时间缩放搜索在大型语言模型中取得了显著进展，但在图像生成中的应用具有挑战性。
离散序列视觉自回归模型在图像生成中展现出有效的搜索能力。
光束搜索显著提高了文本到图像的生成质量。
2亿参数的自回归模型在基准测试中优于12亿参数的扩散模型。
消融实验表明优势来源于离散符号空间，允许早期修剪和计算复用。
验证分析揭示了速度和推理能力之间的权衡。
模型架构对于推理时间优化至关重要，而不仅仅是规模。

Cool Papers

点此查看论文截图

Authors:Qiyi Tong, Olivia Nocentini, Marta Lagomarsino, Kuanqi Cai, Marta Lorenzini, Arash Ajoudani

Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student’s learned representations by feeding them back into the frozen teacher’s prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.

面部地标检测（FLD）在热成像中的应用对于在具有挑战性的照明条件下的应用至关重要，但其受到了缺乏丰富的视觉线索的阻碍。传统的跨模态解决方案，如特征融合或RGB数据的图像转换，通常计算成本高昂或引入结构伪影，从而限制了它们的实际应用部署。为了解决这个问题，我们提出了多级跨模态知识蒸馏（MLCM-KD），这是一种新型框架，它将高保真RGB到热成像的知识转移与模型压缩解耦，以创建既准确又高效的热成像FLD模型。知识转移期间的一个核心挑战在于RGB和热成像数据之间的固有模态差距，传统单向蒸馏无法在不同特征空间之间实施语义一致性。为了克服这一问题，我们引入了双向注入知识蒸馏（DIKD），这是一种专门针对此任务设计的双向机制。DIKD在模态之间建立了联系：它不仅用丰富的RGB特征引导热成像学生，而且还通过将学生的已学表示反馈到冻结的教师的预测头中来验证学生的表示。这种闭环监督促使学生学习模态不变的特征，这些特征在语义上与教师对齐，确保稳健且深入的知识转移。实验表明，我们的方法在公共热成像FLD基准测试中树立了新的标杆，显著优于以前的方法，同时大大降低了计算开销。

论文及项目相关链接

PDF

Summary
针对热成像中的人脸关键点检测（FLD），在复杂光照条件下具有重要应用价值，但缺乏丰富的视觉线索是一大挑战。本研究提出一种新型的多层次跨模态知识蒸馏（MLCM-KD）框架，实现了高效且准确地在热成像中实施人脸关键点检测。该研究引入了双向知识蒸馏机制——双向注入知识蒸馏（DIKD），缩小了RGB与热成像数据之间的模态差距，实现了跨特征空间的知识转移。实验表明，该方法在公开的热成像FLD基准测试中表现卓越，显著优于先前方法，同时大幅降低了计算开销。

Key Takeaways

面部关键点检测（FLD）在热成像领域因复杂光照条件而变得关键，但由于缺乏丰富的视觉线索而面临挑战。
传统的跨模态解决方案（如特征融合或RGB数据的图像翻译）计算成本高昂并可能引入结构伪影，限制了实际应用。
提出了新型的多层次跨模态知识蒸馏（MLCM-KD）框架，旨在创建高效且准确的热成像FLD模型。
双向知识蒸馏机制——双向注入知识蒸馏（DIKD）被引入以缩小RGB与热成像数据之间的模态差距。
DIKD机制不仅利用RGB特征引导热成像学生模型，还通过反馈学生模型的预测结果来验证其学习到的表示，实现闭环监督。
闭环监督促使学生模型学习模态不变且语义对齐的特征，确保稳健且深入的知识转移。

Cool Papers

点此查看论文截图

ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

Authors:Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P. Mistry, Lucas Bijnens, Kent Ryan Kleinschmidt, Brady Chrisler, Sathvik Suryadevara, Sri Sai Dinesh Jaliparthi, Noah Michael Prudlo, Mark David Marino, Jeremy Palacio, Rithvik Akula, Di Zhou, Hong-Yu Zhou, Ibrahim Ethem Hamamci, Scott J. Adams, Hassan Rayhan AlOmaish, Pranav Rajpurkar

We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at https://rexrank.ai/ReXGroundingCT. The dataset is available at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.

我们推出了ReXGroundingCT数据集，这是首个将自由文本发现与胸部CT扫描中的像素级3D分割相链接的公开数据集。该数据集包含来自CT-RATE的标准化放射学报告与3,142份非对比性胸部CT扫描配对。构建过程遵循结构化三阶段流程：首先，使用GPT-4提取并标准化报告中原本用土耳其语写成并通过机器翻译至英文的发现、描述和元数据；其次，GPT-4o-mini将每项发现归类至肺和胸膜异常的层次化本体；最后，对所有CT体积进行3D注释：训练集经过专业认证放射科医师的质量保证，验证集和测试集则完全由专业认证放射科医师进行标注。此外，还创建了补充的“思考链”数据集，以提供逐步的层次化解剖推理，用于定位CT体积内的发现，使用GPT-4o和来自器官分割模型的定位坐标。ReXGroundingCT包含8,028个文本到3D分割配对中的16,301个注释实体，涵盖了来自3,142份非对比性CT扫描的多样化放射模式。大约79%的发现是局部异常，21%是非局部异常。该数据集包含由专业认证放射科医师标注的50例公开验证集和100例私有测试集。该数据集为CT成像中的自由文本分割和基于地面的放射学报告生成奠定了基础。模型在私有测试集上的表现已在https://rexrank.ai/ReXGroundingCT上的公开排行榜上公布。数据集可在https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT上获取。

论文及项目相关链接

PDF

Summary

本文介绍了ReXGroundingCT数据集，该数据集首次实现了自由文本发现与胸部CT扫描的像素级3D分割之间的链接。包含标准化放射学报告配对的非对比胸部CT扫描，通过结构化三阶段管道构建而成。利用GPT-4提取并标准化报告中的发现、描述符和元数据，进行分类并分为肺部和胸膜异常层次结构。所有CT体积均产生3D注释，并由专业认证的放射科医生进行质量保证和完全注释。该数据集为CT影像中的自由文本分割和接地放射学报告生成奠定了基础，提供逐级的解剖推理，用于定位CT体积中的发现。ReXGroundingCT包含覆盖多种放射学模式的多个注释实体和文本到3D分割对。数据集已在公共排行榜上托管私有测试集的模型性能，并可通过链接获取。

Key Takeaways

ReXGroundingCT是首个链接自由文本发现与胸部CT扫描像素级3D分割的数据集。
数据集包含与CT-RATE标准化放射学报告配对的非对比胸部CT扫描。
数据集构建采用结构化三阶段管道，包括使用GPT-4提取和分类报告发现。
所有CT体积均进行3D注释，并由专业认证的放射科医生进行质量保证和完全注释。
数据集提供逐级的解剖推理，有助于定位CT体积中的异常发现。
ReXGroundingCT包含多种放射学模式的多个注释实体和文本到3D分割对。
模型的性能已在公共排行榜上托管，数据集可通过指定链接获取。

Cool Papers

点此查看论文截图

DArFace: Deformation Aware Robustness for Low Quality Face Recognition

Authors:Sadaf Gulshad, Abdullah Aldahlawi

Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce \textbf{DArFace}, a \textbf{D}eformation-\textbf{A}ware \textbf{r}obust \textbf{Face} recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.

人脸识别系统通过利用深度神经网络、先进的损失函数和大规模数据集取得了显著的成功。然而，它们在涉及低质量面部图像的现实世界场景中性能往往会下降。这种退化在监控录像或远距离成像中很常见，包括低分辨率、运动模糊和各种失真，与训练期间通常使用的高质量数据之间存在巨大的领域差距。虽然现有方法试图通过修改网络架构或建模全局空间变换来提高稳健性，但它们经常忽略现实环境中固有的局部非刚性变形。在这项工作中，我们引入了DArFace，一个Deformation-Aware robust Face识别框架，该框架提高了对这种退化的稳健性，而无需配对的高质量低质量训练样本。我们的方法通过在训练过程中对抗性地集成全局变换（例如旋转、平移）和局部弹性变形，来模拟现实生活中的低质量条件。此外，我们引入了一个对比目标，以强制不同变形视图之间的身份一致性。在低质量基准测试上的广泛评估，包括TinyFace、IJB-B和IJB-C，表明DArFace超越了最先进的方法，其显著收益归因于局部变形建模的引入。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于深度神经网络和对比目标的面部识别框架DArFace，该框架通过模拟低质量条件下的全局变换和局部弹性变形，提高了对低质量面部图像的鲁棒性。在TinyFace、IJB-B和IJB-C等低质量基准测试上的广泛评估表明，DArFace在包括局部变形建模的情况下超越了最新方法，具有显著的优势。

Key Takeaways

面部识别系统利用深度神经网络、先进的损失函数和大规模数据集取得了显著成功。
在涉及低质量面部图像的现实世界场景中，面部识别系统的性能往往会下降。
低质量图像常见的退化包括低分辨率、运动模糊和各种扭曲。
现有方法主要通过修改网络架构或建模全局空间变换来提高鲁棒性，但忽略了现实环境中固有的局部非刚性变形。
DArFace框架通过模拟低质量条件下的全局变换和局部弹性变形，提高了对低质量面部图像的鲁棒性。
DArFace使用对比目标来强制执行不同变形视图之间的身份一致性。

Cool Papers

点此查看论文截图

MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Authors:Swadhin Das, Raksha Sharma

Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence’s semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

遥感图像包含复杂的空间模式和语义结构，这使得标注模型难以准确描述。编码器-解码器架构已成为遥感图像标注（RSIC）的广泛应用方法，通过将视觉内容翻译为描述性文本。然而，许多现有方法依赖于单流架构，这削弱了模型准确描述图像的能力。这种单流架构通常难以提取多样的空间特征或捕捉复杂的语义关系，在具有高类内相似性或上下文模糊的场景中，其效果有限。在这项工作中，我们提出了一种新颖的多流编码器-解码器框架（MsEdF），通过优化编码器-解码器架构的空间表示和语言生成，提高了RSIC的性能。编码器融合了来自两个互补图像编码器的信息，通过多尺度和结构不同特征的结合，促进了特征的多样性。为了提高对上下文感知描述捕捉能力，我们在解码器端使用堆叠的GRU架构和元素级聚合方案，对输入序列的语义建模进行了改进。在三个基准RSIC数据集上的实验表明，MsEdF优于几种基线模型。

论文及项目相关链接

PDF

Summary
远程图像的空间模式和语义结构复杂，导致图像标注模型难以准确描述。现有常用的编码器-解码器架构通过视觉内容翻译来描述文本。但单一流架构方法较弱，难以准确描述图像。该方法在提取空间特征和捕捉复杂语义关系方面存在局限性，对高内聚类或高语境模糊场景的有效性受限。本研究提出一种新的多流编码器-解码器框架（MsEdF），优化编码器-解码器架构的空间表示和语言生成，提高远程图像标注的性能。编码器融合两个互补图像编码器的信息，促进特征多样性的融合。为提高语境感知描述的捕捉能力，解码器侧使用堆叠GRU架构和逐元素聚合方案对输入序列的语义建模进行细化。实验结果表明，MsEdF在三个基准远程图像标注数据集上优于几个基线模型。

Key Takeaways

远程图像标注面临复杂空间模式和语义结构的挑战。
现有编码器-解码器架构在描述图像时存在局限性。
提出的多流编码器-解码器框架（MsEdF）旨在提高远程图像标注的性能。
MsEdF通过融合两个互补图像编码器的信息，促进特征多样性。
MsEdF通过细化解码器侧的语义建模，提高语境感知描述的捕捉能力。
实验结果显示，MsEdF在多个数据集上的性能优于基线模型。

Cool Papers

点此查看论文截图

Guided MRI Reconstruction via Schrödinger Bridge

Authors:Yue Wang, Yuanbiao Yang, Zhuo-xu Cui, Tian Zhou, Bingsheng Huang, Hairong Zheng, Dong Liang, Yanjie Zhu

Magnetic Resonance Imaging (MRI) is an inherently multi-contrast modality, where cross-contrast priors can be exploited to improve image reconstruction from undersampled data. Recently, diffusion models have shown remarkable performance in MRI reconstruction. However, they still struggle to effectively utilize such priors, mainly because existing methods rely on feature-level fusion in image or latent spaces, which lacks explicit structural correspondence and thus leads to suboptimal performance. To address this issue, we propose $\mathbf{I}^2$SB-Inversion, a multi-contrast guided reconstruction framework based on the Schr"odinger Bridge (SB). The proposed method performs pixel-wise translation between paired contrasts, providing explicit structural constraints between the guidance and target images. Furthermore, an Inversion strategy is introduced to correct inter-modality misalignment, which often occurs in guided reconstruction, thereby mitigating artifacts and improving reconstruction accuracy. Experiments on paired T1- and T2-weighted datasets demonstrate that $\mathbf{I}^2$SB-Inversion achieves a high acceleration factor of up to 14.4 and consistently outperforms existing methods in both quantitative and qualitative evaluations.

磁共振成像（MRI）是一种固有的多对比度成像模式，可以利用跨对比度先验信息来改善从欠采样数据中重建图像。最近，扩散模型在MRI重建中表现出了卓越的性能。然而，它们仍然难以有效利用这样的先验信息，主要是因为现有方法依赖于图像或潜在空间中的特征级融合，这缺乏明确的结构对应关系，从而导致性能不佳。为了解决这一问题，我们提出了基于薛定谔桥（SB）的多对比度引导重建框架$\mathbf{I}^2$SB-Inversion。该方法执行配对对比度之间的像素级转换，为引导和目标图像之间提供明确的结构约束。此外，还引入了一种反转策略，以解决在引导重建中经常发生的跨模态不匹配问题，从而减轻伪影，提高重建精度。在配对T1和T 实验中，加权数据集表明$\mathbf{I}^2$SB-Inversion实现了高达14.4的高加速因子，在定量和定性评估中均优于现有方法。

论文及项目相关链接

PDF

Summary

MRI重建中，多对比度先验信息能有效提升欠采样数据的图像重建效果。针对现有扩散模型在利用这些先验信息时存在的问题，提出了一种基于Schrödinger Bridge（SB）的多对比度引导重建框架$\mathbf{I}^2$SB-Inversion。该方法实现了配对对比度之间的像素级转换，提供了明确的结构约束，并引入了一种校正跨模态间不一致性的倒序策略，有效减轻了引导重建中的伪影问题，提高了重建准确性。在T1和T2加权数据集上的实验表明，$\mathbf{I}^2$SB-Inversion在加速因子高达14.4的情况下，无论在定量还是定性评估中都优于现有方法。

Key Takeaways

$\mathbf{I}^2$SB-Inversion利用多对比度先验信息进行MRI重建，能有效提升欠采样数据的图像质量。
通过像素级转换实现了配对对比度间的明确结构约束。
引入倒序策略校正跨模态间的不一致性，提高了重建准确性。
与现有方法相比，$\mathbf{I}^2$SB-Inversion在定量和定性评估中表现更优异。
$\mathbf{I}^2$SB-Inversion方法能够在高加速因子下实现MRI重建。
特征级融合和图像或潜在空间中的融合在某些情况下可能效果不佳，缺乏明确的结构对应关系。

Cool Papers

点此查看论文截图

Bidirectional Regression for Monocular 6DoF Head Pose Estimation and Reference System Alignment

Authors:Sungho Chun, Boeun Kim, Hyung Jin Chang, Ju Yong Chang

Precise six-degree-of-freedom (6DoF) head pose estimation is crucial for safety-critical applications and human-computer interaction scenarios, yet existing monocular methods still struggle with robust pose estimation. We revisit this problem by introducing TRGv2, a lightweight extension of our previous Translation, Rotation, and Geometry (TRG) network, which explicitly models the bidirectional interaction between facial geometry and head pose. TRGv2 jointly infers facial landmarks and 6DoF pose through an iterative refinement loop with landmark-to-image projection, ensuring metric consistency among face size, rotation, and depth. To further improve generalization to out-of-distribution data, TRGv2 regresses correction parameters instead of directly predicting translation, combining them with a pinhole camera model for analytic depth estimation. In addition, we identify a previously overlooked source of bias in cross-dataset evaluations due to inconsistent head center definitions across different datasets. To address this, we propose a reference system alignment strategy that quantifies and corrects translation bias, enabling fair comparisons across datasets. Extensive experiments on ARKitFace, BIWI, and the challenging DD-Pose benchmarks demonstrate that TRGv2 outperforms state-of-the-art methods in both accuracy and efficiency. Code and newly annotated landmarks for DD-Pose will be publicly available.

精确六自由度（6DoF）头部姿态估计是安全关键应用和人机交互场景中的关键，但现有的单目方法仍然难以进行稳健的姿态估计。我们通过引入TRGv2来解决这个问题，这是我们的先前翻译、旋转和几何（TRG）网络的轻量级扩展，它显式地模拟面部几何和头部姿态之间的双向交互。TRGv2通过具有地标到图像投影的迭代细化循环联合推断面部地标和6DoF姿态，确保面部大小、旋转和深度之间的度量一致性。为了进一步提高对离群数据的泛化能力，TRGv2回归校正参数而不是直接预测翻译，将它们与针孔相机模型结合进行解析深度估计。此外，我们发现了跨数据集评估中因不同数据集之间头部中心定义不一致而被忽视之前的偏见来源。为了解决这一问题，我们提出了一种参考系统对齐策略，对翻译偏差进行量化和校正，实现了跨数据集公平比较。在ARKitFace、BIWI和有挑战性的DD-Pose基准测试上的大量实验表明，TRGv2在准确性和效率方面都优于最新技术。DD-Pose的代码和新注释的地标将公开可用。

论文及项目相关链接

PDF This version extends the previously published preprint and has been submitted to Pattern Recognition

Summary

文章介绍了TRGv2网络的重要性及其在精确六自由度（6DoF）头部姿态估计中的应用。TRGv2是先前翻译、旋转和几何（TRG）网络的轻量级扩展，它通过对面部几何和头部姿态之间的双向交互进行显式建模来解决稳健的头部姿态估计问题。通过迭代优化循环和地标到图像的投影，TRGv2联合推断面部地标和6DoF姿态，确保面部大小、旋转和深度之间的度量一致性。此外，文章还提出回归校正参数与针孔相机模型相结合进行深度估计，以提高对离群数据的泛化能力。同时，文章还识别并解决了跨数据集评估中因不同数据集间头部中心定义不一致导致的偏见问题。通过引入的参考系统对齐策略量化和纠正了平移偏差，使不同数据集之间的比较更加公平。经过广泛的实验验证，TRGv2在准确率和效率方面都优于最新的方法。文章提供了对新DD-Pose基准数据集的代码和新注释。这些进步有望提高关键领域中安全性能以及人类计算机交互体验。

Key Takeaways

TRGv2网络是翻译、旋转和几何（TRG）网络的轻量级扩展，用于解决稳健的头部姿态估计问题。
TRGv2网络对面部几何与头部姿态之间的双向交互进行显式建模，并通过迭代优化循环进行地标和姿态推断。
通过回归校正参数结合针孔相机模型提高了深度估计和对离群数据的泛化能力。
跨数据集评估时发现了由于头部中心定义不一致产生的偏见问题。通过引入的参考系统对齐策略解决该问题。

Cool Papers

点此查看论文截图

SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer

Authors:Xiang Gao, Yuqi Zhang

Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.

从跨域图像到图像（I2I）翻译的角度来看，最近的风格迁移问题在很大程度上仍然受到生成对抗网络（GAN）的主导。其中的关键问题是学习和将目标域的样式模式转移到源域的内容图像上。本文处理将真实图片翻译成传统水墨画的问题，即水墨画风格转移。尽管有许多I2I模型处理这个问题，但一个显著的挑战是源图像的内容细节在转移水墨风格元素时可能会被轻易擦除或破坏。为了解决这个问题，我们提出将显著性检测融入到非配对的I2I框架中来规范图像内容，从两个方面利用检测到的显著性图：（1）我们提出显著性IOU（SIOU）损失，通过强制显著性一致性来显式地规范对象内容结构，在图像风格化之前和之后；（2）我们提出显著性自适应归一化（SANorm），通过动态将图像显著性信息注入生成器来隐式地增强生成画作的对象结构完整性，以指导风格化过程。此外，我们还提出了显著性关注鉴别器，它利用图像显著性信息将生成对抗注意力集中在绘制对象上，有助于生成更生动、更精细的笔触和水墨纹理。大量的定性和定量实验表明，我们的方法在GAN和扩散模型范式中均优于相关的高级图像风格化方法。

论文及项目相关链接

PDF Pattern Recognition, Volume 162, June 2025, 111344

Summary
本文研究了基于生成对抗网络（GAN）的跨域图像到图像（I2I）翻译中的风格转换问题，特别是在将真实图片转换为传统水墨画时的挑战。为解决源图像内容细节在转换水墨风格元素时容易丢失或损坏的问题，本文提出了结合显著性检测的无配对I2I框架来规范图像内容。通过引入显著性检测，我们提出了显著性IOU（SIOU）损失和显著性自适应归一化（SANorm）方法，分别从显性和隐性两方面增强对象内容的完整性。此外，我们还提出了显著性关注鉴别器，利用图像显著性信息将生成对抗的注意力集中在绘制对象上，从而生成更生动、精细的笔触和水墨纹理。

Key Takeaways

该研究关注生成对抗网络（GAN）在跨域图像到图像（I2I）翻译中的风格转换问题，特别是真实图片转换为传统水墨画。
在转换过程中，源图像的内容细节容易丢失或损坏是一个显著挑战。
为解决此问题，引入了显著性检测来规范图像内容，从显性和隐性两个方面增强对象内容的完整性。
提出了显著性IOU（SIOU）损失和显著性自适应归一化（SANorm）方法，分别用于显式和隐式地保护图像内容结构。
还提出了显著性关注鉴别器，利用图像显著性信息来生成更生动、精细的笔触和水墨纹理。
实验表明，该方法在GAN和扩散模型范式中均优于相关的高级图像风格化方法。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-05/I2I%20Translation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

I2I Translation

视频理解

视频理解方向最新论文已更新，请持续关注 Update in 2025-11-05 EgoExo-Con Exploring View-Invariant Video Temporal Understanding

2025-11-05 视频理解

视频理解

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-11-05 Patient-Centered Summarization Framework for AI Clinical Summarization A Mixed-Methods Design

2025-11-05 Few-Shot

Few-Shot

I2I Translation

2025-11-05 更新

Enabling Fast and Accurate Neutral Atom Readout through Image Denoising

Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Robust and Generalizable Background Subtraction on Images of Calorimeter Jets using Unsupervised Generative Learning

Gut decisions based on the liver: A radiomics approach to boost colorectal cancer screening

The Sun as an X-ray star V.: A new method to retrieve coronal filling factors

Alias-Free ViT: Fractional Shift Invariance via Linear Attention

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer

ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

DArFace: Deformation Aware Robustness for Low Quality Face Recognition

MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Guided MRI Reconstruction via Schrödinger Bridge

Bidirectional Regression for Monocular 6DoF Head Pose Estimation and Reference System Alignment

SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer