⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-02 更新
Data-to-Energy Stochastic Dynamics
Authors:Kirill Tamogashev, Nikolay Malkin
The Schr"odinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schr"odinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schr"odinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method. Code: https://github.com/mmacosha/d2e-stochastic-dynamics
薛定谔桥问题关注的是寻找一个随机动力系统,该系统连接两个边缘分布,并最小化特定的运输成本。这个问题代表了最优传输在随机情况下的推广,由于其与扩散模型和流匹配的关联以及在自然科学中的应用而受到关注。然而,所有现有的算法都只能在两个分布都有样本的情况下推断出这种动态。在本文中,我们提出了当一(或两个)分布由其非标准化密度给出且无法获得数据样本时,建立薛定谔桥的第一个通用方法。我们的算法依赖于无数据情况下迭代比例拟合(IPF)程序的推广,这一推广受到最近离线策略强化学习训练扩散采样器的启发。我们在合成问题上验证了数据到能量IPF的有效性,发现它能够成功地在多模态分布之间学习传输。作为对假设固定时间离散化方案的强化学习公式的次要结果,我们发现,通过学习动力学的扩散系数,现有的数据到数据的薛定谔桥算法可以大大改进。最后,我们将新开发的算法应用于生成模型的潜在空间中的后验分布抽样问题,从而创造出一种无数据图像到图像转换方法。代码:https://github.com/mmacosha/d2e-stochastic-dynamics
论文及项目相关链接
Summary
本文研究了Schrödinger桥问题,该问题旨在寻找一个随机动力系统,在两个边际分布之间建立联系,并最小化特定的运输成本。该问题在扩散模型和流匹配等领域受到关注,并广泛应用于自然科学。然而,现有算法都要求能同时获取两个分布的样本。本文提出了一种新方法来解决这个问题,即使其中一个(或两个)分布只能通过非标准化的密度来描述且无法获得数据样本时也能建模Schrödinger桥。该算法依赖于对数据缺失情况的迭代比例拟合(IPF)程序的泛化,并受到强化学习领域最新发展的启发。此外,我们还发现现有的数据到数据的Schrödinger桥算法可以通过学习动力学的扩散系数来显著提高性能。最后,我们将新开发的算法应用于生成模型的潜在空间中的后验分布采样问题,从而创建了一种数据无关的图像到图像转换方法。
Key Takeaways
- Schrödinger桥问题旨在找到最小化运输成本的随机动力系统,连接两个边际分布。
- 现有算法需要同时获取两个分布的样本,而本文的方法能在无法获取样本的情况下建模Schrödinger桥。
- 算法基于迭代比例拟合(IPF)程序的泛化,并受到强化学习领域的启发。
- 该方法可以成功学习多模态分布之间的传输。
- 通过学习动力学的扩散系数,可以显著提高现有算法的性能。
- 将新算法应用于生成模型的潜在空间中的后验分布采样问题。
点此查看论文截图


Evaluating Foundation Models with Pathological Concept Learning for Kidney Cancer
Authors:Shangqi Gao, Sihan Wang, Yibo Gao, Boming Wang, Xiahai Zhuang, Anne Warren, Grant Stewart, James Jones, Mireia Crispin-Ortuzar
To evaluate the translational capabilities of foundation models, we develop a pathological concept learning approach focused on kidney cancer. By leveraging TNM staging guidelines and pathology reports, we build comprehensive pathological concepts for kidney cancer. Then, we extract deep features from whole slide images using foundation models, construct pathological graphs to capture spatial correlations, and trained graph neural networks to identify these concepts. Finally, we demonstrate the effectiveness of this approach in kidney cancer survival analysis, highlighting its explainability and fairness in identifying low- and high-risk patients. The source code has been released by https://github.com/shangqigao/RadioPath.
为了评估基础模型的翻译能力,我们开发了一种以肾癌为重点的病理性概念学习方法。我们借助TNM分期指南和病理报告,为肾癌构建全面的病理性概念。然后,我们从全片图像中使用基础模型提取深度特征,构建病理性图以捕捉空间相关性,并训练图神经网络来识别这些概念。最后,我们通过肾癌生存分析证明了该方法的有效性,并重点展示了其在识别低危和高危患者时的可解释性和公平性。源代码已发布在https://github.com/shangqigao/RadioPath。
论文及项目相关链接
PDF Best Paper Award at MICCAI AMAI 2025
Summary
在评估基础模型的翻译能力时,我们针对肾癌,发展了一种以病理学概念学习为核心的方法。通过TNM分期指南和病理报告,我们构建了全面的肾癌病理学概念。接着,我们从全切片图像中提取深层特征,构建病理学图谱以捕捉空间相关性,并训练图神经网络来识别这些概念。最后,我们在肾癌生存分析中验证了该方法的有效性,并突出了其在识别低危和高危患者方面的可解释性和公平性。相关源代码已发布在https://github.com/shangqigao/RadioPath上。
Key Takeaways
- 针对肾癌评估基础模型的翻译能力。
- 建立了全面的病理学概念体系,基于TNM分期指南和病理报告。
- 利用全切片图像提取深层特征,构建病理学图谱。
- 通过图神经网络识别病理学概念。
- 在肾癌生存分析中验证了方法的有效性。
- 该方法具有可解释性和公平性,能识别低危和高危患者。
点此查看论文截图





Dynamic Novel View Synthesis in High Dynamic Range
Authors:Kaixuan Zhang, Zhipeng Xiong, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu
High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario. To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic’’ emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code will be released.
高动态范围新型视图合成(HDR NVS)旨在从在常规成像条件下捕获的低动态范围(LDR)训练图像中学习HDR 3D模型。当前的方法主要关注静态场景,隐含地假设所有场景元素都是静止的且非活着的。然而,现实世界场景通常包含动态元素,如移动物体、光照条件的变化和其他临时事件,从而呈现出更具挑战性的场景。为了弥补这一差距,我们提出了一个更现实的问题,称为高动态范围动态新型视图合成(HDR DNVS),其中“动态”这个额外的维度强调了同时建模时间辐射变化的必要性,以及与LDR和HDR之间复杂3D转换的精湛技术。为了应对这一复杂且相互交织的挑战,我们引入了HDR-4DGS,这是一种基于高斯喷洒的架构,具有创新性的动态色调映射模块,该模块显式连接HDR和LDR域,通过根据时间维度上不断发展的辐射分布动态地适应色调映射功能,从而保持时间辐射一致性。因此,HDR-4DGS实现了时间辐射一致性和空间精确的颜色翻译,可以从任意观点和时间实例进行逼真的HDR渲染。大量实验表明,HDR-4DGS在定量性能和视觉保真度方面都超越了现有的最先进方法。源代码将被发布。
论文及项目相关链接
Summary
本文介绍了HDR动态新视角合成技术(HDR DNVS),该技术解决了HDR新视角合成中遇到的动态场景问题。通过引入HDR-4DGS架构,结合高斯溅射技术和动态色调映射模块,实现了在动态场景中HDR渲染的时空辐射一致性及空间准确的色彩转换。
Key Takeaways
- HDR动态新视角合成技术(HDR DNVS)解决了HDR新视角合成在真实场景中遇到的动态元素问题。
- HDR-4DGS架构基于高斯溅射技术,具备动态色调映射模块,连接HDR和LDR领域。
- HDR-4DGS实现了时空辐射一致性,通过动态适应色调映射功能来应对随时间变化的辐射分布。
- HDR-4DGS能够在任意视角和时间点实现高保真度的HDR渲染。
- 该技术超越了现有最先进的方法,在定量性能和视觉保真度方面都有所提高。
- HDR-4DGS具有优秀的性能表现,未来有望广泛应用于各种场景。
点此查看论文截图



OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata
Authors:Oussema Dhaouadi, Riccardo Marin, Johannes Meier, Jacques Kaiser, Daniel Cremers
Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e.g., no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e.g., the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16,425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%. The dataset and code are available at: https://deepscenario.github.io/OrthoLoC.
从航拍视角进行精确视觉定位是一个具有地图绘制、大面积检测以及搜索救援行动等多个应用领域的基础性问题。在许多场景中,这些系统需要在有限资源(例如,无网络连接或GNSS/GPS支持)的情况下实现高精度定位,使得大型图像数据库或复杂的3D模型不太实用。令人惊讶的是,很少有人会关注利用正射地理数据作为一种替代方法,这种数据既轻便又可通过政府部门的免费发布而日益获得(例如,欧洲联盟)。为了填补这一空白,我们提出了OrthoLoC,这是由来自德国和美国的第一大规模数据集组成,包含16425张无人机图像和多种模式。该数据集解决了无人机图像和地理空间数据之间的领域转换问题。其配对结构能够公平地评估现有解决方案,通过将图像检索与特征匹配解耦,实现对定位和校准性能的独立评估。通过全面评估,我们研究了领域转换、数据分辨率和可见性对定位精度的影响。最后,我们引入了一种名为AdHoP的改进技术,它可以与任何特征匹配器集成,通过提高匹配率最多达95%,并减少最多达63%的翻译误差。数据集和代码可在https://deepscenario.github.io/OrthoLoC上找到。
论文及项目相关链接
PDF Accepted at NeurIPS 2025
Summary
本文关注无人机在缺乏资源条件下的精准定位问题,提出利用正射地理数据作为替代方案。为此,构建了大规模数据集OrthoLoC,包含来自德国和美国的16,425张无人机图像,并引入了一种改进技术AdHoP,可大幅提高定位精度。数据集和代码可在指定网站下载。
Key Takeaways
- 无人机在缺乏资源条件下的精准定位是一个关键问题,广泛应用于地图绘制、大范围检测及搜救行动等领域。
- 正射地理数据作为替代方案被提出,具有轻便、易获取的特点。
- 构建了大规模数据集OrthoLoC,包含德国和美国的16,425张无人机图像,解决无人机图像与地理数据领域差异的问题。
- 数据集采用配对结构,可公正评估现有解决方案,单独评估定位和校准性能。
- 通过对领域差异、数据分辨率和共视性的全面评估,研究了这些因素对定位精度的影响。
- 提出了一种改进技术AdHoP,可与任何特征匹配器集成,提高匹配精度达95%,降低翻译误差达63%。
点此查看论文截图





A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin’ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
大型多模态模型(LMM)由于其理解和生成视觉内容描述的有效性而最近受到关注。现有的大多数LMM都是用英语。虽然有一些最近的工作探索了多语言图像LMM,但据我们所知,在视频LMM的上下文中,为了文化和语言的包容性超越英语尚未被研究。为了寻求更具包容性的视频LMM,我们引入了一个多语言视频LMM基准测试,名为ViMUL-Bench,旨在评估涵盖14种语言的视频LMM,包括低资源语言和高资源语言:英语、中文、西班牙语、法语、德语、印地语、阿拉伯语、俄语、孟加拉语、乌尔都语、僧伽罗语、泰米尔语、瑞典语和日语。我们的ViMUL-Bench旨在严格测试涵盖15个类别的视频LMM,其中包括八个文化多样的类别,从生活方式和节日到食品和仪式,以及从当地地标到突出的文化人物。ViMUL-Bench既包括开放性问题(短形式和长形式)和选择题,涵盖各种视频时长(短、中和长),包含经过母语者手动验证的8k样本。此外,我们还引入了一个包含120万个样本的机器翻译多语言视频训练集,并开发了一个简单的多语言视频LMM,名为ViMUL,该模型被证明在高资源和低资源语言之间提供了更好的权衡,用于视频理解。我们希望我们的ViMUL-Bench和多语言视频LMM以及大规模多语言视频训练集将有助于未来研究和开发具有文化和语言包容性的多语言视频LMM。我们提出的基准测试、视频LMM和训练数据将在https://mbzuai-oryx.github.io/ViMUL/上公开发布。
论文及项目相关链接
Summary
基于大型多模态模型(LMMs)对视觉内容的理解和生成描述的有效性,该模型近期备受关注。然而,大多数现有的LMMs仅支持英语。尽管已有一些研究尝试进行多语言图像LMM的探索,但视频LMM仍局限在英语中。为了构建更具包容性的视频LMM,本文提出了一个多语言视频LMM基准测试——ViMUL-Bench,可以评估包括英语、中文、西班牙语等在内的14种语言的视频LMM。该基准测试涵盖了包括生活方式、节日等在内的15个类别,包含开放性问题以及选择题等多种形式的问题类型。此外,本文还引入了机器翻译的多语言视频训练集并开发了简单的多语言视频LMM——ViMUL。希望ViMUL-Bench、ViMUL以及多语言视频训练集有助于未来开发更具文化和语言包容性的多语言视频LMM研究。
Key Takeaways
- 大型多模态模型(LMMs)能有效理解和描述视觉内容。
- 当前大多数LMMs仅支持英语,缺乏多语言支持。
- 引入多语言视频LMM基准测试——ViMUL-Bench,涵盖14种语言,包括高、低资源语言。
- ViMUL-Bench设计涵盖多种视频类别和多种问题类型。
- 引入机器翻译的多语言视频训练集。
- 开发简单的多语言视频LMM——ViMUL,实现高、低资源语言的良好平衡。
点此查看论文截图






Binary Diffusion Probabilistic Model
Authors:Vitaliy Kinakh, Slava Voloshynovskiy
We propose the Binary Diffusion Probabilistic Model (BDPM), a generative framework specifically designed for data representations in binary form. Conventional denoising diffusion probabilistic models (DDPMs) assume continuous inputs, use mean squared error objectives and Gaussian perturbations, i.e., assumptions that are not suited to discrete and binary representations. BDPM instead encodes images into binary representations using multi bit-plane and learnable binary embeddings, perturbs them via XOR-based noise, and trains a model by optimizing a binary cross-entropy loss. These binary representations offer fine-grained noise control, accelerate convergence, and reduce inference cost. On image-to-image translation tasks, such as super-resolution, inpainting, and blind restoration, BDPM based on a small denoiser and multi bit-plane representation outperforms state-of-the-art methods on FFHQ, CelebA, and CelebA-HQ using a few sampling steps. In class-conditional generation on ImageNet-1k, BDPM based on learnable binary embeddings achieves competitive results among models with both low parameter counts and a few sampling steps.
我们提出了二进制扩散概率模型(BDPM),这是一种专门为二进制形式的数据表示设计的生成框架。传统的降噪扩散概率模型(DDPM)假设输入是连续的,使用均方误差目标和高斯扰动,即不适合离散和二进制表示的假设。相反,BDPM通过多位平面和可学习的二进制嵌入将图像编码为二进制表示,通过XOR噪声对其进行扰动,并通过优化二进制交叉熵损失来训练模型。这些二进制表示提供了精细的噪声控制,加速了收敛,并降低了推理成本。在图像到图像的翻译任务中,如超分辨率、图像修复和盲恢复等,基于小去噪器和多位平面表示的BDPM在FFHQ、CelebA和CelebA-HQ上使用很少的采样步骤即可超越最新方法。在ImageNet-1k的类别条件生成中,基于可学习的二进制嵌入的BDPM在参数较少且采样步骤较少的情况下取得了具有竞争力的结果。
论文及项目相关链接
Summary
本文提出二进制扩散概率模型(BDPM),该生成框架专为二进制形式的数据表示而设计。与传统假设连续输入的降噪扩散概率模型(DDPMs)不同,BDPM使用基于多比特平面的二进制表示,通过XOR噪声进行扰动,并通过优化二进制交叉熵损失进行训练。该模型可实现精细的噪声控制、加速收敛并降低推理成本。在图像到图像的翻译任务上,如超分辨率、图像修复和盲恢复等,BDPM在FFHQ、CelebA和CelebA-HQ数据集上表现出优于最新技术方法的性能,且仅需少数采样步骤。在ImageNet-1k上的类别条件生成中,基于可学习的二进制嵌入的BDPM在参数数量和采样步骤方面都取得了具有竞争力的结果。
Key Takeaways
- BDPM是为二进制形式数据表示设计的生成框架。
- BDPM采用多比特平面和可学习的二进制嵌入编码图像。
- BDPM使用XOR噪声进行扰动,并优化二进制交叉熵损失。
- BDPM实现了精细的噪声控制、快速收敛和降低推理成本。
- 在图像翻译任务上,如超分辨率、图像修复和盲恢复,BDPM性能优越。
- BDPM在FFHQ、CelebA和CelebA-HQ数据集上的表现优于其他最新技术。
点此查看论文截图


