发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 4.4k

阅读时长: 17 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Authors:Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li

Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

多模态预训练已经彻底改变了视觉理解，但其对基于视频的行人再识别（ReID）的影响仍被忽视。现有方法通常依赖于视频文本对，但存在两个基本局限性：（1）缺乏真正的多模态预训练；（2）文本无法很好地捕捉精细的时空运动信息，这对于区分视频中的身份至关重要。在这项工作中，我们首次引入基于骨架驱动的ReID预训练框架，摒弃基于文本的模式。为了实现这一点，我们提出了对比骨架图像预训练ReID（CSIP-ReID），这是一种新的两阶段方法，利用骨架序列作为与视频帧对齐的时空信息模态。在第一阶段，我们使用对比学习在序列级别对齐骨架和视觉特征。在第二阶段，我们引入动态原型融合更新器（PFU），以细化多模态身份原型，融合运动和外观线索。此外，我们提出了骨架引导时间建模（SGTM）模块，从骨架数据中提炼时间线索并将其融入视觉特征。大量实验表明，CSIP-ReID在标准视频ReID基准测试（MARS、LS-VID、iLIDS-VID）上实现了最新国家水平的结果。此外，它在仅基于骨架的ReID任务（BIWI、IAS）上表现出强大的泛化能力，显著优于以前的方法。CSIP-ReID为ReID开创了无标注和感知运动的预训练模式，为多媒体模态表示学习打开了新的前沿。

论文及项目相关链接

PDF

Summary

本文提出了一个基于骨架驱动的视频人物再识别预训练框架，名为对比骨架图像预训练再识别（CSIP-ReID）。该方法分为两个阶段，第一阶段利用对比学习对齐骨架和视频特征的序列水平；第二阶段通过动态原型融合更新器（PFU）和多模态身份原型融合运动与外观线索。此外，还提出了骨架引导时序建模模块（SGTM），从骨架数据中提取时序线索并将其融入视觉特征。实验表明，CSIP-ReID在视频再识别标准数据集上取得了最新成果，并具有良好的骨架再识别任务泛化能力。

Key Takeaways

引入骨架驱动预训练框架，解决视频人物再识别中多模态预训练的不足。
提出对比骨架图像预训练再识别（CSIP-ReID）方法，分为两个阶段进行特征对齐和身份原型优化。
第一阶段利用对比学习对齐骨架和视频特征的序列水平。
第二阶段通过动态原型融合更新器（PFU）融合运动和外观线索，提升身份识别精度。
引入骨架引导时序建模模块（SGTM），提取骨架数据的时序线索并融入视觉特征。
CSIP-ReID在视频再识别标准数据集上取得最新成果。

Cool Papers

点此查看论文截图

ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Authors:Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model’s acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation.

本研究探讨了如何使用大型语言模型（LLM）通过动作和视频数据来理解人类行为。我们认为，混合两种类型的数据对于完全捕捉人类动作的细微动作和含义至关重要，这与最近仅专注于动作数据或电影的模型形成对比。为了解决这个问题，我们提供了ViMoNet，这是一个简单有效的框架，用于理解、表征和推断人类行为。ViMoNet采用联合训练策略，利用两种数据类型的好处：详细的动作文本数据更加精确，而通用的视频文本数据更综合但不太详细。这有助于模型获取有关人类行为的时间和空间的丰富数据。此外，我们提供了一个全新的数据集VIMOS，其中包含各种电影、动作序列、指令和字幕。为了评估模型对人类行为的理解程度，我们开发了ViMoNet-Bench，这是一个带有精心标注样本的标准基准测试。我们的测试表明，在生成字幕、动作理解和行为解释方面，ViMoNet的表现优于现有方法。

论文及项目相关链接

PDF This is the preprint version of the manuscript. It is currently being prepared for submission to an academic conference

Summary

本研究探讨了如何使用大型语言模型（LLMs）通过运动和视频数据理解人类行为。研究认为，混合两种类型的数据对于完全捕捉人类动作的细微运动和意义至关重要，与最近仅专注于运动数据或电影的模型形成对比。为此，研究提出了ViMoNet框架，一个简单有效的框架，用于理解、表征和推断人类行为。ViMoNet采用联合训练策略，利用两种数据类型的好处：详细的运动文本数据，更精确；通用的视频文本数据，更全面但不太详细。这有助于模型获取有关人类行为的时间和空间的丰富数据。此外，还推出了全新的VIMOS数据集，包含各种电影、运动序列、指令和字幕。为了评估模型对人类行为的理解程度，研究还开发了ViMoNet-Bench标准化基准测试，带有精心标记的样本。测试表明，ViMoNet在生成字幕、理解运动和解释行为方面优于现有方法。

Key Takeaways

大型语言模型（LLMs）可以通过运动和视频数据理解人类行为。
混合运动文本数据和视频文本数据对于完全理解人类行为至关重要。
ViMoNet框架通过联合训练策略利用两种数据类型的好处。
ViMoNet框架在理解、表征和推断人类行为方面表现出优异性能。
推出了全新的VIMOS数据集，包含多种媒体内容，用于支持人类行为理解研究。
研究开发了ViMoNet-Bench标准化基准测试来评估模型对人类行为的理解程度。
ViMoNet在生成字幕、理解运动和解释行为方面的性能优于现有方法。

Cool Papers

点此查看论文截图

X-MoGen: Unified Motion Generation across Humans and Animals

Authors:Xuan Wang, Kai Ruan, Liyang Qian, Zhizhi Guo, Chang Su, Gaoang Wang

Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose X-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct UniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

文本驱动的运动生成因其广泛应用于虚拟现实、动画和机器人技术而备受关注。尽管现有方法通常分别对人体和动物的运动进行建模，但联合跨物种方法提供了关键优势，如统一表示形式和提高泛化能力。然而，物种间的形态差异仍然是关键挑战，经常影响运动的合理性。为了解决这一问题，我们提出了X-MoGen，这是第一个统一的跨物种文本驱动运动生成框架，涵盖人类和动物。X-MoGen采用两阶段架构。首先，条件图变分自动编码器学习标准T姿势的先验知识，而自动编码器将运动编码为受形态损失约束的共享潜在空间。在第二阶段，我们进行掩膜运动建模，根据文本描述生成运动嵌入。在训练过程中，采用形态一致性模块来促进跨物种的骨骼合理性。为了支持统一建模，我们构建了UniMo4D，这是一个包含115个物种和11.9万多个运动序列的大规模数据集，它在统一的骨骼拓扑结构下整合了人类和动物的运动，用于联合训练。在UniMo4D上的广泛实验表明，X-MoGen在已知和未知物种上的表现均优于最先进的方法。

论文及项目相关链接

PDF

Summary

本文介绍了一种跨物种文本驱动运动生成的方法X-MoGen，该方法采用两阶段架构，旨在解决虚拟现实中人类和动物运动生成的问题。通过采用条件图变分自编码器学习规范姿势先验和形态损失进行正则化的共享潜在空间，并在训练期间采用形态一致性模块来促进跨物种的骨骼合理性。构建的大型数据集UniMo4D支持联合训练人类和动物运动。实验证明，X-MoGen在已见和未见物种上的表现均优于现有方法。

Key Takeaways

X-MoGen是一种跨物种文本驱动运动生成方法，能广泛应用于虚拟现实、动画和机器人等领域。
X-MoGen采用两阶段架构，第一阶段学习规范姿势先验和正则化的共享潜在空间，第二阶段根据文本描述生成运动嵌入。
X-MoGen通过形态损失和形态一致性模块解决跨物种运动生成中的形态差异问题，提高运动的合理性。
UniMo4D数据集是首个统一人类和动物运动的大型数据集，具有115个物种和11.9万条运动序列，支持跨物种的联合训练。
X-MoGen在UniMo4D数据集上的实验表明，其在已见和未见物种上的表现均优于现有方法。
X-MoGen方法具有潜力为虚拟角色提供更为真实、多样化的运动表现。

Cool Papers

点此查看论文截图

RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-Wave Point Cloud Sequence

Authors:Zengyuan Lai, Jiarui Yang, Songpengcheng Xia, Lizhou Lin, Lan Sun, Renwen Wang, Jianran Liu, Qi Wu, Ling Pei

Millimeter-wave radar offers a privacy-preserving and environment-robust alternative to vision-based sensing, enabling human motion analysis in challenging conditions such as low light, occlusions, rain, or smoke. However, its sparse point clouds pose significant challenges for semantic understanding. We present RadarLLM, the first framework that leverages large language models (LLMs) for human motion understanding from radar signals. RadarLLM introduces two key innovations: (1) a motion-guided radar tokenizer based on our Aggregate VQ-VAE architecture, integrating deformable body templates and masked trajectory modeling to convert spatial-temporal radar sequences into compact semantic tokens; and (2) a radar-aware language model that establishes cross-modal alignment between radar and text in a shared embedding space. To overcome the scarcity of paired radar-text data, we generate a realistic radar-text dataset from motion-text datasets with a physics-aware synthesis pipeline. Extensive experiments on both synthetic and real-world benchmarks show that RadarLLM achieves state-of-the-art performance, enabling robust and interpretable motion understanding under privacy and visibility constraints, even in adverse environments. This paper has been accepted for presentation at AAAI 2026. This is an extended version with supplementary materials.

毫米波雷达提供了一种隐私保护和环境适应性强的替代视觉感知方案，能够在低光、遮挡、雨雪或烟雾等具有挑战性的条件下进行人体运动分析。然而，其稀疏的点云给语义理解带来了巨大挑战。我们推出了RadarLLM，这是首个利用大型语言模型（LLM）从雷达信号进行人体运动理解的框架。RadarLLM有两个关键创新点：（1）基于我们的Aggregate VQ-VAE架构的运动引导雷达分词器，它集成了可变形身体模板和掩码轨迹建模，将时空雷达序列转换为紧凑的语义令牌；（2）雷达感知语言模型，在共享嵌入空间中建立雷达和文本之间的跨模态对齐。为了克服配对雷达文本数据的稀缺性，我们从运动文本数据集中通过物理感知合成管道生成现实的雷达文本数据集。在合成和真实世界基准测试上的广泛实验表明，RadarLLM达到了最先进的性能，实现了隐私和可见性约束下的稳健和可解释的运动理解，即使在恶劣环境中也是如此。本文已被接受在AAAI 2026上进行展示。这是一个补充材料的扩展版本。

论文及项目相关链接

PDF Accepted by AAAI 2026 (extended version with supplementary materials)

Summary：毫米波雷达提供了一种隐私保护和环境适应性强的替代方案，用于在视线感知基础上进行人体运动分析，尤其在低光、遮挡、雨雪等恶劣环境下表现优异。然而，雷达稀疏点云给语义理解带来挑战。我们推出RadarLLM框架，首次利用大型语言模型（LLM）理解雷达信号的人体运动。RadarLLM引入两项关键创新：一是基于Aggregate VQ-VAE架构的运动引导雷达分词器，集成可变形体模板和掩膜轨迹建模，将时空雷达序列转换为紧凑语义令牌；二是雷达感知语言模型，在共享嵌入空间中建立雷达和文本之间的跨模态对齐。为了克服配对雷达文本数据的稀缺性，我们通过物理感知合成管道从运动文本数据集中生成逼真的雷达文本数据集。实验表明，RadarLLM在合成和真实世界基准测试中达到最新性能水平，实现了隐私和可见性约束下的稳健和可解释运动理解，即使在恶劣环境中也是如此。这篇文章已被接受在AAAI 2026进行展示。这是包含补充材料的扩展版。

Key Takeaways：