MMT

发布日期: 2025-11-18

更新日期: 2025-11-27

文章字数: 3.2k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-18 更新

HI-TransPA: Hearing Impairments Translation Personal Assistant

Authors:Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

听力受损人士在日常沟通中常因发音清晰度的问题面临重大障碍。为解决这一问题，我们将Omni-Model范式引入辅助技术，并推出HI-TransPA——一款指令驱动的视听个人助理。该模型融合了模糊不清的语音和唇部动态，在单一的多模式框架内实现翻译和对话。针对听力受损人士的发音特点和现有模型的适应性问题，我们开发了一种多模式预处理和筛选管道，可以检测面部地标、稳定唇部区域，并定量评估样本质量。这些质量分数引导了一种先学习简单、高信心样本，再逐步引入更困难案例的教学策略，以加强模型的稳健性。在结构上，我们采用了一种新颖的统一3D重采样器来有效编码唇部动态，这对准确解读至关重要。在针对特定目的构建的HI-Dialogue数据集上的实验表明，HI-TransPA在字面准确性和语义保真度方面都达到了最先进的性能。我们的工作奠定了将Omni-Models应用于辅助通信技术的基础，为未来的研究提供了端到端的建模框架和必要的处理工具。

论文及项目相关链接

PDF

Summary

Omni-Model范式被引入助听技术中，并推出了HI-TransPA，这是一款指令驱动的视听个人助理。该技术融合了不清晰语音与唇部动态，在一个单一的多模式框架内实现翻译和对话。针对听力受损者的独特发音模式和现有模型的有限适应性，开发了一个多模式预处理和筛选管道，用于检测面部地标、稳定唇部区域并定量评估样本质量。实验表明，HI-TransPA在字面准确性和语义忠实度方面达到了最新技术水平。

Key Takeaways

Omni-Model范式被成功引入助听技术中，用于帮助听力受损者进行日常沟通。
HI-TransPA是一款指令驱动的视听个人助理，结合了语音和唇部动态，实现翻译和对话功能。
针对听力受损者的独特发音模式，开发了一个多模式预处理和筛选管道，包括检测面部地标、稳定唇部区域和评估样本质量。
样本质量评估用于指导课程学习策略，首先训练清洁、高信心的样本，然后逐渐引入更复杂的案例以增强模型的稳健性。
引入了一种新颖的统一3D重采样技术，用于有效编码唇部动态，对准确解读至关重要。
HI-TransPA在专门构建的HI-Dialogue数据集上实现了最先进的性能，包括字面准确性和语义忠实度。

Cool Papers

点此查看论文截图

Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

Authors:Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Shuang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.

本文介绍了Pelican-VL 1.0，这是一个新的开源实体脑模型系列，参数规模从7亿到72亿不等。我们的明确使命是：将强大的智能嵌入各种实体。Pelican-VL 1.0目前是最大规模的开源实体多模态脑模型。它的核心优势在于深度整合数据能力和智能自适应学习机制。具体来说，metaloop从包含4亿多个标记的原始数据集中提炼出高质量数据集。Pelican-VL 1.0在由1000多个A800 GPU组成的大规模集群上进行训练，每个检查点消耗超过5万多个A800 GPU小时。这使其基础模型的性能提升了20.3%，并且相较于开源的百亿参数级别模型高出10.6%，在知名的实体基准测试中与领先的专有系统不相上下。我们建立了一个新的框架，DPPO（深思熟虑的策略优化政策），受人类元认知的启发来训练Pelican-VL 1.0。我们将其操作化为一个metaloop，让AI学会刻意练习，这是一个RL-精炼-诊断-SFT循环。

论文及项目相关链接

PDF

Summary

佩利肯VL 1.0是一款全新的开源实体化大脑模型，具有从7亿到72亿的参数规模。它的核心优势在于深度整合数据能力和智能自适应学习机制。该模型在包含超过4亿令牌的原始数据集中提炼出高质量数据集，并在超过一千台A800 GPU的大规模集群上进行训练。它采用新型框架DPPO（深思熟虑的策略优化），模拟人类元认知进行训练，实现AI的深思熟虑实践。佩利肯VL 1.0是目前最大规模的开源实体化多模态大脑模型，性能提升20.3%，在知名实体化基准测试中与领先的专有系统不相上下，优于其他开源大型模型。

Key Takeaways

佩利肯VL 1.0是一款新的开源实体化大脑模型，具有广泛的参数规模。
该模型的核心优势在于深度整合数据能力和智能自适应学习机制。
佩利肯VL 1.0从包含超过4亿令牌的原始数据集中提炼出高质量数据集。
模型训练使用了超过一千台A800 GPU的大规模集群。
佩利肯VL 1.0性能相较于基准模型提升20.3%，并在知名实体化基准测试中表现优秀。
该模型采用新型训练框架DPPO，模拟人类元认知进行训练。

Cool Papers

点此查看论文截图

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Authors:Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video’s frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question’s logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning. We open-source our code at https://utaustin-swarmlab.github.io/NeuS-QA/.

视觉语言模型（VLM）在涉及单张图片或短视频的任务上表现出色，但由于长视频问答（LVQA）需要复杂的多步时序推理，它们仍然面临挑战。传统的方法只是简单地均匀采样帧，并将其与问题一起输入到VLM中，这导致了大量的标记开销。这迫使人们对长视频进行激烈的下采样，导致模型错过精细的视觉效果、微妙的事件过渡和关键的时间线索。最近的工作试图通过启发式方法来克服这些限制，然而，它们缺乏明确的编码时间关系的机制，无法提供任何正式保证所采样的上下文实际上编码了问题所需的组合或因果逻辑。为了解决这些基础差距，我们引入了NeuS-QA，这是一个用于LVQA的无需训练、即插即用的神经符号管道。NeuS-QA首先将自然语言问题翻译成逻辑规范，该规范模拟帧级别事件之间的时间关系。接下来，我们构建了一个视频自动机来模拟视频逐帧的事件进展，并最终采用模型检查来比较自动机与规范，以识别满足问题逻辑要求的所有视频片段。只有这些经过逻辑验证的片段才被提交给VLM，从而提高了可解释性，减少了幻觉，并能够在不修改或微调模型的情况下进行组合推理。在LongVideoBench和CinePile LVQA基准测试上的实验表明，NeuS-QA的性能提高了10%以上，特别是在涉及事件顺序、因果关系和多步推理的问题上。我们的代码已开源在https://utaustin-swarmlab.github.io/NeuS-QA/。

论文及项目相关链接

PDF

Summary

本文主要针对长视频问答（LVQA）任务的挑战，提出了一种无需训练的神经符号管道NeuS-QA。该方法将自然语言问题转换为逻辑规范，模拟帧级别事件的时态关系，并通过模型检查验证视频自动机与逻辑规范的一致性。这种方法提高了LVQA的性能，特别是在涉及事件顺序、因果关系和多步推理的问题上。

Key Takeaways