发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 6.9k

阅读时长: 28 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

Authors:Adam S. Jovine, Tinghan Ye, Francis Bahk, Jingjing Wang, David B. Shmoys, Peter I. Frazier

Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert’s high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.

人类专家经常面临在一系列具有多个相互竞争目标的项目中选择最佳选项的难题，这一过程由于难以形式化复杂的隐含偏好而受到阻碍。为了解决这个问题，我们引入了LISTEN框架，该框架利用大型语言模型（LLM）作为零样本偏好预言家，仅根据专家使用自然语言的高级优先事项进行引导。为了在大规模语言模型的约束下进行操作，例如上下文窗口和推理成本，我们提出了两种迭代算法：LISTEN-U，它使用大规模语言模型来优化参数效用函数；LISTEN-T是一种非参数方法，通过小型解决方案批次进行锦标赛式选择。在包括预订航班、购物和考试日程安排等多样化任务上进行的评估表明，LISTEN-U在偏好参数对齐时表现优异（我们使用新型的一致性度量指标来衡量这一特性），而LISTEN-T则提供了更稳健的性能。这项工作为直接使用自然语言引导复杂的多元决策提供了一个有前景的方向，减轻了传统偏好表达的认知负担。

论文及项目相关链接

PDF

Summary

LISTEN框架通过利用大型语言模型（LLM）作为零样本偏好预言家，仅根据专家的高级别优先事项的自然语言，解决了从大量项目中挑选最佳选项的难题。该框架提出了LISTEN-U和LISTEN-T两种迭代算法，以适应LLM的上下文窗口和推理成本限制。在飞行预订、购物和考试安排等多样化任务上，LISTEN-U在参数偏好对齐时表现优异，而LISTEN-T则提供更稳健的性能。此工作探索了直接使用自然语言进行复杂的多目标决策的有前途的方向，减少了传统偏好征求的认知负担。

Key Takeaways

LISTEN框架利用大型语言模型（LLM）作为零样本偏好预言家，解决专家在挑选大量项目时的决策困难。
LISTEN框架提出了LISTEN-U和LISTEN-T两种迭代算法，以适应LLM的上下文窗口和推理成本限制。
LISTEN-U在参数偏好对齐时表现优异，适用于特定场景。
LISTEN-T提供更稳健的性能，适应更多变化。
该工作探索了直接使用自然语言进行复杂的多目标决策的前景。
LISTEN框架降低了传统偏好征求的认知负担。

Cool Papers

点此查看论文截图

Listening without Looking: Modality Bias in Audio-Visual Captioning

Authors:Yuchi Ishikawa, Toranosuke Manabe, Tatsuya Komatsu, Yoshimitsu Aoki

Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.

视听字幕生成的目标是通过联合建模声音和视觉来生成整体场景描述。虽然最近的方法通过复杂的模态融合提高了性能，但在当前的视听字幕生成模型中，两种模态的互补程度以及当一种模态退化时这些模型的稳健性仍不清楚。我们通过在一流的视听字幕生成模型LAVCap上进行系统的模态稳健性测试来解决这些问题。我们选择性抑制或破坏音频或视觉流来量化敏感性和互补性。分析显示LAVCap存在明显的偏向音频流的趋势。为了评估视听字幕生成模型在使用两种模态时的平衡程度，我们增加了AudioCaps的文本注释，这些注释共同描述了音频和视觉流，从而形成了AudioVisualCaps数据集。在我们的实验中，我们报告了LAVCap在AudioVisualCaps上的基线结果。我们还在该数据集上进行模态稳健性测试评估模型性能，结果表明在AudioVisualCaps上训练的LAVCap比在AudioCaps上训练的模型具有更少的模态偏见。

论文及项目相关链接

PDF under review

Summary

本文探讨了音频视觉字幕的生成问题，通过对先进音频视觉字幕模型进行系统性的模态稳健性测试，研究音频和视觉两种模态在字幕生成中的互补性和鲁棒性。通过对LAVCap模型进行选择性抑制或破坏音频或视觉流，发现模型对音频流的偏向性显著。为解决模型在利用两种模态时的平衡问题，研究团队扩充了AudioCaps数据集，加入同时描述音频和视觉流的文本标注，形成AudioVisualCaps数据集。实验结果表明，在AudioVisualCaps上训练的LAVCap模型相对于在AudioCaps上训练的模型展现出较少的模态偏见。

Key Takeaways

音频视觉字幕旨在通过联合建模声音和视觉来生成场景的整体描述。
对LAVCap模型进行模态稳健性测试，发现模型对音频流的偏向性显著。
通过扩充AudioCaps数据集，形成AudioVisualCaps数据集，以平衡模型在利用两种模态时的表现。
LAVCap模型在AudioVisualCaps上的训练表现相对在AudioCaps上的训练表现出较少的模态偏见。
先进的音频视觉字幕模型通过复杂的多模态融合提高了性能。
系统性的测试揭示了当前音频视觉字幕模型中音频和视觉两种模态的互补性程度。

Cool Papers

点此查看论文截图

How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse

Authors:Saki Imai, Lee Kezar, Laurel Aichler, Mert Inan, Erin Walker, Alicia Wooten, Lorna Quandt, Malihe Alikhani

Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts and interlocutors through spatiotemporal changes and articulation style. This specifically manifests itself in educational settings, where novel vocabularies are used by teachers, and students. To address this gap, we collect a motion capture dataset of American Sign Language (ASL) STEM (Science, Technology, Engineering, and Mathematics) dialogue that enables quantitative comparison between dyadic interactive signing, solo signed lecture, and interpreted articles. Using continuous kinematic features, we disentangle dialogue-specific entrainment from individual effort reduction and show spatiotemporal changes across repeated mentions of STEM terms. On average, dialogue signs are 24.6%-44.6% shorter in duration than the isolated signs, and show significant reductions absent in monologue contexts. Finally, we evaluate sign embedding models on their ability to recognize STEM signs and approximate how entrained the participants become over time. Our study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies.

目前最先进的手语模型大多是在口译或孤立的词汇数据上进行训练的，这忽略了自然对话所特有的变化性。然而，人类沟通会动态地根据上下文和对话伙伴通过时空变化和发音风格来适应。这在教育环境中表现得尤为明显，老师们和学生会使用新的词汇。为了弥补这一空白，我们收集了一份美国手语（ASL）STEM（科学、技术、工程和数学）对话的动作捕捉数据集，使两人在交互式签名、单人讲座和翻译文章之间能够进行定量比较。通过使用连续的运动特征，我们从个人的努力减少中解构对话特有的参与性，并展示在STEM术语重复提及时的时空变化。平均而言，对话符号的持续时间比孤立符号短24.6%~44.6%，并且在独白语境中显示出显著的减少。最后，我们评估了符号嵌入模型在识别STEM符号方面的能力，并大致估算参与者随时间推移的熟练程度。我们的研究将语言分析和计算建模相结合，以了解语用学是如何影响手语发音及其在手语技术中的表示的。

论文及项目相关链接

PDF

Summary

本文主要介绍了如何解决现有手语模型在模拟自然对话时的不足，特别是在教育环境中。为了解决这个问题，作者收集了一份美国手语（ASL）STEM（科学、技术、工程和数学）对话的运动捕捉数据集，并展示了对话手语与孤立手语之间的差异。研究结果表明，对话手语的持续时间比孤立手语短24.6%~44.6%，并且这种缩短只在对话中出现。此外，该研究还评估了手语嵌入模型对STEM手语的识别能力，并探讨了参与者随时间推移的掌握程度。该研究结合语言分析和计算建模，以理解语言实用性和手语技术中的表现。

Key Takeaways

当前大多数先进的手语模型主要基于口译员或孤立词汇数据，忽略了自然对话的多样性。
在教育环境中，手语的使用具有特殊性，表现在新型词汇的使用上。
收集了美国手语（ASL）STEM对话的运动捕捉数据集，用于对比双人互动签署、单人签署讲座和口译文章。
对话手语的持续时间比孤立手语短24.6%~44.6%，这种缩短只在对话中出现。
研究评估了手语嵌入模型对STEM手语的识别能力。
参与者在手语掌握上随时间推移有所进步。

Cool Papers

点此查看论文截图

Binaural Signal Matching with Wearable Arrays for Near-Field Sources and Directional Focus

Authors:Sapir Goldring, Zamir Ben Hur, David Lou Alon, Chad McKell, Sebastian Prepelita, Boaz Rafaely

This paper investigates the performance of Binaural Signal Matching (BSM) methods for near-field sound reproduction using a wearable glasses-mounted microphone array. BSM is a flexible, signal-independent approach for binaural rendering with arbitrary arrays, but its conventional formulation assumes far-field sources. In our previous work, we proposed a near-field extension of BSM (NF-BSM) that incorporates distance-dependent modeling and showed improved performance over far-field BSM using analytic data, though degradation persisted for sources very close to the array. In this study, we extend that analysis by using realistic simulated data of near-field Head-Related Transfer Functions (HRTFs) and Acoustic Transfer Functions (ATFs) of the array, accounting for listener head rotation and evaluating binaural cues such as interaural level and time differences (ILD and ITD). A key contribution is the introduction of a Field of View (FoV) weighting, designed to emphasize perceptually relevant directions and improve robustness under challenging conditions. Results from both simulation and a listening test confirm that NF-BSM outperforms traditional far-field BSM in near-field scenarios, and that the proposed NF-FoV-BSM method achieves the best perceptual and objective quality among all tested methods, particularly at close source distances and under head rotation. These findings highlight the limitations for far-field models in near-field sources and demonstrate that incorporating source distance and directional weighting can significantly improve binaural reproduction performance for wearable spatial audio systems.

本文研究了使用可穿戴眼镜式麦克风阵列进行近场声音再现的双耳信号匹配（BSM）方法的性能。BSM是一种灵活、信号独立的双耳渲染方法，适用于任意阵列，但其传统公式假设为远场源。在我们之前的工作中，我们提出了BSM的近场扩展（NF-BSM），它结合了距离相关的建模，并使用分析数据表明其性能优于远场BSM，但对于非常接近阵列的源，仍然存在性能下降的情况。本研究通过使用近场头相关传输函数（HRTF）和阵列的声学传输函数（ATF）的实际模拟数据来扩展该分析，考虑了听者头部旋转并评估了双耳线索，如耳间水平和时间差异（ILD和ITD）。一个关键的贡献是引入了视野（FoV）权重，旨在强调感知相关的方向并在具有挑战性的条件下提高稳健性。仿真和听觉测试的结果证实，NF-BSM在近距离场景中优于传统的远场BSM，并且所提出的NF-FoV-BSM方法在所有的测试方法中实现了最佳的感知和客观质量，特别是在近距离和头部旋转的情况下。这些发现突显了远场模型在近距离源中的局限性，并表明结合源距离和方向权重可以显著提高可穿戴空间音频系统的双耳再现性能。

论文及项目相关链接

PDF

Summary
本文研究了使用佩戴眼镜的麦克风阵列进行近场声音再现的双耳信号匹配（Binaural Signal Matching，BSM）方法的性能。文章提出了对传统的远场BSM方法的近场扩展（NF-BSM），该扩展考虑了距离依赖建模，并对距离非常近的源进行改进。在此研究中，通过现实模拟数据考虑了听者的头部旋转并评估了诸如耳间声级和时差（ILD和ITD）等双耳线索。一个关键贡献是引入了视野（FoV）权重设计，用于强调感知相关的方向，并在具有挑战性的条件下提高稳健性。仿真结果和听觉测试均证实，NF-BSM在近场场景中优于传统的远场BSM方法，而提出的NF-FoV-BSM方法在所有测试方法中实现了最佳的感知和客观质量，特别是在近距离源和头部旋转的情况下。这些发现突显了远场模型在近场源中的局限性，并表明融入源距离和方向权重可以显著提高可穿戴空间音频系统的双耳再现性能。

Key Takeaways

本研究调查了近场声音再现的双耳信号匹配（BSM）方法的性能。
提出了对传统的远场BSM方法的近场扩展（NF-BSM），考虑了距离依赖建模和对近距离源的改进。
通过现实模拟数据考虑了听者的头部旋转并评估了双耳线索。
引入视野（FoV）权重设计，强调感知相关的方向并提高稳健性。
NF-BSM在近场场景中表现优于传统的远场BSM方法。
NF-FoV-BSM在所有测试方法中实现了最佳的感知和客观质量，特别是在近距离源和头部旋转条件下。

Cool Papers

点此查看论文截图

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Authors:Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem

In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker’s multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners’ facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.

本文介绍了在线多模态对话响应生成（OMCRG）这一新任务，该任务旨在基于说话者的多模态输入，在线生成同步的言语和非言语听众反馈。OMCRG捕捉自然二元交互，并在将生成的音频与听众的面部响应对齐方面引入新的挑战。为了应对这些挑战，我们采用文本作为连接音频和面部响应的中间模态。我们提出了OmniResponse，这是一种多模态大型语言模型（MLLM），可以自动生成准确的多元听众响应。OmniResponse利用一个预训练的大型语言模型，其中包含两个核心组件：Chrono-Text Markup，它为生成的文本标记提供精确的时间戳，以及TempoVoice，这是一个可控的在线文本到语音（TTS）模块，可输出与面部响应同步的语音。为了推动OMCRG研究的发展，我们提供了ResponseNet数据集，其中包含696个具有同步分屏视频、多通道音频、字幕和注释面部行为的详细二元交互。在ResponseNet上的综合评估表明，OmniResponse在语义语音内容、视听同步和生成质量方面优于基准模型。我们的数据集、代码和模型均公开可用。

论文及项目相关链接

PDF 25 pages, 9 figures

Summary

本文介绍了在线多模态对话响应生成（OMCRG）任务，该任务旨在基于说话者的多模态输入，在线产生同步的言语和非言语听众反馈。为应对挑战，研究团队引入了OmniResponse，一种多模态大型语言模型（MLLM），可自动生成准确的多模态听众响应。OmniResponse利用了两项核心技术：精确时间戳生成文本标记的Chrono-Text Markup和可控制输出的语音与面部响应同步的TempoVoice在线文本转语音模块。为了推进OMCRG研究，研究团队推出了ResponseNet数据集，包含696个详细的双人互动同步分屏视频、多频道音频、字幕和面部行为注释。在ResponseNet上的综合评估表明，OmniResponse在语义语音内容、音视频同步和生成质量方面优于基准模型。数据集、代码和模型均已公开可用。

Key Takeaways

介绍了在线多模态对话响应生成（OMCRG）任务，该任务旨在产生基于多模态输入的同步反馈。
提出了OmniResponse，一种多模态大型语言模型（MLLM），用于自动生成准确的多模态听众响应。
OmniResponse利用Chrono-Text Markup和TempoVoice两大技术，实现文本与语音及面部响应的同步。
推出ResponseNet数据集，包含详细的双人互动同步视频等数据，为OMCRG研究提供资源。
OmniResponse在语义语音内容、音视频同步和生成质量方面表现优异。
该研究的数据集、代码和模型已公开可用，便于其他研究者使用与进一步探索。

Cool Papers

点此查看论文截图

3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

Authors:Ivan Sviridov, Amina Miftakhova, Artemiy Tereshchenko, Galina Zubkova, Pavel Blinov, Andrey Savchenko

Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM’s context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.

虽然大型视觉语言模型（LVLMs）在医学领域正受到积极研究，但它们结合准确诊断和专业知识进行复杂现实世界远程医疗咨询的能力仍鲜有研究。本文介绍了3MDBench（医疗多模态多智能体对话基准测试），这是一个用于模拟和评估LVLM驱动的远程医疗咨询的开源框架。3MDBench通过基于性格的患者智能体模拟患者差异性，并通过评估智能体评估诊断准确性和对话质量。它包括来自现实世界远程医疗互动的2996个病例和涉及34种诊断，结合了文本和图像数据。实验性研究比较了广泛使用的开源和闭源LVLMs的诊断策略。我们证明，与非对话设置相比，具有内部推理的多模式对话可以提高F1分数6.5%，这突显了上下文感知和信息寻求问题的重要性。此外，将诊断卷积神经网络（CNN）的预测注入到LVLM的上下文中，可以将F1分数提高高达20%。源代码可在https://github.com/univanxx/3mdbench找到。

论文及项目相关链接

PDF EMNLP 25 (main)

Summary

本文介绍了一个名为3MDBench的开放源代码框架，用于模拟和评估大型视觉语言模型（LVLMs）驱动的远程医疗咨询。该框架通过模拟患者的个体差异来评估LVLMs的诊断准确性和对话质量，并包括来自真实远程医疗互动的2996个病例和涉及34种诊断。研究表明，通过对话与内部推理和多模态对话能提高诊断准确性，并强调了基于预测的诊断卷积神经网络的重要性。

Key Takeaways

3MDBench是一个用于模拟和评估大型视觉语言模型（LVLMs）在远程医疗咨询中的性能的开放源代码框架。
该框架通过模拟患者的个体差异来评估LVLMs的诊断准确性和对话质量。
3MDBench包括来自真实远程医疗互动的2996个病例和涉及34种诊断的数据集。
通过对话与内部推理，可以提高LVLMs的诊断准确性。
多模态对话（结合文本和图像数据）对诊断有积极影响。
将预测注入LVLM的上下文可以提高诊断的准确性，最多可提高20%。

Cool Papers

点此查看论文截图

Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives

Authors:Mahnaz Roshanaei, Rezvaneh Rezapour, Magy Seif El-Nasr

Social interactions promote well-being, yet barriers like geographic distance, time limitations, and mental health conditions can limit face-to-face interactions. Emotionally responsive AI systems, such as chatbots, offer new opportunities for social and emotional support, but raise critical questions about how empathy is perceived and experienced in human-AI interactions. This study examines how empathy is evaluated in AI-generated versus human responses. Using personal narratives, we explored how persona attributes (e.g., gender, empathic traits, shared experiences) and story qualities affect empathy ratings. We compared responses from standard and fine-tuned AI models with human judgments. Results show that while humans are highly sensitive to emotional vividness and shared experience, AI-responses are less influenced by these cues, often lack nuance in empathic expression. These findings highlight challenges in designing emotionally intelligent systems that respond meaningfully across diverse users and contexts, and informs the design of ethically aware tools to support social connection and well-being.

社会互动有助于提升福祉，然而地理距离、时间限制和心理健康状况等障碍限制了面对面的互动。情感响应的人工智能系统，如聊天机器人，为社会和情感支持提供了新的机会，但也引发了关于人机互动中如何感知和体验同理心的重要问题。本研究考察了人工智能生成响应与人类响应中的同理心评价。通过个人叙事，我们探讨了人格属性（如性别、同理心特质、共享经验）和故事质量如何影响同理心评分。我们将标准人工智能模型和经过微调的人工智能模型的响应与人类的判断进行了比较。结果表明，虽然人类对情感生动性和共享经验高度敏感，但人工智能的响应受这些线索的影响较小，往往缺乏微妙的同理心表达。这些发现突显了在设计和适应多样用户和情境的有意义的情感智能系统方面所面临的挑战，并为支持社会联系和福祉的道德意识工具的设计提供了信息。

论文及项目相关链接

PDF 18 pages, 4 figures, 6 tables. Title updated from “Talk, Listen, Connect: Navigating Empathy in Human-AI Interactions” to “Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives” in this version. This is version 3 (v3) of the paper. All previous citations of arXiv:2409.15550 with the old title still refer to the same paper

Summary
社交互动有助于提升人们的幸福感，但由于地理距离、时间限制和心理健康状况等障碍，面对面的互动可能会受到限制。情感响应AI系统（如聊天机器人）提供了新的社会与情感支持机会，但也引发了关于人机交互中如何体验和感知同情心的关键问题。本研究探讨了AI生成响应与人类响应中的同情心评估方式。通过个人叙事，我们探讨了人格属性（如性别、同情特质、共同经历）和故事品质如何影响同情心评价。我们将标准AI模型和经过精细调整的AI模型的响应与人的判断进行了比较。结果表明，人类非常关注情感的生动性和共同经历，而AI的响应则较少受这些线索的影响，往往在表达同情时缺乏细微差别。这些发现突显了在设计和情感智能系统时面临的挑战，这些系统需要在不同用户和背景下做出有意义的响应，并为支持社会联系和幸福感设计伦理工具。

Key Takeaways

社交互动对个体幸福感有积极影响，但存在地理距离、时间限制和心理健康状况等障碍限制面对面的交流。
情感响应AI系统（如聊天机器人）为提供社会与情感支持创造了新机会，但也引发了关于人机交互中同情心体验与感知的关键问题。
通过个人叙事研究，发现人格属性和故事品质影响同情心评价。
对比标准AI模型和精细调整AI模型与人类判断发现，人类关注情感的生动性和共同经历，而AI在表达同情时缺乏细微差别。
AI响应在设计和情感智能系统中面临挑战，需要在不同用户和背景下做出有意义的响应。
设计伦理工具以支持社会联系和幸福感至关重要。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-05/Interactive/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Interactive

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-11-05 See the Speaker Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

2025-11-05 Talking Head Generation

Talking Head Generation

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2025-11-05 Two-Timescale Optimization Framework for IAB-Enabled Heterogeneous UAV Networks

2025-11-05 TTS

TTS