
Machine learning algorithms and large language models (LLMs), such as the model underpinning the functioning of the platform ChatGPT, have proved to be effective in tackling a wide range of tasks. These models are trained on various types of data (e.g., texts, images, videos, and/or audio recordings), which are typically annotated by humans, who label important features, including the emotions expressed in the data.
Researchers at Pennsylvania State University recently carried out a study aimed at better understanding the extent to which third-party annotators, both humans and large language models (LLMs), can identify the emotions expressed by others in their written texts. Their findings, outlined in a paper published on the arXiv preprint server and set to be presented at the ACL 2025 conference in Vienna, suggest that people often fail to pick up the emotions expressed by others in texts.
“Many NLP tasks model authors’ private states (like emotions) using third-party annotations, assuming these labels align with the author’s actual experience,” Sarah Rajtmajer, senior author of the paper, told Tech Xplore. “However, this critical assumption is rarely examined.
“The misalignment between an author’s private state and its third-party interpretation is not merely a labeling error—it can propagate through learned models and undermine the reliability of downstream applications, leading to socially harmful consequences.”
The recent study led by Ph.D. student Jiayi Li focused on text-based emotion recognition, as many AI platforms used by individuals worldwide are accessed via chat and are designed to process written texts. The key question that the team hoped to answer is: How reliable are third-party annotations when it comes to recognizing the emotions expressed by humans?
“This question drove us to systematically study the alignment between self-reported (first-party) emotions and the interpretations of third-party annotators, including both humans and LLMs,” said Li. “We also explored ways to improve this alignment by incorporating factors, specifically examining the impact of shared demographics for human annotators and providing author demographics to LLMs.”
To probe the ability of third-party annotators to infer the emotions of others from texts, the researchers carried out an experiment involving social media users recruited via the crowdsourcing platform Connect. The study participants were first asked to share their own social media posts and label the emotions they felt they were expressing in these posts.
“We then asked different groups of human annotators, considering their demographics relative to the authors, to label the same posts,” explained Li. “We also had several large language models (LLMs) perform the same labeling task. By comparing these third-party annotations (from both humans and LLMs) to the author’s self-reported emotions using evaluation metrics like F1 score and statistical tests, we examined the alignment between first- and third-party annotations.”
In their analyses, Li, Rajtmajer and their colleagues also looked at demographic similarities between users writing posts and the third-party annotators. This allowed them to explore the possibility that people with similar demographics (e.g., a similar age or ethnic background) are better at picking up each other’s emotions.
“The most notable finding from our study is the clear misalignment between third-party annotations and first-party self-reported emotions,” said Rajtmajer. “This finding challenges a common assumption in emotion recognition research that third-party annotators can reliably infer someone else’s emotional expression based solely on text.
“Notably, we found that human annotators who shared demographic traits with the post author were more aligned with first-party labels, and prompting LLMs with demographic context led to small but statistically significant improvements.”
Overall, the findings of this recent work suggest that human annotators might not be as effective at picking up emotions expressed in texts as past studies have suggested, yet they might be better at detecting the emotions of people who share similar characteristics with them.
This insight could guide the annotation of future text datasets for training natural language processing (NLP) models, including LLMs, which could help to boost the ability of these models to generate responses aligned with the emotions expressed by users.
“Our study highlights the need for researchers and developers to be precise about whose emotional perspective they are capturing—the author’s or an observer’s,” added Rajtmajer. “This distinction is especially critical in downstream applications like mental health support and empathetic dialogue systems, where understanding the author’s emotional state is often the goal.
“Moving forward, we are interested in more nuanced and user-centered models of emotion that go beyond basic emotion categories that are derived from biological responses and rigid third-party constructed taxonomies.”
More information:
Jiayi Li et al, Can Third-parties Read Our Emotions?, arXiv (2025). DOI: 10.48550/arxiv.2504.18673
© 2025 Science X Network
Citation:
Third-party data annotators often fail to accurately read the emotions of others, study finds (2025, May 19)
retrieved 19 May 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Leave a comment