
The ability to communicate effectively in spoken English is a key determinant of both academic and professional success. Traditionally, the degree of mastery over English grammar, vocabulary, pronunciation, and communication skills has been assessed through tedious and expensive human-administered tests.
However, with the advent of artificial intelligence (AI) and machine learning in recent years, automated spoken English assessment tests have gained immense popularity among researchers worldwide.
While monologue-based speaking assessments are prevalent, they lack real-world relevance, particularly in environments where a dialog or group interaction is crucial. Moreover, research on automated assessment of spoken English skills in interactive settings remains limited and often focuses only on single modalities, such as text or audio.
In this light, a team of researchers led by Professor Shogo Okada that included Assistant Professor Candy-Olivia Mawalim from Japan Advanced Institute of Science and Technology (JAIST), have developed a multioutput learning framework that can simultaneously assess multiple aspects of spoken English proficiency. Their findings are published online in the journal Computers and Education: Artificial Intelligence.
The researchers utilized a novel spoken English evaluation (SEE) dataset comprising synchronized audio, video, and text transcripts from open-ended, high-stakes interviews with adolescents (9–16 years old) applying to high schools and universities. This dataset was collected by the real service from Vericant and is particularly notable for incorporating expert-assigned scores supervised by researchers from the Education Testing Service (ETS) across a range of speaking skill dimensions, enabling a rich, multimodal analysis of English proficiency.
Dr. Mawalim says, “Our framework allows for the modeling and integration of different aspects of speaking proficiency, thereby improving our understanding of the various underlying factors. Also, by incorporating open-ended interview settings in our assessment framework, we can gauge an individual’s ability to engage in spontaneous and creative communication and their overall sociolinguistic competence.”
The multioutput learning framework developed by the team integrates acoustic features such as prosody, visual cues like facial action units, and linguistic patterns such as turn-taking. Compared to unimodal approaches, this multimodal strategy significantly enhanced prediction accuracy, achieving an overall SEE score prediction accuracy of approximately 83% using the Light Gradient Boosting Machine (LightGBM) algorithm.
“The findings of our study have broad implications, offering diverse applications for stakeholders across various fields,” says Prof. Okada. “Besides providing direct actionable insights for students to improve their spoken English proficiency, our approach can help teachers to tailor their instructions to address individual student needs. Moreover, our multi-output learning framework can aid the development of more transparent and interpretable models for the assessment of spoken language skills.”
The scientists also studied the importance of the utterance sequence in spoken English proficiency. Bidirectional encoder representations from transformers (BERT), a pre-trained deep learning model, revealed that the initial utterance had a lot of significance in predicting spoken proficiency. Furthermore, the influence of external factors, such as interviewer behavior and the interview setting on spoken English proficiency, was also assessed.
Their analyses showed that specific features, such as interviewer speech, gender, and in-person or remote interview setting, significantly impacted the coherence of the interviewees’ responses.
“With the rapid growth of AI-driven technologies and their expanding integration into our daily lives, multimodal assessments could become standard in educational settings in the near future. This can enable students to receive highly personalized feedback on their communication skills, not just language proficiency.
“This could lead to tailored curricula and teaching methods, helping students to hone and develop crucial soft skills like public speaking, presentation, and interpersonal communication more effectively,” says Dr. Mawalim, the lead author of the present study.
Taken together, the research offers a more nuanced and interpretable approach to automated spoken English assessment and lays the groundwork for developing intelligent, student-centered tools in educational and professional contexts.
More information:
Candy Olivia Mawalim et al, Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents, Computers and Education: Artificial Intelligence (2025). DOI: 10.1016/j.caeai.2025.100386
Citation:
A novel, multimodal approach to automated speaking skill assessment (2025, June 2)
retrieved 2 June 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Leave a comment