
For people who are blind or have low vision, the audio descriptions of action in movies and TV shows are essential to understanding what is happening. Networks and streaming services hire professionals to create audio descriptions, but that’s not the case for billions of YouTube and TikTok videos.
That doesn’t mean people don’t want access to the content.
Using AI vision language models (VLM), researchers at Northeastern University are making audio descriptions available for user-generated videos as part of a crowdsourced platform called YouDescribe. Like a library, blind and low-vision users can request descriptions for videos, and later rate and contribute to them.
“It’s understandable that a 20-second video on TikTok of somebody dancing may not get a professional description,” says Lana Do, who received her master’s in computer science from Northeastern’s Silicon Valley campus in May. “But blind and low-vision people might like to see that dancing video too.”
In fact, a 2020 video of the South Korean boy band BTS’s song “Dynamite” is at the top of YouDescribe’s wishlist, waiting to be described. The platform has 3,000 volunteer describers, but the wishlist is so long, they can’t keep up. Only 7% of requested videos on the wishlist have audio descriptions, Do says.
Do works in the lab of Ilmi Yoon, teaching professor of computer science on the Silicon Valley campus. Yoon joined YouDescribe’s team in 2018 to develop the platform’s machine learning elements.
This year, Do added new features to speed up YouDescribe’s human-in-the-loop workflow. New VLM technology provides better quality descriptions, and a new infobot tool will allow users to ask for more information about a specific video frame. Low-vision users can even correct mistakes in the descriptions with a collaborative editing interface, Do says.
The result will make video content descriptions better and more quickly available. AI-generated drafts ease the burden on human describers, and users can easily engage in the process through ratings and comments, she said.
“They could say that they were watching a documentary set in a forest and they heard a flapping sound that wasn’t described,” Do says, “and they wondered what it was.”
Do and her colleagues presented a paper recently at the Symposium on Human-Computer Interaction for Work in Amsterdam about the potential for AI to accelerate the development of audio descriptions. AI does a surprisingly good job, says Yoon, at describing human expressions and movements. In this video, an AI agent describes the steps that a chef takes while making cheese rolls.
But there are some consistent weaknesses, she says. AI isn’t as good at reading facial expressions in cartoons. And overall, humans are better at picking up on the most important details in a scene—a key skill in creating a helpful description.
“It’s very labor-intensive,” Yoon says.
Graduate students in her lab compare the AI first drafts to what human describers create.
“Then we measure the gaps so we can train the AI to do a better job,” she says. “Blind users don’t want to get distracted with too much verbal description. It’s an editorial art to verbalize the most important information in a concise way.”
YouDescribe was launched in 2013 by the San Francisco-based Smith-Kettlewell Eye Research Institute to train sighted volunteers in the creation of audio descriptions. With a focus on YouTube and TikTok videos, the platform offers tutorials for recording and timing narration that make user-generated video content accessible.
This story is republished courtesy of Northeastern Global News news.northeastern.edu.
Citation:
AI vision language models provide video descriptions for blind users (2025, June 30)
retrieved 30 June 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Leave a comment