Tech

How AI is leaving non-English speakers behind

Share
Share
How AI is leaving non-English speakers behind
Credit: Stanford University

New research explores the communities and cultures being excluded from AI tools, leading to missed opportunities and increased risks from bias and misinformation.

Scholars find that large language models suffer a digital divide: The ChatGPTs and Geminis of the world work well for the 1.52 billion people who speak English, but they underperform for the world’s 97 million Vietnamese speakers, and even worse for the 1.5 million people who speak the Uto-Aztecan language Nahuatl.

The main culprit is data: These non-English languages lack the needed quantity and quality of data to build and train effective models. That means most major LLMs are predominantly trained using English (or other high-resource languages) data or poor-quality local language data and not attuned to the rest of the world’s contexts and cultures.

The impact? Not just inconvenience, but systematic exclusion. Entire cultures and communities are being left out of the AI revolution, risk being harmed by AI-generated misinformation and bias, and lose crucial economic and educational opportunities that English speakers gain through effective technology.

In this conversation, Stanford School of Engineering Assistant Professor Sanmi Koyejo, senior author of a new policy white paper on this topic, discusses the risks of this divide and, importantly, what developers can do to close it.

What are low-resource languages, and why is it so hard to make LLMs work well for them?

Low-resource languages are languages with limited amounts of computer-readable data about them. That could mean few speakers of a language, or languages where there are speakers but not a lot of digitized language data, or languages where there might be speakers and digital data, but not the resources to engage in computational work around the data. For instance, Swahili has 200 million speakers but lacks sufficient digitized resources for AI models to learn from, while a language like Welsh, with fewer speakers, benefits from extensive documentation and digital preservation efforts.

All of machine learning is highly dependent on data as a resource. We consistently find that models do really well when the tasks that they’re asked to solve are similar to their training data, and they do badly the further away the data is. Because low-resource languages have less data, models perform poorly on these languages.

Why does this digital divide matter?

AI models, language models in particular, are having more and more impact on the world; they give people the potential for economic opportunity, to build businesses, or solve enterprise or individual problems. If we have language technology that doesn’t work for people in the language that they speak, those communities don’t see the technology boost that other people might have.

For example, there’s a lot of promise in AI models and health care delivery—helping with diagnosis questions or clinical support questions. There are assumptions that these models will have meaningful societal health benefits, long-term impacts on people’s well-being, and potential economic impacts for large communities.

But all these assumptions break if people can’t engage in the technology because the language isn’t one that they understand. In regions where universal health care remains a challenge, AI-powered diagnostic tools that only function in English create a new layer of health care inequality.

We anticipate these gaps will get bigger. Think about global citizenship, or the ability to engage across companies, across cultures. This could be a lever for economic development or for advocacy for individual or group rights. These things could be harder for people who don’t have access to AI tools in their languages.

Another potential growing gap is in employment. As AI transforms workplaces globally, workers fluent in English will advance while others face technological barriers to employment, widening economic inequality.

What approaches are developers taking to make LLMs perform better for low-resource languages?

I see a few techniques to close this gap. One way in which these techniques differ is in model size. Technologists can train very big models that capture lots of languages all at the same time; they can train smaller models that are tied to very specific languages; or there’s a mix between the two—regional, medium-sized models that capture a semantically similar group of languages.

We have both technical theory and observed practice that suggests that you can improve performance faster if models can share information across different languages. For example, all of the Latin languages share words, phrasings, and linguistic structure. The particular language can be very different, but there’s actually a lot that one can get across with, say, Spanish and Italian. Just as bilingual humans learn new languages faster by recognizing patterns, AI models can leverage the similarities between Spanish and Portuguese to improve performance in both.

People are also trying to use automatic translation as a way to fill the gap. The downside is error propagation—anything complicated is hard to translate. In fact, in a paper we wrote recently studying models and the Vietnamese language, we found that a lot of baselines had used automatic translation, and they failed often because the phrasings were highly unnatural for Vietnamese. Word by word, they made sense, but it was culturally completely incorrect. Translation is scalable, but it doesn’t capture the nuance of the way language is spoken and written. Because of this, I think translation can be a good bootstrap, but it is unlikely to solve the problem.

Another way to solve this is to get more data on these languages from the communities. That’s actually a challenging problem. There’s a long history of people parachuting into different communities and taking data without any benefit for the local community. Some communities are developing new data licensing models where language contributors maintain rights to their data while allowing AI development, ensuring both technological advancement and cultural sovereignty. Other communities decide to build their own models. It can be a deeply political, societal problem; data use can often slip into exploitation when we’re not careful.

What’s the most promising of these solutions?

The honest answer is, we don’t know. My best sense right now is that the answer is context-dependent. What I mean is, what are the purposes for the model, and what is the societal and political landscape that we’re building in? In some cases, this will matter more than the technical aspects. Think about language preservation, when there are so few speakers that a language may become extinct. For those, there is an argument that a separate model just for that context is most productive.

Meanwhile, a company may want a large-scale model for the economies of scale. That company may be concerned about model governance—how does it keep all the models updated? This is much easier if it’s one big model that you have to maintain, rather than hundreds of models across languages.

Right now, I think the decisions are shaped by factors other than performance. However, I will highlight that we need more evaluation approaches specialized for low-resource languages that go beyond English-centric performance measures.

Language is not the only challenge here. Cultural values are imbued in LLMs. Does it matter?

It does a ton. We know that models out of the box often don’t capture cultural values appropriately. Sometimes it’s the awkward phrasing I mentioned before. There’s a lot of old automatic translation that comes from well-structured sources like political gatherings. This has a fascinating effect because it’s a very special version of language from congressional hearings or something similar, which is very different from a conversational style and extremely awkward when applied out of the box. They’re not capturing how people actually speak.

There are other cases where this cultural gap can be bigger. There’s been excellent research showing that many language models pick up values that match the language they’ve been trained on. My colleague Tatsu Hashimoto asked language models to answer PEW surveys to see what political perspectives they align with, and showed that many of the models ended up aligning quite strongly with California political perspectives.

That makes sense when we think about who’s training the models and what they’re picking up. Diyi Yang has done some excellent work looking at how language models work with dialects of English, showing they can be systematically incorrect for, say, African American dialects of English.

Language models, when not designed carefully, run the risk of collapsing rich language and cultural diversity into one big blob, often a U.S.-centric culture blob. Arguably, a lot of culture gets shaped by technology. The way people think about problems and the way they think about culture will often get shaped by the way they engage with technology.

Many cultural leaders across the world are worried about the erasure of their culture the more as language models become a dominant mode of technology. However, the white paper suggests strategic investments, participatory research, and equitable data ownership frameworks as specific recommendations for stakeholders moving forward.

More information:
Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts. hai-production.s3.amazonaws.co … the-language-gap.pdf

Provided by
Stanford University


Citation:
How AI is leaving non-English speakers behind (2025, May 22)
retrieved 22 May 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Trading energy with neighbors could lower bills and carbon emissions, study suggests
Tech

Trading energy with neighbors could lower bills and carbon emissions, study suggests

A simplified overview of the market structure and household strategization process. Credit:...

How to protect yourself from sim-swap fraud
Tech

How to protect yourself from sim-swap fraud

Credit: Pixabay/CC0 Public Domain Our mobile phone numbers have become a de...

Three NZ trials show people will make the switch—with the right support
Tech

Three NZ trials show people will make the switch—with the right support

Credit: Unsplash/CC0 Public Domain Anyone who uses city roads will know e-bikes...

3D printers leave hidden ‘fingerprints’ that reveal part origins
Tech

3D printers leave hidden ‘fingerprints’ that reveal part origins

Four 3D printed parts made on four different printers. A deep learning...