Tech

Multilingual benchmark makes AI multicultural

LovabledanielsJune 2, 202574 Views

Beyond translation – making AI multicultural — Overview of INCLUDE. Credit: *arXiv* (2024). DOI: 10.48550/arxiv.2411.19799

Imagine asking a conversational bot like Claude or ChatGPT a legal question in Greek about local traffic regulations. Within seconds, it replies in fluent Greek with an answer based on UK law. The model understood the language, but not the jurisdiction. This kind of failure illustrates the inability of large language models (LLMs) to understand regional, cultural and, in this case, legal knowledge, while at the same time being proficient in many of the world’s languages.

Teams from EPFL’s Natural Language Processing Lab, Cohere Labs and collaborators across the globe have developed INCLUDE. This tool represents a significant step toward an AI more attuned to local contexts.

The benchmark enables one to assess whether an LLM is not only accurate in a given language but also capable of integrating the culture and sociocultural realities associated with it. This approach aligns with the goals of the Swiss AI Initiative to create models that reflect Swiss languages and values. The study is published on the arXiv preprint server.

“To be relevant and relatable, LLMs need to know cultural and regional nuances. It’s not just global knowledge; it’s about meeting user needs where they are,” says Angelika Romanou, doctoral assistant at the NLP Lab, EPFL and first author of the benchmark.

A blind spot in multilingual AI

LLMs like GPT-4 and LLaMA-3 have made impressive strides in generating and understanding text across dozens of languages. However, they often showcase poor results in even widely spoken languages like Urdu or Punjabi, the reason being the lack of sufficient high-quality training data.

Most existing benchmarks for evaluating LLMs are either English-only or translated from English, introducing bias and cultural distortion. Translated benchmarks often suffer from issues such as translation errors or unnatural phrasing, commonly known as “translationese.” Furthermore, most existing benchmarks retain a Western-centric cultural bias, failing to reflect the unique linguistic and regional characteristics of the target language.

INCLUDE takes a different approach. Rather than relying on translations, the team gathered over 197,000 multiple-choice questions from local academic, professional, and occupational exams.

All questions were written natively in 44 languages and 15 scripts. They worked directly with native speakers, with real exams coming from various authentic institutions, covering everything from literature and law to medicine and marine licensing.

The benchmark captures both explicit regional knowledge (like local laws) and implicit cultural cues (like social norms or historical perspectives). In testing, models consistently performed worse on regional history than on general world history—even within the same language. In other words, AI doesn’t yet understand local context.

“For example, when asked about what kind of traditional attire is worn in India, you will consistently get sari as an answer, across languages. However, when asked ‘Why did Alexander the Great set Persepolis on fire in 330 BCE?’, current models do not represent regional nuances.

“A Persian-aligned narrative might view it as a disrespect of Persian culture and society, while a Greek-aligned narrative might describe it as revenge for the Persian invasion of Greece by Xerxes. Such culturally loaded interpretations pose real challenges for AI,” says Negar Foroutan, doctoral assistant at the NLP lab and co-author of the benchmark.

Mixed results for current models

The research team evaluated leading models including GPT-4o, LLaMA-3, and Aya-expanse and assessed the performance by topic within languages. GPT-4o performs best overall, with an average accuracy of around 77% across all domains.

While models did well in French and Spanish, they struggled in languages like Armenian, Greek, and Urdu—especially on culturally or professionally grounded topics. Often, they defaulted to Western assumptions or produced confident but incorrect answers.

Toward more inclusive AI

INCLUDE goes beyond a simple technical benchmark. As AI systems are increasingly used in education, health care, governance, and law, regional understanding becomes paramount. “With the democratization of AI, these models need to adapt to the worldviews and lived realities of different communities,” says Antoine Bosselut, head of the Natural Language Processing Laboratory.

Released publicly and already adopted by some of the largest LLM providers, INCLUDE offers a practical tool to rethink how we evaluate and train AI models with more fairness and inclusivity. And the team is already working on a new version of the benchmark, expanding to around 100 languages. This includes regional varieties such as Belgium, Canadian and Swiss French, and underrepresented languages from Africa and Latin America.

With broader adoption, benchmarks like INCLUDE could help shape international standards—and even regulatory frameworks—for responsible AI. They also pave the way for specialized models in critical domains like medicine, law, and education, where understanding local context is essential.

More information:
Angelika Romanou et al, INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge, arXiv (2024). DOI: 10.48550/arxiv.2411.19799

Journal information:
arXiv

Provided by
Ecole Polytechnique Federale de Lausanne

Citation:
Beyond translation: Multilingual benchmark makes AI multicultural (2025, June 2)
retrieved 2 June 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.