Tech

Can AI pass a Ph.D.-level history test? New study says ‘not yet’

Share
Share
Can AI pass a Ph.D.-level history test?
Credit: AI-generated image

For the past decade, complexity scientist Peter Turchin has been working with collaborators to bring together the most current and structured body of knowledge about human history in one place: the Seshat Global History Databank.

Over the past year, together with computer scientist Maria del Rio-Chanona, he has begun to wonder if artificial intelligence chatbots could help historians and archaeologists to gather data and better understand the past. As a first step, they wanted to assess the AI tools’ understanding of historical knowledge.

In collaboration with an international team of experts, they decided to evaluate the historical knowledge of advanced AI models such as ChatGPT-4, Llama, and Gemini.

“Large language models (LLMs), such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals,” says Turchin, who leads the Complexity Science Hub’s (CSH) research group on social complexity and collapse.

“But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited.

“One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others.”

The results of the study were presented recently at the NeurIPS conference in Vancouver. GPT-4 Turbo, the best-performing model, scored 46% on a four-choice question test.

According to Turchin and his team, although these results are an improvement over the baseline of 25% of random guessing, they highlight the considerable gaps in AI’s understanding of historical knowledge.

“I thought the AI chatbots would do a lot better,” says del Rio-Chanona, the study’s corresponding author. “History is often viewed as facts, but sometimes interpretation is necessary to make sense of it,” adds del Rio-Chanona, an external faculty member at CSH and an assistant professor at University College London.

Setting a benchmark for LLMs

This new assessment, the first of its kind, challenged these AI systems to answer questions at a graduate and expert level, similar to ones answered in Seshat (and the researchers used the knowledge in Seshat to test the accuracy of the AI answers). Seshat is a vast, evidence-based resource that compiles historical knowledge across 600 societies worldwide, spanning more than 36,000 data points and over 2,700 scholarly references.

“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explains first author Jakob Hauser, a resident scientist at CSH.

“The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”

Disparities across time periods and geographic regions

The benchmark also reveals other important insights into the ability of current chatbots—a total of seven models from the Gemini, OpenAI, and Llama families—to comprehend global history. For instance, they were most accurate in answering questions about ancient history, particularly from 8,000 BCE to 3,000 BCE.

However, their accuracy dropped sharply for more recent periods, with the largest gaps in understanding events from 1,500 CE to the present.

In addition, the results highlight the disparity in model performance across geographic regions. OpenAI’s models performed better for Latin America and the Caribbean, while Llama performed best for Northern America.

Both OpenAI’s and Llama models’ performance was worse for Sub-Saharan Africa. Llama also performed poorly for Oceania. This suggests potential biases in the training data, which may overemphasize certain historical narratives while neglecting others, according to the study.

Better on legal system, worse on discrimination

The benchmark also found differences in performance across categories. Models performed best on legal systems and social complexity. “But they struggled with topics such as discrimination and social mobility,” says del Rio-Chanona.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, Ph.D.-level historical inquiry, they’re not yet up to the task,” adds del Rio-Chanona.

According to the benchmark, the model that performed best was GPT-4 Turbo, with a balanced accuracy of 46%, while the weakest was Llama-3.1-8B with 33.6%.

Next steps

Del Rio-Chanona and the other researchers—from CSH, the University of Oxford, and the Alan Turing Institute—are committed to expanding the dataset and improving the benchmark. They plan to include more data from underrepresented regions and incorporate more complex historical questions, according to Hauser.

“We plan to continue refining the benchmark by integrating additional data points from diverse regions, especially the Global South. We also look forward to testing more recent LLM models, such as o3, to see if they can bridge the gaps identified in this study,” says Hauser.

The CSH scientist emphasizes that the benchmark’s findings can be valuable to both historians and AI developers. For historians, archaeologists, and social scientists, knowing the strengths and limitations of AI chatbots can help guide their use in historical research.

For AI developers, these results highlight areas for improvement, particularly in mitigating regional biases and enhancing the models’ ability to handle complex, nuanced historical knowledge.

More information:
Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM). nips.cc/virtual/2024/poster/97439

Provided by
Complexity Science Hub Vienna


Citation:
Can AI pass a Ph.D.-level history test? New study says ‘not yet’ (2025, January 21)
retrieved 21 January 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
We just got another big hint that the Samsung Galaxy S25 FE is on the way
Tech

We just got another big hint that the Samsung Galaxy S25 FE is on the way

References to Galaxy S25 FE firmware have appeared The phone could launch...

You won’t believe what 700+ projectors and AI can do in Abu Dhabi’s new immersive art world
Tech

You won’t believe what 700+ projectors and AI can do in Abu Dhabi’s new immersive art world

Over 700 Epson projectors transform walls into moving, responsive works of living...

When the school bell rings, the bandwidth drops: How post-15:40 internet surges affect UK broadband quality
Tech

When the school bell rings, the bandwidth drops: How post-15:40 internet surges affect UK broadband quality

Half of parents work after school, causing a broadband battle with streaming-addicted...