TechCrunch Industry News

AI isn’t very good at history, new paper finds

Jan 22, 2025

The discussion reveals a striking finding: top language models like GPT-4 only score about 46% on historical tests. It delves into the limitations of AI in grasping the nuances of history, introducing the HIST-LLM benchmark as a new measure. Despite these shortcomings, there's an optimistic view on the potential for AI’s future contributions to historical research.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

LLMs Struggle with History

LLMs struggle with nuanced historical questions, achieving only around 46% accuracy on a PhD-level history exam.
They excel at basic facts but lack deep understanding for complex historical inquiry, exemplified by GPT-4's errors on ancient Egyptian armor and armies.

ANECDOTE

GPT-4's Inaccurate Claim about Scale Armor

GPT-4 incorrectly claimed scale armor existed in ancient Egypt during a specific period.
The technology actually appeared 1,500 years later, highlighting the LLM's factual inaccuracy.

INSIGHT

LLMs Extrapolate from Prominent Data

LLMs struggle with obscure historical facts, possibly due to over-reliance on prominent historical data.
They extrapolate from readily available information, as seen with GPT-4's incorrect answer about ancient Egypt's standing army.

Get the Snipd Podcast app to discover more snips from this episode