TechCrunch Industry News

AI isn’t very good at history, new paper finds

Jan 22, 2025
The discussion reveals a striking finding: top language models like GPT-4 only score about 46% on historical tests. It delves into the limitations of AI in grasping the nuances of history, introducing the HIST-LLM benchmark as a new measure. Despite these shortcomings, there's an optimistic view on the potential for AI’s future contributions to historical research.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

LLMs Struggle with History

  • LLMs struggle with nuanced historical questions, achieving only around 46% accuracy on a PhD-level history exam.
  • They excel at basic facts but lack deep understanding for complex historical inquiry, exemplified by GPT-4's errors on ancient Egyptian armor and armies.
ANECDOTE

GPT-4's Inaccurate Claim about Scale Armor

  • GPT-4 incorrectly claimed scale armor existed in ancient Egypt during a specific period.
  • The technology actually appeared 1,500 years later, highlighting the LLM's factual inaccuracy.
INSIGHT

LLMs Extrapolate from Prominent Data

  • LLMs struggle with obscure historical facts, possibly due to over-reliance on prominent historical data.
  • They extrapolate from readily available information, as seen with GPT-4's incorrect answer about ancient Egypt's standing army.
Get the Snipd Podcast app to discover more snips from this episode
Get the app