

AI isn’t very good at history, new paper finds
Jan 22, 2025
The discussion reveals a striking finding: top language models like GPT-4 only score about 46% on historical tests. It delves into the limitations of AI in grasping the nuances of history, introducing the HIST-LLM benchmark as a new measure. Despite these shortcomings, there's an optimistic view on the potential for AI’s future contributions to historical research.
AI Snips
Chapters
Transcript
Episode notes
LLMs Struggle with History
- LLMs struggle with nuanced historical questions, achieving only around 46% accuracy on a PhD-level history exam.
- They excel at basic facts but lack deep understanding for complex historical inquiry, exemplified by GPT-4's errors on ancient Egyptian armor and armies.
GPT-4's Inaccurate Claim about Scale Armor
- GPT-4 incorrectly claimed scale armor existed in ancient Egypt during a specific period.
- The technology actually appeared 1,500 years later, highlighting the LLM's factual inaccuracy.
LLMs Extrapolate from Prominent Data
- LLMs struggle with obscure historical facts, possibly due to over-reliance on prominent historical data.
- They extrapolate from readily available information, as seen with GPT-4's incorrect answer about ancient Egypt's standing army.