Current AI models are strange. They can speak—often coherently, sometimes even eloquently—which is wild. They can predict the structure of proteins, beat the best humans at many games, recall more facts in most domains than human experts; yet they also struggle to perform simple tasks, like using computer cursors, maintaining basic logical consistency, or explaining what they know without wholesale fabrication.

Perhaps someday we will discover a deep science of intelligence, and this will teach us how to properly describe such strangeness. But for now we have nothing of the sort, so we are left merely gesturing in vague, heuristical terms; lately people have started referring to this odd mixture of impressiveness and idiocy as “spikiness,” for example, though there isn’t much agreement about the nature of the spikes.

Of course it would be nice to measure AI progress anyway, at least in some sense sufficient to help us [...]

---

Outline:

(03:48) Conceptual Coherence

(07:12) Benchmark Bias

(10:39) Predictive Value

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
October 14th, 2025

Source:
https://www.lesswrong.com/posts/PzLSuaT6WGLQGJJJD/the-length-of-horizons

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Graph showing AI task duration over time from GPT-2 through GPT-5 releases

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The ‘Length’ of ‘Horizons’” by Adam Scholl

LessWrong (Curated & Popular)

Why Automatically Checkable Tasks Skew Benchmarks

The AI-powered Podcast Player