Held-out tasks and domain generalization

They consider validating benchmarks via held-out, systematically different tasks to measure generalization.

Play episode from 01:09:10

chevron_right

Transcript

chevron_right

Transcript

Episode notes

When METR says something like "Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes", what does that mean? In this episode David Rein, METR researcher and co-author of the paper "Measuring AI ability to complete long tasks", talks about METR's work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2026/01/03/episode-47-david-rein-metr-time-horizons.html

Topics we discuss, and timestamps:

0:00:32 Measuring AI Ability to Complete Long Tasks

0:10:54 The meaning of "task length"

0:19:27 Examples of intermediate and hard tasks

0:25:12 Why the software engineering focus

0:32:17 Why task length as difficulty measure

0:46:32 Is AI progress going superexponential?

0:50:58 Is AI progress due to increased cost to run models?

0:54:45 Why METR measures model capabilities

1:04:10 How time horizons relate to recursive self-improvement

1:12:58 Cost of estimating time horizons

1:16:23 Task realism vs mimicking important task features

1:19:50 Excursus on "Inventing Temperature"

1:25:46 Return to task realism discussion

1:33:53 Open questions on time horizons

Links for METR:

Main website: https://metr.org/

X/Twitter account: https://x.com/METR_Evals/

Research we discuss:

Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts: https://arxiv.org/abs/2411.15114

HCAST: Human-Calibrated Autonomy Software Tasks: https://arxiv.org/abs/2503.17354

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity: https://arxiv.org/abs/2507.09089

Anthropic Economic Index: Tracking AI's role in the US and global economy: https://www.anthropic.com/research/anthropic-economic-index-september-2025-report

Bridging RL Theory and Practice with the Effective Horizon (i.e. the Cassidy Laidlaw paper): https://arxiv.org/abs/2304.09853

How Does Time Horizon Vary Across Domains?: https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/

Inventing Temperature: https://global.oup.com/academic/product/inventing-temperature-9780195337389

Is there a Half-Life for the Success Rates of AI Agents? (by Toby Ord): https://www.tobyord.com/writing/half-life

Lawrence Chan's response to the above: https://nitter.net/justanotherlaw/status/1920254586771710009

AI Task Length Horizons in Offensive Cybersecurity: https://sean-peters-au.github.io/2025/07/02/ai-task-length-horizons-in-offensive-cybersecurity.html

Episode art by Hamish Doodles: hamishdoodles.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books