Why Static Benchmarks Fail for LLMs

Dan and Alex discuss the limits of static evaluations and why dynamic, game-based tests reveal more about model capabilities.

Play episode from 04:10

chevron_right

Transcript

chevron_right

Transcript

Episode notes

This episode is a little different from our usual fare: It’s a conversation with our head of AI training Alex Duffy about Good Start Labs, a company he incubated inside Every. Today, Good Start Labs is spinning out of Every as a separate company with $3.6 million in funding from General Catalyst, Inovia, Every, and a group of angel investors from top-tier AI labs like DeepMind. We get into how Alex learned some of his biggest lessons about the real world from games, starting with RuneScape, which taught him how markets work and how not to get scammed. He explains why the static benchmarks we use to evaluate LLMs today are breaking down, and how games like Diplomacy offer a richer, more dynamic way to test and train large language models. Finally, Alex shares where he sees the most promise in AI—software, life sciences, and education—and why he believes games can make the models we use smarter, while helping people understand and use AI more effectively.

If you found this episode interesting, please like, subscribe, comment, and share.

Want even more?

Sign up for Every to unlock our ultimate guide to prompting ChatGPT here: https://every.ck.page/ultimate-guide-to-prompting-chatgpt. It’s usually only for paying subscribers, but you can get it here for free.

To hear more from Dan Shipper:

Subscribe to Every: https://every.to/subscribe
Follow him on X: https://twitter.com/danshipper

Timestamps

00:00:00 - Start

00:01:48 - Introduction

00:04:14 - Why evals and benchmarks are broken

00:07:13 - The sneakiest LLMs in the market

00:13:00 - A competition that turns prompting into a sport

00:15:49 - Building a business around using games to make AI better

00:22:39 - Can language models learn how to be funny

00:25:31 - Why games are a great way to evaluate and train new models

00:26:58 - What child psychology tells us about games and AI

00:30:10 - Using games to unlock continual learning in AI

00:36:42 - Why Alex cares deeply about games

00:44:37 - Where Alex sees the most promise in AI

00:50:54 - Rethinking how young people start their careers in the age of AI

Links to resources mentioned in the episode:

Alex Duffy: alex duffy (@alxai_)
Good Start Labs: https://goodstartlabs.com/, good start (@goodstartlabs)
The book Alex is reading about the importance of games: Playing with Reality: How Games Shape Our World
The book Dan recommends by the psychoanalyst D.W. Winnicott: Playing and Reality

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books