2min chapter

AXRP - the AI X-risk Research Podcast cover image

24 - Superalignment with Jan Leike

AXRP - the AI X-risk Research Podcast

CHAPTER

How to Train Misaligned Models to Be Consistent Liars

You're deliberately training misaligned models is the idea that like you would train them to lie and potentially self-exfiltrate but at a low enough capability level that it's fine. The scary systems are the really coherent and believable liars that lie super consistently where it would just be really hard to interrogate in a way that points out its lies or any sort of inconsistencies in what it's saying. I think actually to be this kind of like consistent liar is not that easy, so we want to make it really really hard for the model to be this consistent liar okay.

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode