5min chapter

Machine Learning Street Talk (MLST) cover image

Neel Nanda - Mechanistic Interpretability

Machine Learning Street Talk (MLST)

CHAPTER

The Importance of Interpretability in a World Full of GPT4 Models

I'm very happy with there was a prompt saying to deceive someone or it learned that in this context people often output things that are intended to convince someone. My vision of what interpretability would look like is we take some big foundation model like the DPD4 based model or the fine-tuned Db4 that's being used as a base for everything else. I think getting a deep understanding of a single model is kind of plausibly possible but do you think it doesn't change that much so no one's really checked? The more you're using weird techniques like reinforcement learning from human feedback the less I'm confident in this claim and yeah if we discovered that every time you fine-

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode