AXRP - the AI X-risk Research Podcast cover image

24 - Superalignment with Jan Leike

AXRP - the AI X-risk Research Podcast

00:00

Automated Interpretability for Neurons

The work that we've kind of started last year where we released a paper earlier this year on automated interpretability and here the idea is like basically what you would want is like you would want to have a technique that both works in the level of detail of individual neurons so that you can make sure you don't miss any details. The way to then that scale that to the entire model is you need automation right yeah but you can do thatonce once you figure out how to do it on the detail then you just record what you're doing.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app