AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

CHAPTER

How to Train a Smaller Model to Grok Modular Addition

Sudden grokking actually broke down into three phases of training that I call memorization where models memorize the training data circuit formation. It slowly transitions from the memorized solution to the trick-based generalizing solution because serving train performance the entire time and then cleanup when it suddenly got so good at generating it's no longer worth keeping around the memorization parameters. These models are trained with weight decay that incentivizes them to be simpler so it decides to get rid of it. The kind of high level principles from this are I think that's a good proof of concept that a promising way to do science a deep learning and understand these models is by building a model organism like

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner