How to Train a Smaller Model to Grok Modular Addition

Sudden grokking actually broke down into three phases of training that I call memorization where models memorize the training data circuit formation. It slowly transitions from the memorized solution to the trick-based generalizing solution because serving train performance the entire time and then cleanup when it suddenly got so good at generating it's no longer worth keeping around the memorization parameters. These models are trained with weight decay that incentivizes them to be simpler so it decides to get rid of it. The kind of high level principles from this are I think that's a good proof of concept that a promising way to do science a deep learning and understand these models is by building a model organism like

Play episode at 01:19:51

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app