AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How to Train a Smaller Model to Grok Modular Addition
Sudden grokking actually broke down into three phases of training that I call memorization where models memorize the training data circuit formation. It slowly transitions from the memorized solution to the trick-based generalizing solution because serving train performance the entire time and then cleanup when it suddenly got so good at generating it's no longer worth keeping around the memorization parameters. These models are trained with weight decay that incentivizes them to be simpler so it decides to get rid of it. The kind of high level principles from this are I think that's a good proof of concept that a promising way to do science a deep learning and understand these models is by building a model organism like