Scaling Loss in Language Models

The reason it works so well is just like you can scale transformers way up and they get better. So if you're trying to get precise from this picture, the loss decreases. Do you think, do you think the scaling law will last for all? Hold on. I'm not sure how to really grasp it. It's basically saying compute go up language model go burr.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app