With GPD3 we're talking about 175 billion parameters at least in the original version of GPD3. And I think it's such a shockingly large number of things being learned of essentially numbers in the model being learned. So you could argue that even if we were in the regime that the statistics people are used to for something like GPT-3, the situation wouldn't be too bad because the amount of data has been quite large compared to the parameters. Even when you have neural networks with far more parameters than labels, how can those generalize? This happens due to properties of the optimization algorithm of stochastic gradient descent.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode