Is the Model Doing Better Than the Baseline?

The test set for numerical reasoning only has 20 data points. If you get one right, you can change your performance by 5%. So basically a model that does 25% versus another one that only does 20%. Maybe it's just getting one more answer right as compared to the other one. It could look like a large performance jump, but it isn't such a large performance jumped.

Play episode from 37:46

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app