How to Measure Academic Data Set Performance

Evaluation of these models is incredibly hard for a bunch of different reasons. The only reliable measure is actually putting it in front of people and asking them which models you prefer, but two models in front of them. You can just measure that in academic data set performance. If you throw a base model at these academic data sets and you throw a command model, command model is going to perform just leaks leaks better, dramatically better. How much effort is this step? Is it like comparable in cost to training the base model or like how long does it take? Like how expensive is it? It's really hard to get right.

Play episode from 20:58

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app