How to Make a Good Benchmark for Language Models

I think a lot of the benchmarks that we have are out of date. I think it's very difficult to make a good benchmark, especially in this space because so much of our feeling about what language is, is like this pretty fuzzy thing. So these benchmarks are very hard to put together. You need so much data as well, that they're difficult to generate in the first place. And then because you need to generate them at very large scale, it means that you now have this enormous pile of data that you now need to assess and evaluate. It's just a really, really hard problem.

Play episode from 21:14

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app