NLP Benchmarks for Language Models

I think benchmarks are quite tricky, especially for language models and an LP where if you just like change the beginning of a word. Even human judgment sometimes is not very useful because humans do not really understand whether the generated text is factual or not. So in that case, yeah, this can be a big problem. I think for short term, we can probably manage to make up some nice benchmark using human evaluations. And then, but at some point, even training models to evaluate model is not enough because at the point, the models are so much greater than human talent.

Play episode from 01:00:47

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app