
The fastest agent in the race has the best evals
The Stack Overflow Podcast
00:00
Problems with eval datasets and verification
Benjamin discusses errors in eval datasets, the need for human verification, and how synthetic/adversarial construction can mislead.
Play episode from 15:38
Transcript


