
The fastest agent in the race has the best evals
The Stack Overflow Podcast
00:00
Why standard search evals fall short and need freshness
Benjamin critiques simple QA benchmarks and describes a real-time search eval that pulls recent news to test novelty.
Play episode from 11:55
Transcript


