AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Can Benchmarks Be Used as a Benchmark?
This can be used sort of as a sanity check, okay, did my system actually do better than a super naive baseline? Or I want to compare some systems head to head, let's use this benchmark. You might also use test suites, which are put together to sort of map out particular kinds of cases that you want to handle well. There's also adversarial testing, where people will create test sets by going and collecting all the examples that previous systems did poorly on. And then another one is what we did in the build it, break it shared tasks. So two examples that were minimally different to each other, but would work for which the systems would work for one but not