
60 - FEVER: a large-scale dataset for Fact Extraction and VERification, with James Thorne
NLP Highlights
00:00
The Artifacts Problem in a Data Set
In our data set we report two types of scores, there's the label only accuracy. And there's the conditional accuracy on finding the right evidence. This hypothesis only style evaluation gives us a score of about 50%, which is comparable to the multi and allied datasets. So yeah significantly above chance, but at least it's not way higher than that. It also shows that the artifacts problem is equally as problematic to us as the style of data sets if we ignore the requirement for evidence.
Transcript
Play full episode