Is the Semantic String Doc ID the Best?

The best strategy is varies depending on the metric you're looking at and the size of the corpus. The atomic doc ID seems to be a little bit all over the place as in sometimes making the model larger will hurt performance. And I think that the authors mentioned that they have some stability challenges or apparently it's pretty hard to train this layer that is so large. Maybe the most naive assumption for representing documents would outperform the other strategies.

Play episode from 30:00

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app