Semantic String Doc ID Is Better Than Atomic String ID

I find it hard to rub my head around this idea that the model would just learn one separate token for each document and that's it. I think I could imagine a world where atomic doc ID is better than that with perfect optimization. So at least to me, the big takeaway is semantic string doc ID does seem to perform best in the large corpus, large model case. It's unclear to what degree things could be optimized further and the results would change based on that.

Play episode from 32:56

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app