Neural Search Talks — Zeta Alpha cover image

ColBERT + ColBERTv2: late interaction at a reasonable inference cost

Neural Search Talks — Zeta Alpha

00:00

The Similarity Matrix and Late Interaction

In late interaction they're creating these query and document embeddings separately. And then with those embeddings they're basically max pooling over the query terms in the similarity matrix. So for each query term they're finding the maximum document similarity, where similarity is cosine or dot product,. They then sum these maximum values. It's a significant amount of computation. You can only re-rank like 100 or 1000 with a reasonable speed at inference time.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner