#022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It)

19 snips

Sep 19, 2024

Join Nils Reimers, a prominent researcher in dense embeddings and the driving force behind foundational search models at Cohere. He dives into the intriguing limitations of text embeddings, such as their struggles with long documents and out-of-domain data. Reimers shares insights on the necessity of fine-tuning to adapt models effectively. He also discusses innovative approaches like re-ranking to enhance search relevance, and the bright future of embeddings as new research avenues are explored. Don't miss this deep dive into the cutting-edge of AI!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Argument Mining with BERT

Nils Reimers's colleagues used BERT for argument mining, classifying pairwise arguments.
Scaling issues led him to explore embeddings for clustering similar arguments.

INSIGHT

Open-Sourcing Embedding Training

Early embedding models like Universal Sentence Encoder and InferSent lacked training details and code.
Reimers recreated InferSent's training using BERT and open-sourced it, advancing the field.

INSIGHT

Out-of-Domain Embedding Weakness

Embeddings excel in-domain but struggle out-of-domain, often performing worse than lexical search.
This limitation stems from their reliance on training data for accurate semantic placement.

Get the Snipd Podcast app to discover more snips from this episode

Get the app