How AI Is Built

#022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It)

19 snips
Sep 19, 2024
Join Nils Reimers, a prominent researcher in dense embeddings and the driving force behind foundational search models at Cohere. He dives into the intriguing limitations of text embeddings, such as their struggles with long documents and out-of-domain data. Reimers shares insights on the necessity of fine-tuning to adapt models effectively. He also discusses innovative approaches like re-ranking to enhance search relevance, and the bright future of embeddings as new research avenues are explored. Don't miss this deep dive into the cutting-edge of AI!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Argument Mining with BERT

  • Nils Reimers's colleagues used BERT for argument mining, classifying pairwise arguments.
  • Scaling issues led him to explore embeddings for clustering similar arguments.
INSIGHT

Open-Sourcing Embedding Training

  • Early embedding models like Universal Sentence Encoder and InferSent lacked training details and code.
  • Reimers recreated InferSent's training using BERT and open-sourced it, advancing the field.
INSIGHT

Out-of-Domain Embedding Weakness

  • Embeddings excel in-domain but struggle out-of-domain, often performing worse than lexical search.
  • This limitation stems from their reliance on training data for accurate semantic placement.
Get the Snipd Podcast app to discover more snips from this episode
Get the app