
#22 Nils Reimers on the Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) | Search
How AI Is Built
Understanding Embeddings: From Encoder to Contextualization
Producing embeddings primarily involves using encoder-only models, which tend to perform better than decoder-only models. The output consists of contextualized word embeddings derived from input text, with limited methods to combine them. Common strategies include using the CLS token, last token, or averaging token embeddings. The effectiveness of embeddings depends significantly on the training data and how similarity is defined, which can vary widely among different contexts. Historical approaches, like LASER and LaBSE, have utilized embedding models for tasks such as identifying translated texts.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.