Arctic Embed with Luke Merrick, Puxuan Yu, and Charles Pierse - Weaviate Podcast #110!

8 snips

Dec 18, 2024

Join Luke Merrick from Snowflake, a key player in Arctic Embed development, and Charles Pierse, head of Weaviate Labs, as they dive into the intricacies of multilingual text embeddings. They explore the evolution of Arctic Embed 2.0, emphasizing its open-source nature. The conversation covers technical strategies in model training, the economics of pre-training large models, and the challenges of integrating negative examples. They discuss the delicate balance between model simplicity and nuance in retrieval, promoting collaboration to enhance search quality.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Origin of Arctic Embed

Snowflake acquired a search company, Neva, bringing search expertise into the company.
This led to the development of Cortex Search and the realization of embedding models' importance.

INSIGHT

Trust Production, Not Just Benchmarks

MTAP benchmark scores can be misleading.
Real-world production usage is a better indicator of a model's true performance.

ADVICE

Pretraining Embedding Models

When pretraining embedding models, use webscale data that reflects the desired behavior.
Focus on large batch sizes and ensure data quality by removing irrelevant entries.

Get the Snipd Podcast app to discover more snips from this episode

Get the app