

Arctic Embed with Luke Merrick, Puxuan Yu, and Charles Pierse - Weaviate Podcast #110!
8 snips Dec 18, 2024
Join Luke Merrick from Snowflake, a key player in Arctic Embed development, and Charles Pierse, head of Weaviate Labs, as they dive into the intricacies of multilingual text embeddings. They explore the evolution of Arctic Embed 2.0, emphasizing its open-source nature. The conversation covers technical strategies in model training, the economics of pre-training large models, and the challenges of integrating negative examples. They discuss the delicate balance between model simplicity and nuance in retrieval, promoting collaboration to enhance search quality.
AI Snips
Chapters
Transcript
Episode notes
Origin of Arctic Embed
- Snowflake acquired a search company, Neva, bringing search expertise into the company.
- This led to the development of Cortex Search and the realization of embedding models' importance.
Trust Production, Not Just Benchmarks
- MTAP benchmark scores can be misleading.
- Real-world production usage is a better indicator of a model's true performance.
Pretraining Embedding Models
- When pretraining embedding models, use webscale data that reflects the desired behavior.
- Focus on large batch sizes and ensure data quality by removing irrelevant entries.