Investing Time in Data Preparation for Model Training

Investing significant time in data work and cleaning is crucial for training embedding models. The process involves two stages: large-scale contrastive pre-training with around 240 million pairs of semantically related sentences, and smaller scale contrastive fine-tuning including hard negatives to enhance retrieval performance. The addition of hard negatives aids in pushing model performance further.

Transcript

Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.

Get the app