Machine Learning Guide

MLA 011 Practical Clustering Tools

4 snips
Nov 8, 2020
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Start With K-Means As A Baseline

  • Try K-means first for general clustering tasks as a simple baseline.
  • Use scikit-learn KMeans for small/medium rows and Faiss KMeans for very large datasets.
INSIGHT

Euclidean Breaks Down In High Dimensions

  • Euclidean distance fails in high dimensions, so K-means degrades with large embedding sizes.
  • For document embeddings (e.g., 768 dims) K-means often performs poorly compared to other methods.
ADVICE

Use ANN Libraries For Large-Scale Semantic Search

  • Use Faiss, Annoy or HNSWlib for approximate nearest neighbor (ANN) search on millions of vectors.
  • Build an index with your chosen similarity metric (e.g., cosine) for fast semantic lookup.
Get the Snipd Podcast app to discover more snips from this episode
Get the app