
Machine Learning Guide MLA 011 Practical Clustering Tools
34:50
Start With K-Means As A Baseline
- Try K-means first for general clustering tasks as a simple baseline.
- Use scikit-learn KMeans for small/medium rows and Faiss KMeans for very large datasets.
Euclidean Breaks Down In High Dimensions
- Euclidean distance fails in high dimensions, so K-means degrades with large embedding sizes.
- For document embeddings (e.g., 768 dims) K-means often performs poorly compared to other methods.
Use ANN Libraries For Large-Scale Semantic Search
- Use Faiss, Annoy or HNSWlib for approximate nearest neighbor (ANN) search on millions of vectors.
- Build an index with your chosen similarity metric (e.g., cosine) for fast semantic lookup.
Get the Snipd Podcast app to discover more snips from this episode
Get the app 1 chevron_right 2 chevron_right 3 chevron_right 4 chevron_right 5 chevron_right 6 chevron_right 7 chevron_right 8 chevron_right 9 chevron_right 10 chevron_right 11 chevron_right 12 chevron_right 13 chevron_right
Introduction
00:00 • 2min
Psychi Learn Clustering
02:05 • 3min
Using K Mens Clustering in a Machine Learning Environment
04:45 • 2min
Using Fice in Nothe for Large Data Sets, Number of Rows, Not Dimensions
06:49 • 2min
Using Fices Implementation of the Cameans Algrithm
09:06 • 3min
Using Agglomerative Clustering
11:54 • 4min
How Do I Compute the Cosin Similarity of Every Entry Embedding?
16:04 • 2min
Clustering Algorithm - How to Find the Number of Clusters to Cluster Your Clustering Application
18:25 • 3min
Need Dot Knee Locator
21:17 • 1min
How to Find the Right Elbow of a Graph
22:44 • 2min
The Silhouettes Score Is Better Than Inertia for Clustering
25:05 • 3min
D B Scan
28:09 • 4min
H DB Scanning
32:17 • 2min
Primary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices.
Links- Notes and resources at ocdevel.com/mlg/mla-11
- Try a walking desk stay healthy & sharp while you learn & code
- K-means is the most widely used clustering algorithm and is typically the first method to try for general clustering tasks.
- The scikit-learn KMeans implementation is suitable for small to medium-sized datasets, while Faiss's kmeans is more efficient and accurate for very large datasets.
- K-means requires the number of clusters to be specified in advance and relies on the Euclidean distance metric, which performs poorly in high-dimensional spaces.
- When document embeddings have high dimensionality (e.g., 768 dimensions from sentence transformers), K-means becomes less effective due to the limitations of Euclidean distance in such spaces.
- For text embeddings with high dimensionality, agglomerative (hierarchical) clustering methods are preferable, particularly because they allow the use of different similarity metrics.
- Agglomerative clustering in scikit-learn accepts a pre-computed cosine similarity matrix, which is more appropriate for natural language processing.
- Constructing the pre-computed distance (or similarity) matrix involves normalizing vectors and computing dot products, which can be efficiently achieved with linear algebra libraries like PyTorch.
- Hierarchical algorithms do not use inertia in the same way as K-means and instead rely on external metrics, such as silhouette score.
- Other clustering algorithms exist, including spectral, mean shift, and affinity propagation, which are not covered in this episode.
- Libraries such as Faiss, Annoy, and HNSWlib provide approximate nearest neighbor search for efficient semantic search on large-scale vector data.
- These systems create an index of your embeddings to enable rapid similarity search, often with the ability to specify cosine similarity as the metric.
- Sample code using these libraries with sentence transformers can be found in the UKP Lab sentence-transformers examples directory.
- Both K-means and agglomerative clustering require a predefined number of clusters, but this is often unknown beforehand.
- The "elbow" method involves running the clustering algorithm with varying cluster counts and plotting the inertia (sum of squared distances within clusters) to visually identify the point of diminishing returns; see kmeans.inertia_.
- The kneed package can automatically detect the "elbow" or "knee" in the inertia plot, eliminating subjective human judgment; sample code available here.
- The silhouette score, calculated via silhouette_score, considers both inter- and intra-cluster distances and allows for direct selection of the number of clusters with the maximum score.
- The silhouette score can be computed using a pre-computed distance matrix (such as from cosine similarities), making it well-suited for applications involving non-Euclidean metrics and hierarchical clustering.
- DBSCAN is a hierarchical clustering method that does not require specifying the number of clusters, instead discovering clusters based on data density.
- HDBSCAN is a more popular and versatile implementation of density-based clustering, capable of handling various types of data without significant parameter tuning.
- DBSCAN and HDBSCAN can be preferable to K-means or agglomerative clustering when automatic determination of cluster count or robustness to noise is important.
- However, these algorithms may not perform well with all types of high-dimensional embedding data, as illustrated by the challenges faced when clustering 768-dimensional text embeddings.
- For low- to medium-sized, low-dimensional data, use K-means with silhouette score to choose the optimal number of clusters: scikit-learn KMeans, silhouette_score.
- For very large data or vector search, use Faiss.kmeans.
- For high-dimensional data using cosine similarity, use Agglomerative Clustering with a pre-computed square matrix of cosine similarities; sample code.
- For density-based clustering, consider DBSCAN or HDBSCAN.
- Exploratory code and further examples can be found in the UKP Lab sentence-transformers examples.
