

Real-time Feature Generation at Lyft // Rakesh Kumar // #334
Jul 25, 2025
Rakesh Kumar, a Senior Staff Software Engineer at Lyft with a focus on Machine Learning platforms, dives into the intricacies of real-time feature generation. He discusses how Lyft evolved from naive pipelines to handling millions of events per minute, achieving low-latency delivery. Rakesh emphasizes balancing self-service and specialized data processing while navigating the challenges of geospatial data. He also shares insights on technology adoption and how YAML configurations streamline data processing efforts. This session is a treasure trove for anyone interested in MLOps and real-time data management!
AI Snips
Chapters
Transcript
Episode notes
Evolution from Cron Jobs to Streaming
- Lyft evolved from a cron job based pipeline to streaming processing using Apache Beam and Flink for real-time feature generation.
- They addressed scalability issues by sharding data on geohashes instead of cities to evenly distribute load and avoid hot shards.
Real-Time vs Offline Feature Validation
- Lyft compares real-time features with offline 'ground truth' features and alerts if discrepancies exceed thresholds.
- This observability framework ensures real-time features maintain high data quality and reliability.
Hierarchical Geospatial Feature Store
- Lyft uses a geospatial hierarchical feature store that supports aggregated features across various geohash levels.
- This flexible store allows different models to consume data at multiple regional granularities through a unified API.