

Scaling Model Training with Kubernetes at Stripe with Kelley Rivoire - TWIML Talk #272
Jun 6, 2019
Kelley Rivoire, an engineering manager at Stripe with expertise in machine learning infrastructure, shares her insights on scaling model training. She discusses Stripe's journey from production-focused systems to building the Railyard API for efficient model management on Kubernetes. Kelley highlights the importance of collaboration across teams, custom parameters for hyperparameter optimization, and the significance of an infrastructure team to support machine learning advancements. Tune in to discover how Stripe is navigating the complexities of AI implementation!
AI Snips
Chapters
Transcript
Episode notes
Stripe's Production-First ML
- Stripe's machine learning began with production-focused applications like fraud detection and risk management.
- This contrasts with many companies that start with offline analytics.
Collaboration with Orchestration Team
- Stripe's ML infrastructure team collaborates with its orchestration team for Kubernetes management.
- This allows the ML team to focus on model training without managing the infrastructure.
Railyard and Workflows
- Stripe uses a two-part system: Railyard API and flexible workflows.
- Railyard handles metadata and data location, while workflows allow custom Python code for training.