The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Scaling Model Training with Kubernetes at Stripe with Kelley Rivoire - TWIML Talk #272

Jun 6, 2019

Kelley Rivoire, an engineering manager at Stripe with expertise in machine learning infrastructure, shares her insights on scaling model training. She discusses Stripe's journey from production-focused systems to building the Railyard API for efficient model management on Kubernetes. Kelley highlights the importance of collaboration across teams, custom parameters for hyperparameter optimization, and the significance of an infrastructure team to support machine learning advancements. Tune in to discover how Stripe is navigating the complexities of AI implementation!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Stripe's Production-First ML

Stripe's machine learning began with production-focused applications like fraud detection and risk management.
This contrasts with many companies that start with offline analytics.

INSIGHT

Collaboration with Orchestration Team

Stripe's ML infrastructure team collaborates with its orchestration team for Kubernetes management.
This allows the ML team to focus on model training without managing the infrastructure.

INSIGHT

Railyard and Workflows

Stripe uses a two-part system: Railyard API and flexible workflows.
Railyard handles metadata and data location, while workflows allow custom Python code for training.

Get the Snipd Podcast app to discover more snips from this episode

Get the app