Jeff Magnusson: How To Create A Self-Service Data Platform For Data Scientists
Mar 6, 2018
auto_awesome
Explore the self-service data platform model discussed by Jeff Magnusson, VP of Data Platform at Stitch Fix. Learn about the integration of data engineering and data science roles, empowering data scientists with Flotilla API for batch job execution, and the technology stack deployed on AWS including Spark and Presto.
Empowering data scientists with self-service tools reduces coordination costs and aligns goals effectively in data science work.
Flotilla, an open source API, streamlines batch-oriented tasks for data scientists, enhancing job scheduling and production workflow management.
Deep dives
Challenge with Handoffs in Data Science Departments
In the podcast, Jeff Magneson discusses the challenges of handoffs within data science departments. He highlights that traditional pipelines involving handoffs between data engineers, data scientists, and production engineers create coordination costs and misaligned motivations. By empowering data scientists to take full ownership of their pipelines, from data acquisition to productionization, through self-service tools and abstractions, organizations can increase velocity, innovation, and align goals more effectively.
Evolving Data Engineering Role at Stitch Fix
Jeff Magneson explains how at Stitch Fix, the traditional data engineering role merges with the data science role to create full-stack data scientists. This approach shifts the focus to a data platform team responsible for maintaining infrastructure that supports data scientists. The goal is to facilitate easier usage of the environment for data scientists, pushing data engineering responsibilities to the data platform team.
Introduction of Flotilla Job Execution Service
The podcast delves into the introduction of Flotilla, an API developed by Stitch Fix to manage batch-oriented tasks in data science departments. Flotilla abstracts over ECS and handles job queuing, resource allocation, monitoring, and task execution within containers. This service enables data scientists to interact with job scheduling through command lines and obstructions, providing a tool for managing production workflows efficiently.
In this episode, Wayne Eckerson and Jeff Magnusson discuss a self-service model for data science work and the role of a data platform in that environment. Magnusson also talks about Flotilla, a new open source API that makes it easy for data scientists to execute tasks on the data platform.
Magnusson is the vice president of data platform at Stitch Fix. He leads a team responsible for building the data platform that supports the company's team of 80+ data scientists, as well as other business users. That platform is designed to facilitate self-service among data scientists and promote velocity and innovation that differentiate Stitch Fix in the marketplace. Before Stitch Fix, Magnusson managed the data platform architecture team at Netflix where he helped design and open source many of the components of the Hadoop-based infrastructure and big data platform.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode