Savin Goyal, Co-Founder & CTO at Outerbounds, discusses the evolution of full stack data scientists integrating software engineering tasks in data science projects for production. Topics include challenges in ML deployment, success stories at companies like Netflix, Metaflow for ML management, and strategies for scalability and robustness in AI production.
Read more
AI Summary
AI Chapters
Episode notes
auto_awesome
Podcast summary created with Snipd AI
Quick takeaways
Full Stack Data Scientists integrate software engineering tasks into their role, focusing on deploying ML models into production systems.
The definition of 'in production' in data science varies based on organizational maturity, with examples like A/B testing at Netflix.
Not all ML models from data science projects make it to production, highlighting the iterative nature of model development.
Deep dives
Evolution of Data Scientist Role to Full Stack Data Scientist
The role of the data scientist is evolving, with some organizations narrowing the focus of the role while others are expanding it to create Full Stack Data Scientists. This approach mirrors the concept of Full Stack Software Engineers, where additional responsibilities, like software engineering tasks, are integrated into the role. Full Stack Data Scientists are particularly involved in the deployment of machine learning models into production software systems.
Defining 'In Production' in Data Science
The definition of 'in production' in data science varies depending on the organization's maturity and specific project goals. For instance, in the context of Netflix, running A/B tests against live user traffic is considered a key aspect of deploying a model into production. Data-informed decisions that impact business strategies also contribute to the production process, showcasing the spectrum of definitions for production within different organizations.
Transitioning Models to Production Worthy Status
Determining whether a model is worthy of being deployed into production is a critical step in the process. Not all models created during data science projects will make it to production, and this is a norm in the field. The focus shifts from maintaining one model to efficiently progressing to the next version when building machine learning systems, highlighting the iterative nature of development.
Success Stories of Model Deployment and Business Impact
Organizations like Netflix have leveraged machine learning models, specifically their recommendation system, to significantly impact their bottom line. The scalability and application of machine learning extend beyond flagship projects to various internal processes and consumer interfaces. Companies like 23andMe and Medtronic demonstrate successful integration of machine learning and data science in their respective domains, showcasing tangible benefits in medical research and automated surgical AI.
The Role of Metaflow in Machine Learning Operations
Metaflow, an open-source ML platform, aids in training and deploying ML models while emphasizing on the broader aspect of building system solutions rather than just focusing on the models. Metaflow addresses the challenges faced by data scientists in navigating through the complexities of ML projects and ensures an end-to-end data science approach. By focusing on enabling data scientists to control their destiny and streamline their workflows, Metaflow enhances productivity and seamless integration of machine learning into production systems.
The role of the data scientist is changing. Some organizations are splitting the role into more narrowly focused jobs, while others are broadening it. The latter approach, known as the Full Stack Data Scientist, is derived from the concept of a full stack software engineer, with this role often including software engineering tasks. In particular, one of the key functions of a full stack data scientist is to take machine learning models and get them into production inside software. So, what separates projects from production?
Savin Goyal is the Co-Founder & CTO at Outerbounds. In addition to his work at Outerbounds, Savin is the creator of the open source machine learning management platform Metaflow. Previously Savin has worked as a Software Engineer at Netflix and LinkedIn.
In the episode, Richie and Savin explore the definition of production in data science, steps to move from internal projects to production, the lifecycle of a machine learning project, success stories in data science, challenges in quality control, Metaflow, scalability and robustness in production, AI and MLOps, advice for organizations and much more.