171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster
Jan 3, 2024
auto_awesome
Guest Sandy Ryza, an expert in machine learning pipelines, discusses the role of orchestrators in the lifecycle of data, changes in data ops and MLOps, data cleaning, and the overview of Dagster. They also explore the difference between data assets and tasks in data pipelines, defining lineage and data assets in Dagster, and the benefits of a unified orchestration framework. Additionally, they touch on orchestration in the development phase and the emergence of the analytics engineer role.
The boundaries between data engineering, ML engineering, and data science roles are becoming increasingly blurred, allowing individuals to explore different areas and follow their curiosity without needing a complete career change.
Orchestrators like Dagster play a crucial role in managing and executing data pipelines for both analytics and ML workloads, providing a flexible execution substrate for experimentation and reliability across the entire development lifecycle.
Deep dives
The Blurring Lines in Data Roles
The boundaries between data engineering, ML engineering, and data science roles are becoming increasingly blurred. In the past, different roles had distinct responsibilities, but now there is more overlap and fluidity. Proficiency in data modeling and infrastructure are key aspects of these roles. The tooling available now allows for collaboration and crossover between Python and SQL, providing flexibility for individuals to explore different areas and follow their curiosity. This blurring of lines is sparking creativity and enabling individuals to pursue their interests without needing a complete career change.
Dagster's Usage in Data Workflows
Dagster is used by various types of users and teams. Data platform engineers often adopt Dagster to organize the computation within their data organizations, facilitating shared orchestration environments. Data engineering practitioners also use Dagster to define and create data assets, writing the logic for moving and transforming data. Additionally, many machine learning teams utilize Dagster to train models, generate features, and perform batch inference. Dagster's flexibility and adaptability make it valuable for a wide range of data-related workflows.
The Role of Orchestrators in the World of ML and Data Engineering
With the emergence of ML ops and the increasingly interconnected worlds of data engineering and ML, the role of orchestrators has become crucial. Orchestrators, such as Dagster, play a significant role in managing and executing data pipelines for both analytics and ML workloads. They provide a flexible execution substrate for heterogeneous compute environments and enable experimentation and reliability across the entire development lifecycle. Although there are specific tools and frameworks for ML-specific tasks like feature engineering, there is a growing recognition that a unified orchestration tool, like Dagster, can bridge the gap and provide a consistent view of the entire data pipeline.
The Future of Data Roles
As the lines between data engineering, ML engineering, and data science continue to blur, the future of data roles is becoming more fluid and adaptable. The tooling available, such as Dagster, enables collaboration and integration between different roles and skill sets. Rather than being siloed, data practitioners can explore different areas and follow their curiosity without the need for career changes. This flexibility and fluidity spark creativity and allow individuals to pursue their interests in a more interconnected and collaborative manner.
The role of an orchestrator in the lifecycle of data (1:34)
Relevance of orchestration in data pipelines (00:02:45)
Changes around data ops and MLOps (3:37)
Data Cleaning (11:42)
Overview of Dagster (13:50)
Assets vs Tasks in Data Pipeline (19:15)
Building a Data Pipeline with Dexter (25:40)
Difference between Data Asset and Materialized Dataset (28:28)
Defining Lineage and Data Assets in Dagster (29:32)
The boundaries of software and organizational structures (37:25)
The benefits of a unified orchestration framework (39:56)
Orchestration in the development phase (45:29)
The emergence of analytics engineer role (51:53)
Fluidity in data pipeline and infrastructure roles (52:40)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode