Lessons in Data Engineering: Scaling, AI, and Open Source with Sandy Ryza
Feb 7, 2025
auto_awesome
Sandy Ryza, a lead engineer on Dagster, shares his rich journey from software engineering to data science. He dives into the evolution of data engineering, emphasizing its increasing complexity and the vital role of AI in shaping data platforms. Sandy discusses best practices for managing data, highlighting the integration of software engineering principles. He also reflects on the future of open-source tools and the importance of data ownership in modern infrastructures. His insights offer great value for both seasoned professionals and newcomers in the field.
Sandy Ryza emphasizes the necessity of software engineering principles to manage increasing complexity in modern data pipelines and ensure scalable data platforms.
The rise of unstructured data and AI will reshape data engineering, making intentional platform design and interoperability critical for future success.
Deep dives
Sandy's Journey in Data Engineering
Sandy shares his multifaceted career journey that has revolved around data engineering, starting as a software engineer building tools for complex data sets. After transitioning to a data practitioner role, he faced various challenges that fueled his desire to refocus on creating better tools. This culminated in his current work on Dagster, an orchestration and data management tool aimed at improving data pipelines. His experiences reflect a holistic understanding of both the technical and practical aspects of data engineering, positioning him as a thought leader in the field.
Understanding Data Complexity
One key insight is the complexity involved in handling large amounts of data, which often surpasses merely processing size. Sandy emphasizes that many practitioners face challenges with understanding the interconnections among diverse datasets and the multitude of processing steps needed to derive valuable insights. He recounts his experiences at Clover Health and Keep Trucking, where he built internal tools to manage this complexity, highlighting the importance of overcoming the traditional limitations of data handling. The growing intricate nature of data processing now requires innovative solutions to manage diverse and interrelated datasets effectively.
The Role of Data Pipelines and Dagster
Data pipelines are crucial for moving data from its raw forms to actionable insights within organizations. They involve not only effective data movement but also transformation processes that prepare data assets for decision-making and product functionalities. Dagster serves as a pivotal tool in orchestrating these data pipelines, ensuring a smooth flow from source data to intermediate and ultimate usable assets. This orchestration minimizes the disconnect often found in handling large datasets and aids in constructing a robust data management framework.
Future of Data Engineering and Best Practices
The future of data engineering is expected to see an increase in complexity due to the growth of unstructured data and AI applications, influencing how data is managed and utilized. Sandy advises aspiring data engineers to adopt software engineering practices for effective data management, highlighting the significance of intentional platform design from the outset. Organizations should strive for thoughtful data platforms that consider future scalability and complexity to preemptively address potential complications. Additionally, fostering 'software-defined data assets' through robust version control and testing will help streamline data architecture and maintain clarity as data ecosystems evolve.
In this episode of Product by Design, Kyle chats with Sandy Ryza, lead engineer on the Dagster project, author, and thought leader in data engineering. Sandy shares his journey through the world of data—from building big data tools at Cloudera to working as a data scientist, product manager, and engineer—and how those experiences led him to help create Dagster, an open-source data orchestration platform.
We discuss:
The evolution of data engineering and the growing complexity of modern data pipelines.
The role of AI and unstructured data in shaping the future of data platforms.
How organizations should think about data platforms to avoid costly rework.
Best practices for managing data complexity using software engineering principles.
The future of open-source tools in data infrastructure and the push toward interoperability.
Sandy Ryza Sandy is a lead engineer, author, and thought leader in the domain of data engineering. Sandy co-wrote “Advanced Analytics with PySpark” and “Advanced Analytics with Spark”. He led ML and data science teams at Cloudera, Remix, Clover Health, and KeepTruckin.
Sandy is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. Sandy is a regular speaker at data engineering and ML conferences.