
Data Engineering Podcast
Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh
Episode guests
Podcast summary created with Snipd AI
Quick takeaways
- SQLMesh simplifies data transformations with Python/SQL, focusing on correctness and scalability.
- SQLMesh automates data ops practices to ensure correctness, scalability, and reliability.
- SQLMesh offers comprehensive testing capabilities and plans to integrate with tools like DAGster for metrics.
Deep dives
SQL Mesh: Transform Data Operations Effortlessly
SQL Mesh is a data ops framework designed by Toby Mao, the CTO of Dubica Data, to simplify data transformations. The framework allows writing transformations in either Python or SQL, enabling scalable and correct execution. The inspiration for SQL Mesh came from the need for good data for generating meaningful metrics. The goal of SQL Mesh is to automate data ops practices, allowing users to focus on defining business logic while ensuring correctness, scalability, and reliability.
Challenges with Existing Data Infrastructure Tools
Existing tools like DBT have limitations in handling complex transformation needs over time. SQL Mesh aims to improve upon DBT by providing a transformation framework with a fundamental understanding of the data transformation space. SQL Mesh handles transformations at a higher level, offering a more seamless experience and addressing the inefficiencies and complexities that arise from extensive use of existing tools like DBT.
Simplified Development Environments with SQL Mesh
SQL Mesh simplifies the management of development environments by offering a streamlined approach. Users can easily create representative development environments without unnecessary manual work. The virtual environment layer in SQL Mesh understands SQL and tracks changes at the column level, facilitating efficient development and testing. This approach minimizes manual effort and ensures that development environments are accurate and representative.
Enhancing Data Validation and Testing with SQL Mesh
SQL Mesh provides comprehensive testing capabilities, including unit tests with real data and audits for data quality checks. These functionalities ensure the correctness and reliability of data operations. SQL Mesh's unit tests enable users to define input and output expectations using YAML files, while audits offer data quality checks after processing. Additionally, SQL Mesh is exploring the implementation of table diffs for comparing data discrepancies between environments.
Future Plans and Integrations of SQL Mesh
In the future, SQL Mesh plans to integrate with tools like DAGster and enhance its semantics layer for metrics and experimentation. The goal is to provide a more flexible and powerful solution for defining and managing metrics. As SQL Mesh evolves, it aims to offer cloud or on-prem solutions, featuring enterprise-grade features like cost estimation and a query proxy layer for facilitating seamless data exploration and validation.
Addressing Data Infrastructure Fragmentation with SQL Mesh
SQL Mesh aims to address the fragmentation in data infrastructure by providing a homogenous environment for data practitioners. The tool simplifies complex workflows across data engineers, data analysts, and data scientists. By unifying the development stack and enabling seamless collaboration, SQL Mesh bridges the gap between different personas and fosters a more productive data development environment, tackling the challenges posed by diverse technology stacks within organizations.
Summary
Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-
- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what SQLMesh is and the story behind it?
- DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?
- What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?
- How do those rough edges impact the productivity and effectiveness of teams using those
- Can you describe how SQLMesh is implemented?
- How have the design and goals evolved since you first started working on it?
- What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh?
- For teams who have already invested in dbt, what is the migration path from or integration with dbt?
- You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator?
- What do you see as the potential benefits of integration with e.g. data-diff?
- What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains?
- What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh?
- When is SQLMesh the wrong choice?
- What do you have planned for the future of SQLMesh?
Contact Info
- tobymao on GitHub
- @captaintobs on Twitter
- Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- SQLMesh
- Tobiko Data
- SAS
- AirBnB Minerva
- SQLGlot
- Cron
- AST == Abstract Syntax Tree
- Pandas
- Terraform
- dbt
- SQLFluff
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack:  RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.