
Data Engineering Podcast
Unlocking Your dbt Projects With Practical Advice For Practitioners
Podcast summary created with Snipd AI
Quick takeaways
- Proper planning and structure are essential for scaling a DBT project and preventing chaos and disorganization.
- Implementing continuous integration and continuous deployment (CI/CD) pipelines and testing strategies helps maintain quality and avoid technical debt in a DBT project.
- Encouraging the adoption of formal modeling strategies, providing training, and promoting collaboration and review processes are key to maintaining a scalable and well-organized DBT project.
Deep dives
Importance of Planning and Structuring
One of the key challenges in scaling a DBT project is the lack of proper planning and structure. It is important to have a clear plan before starting the project, considering factors like project structure, folder organization, and governance. Without a plan, the project can quickly become disorganized, leading to multiple versions of the same truth and difficulty in locating specific components. Implementing a thoughtful structure and plan from the beginning helps prevent chaos and ensures a more manageable and scalable project.
Setting up CI/CD Pipelines and Testing Strategy
Another challenge in scaling a DBT project is maintaining quality and avoiding technical debt. To address this, it is crucial to set up continuous integration and continuous deployment (CI/CD) pipelines early on. CI/CD checks ensure that changes made to the project's models are thoroughly tested before merging them into the main branch, preventing potential issues and maintaining data trust. Additionally, implementing a testing strategy, such as data quality tests and unit tests, helps identify and resolve issues quickly, promoting data accuracy and reliability. These practices play an essential role in managing technical debt and preventing potential disruptions as the project scales.
Formal Modeling Strategies and Training
To encourage the adoption of formal modeling strategies and best practices, teams can provide training and resources to empower members. This includes educating the team about the benefits of dimensional modeling and incorporating reporting into the DBT flow. Encouraging the use of ERDs, source-to-target mapping documents, and documentation of business requirements helps ensure a clear understanding of the project goals and facilitates effective modeling. Furthermore, promoting collaboration and review processes within the team ensures that models are well-designed and meet business needs before starting the coding process. By emphasizing planning, training, and structured development, teams can maintain a scalable and well-organized DBT project.
Importance of Thoughtful Model Design
Thoughtful model design is a key aspect of scaling a DBT project. Before starting the coding process, it is crucial to consider the business needs, build ERDs, and create source-to-target mapping documents. This helps define the dimensional models, master data tables for dimensions, and the measurement tables for facts. By having a clear plan and North Star in mind, teams can effectively build models that align with business requirements. Regular collaboration and reviews within the team can ensure that models are well-designed, meet business needs, and prevent major design changes later in the project. Incorporating master data tables for dimensions and measurement tables for facts early on sets the foundation for a scalable and efficient DBT project.
Importance of Standard Modeling Practices and Documentation
Having standard modeling practices in place improves the developer experience and allows for easier onboarding of new engineers to work on the dbt project. Writing proper documentation for all models is essential as it provides context and understanding to developers about the data model and its purpose. It is crucial to be intentional about writing documentation to ensure that new developers can quickly grasp the project and its functionalities.
Tools and Processes to Alleviate Complexity and Ensure Stability
Besides dbt, there are various tools and processes that can help alleviate incidental complexity when working on large data projects. Writing code following a plan and having a code review process helps ensure adherence to standards and prevents any developer from pushing code without validation. Collaborative communication and working together as a team aid in achieving project stability. Monitoring the builds and keeping an eye on the runtime of models is also crucial, allowing for early detection of issues or changes that affect processing time. Additionally, utilizing CI/CD tools like GitHub Actions or dbt Cloud can help streamline CI/CD pipelines and enhance development workflows.
Summary
The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your dbt projects
Interview
- Introduction
- How did you get involved in the area of data management?
- What was your path to adoption of dbt?
- What did you use prior to its existence?
- When/why/how did you start using it?
- What are some of the common challenges that teams experience when getting started with dbt?
- How does prior experience in analytics and/or software engineering impact those outcomes?
- You recently wrote a book to give a crash course in best practices for dbt. What motivated you to invest that time and effort?
- What new lessons did you learn about dbt in the process of writing the book?
- The introduction of dbt is largely responsible for catalyzing the growth of "analytics engineering". As practitioners in the space, what do you see as the net result of that trend?
- What are the lessons that we all need to invest in independent of the tool?
- For someone starting a new dbt project today, can you talk through the decisions that will be most critical for ensuring future success?
- As dbt projects scale, what are the elements of technical debt that are most likely to slow down engineers?
- What are the capabilities in the dbt framework that can be used to mitigate the effects of that debt?
- What tools or processes outside of dbt can help alleviate the incidental complexity of a large dbt project?
- What are the most interesting, innovative, or unexpected ways that you have seen dbt used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with dbt? (as engineers and/or as autors)
- What is on your personal wish-list for the future of dbt (or its competition?)?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Biobot Analytic
- Breezeway
- dbt
- Synapse Analytics
- Snowflake
- Fivetran
- Analytics Power Hour
- DDL == Data Definition Language
- DML == Data Manipulation Language
- dbt codegen
- Unlocking dbt book (affiliate link)
- dbt Mesh
- dbt Semantic Layer
- GitHub Actions
- Metaplane
- DataTune Conference
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Miro:  Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at [dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).
- Starburst:  This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Rudderstack:  Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize:  You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!