Due to health-related issues, we are on a temporary hiatus for new episodes. Please enjoy this rerelease of episode #65 with Abe Gong all about how people are implementing data contracts in the wild. There are so many ways people can just do only defensive data contracts and I think that is such a missed opportunity. Maybe it's where you will have to start but there's a much better way and we talk a bit about why I think that is so distressing that people aren't talking to each other.
Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/
Please Rate and Review us on your podcast app of choice!
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Episode list and links to all available episode transcripts here.
Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here
Abe's Twitter: @AbeGong / https://twitter.com/AbeGong
Abe's LinkedIn: https://www.linkedin.com/in/abe-gong-8a77034/
Great Expectations Community Page: https://greatexpectations.io/community
In this episode, Scott interviewed Abe Gong, the co-creator Great Expectations (an open source data quality / monitoring / observability tool) and co-founder/CEO of Superconductive.
One caveat before jumping in is that Abe is passionate about the topic and has created tooling to help address it. So try to view Abe's discussion of Great Expectations as an approach rather than a commercial for the project/product.
To start the conversation, Abe shared some of his background experience living the pain of unexpected upstream data changes causing data chaos / lots of work to recover from and adapt. Part of where we need to get to using something like data contracts is to remove the need to recover in addition to adapting and move towards controlled/expected adaptation. Abe believes that the best framing for data contracts is to think about them as a set of expectations.
To define expectations here, this would include not just schema but also the content of data, such as value ranges/types/distributions/relationships across tables/etc. So for instance, a column may be a one to five for rankings and then the application team changes it one to 10. The schema may not be broken - it is still passing whole numbers - but the new range is not within expectations so the contract is broken.
At current, Abe sees the best way to not break social expectations is via getting consumers and producers in a meeting to talk about the upcoming changes and prepare, such as with versioning. But, as tooling improves, Abe sees a world where we won't even need a lot of those meetings going forward - either because data pipelines can be "self-healing" and automatically adapt to changes upstream or because metadata and tools for context-sharing will reduce the need for meetings.
Abe sees two distinct use cases in general for data contracts or more specifically how people are using Great Expectations to implement data contracts. The first is purely defensively - put some validation on the data you are ingesting to prevent data that doesn't match from blowing up your own work; the second type is when the consuming team shares their expectations with the producers and there is a more formal agreement - or contract - with a shared set of expectations. The first often leads to the second, via an agreement conversation that happens after there was an upstream breaking change.
Abe also mentioned there is a third constituent on data contracts in the room: the data. Sometimes the consumers and producers may agree on what they expect, but if that’s different than what’s in the actual data, then it's hard or dangerous to move forward. The data has a veto.
There was an interesting discussion on the push versus pull of data contracts - should the producer team create an all-encompassing contract or should we have consumer-driven contracts? Would producer-driven contracts be too restrictive, preventing the serendipity insights data mesh aims to produce? Would consumer-driven contracts mean multiple contracts for each data product that the producer agrees to? Is that sustainable?
So, to sum it up, the idea of a set of explicit expectations around a data product that are the result of collaboration between producers and consumers sounds like where we should all head if possible. If the expectation set is only coming from the producer side, it might be overly restrictive and miss a lot of the nuance necessary to actually create consumer trust. And exclusively consumer-driven contracts don't sound sustainable or scalable.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf