Rerelease of #150 3 Years in, Data Mesh at eDreams: Small Data Products, Consumer Burden, and Iterating to Success, Oh My! - Interview w/ Carlos Saona

7 snips

Sep 8, 2023

Carlos Saona, Chief Architect at eDreams ODIGEO, shares unique implementation and key learnings from eDreams Odigio's approach to Data Mesh. Topics include iterative development, division of responsibilities between data consumers and producers, implementing DataMesh, defining single entity for customers, benefits of small data products, feedback loops and iteration, and use of streaming and batch processing in data analysis.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 2min

Unique Implementation and Key Learnings from eDreams Odigio

01:44 • 4min

Implementing DataMesh: Important Points

05:30 • 17min

Defining a Single Entity for Each Customer

22:08 • 19min

Benefits of Small Data Products

41:04 • 13min

Feedback Loops and Iteration in Data Mesh

54:12 • 27min

The Use of Streaming and Batch Processing in Data Analysis

01:21:11 • 3min

Due to health-related issues, we are on a temporary hiatus for new episodes. Please enjoy this rerelease of episode 150 with Carlos Saona. eDreams' approach is very unique and interesting because it was essentially all on its own so there are a ton of useful learnings to consider if they are the right fit for your own organizations.

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/

Please Rate and Review us on your podcast app of choice!

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

Episode list and links to all available episode transcripts here.

Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Carlos' LinkedIn: https://www.linkedin.com/in/carlos-saona-vazquez/

In this episode, Scott interviewed Carlos Saona, Chief Architect at eDreams ODIGEO.

As a caveat before jumping in, Carlos believes it's too hard to say their experience or learnings will apply to everyone or that he necessarily recommends anything they have done specifically but he has learned a lot of very interesting things to date. Keep that perspective in mind when reading this summary.

Some key takeaways/thoughts from Carlos' point of view:

eDreams' implementation is quite unique in that they were working on it without being in contact with other data mesh implementers for most of the last 3 years - until just recently. So they have learnings from non-typical approaches that are working for them.
You should not look to create a single data model upfront. That's part of what has caused such an issue for the data warehouse - it's inflexible and doesn't really end up fitting needs. But you should look to iterate towards that standard model as you learn more and more about your use cases.
?Controversial?: Look to push as much of the burden as is reasonable onto the data consumers. That means the stitching between data products, the compute costs of consuming, etc. They get the benefit so they should be taking on the burden. Things like data quality are still on the shoulders of producers.
You should provide default values for your data product SLAs. It makes the discussion between consumers and producers far easier - is the default good enough or not?
?Extremely Controversial?: At eDreams, you cannot publish data in your data product that you are not generating. In derived domains (e.g., customer history), “generate” includes the derived stitching. NOTE: Go about an hour into the interview - not episode - for more specifics.
When starting with data mesh, there must be a settling period - consumers must understand that things are subject to change while a new producer really figures things out for the first few weeks to months.
You want to avoid duplicating data. But you REALLY want to avoid duplicating business logic.
Be careful when selecting your initial data mesh use cases. If the use case requires a very fast time to market, while it has value, you likely won't have the time and space necessary to experiment and learn. You need to find repeatable patterns to scale in data mesh. Hurrying is a way to miss the necessary learning.
Look ahead and build ahead for obvious interoperability. E.g. create foreign keys for data products that don't exist yet but will.
Be clear about what early data mesh participation means - what will it net domains that are part of the early implementation? And be specific too about what your early implementation won't include or achieve. Don't over promise and under deliver.
Similarly, strongly emphasize that learning is a priority in your early implementation and that you are factoring in learning into promises and estimations. You can't promise you'll find the right solution to a challenge on day one, things need space to evolve as you learn more.
It's okay to not have everyone as part of your initial implementation - engagement or even buy-in - but set yourself on a path where their participation is inevitable.
Making data as a first class citizen doesn't just happen. There is incremental work to be done by the domains. Make sure you reserve time to actually do that work - the data quanta creation and maintenance.
It is not feasible to have your documentation be fully self-describing for everyone. eDreams chose to set the bar at documentation that is self-describing for readers that already know about the domain. For readers that do not know the domain, that introduction must happen somewhere else.
At the start of a data mesh journey, your central team will likely control all the use cases being served by the mesh. But at some point, self-serve needs to happen. Consumers need to be able to serve their needs without their use cases going through the central team.
With small-size data products and the data combination burden being on consumers, versioning tends to not be as much of a problem because the concepts of domain events don't change that often. Or when they do, the retention window has been typically short, making versioning easier.

To make data producers feel a better sense of ownership, 1) look for ways for producers to better leverage their own data; 2) maximize your number of consumers for their data quanta so there is a quicker time to identify issues with the data product - more eyes means more who can spot issues; and 3) create automation to easily/quickly let domains identify sources of data loss rather than searching: with proper setup, you can make it easy to identify if the data pipeline is the problem. If it's not, then the issue is in the domain.

When Carlos and team were looking at building out how to tackle their growing data challenges a few years ago, they were looking at request for proposals (RFPs) from a number of data consultancies around building out a data lake but just were not convinced it would work. Then they ran across Zhamak's first data mesh article and decided to give it a try themselves. Until more recently, Carlos was not aware of the mass upswing in hype and buzz around data mesh so their implementation is very interesting because it wasn't really influenced by other implementations.

When they were starting out, Carlos said they didn't want to try to create a single, overarching approach. It was very much about finding how to do data mesh incrementally. They started use case by use case and built it out organically, including the design principles and rules - they knew they couldn't start with a single data model for instance. But it was quite challenging iterating towards that standard data model.

When choosing their initial use cases to try for data mesh, Carlos and team had some specific criteria. They rejected anything that needed a very quick turnaround because it wouldn't let them have enough time/space to try things, learn, and iterate. They did plan ahead by creating foreign keys to data products that didn't exist to make interoperability down the road when they would exist easier. And they were very honest with stakeholders about what early participation meant - and what it didn't mean; that way, it was clear what benefits stakeholders could expect.

According to Carlos, while they had executive support and sponsorship for data mesh, that wasn't enough to move forward with confidence at the start. They needed to have a few key stakeholders that were engaged as well and wanted to participate. It was also okay to have some stakeholders not engaged but just informed of what they were trying to do with data mesh. You don't have to win everyone over before starting.

Five things Carlos thinks others embarking on a data mesh journey should really take from their learnings: 1) it's okay to not have everyone really bought in or especially engaged upfront but they will have to participate - make their eventual participation inevitable. 2) Really emphasize that you are learning in your early journey, not that you have it figured out - and factor in learning when doing estimations and promises. 3) Don't try to design your data model from the beginning; you need to learn via iteration - you will start to find your standards to make it easy to design new data products. 4) When treating data as a first class citizen, it's important to understand that will take additional time. Reserve the team's time to create and maintain their data quanta. 5) Let the use cases drive you forward and show you where to go.

Carlos' philosophy is, within reason push as much of the burden onto the consumer as you can. Obviously, we don't want consumers doing the data cleansing work - that's been one of the key issues with the data lake - but the costs of consumption should fall on the data consumers as they are the ones deriving the most benefit. So eDreams makes the consumers own stitching data products together for their queries and makes them pay for the consumption. This minimizes the costs - including maintenance costs - to producers.

One very interesting and somewhat unique - at least as far as Scott has seen - approach is how truly small Carlos and team's data quanta are. Thus far, they have really adhered to the concept that each data quantum should only be about sharing a single type of domain event and really nothing more in it. This again makes for lower complexity and maintenance costs for data producers. They are considering changes with upcoming BI-focused data products so that is to be determined.

Carlos believes - and Scott exceedingly strongly agrees - it is not feasible for your documentation for your data quanta to be fully self-describing. You can't know someone else's context. You need to write good documentation so people can still understand what the data product is and what it's trying to share but if you do not have knowledge of the domain, it would be a considerable amount of effort - essentially impossible to do it right - to fully explain the domain and how it works in the documentation of each data product. Getting to know how other domains exactly work is outside of the scope of the data mesh.

At the start of their journey, the data team was in control of all the use cases, who was consuming, and who was producing, according to Carlos. But, as they've gone wider and there is a self-service model for data consumers, more and more of the use cases are directly between the producers and consumers - or the consumers are consuming without much interaction with producers if they already know the domain. It could become an issue with people trying to understand data from lots of different domains for the sake of understanding but it hasn't been an issue so far.

To date, Carlos hasn't seen many problems around versioning. They thought they would have many more issues with versioning than they have which Carlos believes is from keeping their data products as small as possible and using domain events. When they have had versioning, the retention window for the data has been relatively short so the versioning has been relatively simple to move to the newer version. And because most people are getting their data from source-aligned data products, changes have a smaller blast radius - they won't affect data products that are downstream of a downstream of a downstream data product. Domain events have been enough because their main stakeholder has been machine learning. They are now working on a different kind of data quanta for consumers such as BI, and they plan to include more governed versioning there.

One of the biggest challenges early on according to Carlos was that domains didn't really feel the ownership over the data they shared. So to increase the feeling of ownership, they first looked for ways for producing domains to use their own data - as many other guests have mentioned. Second, they tried to maximize additional consumers of data products by looking for use cases. That led to faster feedback loops if there was a problem - more eyes on the data - so producers discovered issues sooner. And third, the platform team helped identify issues that might be in the system or in the data platform/pipeline process - if there was data loss, there is automation to help identify if it is on the platform side; if it's not on the platform side, then it is an issue with the domain. That one automation has led to a lot less time searching for the cause of data loss rather than fixing data loss.

Carlos and team built in a few different layers of governance. The first is a universal layer for standard metadata in each data product, like when something happened, who is the owner, the version of the schema, the existence of a schema, etc. These are enforced automatically by the data platform and you can't put a data product on the mesh without complying. Producers must also tag any PII or sensitive information like credit cards. Then, a second layer is policies for data contracts between producers and consumers. As many guests have suggested, they have found having default values for SLAs in data contracts provides a great starting point for discussions between data producers and consumers.

"You can have your cake and eat it too," using domain events per Carlos. You don't want direct operational path queries hitting your data quanta as they are designed for analytical queries - they will have a separate latency profile. But at eDreams, the pipeline that writes data quanta to the analytical repository is implemented with streams that can be consumed in real-time by operational consumers (microservices).

Other tidbits:

When launching a new data product, there must be a settling period - consumers must understand that things are subject to change while the producer really figures things out.

You want to avoid duplicating data. But you REALLY want to avoid duplicating business logic.

Data products should have customized SLAs based on use cases. You don't need to optimize for everything. Let the needs drive the SLAs.

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf