Data Mesh Radio

Interviews with data mesh practitioners, deep dives/how-tos, anti-patterns, panels, chats (not debates) with skeptics, "mesh musings", and so much more. Host Scott Hirleman (founder of the Data Mesh Learning Community) shares his learnings - and those of the broader data community - from over a year of deep diving into data mesh. Each episode contains a BLUF - bottom line, up front - so you can quickly absorb a few key takeaways and also decide if an episode will be useful to you - nothing worse than listening for 20+ minutes before figuring out if a podcast episode is going to be interesting and/or incremental ;) Hoping to provide quality transcripts in the future - if you want to help, please reach out! Data Mesh Radio is also looking for guests to share their experience with data mesh! Even if that experience is 'I am confused, let's chat about' some specific topic. Yes, that could be you! You can check out our guest and feedback FAQ, including how to submit your name to be a guest and how to submit feedback - including anonymously if you want - here: https://docs.google.com/document/d/1dDdb1mEhmcYqx3xYAvPuM1FZMuGiCszyY9x8X250KuQ/edit?usp=sharing Data Mesh Radio is committed to diversity and inclusion. This includes in our guests and guest hosts. If you are part of a minoritized group, please see this as an open invitation to being a guest, so please hit the link above. If you are looking for additional useful information on data mesh, we recommend the community resources from Data Mesh Learning. All are vendor independent. https://datameshlearning.com/community/ You should also follow Zhamak Dehghani (founder of the data mesh concept); she posts a lot of great things on LinkedIn and has a wonderful data mesh book through O'Reilly. Plus, she's just a nice person: https://www.linkedin.com/in/zhamak-dehghani/detail/recent-activity/shares/ Data Mesh Radio is provided as a free community resource by DataStax. If you need a database that is easy to scale - read: serverless - but also easy to develop for - many APIs including gRPC, REST, JSON, GraphQL, etc. all of which are OSS under the Stargate project - check out DataStax's AstraDB service :) Built on Apache Cassandra, AstraDB is very performant and oh yeah, is also multi-region/multi-cloud so you can focus on scaling your company, not your database. There's a free forever tier for poking around/home projects and you can also use code DAAP500 for a $500 free credit (apply under payment options): https://www.datastax.com/products/datastax-astra?utm_source=DataMeshRadio

Latest episodes

Mar 10, 2023 • 59min

#203 Panel: Making Privacy Practical and Scalable in Data and Data Mesh - Led by Debra Farber w/ Samia Rahman and Katharine Jarmul

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.Debra's LinkedIn: https://www.linkedin.com/in/privacyguru/Debra's Shifting Privacy Left Podcast: https://shiftingprivacyleft.com/Katharine's LinkedIn: https://www.linkedin.com/in/katharinejarmul/Katharine's book: https://www.oreilly.com/library/view/practical-data-privacy/9781098129453/Samia's LinkedIn: https://www.linkedin.com/in/samia-rahman-b7b65216/Quick acronyms to know: PETs - privacy enhancing technologies; SMEs - subject matter expertsScott Note Warning: there is some nerding out about how awesome it could be if some advanced privacy approaches and PETs were implemented at a broad scale across the industry to protect individual's privacy. It's pretty early days so warning about getting your hopes up :)In this episode, guest host Debra Farber, privacy expert and host of the Shifting Privacy Left podcast facilitated a discussion with Katharine Jarmul, the author of the upcoming book Practical Data Privacy and Principal Data Scientist at Thoughtworks (guest of episode #157) and Samia Rahman, Director of Data and AI Strategy and Architecture at life sciences company Seagen (guest of episode #67).Scott note: given this is a newer area, I wanted to share my takeaways rather than trying to reflect the nuance of the panelists' views. This will be the standard for panels going forward.Scott's Top Takeaways:Privacy has been dominated by risk compliance historically but it's starting to move past the defensive governance aspects. Privacy has mostly been at the tail-end of the development cycle across systems and data but is starting to shift left across the board much like aspects of data mesh.Regarding data mesh, given how many additional aspects of development we are asking domains to own, is it fair to ask them to own privacy as well? How can we train people to understand when to use privacy enhancing technology and then make it easy to implement those decisions? Much like other aspects of the self-serve platform.Privacy tech is emerging and maturing at a very significant rate. What was once a pipe dream or was prohibitively expensive is much closer to being available to the masses. Much like data mesh in general couldn't really have been done before cloud-native tooling/technologies started to mature, privacy is in a similar wave forward. If you want to hear more about specific tech, there's some talk in this episode and Katharine's interview (episode #157).With the explosion in upcoming privacy-focused legislation across the world - much of which is at least slightly different from each other - we will see a large increase in the need for organizations to do privacy well/better. Shifting left is really the only way to do this scalably or we'll potentially see organizations _stop_ doing a lot of currently valuable data work because the cost of privacy and risk compliance becomes prohibitive. Upcoming legislation may be the thing pushing privacy forward more than anything else.Privacy is extra crucial when dealing with data leaving the organization, whether in partnership or for those selling their data on a marketplace. Mozhgan Tavakolifard, PhD (episode #154) talked about this where many companies are opting to merely package and sell insights because they can't track usage deep into those partner or data purchaser systems. It's a bit like what Zhamak discussed re data going into the data science area of an organization right now and all governance and visibility at best gets hazy in Zhamak's Corner 18 (episode #195). Will cross-org data mesh help address that? Probably but it's 5+ years away.Paraphrasing Debra: At the end of the day, privacy is not about compliance. Privacy is about respecting the humans behind the data, not just the data itself. Protecting the data itself is about risk to the organization, that's compliance. We need to encourage a mindset of how do I respect these human's choices and their desires in this context of collecting the data, that's essential.If you don't do privacy well, there are risks to the company of course. But a big one is that people will still look for - and usually find - ways to get access to sensitive data. People will seek out the value. If you make data easily accessible with the right privacy levels, you can unlock many high-value new use cases in compliant and low-risk ways. Organizations should start to look at the rewards of doing privacy well, not only the risks of doing it poorly.Other Important Takeaways (many touch on similar points from different aspects):New ways of doing privacy are going to mean "measurable, quantifiable, verifiable, and auditable tools and capabilities."We need to think of privacy like any other tooling in the development lifecycle. It's about providing abstractions to make the easy/right calls to the domain experts and having a central support structure when things are more tricky.Privacy isn't just about the data, it's not simply a metadata-like concept people are trying to add back privacy to data at rest. Much of the important aspects of privacy are about how the data flows through systems and privacy in those flows and each of the systems, not just the end place it gets stored.In data mesh, the self-service platform will need - currently needs? - to provide privacy-as-code capabilities and so people can easily build data products with privacy built-in instead of added at the end of the process. We can't expose the tech, that's far too complex. How do we provide the good abstractions to make this easy and thus scalable?We need an ability to almost have a privacy capability as an ingest mechanism - point at a data source and say "we need this anonymized" and it's not a super custom build by the producers. We're just at the start of developing those types of capabilities but we need to make it so it's not all on the producer, consumers can consume with privacy on demand.We need policies as code or other easily digestible forms of policies - and compliance - and need to train our people well on what privacy means, why it's important, when to apply, etc.We need tooling to help with federated privacy because otherwise, there is too much privacy context/knowledge AND technology to learn and it won't be scalable. It appears there are some tools emerging but it's still seemingly early days.Anonymization is often pretty easy to overcome if you just add additional datasets. This is especially a risk in sharing data with other organizations. Anonymization isn't a wand you wave and all your privacy risks get taken care of.How can we still derive the value of anonymized datasets? It's often much harder so will companies do the ethical privacy aspects or only the required aspects of privacy? We need better, easier PETs to make it easier to still extract value from anonymized data.How do we balance enough privacy training and not hit information overload? It's hard to get people to learn what's necessary because privacy is such a big topic. We need global and domain-level policies that can again be actually digestible.Can we measure time to compliance, time to privacy, time to 'doing the right thing' ethically? That would be best to understand where we need to improve but we're probably just at the start of that. This is an interesting fitness function area.Subject matter experts (SMEs) have so much specific knowledge that you need to leverage them to discover privacy risks and privacy rewards too. Much like any aspect of governance, trying to have the central team make decisions just isn't scaling so we need to make the people in the domain capable enough to handle privacy.Privacy rewards: in many organizations, there are very high value sets of data that cannot be leveraged for specific use cases due to privacy and other compliance restrictions. Getting to a place where we can easily leverage that high risk but high value data will potentially unlock large amounts of business value.There are lots of instances of teams finding those high value data sets and using shadow IT to get at them. If that's the only way people can get access, many will completely skirt any compliance and privacy. So getting to a place where they have access but it's according to policy and tracked is crucial to lower organizational risk - from compliance and ethics wise.From Katharine: "But if you build easier ways to get access and safer, more responsible, more ethical ways to get access, then you have a win-win situation and people are not going to find shortcuts."Companies are starting to loosen the shackles on data and focus on maintaining privacy but also enabling innovation around privacy-sensitive data. That's a great mindset but there are still many questions on how to do that specifically.Too many, especially in blockchain, conflate privacy and confidentiality. Keeping something confidential is a security aim. So if you focus on confidentiality, you can't actually use the data you have - no one is allowed access, it's on lockdown.We need to get far better at risk modeling for privacy. What are the potential harms by the humans? We need to move beyond only thinking about if there might be a breach, what data might they get. We can free up data for far more uses if we do this right but ethics around data usage is just not a common thought. We need to train people to think ethically and about potential harm.There are multiple issues with anonymized data. Are you taking away the utility? Are you fooling yourself into thinking it can't be de-anonymized? That's a pretty common outcome with adding additional data sets - especially a risk if sharing data externally. Don't treat anonymization as your hammer and everything looks like nails.We need to teach developers about differential privacy which is about "bounding the probability of someone learning" a specific thing. Differential privacy "got a bad reputation" but we can add noise and maintain accuracy now. It is the "gold standard for anonymization".Healthcare patient data is one of the biggest challenges in privacy because you want to maximize the efficacy of care but also maximize privacy. And then how do companies take the data of the individual to extrapolate further to see broader trends?We need to get people upskilled so they can understand when to transform data in privacy preserving ways - and then the self-serve platform needs to make it easy for them to do that. But we don't have great industry-wide understanding on how to do either of those that well yet.Self-sovereign identity, while very interesting, is probably a long way away from being widely adopted. There needs to be a lot of industry collaboration and agreement and it's not really a big benefit in many areas based on the legal requirements versus cost. It would be great for company-to-company interoperability with privacy but who will build it? The 3 panelists were very excited about it though :)Privacy and data sovereignty are going to be intermingled in interesting ways in data mesh. Querying data where it is instead of piping it all over the world* will help maintain privacy and comply with laws - many countries don't allow data to be exported as is. * see Zhamak's Corner 13 episode #173 that covers some of what querying data where it is means and that's not necessarily about source systems but it does mean not moving it without necessityData Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereAll music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

7 snips

Mar 6, 2023 • 1h 23min

#202 Creating a Balanced, Sustainable Approach to Your Data Mesh Journey - Interview w/ Kiran Prakash

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.Kiran's LinkedIn: https://www.linkedin.com/in/kiran-prakash/Kiran's article on the 'Curse of the Data Lake Monster': https://www.thoughtworks.com/insights/blog/curse-data-lake-monsterIn this episode, Scott interviewed Kiran Prakash, Principal Engineer at Thoughtworks.Some key takeaways/thoughts from Kiran's point of view:?Controversial?: You MUST have exec sponsorship to move forward with your data mesh implementation. You need the top-down push for necessary reorganization when the time comes. Scott note: only kinda controversial, really more often ignored :D?Controversial?: Data mesh, if done well, doesn't need to have a huge barrier to entry. That's a misconception. If you think about gradual improvement/evolution, you'll be on the right track."The Curse of the Data Lake Monster" was like the data field of dreams - there was expectation that if you build a great data lake, value will just happen. If you ingest and process as much as you can, the use cases will just happen. And it really wasn't the case. So we should apply product thinking to data to focus on what matters.The 'Curse' was a manifestation of Conway's Law - the strong separation between IT and the business led to mismatched goals and subpar outcomes. With microservices, that started to be much less of an issue on the operational plane so why not try with data?It's easy to lose sight of Conway's Law and aim for distributed architecture first but the organizations doing data mesh well are changing their architectural and cultural approaches and patterns together. Don't try to do the architecture first, you really don't know your key challenges yet.It's very important to have a target operating model and get clear on your organizational vision and purpose around data - how will you use data?Once you have an organizational vision and purpose, domains should start setting goals aligned to that vision and purpose.As others have noted, don't get ahead of yourself, work in thin slices for your data mesh implementation. Stay balanced at an overall level between the data mesh principles as you add more and more thin slices but don't try to solve all problems up front.If you modernize your legacy software but don't change the organization, expect to do the same type of modernization in about 5 years.To really get to a scalable approach to data mesh, you should look for organizational and process reuse as much as tech/architectural and data reuse.Move from measuring outputs to measuring value outcomes. Sounds simple - it isn't - and it's crucial to changing your mindset about how you approach data.?Controversial?: Use the 4 key metrics from DORA (https://dora.dev/) to measure how well you are doing in your software engineering. And a key aspect of data mesh is about applying good software engineering practices to data after all.If you want to measure the value of your data work, you need to break it down into tangible objectives. Ask the owners of those objectives to provide the value of meeting the objectives. Then look to measure how much data work contributed to achieving those objectives.Think of a use case as a value hypothesis: you are making a bet that something will have value. It's okay to be wrong, that's the nature of betting. But limit the scope of your mistakes so you learn and adjust towards value instead of making big mistakes.?Controversial?: If you don't have a culture where it's okay to fail, it will be very hard to do data mesh well. Scott note: it will essentially be impossible in my opinionMany times what people consider minimum viable product is neither minimum nor viable. This is often due to a culture where you can't test things with users when they are still very rough. That will limit your success with data mesh. However, most people are reasonable so if you read them in that this will be a rough sketch/first iteration, they usually are on board to help you iterate to good."Architecture is about the important stuff. Whatever that is." - Ralph Johnson (via Martin Fowler)Always think about necessary capabilities and build to those. The most important are those capabilities you need now. Don't get ahead of yourself.The 'data platform' is really a misnomer, there will be multiple platforms. Users care about services, not if you have one platform or five or more. Don't have platform sprawl but don't over-centralize, that usually leads to scaling and flexibility challenges.Kiran started off by talking about a blog post of his with a colleague from 2019 called "The Curse of the Data Lake Monster" - lots of clients were building big data lakes and it wasn't providing the expected value. There was an expectation that if you ingested and processed as much of your data as you could, it would create great use cases and lots of value. But it didn't happen. Value doesn't just happen without concerted and concentrated effort. So Kiran asked why aren't we applying product thinking to data to figure out and focus on what matters to drive value. What would happen if we focused on outcomes instead of platforms? What if we measured value not how many terabytes were processed and stored? A key reason Kiran feels the 'Curse' happened was the strong separation between business and IT. That separation meant both were seeking different goals instead of collaborating. IT was focused on building things instead of solving business problems and the business side was focused on doing what they could, not building out a scalable and robust data practice. Conway's Law in action. We saw microservices really help to tackle those same issues on the operational plane so the data side of the house was definitely ripe for some product thinking-led innovation.Data mesh avoids a lot of the issues of past data approaches by not leading with the technology - the first two principles are not tech focused. For Kiran, many (most?) of the organizations that are getting data mesh right are respecting Conway's Law and shifting their architecture and organizational approaches together. But in thin slices so as to not put too many eggs in one basket and make reasonable progress. And they are getting the exec sponsorship because while you don't want to reorganize your entire company upfront to do data mesh, you do need some top-down pushing to actually drive the necessary org changes when appropriate.According to Kiran, while many people think data mesh has a high barrier to entry, that shouldn't be the case. There should definitely be a target operating model at the organizational level and organizations need to keep that in mind but it's not as though, again, you reorganize the organization all upfront. Organizations also need to really answer the question of what are they trying to do with data and what would doing data mesh well drive for them - if that's not crisp, they probably aren't ready to do data mesh because their business strategy isn't aligned to or contingent on doing data well.Once the organization has the target operating model and vision down, Kiran recommends that domains should start to set their own specific goals aligned to the broader organizational vision. They should work on some hypotheses on how to achieve those goals and how they plan to measure their progress towards those goals. Start to build out your thin slice approach to making progress towards your vision and goals. Don't get super far ahead of yourself, look to progress at a meaningful but reasonable pace and tackle as little as is necessary now while still making sure you are aligning with the organizational vision. Keep your eyes on the prize, don't take on too much now. And yes, easier said than done.Kiran pointed to a quote he read about if you modernize your legacy software stack but don't change your organization, you will need to do the same modernization in about five years. The same goes for data - if you are taking on data mesh from a tech-first approach, you'll just have to do all the same modernization again and you won't get a lot of the potential benefit from data mesh. Decentralizing the architecture will only really change things if you change the organizational aspects too. People, process, technology. For Kiran, we need to really start to focus more on measuring value outcomes instead of inputs - how many terabytes or operations per second isn't directly tied to value. Teams need to have it made clear what is actually valued and valuable. In many large organizations, there are often less clear links between data work and business value so you have to educate and incentivize teams to do the high-value data work.When thinking about trying to measure the return on investment in data work, especially data mesh, Kiran recommends starting by breaking it down into more tangible goals and measuring the value of achieving those goals. Then you can start to say how did the data work contribute to achieving those goals. But a data team can't really know the value themselves, whether inside the domain or not. And by breaking things into smaller goals and objectives, you can more quickly iterate towards value with tight feedback loops. Instead of large-scale projects, you build to larger and larger objectives through breaking things down and achieving meaningful micro progress that leads to large macro value.Kiran talked about thinking of use cases as value hypotheses: you believe it will have value and thus you are making a bet. And it's okay to get things wrong, just limit the scope of the mistake so you can use the missteps as learning so the larger macro bet has a much higher chance of paying off. This is iterating to value, those tight feedback loops. If you don't have a culture where it's okay to be wrong, okay to fail, then data mesh is potentially (Scott note: almost definitely) not right for you. Minimum viable product is often neither minimum nor viable in Kiran's experience. If you can't put something pretty rough in front of stakeholders, you waste far more time and effort building in wrong directions and are less likely to hit on success. But that's often out of the control of the product team. So it's a catch-22: do you put in a lot of effort to get it well past MVP or do you risk losing face? So we need a culture where we can actually do thin slicing well to really derive the most value out of building software, whether that is apps or data products. That incremental value delivery is really crucial to maintaining nimbleness as you scale. If you can't actually deliver in thin slices, it can significantly increase risk as you are making larger bets. But Kiran's seen that if you spend the time to explain the need for thin slicing and that what they are looking at is the 'sneak peak' and you just want feedback, most people get it and are reasonable. But you need to communicate about it :)Kiran used a phrase Martin Fowler uses often from Ralph Johnson: "Architecture is about the important stuff. Whatever that is." When thinking about decisions that are hard to reverse, spend a lot more time and care, but the ones where it's easy to reverse, those probably don't make up the core of your architecture. When building your architecture, it's again important to build incrementally instead of trying to get it perfect from the start. Think about necessary capabilities, not technologies. In data mesh, that is about data product production and then monitoring/observing, data product consumption, mesh level interoperability/querying, etc.* Start to map out what you need and then think what level you need from a capability standpoint and when. You don't need to build out capabilities for when you have 20 data products when you have 1-2 data products.* Scott note: Kiran went on for quite a bit here about necessary capabilities for data mesh late in the episode if you want to listen.Quick tidbit:Leverage the 4 key metrics from DORA to measure how well you are doing your software engineering as applied to data - 1) Lead Time to Changes (LTTC); 2) Deployment Frequency (DF); 3) Mean Time To Recovery (MTTR); and 4) Change Failure Rate (CFR) https://dora.dev/Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereAll music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Mar 5, 2023 • 33min

Weekly Episode Summaries and Programming Notes – Week of March 5, 2023

Mar 3, 2023 • 1h 21min

#201 Choose Your Blast Radius and Other Lessons Learned Across 10s of Data Mesh Implementations - Interview w/ Vanya Seth

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.Vanya's LinkedIn: https://www.linkedin.com/in/vanyaseth1809/In this episode, Scott interviewed Vanya Seth, Head of Technology for Thoughtworks India and Global 'Data Mesh Guild' Lead for Thoughtworks. To be clear, Vanya was only representing her own views on the episode.Some key takeaways/thoughts from Vanya's point of view:Data mesh is at a similar inflection point to where microservices was a decade ago. Let's not relearn all the hard lessons they already learned. We should adapt/contextualize to data of course but we can skip a lot of the anti-patterns. Similarly, many people are stuck thinking "there's no way that could work" regarding data mesh like they were when people suggested development and operations be combined in DevOps. It's understandable - it's hard to imagine a post monolithic world when all you've known is monoliths.?Controversial?: We should try hard to prevent creating the fear of missing out (FOMO) for those not doing data mesh. If data mesh isn't right for your org, especially if it isn't right at this time, that's perfectly okay. Don't take on the overhead cost of data mesh if it won't bring more value than cost. Scott note: PREACH!?Controversial?: Some CDOs or CAOs, their organizations don't really get the value of data so they are implementing data mesh to try to prove out value and make their mark. That can obviously create issues if their organizations aren't ready.A few indicators an org is ready for data mesh (see below for expanded context): A) data/AI investments are not delivering the promised/expected returns and/or it's hard to point to the value delivered in general from data/AI investments; B) the organization is attempting to throw more people at centralized data management and it's not working (platform included); and C) there's extremely unclear ownership around many aspects of data, especially who owns aspects of hand-offs or who owns the end data asset - how can a consumer actually ask a question about the data with no clear owner? "Innovation in queue syndrome" = your innovation agenda is "in queue" and keeps getting deprioritized because you are dealing with everything else first just to keep your data practice flowing. Use value stream mapping to understand how your organization drives value from business processes and where there is value leakage. Especially useful if data work isn't driving value.We should take a lot of learnings in how microservices service discovery evolved, especially the tooling, for data mesh. There is no need to reinvent the wheel on this.Some existing tooling from the microservices space is just fine for data mesh too. We don't need to invent new tools when existing ones - which are already robust and mature - can be extended or even used as is.Platforms aren't about the tooling, they are about the holistic user experience - how do you stitch things together to automate the toil and let users focus on what matters? The tooling is under the hood, not the main interface.Users of your various data platforms should not be directly interacting with tools for the most part. It should be about abstracting away the tools and making it easy for them to interact with data, not the tools of the platform.!Crucial!: "Choose your blast radius." Far too many are looking to change the entire organization at the start of a data mesh journey instead of limiting scope to a reasonable level. Find one "courageous" domain to move forward. "Nothing succeeds like success itself." Get to a data mesh win that you can tout quickly so others will get bought in and see the value and want to participate. Incremental value delivery builds interest and momentum.Build your platform at the same time as you're building your initial data products. Far too many platforms are built with tools as the focus instead of automating away toil and focusing on necessary capabilities.!Crucial!: Evolvability should be a first class concern when building your platform, just like with any product. You must be able to continue to improve and change to meet needs.Focus on the abstractions and the ubiquitous language - e.g. business people don't care what the technical underpinnings of a data product are, they care about what it means for them and how they can access/leverage it.When starting your data mesh journey, look at the use cases to decide how much of each pillar you really need. Don't overbuild early. If you only need a minuscule amount of governance, great. If you don't actually need the producing team to be overly involved in ownership, awesome. Don't go for full data mesh at the start.What you should focus on relative to your early journey is unique to your own situation and use case. Don't worry about competitors or how others are starting - their circumstances are their own.Vanya started with a bit about her background and how deeply entrenched she's been in the microservices space - that played into the overall conversation a lot. Both Vanya and Scott agree if we want to do data mesh right, we really should take learnings from microservices and DevOps so we don't have to relearn what they already did the hard way. For Vanya, data mesh is at a similar inflection point to where microservices was a decade ago - people were extremely skeptical that developers and operations could even work together, much less around combining them in a singular approach with DevOps. It's hard to imagine a post monolith world when all your career and experience are with monoliths. We have to be somewhat kind to those people in understanding that change is hard and scary :)But, as a counter, for data mesh Vanya believes (and Scott agrees) we must try to prevent creating the same fear of missing out (FOMO) that microservices had. For many, if your organization wasn't doing microservices, it wasn't seen as a cool place to work and that all the best developers were at companies doing microservices. We don't want that in data mesh because it will lead to lots of wasted effort for companies that shouldn't be doing data mesh now or potentially ever.According to Vanya, there are a few really good indicators an organization might be ready for data mesh. Before we get into the 3 she listed, a few things that might be indicative of indicators (Scott note: I know, I know, silly Scott phrasing) are constant displeasure of the kinds of initiatives they've been doing in the data and AI space - there is a constant pressure to prove the value of data and AI investments but really, an inability to do so. Long and lengthening cycles to return on data work/projects. A biggie is an ever-growing platform that is trying to do too much and hasn't been delivered - trying to boil the ocean. So the 3 indicators data mesh could be a good fit that Vanya listed were:1) Investments in data and AI aren't delivering expected value and it's hard to actually point to the value that is being delivered. Users aren't getting "the right data at the right time with the right quality".2) Large and growing central data teams where trying to scale is done by throwing more people at the problem and it just isn't working. When automation would be better, they add people.3) Confusion around who owns data when and why. Who owns the handoff between systems? Who owns the documentation and metadata around data? When someone has a question, how hard is it to find who owns the data?Vanya highly recommends using value stream mapping to understand how you drive value with business processes and especially where are value leakages; this can be data or not, and should be applied to both analytical and operational data processes. You can understand better your business processes and expected outcomes - if something didn't meet expectations was that because expectations were wrong or did something happen along the way to lose value? Value stream mapping gives you an objective and neutral starting point and helps identify problem areas - value leakage - where you can prioritize what to tackle first. In microservices, Vanya pointed to how challenging service discovery started to become until tooling came along - specifically mentioned Consul - so we really don't have to reinvent everything in data mesh. The tools out there, especially those in the open source space, are really making nice progress - specifically mentioned DataHub - compared to where they were 2 years ago at the infancy of bleeding edge data mesh adoption. Overall, we should 1) look to existing tools to see if we can use them as is; 2) look to extend existing tools where possible to cover incremental needs specific to data mesh; and then 3) look to create new tooling that is required for data specific challenges. Again, don't reinvent the wheel.For Vanya, one thing many organizations struggle with in data mesh is the self-serve platform - what is the goal? Circling back to an earlier point, it's not about building the most amazing, ocean-boiling platform. It's about stitching tools together to automate the toil away - how can you create a holistic user experience to focus on doing the value-add? The value of the platform to the users is the abstractions away from the tools that make it easy to focus on what needs to be done to drive value from data, not play with the shiny tools. Focus on enabling interacting with the data, not the tools of the platform."Choose your blast radius" is a key phrase for Vanya. Think about scope appropriately and don't try to bite off more than you can chew. You don't have to reorganize your entire organization on day one to do data mesh, that is far too much of an upfront cost and makes failure a massive cost. Look at how it was done well in microservices: thin slices, not taking a sledgehammer to the monolith. Gradual evolution is sustainable, a revolution either succeeds or it doesn't - don't take on risk that isn't actually beneficial!"Nothing succeeds like success itself," was another line from Vanya. It's crucial to get to an early win or two to show off to the rest of the organization proving data mesh delivers value and getting them interested in participating. 'Hey, we did this and it was a big win, who's next?!' It's not just about showing value, it's about showing there was a reasonable encapsulated timeline, not just promises. That incremental value delivery creates momentum and the more momentum you have, the more you can get people on board.As many past guests have noted, it is a pretty bad (Scott note: fully terrible) idea to build the platform and then bring it to the users when it's done in Vanya's view. There are far too many unexpected friction points and finding those and tackling/automating away the actual friction is where the platform adds value, not bells and whistles. You want to find those as they emerge and work on tight feedback loops - that's product thinking! And if you don't make evolvability a first class concern, you are not building your platform as a product either.For Vanya, it's pretty easy for tech people to focus on the tech, whether that is in data or not. But the overall organization doesn't care about the tech, they care about what they can do. So it's crucial to find the ubiquitous language and make your implementation and platform about what are people trying to do. The user isn't accessing S3, they are accessing the Inbound Marketing Conversion data product. S3 is simply a mechanism to accessing the data and insights.When considering your thin slice early in your data mesh journey, it's okay to have a very unbalanced slice in Vanya's view. This has been mentioned before but it's important to reiterate. If you only need a bit of one of the pillars but you do need more capability in another of the pillars, that's absolutely okay. Don't build today for all the problems of 6mo from now. You want to focus on tackling the toil of today.Quick tidbits:Vanya's phrase "innovation in queue" is when an organization keeps putting off their innovation agenda for more immediate concerns - everything innovative ends up getting deprioritized in the queue.Most data mesh journeys are taking six to seven months to really prove out doing data mesh and value. Scott note: this seems to be standard for larger organizations but a complete POC means faster follow-on for additional use cases. It's a balance!In some organizations, if data is not really valued the CDOs or CAOs are looking to implement data mesh to show the value of data but their organizations often aren't ready and trying to do data mesh just creates more challenges than benefits.Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereAll music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Mar 2, 2023 • 11min

{Bonus} Zhamak's Corner 19.5 - How Does AI/ML Change When Trust is Automatic?

Sponsored by NextData, Zhamak's company that is helping ease data product creation.For more great content from Zhamak, check out her book on data mesh, a book she collaborated on, her LinkedIn, and her Twitter.This episode is part of the greater AI/ML conversation I had with Zhamak but it's super important to emphasize the importance of trust - enough so that I created a separate quick episode on it. Not just trust in the data itself but that there is easy access and there will be going forward. A lot of the things we have done in data historically has been defensive in nature - especially grabbing a copy of the data now because who knows when you'll get access to it again. What if we can implicitly trust that there has been care and foresight in preparation of the data I find, that there is an owner I can ask if I'm confused or curious, that my access won't suddenly go away or that what's there won't suddenly change without my knowledge? In ML/AI, the data scientists have done things in ways that made sense to their situation and challenges. What happens when we make trust inherent? What incremental value does that drive?Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereData Mesh Radio episode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Mar 1, 2023 • 17min

#200 Zhamak's Corner 19 - An AI/ML Future Without So Much (Needless?) Complexity

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/Sponsored by NextData, Zhamak's company that is helping ease data product creation.For more great content from Zhamak, check out her book on data mesh, a book she collaborated on, her LinkedIn, and her Twitter.This episode is part of the greater AI/ML conversation I had with Zhamak. To start, Zhamak recognizes we aren't where we want to be in terms of capabilities - ways of working or tooling - to make this a reality just yet. But, if we can make it so data scientists can trust and easily consume from data products - that we create data products that don't care what use case type - regular analytics or AI/ML - can we remove a lot of the complexity they face? Do they need feature stores for data they aren't transforming? If they can get continued access and know the quality, why create a separate process that has fragility instead of trust the data product owners upstream?I wasn’t smart enough in the moment to talk about do we need to have a copy of the training data itself for reproducibility but folks smarter on ML than I am can answer that one, probably in the affirmative. But overall, there is a lot of complexity in the way we do AI/ML because data scientists can't trust the sources of their data and they feel the need to take control because if they don't, their models break. So we need to earn their trust and show them a better way. But again, we aren't there yet, so let's work to make this a reality in the future.Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereData Mesh Radio episode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Feb 27, 2023 • 1h 8min

#199 Finishing Your Data Marathon - Driving to Action from Data - Interview w/ Brent Dykes

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.Brent's website and book: https://www.effectivedatastorytelling.com/Brent's LinkedIn: https://www.linkedin.com/in/brentdykes/Brent's Data Analytics Marathon Forbes article: https://www.forbes.com/sites/brentdykes/2022/01/12/data-analytics-marathon-why-your-organization-must-focus-on-the-finish/?sh=2af698743c3bIn this episode, Scott interviewed Brent Dykes, Chief of Data Storytelling at his own firm, AnalyticsHero. Scott asked Brent to be on after João Sousa pointed him to Brent's content.Some key takeaways/thoughts from Brent's point of view:Focus on the so-what, what should people take away and do from the insights, not the sausage making of the insights. Execs want to eat the dang cake, not hear about how you made it!!Controversial!: You want to get to a place where you remain more neutral until the data informs your view. It can cause more cognitive load to update our views instead of waiting for the data to speak first.Many organizations lose steam in actually driving action on analytics, they don't drive change with data - they fail at least one of the following: generating actual insights, communicating their insights well enough to drive action, and/or actually acting on the insights.The analytics marathon: data collection -> data processing -> data visualization/reporting -> data analysis -> insight communication -> take action.Many companies stop the data marathon after getting to the visualization/reporting step. They aren't driving the results they want so they focus more time on collection and processing instead of finishing the race and get caught in a loop.Really consider why are you doing data work. It's not to simply do analytics, to build the dashboards and reports, it's to take action on the data and affect change through more informed decisions.It's easy to like the data when it supports your narrative. A strong analytics culture bends decisions and thinking to the data instead of the other way around.Companies with good analytics practices, that take action on data, typically have at least an executive sponsor around analytics if not buy-in from the entire leadership team. There is an understanding that actions are driven by data when possible. And there is a test and learn culture.Executive support and a test and learn culture are what drive results from analytics - many companies buy the same tools and have vastly different degrees of success with data.If your organization isn't doing analytics well, the best way to drive towards doing analytics well is get to wins from analytics and build momentum to drive to higher and higher exec sponsors. ?Controversial?: Sometimes to drive necessary understanding, in documentation - or other ways of bringing data users up to speed - you really need to show the lineage all the way back to how the data is even collected.The more you share data, the more likely information is to be misunderstood. Beware the difference between what a metric means and what people _think_ it means.?Controversial?: To get to scale, we use passive communication - read mostly documentation - about our data. But to truly drive to understanding, we also can't shy away from active - read person-to-person - communication. Scott note: Episode 150 w/ has some interesting insight on how far documentation should go.?Controversial?: Storytelling is often the easiest way to sway people with data because human brains have evolved to accept information via stories. We've been doing it for 1000s of years.Mature organizations understand the data can be wrong and prepare for that. Move fast and make incremental moves instead of a big bang approach. You learn more and can do better the next time even on actions that weren't as valuable as expected.?Controversial?: Business analyst roles should evolve to be more like personal trainers - helping people learn how to do good analysis and then communicate their insights. They won't work themselves out of a job, merely get to a place where they focus on the bigger, harder, deeper, more valuable questions.Brent started with a bit about his background and why he titled his book "Effective Data Storytelling: How to Drive Change with Data, Narrative, and Visuals." There are a few places many organizations fall down in driving change via their analytics whether that is a failure to generate actual insights, a failure to communicate insights well enough to drive action, or a failure to actually take action on the insights - he's most focused on the communication of insights, a place often overlooked. You can find the best insights in the world but if you can't communicate those insights well enough, no one will understand them and/or understand the potential impact of acting on them. Communicate well enough to drive change!The analytics marathon is one of Brent's big analogies for explaining where organizations fail along the path to taking action on their insights. There is data collection, which pretty much all organizations do. Then data preparation into data visualization. But this is where many orgs fall off because they are simply reporting on what's happening, the descriptive analytics and not actually driving to diagnostic analytics. Instead of doing deeper analysis, they believe their problems lie in what data is collected so they try to collect more data, thinking it's simply a lack of information instead of a lack of analysis. And then of course, once you do the analysis, you still have to communicate and take action.For Brent, a few common indicators an organization will likely have a good analytics practice include: 1) an executive sponsor for being or becoming data driven. Possibly the entire leadership team. 2) a general commitment to driving actions from data where possible. It's "how we do things." 3) A test and learn culture in the organization that's supported by data. If an organization isn't yet data driven, isn't doing analytics that well, Brent recommends getting to wins and slowly moving your executive sponsorship up the ladder. It might start at a Director level and then after you build momentum, people will take notice and you can climb to VP level and then C-Suite level. It's about showing the value of analytics and plugging along so you have proof points when you move the conversation higher in the organization. Rome wasn't built in a day and neither is a good, organization-wide analytics practice.As data initiatives have become more ambitious, it's often meant ownership has become more murky according to Brent. What was once data that was essentially only for the generating team is now a potential core value asset and driver for the organization. And that opens you up for much more misunderstanding. Focusing on making sure information is understood - not just data is made available - is crucial to making good decisions with your data. There's often what the metric means and what others assume it means.Brent shared his views that we need both active and passive ways of sharing context around data. Passive is the metadata, the documentation and the like. If we want to scale, passive is crucial. Self-service can't just be a pipe dream. But too often, people in data want to only do passive and ignore the people-to-people conversation. But often that's key to nuanced data or crucial to working with key people making big decisions based on data.For 1000s of years, humans have been passing information via stories - human brains have evolved to share information via stories. We inherently want to know where the story goes. For Brent, mastering that storytelling with data and about data is the best way to convey the information we generate and discover with our data. If you don't communicate insights to those who can take actions, in a way they can understand, they won't take those actions :)For Brent, execs rarely want to hear how the sausage was made via data. You want to show them what you've discovered and what they should do with that, not how you came upon that. It can be important to show people the sausage making isn't that hard though, especially trying to enable a team to do self-serve analytics. Really consider which is more appropriate to the situation.It's pretty easy to like what the data is saying and point to the data as backing you up when you agree with it in Brent's experience. But a truly data-driven culture will focus on updating their thoughts and processes based on what the data says, improving their understanding via the data instead of trying to bend the data to support our hypotheses. It's about getting to a place where you try to remain more neutral until you hear what the data says and shape your vision around that.On data-driven versus data-informed as semantics of what we're trying to get to, Brent likes the idea of data-driven. For him, it means really leaning into the data. Part of that is understanding and accepting that sometimes the data is wrong or we didn't ask the question in the right way - that's just getting to data maturity. And data-driven companies recognize the value of learning - there is still value from experiments and moves that didn't have as much benefit as expected when digging into the why. You have incremental understanding even if not direct incremental business value. And it sets you up to do better on the next iteration.When asked about where will business analysts fit in data storytelling in the future, Brent sees them like personal trainers - they won't be doing the work but showing people how and assisting them until they can get to a level the BAs aren't as needed. That's for the analysis and especially the insight communication. Pure self-service analysis is nice in theory but you need a way for people to get help and make sure they aren't hurting themselves :) If more people are far more capable, that means the BAs can focus on the more valuable, large-scale questions.Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereAll music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Feb 26, 2023 • 32min

Weekly Episode Summaries and Programming Notes – Week of February 26, 2023

Feb 24, 2023 • 1h 23min

#198 How Do We Make Data Contracts Easy, Scalable, and Meaningful - Interview w/ Ananth Packkildurai

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.Ananth's LinkedIn: https://www.linkedin.com/in/ananthdurai/Schemata: https://schemata.app/Data Engineering Weekly newsletter: https://www.dataengineeringweekly.com/In this episode, Scott interviewed Ananth Packkildurai, Author of Data Engineering Weekly and the creator of Schemata.Scott note: we discuss Schemata quite a bit in this episode but it's an open source offering that I think can fill in some of the major gaps in our tooling and even ways of working collaboratively around data.Some key takeaways/thoughts from Ananth's point of view:!Important!: Collaboration around data is crucial. The best way to get people bought in on collaboration around data is to integrate into their workflow, not to create yet another one-off tool in yet another pane of glass.?Controversial?: There is so much friction between initial data producers - the domain developers - and data consumers because they are constantly speaking past each other. The data consumers have to learn too much about the domain and the data producers rarely really understand the context of most analytical asks.Data creation is a human-in-the-loop problem. Autonomous data creation is not likely to create significant value because the systems can't understand the context well enough right now.As Zhamak has also pointed out, there is far too much tool fragmentation. It made sense with lots of readily available VC money and finding how to approach things with cloud but we need holistic approaches, not spot approaches to things like data quality, observability, lineage, etc.!Open Source Product!: Schemata was created to enforce certain rules in a cooperative platform around data sharing specific to data schemas to help alleviate much of the above friction.Data needs to really take a lot of learning from the platform engineering for microservices space. They make it easy for teams to test new services or changes, deploy, etc. In data, we are asking domains to own their data without giving them the tooling to easily do so. It's too much of an ask. Scott note: PREACH!?Controversial?: In general, we need better ways to share about what data is already available and what data we expect. This is where data contracts as a platform instead of tooling only becomes important.!Scott Controversial Note!: Many get data contracts woefully wrong. Data contracts aren't about _only_ the contract. They signify a relationship that has contractual terms but think about a vendor - is your only interaction, communication, set of expectations, etc. the contract or is a part of the relationship with strong guarantees?Consumer-driven data contract testing is important. It is defensive in a way - if my upstream changes, I don't necessarily want to consume from it. But consumer-driven testing can also be a great part of the conversation around how consumers are actually using the data - it's a programmatic way to describe usage to producers. Scott note: if we can make consumer-driven testing easy, great. But we need to reduce the burden on producer and consumer to ensure data contract compliance.We need to be able to take consumer requests and translate those to producers as well as give guidance to producers about effective - cost or otherwise - ways of meeting those requests. E.g. including a user ID might seem easy to a consumer but it could be very expensive for producers in their standard way of calling user ID. How can we put a prescriptive way in front of data producers to make it easy to meet requests??Controversial?: Pull requests > requirements gathering. A consumer can show exactly what they want and a producer can either approve or deny but it generates a better conversation.Teams need to figure out coordination and communication in a decentralized data modeling world. That is the federated aspect of data mesh - if it's all decentralized, nothing works with each other and you end up with garbage data at the organization level despite domains having well modeled data for themselves.!Controversial!: Schemata believes there needs to be a core data domain for data that links most other domains. Scott note: While not rare in data mesh, a core domain can become a bottleneck and may not give you the flexibility required. Adevinta (episode #40) discusses leveraging a core domain in depth.It's very, very valuable to provide automated feedback via your platform to people considering creating a new data asset - whether that will become a data product or not - including how well it fits in the organization's data landscape. Are you creating something that can actually be leveraged for other use cases? Does it integrate well with existing data assets/products?It's crucial to have something that gets all the parameters of a data contract on paper. Think about negotiating an agreement with a vendor - is it all just verbal or are you starting from something concrete and working from there? Have your platform provide the basic parameters that people can adjust.Far too often, the first conversation a data producer has with their consumer is once something breaks for the consumer. These silent or stealth data consumers create expectations without ever telling the producer. That causes many, many issues.Schema should be immutable - the only way to change your schema is by creating a new version.Ananth started by sharing a bit about his background. Despite writing the Data Engineering Weekly newsletter, he sees his experience as somewhat between a data engineer and a data analyst. That gave him the ability to see the full end-to-end journey of how data was handled at many different organizations. He consistently saw that analytical data outside of the application scope was an afterthought because developers were focused singularly on their application, not how it fit into the greater scheme, especially on the analytics side.For Ananth, the data marketplace is a useful concept for many organizations when thinking about data contracts. It might be a bit more of a data bazaar than like Amazon in certain ways as there can be a bit of collaborative negotiation - 'oh, you have XYZ to offer, what about ABC, could you do that?' We need standardized ways to discuss/document data to make it far easier to share data, or at least start the conversation off from an informed standpoint when collaborating to get the most useful data created and shared. We need programmatic ways for producers to share what data they have available including their expectations like SLAs and consumers to request data they want with their expectations. Scott note: It's crucial to understand that data contracts are less about the actual contractual terms and more about the establishment of a relationship that is covered through the contract terms. There are expectations but the contract isn't the entire relationship between the data producer and the data consumer. Essentially, the relationship includes the contract but just having SLAs will not resolve many of the issues people have around data contracts/sharing.Similar to something Chris Riccomini mentioned in episode #51, Schemata is looking to provide feedback to producers about what broke downstream when they made a change. Or more valuably what will break before a commit is deployed. Data producers haven't had much of this feedback historically - e.g. "if you make this change, it will break your data contract expectations on the schema front because of…". But, Schemata is also designed for producers to see how well what they are offering fits with what other domains are offering when thinking about how well does my domain or potential new data product integrate into the overall organizational data sharing landscape.On consumer-driven testing in data contracts/agreements, Ananth thinks there are two aspects: structural and behavioral. Structural is what you'd expect and what most people discuss in data contracts - mainly schema validation, is it backward compatible, is it strongly typed, is the required metadata complete, is there a registered owner, are the SLAs defined and complete, etc. The behavioral is similar to what Abe Gong talked about in episode 65 about what are the expectations, does the data behave the way people expect such that it can actually be leveraged for their use case. A key, widespread reason we need consumer-driven testing is producers rarely really understand how data consumers will use their data or are using their data already. Thus, that behavioral testing can inform the producer - along with actual human to human conversations - about how consumers will be/are leveraging data.One general issue many teams have according to Ananth is the consumer doesn't really understand the cost or complexity of doing something around data creation. E.g. the producer of one domain might not store the user ID so to get every user ID is an expensive database call. So a consumer creating a pull request instead of a demand/request for data means you can start from a deeper conversation about what the data will be used for and why it's structured like it is proposed in the PR. This is also much more in a developer in the domain's workflow of using git. It's all just far less vague even if the initial proposal is infeasible - a producer has far more information about how the data might be used to start to iterate towards a workable solution.According to Ananth, many people looking at Schemata have seen the need for years but there hasn't been a great way to implement what Scott calls "making the implicit explicit" around data sharing/data contracts. And this isn't a typical problem at a small company but once you get to a certain scale, the need for decentralized data modeling starts to become very evident. But with decentralized data modeling, it's pretty easy to put yourself in a bad spot because there is no collaboration layer so you create data silos / things that just don't interoperate well. Much like thinking federated governance versus decentralized governance in data mesh.Schemata has a concept of a core domain that then every incremental entity or event you model, it will automatically assess how well the new event or entity is connected to the core domain. The theory is to quickly figure out how well what you are building will connect into the greater whole of the organization through the core domain. It gives you quick feedback on what is in process and you can easily add more fields to better match the core domain if a producer wants. It isn't a blocker, it's giving feedback to someone creating a pull request - data producer or consumer - about how well the resulting data model would fit in the organizational data landscape.Ananth discussed how data creation is really a human-in-the-loop challenge - autonomous data creation is just not very valuable now and might never be. We need a collaborative platform to create data that is truly valuable and understandable but especially usable. The crucial aspect is to make a tool that integrates into people's workflows instead of yet another screen that further fractures the data management experience. Schemata is trying to be like Snyk - automatically scanning and giving people actionable advice but with little effort on their part. Where are your likely pain points? How could you address them? You can more easily set a goal of remediation/improvement and figure out how well you are doing. What are the top 2-3 things you could focus on to make the data you share that much better/more valuable?A big thing many overlook in creating data contracts is about defining the value and/or cost of something happening according to Ananth. It's about getting people to the table to discuss something concrete and make sure people are on the same page. Instead of requirements, it's a collaborative discussion. Alla Hale in her episode #122 talked about every conversation, you should have something to show the other party, whether a full prototype or a post-it note with a little drawing. So getting to clear contract/agreement is far easier if you have a system that defines an owner, defines the parameters you need, makes sure the implicit aspects are explicit so both parties can fully agree, etc. One thing Ananth - and Scott - keep running across are stealth data consumers creating one-sided data contracts. Essentially, the consumer has created their consumer-side testing and is consuming but the data producer has no idea they are consuming their data. Or many don't even really do the testing/contract model to protect themselves at all. The first the producer hears about their consumption is when something breaks for the consumer. With Schemata, at least there is a contract in place and stealth data consumers just have to inherit existing contractual bounds. Scott note: I hate stealth anything in data, let the producer know or they will potentially make breaking changes that could be prevented if they were just aware.According to Ananth, we can really learn a LOT from the DevOps movement that has become more the platform engineering movement on the microservices side. If we try to push ownership to domains/data producers without the tooling to help them verify they comply with governance and that things are working okay, that's a lot of extra work on the data producer end. It's why we are seeing so damn much pushback from domains about not wanting to own their data - it's just way too much of an ask. Data producers just don't have enough information about what might be an issue when they try to make a change and it causes unnecessary friction. We need to make both the producer and consumer more productive, so that people can develop and deploy without tons of manual intervention.Far too many teams are using tooling to solve single problems and while that one-off tool helps address a singular issue, it creates an even more disjointed data management workflow in Ananth's view. It's easy to focus too much on the spot challenge instead of the overall challenges in data management, the holistic process. Tooling fragmented with cloud and it made sense as we figured out new approaches and patterns - and VCs were quite free with their money - but we need to think about the whole process as one again now. Zhamak has mentioned this multiple times as well.Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereAll music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Feb 21, 2023 • 15min

#197 Explorers Needed, Experts Not (Yet) - Mesh Musings 44

Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/All about why we need more explorers in our data mesh implementations and that it's too early for experts :)Please Rate and Review us on your podcast app of choice!If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see hereEpisode list and links to all available episode transcripts here.Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app