AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Reliability engineering is often neglected in the data space, with organizations focusing on other aspects of data management. The importance of reliability engineering lies in observability, as without the ability to observe the system, it becomes difficult to identify and address issues. Observability in data goes beyond data quality and encompasses the system's storing, moving, and transforming processes. The measurement of reliability should also consider the criticality of the data and the needs of the users. Moreover, SLOs (Service Level Objectives) play a crucial role in determining reliability, but it's essential to avoid copying and pasting metrics from software engineering and focus on what matters for data applications and systems.
Bringing SRE (Site Reliability Engineering) thinking into the data ecosystem involves considering data criticality and moving away from binary thinking about data quality, availability, and reliability. It is important to map which data is critical for the business and understand the needs and expectations of users. Determining SLOs (Service Level Objectives) requires conversations and prioritizing needs based on return on investment. Simplifying measurement and focusing on what systems need to do to serve customer needs are key considerations. While challenges exist, like the control-oriented mindset and the economic barriers imposed by cloud providers, understanding users and fostering communication across natural boundaries can help drive change.
The lack of good tooling and observability hampers data engineering teams' ability to ensure reliability and scalability in data systems. Many existing systems do not provide easy-to-use hooks for observability and monitoring, making it difficult to identify and address issues. Investing time in developing better observability, including system logs, identity management, and easy access to metrics, can significantly improve data reliability. Bringing in tooling that offers richer metadata collections and distributed systems management would also enhance data engineering processes. While challenges exist, stories and quantitative evidence can help bridge the gap and drive change within organizations.
Fostering a reliability mindset in data teams requires addressing organizational barriers and evolving leadership attitudes. Data engineering often operates as a disempowered, centralized team far removed from value generation, leading to silos and disjointed workflows. Breaking down these natural organizational boundaries is essential to bringing stakeholders closer to data systems and enabling data engineers to influence reliability. Organizations need to recognize the value generated by data teams and provide the necessary autonomy and resources to implement better practices. Additionally, decentralizing data management, focusing on metadata mediation across sources, and building a culture of shared responsibility can pave the way for more reliable and scalable data practices.
The industry is gradually moving toward more reliable data systems, driven by growing system complexity and the need to ensure data integrity and availability. This transition requires the adoption of SRE principles, similar to the DevOps movement, to create a culture of shared responsibility and scalability. Organizations must prioritize observability, change management, and incident response in data engineering, ensuring that teams can adapt systems to meet evolving needs. While challenges exist, including the control-oriented mindset and cloud provider lock-ins, technology advancements and emerging practices like data mesh present opportunities for distributed data management and improvement in reliability across the industry.
Please Rate and Review us on your podcast app of choice!
Get involved with Data Mesh Understanding's free community roundtables and introductions: https://landing.datameshunderstanding.com/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Episode list and links to all available episode transcripts here.
Provided as a free resource by Data Mesh Understanding. Get in touch with Scott on LinkedIn if you want to chat data mesh.
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Emily's LinkedIn: https://www.linkedin.com/in/emily-gorcenski-0a3830200/
Amy's LinkedIn: https://www.linkedin.com/in/amytobey/
Alex's LinkedIn: https://www.linkedin.com/in/alex-hidalgo-6823971b7/
Alex's Book Implementing Service Level Objectives: https://www.alex-hidalgo.com/the-slo-book
In this episode, guest host Emily Gorcenski, Head of Data and AI for Thoughtworks Europe (guest of episode #72) facilitated a discussion with Amy Tobey, Senior Principal Engineer at Equinix and Alex Hidalgo, Principal Reliability Advocate at Nobl9. As per usual, all guests were only reflecting their own views.
The topic for this panel was applying reliability engineering practices to data. This is different than engineering for data reliability which is focused on data quality specifically.
The overall concept is taking what we've learned from reliability engineering across disciplines but mostly in software, especially SRE/site reliability engineering, and bringing those learnings to data to make data - especially data production and serving - more reliable and scalable. Scott note: this is probably one of the most frustrating topics in data for me because it feels like it's basic foundational work yet most organizations aren't tackling this well yet if at all really. The best starting point for an organization is simple awareness and starting to have reliability engineering conversations around data. And you will probably feel like you're behind after listening to this. Everyone is behind on this 😅even most orgs aren't doing SRE well so applying it to data, that's no surprise.
Scott note: I wanted to share my takeaways rather than trying to reflect the nuance of the panelists' views individually.
Scott's Top Takeaways:
Other Important Takeaways (many touch on similar points from different aspects):
Learn more about Data Mesh Understanding: https://datameshunderstanding.com/about
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode