88: What Is Data Observability? With Tristan Spaulding of Acceldata
May 25, 2022
auto_awesome
Tristan Spaulding, data observability expert, discusses updating old technology, defining data observability, handling incidents, and early symptoms of data drift in an entertaining podcast conversation.
Data observability provides insights into the distribution and behavior of data systems, allowing for deep analysis of data pipelines and optimization of system performance.
Detecting data drift early is crucial for maintaining the performance of machine learning models and data pipelines, and data observability can help by monitoring data distribution throughout the pipeline and alerting to any significant drift.
Data observability and data governance are closely connected, as both aim to ensure the trustworthiness and reliability of data, and integrating observability into data governance practices allows for better automation, policy creation, and data management across the entire data stack.
Deep dives
Understanding Data Observability
Data observability is the practice of understanding the internal state of data systems and being able to quickly identify the cause of any issues. It goes beyond traditional monitoring by providing insights into the distribution and behavior of data. This is particularly important in modern data use cases where data is used to drive products and services, and any issues can result in loss of business. Observability allows for deep analysis of data pipelines, helping to pinpoint bottlenecks, identify drift in data distribution, and optimize system performance.
The Challenges of Data Drift
Data drift refers to changes in the distribution of data over time, which can impact the performance of machine learning models and data pipelines. It can be a silent problem, often discovered when it becomes a major issue. Detecting data drift early is crucial, yet it typically triggers alerts only at the final stage of the model, missing potential problems earlier in the data pipeline. Data observability can address this challenge by monitoring the distribution of data throughout the pipeline and alerting to any significant drift, allowing for proactive response and optimization.
Bringing Together Data Observability and Data Governance
Data observability and data governance are closely connected, as both aim to ensure the trustworthiness and reliability of data. While data governance traditionally focuses on establishing policies, classifications, and data catalogs, observability provides a dynamic and performance-focused lens into the internal state of data systems. By integrating observability into data governance practices, organizations can gain insights into data drift, data quality issues, and the impact on system performance. This integration allows for better automation, intelligent policy creation, and improved data management across the entire data stack.
The Importance of Data Observability for Data Pipelines
Data observability plays a crucial role in data pipelines. There are two main entry points for implementing data observability: connecting to a data source or instrumenting the pipeline itself. Connecting to a data source involves analyzing the compute layer and the data itself for anomalies and distributions. Instrumenting the pipeline allows for monitoring and gathering information on the data flow, query statistics, and database load. Implementing data observability involves defining tests or quality checks, automating wherever possible, and customizing rules or columns when necessary. The primary user of data observability tools is typically the data engineers responsible for maintaining and monitoring complex pipelines. However, the scope of users can expand to include machine learning engineers, data scientists, and analytics engineers.
The Challenges of Tuning Data Engines and the Role of Expertise
Tuning data engines and optimizing their performance is a complex and ongoing challenge for data engineers and ML ops professionals. With a plethora of data processing platforms available, the choice of the right engine for specific use cases can be overwhelming. Factors to consider include familiarity with the engine, customization options, and cost. While some commercial solutions offer automated tuning, the need for control and customization may drive organizations to explore other options. Expertise in different data engines and platforms can help navigate the complexities of tuning. However, striking a balance between performance optimization and cost efficiency remains a challenge. The evolving landscape of data processing platforms presents opportunities and difficulties, ultimately requiring data engineers to carefully evaluate and choose the most suitable engines for their infrastructure.
The primary user of a data observability tool (29:56)
Handling an incident (33:01)
Why multipliers for data observability (37:06)
Early symptoms of a data drift (43:12)
Tuning in the context of data engineering (50:11)
What keeps Tristan working with data (55:12)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode