Reliability Enablers cover image

Reliability Enablers

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Sep 17, 2024
08:27

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come.

The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts.

This interrupts workflows, affects personal time, and even disrupts sleep.

Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise.

When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.

Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit.

When instrumenting your systems, be intentional about what data you collect and transport.

Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.

To combat this, focus on:

* Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.

* Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.

* Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.

Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently.

He shared that managing millions of alerts, often filled with noise, is a significant issue.

His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey.

According to Dan, the anatomy of a good alert includes:

* A run book

* A defined priority level

* A corresponding dashboard

* Consistent labels and tags

* Clear escalation paths and ownership

To elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.

The learning point is simple: aim for quality over quantity.

By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents.



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode