Charity Majors, CEO and co-founder of Honeycomb, dives into the transformative power of observability in software engineering. She advocates for minimizing complexity by consolidating multiple observability tools into one source with rich context. This shift not only cuts costs but also accelerates debugging and enhances understanding of distributed systems. Additionally, Charity discusses the need for developers to be on call for their code, creating a direct feedback loop that fosters reliability. Their conversation spans innovative engineering practices and the evolving role of AI in observability.
01:05:52
forum Ask episode
web_stories AI Snips
view_agenda Chapters
menu_book Books
auto_awesome Transcript
info_circle Episode notes
volunteer_activism ADVICE
Trust Production as Truth
Treat production as the ultimate source of truth, not your IDE or tests.
Instrument your code early to understand its real behavior in production.
insights INSIGHT
Power of Wide Structured Logs
Wide structured logs with rich context replace fragmented logs, enabling fast problem correlation.
Correlating multiple dimensions helps find issues in minutes, not days.
insights INSIGHT
Everything is Trace-Shaped
The growing complexity of systems makes observability trace-shaped by nature.
AI observability is part of software observability, all fundamentally trace-shaped problems.
Get the Snipd Podcast app to discover more snips from this episode
Sarah Drasner's "Engineering Management for the Rest of Us" offers practical advice for engineering managers, focusing on building strong teams and fostering a positive work environment. The book emphasizes the importance of understanding personal and team values, promoting transparency, and creating a culture of accountability. It provides actionable strategies for managing engineers effectively, addressing common challenges, and navigating the complexities of modern software development. Drasner's insights are particularly relevant for managers who are new to the role or those seeking to improve their management skills. The book's focus on empathy and understanding makes it a valuable resource for anyone in a leadership position.
Fluke
Chance, Chaos, and Why Everything We Do Matters
Brian Klaas
Nicolay here,
Today I have the chance to talk to Charity Majors, CEO and co-founder of Honeycomb, who recently has been writing about the cost crisis in observability.
"Your source of truth is production, not your IDE - and if you can't understand your code there, you're flying blind."
The key insight is architecturally simple but operationally transformative: replace your 10-20 observability tools with wide structured events that capture everything about a request in one place. Most teams store the same request data across metrics, logs, traces, APM, and error tracking - creating a 20X cost multiplier while making debugging nearly impossible because you're reconstructing stories from fragments.
Charity's approach flips this: instrument once with rich context, derive everything else from that single source. This isn't just about cost - it's about giving engineers the connective tissue to understand distributed systems. When you can correlate "all requests failing from Android version X in region Y using language pack Z," you find problems in minutes instead of days.
The second is putting developers on call for their own code. This creates the tight feedback loop that makes engineers write more reliable software - because nobody wants to get paged at 3am for their own bugs.
In the podcast, we also touch on:
Why deploy time is the foundational feedback loop (15 minutes vs 15 hours changes everything)
The controversial "developers on call" stance and why ops people rarely found companies
How microservices made everything trace-shaped and killed traditional metrics approaches
The "normal engineer" philosophy - building for 4am debugging, not peak performance
AI making "code of unknown quality" the new normal
Wide Structured Events: Capturing all request context in one instrumentation event instead of scattered log lines - enables correlation analysis that's impossible with fragmented data.
Observability 2.0: Moving from metrics-as-workhorse to structured-data-as-workhorse, where you instrument once and derive metrics/alerts/dashboards from the same rich dataset.
SLO-based Alerting: Replacing symptom alerts (CPU, memory, disk) with customer-impact alerts that measure whether you're meeting promises to users.
Progressive Deployment: Gradual rollout through staged environments (kibble → dogfood → production) that builds confidence without requiring 2X infrastructure.
Trace-shaped Systems: Architecture pattern recognizing that distributed systems problems are fundamentally about correlating events across time and services, not isolated metrics.
Gateway Drug to Engineering: [01:04] How IRC and bash tab completion sparked Charity's fascination with Unix command line possibilities
ADHD and Incident Response: [01:54] Why high-pressure outages brought out her best work - getting "dead calm" when everything's broken
Code vs. Production Reality: [02:56] Evolution from focusing on code beauty to understanding performance, behavior, and maintenance over time
The Alexander's Horse Principle: [04:49] Auto-deployment as daily practice - if you grow up deploying constantly, it feels natural by the time you scale
Production as Source of Truth: [06:32] Why your IDE output doesn't matter if you can't understand your code's intersection with infrastructure and users
The Logging Evolution: [08:03] Moving from debugger-style spam logs to fewer, wider structured events oriented around units of work
Bubble Up Anomaly Detection: [10:27] How correlating dimensions reveals that failures cluster around specific Android versions, regions, and feature combinations
Everything is Trace-Shaped: [12:45] Why microservices complexity is about locating problems in distributed systems, not just identifying them
AI as Acceleration of Automation: [15:57] Most AI panic could be replaced with "automation" - it's the same pattern, just faster feedback loops
Non-determinism as Genuinely New: [16:51] The one aspect of AI that's actually novel in software systems, requiring new architectural patterns
The Cost Crisis: [22:30] How 10-20 observability tools create unsustainable cost multipliers as businesses scale
SLO Revolution: [28:40] Deleting 90% of alerts by focusing on customer impact instead of system symptoms
Shrinking Feedback Loops: [34:28] Keeping deploy-to-validation under one hour so engineers can connect actions to outcomes
Normal Engineer Design: [38:12] Building systems that work for tired humans at 4am, not just heroes during business hours
The Instrumentation Habit: [23:15] Always looking at your code in production after deployment to build informed instincts about system behavior
Progressive Deployment Strategy: [36:43] Kibble → Dog Food → Production pipeline for gradual confidence building
Real Engineering Bar: [49:00] Discussion on what actually makes exceptional vs normal engineers
🛠️ Tools & Tech Mentioned
Honeycomb - Observability platform for structured events