Death by Uptime

9 snips

Dec 8, 2025

Cliff Biffle, a firmware and systems engineer, and Matt Keeter, a hardware debugging expert on the Oxide team, dive into a perplexing issue where multiple service processors became unresponsive. They explore the surprising root cause linked to an Ethernet driver bug, analyzing network behavior and thermal metrics. Cliff explains how a management-counter interrupt triggered repeated task activations, revealing a hidden fault. The discussion wraps up with insights on the broader implications and how such fixes can benefit future projects.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Partial Network Liveness Reveals Task Starvation

All service processors stopped responding over the management network but still answered ICMP pings.
That implied the network task ran but higher-priority clients and services were starved or blocked.

ANECDOTE

Elbolt And Aggressive Clock Testing

Bryan recounts fixing the classic Unix 248-day elbolt uptime bug early in his career.
He used aggressive clock-rate testing to expose edge cases and even found a chip bug.

INSIGHT

Priority IPC And Net Task Can Dominate CPU

Hubris uses a priority-ordered IPC and a separate network task that wins CPU when runnable.
A runaway network task or unacknowledged interrupt can monopolize CPU and block clients like IPCC.

Get the Snipd Podcast app to discover more snips from this episode

Get the app