Oxide and Friends

Death by Uptime

9 snips
Dec 8, 2025
Cliff Biffle, a firmware and systems engineer, and Matt Keeter, a hardware debugging expert on the Oxide team, dive into a perplexing issue where multiple service processors became unresponsive. They explore the surprising root cause linked to an Ethernet driver bug, analyzing network behavior and thermal metrics. Cliff explains how a management-counter interrupt triggered repeated task activations, revealing a hidden fault. The discussion wraps up with insights on the broader implications and how such fixes can benefit future projects.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Partial Network Liveness Reveals Task Starvation

  • All service processors stopped responding over the management network but still answered ICMP pings.
  • That implied the network task ran but higher-priority clients and services were starved or blocked.
ANECDOTE

Elbolt And Aggressive Clock Testing

  • Bryan recounts fixing the classic Unix 248-day elbolt uptime bug early in his career.
  • He used aggressive clock-rate testing to expose edge cases and even found a chip bug.
INSIGHT

Priority IPC And Net Task Can Dominate CPU

  • Hubris uses a priority-ordered IPC and a separate network task that wins CPU when runnable.
  • A runaway network task or unacknowledged interrupt can monopolize CPU and block clients like IPCC.
Get the Snipd Podcast app to discover more snips from this episode
Get the app