On call work needs to be an extremely high priority for the business. It's a bunch of soft things and a bunch of actual real hard prioritization decisions as well. No one wants to see heroics, but there are times when it's like more stressful or more impactful than others,. When this work is happening.
In this deep-dive episode, Brian Scanlan, Principal Systems Engineer at Intercom, describes how the company’s on-call process works. He explains how the process started and key changes they’ve made over the years, including a new volunteer model, changes to compensation, and more.
Discussion points:
- (1:28) How on-call started at Intercom
- (10:11) Brian’s background and interest in being on-call
- (14:06) Getting engineers motivated to be on-call
- (16:37) Challenges Intercom saw with on-call as it grew
- (19:53) Having too many people on-call
- (23:20) Having alarms that aren’t useful
- (26:03) Recognizing uneven workload with compensation
- (27:22) Initiating changes to the on-call process
- (30:08) Creating a volunteer model
- (33:02) Addressing concerns that volunteers wouldn’t take action on alarms
- (34:40) Equitability in a volunteer model
- (36:36) Expectations of expertise for being on-call
- (40:56) How volunteers sign up
- (44:15) The Incident Commander role
- (46:19) Using code review for changes to alarms
- (50:02) On-call compensation
- (52:50) Other approaches to compensating on-call
- (55:08) Whether other companies should compensate on-call
- (57:32) How Intercom’s on-call process compares to other companies
- (1:00:46) Recent changes to the on-call process
- (1:04:13) Balancing responsiveness and burnout
- (1:07:12) Signals for evaluating the on-call process
Mentions and links: