This conversation features Pete and Chris, co-founders of incident.io, who specialize in building successful on-call teams. They discuss the critical role of on-call support in tech startups and share insights on creating a compassionate on-call culture. Key topics include who should be on-call, the dangers of relying on 'on-call heroes,' and the importance of fair compensation practices. They also introduce strategies to motivate engineers and discuss effective onboarding processes, making on-call responsibilities more manageable and inclusive.
Decentralizing the on-call responsibility enables faster incident resolution and fosters a deeper understanding of systems among engineers.
Building a compassionate on-call culture requires recognizing personal circumstances and transforming on-call duties into growth opportunities.
Compensation for on-call duties, such as additional pay or time off in lieu, plays a crucial role in maintaining team morale.
Deep dives
Understanding On-Call Responsibilities
On-call management is essential for services that require continuous online availability, such as banking and e-commerce. When regular hours conclude, having a support structure in place becomes vital to maintain operations, involving a group of designated individuals responsible for monitoring and addressing system issues. These individuals typically follow a rotation schedule, ensuring that someone is always available to respond to alerts. A well-defined escalation policy further enhances this framework by detailing how to reach additional team members if the primary contact is unavailable.
Building Commitment to On-Call Culture
Engaging the entire engineering team in the on-call rotation fosters a deeper connection between employees and the systems they manage. By allowing engineers to experience both the successes and pitfalls of their deployments through on-call duties, they gain invaluable feedback that can improve their work. Strategies to encourage participation include making on-call assignments approachable and acknowledging personal circumstances, thus creating an inclusive environment. It’s important to strike a balance so that being on-call is seen as a professional growth opportunity rather than a burdensome task.
Philosophies for Effective On-Call Management
The implementation of a decentralized on-call responsibility can lead to improved operational efficiency and enhanced feedback cycles within teams. Companies often start with a central on-call team managing incidents but may find that distributing this responsibility can lead to quicker resolutions and a better understanding of system components. At Monzo, for instance, team members transitioned from a centralized model to having individual teams responsible for specific alerts, which significantly reduced incident resolution times. This iterative process highlights the value of engineers understanding the systems they are responsible for and sharing the accountability for incidents.
Compensation and Incentives for On-Call Duties
The topic of on-call compensation often generates lively discussions within tech communities, with varied practices observed across different regions. In Europe, it’s common for employees to receive additional pay for on-call duties, while in the U.S., this is less frequent. Offering a flat rate for being on-call acknowledges the inconvenience of potential interruptions and creates a more balanced workplace dynamic. Additionally, implementing time off in lieu for hours worked during unexpected incidents can help mitigate the emotional burden of these responsibilities.
Operationalizing an Effective On-Call Team
To effectively operationalize an on-call team, providing thorough and supportive onboarding is crucial. New team members should initially shadow more experienced colleagues during working hours to learn the ropes in a low-pressure environment before transitioning to overnight responsibilities. Encouraging the declaration of incidents, even for minor issues, promotes a culture of accountability and learning. Lastly, utilizing practices such as game days for simulated incident responses can empower team members with experience and confidence to tackle real scenarios when they arise.
In this podcast, our panellists discuss what it means to build a successful on-call team. Drawing on their experiences at fast growing start-ups and scale-ups, incident.io co-founders Pete and Chris cover everything from who should be on the rota and how to build a compassionate on-call culture, to compensation structures and tips for operationalising on-call.
In this episode, we cover:
(04:07) What is on-call and why is it important?
(06:25) Who should be in-call?
(09:13) Should all teams be responsible for their own on-call, or should there be a dedicated team?
(12:59) How can you build a compassionate on-call culture?