John Allspaw, a pioneer in resilience engineering known for his impactful work at Etsy and in the DevOps movement, dives into crucial discussions about safety in software. He addresses how traditional safety concepts clash with software development realities. The conversation highlights the necessity of psychological safety for innovation and explores narrative control's role in software perception. They also examine the dynamics of change management in development, including the risks associated with code freezes and fostering open communication during deployments.
Code freezes can heighten developer anxiety and lead to mistakes, suggesting a need for a more flexible 'slush' model.
Emphasizing psychological safety in software engineering encourages open discussions about risks, promoting a culture of resilience and continuous improvement.
Deep dives
The Impact of Code Freeze Practices
The discussion centers on the timing of code freezes in the software industry, especially as the holiday season approaches. Code freezes are often implemented to prevent changes during critical periods, but they can create problems by forcing a backlog of changes that leads to a spike in activity both before and after the freeze. This results in heightened anxiety among developers who feel pressured to push changes through quickly, which may lead to mistakes and increased risk. Instead of utilizing code freezes, a 'slush' model is suggested, which encourages careful consideration of changes rather than outright stalling modifications, fostering a more proactive and calculated approach.
Understanding Safety in Software Engineering
The importance of safety in software engineering is emphasized, particularly as the industry continues to grow in complexity. Safety is not just about avoiding negative events but also includes creating an environment conducive to successful and resilient operations. Examples were provided to illustrate that even companies like Netflix and Spotify can face serious implications if their software issues lead to broader consequences, such as outages affecting societal functions. This broader perspective on safety challenges traditional views and highlights the need for a culture of psychological safety among engineers to promote open discussion about risks and improvements.
The Interconnectedness of Software Systems
The complexities of interconnected software systems are discussed, noting that operators often lack an understanding of how their work impacts larger systems. This disconnect can lead to unintended consequences when teams are unaware of the broader operational landscape, increasing the potential for significant failures. The conversation points out that while software engineers must trust their instincts and follow established processes, the complexity of systems means they must also be prepared for the unpredictability of outcomes. This reinforces the need for collaborative environments where concerns can be shared openly among team members, thereby fostering a culture of safety and awareness.
The Importance of Continuous Improvement
A critical moment in the conversation highlights the necessity of continuous improvement within organizations to maintain effective operational practices. Reflecting on past experiences where code freezes caused renewed cycles of development and broken workflows, the speakers stress that the approach to engineering should prioritize flexibility and learning from mistakes. By allowing room for experimentation and discussions about past errors, organizations can cultivate better workflows that lead to increased confidence in deployment. Throughout the episode, participants agree that recognizing the value of change, rather than restricting it, is vital for ensuring long-term stability and resilience in software development.
We talk to the pioneer of resilience engineering in the software world John Allspaw about how he discovered this world, and we answer a reader question together: does software need safety?
Correction: we thought this would be episode 3, but it ended up being 2, because of scheduling conflicts with guests...