Martin Jackson, a seasoned expert in deploying configuration management at scale, shares insights from his extensive experience at Walmart. He discusses the challenges of managing infrastructure across 5000 stores, each acting like a mini data center. The conversation explores the evolution of technology at Walmart, highlighting the shift from outdated systems to modern infrastructure. Martin emphasizes the importance of teamwork in deploying new applications and the critical nature of real-time data synchronization for operational efficiency.
The CrowdStrike incident underscored the human impact of IT failures, as disruptions severely affected organizations reliant on Windows systems, particularly in critical environments like hospitals.
The shift from manual processes to automated solutions for configuration management at Walmart illustrates how embracing modern practices can enhance scalability and operational efficiency.
The need for self-service capabilities in infrastructure as code was emphasized, highlighting the importance of collaboration between application teams and central management to prevent inefficiencies.
Deep dives
Impact of the CrowdStrike Incident
The podcast discusses the significant impact of the CrowdStrike incident, highlighting the challenges faced by organizations reliant on Windows systems. A staggering 8.5 million devices were affected, with less than 1% of all Windows installs reported as impacted. Hospitals, including Seattle Children's Hospital, had to physically deploy personnel to deal with the issues, reflecting the gravity of the situation in critical environments. The response requires not only technical fixes but also addressing the human component, as people faced real-life disruptions during their travels due to canceled flights and cancellations.
Evolution of Infrastructure Management
A conversation about the rapid evolution of infrastructure management over the past decade is presented, comparing past methods to the more modern practices used today. The discussion centers around the deployment of Puppet at Walmart, where the focus transitioned from manual processes to automated solutions that enhanced scalability and efficiency. The hosts reflect on how Walmart became an innovator in infrastructure despite its traditional retail roots. They emphasize the importance of managing legacy systems while also adopting new technologies to streamline and modernize existing workflows.
Challenges of Configuration Management
The podcast delves into the challenges faced specifically within configuration management at Walmart, where a centralized approach was deemed necessary for consistency. However, the centralized nature posed risks, as any bad configurations could lead to widespread issues across multiple stores. The experience shared about deploying Puppet at scale revealed the organizational responsibility to implement changes safely and effectively. Moreover, the accountability introduced by infrastructure as code highlights the need for comprehensive testing and deployment strategies to prevent catastrophic failures.
Self-Service Infrastructure as Code
Emphasizing the necessity of enabling self-service within infrastructure as code, the hosts discuss how the centralized puppet system hindered overall efficiency. The inability for other teams to contribute their code without extensive knowledge of the underlying system slowed down the adoption of Puppet across the organization. This struggle underlined a significant lesson: the importance of involving application teams in configuration management processes. The potential benefits of improved collaboration and streamlined workflows were discussed as opportunities for enhancing operational efficiency.
The Ethical Responsibility of Tech Companies
The conversation wraps around the ethical responsibilities tech companies face in the wake of incidents like CrowdStrike's. The implication of rushing to deploy updates without adequate testing or a staggered rollout model is critically evaluated, showcasing the repercussions of poor decision-making. The podcast aligns this situation with broader concerns about the sustainability and environmental impact of tech practices. A consistent theme is the call for greater accountability, transparency, and dialogue between technical teams and business stakeholders to foster better solutions moving forward.
Deploying new applications can be tough. Deploying configuration management safely at scale with stores around the world is different. Martin Jackson joins us to discuss.
Changelog++ members save 8 minutes on this episode because they made the ads disappear. Join today!
Sponsors:
Coder.com – Instantly launch fully configured cloud development environments (CDE) and make your first commit in minutes. No need to traverse README files or await onboarding queues. Learn more at Coder.com
Retool – The low-code platform for developers to build internal tools — Some of the best teams out there trust Retool…Brex, Coinbase, Plaid, Doordash, LegalGenius, Amazon, Allbirds, Peloton, and so many more – the developers at these teams trust Retool as the platform to build their internal tools. Try it free at retool.com/changelog