AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
In this episode, we cover:
Links:
Transcript
Jason: — hilarious or stupid?
Mandi: [laugh]. I heard that; I listened to the J. Paul Reed episode and I was like, “Oh, there’s, like, a little, like, cold intro.” And I’m like, “Oh, okay.”
Jason: Welcome to Break Things on Purpose, a podcast about reliability and learning from failure. In this episode, we take a trip down memory lane with Mandi Walls to discuss how much technology, reliability practices, and chaos engineering has evolved over her extensive career in technology.
Jason: Everybody, welcome to the show, Julie Gunderson, who recently joined Gremlin on the developer advocacy team. How’s it going, Julie?
Julie: Great, Jason. Really excited to be here.
Jason: So, Mandi is actually a guest of yours. I mean, we both have been friends with Mandi for quite a while but you had the wonderful opportunity of working with Mandi.
Julie: I did, and I was really excited to have her on our podcast now as we ran a podcast together at PagerDuty when we worked there. Mandi has such a wealth of knowledge that I thought we should have her share it with the world.
Mandi: Oh, no. Okay.
Julie: [laugh].
Jason: “Oh, no?” Well, in that case, Mandi, why don’t you—
Mandi: [crosstalk 00:01:28]. I don’t know.
Jason: Well, in that case with that, “Oh no,” let’s have Mandi introduce herself. [laugh].
Mandi: Yeah hi. So, thanks for having me. I am Mandi Walls. I am currently a DevOps advocate at PagerDuty, Julie’s last place of employment before she left us to join Jason at Gremlin.
Julie: And Mandi, we worked on quite a few things over a PagerDuty. We actually worked on things together, joint projects between Gremlin, when it was just Jason and us where we would run joint workshops to talk about chaos engineering and actually how you can practice your incident response. And I’m sure we’ll get to that a little bit later in the episode, but will you kick us off with your background so everybody knows why we’re so excited to talk to you today?
Mandi: Oh, goodness. Well, so I feel like I’ve been around forever. [laugh]. Prior to joining PagerDuty. I spent eight-and-a-half years at Chef Software, doing all kinds of things there, so if I ever trained you on Chef, I hope it was good.
Prior to joining Chef, I was assistant administrator for AOL.com and a bunch of other platform and sites at AOL for a long time. So, things like Moviefone, and the AOL Sports Channel, and dotcom, and all kinds of things. Most of them ran on one big platform because the monolith was a thing. So yeah, my background is largely in operations, and just systems administration on that side.
Jason: I’m laughing in the background because you mentioned Moviefone, and whenever I think of Moviefone, I think of the Seinfeld episode where Kramer decides to make a Moviefone competitor, and it’s literally just his own phone number, and people call up and he pretends to be that, like, robotic voice and has people, like, hit numbers for which movie they want to see and hear the times that it’s playing. Gives a new meaning to the term on-call.
Mandi: Indeed. Yes, absolutely.
Julie: And I’m laughing just because I recently watched Hackers and, you know, they needed that AOL.com disc.
Mandi: That’s one of my favorite movies. Like, it’s so ridiculous, but also has so many gems of just complete nonsense in it. Absolutely love Hackers. “Hack the planet.”
Julie: “Hack the planet.” So, with hacking the planet, Mandi, and your time working at AOL with the monolith, let’s talk a little bit because you’re in the incident business right now over at PagerDuty, but let’s talk about the before times, the before we practiced Chaos Engineering and before we really started thinking about reliability. What was it like?
Mandi: Yeah, so I’ll call this the Dark Ages, right? So before the Enlightenment. And, like, for folks listening at home, [laugh] the timeline here is probably—so between two-thousand-and-fi—four, five, and 2011. So, right before the beginning of cloud, right before the beginning of, like, Infrastructure as Code, and DevOps and all those things that’s kind of started at, like, the end of my tenure at AOL. So, before that, right—so in that time period, right, like, the web was, it wasn’t like it was just getting started, but, like, the Web 2.0 moniker was just kind of getting a grip, where you were going from the sort of generic sites like Yahoo and Yellow Pages and those kinds of things and AOL.com, which was kind of a collection of different community bits and news and things like that, into more personalized experiences, right?
So, we had a lot of hook up with the accounts on the AOL side, and you could personalize all of your stuff, and read your email and do all those things, but the sophistication of the systems that we were running was such that like, I mean, good luck, right? It was migration from commercial Unixes into Linux during that era, right? So, looking at when I first joined AOL, there were a bunch of Solaris boxes, and some SGIs, and some other weird stuff in the data center. You’re like, good luck on all that. And we migrated most of those platforms onto Linux at that time; 64 bit. Hurray.
At least I caught that. And there was an increase in the use of open-source software for big commercial ventures, right, and so less of a reliance on commercial software and caught solutions for things, although we did have some very interesting commercial web servers that—God help them, they were there, but were not a joy, exactly, to work on because the goals were different, right? That time period was a huge acceleration. It was like a Cambrian explosion of software pieces, and tools, and improvements, and metrics, and monitoring, and all that stuff, as well as improvements on the platform side. Because you’re talking about that time period is also being the migration from bare metal and, like, ordering machines by the rack, which really only a handful of players need to do that now, and that was what everybody was doing then.
And in through the earliest bits of virtualization and really thinking about only deploying the structures that you needed to meet the needs of your application, rather than saying, “Oh, well, I can only order gear, I can only do my capacity planning once a year when we do the budget, so like, I got to order as much as they’ll let me order and then it’s going to sit in the data center spinning until I need it because I have no ability to have any kind of elastic capacity.” So, it was a completely, [laugh] completely different paradigm from what things are now. We have so much more flexibility, and the ability to, you know, expand and contract when we need to, and to shape our infrastructures to meet the needs of the application in such a more sophisticated an...