AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
In this episode, we cover:
Transcript
Jason: There’s a bunch of cruft that they’ll cut from the beginning, and plenty of stupid things to cold-open with, so.
Julie: I mean, I probably should have not said that I look forward to more incidents.
[audio break 00:00:12]
Jason: Hey, Julie. So, it’s been quite a year, and we’re going to do a year-end review episode here. As with everything, this feels like a year of a lot of incidents and outages. So, I’m curious, what is your favorite outage of the year?
Julie: Well, Jason, it has been fun. There’s been so many outages, it’s really hard to pick a favorite. I will say that one that sticks out as my favorite, I guess, you could say was the Fastly outage, basically because of a lot of the headlines that we saw such as, “Fastly slows down and stops the internet.” You know, “What is Fastly and why did it cause an outage?” And then I think that people started realizing that there’s a lot more that goes into operating the internet. So, I think from just a consumer side, that was kind of a fun one. I’m sure that the increases in Google searches for Fastly were quite large in the next couple of days following that.
Jason: That’s an interesting thing, right? Because I think for a lot of us in the industry, like, you know what Fastly is, I know what Fastly is; I’ve been friends with folks over there for quite a while and they’ve got a great service, but for everybody else out there in the general public, suddenly, this company, they never heard of that, you know, handles, like, 25% of the world’s internet traffic, like, is suddenly on the front page news and they didn’t realize how much of the internet runs through this service. And I feel it that way with a lot of the incidents that we’re seeing lately, right? We’re recording this in December, and a week ago, Amazon had a rather large outage, affecting us-east-1, which it seems like it’s always us-east-1. But that took down a bunch of stuff and similar, they are people, like you know, my dad, who’s just like, “I buy things from Amazon. How did this crash, like, the internet?”
Julie: I will tell you that my mom generally calls me—and I hate to throw her under the bus—anytime there is an outage. So, Hulu had some issues earlier this year and I got texts from my mom actually asking me if I could call any of my friends over at Hulu and, like, help her get her Hulu working. She does this similarly for Facebook. So, when that Facebook outage happened, I always—almost—know about an outage first because of my mother. She is my alerting mechanism.
Jason: I didn’t realize Hulu had an outage, and now it makes me think we’ve had J. Paul Reed and some other folks from Netflix on the show. We definitely need to have an engineer from Hulu come on the show. So, if you’re out there listening and you work for Hulu, and you’d like to be on the show and dish all the dirt on Hulu—actually don’t do that, but we’d love to talk with you about reliability and what you’re doing over there at Hulu. So, reach out to us at podcast@gremlin.com.
Julie: I’m sure my mother would appreciate their email address and phone number just in case—
Jason: [laugh].
Julie: —for the future. [laugh].
Jason: If you do reach out to us, we will connect you with Julie’s mother to help solve her streaming issues. You had mentioned one thing though. You said the phrase about throwing your mother under the bus, and that reminds me of one of my favorite outages from this year, which I don’t know if you remember, it’s all about throwing people under the bus, or one person in particular, and that’s the Salesforce outage. Do you remember that?
Julie: Oh. Yes, I do. So, I was not here at the time of the Salesforce outage, but I do remember the impact that that had on multiple organizations. And then—
Jason: Yes—
Julie: —the retro.
Jason: —the Salesforce outage was one where ,similarly ,Salesforce affects so much, and it is a major name. And so people like my dad or your mom probably knew like, “Oh, Salesforce. That’s a big thing.” The retro on it, I think, was what really stood out. I think, you know, most people understand, like, “Oh, you’re having DNS issues.” Like, obviously it’s always DNS, right? That’s the meme: It’s always DNS that causes your issues.
In this case it was, but their retro on this they publicly published was basically, “We had an engineer that went to update DNS, and this engineer decided to push things out using an EBF process, an Emergency Brake Fix process.” So, they sort of circumvented a lot of the slow rollout processes because they just wanted to get this change made and get it done without all the hassle. And turns out that they misconfigured it and it took everything down. And so the entire incident retro was basically throwing this one engineer under the bus. Not good.
Julie: No, it wasn’t. And I think that it’s interesting because especially when I was over at PagerDuty, right, we talked a lot about blamelessness. That was very not blameless. It doesn’t teach you to embrace failure, it doesn’t show that we really just want to take that and learn better ways of doing things, or how we can make our systems more resilient. But going back to the Fastly outage, I mean, the NPR headline was, “Tuesday’s Internet Outage was Caused by One Customer Changing a Setting, Fastly says.” So again, we could have better ways of communicating.
Jason: Definitely don’t throw your engineers on their bus, but even moreso, don’t throw your customers under the bus. I think for both of these, we have to realize, like, for the engineer at Salesforce, like, the blameless lesson learned here is, what safeguards are you going to put in place? Or what safeguards were there? Like, obviously, this engineer thought, like, “The regular process is a hassle; we don’t need to do that. What’s the quickest, most expedient way to resolve the issue or get this job done?” And so they took that.
And similarly with the customer at Fastly, they’re just like, “How can I get my systems working the way I want them to? Let’s roll out this configuration.” It’s really up to all of us, and particularly within our companies, to think about how are people using our products. How are they working on our systems? And, what are the guardrails that we need to put in place? Because people are going to try to make the best decisions that they can, and that obviously means getting the job done as quickly as possible and then moving on to the next thing.
Julie: Well, and I think you’re really onto something there, too, because I think it’s also about figuring out those unique ways that our customers can break our products, things that we didn’t think through. And I mean, that goes back to what we do here at Gremlin, right? Then that goes back to Chaos Engineering. Let’s think through a hypothesis. Let’s see, you know, what if ABC Company, somebody there does something. How can we test for that?
And I think that shouldn’t get lost in the whole aspect of now we’ve got this postmortem. But how do we recreate that? How do we make sure that these things don’t happen again? And then how do we get creative with trying to figure out, well, how can we break...