Break Things on Purpose

Gremlin

A podcast about site reliability engineering (SRE); Chaos Engineering; and the people, processes, and tools used to build resilient systems. Sponsored by Gremlin. Find us on Twitter at @BTOPpod.

Episodes

Mentioned books

Feb 22, 2022 • 26min

Carissa Morrow: Learning to be Resilient

Carissa Morrow, a tech professional with experience in bootcamps and Chaos Engineering, shares her journey into the tech industry, the importance of resilience, and learning from mistakes. Topics include first job in tech, lessons from breaking production, measuring metrics, Chaos Engineering experiences, advice for new newcomers, and the value of constantly learning. The podcast also emphasizes the significance of asking for help, supporting new colleagues, embracing failure, and showing empathy in learning.

Feb 8, 2022 • 30min

Gunnar Grosch: From User to Hero to Advocate

In this episode, we cover:00:00:00 - Intro00:01:45 - AWS Severless Hero and Gunnar’s history using AWS00:04:42 - Severless as reliability00:08:10 - How they are testing the connectivity in serverless00:12:47 - Gunnar shares a suprising result of Chaos Engineering00:16:00 - Strategy for improving and advice on tracing 00:20:10 - What Gunnar is excited about at AWS00:28:50 - What Gunnar has going on/OutroLinks:Twitter: https://twitter.com/GunnarGroschLinkedIn: https://www.linkedin.com/in/gunnargrosch/TranscriptGunnar: When I started out, I perhaps didn’t expect to find that many unexpected things that actually showed more resilience or more reliability than we actually thought.Jason: Welcome to the Break Things on Purpose podcast, a show about Chaos Engineering and building more reliable systems. In this episode, we chat with Gunnar Grosch, a Senior Developer Advocate at AWS about Chaos Engineering with serverless, and the new reliability-related projects at AWS that he’s most excited about.Jason: Gunnar, why don’t you say hello and introduce yourself.Gunnar: Hi, everyone. Thanks, Jason, for having me. As you mentioned that I’m Gunnar Grosch. I am a Developer Advocate at AWS, and I’m based in Sweden, in the Nordics. And I’m what’s called a Regional Developer Advocate, which means that I mainly cover the Nordics and try to engage with the developer community there to, I guess, inspire them on how to build with cloud and with AWS in different ways. And well, as you know, and some of the viewers might know, I’ve been involved in the Chaos Engineering and resilience community for quite some years as well. So, topics of real interest to me.Jason: Yeah, I think that’s where we actually met was around Chaos Engineering, but at the time, I think I knew you as just an AWS Serverless Hero, that’s something that you’d gotten into. I’m curious if you could tell us more about that. How did you begin that journey?Gunnar: Well, I guess I started out as an AWS user, built things on AWS. As a builder, developer, I’ve been through a bunch of different roles throughout my 20-plus something year career by now. But started out as an AWS user. I worked for a company, we were a consulting firm helping others build on AWS, and other platforms as well. And I started getting involved in the AWS community in different ways, by arranging and speaking at different meetups across the Nordics and Europe, also speaking at different conferences, and so on.And through that, I was able to combine that with my interest for resiliency or reliability, as someone who’s built systems for myself and for our customers. That has always been a big interest for me. Serverless, it came as I think a part of that because I saw the benefits of using serverless to perhaps remove that undifferentiated heavy lifting that we often talk about with running your own servers, with operating things in your own data centers, and so on. Serverless is really the opposite to that. But then I wanted to combine it with resilience engineering and Chaos Engineering, especially.So, started working with techniques, how to use Chaos Engineering with serverless. That gained some traction, it wasn’t a very common topic to talk about back then. Adrian Hornsby, as some people might know, also from AWS, he was previously a Developer Advocate at AWS, now in a different role within the organization. He also talked a bit about Chaos Engineering for serverless. So, teamed up a bit with him, and continue those techniques, started creating different tools and some open-source libraries for how to actually do that. And I guess that’s how, maybe, the AWS serverless team got their eyes opened for me as well. So somehow, I managed to become what’s known as an AWS Hero in the serverless space.Jason: I’m interested in that experience of thinking about serverless and reliability. I feel like when serverless was first announced, it was that idea of you’re not running any infrastructure, you’re just deploying code, and that code gets called, and it gets run. Talk to me about how does that change the perception or the approach to reliability within that, right? Because I think a lot of us when we first heard of serverless it’s like, “Great, there’s Nothing. So theoretically, if all you’re doing is calling my code and my code runs, as long as I’m being reliable on my end and, you know, doing testing on my code, then it should be fine, right?” But I think there’s some other bits in there or some other angles to reliability that you might want to tune us into.Gunnar: Yeah, for sure. And AWS Lambda really started it all as the compute service for serverless. And, as you said, it’s about having your piece of code running that on-demand; you don’t have to worry about any underlying infrastructure, it scales as you need it, and so on; the value proposition of serverless, truly. The serverless landscape has really evolved since then. So, now there is a bunch of different services in basically all different categories that are serverless.So, the thing that I started doing was to think about how—I wasn’t that concerned about not having my Lambda functions running; they did their job constantly. But then when you start building a system, it becomes a lot more complex. You need to have many different parts. And we know that the distributed systems we build today, they are very complex because they contain so many different moving parts. And that’s still the case for serverless.So, even though you perhaps don’t have to think about the underlying infrastructure, what servers you’re using, how that’s running, you still have all of these moving pieces that you’ve interconnected in different ways. So, that’s where the use case for Chaos Engineering came into play, even for serverless. So, testing how these different parts work together to then make sure that it actually works as you intended to. So, it’s a bit harder to create those experiments since you don’t have control of that underlying infrastructure. So instead, you have to do it in a few different ways, since you can’t install any agents to run on the platform, for instance, you can’t control the servers—shut down servers, the perhaps most basic of Chaos Engineering experiment.So instead, we’re doing it using different libraries, we’re doing it by changing configuration of services, and so on. So, it’s still apply the same principles, the principles of Chaos Engineering, we just have to be—well, we have to think about it in different way in how we actually create those experiments. So, for me, it’s a lot about testing how the different services work together. Since the serverless architectures that you build, they usually contain a bunch of different services that you stitch together to actually create the output that you’re looking for.Jason: Yeah. So, I’m curious, what does that actually look like then in testing, how these are stitched together, as you say? Because I know with traditional Chaos Engineering, you would run a blackhole attack or some sort of network attack to disrupt that connectivity between services. Obviously, with Lambdas, they work a little bit differently in the way that they’re called and they’re more event-driven. So, what does that look like to test the connectivity in serverless?Gunnar: So, what we started out with, both me...

Jan 25, 2022 • 36min

Sam Rossoff: Data Centers Inside Data Centers

In this episode, we cover:00:00:00 - Intro00:02:23 - Iwata is the best, rest in peace00:06:45 - Sam sneaks some SNES emulators/Engineer prep00:08:20 - AWS, incidents, and China00:16:40 - Understanding the big picture and moving from project to product00:19:18 - Sam’s time at Snacphat00:26:40 - Sam’s work at Gremlin, and culture changes00:34:15 - Pokémon Go and OutroTranscriptSam: It’s like anything else: You can have good people and bad people. But I wouldn’t advocate for no people.Julie: [laugh].Sam: You kind of need humans involved.Julie: Welcome to the Break Things on Purpose podcast, a show about people, culture, and reliability. In this episode, we talk with Sam Rossoff, principal software engineer at Gremlin, about legendary programmers, data center disasters at AWS, going from 15 to 3000 engineers at Snapchat, and of course, Pokémon.Julie: Welcome to Break Things on Purpose. Today, Jason Yee and I are joined by Sam Rossoff, principal software engineer at Gremlin, and max level 100. Pokémon trainer. So Sam, why don’t you tell us real quick who you are.Sam: So, I’m Sam Rossoff. I’m an engineer here at Gremlin. I’ve been in engineering here for two years. It’s a good time. I certainly enjoyed it. And before that, I was at Snapchat for six years, and prior to that at Amazon for four years. And actually, before I was at Amazon, I was at Nokia Research Center in Palo Alto, and prior to that, I was at Activision. This was before they merged with Blizzard, all the way back in 2002. I worked in QA.Julie: And do you have any of those Nokia phones that are holding up your desk, or computer, or anything?Sam: I think I’ve been N95 around here somewhere. It’s, like, a phone circa 2009. Probably. I remember, it was like a really nice, expensive phone at the time and they just gave it to us. And I was like, “ oh, this is really nice.”And then the iPhone came out. And I was like [laugh], “I don’t know why I have this.” Also, I need to find a new job. That was my primary—I remember I was sitting in a meeting—this was lunch. It wasn’t a meeting.I was sitting at lunch with some other engineers at Nokia Research, and they were telling me the story about this app—because the App Store was brand new in those days—it was called iRich, and it was $10,000. It didn’t do anything. It was, like, a glowing—it was, like, NFTs, before NFTs—and it was just, like, a glowing thing on your phone. And you just, like, bought it to show you could waste $10,000 an app. And that was the moment where I was like, “I need to get out of this company. I need a new job.” It’s depressing at the time, I guess.Julie: So. Sam, you’re the best.Sam: No. False. Let me tell you story. There’s a guy, his name is Iwata, right? He’s a software developer. He works at a company called HAL Laboratories. You may recall, he built a game called Kirby. Very famous game; very popular.HAL Laboratories gets acquired by Nintendo. And Nintendo is like, “Hey, can you”—but Iwata, by the way, is the president of HAL Laboratories. Which is like, you know, ten people, so not—and they’re like, “Hey, can you, like, send someone over? We’re having trouble with this game we’re making.” Right, the game question, at the time they called it Pokémon 2, now we call it Gold and Silver, and Iwata just goes over himself because he’s a programmer in addition to be president of HAL Laboratories.And so he goes over there and he’s like, “How can I help?” And they’re like, “We’re over time. We’re over budget. We can’t fit all the data on the cart. We’re just, like, cutting features left and right.” He’s like, “Don’t worry. I got this.”And he comes up with this crazy compression algorithm, so they have so much space left, they put a second game inside of the game. They add back in features that weren’t there originally. And they released on time. And they called this guy the legendary programmer. As a kid, he was my hero.Also famous for building Super Smash Brothers, becoming the president of all of Nintendo later on in his life. And he died a couple years ago, of cancer, if I recall correctly. But he did this motion when he was president of Nintendo. So, you ever see somebody in Nintendo go like this, that’s a reference to Iwata, the legendary programmer.Jason: And since this is a podcast, Sam is two hands up, or just search YouTube for—Sam: Iwata.Jason: That’s the lesson. [laugh].Sam: [laugh]. His big console design after he became President of Nintendo was the Nintendo Wii, as you may recall, with the nunchucks and everything. Yeah. That’s Iwata. Crazy.Julie: We were actually just playing the Nintendo Wii the other day. It is still a high-quality game.Sam: Yeah.Jason: The original Wii? Not like the… whatever?Julie: Yeah. Like, the original Wii.Jason: Since you brought up the Wii, the Wii was the first console I ever owned because I grew up with parents that made it important to do schoolwork, and their entire argument was, if you get a Nintendo, you’ll stop doing your homework and school stuff, and your grades will suffer, and just play it all the time. And so they refuse to let me get a Nintendo. Until at one point I, like, hounded them enough-I was probably, like, eight or nine years old, and I’m like, “Can I borrow a friend’s Nintendo?” And they were like, sure you can borrow it for the weekend. So, of course, I borrowed it and I played it the whole weekend because, like, limited time. And then they used that as the proof of like, “See? All you did this weekend was play Nintendo. This is why we won’t get you one.” [laugh].Sam: So, I had the exact same problem growing up. My parents are also very strict. And firm believers in corporal punishment. And so no video games was very clear. And especially, you know, after Columbine, which was when I was in high school.That was like a hard line they held. But I had friends. I would go to their houses, I would play at their houses. And so I didn’t have any of those consoles growing up, but I did eventually get, like, my dad’s old hand-me-down computer for, like, schoolwork and stuff, and I remember—first of all, figuring out how to program, but also figuring out how to run SNES emulators on [laugh] on those machines. And, like, a lot of my experience playing video games was waking up at 2 a.m. in the morning, getting on emulators, playing that until about, you know, five, then turning it off and pretending to go back to bed.Julie: So see, you were just preparing to be an engineer who would get woken up at 2 a.m. with a page. I feel like you were just training yourself for incidents.Sam: What I did learn—which has been very useful—is I learned how to fall asleep very quickly. I can fall asleep anywhere, anytime, on, like, a moment’s notice. And that’s a fantastic skill to have, let me tell you. Especially when [crosstalk 00:07:53]—Julie: That’s a magic skill.Sam: Yeah.Julie: That is a magic skill. I’m so jealous of people that can just fall asleep when they want to. For me, it’s probably some Benadryl, maybe add in some melatonin. So, I’m very jealous of y...

Jan 11, 2022 • 6min

Unpopular Opinions

In this episode, we cover:00:00:00 - Intro00:00:38 - Death to VPNs00:02:45 - “I do not like React hooks.”00:03:50 - A Popular (?) OpinionTranscriptPat: Good thing you're putting that on our SRE focused pod.Brian: Yeah, well, they can take that to their front end developers and say, well, Brian Holt told me that hooks suck.Jason: Welcome to break things on purpose, an opinionated podcast about reliability and technology. As we launch into 2022, we thought it would be fun to ask some of our previous guests about their unpopular opinions.Zack butcher joined the show in August, 2021, to chat about his work on the Istio service mesh and its role in building more reliable, distributed systems. Here's his unpopular opinion on network security.Zack: I mean, can I talk about how I'm going to kill all the VPNs in the world? Uh, VPNs don't need to exist anymore. and that's stuff that I've actually been saying for years now. So it's so funny. We're finally realizing multi cluster Kubernetes. Right? I was so excited maybe two years ago at Kubecon and I finally heard people talk about multi cluster and I was like, oh, we finally arrived! It's not a toy anymore! Because when you have one, it's a toy, we have multiple, you're actually doing things. However, how do people facilitate that? I had demos four years ago of multicluster routing and traffic management on Istio. It was horrendous to write. It was awful. It's way better the way we do now. But, you know, the whole point that almost that entire time, I would tell people like, I'm going to kill VPN, there's no need for VPNs.There's a small need for like user privacy things. Right? That's a different category. But by and large, when organizations use a VPN, it's really about extending their network, right. It's about a network based trust model. And so I know that when you have reachability, that is that authorization, right? That's the old paradigm. VPNs enabled that. Fundamentally that doesn't work with the world that we live in anymore. it just doesn't, that's just not how security works, sorry. Uh, in, in these highly dynamic environments that we live in now. and so I actually think at this point in time, for the most part, actually VPNs probably cause more problems than solutions given the other tools that we have around.So yeah, so my unpopular opinion is that I want them to go away and be replaced with Envoy sidecars doing the encryption for all kinds of stuff. I would love to see that on your machine too. Right. I would love to see, you know, I'm, I'm talking to you on a Mac book. I would love for there to be a small sidebar there that actually is proxying that and doing things like identity and credential exchange in some way. Because that's a much stronger way to do security and to build your system, then things like a VPN.Jason: In April, 2021, Brian Holt shared some insightful, and hilarious, incidents and his perspective on Frontend Chaos Engineering. He shared his unpopular opinion with host Pat HigginsBrian: My unpopular opinion is that I do not like react hooks. And if you get people from the react community there's going to be some people that are legitimately going to be upset by that.I think they demo really well. And like the first time you show me some of that, it's just amazing and fascinating, but maintaining the large code bases full of hooks just quickly devolves into a performance mess, you get into like weird edge cases. And long-term, I think they actually have more cognitive load because you have to understand closures , really well to understand hooks really well. Whereas the opposite way, which is doing with react components. You have to understand this in context a little bit, but not a lot. So anyway, that's my very unpopular react opinion is that I like hooks and I wish we didn't have them.Pat: Good thing you're putting that on our SRE focused pod.Brian: Yeah, well, they can take that to their front end developers and say, well, Brian Holt told me that hooks suck.Jason: In November, Gustavo Franco dropped by to chat about building an SRE program at VMWare and the early days of Chaos Engineering at Google, we suspect his strongly held opinion is in fact, quite popular.Gustavo: About technology in general, the first thing that comes to mind, like the latest pet peeve in my head is really AIOps, as a term. It really bothers me. I think it's giving a name to something that is not there yet. It may come one day.So I could rant about AIOps forever. But the thing I would say is that, I dunno, folks selling AIOps solutions, like, look into improving, statistics functions in your products first. Yeah, it's, it's just a pet peeve. I know it doesn't really change anything to me day to day basis just every time I see something related to AIOps or people asking me, you know, if my teams ever implement AIOps it bothers me.Maybe about technology at large, just quickly, is kind of the same realm and how everything is artificial intelligence now. Even when people are not using machine learning at all. So everything quote unquote is an AI like queries and keyword matching for things. And people were like, oh, this is like an AI. This is more like for journalists, right? Like, I don't know if any journalists ever listen to this, but if they do, not everything that uses keyword matching's AI or machine learning.The computers are not learning, people! The computers are not learning! Calm down!Jason: The computers are not learning, but we are. And we hope that you'll learn along with us.To hear more from these guests and listen to all of our previous episodes. Visit our website at gremlin.com/podcast. You can automatically receive all of our new episodes by subscribing to the Break Things on Purpose podcast on Apple Podcasts, Spotify, or your favorite podcast app. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

Dec 28, 2021 • 18min

2021 Year in Review

In this episode, we cover:00:00:00 - Introduction00:30:00 - Fastly Outage00:04:05 - Salesforce Outage00:07:25 - Hypothesizing 00:10:00 - Julie Joins the Team!00:14:05 - Looking Forward/OutroTranscriptJason: There’s a bunch of cruft that they’ll cut from the beginning, and plenty of stupid things to cold-open with, so.Julie: I mean, I probably should have not said that I look forward to more incidents.[audio break 00:00:12]Jason: Hey, Julie. So, it’s been quite a year, and we’re going to do a year-end review episode here. As with everything, this feels like a year of a lot of incidents and outages. So, I’m curious, what is your favorite outage of the year?Julie: Well, Jason, it has been fun. There’s been so many outages, it’s really hard to pick a favorite. I will say that one that sticks out as my favorite, I guess, you could say was the Fastly outage, basically because of a lot of the headlines that we saw such as, “Fastly slows down and stops the internet.” You know, “What is Fastly and why did it cause an outage?” And then I think that people started realizing that there’s a lot more that goes into operating the internet. So, I think from just a consumer side, that was kind of a fun one. I’m sure that the increases in Google searches for Fastly were quite large in the next couple of days following that.Jason: That’s an interesting thing, right? Because I think for a lot of us in the industry, like, you know what Fastly is, I know what Fastly is; I’ve been friends with folks over there for quite a while and they’ve got a great service, but for everybody else out there in the general public, suddenly, this company, they never heard of that, you know, handles, like, 25% of the world’s internet traffic, like, is suddenly on the front page news and they didn’t realize how much of the internet runs through this service. And I feel it that way with a lot of the incidents that we’re seeing lately, right? We’re recording this in December, and a week ago, Amazon had a rather large outage, affecting us-east-1, which it seems like it’s always us-east-1. But that took down a bunch of stuff and similar, they are people, like you know, my dad, who’s just like, “I buy things from Amazon. How did this crash, like, the internet?”Julie: I will tell you that my mom generally calls me—and I hate to throw her under the bus—anytime there is an outage. So, Hulu had some issues earlier this year and I got texts from my mom actually asking me if I could call any of my friends over at Hulu and, like, help her get her Hulu working. She does this similarly for Facebook. So, when that Facebook outage happened, I always—almost—know about an outage first because of my mother. She is my alerting mechanism.Jason: I didn’t realize Hulu had an outage, and now it makes me think we’ve had J. Paul Reed and some other folks from Netflix on the show. We definitely need to have an engineer from Hulu come on the show. So, if you’re out there listening and you work for Hulu, and you’d like to be on the show and dish all the dirt on Hulu—actually don’t do that, but we’d love to talk with you about reliability and what you’re doing over there at Hulu. So, reach out to us at podcast@gremlin.com.Julie: I’m sure my mother would appreciate their email address and phone number just in case—Jason: [laugh].Julie: —for the future. [laugh].Jason: If you do reach out to us, we will connect you with Julie’s mother to help solve her streaming issues. You had mentioned one thing though. You said the phrase about throwing your mother under the bus, and that reminds me of one of my favorite outages from this year, which I don’t know if you remember, it’s all about throwing people under the bus, or one person in particular, and that’s the Salesforce outage. Do you remember that?Julie: Oh. Yes, I do. So, I was not here at the time of the Salesforce outage, but I do remember the impact that that had on multiple organizations. And then—Jason: Yes—Julie: —the retro.Jason: —the Salesforce outage was one where ,similarly ,Salesforce affects so much, and it is a major name. And so people like my dad or your mom probably knew like, “Oh, Salesforce. That’s a big thing.” The retro on it, I think, was what really stood out. I think, you know, most people understand, like, “Oh, you’re having DNS issues.” Like, obviously it’s always DNS, right? That’s the meme: It’s always DNS that causes your issues.In this case it was, but their retro on this they publicly published was basically, “We had an engineer that went to update DNS, and this engineer decided to push things out using an EBF process, an Emergency Brake Fix process.” So, they sort of circumvented a lot of the slow rollout processes because they just wanted to get this change made and get it done without all the hassle. And turns out that they misconfigured it and it took everything down. And so the entire incident retro was basically throwing this one engineer under the bus. Not good.Julie: No, it wasn’t. And I think that it’s interesting because especially when I was over at PagerDuty, right, we talked a lot about blamelessness. That was very not blameless. It doesn’t teach you to embrace failure, it doesn’t show that we really just want to take that and learn better ways of doing things, or how we can make our systems more resilient. But going back to the Fastly outage, I mean, the NPR headline was, “Tuesday’s Internet Outage was Caused by One Customer Changing a Setting, Fastly says.” So again, we could have better ways of communicating.Jason: Definitely don’t throw your engineers on their bus, but even moreso, don’t throw your customers under the bus. I think for both of these, we have to realize, like, for the engineer at Salesforce, like, the blameless lesson learned here is, what safeguards are you going to put in place? Or what safeguards were there? Like, obviously, this engineer thought, like, “The regular process is a hassle; we don’t need to do that. What’s the quickest, most expedient way to resolve the issue or get this job done?” And so they took that.And similarly with the customer at Fastly, they’re just like, “How can I get my systems working the way I want them to? Let’s roll out this configuration.” It’s really up to all of us, and particularly within our companies, to think about how are people using our products. How are they working on our systems? And, what are the guardrails that we need to put in place? Because people are going to try to make the best decisions that they can, and that obviously means getting the job done as quickly as possible and then moving on to the next thing.Julie: Well, and I think you’re really onto something there, too, because I think it’s also about figuring out those unique ways that our customers can break our products, things that we didn’t think through. And I mean, that goes back to what we do here at Gremlin, right? Then that goes back to Chaos Engineering. Let’s think through a hypothesis. Let’s see, you know, what if ABC Company, somebody there does something. How can we test for that?And I think that shouldn’t get lost in the whole aspect of now we’ve got this postmortem. But how do we recreate that? How do we make sure that these things don’t happen again? And then how do we get creative with trying to figure out, well, how can we break...

Dec 14, 2021 • 37min

Mandi Walls

In this episode, we cover:00:00:00 - Introduction 00:04:30 - Early Dark Days in Chaos Engineering and Reliability00:08:27 - Anecdotes from the “Long Dark Time”00:16:00 - The Big Changes Over the Years00:20:50 - Mandi’s Work at PagerDuty00:27:40 - Mandi’s Tips for Better DevOps00:34:15 - OutroLinks:PagerDuty: https://www.pagerduty.comTranscriptJason: — hilarious or stupid?Mandi: [laugh]. I heard that; I listened to the J. Paul Reed episode and I was like, “Oh, there’s, like, a little, like, cold intro.” And I’m like, “Oh, okay.”Jason: Welcome to Break Things on Purpose, a podcast about reliability and learning from failure. In this episode, we take a trip down memory lane with Mandi Walls to discuss how much technology, reliability practices, and chaos engineering has evolved over her extensive career in technology.Jason: Everybody, welcome to the show, Julie Gunderson, who recently joined Gremlin on the developer advocacy team. How’s it going, Julie?Julie: Great, Jason. Really excited to be here.Jason: So, Mandi is actually a guest of yours. I mean, we both have been friends with Mandi for quite a while but you had the wonderful opportunity of working with Mandi.Julie: I did, and I was really excited to have her on our podcast now as we ran a podcast together at PagerDuty when we worked there. Mandi has such a wealth of knowledge that I thought we should have her share it with the world.Mandi: Oh, no. Okay.Julie: [laugh].Jason: “Oh, no?” Well, in that case, Mandi, why don’t you—Mandi: [crosstalk 00:01:28]. I don’t know.Jason: Well, in that case with that, “Oh no,” let’s have Mandi introduce herself. [laugh].Mandi: Yeah hi. So, thanks for having me. I am Mandi Walls. I am currently a DevOps advocate at PagerDuty, Julie’s last place of employment before she left us to join Jason at Gremlin.Julie: And Mandi, we worked on quite a few things over a PagerDuty. We actually worked on things together, joint projects between Gremlin, when it was just Jason and us where we would run joint workshops to talk about chaos engineering and actually how you can practice your incident response. And I’m sure we’ll get to that a little bit later in the episode, but will you kick us off with your background so everybody knows why we’re so excited to talk to you today?Mandi: Oh, goodness. Well, so I feel like I’ve been around forever. [laugh]. Prior to joining PagerDuty. I spent eight-and-a-half years at Chef Software, doing all kinds of things there, so if I ever trained you on Chef, I hope it was good.Prior to joining Chef, I was assistant administrator for AOL.com and a bunch of other platform and sites at AOL for a long time. So, things like Moviefone, and the AOL Sports Channel, and dotcom, and all kinds of things. Most of them ran on one big platform because the monolith was a thing. So yeah, my background is largely in operations, and just systems administration on that side.Jason: I’m laughing in the background because you mentioned Moviefone, and whenever I think of Moviefone, I think of the Seinfeld episode where Kramer decides to make a Moviefone competitor, and it’s literally just his own phone number, and people call up and he pretends to be that, like, robotic voice and has people, like, hit numbers for which movie they want to see and hear the times that it’s playing. Gives a new meaning to the term on-call.Mandi: Indeed. Yes, absolutely.Julie: And I’m laughing just because I recently watched Hackers and, you know, they needed that AOL.com disc.Mandi: That’s one of my favorite movies. Like, it’s so ridiculous, but also has so many gems of just complete nonsense in it. Absolutely love Hackers. “Hack the planet.”Julie: “Hack the planet.” So, with hacking the planet, Mandi, and your time working at AOL with the monolith, let’s talk a little bit because you’re in the incident business right now over at PagerDuty, but let’s talk about the before times, the before we practiced Chaos Engineering and before we really started thinking about reliability. What was it like?Mandi: Yeah, so I’ll call this the Dark Ages, right? So before the Enlightenment. And, like, for folks listening at home, [laugh] the timeline here is probably—so between two-thousand-and-fi—four, five, and 2011. So, right before the beginning of cloud, right before the beginning of, like, Infrastructure as Code, and DevOps and all those things that’s kind of started at, like, the end of my tenure at AOL. So, before that, right—so in that time period, right, like, the web was, it wasn’t like it was just getting started, but, like, the Web 2.0 moniker was just kind of getting a grip, where you were going from the sort of generic sites like Yahoo and Yellow Pages and those kinds of things and AOL.com, which was kind of a collection of different community bits and news and things like that, into more personalized experiences, right?So, we had a lot of hook up with the accounts on the AOL side, and you could personalize all of your stuff, and read your email and do all those things, but the sophistication of the systems that we were running was such that like, I mean, good luck, right? It was migration from commercial Unixes into Linux during that era, right? So, looking at when I first joined AOL, there were a bunch of Solaris boxes, and some SGIs, and some other weird stuff in the data center. You’re like, good luck on all that. And we migrated most of those platforms onto Linux at that time; 64 bit. Hurray.At least I caught that. And there was an increase in the use of open-source software for big commercial ventures, right, and so less of a reliance on commercial software and caught solutions for things, although we did have some very interesting commercial web servers that—God help them, they were there, but were not a joy, exactly, to work on because the goals were different, right? That time period was a huge acceleration. It was like a Cambrian explosion of software pieces, and tools, and improvements, and metrics, and monitoring, and all that stuff, as well as improvements on the platform side. Because you’re talking about that time period is also being the migration from bare metal and, like, ordering machines by the rack, which really only a handful of players need to do that now, and that was what everybody was doing then.And in through the earliest bits of virtualization and really thinking about only deploying the structures that you needed to meet the needs of your application, rather than saying, “Oh, well, I can only order gear, I can only do my capacity planning once a year when we do the budget, so like, I got to order as much as they’ll let me order and then it’s going to sit in the data center spinning until I need it because I have no ability to have any kind of elastic capacity.” So, it was a completely, [laugh] completely different paradigm from what things are now. We have so much more flexibility, and the ability to, you know, expand and contract when we need to, and to shape our infrastructures to meet the needs of the application in such a more sophisticated an...

Nov 30, 2021 • 15min

Itiel Shwartz

In this episode, we cover:00:00:00 - Introduction 00:05:00 - Itiel’s Background in Engineering00:08:25 - Improving Kubernetes Troubleshooting00:11:45 - Improving Team Collaboration 00:14:00 - OutroLinks:Komodor: https://komodor.com/Twitter: https://twitter.com/Komodor_comTranscriptJason: Welcome back to another episode of Build Things On Purpose, a part of the Break Things On Purpose podcast where we talk with people who have built really cool software or systems. Today with us, we have Itiel Shwartz who is the CTO of a company called Komodor. Welcome to the show.Itiel: Thanks, happy to be here.Jason: If I go to Komodor’s website it really talks about debugging Kubernetes, and as many of our listeners know Kubernetes and complex systems are a difficult thing. Talk to me a little bit more—tell me what Komodor is. What does it do for us?Itiel: Sure. So, I don’t think I need to tell our listeners—your listeners that Kubernetes looks cool, it’s very easy to get started, but once you’re into it and you have a big company with complex, like, micros—it doesn’t have to be big, even, like, medium-size complex system company where you’re starting to hit a couple of walls or, like, issues when trying to troubleshoot Kubernetes.And that usually is due to the nature of Kubernetes which makes making complex systems very easy. Meaning you can deploy in multiple microservices, multiple dependencies, and everything looks like a very simple YAML file. But in the end of the day, when you have an issue, when one of the pods is starting to restart and you try to figure out, like, why the hell is my application is not running as it should have, you need to use a lot of different tools, methodologies, knowledge that most people don’t really have in order to solve the issue. So, Komodor focus on making the troubleshooting in Kubernetes an easy and maybe—may I dare say even fun experience by harnessing our knowledge in Kubernetes and align our users to get that digest view of the world.And so usually when you speak about troubleshooting, the first thing that come to mind is issues are caused due to changes. And the change might be deploying Kubernetes, it can be a [configurment 00:02:50] that changed, a secret that changed, or even some feature flag, or, like, LaunchDarkly feature that was just turned on and off. So, what Komodor does is we track and we collect all of the changes that happen across your entire system, and we put, like, for each one of your services a [unintelligible 00:03:06] that includes how did the service change over time and how did it behave? I mean, was it healthy? Was it unhealthy? Why wasn’t it healthy?So, by collecting the data from all across your system, plus we are sit on top of Kubernetes so we know the state of each one of the pods running in your application, we give our users the ability to understand how did the system behave, and once they have an issue we allow them to understand what changes might have caused this. So, instead of bringing down dozens of different tools, trying to build your own mental picture of how the world looks like, you just go into Komodor and see everything in one place.I would say that even more than that, once you have an issue, we try to give our best efforts on helping to understand why did it happen. We know Kubernetes, we saw a lot of issues in Kubernetes. We don’t try complex AI solution or something like that, but using our very deep knowledge of Kubernetes, we give our users, FYI, your pods that are unhealthy, but the node that they are running on just got restarted or is having this pressure.So, maybe they could look at the node. Like, don’t drill down into the pods logs, but instead, go look at the nodes. You just upgraded your Kubernetes version or things like that. So, basically we give you everything you need in order to troubleshoot an issue in Kubernetes, and we give it to you in a very nice and informative way. So, our user just spend less time troubleshooting and more time developing features.Jason: That sounds really extremely useful, at least from my experience, in operating things on Kubernetes. I’m guessing that this all stemmed from your own experience. You’re not typically a business guy, you’re an engineer. And so it sounds like you were maybe scratching your own itch. Tell us a little bit more about your history and experience with this?Itiel: I started computer science, I started working for eBay and I was there in the infrastructure team. From there I joined two Israeli startup and—I learned that the thing that I really liked or do quite well is to troubleshoot issues. I was in a very, very, like, production-downtime-sensitive systems. A system when the system is down, it just cost the business a lot of money.So, in these kinds of systems, you try to respond really fast through the incidents, and you spend a lot of time monitoring the system so once an issue occur you can fix it as soon as possible. So, I developed a lot of internal tools. For the companies I worked for that did something very similar, allow you once you have an issue to understand the root cause, or at least to get a better understanding of how the world looks like in those companies.And we started Komodor because I also try to give advice to people. I really like Kubernetes. I liked it, like, a couple of years ago before it was that cool, and people just consult with me. And I saw the lack of knowledge and the lack of skills that most people that are running Kubernetes have, and I saw, like—I’d have to say it’s like giving, like, a baby a gun.So, giving an operation person that doesn’t really understand Kubernetes tell him, “Yeah, you can deploy everything and everything is a very simple YAML. You want a load balancer, it’s easy. You want, like, a persistent storage, it’s easy. Just install like—Helm install Postgres or something like that.” I installed quite a lot of, like, Helm-like recipes, GA, highly available. But things are not really highly available most of the time.So, it’s definitely scratching my own itch. And my partner, Ben, is also a technical guy. He was in Google where they have a lot of Kubernetes experience. So, together both of us felt the pain. We saw that as more and more companies moved to Kubernetes, the pain became just stronger. And as the shift-left movement is also like taking off and we see more and more dev people that are not necessarily that technical that are expected to solve issues, then again we saw an issue.So, what we see is companies moving to Kubernetes and they don’t have the skills or knowledge to troubleshoot Kubernetes. And then they tell their developers, “You are now responsible for the production. You are deploying? You should troubleshoot,” and the developers really don’t know what to do. And we came to those companies and basically it makes everything a lot easier.You have any issue in Kubernetes? No issue, like, no issue. And no problem go to Komodor and understand what is the probable root cause. See what’s the status? Like, when did it change? When was it last restarted? When was it unhealthy before today? Maybe, like, an hour ago, maybe a month ago. So, Komodor just gives you all of this information in a very informative way.Jason: I like the idea of pulling everything into one place, but I th...

Nov 16, 2021 • 31min

Tomas Fedor

In this episode, we cover:00:00:00 - Introduction00:02:45 - Adopting the Cloud00:08:15 - POC Process 00:12:40 - Infrastructure Team Building00:17:45 - “Disaster Roleplay”/Communicating to the Non-Technical Side 00:20:20 - Leadership00:22:45 - Tomas’ Horror Story/Dashboard Organziation00:29:20 - OutroLinks:Productboard: https://www.productboard.comScaling Teams: https://www.amazon.com/Scaling-Teams-Strategies-Successful-Organizations/dp/149195227XSeeking SRE: https://www.amazon.com/Seeking-SRE-Conversations-Running-Production/dp/1491978864/TranscriptJason: Welcome to Break Things on Purpose, a podcast about failure and reliability. In this episode, we chat with Tomas Fedor, Head of Infrastructure at Productboard. He shares his approach to testing and implementing new technologies, and his experiences in leading and growing technical teams.Today, we’ve got with us Tomas Fedor, who’s joining us all the way from the Czech Republic. Tomas, why don’t you say hello and introduce yourself?Tomas: Hello, everyone. Nice to meet you all, and my name is Tomas, or call me Tom. And I’ve been working for a Productboard for past two-and-a-half year as infrastructure leader. And all the time, my experience was in the areas of DevOps, and recently, three and four years is about management within infrastructure teams. What I’m passionate about, my main technologies-wise in cloud, mostly Amazon Web Services, Kubernetes, Infrastructure as Code such as Terraform, and recently, I also jumped towards security compliances, such as SOC 2 Type 2.Jason: Interesting. So, a lot of passions there, things that we actually love chatting about on the podcast. We’ve had other guests from HashiCorp, so we’ve talked plenty about Terraform. And we’ve talked about Kubernetes with some folks who are involved with the CNCF. I’m curious, with your experience, how did you first dive into these cloud-native technologies and adopting the cloud? Is that something you went straight for, or is that something you transitioned into?Tomas: I actually slow transition to cloud technologies because my first career started at university when I was like, say, half developer and half Unix administrator. And I had experience with building very small data center. So, those times were amazing to understand all the hardware aspects of how it’s going to be built. And then later on, I got opportunity to join a very famous startup at Czech Republic [unintelligible 00:02:34] called Kiwi.com [unintelligible 00:02:35]. And that time, I first experienced cloud technologies such as Amazon Web Services.Jason: So, as you adopted Amazon, coming from that background of a university and having physical servers that you had to deal with, what was your biggest surprise in adopting the cloud? Maybe something that you didn’t expect?Tomas: So, that’s great question, and what comes to my mind first, is switching to completely different [unintelligible 00:03:05] because during my university studies and career there, I mostly focused on networking [unintelligible 00:03:13], but later on, you start actually thinking about not how to build a service, but what service you need to use for your use case. And you don’t have, like, one service or one use case, but you have plenty of services that can suit your needs and you need to choose wisely. So, that was very interesting, and it needed—and it take me some time to actually adopt towards new thinking, new mindset, et cetera.Jason: That’s an excellent point. And I feel like it’s only gotten worse with the, “How do you choose?” If I were to ask you to set up a web service and it needs some sort of data store, at this point you’ve got, what, a half dozen or more options on Amazon? [laugh].Tomas: Exactly.Jason: So, with so many services on providers like Amazon, how do you go about choosing?Tomas: After a while, we came up with a thing like RFCs. That’s like ‘Request For Comments,’ where we tried to sum up all the goals, and all the principles, and all the problems and challenges we try to tackle. And with that, we also tried to validate all the alternatives. And once you went through all these information, you tried to sum up all the possible solutions. You typically had either one or two options, and those options were validated with all your team members or the whole engineering organization, and you made the decision then you try to run POC, and you either are confirmed, yeah this is the technology, or this is service you need and we are going to implement it, or you revised your proposal.Jason: I really like that process of starting with the RFC and defining your requirements and really getting those set so that as you’re evaluating, you have these really stable ideas of what you need and so you don’t get swayed by all of the hype around a certain technology. I’m curious, who is usually involved in the RFC process? Is it a select group in the engineering org? Is it broader? How do you get the perspectives that you need?Tomas: I feel we have very great established process at Productboard about RFCs. It’s transparent to the whole organization, that’s what I love the most. The first week, there is one or two reporters that are mainly focused on writing and summing up the whole proposal to write down goals, and also non-goals because that is going to define your focus and also define focus of reader. And then you’re going just to describe alternatives, possible options, or maybe to sum up, “Hey, okay, I’m still unsure about this specific decision, but I feel this is the right direction.” Maybe I have someone else in the organization who is already familiar with the technology or with my use case, and that person can help me.So, once—or we call it a draft state, and once you feel confident, you are going to change the status of RFC to open. The time is open to feedback to everyone, and they typically geared, like, two weeks or three weeks, so everyone can give a feedback. And you have also option to present it on engineering all-hands. So, many engineers, or everyone else joining the engineering all-hands is aware of this RFC so you can receive a lot of feedback. What else is important to mention there that you can iterate over RFCs.So, you mark it as resolved after through two or three weeks, but then you come up with a new proposal, or you would like to update it slightly with important change. So, you can reopen it and update version there. So, that also gives you a space to update your RFC, improve the proposal, or completely to change the context so it’s still up-to-date with what you want to resolve.Jason: I like that idea of presenting at engineering all-hands because, at least in my experience, being at a startup, you’re often super busy so you may know that the RFC is available, but you may not have time to actually read through it, spend the time to comment, so having that presentation where it’s nicely summarized for you is always nice. Moving from that to the POC, when you’ve selected a few and you want to try them out, tell me m...

Nov 3, 2021 • 37min

Gustavo Franco

In this episode, we cover:00:00:00 - Introduction00:03:20 - VMWare Tanzu00:07:50 - Gustavo’s Career in Security 00:12:00 - Early Days in Chaos Engineering00:16:30 - Catzilla 00:19:45 - Expanding on SRE00:26:40 - Learning from Customer Trends00:29:30 - Chaos Engineering at VMWare00:36:00 - OutroLinks:Tanzu VMware: https://tanzu.vmware.comGitHub for SREDocs: https://github.com/google/sredocsE-book on how to start your incident lifecycle program: https://tanzu.vmware.com/content/ebooks/establishing-an-sre-based-incident-lifecycle-programTwitter: https://twitter.com/stratusTranscriptJason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems. In this episode, Gustavo Franco, a senior engineering manager at VMware joins us to talk about building reliability as a product feature, and the journey of chaos engineering from its place in the early days of Google’s disaster recovery practices to the modern SRE movement. Thanks, everyone, for joining us for another episode. Today with us we have Gustavo Franco, who’s a senior engineering manager at VMware. Gustavo, why don’t you say hi, and tell us about yourself.Gustavo: Thank you very much for having me. Gustavo Franco; as you were just mentioning, I’m a senior engineering manager now at VMware. So, recently co-founded the VMware Tanzu Reliability Engineering Organization with Megan Bigelow. It’s been only a year, actually. And we’ve been doing quite a bit more than SRE; we can talk about like—we’re kind of branching out beyond SRE, as well.Jason: Yeah, that sounds interesting. For folks who don’t know, I feel like I’ve seen VMware Tanzu around everywhere. It just suddenly went from nothing into this huge thing of, like, every single Kubernetes-related event, I feel like there’s someone from VMware Tanzu on it. So, maybe as some background, give us some information; what is VMware Tanzu?Gustavo: Kubernetes is sort of the engine, and we have a Kubernetes distribution called Tanzu Kubernetes Grid. So, one of my teams actually works on Tanzu Kubernetes Grid. So, what is VMware Tanzu? What this really is, is what we call a modern application platform, really an end-to-end solution. So, customers expect to buy not just Kubernetes, but everything around, everything that comes with giving the developers a platform to write code, to write applications, to write workloads.So, it’s basically the developer at a retail company or a finance company, they don’t want to run Kubernetes clusters; they would like the ability to, maybe, but they don’t necessarily think in terms of Kubernetes clusters. They want to think about workloads, applications. So, VMWare Tanzu is end-to-end solution that the engine in there is Kubernetes.Jason: That definitely describes at least my perspective on Kubernetes is, I love running Kubernetes clusters, but at the end of the day, I don’t want to have to evaluate every single CNCF project and all of the other tools that are required in order to actually maintain and operate a Kubernetes cluster.Gustavo: I was just going to say, and we acquired Pivotal a couple of years ago, so that brought a ton of open-source projects, such as the Spring Framework. So, for Java developers, I think it’s really cool, too, just being able to worry about development and the Java layer and a little bit of reliability, chaos engineering perspective. So, kind of really gives me full tooling, the ability common libraries. It’s so important for reliable engineering and chaos engineering as well, to give people this common surface that we can actually use to inject faults, potentially, or even just define standards.Jason: Excellent point of having that common framework in order to do these reliability practices. So, you’ve explained what VMware Tanzu is. Tell me a bit more about how that fits in with VMware Tanzu?Gustavo: Yeah, so one thing that happened the past few years, the SRE organization grew beyond SRE. We’re doing quite a bit of horizontal work, so SRE being one of them. So, just an example, I got to charter a compliance engineering team and one team that we call ‘Customer Zero.’ I would call them partially the representatives of growth, and then quote-unquote, “Customer problems, customer pain”, and things that we have to resolve across multiple teams. So, SRE is one function that clearly you can think of.You cannot just think of SRE on a product basis, but you think of SRE across multiple products because we’re building a platform with multiple pieces. So, it’s kind of like putting the building blocks together for this platform. So then, of course, we’re going to have to have a team of specialists, but we need an organization of generalists, so that’s where SRE and this broader organization comes in.Jason: Interesting. So, it’s not just we’re running a platform, we need our own SREs, but it sounds like it’s more of a group that starts to think more about the product itself and maybe even works with customers to help their reliability needs?Gustavo: Yeah, a hundred percent. We do have SRE teams that invest the majority of their time running SaaS, so running Software as a Service. So, one of them is the Tanzu Mission Control. It’s purely SaaS, and what teams see Tanzu Mission Control does is allow the customers to run Kubernetes anywhere. So, if people have Kubernetes on-prem or they have Kubernetes on multiple public clouds, they can use TMC to be that common management surface, both API and web UI, across Kubernetes, really anywhere they have Kubernetes. So, that’s SaaS.But for TKG SRE, that’s a different problem. We don’t have currently a TKG SaaS offering, so customers are running TKG on-prem or on public cloud themselves. So, what does the TKG SRE team do? So, that’s one team that actually [unintelligible 00:05:15] to me, and they are working directly improving the reliability of the product. So, we build reliability as a feature of the product.So, we build a reliability scanner, which is a [unintelligible 00:05:28] plugin. It’s open-source. I can give you more examples, but that’s the gist of it, of the idea that you would hire security engineers to improve the security of a product that you sell to customers to run themselves. Why wouldn’t you hire SREs to do the same to improve the reliability of the product that customers are running themselves? So, kind of, SRE beyond SaaS, basically.Jason: I love that idea because I feel like a lot of times in organizations that I talk with, SRE really has just been a renamed ops team. And so it’s purely internal; it’s purely thinking about we get software shipped to us from developers and it’s our responsibility to just make that run reliably. And this sounds like it is that complete embrace of the DevOps model of breaking down silos and starting to move reliability, thinking of it from a developer perspective, a product perspective.Gustavo: Yeah. A lot of my work is spent on making analogies with security, basically. One example, several of the SREs in my org, yeah, they do spend time doing PRs with product developers, but al...

Oct 19, 2021 • 35min

Leonardo Murillo

In this episode, we cover:00:00:00 - Introduction 00:03:30 - An Engineering Anecdote 00:08:10 - Lessons Learned from Putting Out Fires00:11:00 - Building “Guardrails”00:18:10 - Pushing the Chaos Envelope 00:23:35 - OpenGitOps Project00:30:37 - Where to Find Leo/Costa Rica CNCFLinks:Weaveworks: https://www.weave.worksGitOps Working Group: https://github.com/gitops-working-group/gitops-working-groupOpenGitOps Project: https://opengitops.devGithub.com/open-gitops: https://github.com/open-gitopsTwitter: https://twitter.com/murillodigitalLinkedIn: https://www.linkedin.com/in/leonardomurillo/Costa Rica CNCF: https://community.cncf.io/costa-rica/Cloudnative.tv: http://cloudnative.tvGremlin-certified chaos engineering practitioner: https://www.gremlin.com/certificationTranscriptJason: Welcome to the Break Things on Purpose podcast, a show about our often self-inflicted failures and what we learn from them. In this episode, Leonardo Murillo, a principal partner solutions architect at Weaveworks. He joins us to talk about GitOps, Automating reliability, and Pura Vida.Ana: I like letting our guests kind of say, like, “Who are you? What do you do? What got you into the world of DevOps, and cloud, and all this fun stuff that we all get to do?”Leo: Well, I guess I’ll do a little intro of myself. I’m Leonardo Murillo; everybody calls me Leo, which is fine because I realize that not everybody chooses to call me Leo, depending on where they’re from. Like, Ticos and Latinos, they’re like, “Oh, Leo,” like they already know me; I’m Leo already. But people in Europe and in other places, they’re, kind of like, more formal out there. Leonardo everybody calls me Leo.I’m based off Costa Rica, and my current professional role is principal solutions architect—principal partner solutions architect at Weaveworks. How I got started in DevOps. A lot of people have gotten started in DevOps, which is not realizing that they just got started in DevOps, you know what I’m saying? Like, they did DevOps before it was a buzzword and it was, kind of like, cool. That was back—so I worked probably, like, three roles back, so I was CTO for a Colorado-based company before Weaveworks, and before that, I worked with a San Francisco-based startup called High Fidelity.And High Fidelity did virtual reality. So, it was actually founded by Philip Rosedale, the founder of Linden Lab, the builders of Second Life. And the whole idea was, let’s build—with the advent of the Oculus Rift and all this cool tech—build the new metaverse concept. We’re using the cloud because, I mean, when we’re talking about this distributed system, like a distributed system where you’re trying to, with very low latency, transmit positional audio, and a bunch of different degrees of freedom of your avatars and whatnot; that’s very massive scale, lots of traffic. So, the cloud was, kind of like, fit for purpose.And so we started using the cloud, and I started using Jenkins, as a—and figure it out, like, Jenkins is a cron sort of thing; [unintelligible 00:02:48] oh, you can actually do a scheduled thing here. So, started using it almost to run just scheduled jobs. And then I realized its power, and all of a sudden, I started hearing this whole DevOps word, and I’m like, “What this? That’s kind of like what we’re doing, right?” Like, we’re doing DevOps. And that’s how it all got started, back in San Francisco.Ana: That actually segues to one of the first questions that we love asking all of our guests. We know that working in DevOps and engineering, sometimes it’s a lot of firefighting, sometimes we get to teach a lot of other engineers how to have better processes. But we know that those horror stories exist. So, what is one of those horrible incidents that you’ve encountered in your career? What happened?Leo: This is before the cloud and this is way before DevOps was even something. I used to be a DJ in my 20s. I used to mix drum and bass and jungle with vinyl. I never did the digital move. I used DJ, and I was director for a colocation facility here in Costa Rica, one of the first few colocation facilities that existed in the [unintelligible 00:04:00].I partied a lot, like every night, [laugh] [unintelligible 00:04:05] party night and DJ night. One night, they had 24/7 support because we were collocations [unintelligible 00:04:12], so I had people doing support all the time. I was mixing in some bar someplace one night, and I don’t want to go into absolute detail of my state of consciousness, but it wasn’t, kind of like… accurate in its execution. So, I got a call, and they’re like, “We’re having some problem here with our network.” This is, like, back in Cisco PIX times for firewalls and you know, like… back then.I wasn’t fully there, so I [laugh], just drove back to the office in the middle of night and had this assistant, Miguel was his name, and he looks at me and he’s like, “Are you okay? Are you really capable of solving this problem at [laugh] this very point in time?” And I’m like, “Yeah. Sure, sure. I can do this.”We had a rack full of networking hardware and there was, like, a big incident; we actually—one of the primary connections that we had was completely offline. And I went in and I started working on a device, and I spent about half an hour, like, “Well, this device is fine. There’s nothing wrong with the device.” I had been working for half an hour on the wrong device. They’re like, “Come on. You really got to focus.”And long story short, I eventually got to the right device and I was able to fix the problem, but that was like a bad incident, which wasn’t bad in the context of technicality, right? It was a relatively quick fix that I figured it out. It was just at the wrong time. [laugh]. You know what I’m saying?It wasn’t the best thing to occur that particular night. So, when you’re talking about firefighting, there’s a huge burden in terms of the on-call person, and I think that’s something that we had experienced, and that I think we should give out a lot of shout-outs and provide a lot of support for those that are on call. Because this is the exact price they pay for that responsibility. So, just as a side note that comes to mind. Here’s a lot of, like, shout-outs to all the people on-call that are listening to this right now, and I’m sorry you cannot go party. [laugh].So yeah, that’s telling one story of one incident way back. You want to hear another one because there’s a—this is back in High Fidelity times. I was—I don’t remember exactly what it was building, but it had to do with emailing users, basically, I had to do something, I can’t recall actually what it was. They was supposed to email all the users that were using the platform. For whatever reason—I really can’t recall why—I did not mock data on my development environment.What I did was just use—I didn’t mock...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner