Break Things on Purpose

Gremlin

A podcast about site reliability engineering (SRE); Chaos Engineering; and the people, processes, and tools used to build resilient systems. Sponsored by Gremlin. Find us on Twitter at @BTOPpod.

Episodes

Mentioned books

Oct 5, 2021 • 21min

Maxim Fateev and Samar Abbas

Maxim Fateev and Samar Abbas are guests on this podcast episode. They discuss the challenges of orchestrating microservices and their journey in creating a simple workflow service. They talk about their background, transition from Cadence to Temporal, and the value proposition of Cadence. They also explore the concepts of workflows, choreography, and orchestration, and discuss the challenges of storing information in long-running applications.

Sep 21, 2021 • 33min

John Martinez

In this episode, we cover:00:00:00 - Introduction 00:03:15 - FinOps Foundation and Multicloud 00:07:00 - Costs 00:10:40 - John’s History in Reliability Engineering 00:16:30 - The Actual Cost of an Outages, Security, Etc.00:21:30 - What John Measures 00:28:00 - What John is Up To/Latinx in TechLinks:Palo Alto Networks: https://www.paloaltonetworks.com/FinOps Foundation: https://www.finops.orgTechqueria.org: https://techqueria.orgLinkedIn: https://www.linkedin.com/in/johnmartinez/TranscriptJohn: I would say a tip for better monitoring, uh, would be to, uh turn it on. [laugh]. [unintelligible 00:00:07] sounds, right?Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode we chat with John Martinez, Director of Cloud R&D at Palo Alto Networks. John’s had a long career in tech, and we discuss his new focus on FinOps and how it has been influenced by his past work in security and chaos engineering. Jason: So, John, welcome to the show. Tell us a little bit about yourself. Who are you? Where do you work? What do you do?John: Yeah. So, John Martinez. I am a director over at Palo Alto Networks. I have been in the cloud security space for the better of, I would say, seven, eight years or so. And currently, am in transition in my role at Palo Alto Networks.So, I’m heading headstrong into the FinOps world. So, turning back into the ops world to a certain degree and looking at what can we do, two things: better manage our cloud spend and gain a lot more optimization out of our usage in the cloud. So, very excited about new role.Jason: That’s an interesting new role. I’d imagine that at Palo Alto Networks, you’ve got quite a bit of infrastructure and that’s probably a massive bill.John: It can be. It can be. Yeah, [laugh] absolutely. We definitely have large amount of scale, in multi-cloud, too, so that’s the added bonus to it all. FinOps is kind of a new thing for me, so I’m pretty happy to, as I dig back into the operations world, very happy to discover that the FinOps Foundation exists and it kind of—there’s a lot of prescribed ways of both looking at FinOps, at optimization—specifically in the cloud, obviously—and as well as there’s a whole framework that I can go adopt.So, it’s not like I’m inventing the wheel, although having been in the cloud for a long time, and I haven’t talked about that part of it but a lot of times, it feels like—in my early days anyway—felt like I was inventing new wheels all the time. As being an engineer, the part that I am very excited about is looking at the optimization opportunities of it. Of course, the goal, from a finance perspective, is to either reduce our spend where we can, but also to take a look at where we’re investing in the cloud, and if it takes more of a shift as opposed to a straight-up just cut the bill kind of thing, it’s really all about making sure that we’re investing in the right places and optimizing in the right places when it comes down to it.Jason: I think one of the interesting perspectives of adopting multi-cloud is that idea of FinOps: let’s save money. And the idea, if I wanted to run a serverless function, I could take a look at AWS Lambda, I could take a look at Azure Functions to say, “Which one’s going to be cheaper for this particular use case,” and then go with that.John: I really liked how the FinOps Foundation has laid out the approach to the lifecycle of FinOps. So, they basically go from the crawl, walk, run approach which, in a lot of our world, is kind of like that. It’s very much about setting yourself up for success. Don’t expect to be cutting your bill by hundreds of thousands of dollars at the beginning. It’s really all about discovering not just how much we’re spending, but where we’re spending it.I would categorize the pitting the cloud providers against each other to be more on the run side of things, and that eventually helps, especially in the enterprise space; it helps enterprises to approach the cloud providers with more of a data-driven negotiation, I would say [laugh] to your enterprise spend.Jason: I think that’s an excellent point about the idea of that is very much a run. And I don’t know any companies within my sphere and folks that I know in the engineering space that are doing that because of that price competition. I think everybody gets into the idea of multi-cloud because of this idea of reliability, and—John: Mm-hm.Jason: One of my clouds may fail. Like, what if Amazon goes down? I’d still need to survive that.John: That’s the promise, right? At least that’s the promise that I’ve been operating under for the 11 years or so that I’ve been in the cloud now. And obviously, in the old days, there wasn’t a GCP or an Azure—I think they were in their infancy—there was AWS… and then there was AWS, right? And so I think eventually though you’re right, you’re absolutely right. Can I increase my availability and my reliability by adopting multiple clouds?As I talk to people, as I see how we’re adopting the multiple clouds, I think realistically though what it comes down to is you adopted cloud, or teams adopt a cloud specifically for, I wouldn’t say some of the foundational services, but mostly about those higher-level niche services that we like. For example, if you know large-scale data warehousing, a lot of people are adopting BigQuery and GCP because of that. If you like general purpose compute and you love the Lambdas, you’re adopting AWS and so on, and so forth. And that’s what I see more than anything is, I really like a cloud’s particular higher level service and we go and we adopt it, we love it, and then we build our infrastructure around it. From a practical perspective, that’s what I see.I’m still hopeful, though, that there is a future somewhere there where we can commoditize even the cloud providers, maybe [laugh]. And really go from Cloud A to Cloud B to Cloud C, and just adopt it based on pricing I get that’s cheaper, or more performant, or whatever other dimensions that are important to me. But maybe, maybe. We’ll remain hopeful. [laugh].Jason: Yeah, we’re still very much in that spot where everybody, despite even the basics of if I want to a virtual machine, those are still so different between all the clouds. And I mean even last week, I was working on some Terraform and the idea of building it modularly, and in my head thinking, “Well, at some point, we might want to use one of the other clouds so let’s build this module,” and thinking, “Realistically, that’s probably not going to happen.”John: [laugh]. Right. I would say that there’s the other hidden cost about this and it’s the operational costs. I don’t think we spend a whole lot of time talking about operational costs, necessarily, but what is it going to cost to retrain my DevOps team to move from AWS to GCP, as an example? W...

Sep 7, 2021 • 25min

Omar Marrero

In this episode, we cover:What Kessel Run is Doing: 00:01:27Failure Never has a Single Point: 00:05:50Lessons Learned: 00:10:50Working the DOD:00:13:40Automation and Tools: 00:18:02Links:Kessel Run: https://kesselrun.af.milKessel Run LinkedIn: https://www.linkedin.com/company/kesselrun/TranscriptOmar: But I’ll answer as much as I can. And we’ll go from there.Jason: Yeah. Awesome. No spilling state secrets or highly classified info.Omar: Yes.Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems.Jason: Welcome back to Break Things on Purpose. Today with us we have guest Omar Marrero. Omar, welcome to the show.Omar: Thank you. Thank you, man. Yeah, happy to be here.Jason: Yeah. So, you’ve been doing a ton of interesting work, and you’ve got a long history. For our listeners, why don’t you tell us a little bit more about yourself? Who are you? What do you do?Omar: I’ve been in the military, I guess, public service for a while. So, I was military before, left that and now I’ve joined as a government employee. I love what I do. I love serving the country and supporting the warfighters, making sure they have the tools. And throughout my career, it’s been basically building tools for them, everything they need to make their stuff happen.And that’s what drives me. That’s my passion. If you’ve got the tool to do your mission, I’m in and I’ll make that happen. That’s kind of what I’ve done for the whole of my career, and chaos has always been involved there in some fashion. Yeah, it’s been a pretty cool run.Jason: So, you’re currently doing this at a company called Kessel Run. Tell us a little bit more about Kessel Run.Omar: So, we deliver combat capability that can sense or respond to conflict in any domain, anywhere, any time. Or deliver award-winning software that our warfighters love. So, Kessel Run’s kind of… you might think of it as a software factory within the DOD. So, the whole creation of Kessel Run is to deliver quickly, fast. If you follow the news, you know DOD follows waterfall a little bit.So, the whole creation of Kessel Run was to change that model. And that’s what we do. We deliver continuously non-stop. Our users give us feedback and within hours, they got it. So, that’s the nature behind Kessel Run. It’s like a hybrid acquisition model within the government.Jason: So, I’m curious then, I mean, you obviously aren’t responsible for the company naming, but I’m sure many of our listeners being Star Wars fans are like, “Oh, that sounds familiar.” Omar: Yep, yep.Jason: If you haven’t checked out Kessel Run’s website, you should go do that; they have a really cool logo. I’m guessing that relates to just the story of Kessel Run being like, doing it really fast and having that velocity, and so bringing that to the DOD, is that the connection?Omar: Actually, it goes into the smuggling DevSecOps into the DOD, so the 12 parsecs. So, that’s where it comes from. So, we are smuggling that DevSecOps into the DOD; we’re changing that model. So, that’s where it comes from.Jason: I love that idea of we’re going to take this thing and smuggle it in, and that rebellious nature. I think that dovetails nicely into the work that you’ve been doing with chaos engineering. And I’m curious, how did you get into chaos engineering? Where did you get your start?Omar: I’ve been breaking things forever. So, part of that they deliver tools that our warfighters can use, that’s been my jam. So, I’ve been doing, you can say, chaos forever. I used to walk around, unplug power cables, network cables, turn down [WAN 00:03:24]. Yeah, that was it.Because we used to build these tools and they’re like, “Oh, I wonder if this happens.” “All right, let’s test it out. Why not?” Pull the cable and everybody would scream and say, “What are you doing?” It was like, “We figured it out.”But yeah, I’ve been following chaos engineering for a while, ever since Netflix started doing it and Chaos Monkey came out and whatnot, so that’s been something that’s always been on my mind. It’s like, “Ah, this would be cool to bring into the DOD.” And Kessel Run just made that happen. Kessel Run, the way we build tools, our distributed system was like, “Yep, this is the prime time to bring chaos into the DOD.” And Kessel Run just adopted it.I tossed the idea, I was like, “Hey, we should bring chaos into Kessel Run.” And we slowly started ramping up, and we build a team for it; team is called Bowcaster. So, we follow the breaking stuff. And that’s it. So, we’ve matured, and we’ve deployed and, of course, we’ve learned on how to deploy chaos in our different environments. And I mean, yeah, it’s been a cool run.Jason: Yeah, I’m curious. You mentioned starting off simply, and that’s always what we recommend to people to do. Tell us a little bit more about that. What were some of the tests that you ran then, and then maybe how have they matured, and what have you moved into?Omar: So, our first couple of tests were very simple. Hey, we’re going to test a database failover, and it was really manual at that point. We would literally go in and turn off Database A and see what happened. So, it was very basic, very manual work. We used to record them so we can show them off like, “Hey, check this out. This is what we did.”So, from there, we matured. We got a little bit more complex. We eventually got to the point where we were actually corrupting databases in production and seeing what happens. You should have seen everybody’s faces when we proposed that. So, from there, we’re running basically, we call it ‘Chaos Plus’ in Kessel Run.So, we’ve taken chaos engineering, the concept of chaos engineering, right, breaking things on purpose, but we’ve added performance engineering on top of it, and we’ve added cybersecurity testing on top of it. So, we can run a degraded system, and at the same time say, “All right, so we’re going to ramp up and see what a million users does to our app while it’s fully degraded.” And then we would bring in our cyber team and say, “All right, our system is degraded. See if you can find a vulnerability in it.” So, we’ve kind of evolved.And I call it, put chaos on a little bit of steroids here. But we call it Chaos Plus; that’s our thing. We’ve recently added fuzzing while we’re doing chaos. So, now we got performance chaos, our cyber team, and we’re fuzzing the systems. So, I’m just going to keep going until somebody screams at me and says, “Omar, that’s too much.” But that’s essentially a little bit of our ride in Kessel Run.Jason: That’s amazing. I love that idea of we’re going to do this test, and then we’re going to see what else can happen. One of the things that I’ve been chatting with a bunch of folks recently about is this idea, we always talk about, especially in the resilience engineering space, that failure never has a single point. It’s not a singular root cause; it’s always contributing factors. And the problem is, when you’re doing chaos eng...

Aug 26, 2021 • 30min

Carmen Saenz

In this episode, we cover:Intro and an Anecdote: 00:00:27Early Days of Chaos Engineering: 00:04:13Moving to the Cloud and Important Lessons: 00:07:22Always Learning and Teaching: 00:11:15Figuring Out Chaos: 00:16:30Advice: 00:20:24Links:Apex: https://www.apexclearing.comLinkedIn: https://www.linkedin.com/in/mdcsaenzTranscriptJason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode, Ana Medina is joined by Carmen Saenz, a senior DevOps engineer at Apex Clearing Corporation. Carmen shares her thoughts on what cloud-native engineers can learn from our on-prem past, how she learned to do DevOps work, and what reliable IT systems look like in higher education.Ana: Hey, everyone. We have a new podcast today, we have an amazing guest; we have Carmen Saenz joining us. Carmen, do you want to tell us a little bit about yourself, a quick intro?Carmen: Sure. I am Carmen Saenz. I live in Chicago, Illinois, born and raised on the south side. I am currently a senior DevOps engineer at Apex and I have been in high-frequency trading for 11 out of 12 years.Ana: DevOps engineers, those are definitely the type of work that we love diving in on, making sure that we’re keeping those systems up-to-date. But that really brings me into one of the questions we love asking about. We know that in technology, we sometimes are fighting fires, making sure our engineers can deploy quickly and keep collaboration around. What is one incident that you’ve encountered that has marked your career? What exactly happened that led up to it, and how is it that your team went ahead and discovered the issue?Carmen: One of the incidents that happened to us was, it was around—close to the beginning of the teens [over 00:01:23] 2008, 2009, and I was working at a high-frequency trading firm in which we had an XML configuration that needed to be deployed to all the machines that are on-prem at the time—this was before cloud—that needed to connect to the exchanges where we can trade. And one of the things that we had to do is that we had to add specific configurations in order for us to keep track of our trade position. One of the things that happened was, certain machines get a certain configuration, other machines get another configuration. That configuration wasn’t added for some machines, and so when it was deployed, we realized that they were able to connect to the exchange and they were starting to trade right away. Luckily, someone noticed from external system that we weren’t getting the positions updates.So, then we had to bring down all these on-prem machines by sending out a bash script to hit all these specific machines to kill the connection to the exchange. Luckily, it was just the beginning of the day and it wasn’t so crazy, so we were able to kill them within that minute timeframe before it went crazy. We realized that one of the big issues that we had was, one, we didn’t have a configuration management system in order to check to make sure that the configurations we needed were there. The second thing that we were missing is a second pair of eyes. We need someone to actually look at the configuration, PR it, and then push it.And once it’s pushed, then we should have had a third person as we were going through the deployment system to make sure that this was the new change that needed to be in place. So, we didn’t have the measures in place in order for us to actually make sure that these configurations were correct. And it was chaos because you can lose money because you’re down when the trading was starting in the day. And it was just a simple mistake of not knowing these machines needed a specific configuration. So, it was kind of intense, those five minutes. [laugh].Ana: [laugh]. So, amazing that y’all were able to catch it so quickly because the first thing that comes to mind, as you said, before the cloud—on-prem—and it’s like, do we start needing to making ‘BC’, like, ‘Before Cloud’ times when we talk about incidents? Because I think we do. When we look at the world that we live in now in a more cloud-native space, you tell someone about this incident, they’re going to look at us and say, “What do you mean? I have containers that manage all my config management. Everything’s going to roll out.”Or, “I have observability that’s going to make us be resilient to this so that we detect it earlier.” So, with something like chaos engineering, if something like this was to happen in an on-prem type of data center, is there something that chaos engineering could have done to help prepare y’all or to avoid a situation like this?Carmen: Yeah. One of the things that I believe—the chaos engineering, for what it’s worth, I didn’t actually know what chaos engineering was till 2012, and the specific thing that you mentioned is actually what they were testing. We had a test system, so we had all these on-prem machines and different co-locations in the country. And we would take some of our test systems—not the production because that was money-based but our test systems that were on simulated exchanges—and what would we do to test to make sure our code was up-to-date is we actually had a Chaos Monkey to break the configuration.We actually had a Chaos Monkey and it would just pick a random function to run that day. It would be either send a bad config to a machine or bring down a machine by disconnecting its connection, doing a networking change in the middle to see how we would react. [unintelligible 00:05:01] with any machine in our simulation. And then we had to see how it was going to react with the changes that was happening, we had to deduce, we had to figure out how to roll it back. And those are the things that we didn’t have at the time. In 2012—this was another company I was working for in high-frequency trading—and they implemented chaos engineering in that simulation, specifically for them, we would catch these problems before we hit production. So yeah, that’s definitely was needed.Ana: That’s super awesome that a failure encountered four years prior to your next company, you ended up realizing, wait, if this company actually follows what they do have of let’s roll out a bad deploy; how does our system actually engage with it? That’s such an amazing learning experience. Is there anything more recent that you’ve done in chaos engineering you’d want to share about?Carmen: Actually, since I’ve just started at this company a couple of months ago, I haven’t—thankfully—run into anything, so a lot of my stories are more like war stories from the PC days. So.Ana: Do you usually work now, mostly on-prem systems or do you find yourself in hybrid environments or cloud type of environments?Carmen: Recently, in the last three to four years I spent in cloud-only. I rarely have to encounter on-prem nowadays. But coming from an on-prem world to a cloud world, it was completely different. And I feel with the tools that we have now we have a lot of built-in checks and balances in which even with us trying to manually delete a node in our cluster, we can see our systems auto-heal because cloud engineering tries to attempt to take care of that for us, or with, you know, infrastructur...

Aug 10, 2021 • 18min

Zack Butcher

Welcome back to another edition of “Build Things on Purpose.” This time Jason is joined by Zack Butcher, a founding engineer at Tetrate. They also break down Istio’s ins and outs and the lessons learned there, the role of open source projects and their reception, and more. Tune in to this episode and others for all things chaos engineering!In this episode, we cover:Istio's History: (1:00)Lessons from Istio: (6:55)Implmenting Istio: (11:26)Links:Tetrate: http://tetrate.ioIstio: https://istio.ioTwitter: https://twitter.com/zackbutcherEpisode Transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-zack-butcher-founding-engineer-at-tetrate/

Jul 27, 2021 • 28min

Paul Marsicovetere

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comPaul's Twitter: https://twitter.com/paulmarsicloudEpisode highlights:Accidental SRE (1:41)Migrating to AWS (4:18)Prod vs Non-prod (8:43)Mentoring and advice (12:57)Failure is normal (19:41)Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-paul-marsicovetere-senior-cloud-infrastructure-engineer-at-formidable/

Jul 13, 2021 • 25min

Taylor Dolezal

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comTaylor's Twitter: https://twitter.com/onlydoleEpisode highlights:It's always DNS (2:28)Focus on learning (9:29)Chaos Engineering and improvement (11:25)Tips for learning (16:33)More Chaos Engineering (21:53)Episode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-taylor-dolezal-senior-developer-advocate-at-hashicorp

Jun 29, 2021 • 6min

The Hill You'll Die On

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comEpisode guests:Brian Holt (@holtbt): https://www.gremlin.com/blog/podcast-break-things-on-purpose-brian-holt-principal-program-manager-at-microsoftJérôme Petazzoni (@jpetazzo): https://www.gremlin.com/blog/podcast-break-things-on-purpose-jerome-petazzoni-tinkerer-and-container-technology-educatorJ Paul Reed (@jpaulreed): https://www.gremlin.com/blog/blog/podcast-break-things-on-purpose-j-paul-reed-sr-applied-resilience-engineer-at-netflixEpisode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-the-hill-youll-die-on/

Jun 15, 2021 • 6min

Taylor Dolezal Terraform Special

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comTaylor's Twitter: https://twitter.com/onlydoleFor more information about Terraform, see https://terraform.ioEpisode transcript: https://www.gremlin.com/blog/podcast-break-things-on-purpose-taylor-dolezal-terraform-special-episode/Listen to our episode with Armon Dadgar, CTO and co-founder of Hashicorp: https://www.gremlin.com/blog/podcast-break-things-on-purpose-armon-dadgar-cto-and-co-founder-of-hashicorp/

May 18, 2021 • 16min

Jose Nino

Podcast Twitter: https://twitter.com/BTOPPodPodcast email: podcast@gremlin.comJose's Twitter: https://twitter.com/junr03Episode highlights:Balancing consistency and complexity (1:22)Extending consistency to mobile clients (3:07)Know what you're trying to solve (9:50)Episode transcript: https://gremlin.com/blog/podcast-break-things-on-purpose-jose-nino-staff-software-engineer-at-lyft

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner