

AWS Morning Brief
Corey Quinn
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense.
Episodes
Mentioned books

Jul 31, 2020 • 12min
Whiteboard Confessional: The Bootstrapping Problem
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.Hello, and welcome to this edition of the AWS Morning Brief: Whiteboard Confessional, where we confess our various architectural sins that we and others have committed. Today, we're going to talk about, once upon a time, me taking a job at a web hosting provider. It was the thing to do at the time because AWS hadn't eaten the entire world yet, therefore, everything that we talk about today was still a little far in the future. So, it was a more reasonable approach, especially for those with, you know, budgets that didn't stretch to infinity, or willingness to be an early adopter of someone else's hosting nonsense to go ahead and build out something in a data center. Now, they were obviously themselves not hosting on top of a cloud provider because the economics made less than no sense back then. So, instead, they had multiple data centers built out that provided for customers various hosting needs. Each one of these was relatively self-contained unless customers wound up building something themselves for failover. So, it wasn't really highly available so much as it was a bunch of different single points of failure, and an outage of one would impact some subset of their customers, but not all of them. And that was a fairly reasonable approach provided that you communicate that scenario to your customers because that's an awful surprise to have later in time. Now, I was brought in as someone who had had some experience in the industry, unlike many of my colleagues who had come from the hosting provider’s support floor and promoted into systems engineering roles. So, I was there to be the voice of industry best practices, which is a terrifying concept when you realize that I was nowhere near as empathetic or aware back then as I am now, but you get what you pay for. And my role was to apply all of those different best practices that I had observed, and osmosed, and had bluffed, into what this company was doing, and see how it fit in a way that was responsible, engaging, and possibly entertaining. So, relatively early on in my tenure, I was taking a tour of one of our local data centers and asked what I thought could be improved. Now, as a sidebar, I want to point out that you can always start looking at things and pointing out how terrible they are, but let's not kid ourselves; we very much don't want to do that because there are constraints that shape everything that we do and we aren't always aware of them. So, making people feel bad for their choices is never a great approach if you want to stick around very long. So, instead, I started from the very beginning, and played, “Hi. I'm going to ask the dumb questions, and see where the answers lead me to.” So, I started off with, “Great, scenario time. The power has just gone out. So, everything's dark, now how do we restart the entire environment?” And the response was, “Oh, that would never happen.” And to be clear, that's the equivalent of standing on top of a mountain during a thunderstorm, cursing God while waving a metal rake into the sky. After you say something like that there is no disaster that is likelier. But all right, let's defuse that. “Humor me. Where's the runbook?” And the answer is, “Oh, it lives in Confluence,” which is Atlassian’s wiki offering. For those who aren't aware, Wikis in general, and Confluence in particular, is where documentation and processes go to die. “These are living documents,” is a lie that everyone says because that's not how it actually works. “Cool. Okay, so let's pretend that a single server instead of your whole data center explodes and melts. When everything's been powered off, you turn it back on. That one doesn't survive the inrush current, and that one server explodes. That server happens to be the Confluence server. Now what? How do we bootstrap the entire environment?” The answer was, “Okay, we started printing out that runbook and keeping it inside each data center,” which was a way better option. Now, the trick was to make sure that you revisited this every so often, when something changed, and make sure that you weren't looking at how things were circa five years ago, but that's a separate problem. And this is fundamentally a microcosm of what I've started to think of as the bootstrapping problem. I'll talk to you a little bit more about what those look like in the context of my data center atrocities. But first:This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of p...

Jul 27, 2020 • 9min
AWS re:Lease The Kraken
AWS Morning Brief for the week of July 27, 2020.

Jul 24, 2020 • 12min
Whiteboard Confessional: The Worst Thing You’ll See on Any Whiteboard
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.Welcome to the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn, which means that I fix the horrifying AWS bill both by making it more understandable, as well as lower. On today's episode of the AWS Morning Brief: Whiteboard Confessional, I looked around the whiteboards in the backgrounds of the Zoom calls that I'm having with basically everyone these days because going to the office during a pandemic is and remains a deadly risk, and it's amazing how much you can learn about people's companies by what they leave on the whiteboards. Whether you happen to be visiting their office or left inattentively in the background because they forget that they didn't turn on the Zoom background. One of the most disturbing things that we'll see on any whiteboard in any company that you work at, ever, is an org chart. And what makes it disturbing, first off, is that when you see an org chart, that means that generally, someone is considering reorganizing, which is a polite framing of shuffling the deck chairs on the Titanic. It ties into one of the great corporate delusions that somehow you're going to start immediately making good decisions, and all of the previous poor decision making you've made is going to fall away like dew in the new morning. And the reason that that's the case is that everyone tends to be an optimist when looking forward because otherwise, we'd wake up crying and never go to work.Have you ever noticed that you can take a look at an org chart or an architecture diagram and remove all of the labels, and you've accidentally built the exact same thing just with different services rather than teams? Well, I'm certainly not the first person to make this observation. What I'm talking about is known as Conway's Law. Conway's Law is named after computer programmer Melvin Conway, who in 1967 introduced a phenomenal idea, for the time, that we still haven’t escaped from, specifically, any organization that designs a system defined broadly will produce a design whose structure is a copy of the organization's communication structure. Effectively what that means is you ship your culture as well as your org chart, and if we take a look at how different seminal software products of the ages have come out. It's pretty clear that there is at least some passing resemblance to reality. You take a look at Amazon; they're effectively an entire microservices company. They have so many different small two pizza teams building things, and sure enough, you take a look at AWS, for example, they have 200 some-odd services that are ideally production-grade, but again, it's a mixed bag. Because again, not every team is identical, and not every team has the same resources. So, as a result, though, you take a look at that, that is the good part of their culture. Well, what's bad? Well, anything that involves all of those teams to coordinate at once on something. Think of the billing system. Think of the AWS web console. You start to see where these things break down. These are the seams between services that AWS tends to miss out on. If you take a look at Google, for example, the entire model there, to my understanding, is you want to get promoted and you want to get a raise, and that all comes down to certain metrics that don't necessarily align with what people want to be working on. So, you see people instead focusing on things that are they're incentivized to do to go up in the org, and not maintain things that they built last year, which is why I suspect, at least, that we see this neverending wave of Google product deprecations. And the list goes on, and I'm certainly not a corporate taxonomist; I'm a cloud economist, so I'm not going to go into too much depth on what that looks like in different places, but it does become telling. Let's get into that a bit more. But first:This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.Now, one thing that I sort of pride myself on being—because I have to be—is data center archaeologist. Frankly, these days it’s cloud archaeology. But when I go into a new client environment, I ask them to show me the...

Jul 20, 2020 • 8min
AI/ML Marketing Algorithm Continues to Malfunction
AWS Morning Brief for the week of July 20, 2020.

Jul 17, 2020 • 14min
Whiteboard Confessional: The Right and Wrong Way to Interview Engineers
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.Sponsorships can be a lot of fun sometimes. ParkMyCloud asked, “Can we have one of our execs do a video webinar with you?” My response was, “Here’s a better idea. How about I talk to one of your customers instead, so you can pay to make fun of you.” And turns out, I’m super-convincing. So, that’s what’s happening. Join me and ParkMyCloud’s customer, Workfront, on July 23rd for a no-holds-barred discussion about how they’re optimizing AWS costs, and whatever other fights I manage to pick before ParkMyCloud realizes what’s going on and kills the feed. Visit parkmycloud.com/snark to register. That’s parkmycloud.com/snark.Welcome. I am Cloud Economist Corey Quinn, and this is the AWS Morning Brief: Whiteboard Confessional; things that we see on whiteboards that we wish we could unsee. Today I want to talk about the worst whiteboard confessions of all time, and those invariably all tend to circle around what we ask candidates to do on a whiteboard during job interviews. There are a whole bunch of objections, problems, and other varieties of crappy opinions around whiteboarding as part of engineering job interviews, but they're all a part of the larger problem, which is that interviewing for engineering jobs fundamentally sucks. There are enough Medium articles on how trendy startups have cracked the interview to fill an S3 bucket. So, I'm going to take the contrarian position that all of these startups and all of these people who claim to have solved the problem, suck at it. And these terrible questions fall into a few common failure modes, most of which I've seen when they were levied at me back in my engineering days, and I was exercising my core competency of getting rapidly ejected from other companies. So, I spent a lot of time doing job interviews, and I kept seeing some of the same things appear. And they're all, of course, are different. But let’s start with some of the patterns. The most obnoxious one by far is the open-ended question of how would you solve a given problem? And as you start answering the question, they're paying more attention than you would expect. Maybe someone's on their laptop, quote-unquote ‘taking notes’ an awful lot. And I can't ever prove it, but it feels an awful lot—based upon the question—like, this is the kind of problem where you could suddenly walk out of the interview room, walk into the conference room next door and find a bunch of engineers currently in a war room trying to solve the question you were just asked. And what I hate about this pattern is it's a way of weaseling free work from interview candidates. If you want people to work on real-world problems, pay them. One of the best interviews I ever had is by a company that no longer exists called Three Rings Design. They wanted me to stand up some three-tiered web app, and turn on monitoring, and standard basic DevOps-style stuff; they gave an AWS account credential set to me. But what made them stand out is that they said, “Well, this is a toy problem. We're not going to use this in production, surprise. It's a generic question. But we don't believe in asking people to do work for free, so when you submit your results, we'll pay you a few hundred bucks.” And this was a standard policy they had for everyone who made it to that part of the interview. It was phenomenal, and I loved that approach. It solved that problem completely. But it's the only time I've ever seen it in my entire career. A variant of this horrible technique is to introduce the same type of problem, but it's with the proviso that this is a production problem that we had a few months ago. It's gone now, but how would you solve it? Now on its face, this sounds like a remarkably decent interview question. It's real-world. They've already solved it. So, whatever answer you give is not likely to be something revolutionary that's going to change how they approach things. So, what's wrong with it? Well, the problem is, is that in most cases, the right answer is going to look suspiciously like whatever they did to solve the problem when it manifested. I answered a question like this once with, “Well, what would strace tell me?” And the response was, “What does strace do?” I explained that it attached to processes and looked at the system calls that that process was making, and their response was, “Oh, yeah, that would have caught the problem. Huh. Okay, pretend strace doesn't exist.” Now it's not just the question of how you would solve the problem, but how you would solve the problem while being limited to their particular, often myopic, view of how systems work and how infrastructure needs to play out. This manifests itself the same way, by the way, in the world of various programming languages, and doing traditional developer engineer’s. It's awful because it winds up forcing you to see things through a perspective that you may not share. Part of the value of having you go and work somewhere is to bring your unique perspective. And, surprise, there's all these books on how to pass the technical interview. There are many fewer books on how to freaking give one that doesn't suck. And I wish that some people would write that book and that everyone else would read it. You can tell an awful lot about a company by how they go about hiring their people.Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and th...

Jul 13, 2020 • 8min
AWS Machine Learning Your Business From Inside
AWS Morning Brief for the week of July 13, 2020.

Jul 10, 2020 • 13min
Whiteboard Confessional: The Curious Case of the 9,000% AWS Bill Increase
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigChris Short’s LinkedIn: https://www.linkedin.com/in/thechrisshort/TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.When you're building on a given cloud provider, you're always going to have concerns. If you're building on top of Azure, for example, you're worried your licenses might lapse. If you're building on top of GCP, you're terrified that they're going to deprecate all of GCP before you get your application out the door. If you're building on Oracle Cloud, you're terrified, they'll figure out where you live and send a squadron of attorneys to sue you just on general principle. And if you build on AWS, you're constantly living in fear, at least in a personal account, that they're going to surprise you with a bill that's monstrous.Today, I want to talk about a particular failure that a friend of this podcast named Chris Short experienced. Chris is not exactly a rank neophyte to the world of Cloud. He currently works at IBM Hat, which I'm told is the post-merger name. He was deep in the Ansible community. He's a Cloud Native Computing Foundation Ambassador, which means that every third word out of his mouth is now contractually obligated to be Kubernetes.He was building out a static website hosting environment in his AWS account, and it was costing him between $10 and $30 a month. That is right aligned with what I tend to cost. And he wound up getting his bill at the end of the month: “Welcome to July, time to get your bill,” and it was a bit higher. Instead of $30, or even $40 a month, it was $2700. And now there was actual poop found in his pants.This is a trivial amount of money to most companies, even a small company, and I say this from personal experience, runs on burning piles of money. However, a personal account is a very different thing. This is more than most people's mortgage payments if you don't make terrible decisions like I do, and live in San Francisco. This is an awful lot of money, and his immediate response was equivalent to mine. First, he opened a ticket with AWS support, which is an okay thing to do. Then he immediately turned to Twitter, which is the better thing to do because it means that suddenly these stories wind up in the public eye.I found out roughly 10 seconds later, as my notifications blew up with everyone saying, “Hey, have you met Corey?” Yes, Chris and I know each other. We're friends. He wrote the DevOps’ish newsletter for a long time, and the secret cabal of DevOps-y type newsletters runs deep. We secretly run all kinds of things that aren't the billing system for cloud providers.So, he hits the batphone. I log into his account once we get a credential exchange going, and I start poking around because, yeah, generally speaking, 100x bill increase isn't typical. And what I found was astonishing. He was effectively only running a static site with S3 in this account making the contents publicly available, which is normal. This is a stated use case for S3, despite the fact that the console is going to shriek it's damn fool head off at you at every opportunity, that you have exposed an S3 bucket to the world.Well, yes, that is one of its purposes. It is designed to stand there, or sit there depending on what a bucket does—lay there, perhaps—and provide a static website to the world. Now, in a two-day span, someone or something downloaded data from this bucket, which is normal, but it was 30 terabytes of data, which is not. At approximately nine cents a gigabyte, this adds up to something rather substantial, specifically after free tier limits are exhausted, that's right: $2700.Now, the typical responses to what people should do to avoid bill shocks like this don't actually work. “Well, he should have set up a billing alarm.” Yeah, aspirationally the AWS billing system runs on an eight-hour eventual consistency model, which means that at the time the bill starts spiking. He has at least 8 hours, and in some cases as many as 24 to 48, before those billing alarms would detect. The entire problem took less time than that.So, at that point, it would be alerting after something had already happened. “Oh, he shouldn't have had the bucket available to the outside world.” Well, as it turns out, he was fronting this bucket with CloudFlare. But what he hadn't done is restrict bucket access to CloudFlare’s endpoints, and for good reason. There's no way to say, “Oh, CloudFlare’s, identity is going to be defined in an IAM managed policy.” He has to explicitly list out all of CloudFlare’s IP ranges, and hope and trust that those IP ranges will never change despite whatever networking enhancements CloudFlare makes, it's a game of guess and check and having to build an automated system around this. Again, all he wanted to do was share a static website. I've done this myself. I continue to do this myself and it costs me, on a busy month, pennies. In some rare cases, dozens of pennies.Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for y...

Jul 6, 2020 • 10min
Kicking AWS's ASS into Space
AWS Morning Brief for the week of July 7, 2020.

Jul 3, 2020 • 13min
Whiteboard Confessional: The Day IBM Cloud Dissipated
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.Welcome to the AWS Morning Brief’s Whiteboard Confessional series. I am Cloud Economist Corey Quinn, and today's topic is going to be slightly challenging to talk about. One of the core tenants that we've always had around technology companies and working with SRE, or operations-type organizations is, full stop, you do not make fun of other people's downtime because today it's their downtime, and tomorrow it's yours. It's important. That's why we see the hashtag #HugOps on Twitter start to—well, not trend. It's not that well known but definitely happens fairly frequently when there's a well-publicized multi-hour outage that affects a company that people are familiar with. So, what we're going to talk about is an outage that happened several weeks ago for IBM Cloud. I want to point out some failings on IBM’s part but this is in the quote-unquote, “Sober light of day.” They are not currently experiencing an outage. They've had ample time to make public statements about the cause of the outage. And I've had time to reflect a little bit on what message I want to carry forward, given that there are definitely lessons for the rest of us to learn. HugOps is important, but it only goes so far, and at some point, it's important to talk about the failings of large companies and their associated response to crises so the rest of us can learn. Now, I'm about to dunk on them fairly hard, but I stand by the position that I'm taking, and I hope that it's interpreted in the constructive spirit that I intend it to. For background, IBM Cloud is IBM's purported hyperscale cloud offering. It was effectively stitched together from a variety of different acquisitions, most notable among them SoftLayer. I've had multiple consulting clients who are customers of IBM Cloud over the past few years, and their experience has been, to put it politely, a mixed bag. In practice, the invective that they would lobby against it would be something worse. Now, a month ago, something strange happened to IBM Cloud. Specifically, it went down. I don't mean that a service started having problems in a region. That tends to happen to every cloud provider, and it's important that we don't wind up beating them up unnecessarily for these things. No, IBM Cloud went down. And when I say that IBM Cloud went down, I mean, the entire thing effectively went off the internet. Their status page stopped working, for example. Every resource that people had inside of IBM Cloud was reportedly down. And this was relatively unheard of in the world of global cloud providers. Azure and GCP don't have the same isolated network boundary per region that AWS has, but even in those cases, we tend to see far more frequently rolling outages rather than global outages affecting everything simultaneously. It's a bit uncommon. What's strange is that their status page was down. Every point of access you had into looking at what was going on with IBM Cloud was down. Their Twitter accounts fell silent, other than pre-scheduled promotional tweets that were set to go out. It looked for all the world like IBM had just decided to pack up early, turn everything off on the way out of the office, and enjoy the night off. That obviously isn't what happened, but it was notable in that there was no communication for the first hour or so of the outage, and this was causing people to go more than a little bonkers. One of the pieces that was interesting to me, while this was happening, since it was impossible to get data out of this for anything substantive or authoritative, was I pulled up their marketing site. Now, the marketing site still worked—apparently, it does not live on top of IBM Cloud—but it listed a lot of their marquee customers and case studies. I went through a quick sampling, and American Airlines was the only site that had a big outage notification on the front of it. Everything else seemed to be working. So, either the outage was not as widespread as people thought, or a lot of their marquee customers are only using them for specific components. Either one of those is compelling and interesting, but we don't have a whole lot of data to feed back into the system to draw reasonable conclusions. Their status page itself, like it was mentioned, was down, and that's super bad. One of the early things you learn when running a large-scale system of any kind is the thing that tells you—and the world—that you're down cannot have a dependency on any of the things that you are personally running. The AWS status page had this, somewhat hilariously, during the S3 outage a few years ago, when they had trouble updating what was going on due to that outage. I would imagine that's no longer the case, but one does wonder. And most damning, and the reason I bring this up is the following day, they posted the following analysis on their site: “IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services on Tuesday, June 9th. All services have been restored. A detailed root cause analysis is underway. An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic, and impacting IBM Cloud services, and our data centers. Migration steps have been taken to prevent a recurrence. Root ...

Jun 29, 2020 • 11min
Oh, Honey; Help the Cops in us-west-3
AWS Morning Brief for the week of June 29, 2020.


