

AWS Morning Brief
Corey Quinn
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense.
Episodes
Mentioned books

Dec 23, 2019 • 15min
It's a Horrible Lyfebin
AWS Morning Brief for the week of December 23rd, 2019.

Dec 19, 2019 • 17min
Networking in the Cloud Fundamentals: Regions and Availability Zones in AWS
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: Hello, and welcome back to our Networking in the Cloud mini series sponsored by ThousandEyes. That's right. ThousandEyes is state-of-the-cloud Performance Benchmark Report is now available for your perusal. It's really providing a lot of baseline that we're taking all of the miniseries information from. It pointed us in a bunch of interesting directions and helps us tell stories that are actually, for a change, backed by data rather than pure sarcasm. To get your copy, visit snark.cloud/realclouds because it only covers real cloud providers. Thanks again to ThousandEyes for their ridiculous support of this shockingly informative podcast mini series.It's a basic fact of cloud that things break all the time. I've been joking for a while that a big competitive advantage that Microsoft brings to this space is that they have 40 years of experience apologizing for software failures, except that's not really a joke. It's true. There's something to be said for the idea of apologizing to both technical and business people about real or perceived failures being its own skillset, and they have a lot more experience than anyone in this space.There are two schools of thought around how to avoid having to apologize for service or component failures to your customers. The first is to build super expensive but super durable things, and you can kind of get away with this in typical data center environments right up until you can't, and then it turns out that your SAN just exploded. You're really not diversifying with most SANs. You're just putting all of your eggs in a really expensive basket, and of course, if you're still with power or networking outage, nothing can talk to the SAN, and you're back to square one.The other approach is to come at it with a perspective of building in redundancy to everything and eliminating single points of failure. That's usually the better path in the cloud. You don't ever want to have a single point of failure if you can reasonably avoid it, so going with multiple everythings starts to make sense to a point. Going with a full on multi-cloud story is a whole separate kettle of nonsense we'll get to another time. But you realize at some point you will have single points of failure and you're not going to be able to solve for that. We still only have one planet going around one sun for example. If either of those things explode, well, computers aren't really anyone's concern anymore. However, betting the entire farm on one EC2 instance is generally something you'll want to avoid if at all possible.In the world of AWS, there aren't data centers in the way that you or I would contextualize them. Instead, they have constructs known as availability zones and those composed to form a different construct called regions. Presumably, other cloud providers have similar constructs over in non-AWS land, but we're focusing on AWS as implementation in this series, again, because they have a giant multi-year head start over every other cloud provider, and even that manifests in those other cloud providers comparing what they've built and how they operate to AWS. If that upsets you and you work at one of those other cloud providers, well, you should have tried harder. Let's dive in to a discussion of data centers, availability zones, and regions today.Take an empty warehouse and shove it full of server racks. Congratulations. You have built the bare minimum requirement for a data center at its most basic layer. Your primary constraint and why it's a lot harder than it sounds is power, and to a lesser extent, cooling. Computers aren't just crunching numbers, they're also throwing off waste heat. You've got to think an awful lot about how to keep that heat out of the data center.At some point, you can't shove more capacity into that warehouse-style building just because you can't cool it if it's all running at the same time. If your data center's particularly robust, meaning you didn't cheap out on it, you're going to have different power distribution substations that feed the building from different lines that enter the building at different corners. You're going to see similar things with cooling as well, multiply redundant cooling systems.One of the big challenges, of course, when dealing with this physical infrastructure is validating that what it says on the diagram is what's actually there in the physical environment. That can be a trickier thing to explore than you would hope. Also, if you have a whole bunch of systems sitting in that warehouse and you take a power outage, well, you have to plan for this thing known as inrush current.Normally, it's steady state. Computers generally draw a known quantity of power. But when you first turn them on, if you've ever dealt with data center servers, the first thing they do is they power up everything to self-test. They sound like a jet fighter taking off as all the fans spin up. If you're not careful, and all these things turn on at once, you'll see a giant power spike that winds up causing issues, blowing breakers, maxing out consumption, so having a staggered start becomes a concern as well. Having spent too much time in data centers, I am painfully familiar with this problem of how you safely and sanely recover from site-wide events, but that's a bit out of scope, thankfully, because in the cloud, this is less of a problem.Let's talk about the internet and getting connectivity to these things. This is the Networking in the Cloud podcast after all. You're ideally going to have multiple providers running fiber lines to that data center hoping to avoid fiber's natural predator, the noble backhoe. Now, ideally, all those fiber lines are going over different paths, but again, hard thing to prove, so doing your homework's important, but here's something folks don't always consider: If you have a hundred gigabit ethernet links to each computer, which is not cheap, but doable, and then you have 20 servers in a rack, each rack theoretically needs to be able to speak at least two terabit at all times to each other server in each other rack, and most of them can't do that. They wind up having bottle-necking issues.As a result, when you have high-traffic applications speaking between systems, you need to make sure that they're aware of something known as rack affinity. In other words, are there bottlenecks between these systems, and how do you minimize those to make sure the crosstalk works responsibly? There are a lot of dragons in here, but let's hand-wave past all of it because we're talking about cloud here. The point of this is that there's an awful lot of nuance to running data centers, and AWS and other large cloud providers do a better job of it than you do. That's not me insulting your data center staff. That's just a fact. They have the scal...

Dec 16, 2019 • 11min
AWS Dep-Ric-ates Treasured Offering
AWS Morning Brief for the week of December 16th, 2019.

Dec 13, 2019 • 16min
reInvent Wrap-up, Part 4
AWS Morning Brief for Friday, December 13th, 2019

Dec 12, 2019 • 16min
Networking in the Cloud Fundamentals, Part 6
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: Knock knock. Who's there? A DDOS attack. A DDOS a... Knock. Knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock.Welcome to what we're calling Networking in the Cloud, episodes six, How Things Break in the Cloud, sponsored by ThousandEyes. ThousandEyes recently launched their state of the cloud performance benchmark report that effectively lets you compare and contrast performance and other aspects between the five large cloud providers, AWS, Azure, GCP, Alibaba and IBM cloud. Oracle cloud was not invited because we are talking about real clouds here. You can get your copy of this report at snark.cloud/realclouds. and they compare and contrast an awful lot of interesting things. One thing that we're not going to compare and contrast though, because of my own personal beliefs, is the outages of different cloud providers.Making people in companies, by the way, companies are composed of people, making them feel crappy about their downtime is mean, first off. Secondly, if companies are shamed for outages, it in turn makes it far likelier that they won't disclose having suffered an outage. And when companies talk about their outages in constructive blameless ways, there are incredibly valuable lessons that we all can learn from it. So let's dive into this a bit.If there's one thing that computers do well, better than almost anything else, it's break. And this is, and I'm not being sarcastic when I say this, a significant edge that Microsoft has when they come to cloud. They have 40 some odd years of experience in apologizing for software failures. That's not trying to be insulting to Microsoft, it's what computers do, they break. And being able to explain that intelligently to business stakeholders is incredibly important. They're masters at that. They also have a 20 year head start on everyone else in the space. What makes this interesting and useful is that in the cloud, computers break differently than people would expect them to in a non-cloud environment.Once upon a time when you were running servers and data centers, if you see everything suddenly go offline, you have some options. You can call the data center directly to see if someone cut the fiber, in case you were unaware of fiber optic cables' sole natural predator in the food chain is the mighty backhoe. So maybe something backhoed out some fiber lines, maybe the power is dead to the data center, maybe the entire thing exploded, burst into flames and burned to the ground, but you can call people. In the cloud, it doesn't work that way. Here in the cloud, instead you check Twitter because it's 3:00 AM and Nagios is the original call of duty or PagerDuty calls you, because you didn't need that sleep anyway, telling you there is something amiss with your site. So when a large bond provider takes an outage, and you're hanging out on Twitter at two in the morning, you can see DevOps Twitter come to life in the middle of the night, as they chatter back and forth.And incidentally, if that's you, understand a nuance of AWS availability zone naming. When people say things like us-east-1a is having a problem and someone else says, "No, I just see us-east-1c is having a problem," you're probably talking about the same availability zone. Those letters change, non deterministically, between accounts. You can pull zone IDs, and those are consistent. But by and large, that was originally to avoid having problems like everyone picking A, as humans tend to do or C, getting the reputation as the crappy one.So why would you check Twitter to figure out if your cloud provider's having a massive outage? Well, because honestly, the AWS status page is completely full of lies and gaslights you. It is as green as the healthiest Christmas tree you can imagine, even when things are exploding for a disturbingly long period of time. If you visit the website, stop.lying.cloud, you'll find a Lambda and Edge function that I've put there that cuts out some of the croft, but it's not perfect. And the reason behind this, after I gave them a bit too much crap one day and I got a phone call that started with, "Now you listen here," it turns out that there are humans in the loop, and they need to validate that there is in fact a systemic issue at AWS and what that issue might be, and then finally come up with a way to report that in a way that ideally doesn't get people sued and manually update the status page. Meanwhile, your site's on fire. So that is a trailing function, not a leading function.Alternately, you could always check ThousandEyes. That's right, this episode is sponsored by ThousandEyes. In addition to the report we mentioned earlier, you can think of them as Google Maps of the internet without the creepy privacy overreach issues. Just like you wouldn't necessarily want to commute during rush hour without checking where traffic is going to be and which route was faster, businesses rely on ThousandEyes to see the end to end paths their applications and services are taking in real time to identify where the slow downs are, where the outages are and what's causing problems. They use ThousandEyes to see what's breaking where and then importantly, ThousandEyes shares that data directly with the offending service providers. Not just to hold them accountable, but also to get them to fix the issue fast. Ideally, before it impacts users. But on this episode, it already has.So let's say that you don't have the good sense to pay for ThousandEyes or you're not on Twitter, for whatever reason, watching people flail around helplessly trying to figure out what's going on. Instead, you're now trying desperately to figure out whether this issue is the last deploy your team did or if it's a global problem. And the first thing people try to do in the event of an issue is, "Oh crap, what did we just change? undo it." And often that is a knee jerk response that can make things worse if it's not actually your code that caused the problem. Worse, it can eat up precious time at the beginning of an outage. If you knew that it was a single availability zone or an entire AWS region that was having a problem, you could instead be working to fail over to a different location instead of wasting valuable incident retime checking Twitter or looking over your last 200 commits.Part of the problem, and the reason this is the way that it is, is that unlike rusting computers in your data center currently being savaged by raccoons, things in the cloud break differently. You don't have the same diagnostic tools, you don't have the same level of visibility into what the hardware is doing, and the behaviors themselves are radically different. I have a half dozen tips and tricks on how to monitor whether or not your data center's experiencing a problem r...

Dec 11, 2019 • 11min
reInvent Wrap-up, Part 3
AWS Morning Brief for Wednesday, December 11th, 2019

Dec 10, 2019 • 14min
reInvent Wrap-up, Part 2
AWS Morning Brief for Tuesday, December 10th, 2019.

Dec 9, 2019 • 14min
reInvent Wrap-up, Part 1
AWS Morning Brief for the week of December 9th, 2019.

Dec 2, 2019 • 14min
Wherever You May Rome
AWS Morning Brief for the week of December 2nd, 2019.

Nov 28, 2019 • 16min
Networking in the Cloud Fundamentals, Part 5
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: As the world spins faster, it heats up because of friction. Therefore, for the good of humanity, the AWS Global Accelerator must be turned off. Welcome once again to Networking in the Cloud, a 12 week special on the AWS Morning Brief, sponsored by ThousandEyes. Think of ThousandEyes as the Google Maps of the internet without the creepy privacy implications. Just like you wouldn't necessarily go from one place to another without checking which route was less congested during rush hour, businesses rely on ThousandEyes to see the end to end paths that their applications and services are taking, from their servers, to their end users, or between other servers, just to identify where the slow downs are, where the pile ups live, and what's causing various issues. They use ThousandEyes to see what's breaking where and then of course depend upon ThousandEyes to share that data directly with the offending providers, to shame them into accountability and get them to fix the issue. Learn more at thousandeyes.com.So, today we talk about the Global Accelerator, which is an offering from AWS that they announced at re:Invent last year. What is it? Well, when traffic passes through the internet from your computer on route to a cloud provider, or from your data center to a cloud provider, the provider has choices as to how to route that traffic in. Remember, there's no cloud provider that we're going to be talking about that doesn't have a global presence. So, they have a number of different choices.Some, such as GCP and Azure, will route that traffic directly into their networks right away, as close to the end user as possible. Others, like AWS and interestingly Alibaba, will have that traffic ride the public internet as long as possible, until it gets to the region that that traffic is aimed at, and then ingested into the provider's network. And, IBM has an interesting hybrid approach between the two of these that doesn't actually matter, because it's IBM Cloud.Now, Global Accelerator offers a slightly different option here. Because by default, traffic bound to AWS will ride the public internet until it hits the region at the end. That means that traffic is subject to latency based upon public internet congestion. It's subject to non-deterministic latency, as far as leading to... Some packets will get there faster than others, as they take different routes, so jitter becomes a concern.Global Accelerator sort of flips the behavior on its head, where instead of traveling across the entire internet until it smacks into a region, traffic now winds up landing into AWS's network far sooner, and then rides along AWS's backbone to where it needs to go. And then, it smacks into one of a number of different end points. Today, at the time of this recording, it supports application load balancers, either internal or external, network load balancers, elastic IPs and whatever you can tie those to, and of course EC2 instances, public or private. We'll mention that... The caveat about that a little later on.On the other side, to the internet, what happens is that Global Accelerator gives out two IP addresses that are Anycast. What that means is using BGP, those are generally repointed to the closest supported region to the customer. As a result, they can do a lot of changes to network architecture in completely invisible ways to the end user. It supports, for example, shifting traffic to different regions or endpoints. It can shape how that traffic winds up manifesting on the fly.So, other ways of managing this such as using DNS, means that suddenly you don't have high TTLs anymore on the client side. That mean the traffic doesn't shift as closely as you'd like, and IP caching as well once that DNS record is resolved, no longer applies. You see this all over the place with, for example, public DNS resolvers. The same IP addresses are what people use globally to talk to, well known DNS resolvers, but strangely it's always super quick and not traveling across the entire internet. Imagine that.This is similar in some ways to AWS's CloudFront service. CloudFront is, as mentioned, a CDN that has somewhat similar performance characteristics. It generally winds up being a slightly better answer when you're using a protocol like HTTP or HTTPS that the entire CDN service has been designed around. They have a whole bunch of locations that are scattered across the globe, and sure it takes a year and a day to update a distribution or deploy a new one in CloudFront, but that's not really the point of this comparison here.Where Global Accelerator shines, is where you have non HTTP traffic, or you need that super responsive failover behavior. You have a lot more control with Global Accelerator as well. So if for example, data processing location is super important for you due to regulatory requirements, it's definitely worth highlighting that Global Accelerator does grant additional flexibility here. But it's not all sunshine and roses.There are some performance metrics that shine interesting lights on this. Where do those performance metrics come from, you might wonder? Well, I'm glad you asked. They come from the ThousandEyes state of the cloud performance benchmark report. As mentioned previously, they wound up doing a whole series of tests across a whole variety of different cloud providers from different networks, that in turn wind up showcasing where certain cloud providers shine, where certain cloud providers don't necessarily work as well in some context as others do, and more or less, for lack of a better term, let you race the clouds. It's one of the fun things that they're able to do because they serve the role of global observer. They have a whole bunch of locations where they can monitor from, and they see customer traffic so they understand what those use cases look like in real life.Feel free to get your copy of the report today. They race, GCP, Azure, AWS, Alibaba, and IBM Cloud. As mentioned on previous episodes, Oracle Cloud was not included because they use real clouds. You get your copy today at snark.cloud/realclouds, that's snark.cloud/realclouds and thanks again to ThousandEyes for their continuing support of this ridiculous mini series. Now, what did ThousandEyes learn? Well, this should be blindingly obvious, but in case it's not, the Global Accelerator is not super useful if you and your customers aren't far apart.An example that came up in the report was that if you're in North America, which by and large has decent internet connectivity provided you're not somewhere rural due to a variety of terrible things, we'll get to in a future episode, then it...


