AWS Morning Brief

Corey Quinn
undefined
Mar 13, 2020 • 11min

Whiteboard Confessional: Everything's a Database Except SQLite

About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH.ioSQLiteTranscriptCorey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.Many things make fine databases that replicate data from one place to another, that takes various bits of data and puts them where they need to go. Other things do not make fine databases that do such things. Let’s talk about one of those today. For those who have never had the dubious pleasure of working with it, SQLite is a C library that implements a relational database engine. And it’s pretty awesome. It’s very clearly not designed to work in a client-server fashion, but rather to be embedded into existing programs for local use. In practice, that means that if you’re running SQLite, that’s S-Q-L-I-T-E, your database backend is going to be a flat-file or something very much like that, that lives locally. This is technology used all over the place, and mobile apps and embedded systems, in web apps for some very specific things. But that’s not quite the point. I once worked somewhere that decided to build a replicated environment that was active, active, active, across three distinct data centers. You would really hope that that statement was a non sequitur. It’s not. If you were to picture Hacker News coming to life as a person, and that person decided to design a replication model for a database from first principles, you would be pretty close to what I have seen. By taking a replicated model that runs on top of SQLite, you can get this to work, but the only way to handle that—because there’s no concept of client-server, as mentioned—so you have to kick all of the replication and state logic from the database layer, where it belongs up, into the application code itself, where it most assuredly does not belong. The downside of this—well, there are many downsides, but let’s start with a big one that this is not even slightly what SQLite was designed to do at all. However, take a startup that decides if there’s one core competency they have, it’s knowing better than everyone else; this is that story. Now, I am obviously not a developer, and I’m certainly not a database administrator. I was an ops person, which means that a lot of the joy of various development decisions fell to whatever group I happened to be in at that point in time. It turns out that when you run replicated SQLite as a database, that you have to get around an awful lot of architectural pain points by babying this thing something fierce. There are a number of operational problems that going down a path like this will expose. Let me explain what some of them look like, after this.In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. I’m not going to engage in a point-by-point teardown of this replicated SQLite as primary datastore Eldritch Horror. My favorite database personally remains Route 53, and even that’s a better plan than this monstrosity. I’m not going to tackle point-by-point, everything that made this horrifying thing, come to life, so awful to deal with. Anyone who runs this at any sort of scale for more than a week is going to discover a lot of these on their own. But I am going to cherry-pick a few things that were problematic about it. Remember back in the days of Windows, when things would get slow and crappy, and you had to basically restart your machine while the disk defragmented forever? Yeah, it turns out that most database systems have the same problem. The difference is, is that reasonable adult-level database systems that have human beings who are used to how this stuff works, tend to put that underneath the hood, so you don’t really have to think about this. With SQLite, it wasn’t really designed for this sort of use case. So you get to wind up playing these games yourself, which is just an absolute pleasure and a joy, except the exact opposite of that. Which means that every node periodically has to be taken down in a rotation after, in our case about a week or so, or it would start chewing disk, it would take forever to start returning the results to some queries, and the performance of the entire site would wind up slamming to a halt. So, you have to make people aware that this exists. When we first discovered that it was fun. The problem here is that what you’re doing is speaking to a larger problematic pattern. Namely, you’re forcing what has historically been a low-level function that even most operations people don’t need to know or care about, into something that is now at the forefront of every developer’s mental model of the application. And if they forget that this is one of the things that has to happen, woe be unto them. Further, it should be pretty freakin’ obvious by now, by everything I’ve de...
undefined
Mar 9, 2020 • 11min

Nothing’s Certain but Death and Distinguished Engineers

AWS Morning Brief for the week of March 9, 2020.
undefined
Mar 6, 2020 • 12min

Whiteboard Confessional: Scaling Databases in a Single Bound

About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH.ioAmazon Route 53TranscriptCorey: Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.But first… On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.So I’m going to deviate slightly from the format that I’ve established so far on these Friday morning whiteboard confessional stories, and talk instead about a pattern that has tripped me and others up more times than I care to remember. So it’s my naive hope that by venting about this for the next 10 minutes or so, I will eventually be able to encounter an environment where someone hasn’t made this particular mistake. And what mistake am I talking about? Well, as with so many terrifying architectural patterns, it goes back to databases. You decide that you’re going to write a small toy application, You’re probably not going to turn this into anything massive. And in all honesty, baby seals will probably get more hits than whatever application you’re about to build will. So you don’t really think too hard about what your database structure is going to look like. You spin up a database, you define the database endpoint inside the application, and you go about your merry way. Now, that’s great. Everything’s relatively happy, and everything we just described will work. But let’s say that you hit that edge or corner case where this app doesn’t fade away into obscurity. In fact, this turns out to have some legs, the thing that you’re building now has attained business viability or is at least seeing enough user traffic that it now has to worry about load.So you start taking a look at this application because you get the worst possible bug reports six to eight months later; it’s slow. Where do you start looking when something is slow? Well, personally, I start looking at the bar, because that is a terribly obnoxious problem to have to troubleshoot. There are so many different ways that latency can get injected into an application. You discover the person reporting the slowness is on the other side of the world with satellite internet connection that they’re apparently trying to set up to the satellite with a tin can and a piece of very long string. There’s a lot of failure states here that you get to start hunting down. The joys of latency hunting. But in many cases, the answer is going to come down to, oh, that database that you defined is now no longer up to the task. You’re starting to bottleneck on that database. Now, you can generally buy your way out of this problem by scaling up whatever database you’re using. Terrific, great, it turns out that you can just add more hardware, which in a time of cloud, of course, just means more money and a bit of downtime while you scale the thing up, but that gets you a little bit further down the road. Until the cycle begins to rinse and repeat, and it turns out, there are only instances that are so large that you’ll be able to get to power databases. Also, they’re not exactly inexpensive. Now, I would name exact sizes of what those databases might look like. But this is AWS, they’re probably going to release at least five different instance families and sizes, by the time I finish recording this. But it gets published later at the end of the week. So instead, there is an alternative here, and it doesn’t take much from an engineering or design perspective when you’re building out one of these silly toy apps that will never have to scale. What is that fix, you might wonder? Terrific question. Let me tell you in just a minute. In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. So this is a pattern that increasingly, modern frameworks are recommending, but a number of them don’t. And I’m not going to name names, because I don’t want to wind up in a slap and tickle fight around which frameworks are good versus which frameworks are crappy. You can all make your own decisions around that. But the pattern that makes sense for this is even when you’re beginning with a toy app, go ahead and define two database endpoints, one for reads, And one for writes. Invariably, this is going to solve a whole host of problems with most database technologies. If you take a look at most applications, and yes, I know there are going to be exceptions to this, they tend to bottleneck on reads. If you have just a single database or database cluster, then all of the read traffic gets in the way of being able to write to that. That includes things that don’t actually need to be in line with the rest of what the application is doing. If you can have a read replica that’s used for business analytics, great. Your internal business teams can beat the living crap out of that database replica without damaging anything that’s in the critical path of serving users. And the writes can then go specifically to the primary node, w...
undefined
Mar 2, 2020 • 11min

Amazon Transcribe Gets <REDACTED>

AWS Morning Brief for the week of March 2, 2020.
undefined
Feb 28, 2020 • 14min

Whiteboard Confessional: How Cluster SSH Almost Got Me Fired

About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH.ioCluster SSH GitHub repositoryAWS Systems Manager Session ManagerEC2 Instance ConnectTranscriptCorey: On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.So, once upon a time, way back in the mists of antiquity, was a year called 2006. This is before many folks listening to this podcast were involved in technology. And I admit as well that it is also several decades after other folks listening to this podcast got involved in technology. But that’s not the point of this story. It was my first real job working in anything resembling a production-style environment. I’d dabbled before this, running various environments on Windows desktop style support. I’d played around with small business servers for running Windows-style environments. And then I decided there wasn’t much of a future in technology and spent some time as a technical recruiter, spent a little bit more time working in a sales role, which I was disturbingly good at, but I was selling tape drives to people. But that’s not the interesting part of the story. What is, is that I somehow managed to luck my way into a job interview for a university, helping to run their Linux and Unix systems. Cool. Turns out that interviewing is a skill like any other. The technical reviewer was out sick that day, and they really liked both the confidence of my answers, as well as my personality. That’s two mistakes right there. One; my personality is exactly what you would expect it to be. And two; hiring the person who sounds the most confident is exactly what you don’t want to do. It also tends to lend credence to people who look exactly like me. So I had converted some systems over in the first few months for that role to FreeBSD, which is like Linux, except it’s not Linux. It’s a Unix and it’s far older, derived from the Berkeley software distribution. and managing a bunch of those systems at scale was a challenge. Now understand, in this era scale meant something radically different than it does today. I had somewhere between 12 and 15 nodes that I had to care about. Some more mail servers. Some were NTP servers, of all things. Utility boxes here and there, the omnipresent web servers that we all dealt with, the Cacti box whose primary job was to get compromised and serve as an attack vector for the rest of the environment, etcetera. This was a university. Mistakes didn’t necessarily mean the same thing there as they would in revenue-generating engineering activities. I was also young, foolish, and the statute of limitations is almost certainly expired by now. So, running the same command on every box was annoying. This was in the days before configuration management was really a thing. BCFG2 was out there and incredibly complex. And CFEngine was also out there, which required an awful lot of in-depth arcane knowledge that I frankly didn’t have. Remember, I bluffed my way into this job and was learning on the fly. So I did a little digging and, lo and behold, I found a tool that solved my problems. called ClusterSSH. And oh, was it a cluster. The way that this works was that it would spin up different xterm windows on your screen that you could then provide a list of hosts for, and it would open one for every host you gave it. Great. So now I’m logged into all of those boxes at once. If this is making you cringe already, it probably should, because this is not a great architectural pattern. But here we are, we’re telling this story, so you probably know how that worked out. One of the intricacies of FreeBSD is that instead of running systems that turn things on or turn things off, as far as services to start on boot. For example, with Red Hat derived systems, before the dark times of systemd, you could write things like chkconfig, that’s C-H-K, the word config, and then you could give a service and tell it to turn it on or off at certain run levels. This is how you would tell it to, for example, start the webserver when you boot, otherwise, you reboot the system, the webserver does not start, and you wonder why TCP now terminates on the ground. This was all controlled via a single file—/etc/rc.conf. That controlled which services were allowed to start, as well as which services were going to be started automatically on boot. It would generally be a boolean value provided to the particular service name. Well, I was trying to do something, probably, I want to say, NTP related, but don’t quote me on that, where I wanted to enable a certain service to start on all of the systems at once. So I typed a command, specifically echoing the exact string that I wanted in quotes, so it would be quoted appropriately, and then with the right angle bracket, to that file—/etc/rc.conf, and then I pressed enter. Now, for those who are unaware of Unix-isms and how things work shell, a single right angle bracket means overwrite this file, two angle brackets say append to the end of this file. I was trying to get the second one, and instead, I wound up getting the first. So suddenly, I had just rewritten all of those files across every server. Great plan, huh? Well, I realized what I’d done as soon as I checked my work to validate that the system had taken the update appropriately, it had not, it had taken something horrifying up instead. What happened next? Great question.But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in ...
undefined
Feb 24, 2020 • 12min

RSA Thinks AWS Firewall Manager is a Job Title

AWS Morning Brief for the week of February 24, 2020.
undefined
Feb 21, 2020 • 14min

Whiteboard Confessional: Route 53 DB

About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.But first… On this show. I talk an awful lot about architectural patterns that are horrifying. Let's instead talk for a moment about something that isn't horrifying: CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets and you can access it using API's you've come to know and tolerate through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive discs in triplicate and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at chaossearch.io.I frequently joke on Twitter about my favorite database being Route 53, which is AWS’s managed database service. It’s a fun joke, to the point where I’ve become Route 53’s de facto technical evangelist. But where did this whole joke come from? It turns out that this started life as an unfortunate architecture that was taken in a terrible direction. Let's go back in time, at this point almost 15 years from the time of this recording, in the year of our Lord 2020. We had a data center that was running a whole bunch of instances—in fact, we had a few data centers, or datas center, depending upon how you chose to pluralize, that’s not the point of this ridiculous story. Instead what we’re going to talk about is what was inside these data centers. In this case, servers. I know, server-less fans, clutch your pearls, because that was a thing that people had many, many, many years ago. Also known as roughly 2007. And on those servers there was this new technology that was running and was really changing our perspective of how we dealt with systems. I am, of course, referring to the amazing transformative revelation known as virtualization. This solved the problem of computers being bored and not being able to process things in a parallelized fashion—because you didn’t want all of your applications running on all of your systems—by building artificial boundaries between different application containers, for a lack of a better term. Now in these days, these weren’t applications. These were full-on virtualized operating systems, so you had servers running inside of servers, and this was very early days. Cloud wasn’t really a thing. It was something that was on the horizon, if you’ll pardon the pun. So, this led to an interesting question of, “All right. I wound up connecting to one of my virtual machines, and there’s no good way for me to tell which physical server that virtual machine was connecting to.” How could we solve for this? Now, back in those days, with the Hypervisor technology we used, which was Xen, that’s X-E-N—it’s incidentally the same virtualization technology that AWS started out with for many years before releasing their Nitro Hypervisor, which is KVM derived, a couple of years ago. Again, not the point of this particular story. And one of the interesting pieces about how this works was that Xen doesn’t really expose anything, at least in those days, that you could use to query the physical host it was running on. So, how would we wind up doing this? Now, at very small scale where you have two or three servers sitting somewhere, it’s pretty easy. You log in and you can check. At significant scale, that starts to get a little bit more concerning. How do you figure out which physical host a virtual instance is running on? Well, there’s a bunch of schools of thought you can approach this from. But what you’re trying to build is known, technically, as a configuration management database, or CMDB. This is, of course, radically different from configuration management, such as Puppet, Chef, Ansible, Salt, and other similar tooling. But, again, this is technology, and naming things has never been one of our collective strong suits. So, what do we wind up doing? You can have a database, or an Excel spreadsheet, or something like that that has all of these things listed, but what happens when you then wind up turning an old instance off, and spinning up a new instance on a different physical server? These things become rapidly out-of-date. So, what we did was sort of the worst possible option. It didn’t solve for all of these problems, but at least was able to address what we wound up doing. At least, let us address what the perceived problem was, in a way that is, of course, architecturally terrible, or it wouldn’t have been on this show.DNS has a whole bunch of interesting capabilities. You can view it, more or less, as the phone number for the internet. It translates names to numbers. Fully qualified domain names, in most cases, to IP addresses. But it does more than that. You can query IP address and wind up getting the PTR, or reverse record, that tells you what the name of a given IP address is, assuming that they match. You can set those to different things, but that’s a different pile of madness that I’m certain we will touch upon a different day. So, what we did is we took advantage of a little-known record type known as TXT, or text, record. You can put arbitrary strings inside of TXT records and then consume them programmatically, or use a whole bunch of different things. One of the ways that we can use that, that isn’t patiently ridiculous is, domains generally have TXT records that contain their SPF record, which shows which systems are authorized to send mail on their behalf as an anti-spam measure. So, if you have something else that starts claiming to send email from your domain that isn’t authorized, that gets flagged as spam by many receiving servers. We misused TXT records, because there is no limit, really, to how many TXT records you can have, and wound up using that as our configuration management database. So, you could query a given instance, we’ll call it webserver003.production.losangeles.company.com, which was our naming scheme for these things, and it would return a record that was itself a fully qualified domain name, but it was the name of the physical host on top of which it was running. So, yeah, we could then propagate that, as we could with any other DNS records, to other places in the ...
undefined
Feb 17, 2020 • 13min

EBS Gets Overly Multi-Attached

AWS Morning Brief for the week of February 17, 2020.
undefined
Feb 10, 2020 • 12min

Polly Brand Voice Want a Platypus?

AWS Morning Brief for the week of February 10, 2020.
undefined
Feb 6, 2020 • 22min

Networking in the Cloud Fundamentals: BGP Revisited with Ivan Pepelnjak

About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: Hello and welcome to our Networking In The Cloud mini series, sponsored by ThousandEyes. That's right. There may be just one of you, but there are a thousand eyes. On a more serious note, ThousandEyes has sponsored their cloud performance benchmarking report for 2019, at the end of last year, talking about what it looks like when you race various cloud providers. They looked at all the big cloud providers and determined what does performance look like from an end user perspective? What does the user experience look like among and between different cloud providers? To get your copy of this report, you can visit snark.cloud/realclouds. Why real clouds? Well, because they raced AWS, Azure, GCP, IBM Cloud, and Alibaba, all of which are real clouds. They did not include Oracle cloud because, once again, they are real clouds. Check out your copy of the report at snark.cloud/realclouds.Welcome to week 12 of the Networking In The Cloud mini series of the AWS Morning Brief, sponsored by ThousandEyes. So one of the early episodes of the Networking In The Cloud mini series had me opining and relatively uninformed broad brush strokes about the nature of BGP. Today I am joined by Ivan Pepelnjak, who is a former CCIE who wrote a fascinating blog post that I will link to in the show notes, saying, "This is great, but this is what happens when someone who's good at one thing steps completely out of their comfort zone into things they don't fully understand and start opining confidently, if not authoritatively." Ivan, thank you for taking the time to speak with me.Ivan: Thanks for having me on. And no, I was way more polite than your summary.Corey: Absolutely. I believe that there's a way to tell a story of the hero's journey that everyone talks about when they're building a narrative arc. Instead, I go for the moron's journey and I always like to be the moron because, generally, I tend to be, and as I walk through the world and get things sometimes right, occasionally wrong, I love being corrected when I stumble blindly into an area I don't know. First because it gives me an opportunity to learn something new, which is great, but it also gives me that opportunity to be the dumbest person in the room again, which is awesome. So...Ivan: That's exactly why I blog to get your opinions.Corey: Exactly. You have data, I have opinions and mine are louder seems to be the way that discourse works in the modern era. So from a high level, what did I get wrong about BGP?Ivan: Well, you got everything right about the mess that we are in and the fragility of the generic internet infrastructure. The only thing you got wrong was that you blamed the tool, but not people using the tool.Corey: It always feels like it's safer, on some level, to blame technology because if the takeaway is, "Well, the user experience around tool X isn't great, and that adds a contributing factor to why things break." That seems to be a message that carries slightly better than, "And thus the answer is for everyone to be smarter and stop screwing up." And that may very well be the answer. It's just a bitter pill to swallow sometimes. So I find blaming a tool is easy.Ivan: Yeah, but it's like blaming the knives for people to get cut or blaming the chainsaw for people to cut off their arm because they were not properly trained.Corey: One of my assertions was that BGP is more or less a hot mess because it was designed for an era when people on the internet fundamentally could trust one another and that doesn't seem to be the case today. The analogy in my mind, that I don't think I mentioned, was SMTP, the the email protocol, for lack of a better term. When that was built, the internet was more or less comprised of researchers and who in the world would ever abuse a protocol like email? It's not like there was any money involved in the internet. Fast forward today and your spam folder is inherently a garbage fire.Ivan: Yeah, but BGP has a slightly different history. It was redesigned a few times. There were several attempts to get the global routing protocol right. And BGP, the last attempt, already included the tools that allow entities that don't trust each other, like commercial internet service providers, to exchange information and apply policies on inbound and outbound updates. So for example, I don't want to hear about your customers because I hate you and I don't want to peer with you or I don't want to tell you about my customer because that customer has a special deal and their traffic can only go through some other transit providers so I will not tell you about that customer. Those things were already a major requirement when BGP was designed and it always included the tools to implement the policies that individual commercial entities wanted to have, which by the way, never happens to SMTP. We have BGP version 4 now and we are still on SMTP version zero.one plus enhancements.Corey: I guess the best analogy I can come up with through my exposure with BGP, because I tend to handle inter networking between various groups about as well as I write code, things that I have some vague awareness that there are things you should be doing here that I will almost certainly not get right, so I back away slowly and leave it to professionals. As a result, every time I really see how BGP works in any hands-on sense or a point where it's forced upon my awareness, it's similar to how I become aware of plumbing. I don't think about it. I don't question it. I just expect when I turn the faucet on or flush the toilet that water will do what it's going to do. I don't expect the toilet to explode. So the only time I think about BGP is when there is a peering dispute or when there's a flap or, on one notable occasion, when I was at a security conference and, as a demo, some folks hijacked the entire AS of the SN for the conference and rerouted it halfway around the world and back, which explained why everything was super latent and crappy.Ivan: Yeah. You're absolutely right, but all the incidents you mentioned are not the fault of the tool. They are the fault of the tool not being properly used. And also, let's be honest, it took them hundreds of years to get the plan being to the point when you can just turn on the faucet and the clean and drinkable water comes out of it. It's not like that would have happened in the last year or two, and very probably it wouldn't have happened without public pressure to bring us dri...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app