Cloud Engineering Archives - Software Engineering Daily

Cloud Engineering Archives - Software Engineering Daily

Episodes about building and scaling large software projects

Episodes

Mentioned books

Jan 19, 2018 • 53min

Service Mesh Design with Oliver Gould

Oliver Gould worked at Twitter from 2010 to 2014. Twitter’s popularity was taking off, and the engineering team was learning how to scale the product. During that time, Twitter adopted Apache Mesos and began breaking up its monolithic architecture into different services. As more and more services were deployed, engineers at Twitter decided to standardize communications between those services with a tool called a service proxy. A service proxy provides each service with features that every service would want: load balancing, routing, service discovery, retries, and visibility. It turns out that lots of other companies wanted this service proxy technology as well, which is why Oliver left Twitter to start Buoyant, a company that was focused on developing software around the service proxy–and eventually the service mesh. If you are unfamiliar with service proxies and service mesh, check out our previous shows on Linkerd, Envoy, and Istio. Kubernetes is often deployed with a service mesh. A service mesh consists of two parts: the data plane and the control plane. The “data plane” refers to the sidecar containers that are deployed to each of your Kubernetes application pods. Each sidecar has a service proxy. The “control plane” refers to a central service that aggregates data from across the data plane and can send communications to the service proxies sitting across that control plane. The Linkerd service mesh was built in Java, and the project started before Kubernetes had become the standard for container orchestration. More recently, Buoyant built Conduit, a service mesh built using Rust and Go. In this episode, we explore how to design a service mesh and what Oliver learned in his experience building Linkerd and Conduit. The post Service Mesh Design with Oliver Gould appeared first on Software Engineering Daily.

Jan 18, 2018 • 54min

Kubernetes Storage with Bassam Tabbara

Modern applications store most of their data on hosted storage solutions. We use hosted block storage to back databases, hosted object storage for objects such as videos, and hosted file storage for file systems. Using a cloud provider for these storage systems can simplify scalability, durability, and availability–it can be less painful than taking care of storage yourself. One downside: the storage systems offered by the cloud providers are not open source. The APIs might vary from provider to provider. Wiring your application to a particular storage service on a particular cloud could tightly couple you to that cloud. Rook is a project for managing storage, built on Kubernetes. If you use a Rook cluster for your storage, you can port that storage model to any cloud, and have a consistent API for object, block, and file storage. In this episode, Bassam Tabbara describes the state of cloud storage, and why he started the Rook project. The post Kubernetes Storage with Bassam Tabbara appeared first on Software Engineering Daily.

Jan 17, 2018 • 49min

Kubernetes State Management with Niraj Tolia

A common problem in a distributed system: how do you take a snapshot of the global state of that system? Snapshot is difficult because you need to tell every node in the system to simultaneously record its state. There are several reasons to take a snapshot. You might want to take a picture of the global state for the purposes of debugging. Or you might want to take a comprehensive snapshot of your system (including the database) and port your system from one cloud to another. Or you might just need to take a snapshot for disaster recovery. When a Kubernetes application is deployed, its initial configuration is described in config files. After a deployment, the state of the application might change–some nodes die, some services get scaled up. At any given time, the current state of a Kubernetes cluster is described by etcd, a distributed key-value store. Niraj Tolia is CEO of Kasten, a company that provides data management, backups, and disaster recovery for Kubernetes applications. Niraj joins the show to describe how Kubernetes deployments manage state, and what the modern business environment is around Kubernetes. The post Kubernetes State Management with Niraj Tolia appeared first on Software Engineering Daily.

Jan 16, 2018 • 47min

Kubernetes Operations with Brian Redbeard

In the last four years, CoreOS has been at the center of enterprise adoption of containers. During that time, Brian Harrington (or “Redbeard”) has seen a lot of deployments. In this episode, Brian discusses the patterns he has seen among successful Kubernetes deployments–and the pitfalls of the less successful. How should you manage configuration? How can you avoid IP address overlap between containers? How should you log and monitor your Kubernetes cluster–and whose responsibility is it to set all that stuff up? Brian also discusses the motivation for multi-cloud deployments, and how to implement multi-cloud Kubernetes. CoreOS offers a distributed systems management tool called Tectonic, which uses Kubernetes for container orchestration. In a time where there are lots of options to choose from when it comes to managed Kubernetes providers, it was great to hear Brian describe some of the architectural decisions for building Kubernetes into Tectonic. The post Kubernetes Operations with Brian Redbeard appeared first on Software Engineering Daily.

Jan 15, 2018 • 47min

FluentD with Eduardo Silva

A backend application can have hundreds of services written in different programming frameworks and languages. Across these different languages, log messages are produced in different formats. Some logging is produced in XML, some are produced in JSON, some is in other formats. These logs need to be unified into a common format and centralized for any developer who wants to debug. The popularity of Kubernetes is making it easier for companies to build this kind of distributed application, where different services of different languages are communicating over a network, with a variety of log message types. Fluentd is a tool for solving this problem of log collection and unification. In today’s episode, Eduardo Silva joins the show to describe how Fluentd is deployed to Kubernetes, and what the role of Fluentd is within a Kubernetes logging pipeline. We also discuss the company where Eduardo works–Treasure Data. The story of Treasure Data is unusual. The team started out doing log management but has found itself moving up the stack, into marketing analytics, sales analytics, and customer data management. This story might be useful for anyone who is an open-source developer thinking about how to evolve your project into a business. The post FluentD with Eduardo Silva appeared first on Software Engineering Daily.

Jan 13, 2018 • 1h 3min

The Gravity of Kubernetes

Kubernetes has become the standard way of deploying new distributed applications. Most new internet businesses started in the foreseeable future will leverage Kubernetes (whether they realize it or not). Many old applications are migrating to Kubernetes too. Before Kubernetes, there was no standardization around a specific distributed systems platform. Just like Linux became the standard server-side operating system for a single node, Kubernetes has become the standard way to orchestrate all of the nodes in your application. With Kubernetes, distributed systems tools can have network effects. Every time someone builds a new tool for Kubernetes, it makes all the other tools better. And it further cements Kubernetes as the standard. Google, Microsoft, Amazon, and IBM each have a Kubernetes-as-a-service offering, making it easier to shift infrastructure between the major cloud providers. We are likely to see Digital Ocean, Heroku, and longer tail cloud providers offer a managed, hosted Kubernetes eventually. In this editorial, I explore the following questions: Why is this happening? What are the implications for developers? How are cloud providers affected? What are the new business models that are possible in a Kubernetes-standardized world? Software Standards Standardized software platforms are both good and bad. Standards allow developers to have expectations around how their software will run. If a developer builds something for a standardized platform, they can make smart estimations about the total addressable market for that piece of software. If you write a program in JavaScript, you know that it will run in everyone’s browser. If you create a game for iOS, you know that everyone with an iPhone will be able to download it. If you build a tool for profiling garbage collection in .NET, you know that there is a large community of Windows developers with memory issues who can buy your software. Standardized proprietary platforms can lead to massive profit returns for the platform provider. In 1995, Windows was such a good platform that Microsoft could sell a CD in a cardboard box for $100. In 2018, the iPhone is so good that Apple can take 30% from all app sales on its platform. Proprietary standards lead to fragmentation. Your iPhone app does not run on my Kindle Fire. I can’t use your Snapchat augmented reality sticker on Facebook Messenger. My favorite digital audio workstation only runs on Windows–so I have to keep a Windows computer around just to make music. When developers see this fragmentation, they groan. They imagine the greedy capitalists, who are making money at the expense of software quality. Developers think, “why can’t we all just get along? Why can’t we have things be open and free?” Developers think, “we don’t need proprietary standards. We can have open standards.” Growth of Apache, part of the LAMP (Linux, Apache, MySQL, PHP) Stack This happened with Linux. These days, new server side applications are mostly in Linux. There was a time when that was controversial (see the far left hand side of chart above). More recently, we saw a newer open standard with Docker. Docker gave us an open, standardized way of packaging, deploying, and distributing a single node. This was hugely valuable! But for all of the big problems that Docker solved, it highlighted a new problem–how should we be orchestrating these nodes together? After all–your application is not just a single node. You know you want to be deploying a Docker container–but how are your containers communicating with each other? How are you scaling up instances of these containers? How are you routing traffic between container instances? Container Orchestration After Docker became popular, a scramble of open source projects and proprietary platforms emerged to solve the problem of container orchestration. Mesos, Docker Swarm, and Kubernetes each offered a different set of abstractions for managing containers. Amazon ECS offered a proprietary managed service that took care of installation and scaling of Docker containers for AWS users. Some developers did not adopt any orchestration platform, and used BASH, Puppet, Chef, and other tools to script their deployments. Whether a developer was deploying their container by using an orchestration platform or a script, Docker sped up development, and made things more harmonious between developers and operations. As more developers deployed containers with Docker, the importance of the orchestration platform was becoming clear. A container is as fundamental to a distributed system as an object is to an object oriented program. Everyone wanted to be using a container platform, but there was a struggle between these platforms, and it was hard to see which would come out on top–or if it would be a decades long struggle, like iOS vs. Android. This struggle between the different container orchestration platforms was causing fragmentation–not because any of the popular orchestration frameworks were proprietary (Swarm, Kubernetes, and Mesos are all open source), but because each container orchestration community had invested so much in their own system. So, from 2013 – 2016, there was some anxiety among Docker power users. Choosing a container orchestration framework is a huge bet–and if you chose the wrong orchestration system, it would be like opening a movie store and choosing HD DVD over Blu-ray. These pictures of container ships falling over never get old. Image: Huffington Post The war between container orchestrators felt like a winner-take-all affair. And as in any war, there was a fog that was hard to see through. When I was reporting on container orchestration wars, I recorded podcast after podcast with container orchestration experts, where I would ask some form of the question, “so, which container orchestration system is going to win?” I did this until someone I respect told me that asking about who was going to “win the orchestration wars” was a less interesting question than evaluating the technical tradeoffs between the orchestrators. Looking back, I regret buying into the narrative of a war between container orchestrators. As the debates about container orchestrators raged on, the smartest people in the room were mostly having calm, scientific discussions–even when reporters like me were getting worked up, and thinking that this was a story about tribalism. The conflicts between container orchestrators were not about tribalism–they were more about differences of opinion, and developer ergonomics. OK, maybe the container orchestration war wasn’t only about differences of opinion. There is a boatload of money to be made in this space. We are talking about contracts with billion dollar legacy organizations–banks, and telcos, and insurance companies–who are making their way onto the cloud. If you are in the business of helping telcos move onto the winning platform, business is good. If you champion the wrong platform, you end up with a warehouse full of HD-DVDs. The time where the conflict was the worst was near the end of 2016, when there were rumors about Docker potentially forking, so that Docker the company could change the Docker standard to fit better with their container orchestration system Docker Swarm–but even in those times, it would have made sense to be an optimist. Creative destruction is painful, but it is a sign of progress–and in the struggle for container orchestration dominance, there was a ton of creative destruction. And when the dust cleared around the end of 2016, Kubernetes was the clear winner. Today, with Kubernetes becoming the safe choice, CIOs feel more comfortable adopting container orchestration at their enterprises–which makes vendors feel more comfortable investing in Kubernetes-specific tools to sell to those CIOs. This brings us to the present: The open source developers are rowing in the same direction, excited about what to build. Major enterprises (both new and legacy) are buying into Kubernetes. The major cloud providers are ready with low-cost Kubernetes-as-a-service. The numerous vendors of monitoring, logging, security, and compliance software are thrilled because the underlying software stack that they have to integrate with is becoming more predictable. Going Multicloud Today, the most lucrative provider of proprietary backend developer infrastructure is Amazon Web Services. Developers do not view AWS resentfully, because AWS is innovative, empowering, and cheap. If you are paying AWS a lot of money, that probably means your business is doing well. With AWS, developers do not feel the level of lock-in that was administered by the dominant proprietary platforms of the 90s. But there is some lock-in. Once you are deeply embedded in the AWS ecosystem, using services like DynamoDB, Amazon Elastic Container Service, or Amazon Kinesis, it becomes a daunting task to move away from Amazon. With Kubernetes creating an open, common layer for infrastructure, it becomes theoretically possible to “lift and shift” your Kubernetes cluster from one cloud provider to another. If you decided to lift-and-shift your application, you would have to rewrite parts of your application to stop using the Amazon-specific services (like Amazon S3). For example, if you wanted an S3 replacement that would run on any cloud, you could configure a Kubernetes cluster with Rook, and start to store objects on Rook with the same APIs that you would use to store them on S3. This is a nice option to have, but I have not yet heard of anyone actually lifting and shifting their application away from a cloud–except for Dropbox, and their migration was so epic it took two and a half years. Certainly there is someone out there other than Dropbox who spends so much money on Amazon S3 that they will consider spinning up their own object store, but it would be a huge effort to do such a migration. (Kubernetes can be used for lifting and shifting–but more likely will be used to have familiar operating layer in different clouds) Kubernetes probably won’t be a tool for widespread lifting and shifting any time soon. A more likely scenario is that Kubernetes will become a ubiquitous control plane, that enterprises will use on multiple clouds. NodeJS is a useful analogy. Why do people like NodeJS for their server side applications? It’s not necessarily because Node is the fastest web server. It’s because people like to use the same language on both the client and the server. Just like NodeJS lets you move between your client and server code without having to switch languages, Kubernetes will let you switch between clouds without having to change how you think about operations. On each cloud, you will have some custom application code running on Kubernetes that interacts with the managed services provided by that cloud. Companies want to be multi-cloud–partly for disaster recovery, but also because there is actual upside to having access to managed services on the different clouds. One emerging pattern is to split infrastructure between AWS for user traffic and Google Cloud for data engineering. One company that uses this pattern is Thumbtack: At Thumbtack, the production infrastructure on AWS serves user requests. The log of transactions that occur get pushed from AWS to Google Cloud, where the data engineering occurs. On Google Cloud, the transaction records are queued in Cloud PubSub, a message queueing service. Those transactions are pulled off the queue and stored in BigQuery, a system for storage and querying of high volumes of data. BigQuery is used as the data lake to pull from when orchestrating machine learning jobs. These machine learning jobs are run in Cloud Dataproc, a managed service that runs Apache Spark. After training a model in Google Cloud, that model is deployed on the AWS side, where it serves user traffic. On the Google Cloud side, the orchestration of these different managed services is done by Apache Airflow, an open source tool that is one of the few pieces of infrastructure that Thumbtack does have to manage themselves on Google Cloud. Today, Thumbtack uses AWS for serving user requests and Google Cloud for data engineering and queueing in PubSub. Thumbtack trains its machine learning models in Google, and deploys them to AWS. This is just the way things are today. Thumbtack might eventually use Google Cloud for user-facing services as well. (a multi-cloud data engineering pipeline from a Japanese company) More companies will gradually move towards multiple clouds–and some of them will manage a unique Kubernetes cluster on each cloud. You might have a GKE Kubernetes cluster on Google to orchestrate workloads between BigQuery, Cloud PubSub, and Google Cloud ML, and you might have an Amazon EKS cluster to orchestrate workloads between DynamoDB, Amazon Aurora, and your production NodeJS application. Cloud providers are not replaceable commodities. The services provided by the different clouds will get increasingly exotic and differentiated. Enterprises will benefit from having access to the different services on the different cloud providers. Distributed Systems Distribution Services like Google BigQuery and AWS Redshift are popular because they give you a powerful, scalable, multi-node tool with a simple API. Developers often choose these managed services because they are so easy to use. You don’t see these types of paid tools for single nodes. I don’t pay for NodeJS, or React, or Ruby on Rails. Tools for a single node are much easier to operate than tools for a distributed system. It’s harder to deploy Hadoop across servers than to run a Ruby on Rails application on my laptop. With Kubernetes, this is going to change. If you are using Kubernetes, you can use a distributed systems package manager called Helm. This is like npm for Kubernetes applications. If you are using Kubernetes, you can use Helm to easily install a complicated multi-node application, no matter what cloud provider you are on. Here’s a description of Helm: Helm helps you manage Kubernetes applications — Helm Charts helps you define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish — so start using Helm and stop the copy-and-paste madness. A package manager for distributed systems. Amazing! Let’s take a look at what I can install. Not pictured: WordPress, Jenkins, Kafka Distributed systems are hard to set up. Ask anyone who has set up their own Kafka cluster. Helm is going to make installing Kafka as easy as installing a new version of Photoshop on your MacBook. And you will be able to do this on any cloud. Before Helm, the closest thing to a distributed systems package manager (that I am aware of) was the marketplace that you find on AWS or Azure, or the Google Cloud Launcher. Here again we see how proprietary software can lead to fragmentation. Before Helm, there was no standard, platform-agnostic way of one-click installing Kafka. You can find a way to one-click install Kafka on AWS, Google, or Azure. But each of these installations had to be written independently, for that specific cloud provider. And to install Kafka on Digital Ocean, you need to follow a 10-step tutorial. Helm represents a cross-platform system for distributing multi-node software on any Kubernetes instance. You could use an application installed with Helm in any cloud provider or on your own hardware. You could easily install Apache Spark or Cassandra–systems that are notoriously difficult to set up and operate. Helm is a package manager for Kubernetes–but it also looks like the beginnings of an app store for Kubernetes. With an app store, you could sell software for Kubernetes. What kind of software could you sell? You could sell enterprise distributions of distributed systems platforms like Cloudera Hadoop, Databricks Spark, and Confluent Kafka. New monitoring systems like Prometheus and multi-node databases like Cassandra that are hard to install. Maybe you could even sell higher-level, consumer-grade software like Zendesk. The idea of a self-hosted Zendesk sounds crazy, but someone could build that, and sell it in the form of a proprietary binary, at the cost of a flat fee instead of a subscription. If I sell you a $99 Zendesk-for-Kubernetes, and you can easily run it on your Kubernetes cluster on AWS, you are going to end up saving a lot of money on support ticketing software. Enterprises often run their own WordPress to manage a company blog. Is the software for Zendesk that much more complicated than WordPress? I don’t think so–but enterprises are much more scared of managing their own help desk software than managing their own blogging software. I run a pretty small business but I subscribe to a LOT of different software-as-a-service tools. An expensive WordPress host, an expensive CRM for advertising sales, MailChimp for the newsletter. I pay for these services because they are super reliable and secure, and they are complex multi-node applications. I wouldn’t want to host my own. I wouldn’t want to support them. I don’t want to be responsible for technical troubleshooting when my newsletter fails to send. I want to run less software. In contrast, I’m not afraid that my single-node software is going to break. Software that I use from a single node tends to be much less expensive because I don’t have to buy it as a “service”. The software that I use to write music has a 1-time fixed cost. Photoshop has a 1-time fixed cost. I pay for the electricity to run my computer, but other than that I have no ongoing capital expense to run Photoshop. When multi-node applications are as reliable as single-node applications, we will see changes in the pricing models. Maybe someday I will be able to purchase a Zendesk-for-Kubernetes. The Zendesk-for-Kubernetes will give me everything I need from a help desk–it will spin up an email server, it will give me a web frontend to manage tickets. And if anything goes wrong, I can pay for support when I need it. Zendesk is a fantastic service–but it would be cool if it had a fixed pricing model Metaparticle With Kubernetes, it becomes easier to deploy and manage distributed applications. With Helm, it becomes easier to distribute those applications to other users. But it’s still pretty hard to develop distributed systems. This was the focus of a recent CloudNative Con/KubeCon Keynote by Brendan Burns, called “This Job is Too Hard: Building New Tools, Patterns, and Paradigms to Democratize Distributed Systems Development”. In his presentation, Brendan presented a project called Metaparticle. Metaparticle is a standard library for cloud-native development, with the goal of democratizing distributed systems. In an introduction to Metaparticle, Brendan wrote: Cloud native development is bespoke, complicated and limited to a small number of expert developers. Metaparticle changes this by introducing a set of utilities in familiar programming languages that meet the developer where they are and enables them to begin to develop cloud-native systems using familiar language features. Metaparticle builds on top of Kubernetes primitives to make distributed synchronization easier as well. It supplies language independent modules for locking and leader election as easy-to-use abstractions in familiar programming languages. After decades of distributed systems research and application, patterns have emerged about how we build these systems. We need a way to do locking of a variable, so that two nodes will not be able to write to that variable in a nondeterministic fashion. We need a way to do master election, so that if the master node dies, the other nodes can pick a new node to orchestrate the system. Today, we use tools like etcd and Zookeeper to help us with master election and locking–and these tools have an onboarding cost. Brendan illustrates this with the example of Zookeeper, which is used by both Hadoop and Kafka to do master election. Zookeeper takes significant time and effort to learn how to operate. During the construction of both Hadoop and Kafka, the founding engineers of those projects engineered their systems to work with Zookeeper to maintain a master node. If I am writing a system to do distributed mapreduce, I would like to avoid thinking about node failures and race conditions. Brendan’s idea is to push those problems down into a standard library–so the next developer who comes along with a new idea for a multi-node application has an easier time. Important meta-point: metaparticle is built with the assumption that you are on Kubernetes. This is a language-level abstraction that is built with an assumption about the underlying (distributed) operating system–which once again brings us back to standards. If everyone is on the same distributed operating system, we can make big assumptions about the downstream users of our projects. This is why my mind is blown by Kubernetes. It is a layer of standards across a world of heterogeneous systems. Serverless Workloads Functions as a service (often called “serverless” functions) are a powerful, cheap abstraction that developers can use alongside Kubernetes, on top of Kubernetes, and in some cases instead of Kubernetes altogether. Let’s quickly review the modern landscape of serverless applications, and then consider the relationship between serverless and Kubernetes. Quick review on functions as a service: Functions as a service are deployable functions that run without an addressable server. Functions as a service deploy, execute, and scale with a single call by the developer. Until you make that call to the serverless function, your function is not running anywhere–so you are not using up resources other than the database that is storing your raw code. When you call a function as a service, your cluster will take care of scheduling and running that function. You don’t have to worry about spinning up a new machine and monitoring that machine, and spinning the machine down once it becomes idle. You just tell the cluster that you want to run a function, and the cluster executes it and returns the result. When you “deploy” a serverless function, the function code is not actually deployed. Your code sits in a database in plain text. When you call the function, your code is being taken out of the database entry, loaded onto a Docker container, and executed. (diagram from https://medium.com/openwhisk/uncovering-the-magic-how-serverless-platforms-really-work-3cb127b05f71) AWS Lambda pioneered this idea of a function-as-a-service in 2014. Since then, developers have been thinking of all kinds of use cases. For a comprehensive list of how developers are using serverless, there is a shared Google Doc created by the CNCF Serverless Working Group (34 pages at the time of this article). From my conversations on Software Engineering Daily, these functions as a service have shown at least two clear applications: Compute that can scale up quickly and cheaply in response to bursty workloads (for example, Yubl’s social media scalability case study) Event-driven “glue code” with variable workload frequency (for example, an event sourcing model with a variety of database consumers) To create a FaaS platform, a cloud provider provisions a cluster of Docker containers called invokers. Those invokers wait to get blobs of code scheduled onto them. When you make a request for your code to execute, there is a period of time that you have to wait for that code to be loaded onto an invoker and executed. This time spent waiting is the “cold start” problem. The cold start problem is one of the tradeoffs that you make if you decide to run part of your application on FaaS. You don’t pay for the uptime of a server that isn’t doing any work–but when you want to call your function, you have to wait for your code to be scheduled onto an invoker. On AWS there are invokers designated for whatever requests for AWS Lambda come in. On Microsoft Azure there are invokers designated for Azure Functions requests. On Google Cloud there are invokers reserved for Google Cloud Functions. For most developers, using the function-as-a-service platforms from AWS, Microsoft, Google, or IBM will be fine. The costs are low, and the cold start problem is not problematic for most applications. But some developers will want to get costs even lower. Or they might want to write their own scheduler that defines how code gets scheduled onto invoker containers. These developers can roll their own serverless platform. Open source FaaS projects such as Kubeless let you provision your own serverless cluster on top of Kubernetes. You can define your own pool of invokers. You can determine how containers are scheduled against jobs. You can decide on how to solve the cold start problem for your own cluster. The open source FaaS for Kubernetes are just one type of resource scheduler. They are just a preview of other custom schedulers we will see on top of Kubernetes. Developers are always building new schedulers in order to build more efficient services on top of those schedulers. So–what other types of schedulers will be built on top of Kubernetes? Well, as they say, the future is already here, but it’s only available as an AWS managed service. AWS has a new service called Amazon Aurora Serverless–which is a database that scales storage and compute up and down automatically. From Jeff Barr’s post about AWS Serverless Aurora: When you create an Aurora Database Instance, you choose the desired instance size and have the option to increase read throughput using read replicas. If your processing needs or your query rate changes you have the option to modify the instance size or to alter the number of read replicas as needed. This model works really well in an environment where the workload is predictable, with bounds on the request rate and processing requirement. In some cases the workloads can be intermittent and/or unpredictable, with bursts of requests that might span just a few minutes or hours per day or per week. Flash sales, infrequent or one-time events, online gaming, reporting workloads (hourly or daily), dev/test, and brand-new applications all fit the bill. Arranging to have just the right amount of capacity can be a lot work; paying for it on steady-state basis might not be sensible. Because storage and processing are separate, you can scale all the way down to zero and pay only for storage. I think this is really cool, and I expect it to lead to the creations of new kinds of instant-on, transient applications. Scaling happens in seconds, building upon a pool of “warm” resources that are raring to go and eager to serve your requests. We are not surprised that AWS can build something like this, but it’s hard to imagine how it could emerge as an open source project—until Kubernetes. This is the type of system that random developers could build on top of Kubernetes. If you wanted to build your own serverless database on top of Kubernetes, you’ve got a number of scheduling problems to solve. You need different resource tiers for networking, storage, logging, buffering, and caching. For each of those different resource tiers, you need to define how resources will get scheduled to scale up and down in response to demand. Just like Kubeless offers a scheduler for the small bits of functional code, we might see other custom schedulers that people use as building blocks for bigger applications. And once you actually build your serverless database, perhaps you could sell it on the Helm app store as a $99 one-time purchase. Conclusion I hope you have enjoyed this tour through some Kubernetes history and speculation about the future. As 2018 begins, here are some of the areas we’d like to explore this year: How do people manage deployment of machine learning models on Kubernetes? We did a show with Matt Zeiler where he discussed this, and it sounded complicated. Is Kubernetes used for self-driving cars? If so, do you deploy one single cluster to manage the entire car? What does a Kubernetes IoT deployment look like? Does it make sense to run Kubernetes across a set of devices with intermittent network connections? What are the new infrastructure products and developer tools that will be built with Kubernetes? What are the new business opportunities? Kubernetes is a fantastic tool for building modern application backends–but it is still just a tool. If Kubernetes fulfills its mission, it will eventually fade into the background. There will come a time when we talk about Kubernetes like we talk about compilers or operating system kernels. Kubernetes will be lower level plumbing that is not in the purview of the average application developer. But until then, we’ll continue to report on it. The post The Gravity of Kubernetes appeared first on Software Engineering Daily.

Jan 12, 2018 • 51min

Kubernetes Vision with Brendan Burns

Kubernetes has become the standard system for deploying and managing clusters of containers. But the vision of the project goes beyond managing containers. The long-term goal is to democratize the ability to build distributed systems. Brendan Burns is a co-founder of the Kubernetes project. He recently announced an open-source project called Metaparticle, a standard library for cloud-native development: Metaparticle builds on top of Kubernetes primitives to make distributed synchronization easier… It supplies language-independent modules for locking and leader election as easy-to-use abstractions in familiar programming languages. After decades of distributed systems research and application, patterns have emerged about how we build these systems. We need a way to lock a variable so that two nodes will not be able to write to that variable in a non-deterministic fashion. We need a way to do a master election so that if the master node dies, the other nodes can pick a new node to orchestrate the system. We know that just about every distributed application needs locking and leader election–so how can we build these features directly into our programming tools, rather than bolting them on? With Kubernetes providing a standard operating system for distributed applications, we can start to build standard libraries that assume we have access to underlying Kubernetes primitives. Instead of calling out to external tools like Zookeeper and etcd, a standard library like Metaparticle will abstract them away. An example: if I am writing a system to do distributed mapreduce, I would like to avoid thinking about node failures and race conditions. Brendan’s idea is to push those problems down into a standard library–so the next developer who comes along with a new idea for a multi-node application has an easier time. Brendan Burns currently works as a distinguished engineer at Microsoft, and he joins the show to discuss why it is still hard to build distributed systems and what can be done to make it easier. This is the second time we have had Brendan on the show. The first time he came on, he discussed the history of Kubernetes and some of the design decisions of the system. This episode was more about the future. Full disclosure: Microsoft (where Brendan is employed) is a sponsor of Software Engineering Daily. Show Notes Get a free digital copy of Brendan’s new O’Reilly book, Designing Distributed Systems: http://aka.ms/sedaily The post Kubernetes Vision with Brendan Burns appeared first on Software Engineering Daily.

Jan 11, 2018 • 58min

High Volume Distributed Tracing with Ben Sigelman

You are requesting a car from a ridesharing service such as Lyft. Your request hits the Lyft servers and begins trying to get you a car. It takes your geolocation, and passes the geolocation to a service that finds cars that are nearby, and puts all those cars into a list. The list of nearby cars is sent to another service, which sorts the list of cars by how close they are to you, and how high their star rating is. Finally, your car is selected, and sent back to your phone in a response from the server. In a “microservices” environment, multiple services often work together to accomplish a user task. In the example I just gave, one service took geolocation and turned it into a list, another service took a list and sorted it, and another service sent the actual response back to the user. This is a common pattern: Service A calls service B, which calls service C, and so on. When one of those services fails along the way, how do you identify which one it was? When one of those services fails to deliver a response quickly, how do you know where your latency is coming from? The solution is distributed tracing. To implement distributed tracing, each user level request gets a request identifier associated with it. When service A calls service B, it also hands off the unique request ID, so that the overall request can be traced as it passes through the distributed system (and if that doesn’t make sense–don’t worry, we explain it again during the show). Ben Sigelman began working on distributed tracing when he was at Google and authored the “Dapper” paper. Dapper was implemented at Google to help debug some of the distributed systems problems faced by the engineers who work on Google infrastructure. A request that moves through several different services spends time processing each of those services. A distributed tracing system measures the time spent in each of those services–that time spent is called a span. A single request that has to hit 20 different services will have 20 spans associated with it. Those spans get collected into a trace. A trace can be evaluated to look at the latencies of each of those services. If you are trying to improve the speed of a distributed systems infrastructure, distributed tracing can be very helpful for choosing where to focus your attention. The published Google papers of ten years ago often turn out to be the companies of today. Some examples include MapReduce (which formed the basis of Cloudera), Spanner (which formed the basis of CockroachDB), and Dremel (which formed the basis of Dremio). Today, a decade after he started thinking about distributed tracing, Ben Sigelman is the CEO of Lightstep, a company that provides distributed tracing and other monitoring technologies. Lightstep’s distributed tracing model still bears a resemblance to the same techniques described in the paper–so I was eager to learn the differences between open source versions of distributed tracing (such as OpenZipkin) and enterprise providers such as Lightstep. The key feature of Lightstep that we discussed: garbage collection. If you are using a distributed tracing system, you could be collecting a lot of traces. You could collect a trace for every single user request. Not all of these traces are useful–but some of them are very useful. Maybe you only want to keep track of traces that take an exceptionally long latency. Maybe you want to keep every trace in the last 5 days and destroy them over time. So, the question of how to manage the storage footprint of those traces was as interesting as the discussion of distributed tracing itself. Beyond the distributed tracing features of his product, Ben has a vision for how his company can provide other observability tools over time. I spoke to Ben at Kubecon–and although this conversation does not talk about Kubernetes specifically, this topic is undoubtedly interesting to people who are building Kubernetes technologies. The post High Volume Distributed Tracing with Ben Sigelman appeared first on Software Engineering Daily.

Jan 10, 2018 • 46min

Kubernetes on AWS with Arun Gupta

Since Kubernetes came out, engineers have been deploying clusters to Amazon. In the early years of Kubernetes, deploying to AWS meant that you had to manage the availability of the cluster yourself. You needed to configure etcd and your master nodes in a way that avoided having a single point of failure. Deploying Kubernetes on AWS became simpler with an open-source tool called kops (short for Kubernetes Operations). Kops automates the provisioning and high-availability deployment of Kubernetes. In late 2017, AWS released a managed Kubernetes service called EKS. EKS allows developers to run Kubernetes without having to manage the availability and scaling of a cluster. The announcement of EKS was exciting because it means that all of the major cloud providers are officially supporting Kubernetes. Arun Gupta is a principal open-source technologist at AWS, and he joins the show to explain what is involved in deploying and managing a Kubernetes cluster. Arun describes how to operate a Kubernetes cluster, including logging, monitoring, storage, and updates. If you are convinced that you want to use Kubernetes, but you aren’t sure yet how you want to deploy it, this will be useful information for you. We also discussed how Amazon built EKS, and some of the architectural decisions they made. AWS has had a managed container service called ECS since 2014. The development of ECS was instructive for the AWS engineers who built EKS. Amazon wanted to make EKS able to integrate with both open source tools and the Amazon managed services. The post Kubernetes on AWS with Arun Gupta appeared first on Software Engineering Daily.

Jan 9, 2018 • 52min

Istio Motivations with Louis Ryan

A single user request hits Google’s servers. A user is looking for search results. In order to deliver those search results, that request will have to hit several different internal services on the way to getting a response. These different services work together to satisfy the user’s request. All of these services need to communicate efficiently, they need to scale, and they need to be secure. Services need to have a consistent way of being “observable”–allowing logging and monitoring. Services need to have proper security. Since every service wants these different features (like communication, load balancing, security), it makes sense to build these features into a common system that can be deployed to every server. Louis Ryan has spent his years at Google working on service infrastructure. During that time, he has seen massive changes in the way traffic flows through Google. First, the rise of Android, and all of the user traffic from mobile phones. And second, the rise of the Google Cloud Platform, which meant that Google was now responsible for nodes deployed by users outside of Google. These two changes–mobile and cloud–led to an increase in the amount of traffic and the type of traffic. All of this traffic leads to more internal services communicating with each other. How does service networking change in such an environment? Google’s adaptation to the new networking conditions is to introduce a “service mesh”. A service mesh is a network for services. It provides observability, resiliency, traffic control, and other features to every service that plugs into it. Each service needs to plug into the service mesh. In Kubernetes, services connect to the mesh through a sidecar. Let me explain the term “sidecar.” Kubernetes manages its resources in pods, and each pod contains a set of containers. You might have a pod that is dedicated to responding to any user that is requesting a picture of a cat. Within that pod, you not only have the container that serves the cat picture–you also have other “sidecar” containers that help out an application container. You could have a sidecar that gets deployed next to your application container that handles logging, or a sidecar that helps out with monitoring, or network communications. If you are using the Istio service mesh, that means that you are using a sidecar called Envoy. Envoy is a sidecar called a “service proxy” that provides configuration updates, load balancing, proxying, and lots of other benefits. If we get all that out of Envoy, why do we need a separate abstraction of a “service mesh”? Because it helps to have a tool that aggregates and centralizes all the different communications among these proxies. Every service gets a sidecar for a service proxy. Every service proxy communicates with the centralized service mesh. Louis Ryan joins this episode to explain the motivations for building the Istio service mesh, and the problems it solves for Kubernetes developers. For the next two weeks, we are covering exclusively the world of Kubernetes. Kubernetes is a project that is likely to have as much impact as Linux–and it is very early days. Whether you are an expert in Kubernetes or you are just starting out, we have lots of episodes to fit your learning curve. To find all of our old episodes about Kubernetes (including a previous show about Istio), download the Software Engineering Daily app for iOS or for Android. In other podcast players, only the most 100 recent episodes are available, but in our apps you can find all 650 episodes–and there is also plenty of content that is totally unrelated to Kubernetes! The post Istio Motivations with Louis Ryan appeared first on Software Engineering Daily.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner