
Cloud Engineering Archives - Software Engineering Daily
Episodes about building and scaling large software projects
Latest episodes

Apr 27, 2018 • 45min
Google Cluster Evolution with Brian Grant
Google’s central system for managing to compute resources is called Borg. On Borg, millions of Linux containers process a wide variety of workloads. When a new application is spun up, Borg provides that application with the resources it needs.
Workloads at Google usually fall into one of two distinct categories: long-running application workloads (such as Gmail) and batch workloads (such as a MapReduce job). In the early days of Google, the long-lived workloads were scheduled onto a system called “BabySitter” and the batch workloads were scheduled onto a system called “Global Work Queue.”
Borg was the first cluster manager at Google designed to service both long-running and batch workloads from a single system. The second cluster manager at Google was Omega, a project that was created to improve the engineering behind Borg. The innovations of Omega improved the efficiency and architecture of Borg.
More recently, Kubernetes was created as an open source implementation of the ideas pioneered in Borg and Omega. Google has also built a Kubernetes as a service offering that companies use to run their infrastructure in the same way that Google does.
Brian Grant is an engineer at Google who has seen the iteration of all three cluster management systems that have come out of Google. He joins the show to discuss how the workloads at Google have changed over time, and how his perspective on how to build and architect distributed systems has evolved. Full disclosure: Google is a sponsor of Software Engineering Daily.
The post Google Cluster Evolution with Brian Grant appeared first on Software Engineering Daily.

Apr 24, 2018 • 59min
NATS Messaging with Derek Collison
A message broker is an architectural component that sends messages between different nodes in a distributed system.
Message brokers are useful because the sender of a message does not always know who might want to receive that message. Message brokers can be used to implement the “publish/subscribe” pattern, and by centralizing the message workloads within the pub/sub system, it lets system operators scale the performance of the messaging infrastructure by simply scaling that pub/sub system.
Derek Collison has worked on messaging infrastructure for 25 years. He started at TIBCO, then spent time at Google and VMWare. When he was at VMWare, he architected the open source platform Cloud Foundry. While working on Cloud Foundry, Derek developed NATS, a messaging control plane.
Since that time, Derek has started two companies–Apcera and Synadia Communications. In our conversation, Derek and I discussed the history of message brokers, how NATS compares to Kafka and his ideas for how NATS could scale in the future to become something much more than a centralized message bus.
The post NATS Messaging with Derek Collison appeared first on Software Engineering Daily.

Apr 23, 2018 • 1h 2min
Stripe Observability Pipeline with Cory Watson
Stripe processes payments for thousands of businesses. A single payment could involve 10 different networked services. If a payment fails, engineers need to be able to diagnose what happened. The root cause could lie in any of those services.
Distributed tracing is used to find the causes of failures and latency within networked services. In a distributed trace, each period of time associated with a request is recorded as a span. The spans can be connected together because they share a trace ID.
The spans of a distributed trace are one element of observability. Others include metrics and logs. Each of these components of observability makes its way into services like Lightstep and Datadog. The path traveled by different elements of observability is called the observability pipeline.
In an episode last year, Cory Watson explained how observability works at Stripe. In today’s episode, Cory describes how observability is created and aggregated. It’s a useful discussion for anyone working at a company that is figuring out how to instrument their systems for better monitoring.
The post Stripe Observability Pipeline with Cory Watson appeared first on Software Engineering Daily.

Apr 16, 2018 • 47min
Monitoring Kubernetes with Ilan Rabinovitch
Monitoring a Kubernetes cluster allows operators to track the resource utilization of the containers within that cluster. In today’s episode, Ilan Rabinovitch joins the show to explore the different options for setting up monitoring, and some common design patterns around Kubernetes logging and metrics gathering.
Ilan is the VP of product and community at Datadog. Earlier in his career, Ilan spent much of his time working with Linux and taking part in the Linux community. We discussed the similarities and differences between the evolution of Linux and that of Kubernetes.
In previous episodes, we have explored some common open source solutions for monitoring Kubernetes–including Prometheus and the EFK stack. Since Ilan works at Datadog, we explored how hosted solutions compare to self-managed monitoring. We also talked about how to assess different hosted solutions–such as those from a large cloud provider like AWS versus vendors that are specifically focused on monitoring. Full disclosure: Datadog is a sponsor of Software Engineering Daily.
The post Monitoring Kubernetes with Ilan Rabinovitch appeared first on Software Engineering Daily.

Apr 11, 2018 • 54min
Go Systems with Erik St. Martin
Go is a language designed to improve systems programming. Go includes abstractions that simplify aspects of low level engineering that are historically difficult—concurrency, resource allocation, and dependency management. In that light, it makes sense that the Kubernetes container orchestration system was written in Go.
Erik St. Martin is a cloud developer advocate at Microsoft, where he focuses on Go and Kubernetes. He also hosts the podcast “Go Time,” and has written a book on Go called Go In Action.
Recently, Erik helped build the virtual Kubelet project, which allows Kubernetes nodes to be backed by services outside of that cluster. If you want your Kubernetes cluster to leverage abstractions such as serverless functions and standalone container instances, you can use Virtual Kubelet to treat these other abstractions as nodes.
Erik also discussed his experience using Kubernetes at Comcast—which was a great case study. Near the end of the show, he also talked about organizing Gophercon, a popular conference for Go programmers—if you are organizing a conference or thinking about organizing one, it will be useful information to you. Full disclosure: Microsoft, where Erik works, is a sponsor of Software Engineering Daily.
The post Go Systems with Erik St. Martin appeared first on Software Engineering Daily.

Apr 10, 2018 • 55min
Database Chaos with Tammy Butow
Tammy Butow has worked at Digital Ocean and Dropbox, where she built out infrastructure and managed engineering teams. At both of these companies, the customer base was at a massive scale.
At Dropbox, Tammy worked on the database that holds metadata used by Dropbox users to access their files. To call this metadata system simply a “database” is an understatement–it is actually a multi-tiered system of caches and databases. This metadata is extremely sensitive–this is metadata that tells you where the objects across Dropbox are located–so it has to be highly available.
To encourage that reliability, Tammy helped institute chaos engineering–inducing random failures across the Dropbox infrastructure, and making sure that the Dropbox systems could automatically respond to those failures. If you are unfamiliar with the topic, we have covered chaos engineering in two previous episodes of Software Engineering Daily.
Tammy now works at Gremlin, a company that does chaos engineering as a service. In this show we talked about her experiences at Dropbox, and how to institute chaos engineering across databases. We also explored how her work at Gremlin–a smaller startup–compares to Dropbox and Digital Ocean, which are larger companies.
Show Notes
Tammy Butow Chaos Engineering Bootcamp
Information to run your own Chaos Day
How to Create a Kubernetes Cluster on Ubuntu 16.04 with kudeadm and Weave Net | Gremlin Community
The post Database Chaos with Tammy Butow appeared first on Software Engineering Daily.

Apr 9, 2018 • 43min
Site Reliability Management with Mike Hiraga
Software engineers have interacted with operations teams since the software was being written. In the 1990s, most operations teams worked with physical infrastructure. They made sure that servers were provisioned correctly and installed with the proper software. When software engineers shipped bad code that took down a software company, the operations teams had to help recover the system—which often meant dealing with the physical servers.
During the 90s and early 2000s, these operations engineers were often called “sysadmins,” “database admins” (if they worked on databases), or “infrastructure engineers.” Over the last decade, virtualization has led to many more logical servers across a company. Cloud computing has made infrastructure remote and programmable.
The progression of infrastructure led to a change in how operations engineers work. Since infrastructure can be interacted with through code, operations engineers are now writing a lot more code.
The “DevOps” movement can be seen through this lens. Operations teams were now writing software—and this meant that software engineers could now work on operations. Both software engineers and operators could create deployment pipelines, monitor application health, and improve the system scalability—all through written code.
Site reliability engineering (or SRE) is a newer point along the evolutionary timeline of operations. Web applications can be unstable sometimes, and SRE is focused on making a site work more reliably. This is especially important for a company that makes business applications that other companies rely on.
Mike Hiraga is the head of site reliability engineering at Atlassian. Atlassian makes several products that many businesses rely on—such as JIRA, Confluence, HipChat, and Bitbucket. Since the infrastructure is at a massive scale, Mike has a broad set of experiences from his work managing SRE at Atlassian.
One particularly interesting topic is Atlassian’s migration to the cloud. Atlassian was started in 2002, before the cloud was widely used, and they have more recently made a push to move applications into the cloud. Full disclosure: Atlassian is a sponsor of Software Engineering Daily—and they are hiring, so if you are looking for a job, check out Atlassian jobs, or send me an email directly and I’m happy to introduce you to the team.
The post Site Reliability Management with Mike Hiraga appeared first on Software Engineering Daily.

Feb 23, 2018 • 58min
Cloud and Edge with Steve Herrod
Steve Herrod led engineering at VMWare as the company scaled from 30 engineers to 3,000 engineers. After 11 years, he left to become a managing director for General Catalyst, a venture capital firm. Since he has both operating experience and a wide view of the technology landscape as an investor, he is well-equipped to discuss a topic that we have been covering on Software Engineering Daily: the integration of cloud and edge computing.
Today, we think of the cloud as a network of large data centers operated by big players like Google, Amazon, and Microsoft. The cloud is where most of the computation across the world takes place. My smartphone and laptop are “edge” devices. They are lightweight computers that don’t perform much complex processing. I would not be able to run a large production database or a 3 terabyte MapReduce job on my laptop.
The current division of labor makes sense in this world of smart clouds and low-power, low-bandwidth devices. But the devices are getting cheaper, smarter, and more proliferate. Cars, drones, security cameras, sensors, and other devices can serve as points of computation that are geographically between the edge devices and the cloud. With more devices between you and the cloud, there is an opportunity to put computation on those devices.
Everyone knows that cloud and edge computing will become intermingled in the coming years. But predicting just how it will play out is nearly impossible. And as an investor, if you bet on something too early, you get the same result as someone who was wrong altogether.
A good analogy for the “cloud and edge” space of investments might be the “smart home.” Everyone knows the smart home is coming eventually, but it’s very hard to tell how long it will be before smart home systems are in widespread use–so it is an open question of how to invest in the space.
Summer internship applications to Software Engineering Daily are also being accepted. If you are interested in working with us on the Software Engineering Daily open source project full-time this Summer, send an application to internships@softwareengineeringdaily.com. We’d love to hear from you.
If you haven’t seen what we are building, check out softwaredaily.com, or download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.
The post Cloud and Edge with Steve Herrod appeared first on Software Engineering Daily.

Feb 22, 2018 • 56min
Serverless Systems with Eduardo Laureano
On Software Engineering Daily, we have been covering the “serverless” movement in detail. For people who don’t use serverless functions, it seems like a niche. Serverless functions are stateless, auto-scaling, event-driven blobs of code. You might say “serverless sounds kind of cool, but why don’t I just use a server? It’s a paradigm I’m used to.”
Serverless is exciting not because of what it adds but because of what it subtracts. The potential of serverless technology is to someday not have to worry about scalability at all.
Today, we take for granted that if you start a new company, you are building it on cloud infrastructure. The problem of maintaining server hardware disappeared for 99% of startups, which unlocked a wealth of innovation.
The cloud also simplified scalability for most startups–but there are still plenty of companies that struggle to scale. Significant mental energy is spent on the following questions: How many database replicas do I need? How do I configure my load balancer? How many nodes should I put in my Kafka cluster?
Serverless functions are important because they are auto-scaling component that sits at a low level. This makes it easy to build auto scaling systems on top of them. Auto scaling databases, queueing systems, machine learning tools, and user applications.
And since the problem is being solved at such a low level, the pricing competitions will also take place at the low level, meaning that systems built on serverless functions will probably see steep declines in costs in the coming years. Serverless computing could eventually become free or nearly free, with the major cloud providers using it as a loss leader to onboard developers to higher level services.
All of this makes for an exciting topic of discussion, that we will be repeatedly covering. Today’s show is with Eduardo Laureano, the principal program manager of Azure Functions. It was a fantastic conversation and we covered applications of serverless, improvements to the “cold start problem,” and how the Azure Functions platform is built and operated. Full disclosure: Microsoft is a sponsor of Software Engineering Daily.
Meetups for Software Engineering Daily are being planned! Go to softwareengineeringdaily.com/meetup if you want to register for an upcoming Meetup. In March, I’ll be visiting Datadog in New York and Hubspot in Boston, and in April I’ll be at Telesign in LA.
Summer internship applications to Software Engineering Daily are also being accepted. If you are interested in working with us on the Software Engineering Daily open source project full-time this Summer, send an application to internships@softwareengineeringdaily.com. We’d love to hear from you.
The post Serverless Systems with Eduardo Laureano appeared first on Software Engineering Daily.

Feb 21, 2018 • 57min
Cloud Foundry Overview with Mike Dalessio
Earlier this year we did several shows about Cloud Foundry, followed by several shows about Kubernetes. Both of these projects allow you to build scalable, multi-node applications–but they serve different types of users.
Cloud Foundry encompasses a larger scope of the application experience than Kubernetes. Kubernetes is lower level and is actually being used within newer versions of Cloud Foundry to give Cloud Foundry users access to the Kubernetes abstractions.
Recording those shows gave me a wide understanding of how infrastructure is managed and how it has evolved. Today’s episode provides more context on Cloud Foundry–how the project got started, how people use it, and where Cloud Foundry is going. Today’s guest Mike Dalessio is a VP of engineering on Pivotal Cloud Foundry, and we had a great time talking about his work. Engineering leadership is a fine art, and conversations with engineering leaders are consistently interesting–this was no exception.
The post Cloud Foundry Overview with Mike Dalessio appeared first on Software Engineering Daily.