
Cloud Engineering Archives - Software Engineering Daily
Episodes about building and scaling large software projects
Latest episodes

Feb 13, 2018 • 50min
Box Kubernetes Migration with Sam Ghods
Over 12 years of engineering, Box has developed a complex architecture of services. Whenever a user uploads a file to Box, that upload might cause 5 or 6 different services to react to the event. Each of these services is managed by a set of servers, and managing all of these different servers is a challenge.
Sam Ghods is the cofounder and services architect of Box. In 2014, Sam was surveying the landscape of different resource managers, deciding which tool should be the underlying scheduler for deploying services at Box. He chose Kubernetes because it was based on Google’s internal Borg scheduling system.
For years, engineering teams at companies like Facebook and Twitter had built internal scheduling systems modeled after Borg. When Kubernetes arrived, it provided an out-of-the-box tool for managing infrastructure like Google would.
In today’s episode, Sam describes how Box began its migration to Kubernetes, and what the company has learned along the way. It’s a great case study for people who are looking at migrating their own systems to Kubernetes.
The post Box Kubernetes Migration with Sam Ghods appeared first on Software Engineering Daily.

Feb 12, 2018 • 42min
Scaling Box with Jeff Quiesser
When Box started in 2006, the small engineering team had a lot to learn. Box was one of the earliest cloud storage companies, with a product that allowed companies to securely upload files to remote storage.
This was two years before Amazon Web Services introduced on-demand infrastructure, so the Box team managed their own servers, which they learned how to do as they went along. In the early days, the backup strategy was not so sophisticated. The founders did not know how to properly set up hardware in a colocated data center. The front-end interface was not the most beautiful product.
But the product was so useful that eventually, it started to catch on. Box’s distributed file system became the backbone of many enterprises. Employees began to use it to interact with and share data across organizations.
The increase in usage raised the stakes for Box’s small engineering team. If Box’s service went down, it could cripple an enterprise’s productivity, which meant that Box needed to hire experienced engineers to build resilient systems with higher availability. And to accommodate the growth in usage, Box needed to predict how much hardware to purchase, and how much space in a data center to rent–a process known as capacity planning.
As Box went from 3 engineers to 300, the different areas of the company went from being managed by individuals to teams, to entire departments with VPs and C-level executives.
Jeff Quiesser is an SVP at Box, and one of the co-founders. He joins the show today to describe how Box changed as the company scaled. We covered engineering, management, operations, and culture.
In previous shows, we have explored the stories of companies like Slack, Digital Ocean, Giphy, Uber, Tinder, and Spotify. It’s always fun to hear how a company works–from engineering the first product to enterprises with millions of users. To find all of our episodes about how companies are built, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links, and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.
The post Scaling Box with Jeff Quiesser appeared first on Software Engineering Daily.

Feb 8, 2018 • 58min
Load Testing Mobile Applications with Paulo Costa and Rodrigo Coutinho
Applications need to be ready to scale in response to high-load events. With mobile applications, this can be even more important. People rely on mobile applications such as banking, ride sharing, and GPS.
During Black Friday, a popular ecommerce application could be bombarded by user requests–you might not be able to complete a request to buy an item at the Black Friday discount. If you attend the Superbowl, and then try to catch an Uber after leaving, all the other people around you might be summoning a car at the same time, and the system might not scale.
In order to prepare infrastructure for high volume, mobile development teams often create end-to-end load tests. After recording incoming mobile traffic, that mobile traffic can be replicated and replayed, to measure a backend’s response to the mobile workload.
Paulo Costa and Rodrigo Coutinho are engineers at OutSystems, a company that makes a platform for building low-code mobile applications. In this episode, Paulo and Rodrigo discuss the process of performing end-to-end scalability testing for mobile applications backed by cloud infrastructure. We talked about the high level process of architecting the load test, and explored the tools used to implement it. Full disclosure: OutSystems is a sponsor of Software Engineering Daily.
The post Load Testing Mobile Applications with Paulo Costa and Rodrigo Coutinho appeared first on Software Engineering Daily.

Feb 6, 2018 • 54min
Serverless at the Edge with Kenton Varda
Over the last decade, computation and storage have moved from on-premise hardware into the cloud data center. Instead of having large servers “on-premise,” companies started to outsource their server workloads to cloud service providers.
At the same time, there has been a proliferation of devices at the “edge.” The most common edge device is your smartphone, but there are many other smart devices that are growing in number–drones, smart cars, Nest thermostats, smart refrigerators, IoT sensors, and next generation centrifuges. Each of these devices contains computational hardware.
Another class of edge devices is the edge server. Edge servers are used to facilitate faster response times than your core application. For example, Software Engineering Daily uses a content delivery network for audio files. These audio files are distributed throughout the world on edge servers. The core application logic of Software Engineering Daily runs on a WordPress site, and that WordPress application is distributed too far fewer servers than our audio files.
“Cloud computing” and “edge computing” both refer to computers that can serve requests. The “edge” is commonly used to refer to devices that are closer to the user–so they will deliver faster responses. The “cloud” refers to big, bulky servers that can do heavy duty processing workloads–such as training machine learning models or issuing a large distributed MapReduce query.
As the volume of computation and data increases, we look for better ways to utilize our resources, and we are realizing that the devices at the edge are underutilized.
In today’s episode, Kenton Varda explains how and why to deploy application logic to the edge. He works at Cloudflare on a project called Cloudflare Workers, which is a way to deploy JavaScript to edge servers, such as the hundreds of data centers around the world that are used by Cloudflare for caching.
Kenton was previously on the show to discuss protocol buffers, a project he led while he was at Google. To find that episode, and many other episodes about serverless, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links, and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.
The post Serverless at the Edge with Kenton Varda appeared first on Software Engineering Daily.

Feb 5, 2018 • 45min
Linkedin Resilience with Bhaskaran Devaraj and Xiao Li
How do you build resilient, failure tested systems? Redundancy, backups, and testing are all important. But there is also an increasing trend towards chaos engineering–the technique of inducing controlled failures in order to prove that a system is fault tolerant in the way that you expect.
In last week’s episode with Kolton Andrus, we discussed one way to build chaos engineering as a routine part of testing a distributed system. Kolton discussed his company Gremlin, which injects failures by spinning up a Gremlin container and having that container induce network failures, memory errors, and filled up disks. In this episode, we explore another insertion point for testing controlled failures, this time from the point of view of Linkedin.
Linkedin is a social network for working professionals. As Linkedin has grown, the increased number of services has led to more interdependency between those services. The more dependencies a given service has, the more partial failure cases there are. That’s not to say there is anything wrong with having a lot of service dependencies–this is just the way we build modern applications. But it does suggest that we should try to test the failures that can emerge from so many dependencies.
Bhaskaran Devaraj and Xiao Li are engineers at Linkedin and are working on a project called Waterbear, with the goal of making the infrastructure more resilient.
Linkedin’s backend system consists of a large distributed application with thousands of microservices communicating with each other. Most of those services communicate over Rest.li, a proxy for standardizing interactions between services. Rest.li can assist with routing, AB testing, circuit breaking, and other aspects of service-to-service communication. This proxy can also be used for executing controlled failures. As services are communicating with each other, creating a controlled failure can be as simple as telling your proxy not to send traffic to downstream services.
If that sounds confusing, don’t worry, we will explain it in more detail.
In this episode, Bhaskaran and Xiao describe their approach to resilience engineering at Linkedin–including the engineering projects and the cultural changes that are required to build a resilient software architecture.
The post Linkedin Resilience with Bhaskaran Devaraj and Xiao Li appeared first on Software Engineering Daily.

Feb 2, 2018 • 53min
Chaos Engineering with Kolton Andrus
The number of ways that applications can fail is numerous. Disks fail all the time. Servers overheat. Network connections get flaky. You assume that you are prepared for such a scenario because you have replicated your servers. You have the database backed up. Your core application is spread across multiple availability zones.
But are you really sure that your system is resilient? The only way to prove that your system is resilient to failure is to experience failure and to make swift responsiveness to failure an integral part of your software.
Chaos engineering is the practice of routinely testing your system’s resilience by inducing controlled failures. Netflix was the first company to discuss chaos engineering widely, but more and more companies are starting to work it into their systems, and finding it tremendously useful. By inducing failures in your system, you can discover unknown dependencies, single points of failure, and problematic state conditions that can cause data corruption.
Kolton Andrus worked on chaos engineering at Netflix and Amazon, where he designed systems that would test system resiliency through routine failures. Since then, he founded Gremlin, a company that provides chaos engineering as a service. In a previous episode, Kolton and I discussed why chaos engineering is useful, and he told some awesome war stories about working at Amazon and Netflix. In this show, we explore how to build a chaos engineering service–which involves standing up Gremlin containers that institute controlled failures.
To find the previous episode I recorded with Kolton, as well as other supplementary materials described in this show, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links, and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.
The post Chaos Engineering with Kolton Andrus appeared first on Software Engineering Daily.

Feb 1, 2018 • 53min
How to Change an Enterprise’s Software and Culture with Zhamak Dehghani
On this show, we spend a lot of time talking about CI/CD, data engineering, and microservices. These technologies have only been widely talked about for the last 5-10 years. That means that they are easy to adopt for startups that get founded in the last 5-10 years, but not necessarily for older enterprises.
Within a large enterprise, it can be challenging to make significant changes to how technology is used. Many of the listeners might even take it for granted that your source code is in git–but if you work at an enterprise that started building software in 1981, you might be moving source code around on FTP servers or floppy disks.
The difficulty of changing the technology within an enterprise gets compounded by culture. Culture develops around specific technologies. That is one interpretation of “Conway’s Law”–that the way an organization uses software informs an organization’s communication structure. This is no surprise–if your organization manages code using FTP servers and floppy disks, it will slow down your innovation.
Zhamak Dehghani is an engineer at ThoughtWorks, where she consults with enterprises to modernize their software and culture. She works off of a blueprint that describes specific steps that an enterprise can take towards modernizing: continuous integration; building a data pipeline; building a system of experimentation. In some ways, this conversation fits nicely with our shows about DevOps a few years ago. Full disclosure: ThoughtWorks is a sponsor of Software Engineering Daily.
To find all of our shows about DevOps, as well as links to learn more about the topics described in the show, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links, and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.
The post How to Change an Enterprise’s Software and Culture with Zhamak Dehghani appeared first on Software Engineering Daily.

Jan 25, 2018 • 50min
Serverless Containers with Sean McKenna
After two weeks of episodes about Kubernetes, our in-depth coverage of container orchestration is drawing to a close. We have a few more shows on the topic before we move on to cover other aspects of the software. If you have feedback on this thematic format (whether you like it or not), send me an email: jeff@softwareengineeringdaily.com
Today’s episode fits nicely into some of the themes we have covered recently–Cloud Foundry, Kubernetes, and the changing landscape of managed services. Sean McKenna works on all three of these things at Microsoft.
We spent much of our time discussing the use cases of container instances versus Kubernetes. Container instances are individual managed containers–so you could spin up an application within a container instance without having to deal with the Kubernetes control plane. Container instances might be described as “serverless containers,” since you do not have to program against the underlying VM at all.
This begs the question–why would you want to use a managed Kubernetes service if you could just use individual managed containers? Sean explores this question and gives his thoughts on where this ecosystem is headed. Full disclosure: Microsoft is a sponsor of Software Engineering Daily.
The post Serverless Containers with Sean McKenna appeared first on Software Engineering Daily.

Jan 22, 2018 • 48min
Container Instances with Gabe Monroy
In 2011, platform-as-a-service was in its early days. It was around that time that Gabe Monroy started a container platform called Deis, with the goal of making an open-source platform-as-a-service that anyone could deploy to whatever infrastructure they wanted.
Over the last six years, Gabe had a front-row seat to the rise of containers, the variety of container orchestration systems, and the changing open source landscape. Every container orchestration system consists of a control plane, a data plane, and a scheduler. In the last few weeks, we have been exploring these different aspects of Kubernetes in detail.
Last year, Microsoft acquired Deis, and Gabe began working on the Azure services that are related to Kubernetes–Azure Container Service, Kubernetes Service, and Container Instances. In this episode, Gabe talks about how containerized applications are changing, and what developments might come in the next few years.
Kubernetes, functions-as-a-service, and container instances are different cloud application runtimes, with different SLAs, interfaces, and economics. Gabe provided some thoughts on how different application types might use those different runtimes. Full disclosure: Microsoft is a sponsor of Software Engineering Daily.
The post Container Instances with Gabe Monroy appeared first on Software Engineering Daily.

Jan 19, 2018 • 53min
Service Mesh Design with Oliver Gould
Oliver Gould worked at Twitter from 2010 to 2014. Twitter’s popularity was taking off, and the engineering team was learning how to scale the product.
During that time, Twitter adopted Apache Mesos and began breaking up its monolithic architecture into different services. As more and more services were deployed, engineers at Twitter decided to standardize communications between those services with a tool called a service proxy.
A service proxy provides each service with features that every service would want: load balancing, routing, service discovery, retries, and visibility. It turns out that lots of other companies wanted this service proxy technology as well, which is why Oliver left Twitter to start Buoyant, a company that was focused on developing software around the service proxy–and eventually the service mesh.
If you are unfamiliar with service proxies and service mesh, check out our previous shows on Linkerd, Envoy, and Istio.
Kubernetes is often deployed with a service mesh. A service mesh consists of two parts: the data plane and the control plane.
The “data plane” refers to the sidecar containers that are deployed to each of your Kubernetes application pods. Each sidecar has a service proxy. The “control plane” refers to a central service that aggregates data from across the data plane and can send communications to the service proxies sitting across that control plane.
The Linkerd service mesh was built in Java, and the project started before Kubernetes had become the standard for container orchestration. More recently, Buoyant built Conduit, a service mesh built using Rust and Go.
In this episode, we explore how to design a service mesh and what Oliver learned in his experience building Linkerd and Conduit.
The post Service Mesh Design with Oliver Gould appeared first on Software Engineering Daily.