Cloud Engineering Archives - Software Engineering Daily cover image

Cloud Engineering Archives - Software Engineering Daily

Episodes about building and scaling large software projects

Latest episodes

Nov 29, 2017 • 1h 1min

How IBM Runs Its Cloud with Jason McGee

Cloud computing changed the economics of running a software company. A cloud is a network of data centers that offers compute resources to developers. In the 1990s, software companies purchased servers–an upfront capital expense that required tens of thousands of dollars. In the early 2000s, cloud computing started, and turned that capital expense into an operational expense. Instead of a huge bulk purchase of servers, a software developer can pay as they go–which is so much easier on the budget than the bulk purchase. This transformation of capital expense to operational expense significantly lowered the barrier to entry to starting software companies: cloud computing created an economic boom that continues to today, and shows no signs of slowing down. Of course, the upfront capital expense did not disappear. It got aggregated into the cloud providers. Instead of individual software companies buying servers like they did in the 90’s, cloud providers buy thousands of servers and build data centers around the world. Because the of the aggregation, cloud providers get economies of scale–it is much cheaper for a single cloud provider to buy 1000 servers than for 1000 software companies to buy 1 server. The first wave of cloud computing was all about deploying our applications to servers that run in cloud data centers. The next wave of cloud computing is moving more and more of our application logic into managed services. In the first wave of cloud computing, we had to spin up a server to run our database. We needed a server for our search engine, and our load balancer. The logical unit of a server comes with the risk that the server will crash, or will be hacked, or will develop a bug that is difficult to solve. Managed services abstract away the notion of a server. Database-as-a-service, search-as-a-service, load balancing-as-a-service; these services are reliable lego blocks that we can build entire applications out of. Developers pay a premium for a managed service–but they are happy to pay that premium because it represents a promise that this service will not crash, will not be hacked due to a software vulnerability, and will not develop bugs that are the responsibility of the developer. Developers are very happy to pay higher unit prices for these managed services than they pay for raw servers. And cloud providers are very happy to develop these managed services–because it turns the infrastructure-as-a-service business into a software-as-a-service business. Software-as-a-service has better margins, better retention, and better competitive differentiation than infrastructure-as-a-service. As a developer, I’m looking forward to building applications completely out of specialized managed services. But we are still pretty far from that time. Today, developers still need to write their own application logic and their own backend services. But there are a set of tools that allow developers to write their own services while getting some of the resiliency and scalability of managed services. These tools are functions as a service and Kubernetes. Functions as a service let developers deploy stateless application logic that is cheap and scalable. Functions as a service still have some problems to overcome in the areas state management, function composition, usability, and developer education. Kubernetes is a tool for managing containerized infrastructure. Developers put their apps into containers on Kubernetes, and Kubernetes provides a control plane for deployment, scalability, load balancing, and monitoring. So–all of the things that you would want out of a managed service become much easier when you put applications into Kubernetes. This is why Kubernetes has become so popular–and it is why Kubernetes itself is being offered as a managed service by many cloud providers–including IBM. For the last decade, IBM has been building out its cloud offerings–and for two of those years, Jason McGee has been CTO of IBM Cloud Platform. In this episode, Jason discusses what it is like to build and manage a cloud, from operations to economics to engineering. If you like this episode, we have done many other shows about engineering cloud services at Microsoft, Google, Amazon, and Digital Ocean. To find all of our old episodes, you can download the Software Engineering Daily app for iOS and for Android. In other podcast players, you can only access the most recent 100 episodes. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help. The post How IBM Runs Its Cloud with Jason McGee appeared first on Software Engineering Daily.

Nov 28, 2017 • 42min

Thumbtack Infrastructure with Nate Kupp

Thumbtack is a marketplace for real-world services. On Thumbtack, people get their house painted, their dog walked, and their furniture assembled. With 40,000 daily marketplace transactions, the company handles significant traffic. On yesterday’s episode, we explored how one aspect of Thumbtack’s marketplace recently changed, going from asynchronous matching to synchronous “instant” matching. In this episode, we zoom out to the larger architecture of Thumbtack, and how the company has grown through its adoption of managed services from both AWS and Google Cloud. The word “serverless” has a few definitions. In the context of today’s episode, serverless is all about managed services like Google BigQuery, Google Cloud PubSub, and Amazon ECS. The majority of infrastructure at Thumbtack is built using services that automatically scale up and down. Application deployment, data engineering, queueing, and databases are almost entirely handled by cloud providers. For the most part, Thumbtack is a “serverless” company. And it makes sense–if you are building a high-volume marketplace, you are not in the business of keeping servers running. You are in the business of improving your matching algorithms, your user experience, and your overall architecture. Paying for lots of managed services is more expensive than running virtual machines–but Thumbtack saves money from not having to hire site reliability engineers. Nate Kupp leads the technical infrastructure team, and we met at QCon in San Francisco to talk about how to architect a modern marketplace. This was my third time attending QCon and as always I was impressed by the quality of presentations and conversations I had there. They were also kind enough to set up some dedicated space for podcasters like myself. The most widely used cloud provider is AWS, but more and more companies that come on the show are starting to use some of the managed services from Google. The great news for developers is that integration between these managed services is pretty easy. At Thumbtack, the production infrastructure on AWS serves user requests. The log of transactions that occur get pushed from AWS to Google Cloud, where the data engineering occurs. On Google Cloud, the transaction records are queued in Cloud PubSub, a message queueing service. Those transactions are pulled off the queue and stored in BigQuery, a system for storage and querying of high volumes of data. BigQuery is used as the data lake to pull from when orchestrating machine learning jobs. These machine learning jobs are run in Cloud Dataproc, a managed service that runs Apache Spark. After training a model in Google Cloud, that model is deployed on the AWS side, where it serves user traffic. On the Google Cloud side, the orchestration of these different managed services is done by Apache Airflow, an open source tool that is one of the few pieces of infrastructure that Thumbtack does have to manage themselves on Google Cloud. To find out more about the Thumbtack infrastructure, check out the video of the talk Nate gave at QCon San Francisco, or check out the Thumbtack Engineering Blog. The post Thumbtack Infrastructure with Nate Kupp appeared first on Software Engineering Daily.

Nov 27, 2017 • 52min

Marketplace Matching with Xing Chen

The labor market is moving online. Taxi drivers are joining Uber and Lyft. Digital freelancers are selling their services through Fiverr. Experienced software contractors are leaving contract agencies to join Gigster. Online labor marketplaces create market efficiency by improving the communications between buyers and sellers. Workers make their own hours, and their performance is judged by customers and algorithms, rather than the skewed perspective of a human manager. These marketplaces for human labor are in different verticals, but they share a common problem: how do you most efficiently match supply and demand? Perfect marketplace matching is an unsolved problem. Hundreds of computer science papers have been written about the problems of stable matching, which often turn out to be NP-Complete. The stock market has been attempting to automate marketplace matching for decades, and inefficiencies are discovered every year. Today’s show is about matching buyers and sellers on Thumbtack, a marketplace for local services. For the first seven years, Thumbtack was building liquidity in its 2-sided market. During those years, the model for job requests was as follows: let’s say I was on Thumbtack looking for someone to paint my house. I would post a job that would say I am looking for house painters. The workers on Thumbtack that paint houses could see my job and place a bid on it. Then I would choose from the bids and get my house painted. This was the “asynchronous” model. The actions of the buyer and seller were not synchronized. There was a significant delay between the time when the buyer posted a job and the time when a seller places a bid, and then another delay before the buyer selects from the sellers. Thumbtack recently moved to an “instant matching” model. After gathering data about the people selling services on the platform, Thumbtack is now able to avoid the asynchronous bidding process. In the new experience, a buyer goes on the platform, requests a house painter, and is instantly matched to someone who has a history of accepting house painting tasks that fit the parameters of the buyer. From the user’s perspective, this is a simple improvement. From Thumbtack’s perspective, there was significant architectural change required. In the asynchronous model, the user requests lined up in a queue, and were matched with pros who placed bids on the items in that queue. In the instant matching model, a user request became more like a search query–the parameters of that request hit an index of pros and returns a response immediately. Xing Chen is an engineer from Thumbtack, and joins the show to describe the rearchitecture process–how Thumbtack went from an asynchronous matching system to synchronous, instant matching. We also explore some of the other architectural themes of Thumbtack, which we dive into in further detail in tomorrow’s episode about scaling Thumbtack’s infrastructure, which uses both AWS and Google Cloud. On Software Engineering Daily, we have explored the software architecture and business models of different labor marketplaces–from Uber to Fiverr. To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. In other podcast players, you can only access the most recent 100 episodes. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help. The post Marketplace Matching with Xing Chen appeared first on Software Engineering Daily.

Nov 22, 2017 • 48min

Load Balancing at Scale with Vivek Panyam

Facebook serves interactive content to billions of users. Google serves query requests on the world’s biggest search engine. Uber handles a significant percentage of the transportation within the United States. These services are handling radically different types of traffic, but many of the techniques they use to balance loads are similar. Vivek Panyam is an engineer with Uber, and he previously interned at Google and Facebook. In a popular blog post about load balancing at scale, he described how a large company scales up a popular service. The methods for scaling up load balancing are simple, but effective–and they help to illustrate how load balancing works at different layers of the networking stack. Let’s say you have a simple service where a user makes a request, and your service sends them a response with a cat picture. Your service starts to get popular, and begins timing out and failing to send a response to users. When your service starts to get overwhelmed, you can scale up load by creating another service instance that is a copy of your cat picture service. Now you have two service instances, and you can use a layer 7 load balancer to route traffic evenly between those two service instances. You can keep adding service instances as the load scales and have the load distributed among those new instances. Eventually, your L7 load balancer is handling so much traffic itself that you can’t put any more service instances in front of it. So you have to set up another L7 load balancer, and put an L4 load balancer in front of those L7 load balancers. You can scale up that tier of L7 load balancers, each of which is balancing traffic across a set of your service instances. But eventually, even your L4 load balancer gets overwhelmed with requests for cat pictures. You have to set up another tier, this time with L3 load balancing… In this episode, Vivek gives a clear description for how load balancing works. We also review the 7 networking layers before discussing why there are different types of load balancers associated with the different networking layers. The post Load Balancing at Scale with Vivek Panyam appeared first on Software Engineering Daily.

Nov 21, 2017 • 52min

Incident Response with Emil Stolarsky

As a system becomes more complex, the chance of failure increases. At a large enough scale, failures are inevitable. Incident response is the practice of preparing for and effectively recovering from these failures. An engineering team can use checklists and runbooks to minimize failures. They can put a plan in place for responding to failures. And they can use the process of post mortems to reflect on a failure and take full advantage of the lessons of that failure. Emil Stolarsky is a production engineer at Shopify where his role shares many similarities with that of Google’s site reliability engineers. In this episode, Emil argues that the academic study of emergency management and industries such as aerospace and transportation have a lot to teach software engineers about responding to production problems. In this interview with guest host Adam Bell, Emil argues that we need to move beyond tribal knowledge and incorporate practices such as an incident command system and rigorous use of checklists. Emil suggests that we need to move beyond a mindset of “move fast and break things” and toward a place of more deliberate preparation. Show Notes Incident Response Insights Talk The Human Side Of Post Mortems The post Incident Response with Emil Stolarsky appeared first on Software Engineering Daily.

Nov 20, 2017 • 54min

Run Less Software with Rich Archbold

There is a quote from Jeff Bezos: “70% of the work of building a business today is undifferentiated heavy lifting. Only 30% is creative work. Things will be more exciting when those numbers are inverted.” That quote is from 2006, before Amazon Web Services had built most of their managed services. In 2006, you had no choice but to manage your own database, data warehouse, and search cluster. If your server crashed in the middle of the night, you had to wake up and fix it. And you had to deal with these engineering problems in addition to building your business. Technology today evolves much faster than in 2006. That is partly because managed cloud services make operating a software company so much smoother. You can build faster, iterate faster, and there are fewer outages. If you are an insurance company or a t-shirt manufacturing company or an online education platform, software engineering is undifferentiated heavy lifting. Your customers are not paying you for your expertise in databases or your ability to configure load balancers. As a business, you should be focused on what the customers are paying you for, and spending the minimal amount of time on rebuilding software that is available as a commodity cloud service. Rich Archbold is the director of engineering at Intercom, a rapidly growing software company that allows for communication between customers and businesses. At Intercom, the engineering teams have adopted a philosophy called Run Less Software. Running less software means reducing choices among engineering teams, and standardizing on technologies wherever possible. When Intercom was in its early days, the systems were more heterogeneous. Different teams could choose whatever relational database they wanted–MySQL or Postgres. They could choose whatever key/value store they were most comfortable with. The downside of all this choice was that engineers who moved from one team to another team might not know how to use the tools at the new team they were moving to. After switching teams, you would have to figure out how to onboard with those new tools, and that onboarding process was time that was not spent on effort that impacted the business. By reducing the number of different choices that engineering teams have, and opting for managed services wherever possible, Intercom ships code at an extremely fast pace with very few outages. In our conversation, Rich contrasts his experience at Intercom with his experiences working at Amazon Web Services and Facebook. Amazon and Facebook were built in a time where there was not a wealth of managed services to choose from, and this discussion was a reminder of how much software engineering has changed because of cloud computing. To learn more about Intercom, you can check out the Inside Intercom podcast. The post Run Less Software with Rich Archbold appeared first on Software Engineering Daily.

Nov 16, 2017 • 57min

High Volume Event Processing with John-Daniel Trask

A popular software application serves billions of user requests. These requests could be for many different things. These requests need to be routed to the correct destination, load balanced across different instances of a service, and queued for processing. Processing a request might require generating a detailed response to the user, or making a write to a database, or the creation of a new file on a file system. As a software product grows in popularity, it will need to scale these different parts of infrastructure at different rates. You many not need to grow your database cluster at the same pace that you grow the number of load balancers at the front of your infrastructure. Your users might start making 70% of their requests to one specific part of your application, and you might need to scale up the services that power that portion of the infrastructure. Today’s episode is a case study of a high-volume application: a monitoring platform called Raygun. Raygun’s software runs on client applications and delivers monitoring data and crash reports back to Raygun’s servers. If I have a podcast player application on my iPhone that runs the Raygun software, and that application crashes, Raygun takes a snapshot of the system state and reports that information along with the exception, so that the developer of that podcast player application can see the full picture of what was going on in the user’s device, along with the exception that triggered the application crash. Throughout the day, applications all around the world are crashing and sending requests to Rayguns servers. Even when crashes are not occurring, Raygun is receiving monitoring and health data from those applications. Raygun’s infrastructure routes those different types of requests to different services, queues them up, and writes the data to multiple storage layers–ElasticSearch, a relational SQL database, and a custom file server built on top of S3. John-Daniel Trask is the CEO of Raygun and he joins the show to describe the end-to-end architecture of Raygun’s request processing and storage system. We also explore specific refactoring changes that were made to save costs at the worker layer of the architecture. This is useful memory management strategy for anyone working in a garbage collected language. If you would like to see diagrams that explain the architecture and other technical decisions, the show notes have a video that explains what we talk about in this show. Full disclosure: Raygun is a sponsor of Software Engineering Daily. The post High Volume Event Processing with John-Daniel Trask appeared first on Software Engineering Daily.

Nov 15, 2017 • 54min

Fiverr Engineering with Gil Sheinfeld

As the gig economy grows, that growth necessitates innovations in the online infrastructure powering these new labor markets. In our previous episodes about Uber, we explored the systems that balance server load and gather geospacial data. In our coverage of Lyft, we studied Envoy, the service proxy that standardizes communications and load balancing among services. In shows about Airbnb, we talked about the data engineering pipeline that powers economic calculations, user studies, and everything else that requires a MapReduce. In today’s episode, we explore the business and engineering behind another online labor platform: Fiverr. Fiverr is a marketplace for digital services. On Fiverr, I have purchased podcast editing, logo creation, music lyrics, videos, and sales leads. I have found people who will work for cheap, and quickly finish a job to my exact specification. I have discovered visual artists who worked with me to craft a music video for a song I wrote. Workers on Fiverr post “gigs”–jobs that they can perform. Most of the workers on Fiverr specialize in knowledge work, like proofreading or gathering sales leads. The workers are all over the world. I have worked with people from Germany, the Philippines, and Africa through Fiverr. Fiverr has become the leader in digital freelancing. The staggering growth of Fiverr’s marketplace has put the company in a position similar to an early Amazon. There is room for strategic expansion, but there is also an urgency to improve the infrastructure and secure the market lead. Gil Sheinfeld is the CTO at Fiverr, and he joins the show to explain how the teams at Fiverr are organized to fulfill the two goals of strategic, creative growth and continuous improvement to the platform. One engineering topic we discussed at length was event sourcing. Event sourcing is a pattern for modeling each change to your application as an event. Each event is placed on a pub/sub messaging queue, and made available to the different systems within your company. Event sourcing creates a centralized place to listen to all of the changes that are occurring within your company. For example, you might be working on a service that allows a customer to make a payment to a worker. The payment becomes an event. Several different systems might want to listen for that event. Fiverr needs to call out to a credit card processing system. Fiverr also needs to send an email to the worker, to let them know they have been paid. Fiverr ALSO needs to update internal accounting records. Event sourcing is useful because the creator of the event is decoupled from all of the downstream consumers. As the platform engineering team works to build out event sourcing, communications between different service owners will become more efficient. The post Fiverr Engineering with Gil Sheinfeld appeared first on Software Engineering Daily.

Nov 14, 2017 • 53min

Serverless Event-Driven Architecture with Danilo Poccia

In an event driven application, each component of application logic emits events, which other parts of the application respond to. We have examined this pattern in previous shows that focus on pub/sub messaging, event sourcing, and CQRS. In today’s show, we examine the intersection of event driven architecture and serverless architecture. Serverless applications can be built by combining functions-as-a-service (like AWS Lambda) together with backend as a service tools like DynamoDB and Auth0. Functions-as-a-service give you cheap, flexible, scalable compute. Backend as a service tools give you robust, fault-tolerant tools for managing state. By combining these sets of tools, we can build applications without thinking about specific servers that are managing large portions of our application logic. This is great–because managing servers and doing load balancing and scaling is painful. With this shift in architecture, we also have to change how data flows through our applications. Danilo Poccia is the author of AWS Lambda In Action, a book about building event-driven serverless applications. In today’s episode, Danilo and I discuss the connection between serverless architecture and event driven architecture. We start by reviewing the evolution of the runtime unit–from physical machines to virtual machines to containers to functions as a service. Then, we dive into what it means for an application to be “event driven.” We explore how to architect and scale a serverless architecture, and we finish by discussing the future of serverless–how IoT and edge computing and on-premise architectures will take advantage of this new technology. The post Serverless Event-Driven Architecture with Danilo Poccia appeared first on Software Engineering Daily.

Nov 7, 2017 • 53min

Netflix Serverless-like Platform with Vasanth Asokan

The Netflix API is accessed by developers who build for over 1000 device types: TVs, smartphontes, VR headsets, laptops. If it has a screen, it can probably run Netflix. On each of these different devices, the Netflix experience is different. Different screen sizes mean there is variable space to display the content. When you open up Netflix, you want to efficiently browse through movies. The frontend engineers who are building different experiences for different device types need to make different requests to the backend to fetch the right amount of data. This was the engineering problem that Vasanth Asokan and his team at Netflix was tasked with solving: how do you enable lots of different frontend engineers to get whatever they need from the backend? This problem led to the development of a “serverless-like platform” within Netflix, which Vasanth wrote about in a few popular articles on Medium. This platform enables frontend developers to write and deploy backend scripts to fetch data, decoupling the responsibilities of frontend engineers and backend engineers. The tight coupling of frontend and backend engineering was problematic to the development velocity of Netflix. We have done many shows about Netflix engineering, covering topics like data engineering, user interface design, and performance monitoring. To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help. The post Netflix Serverless-like Platform with Vasanth Asokan appeared first on Software Engineering Daily.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner