
Cloud Engineering Archives - Software Engineering Daily
Episodes about building and scaling large software projects
Latest episodes

Sep 21, 2017 • 54min
Tinder Growth Engineering with Alex Ross
Tinder is a popular dating app where each user swipes through a sequence of other users in order to find a match. Swiping left means you are not interested. Swiping right means you would like to connect with the person. The simple premise of Tinder has led to massive growth, and the app is now also used to discover new friends and create casual meetings.
Every social network knows–if you are not growing, then you are dying. Growth is so important to Tinder, they have a large engineering organization devoted to five facets of growth: new users, activation, retention, dropoff, and anti-spam.
These five segments cover the entire Tinder user lifecycle, and there is a sub-team in charge of each of the five areas. No matter what kind of Tinder user you are, there are growth engineers focused on your experience.
Alex Ross is the director of engineering for the growth team at Tinder. His job requires a mix of data science, data engineering, psychology, and setting proper KPIs (key performance indicators). Each subteam has KPIs that determine how well they are doing with growth–and if the wrong KPI is set, it can create bad incentives. For example, a growth team that is focused only on getting users to spend more time engaging with Tinder would have an incentive to create so-called “dark patterns” that trigger addiction.
If you like this episode, we have done many other shows about data science and data engineering. Download the Software Engineering Daily app for iOS to hear all of our old episodes, and easily discover new topics that might interest you. You can upvote the episodes you like and get recommendations based on your listening history. With 600 episodes, it is hard to find the episodes that appeal to you, and we hope the app helps with that.
The post Tinder Growth Engineering with Alex Ross appeared first on Software Engineering Daily.

Sep 18, 2017 • 53min
Spotify Event Delivery with Igor Maravic
Spotify is a streaming music company with more than 50 million users. Whenever a user listens to a song, Spotify records that event and uses it as input to learn more about the user’s preferences. Listening to a song is one type of event–there are hundreds of others. Opening the Spotify app, skipping a song, sharing a playlist with a friend–all of these are events that provide valuable insights to Spotify.
These are not the only types of events that Spotify cares about. There are also events that occur at the infrastructure level–for example a logging server that runs out of disk space. There are events that are relevant to all the users on Spotify–for example a new album release from Taylor Swift.
An “event” is an object that needs to be registered within a system. Since there are so many events on a platform like Spotify, delivering and processing them reliably requires significant investment.
Modern Internet companies are built by connecting cloud services, databases, and internal tools together. These different systems might respond to different events in different ways. Each system subscribes to the types of events that it wants to hear. Since there are so many events, and they might be received at uneven bursts, a modern architecture has a scalable queueing system to buffer events.
To put an event on the queue, the event producer “publishes” that event to the queue. The event is then received by each “subscriber.” That’s why queueing is often known as pub/sub–publish/subscribe.
Igor Maravic is an engineer with Spotify. In this episode, he explains why pub/sub is a key element of Spotify’s infrastructure–and he describes the migration that Spotify has made from Apache Kafka to Google Cloud Pubsub.
If you like this episode, we have done many other shows about cloud infrastructure. You can check out our back catalog by downloading the Software Engineering Daily app for iOS, where you can listen to all of our old episodes, and easily discover new topics that might interest you. You can upvote the episodes you like and get recommendations based on your listening history. With 600 episodes, it is hard to find the episodes that appeal to you, and we hope the app helps with that.
The post Spotify Event Delivery with Igor Maravic appeared first on Software Engineering Daily.

Aug 28, 2017 • 54min
Cloud-Native SQL with Alex Robinson
Applications built in the cloud are often serving requests from all around the world. A user in Hong Kong could have written to a database entry at the moment just before a user in San Francisco and a user in Germany simultaneously try to read from that database. If the user in San Francisco is allowed to see a different database entry than the user in Germany, that database is not strongly consistent.
Strongly consistent databases work such that two users who read the same entry at the same time will receive the same result. Weakly consistent or “eventual consistent” databases are suitable for applications where transaction ordering is not important–photo sharing apps and ecommerce shopping carts, for example. Bank accounts, on the other hand, often need to be strongly consistent.
CockroachDB is a scalable, survivable, strongly consistent database. Alex Robinson is an engineer at Cockroach Labs and he joins the show to explain the data model for CockroachDB and how it maintains strong consistency.
The post Cloud-Native SQL with Alex Robinson appeared first on Software Engineering Daily.

Aug 18, 2017 • 52min
Error Diagnosis with James Smith
When a user experiences an error in an application, the engineers who are building that application need to find out why that error occurred. The root cause of that error may be on the user’s device, or within a piece of server-side logic, or hidden behind a black box API. To fix a complex error, we need a stack trace of contextual information so that we can correlate events across all layers of an application.
James Smith is the CEO of Bugsnag, a company that makes crash reporting and error tracking software. In this episode, he describes how to diagnose errors in modern applications. He also explains how the company functions and how Bugsnag itself is built. The product consumes and stores millions of events which makes for a good discussion of software architecture. Full disclosure: Bugsnag is a sponsor of SE Daily.
The post Error Diagnosis with James Smith appeared first on Software Engineering Daily.

Aug 14, 2017 • 53min
Open Compute Project with Steve Helvie
Facebook was rapidly outgrowing its infrastructure in 2009. Classic data center design was not up to the task of the rapid influx of new users and data, photos, and streaming video hitting Facebook’s servers. A small team of engineers spent the next two years designing a data center from the ground up to be cheaper, more energy efficient, and more ergonomic for the engineers who worked within.
That data center design was open sourced in 2011. Intel, Rackspace, and Goldman Sachs were the first three large organizations to join Facebook in the Open Compute Project, an effort to bring the benefits of open source collaboration to data centers.
Steve Helvie works on the Open Compute Project and he joins the show to describe how the project has evolved in the last six years–how it has affected data center design and the implications for the future.
The post Open Compute Project with Steve Helvie appeared first on Software Engineering Daily.

Aug 7, 2017 • 56min
Serverless Continuous Delivery with Robin Weston
Serverless computing reduces the cost of using the cloud. Serverless also makes it easy to scale applications. The downside: building serverless apps requires some mindset shift. Serverless functions are deployed to transient units of computation that are spun up on demand. This is in contrast to the typical model of application delivery–the deployment of an application to a server or a container that stays running until you shut it down.
Robin Weston develops large projects with AWS Lambda, and he joined me for a discussion of how to build applications for serverless environments and how to do continuous delivery with serverless functions. One big appeal for continuous delivery fans is that serverless deployments are often smaller–the user is deploying something as small as a function.
Full disclosure: ThoughtWorks GoCD is a sponsor of Software Engineering Daily.
Show Notes
Serverless Architectures and Continuous Delivery by Robin Weston
Robin Weston at Pipeline Conf
The post Serverless Continuous Delivery with Robin Weston appeared first on Software Engineering Daily.

Aug 4, 2017 • 52min
Serverless Startup with Yan Cui
After raising $18 million, social networking startup Yubl made a series of costly mistakes. Yubl hired an army of expensive contractors to build out its iOS and Android apps. Drama at the executive level hurt morale for the full-time employees. Most problematic, the company was bleeding cash due to a massive over-investment in cloud services.
This was the environment in which Yan Cui joined Yubl. The startup did have traction. There were social media stars who would announce on Twitter that they were about to go on Yubl, and Yubl would be hit by an avalanche of traffic. 50,000 users suddenly logging on to interact with their favorite celebrity was a significant traffic spike.
How do you deal with a traffic pattern like that? Serverless computing. AWS Lambda allowed the company to scale up quickly in a cost efficient manner. Yan began refactoring the entire backend infrastructure to be more cost efficient, heavily leveraging AWS Lambda.
Unfortunately, Yan’s valiant effort was not enough to save the company. But there are some incredible engineering lessons from this episode–how to build cost-effective, scalable infrastructure. It’s also a case study worth looking at if you work at a startup, whether or not you are an engineer.
The post Serverless Startup with Yan Cui appeared first on Software Engineering Daily.

Aug 2, 2017 • 50min
Platform Continuous Delivery with Andy Appleton
Continuous delivery is a model for deploying small, frequent changes to an application. In a continuous delivery workflow, code changes that are pushed to a repository set off a build process that spins up a new version of the application. Testing is performed against that new build before advancing it to production, merging it with the existing codebase.
Many continuous delivery products are getting built today because it is a wide open space–much like cloud providers or monitoring tools. There are subjective product and engineering decisions to be made depending on the audience for the product.
Heroku Flow is a continuous delivery platform built on top of Heroku, a platform as a service. Andy Appleton is an engineer at Heroku and he joins the show to describe how Heroku Flow was built. Two years of work went into the project from initial conception to launch.
Full disclosure: Heroku is a sponsor of Software Engineering Daily.
The post Platform Continuous Delivery with Andy Appleton appeared first on Software Engineering Daily.

Jul 21, 2017 • 38min
Reinforcement Learning with Michal Kempa
Reinforcement learning is a type of machine learning where a program learns how to take actions in an environment based on how that program has been rewarded for actions it took in the past. When program takes an action, and it receives a reward for that action, it is likely to take that action again in the future because it was positively reinforced.
Michal Kempka is a computer scientist work works on VizDoom, an AI research platform for reinforcement learning, with co-creators Marek Wydmuch, Grzegorz Runc, Jakub Toczek, Wojciech Jaśkowski. VizDoom is based on the first-person dungeon game Doom. In VizDoom, an autonomous agent navigates through a maze avoiding enemies.
Reinforcement learning is a widely used tool for machine learning, and we will be doing more shows in the future that explain how it works in further detail.
Show Notes
Cornell University Library: VizDoom
The post Reinforcement Learning with Michal Kempa appeared first on Software Engineering Daily.

Jul 20, 2017 • 50min
Apparel Machine Learning with Colan Connon and Thomas Bell
In its most basic definition, machine learning is a tool that makes takes a data set, finds a correlation in that data set, and uses that correlation to improve a system. Any complex system with well-defined behavior and clean data can be improved with machine learning.
Several precipitating forces have caused machine learning to become widely used: more data, cheaper storage, and better tooling. Two pieces of tooling that have been open sourced from Google help tremendously: Kubernetes and TensorFlow.
Kubernetes is not a tool for machine learning, but it simplifies distributed systems operations, unlocking more time for engineers to focus on things that are not as commodifiable–like tweaking machine learning parameters. TensorFlow is a framework for setting up machine learning systems.
Machine learning should affect every aspect of our lives–including tuxedo fitting. Generation Tux is a company that allows customers to rent apparel that historically has required in-person fitting. Using machine learning, they have developed a system that allows customers to get fit for an outfit without entering a brick-and-mortar store.
In this episode, Colan Connon and Thomas Bell from Generation Tux join to explain how Generation Tux adopted Kubernetes and TensorFlow, and how the company’s infrastructure and machine learning pipeline work.
The post Apparel Machine Learning with Colan Connon and Thomas Bell appeared first on Software Engineering Daily.