The InfoQ Podcast

InfoQ

Software engineers, architects and team leads have found inspiration to drive change and innovation in their team by listening to the weekly InfoQ Podcast. They have received essential information that helped them validate their software development map. We have achieved that by interviewing some of the top CTOs, engineers and technology directors from companies like Uber, Netflix and more. Over 1,200,000 downloads in the last 3 years.

Episodes

Mentioned books

Feb 16, 2017 • 39min

Jonas Bonér on the Actor Model, Akka, Reactive Programming, Microservices and Distributed Systems

Jonas Bonér, CTO of LightBend and creator Akka, discusses using Akka when developing distributed systems. He talks about the Actor Model, and how every Microservice needs to be viewed as a system to be successful. Why listen to this podcast: - Akka is JVM-based framework design for developing distributed systems leveraging the Actor Model - an approach for writing concurrent systems that treat actors as universal primitives and the most successful model with abstraction has been streaming - Circuit breakers in Akka are a backup and retry policy; they protect you by capturing failure data and allow you to roll back - Every Microservice needs to be viewed as a system, it needs to have multiple parts that run on different machines in order to function and be fully resilient - thus is a Microsystem - Two different trends have emerged when it comes to hardware and environments: one is the trend toward Multi-core, the is a movement toward virtualized environments and the cloud - Saga pattern of managing long running transactions in a distributed system fits very well with messaging style architectures Notes and links can be found on: http://bit.ly/2kwB2eB Akka The Actor Model When Akka and the Actor Model is the perfect choice Circuit breakers patterns in distributed systems Two trends toward Multi-core Reactive Manifesto Event Driven vs. Message Driven Reactive Programming and Streams Microliths to Microsystems What do you have to get right before you start trying to deploy a distributed systems? Working with ML / AI at Lightbend to understand tracing through distributed system Saga Pattern More on this: Quick scan our curated show notes on InfoQ http://bit.ly/2kwB2eB You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Jan 27, 2017 • 40min

Keith Adams on the Architecture of Slack, using MySql, Edge Caching, & the backend Messaging Server

In this week’s podcast, QCon chair Wesley Reisz talks to Keith Adams, chief architect at Slack. Prior he was an engineer at Facebook where he worked on the search type live backend, and is well-known for the HipHop VM [hhvm.com]. Adams presented How Slack Works at QCon SanFrancisco 2016. Why listen to this podcast: - Group messaging succeeds when it feels like a place for members to gather, rather than just a tool - Having opt-in group membership scales better than having to define a group on the fly, like a mailing list instead of individually adding people to a mail - Choosing availability over consistency is sometimes the right choice for particular use cases - Consistency can be recovered after the fact with custom conflict resolution tools - Latency is important and can be solved by having proxies or edge applications closer to the user Notes and links can be found on: http://bit.ly/keith-adams 3m:30s Voice and video interactions are impacted by latency; the same is true of messaging clients 4m:00s The user interface can provide indications of presence, through avatars indicating availability and typing indicators 4m:15s Latency is important; sometimes the difference is between 100ms and 200ms so the message channel monitors ping timeout between server and client 4m:40s 99th percentile is less than 100ms ping time 5m:15s If the 99th percentile is more than 100ms then it may be server based, such as needing to tune the Java GC 5m:25s Network conditions of the mobile clients are highly variable 6m:20s Mobile clients can suffer intermittent connectivity More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/keith-adams You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Dec 9, 2016 • 25min

Haley Tucker on Responding to Failures in Playback Features at Netflix

In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members. Why listen to this podcast: - Distributed systems fail regularly, often due to unexpected reasons - Data canaries can identify invalid metadata before it can enter and corrupt the production environment - ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions - Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures - Distributed systems are fundamentally social systems, and require a blameless culture to be successful Notes and links can be found on: http://bit.ly/2hqzQ6K 2m:05s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services. 2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data. 3m:30s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production. 3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2hqzQ6K You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Dec 2, 2016 • 29min

Kolton Andrus on Lessons Learnt From Failure Testing at Amazon and Netflix and New Venture Gremlin

In this week's podcast, QCon chair Wesley Reisz talks to Kolton Andrus. Andrus is the founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website. Why listen to this podcast: - Gremlin, Kolton Andrus' new start-up, is focused on providing failure testing as a service. Version 1, currently in closed beta, is focused on infrastructure failures. - Lineage-driven Fault Injection (LDFI) allowed Netflix to dramatically reduce the number of tests they needed to run in order to explore a problem space. - You generally want to run failure tests in production, but you can't start there. Start in developemnt and build up. - Having failure testing at an application level, as Netflix does, so you can have request level fault injection for a specific user or a specific device. - Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix, the failure injection system is integrated into the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2fT9YiM You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 18, 2016 • 26min

Preslav Le on How Dropbox Moved off AWS and What They Have Been Able to Do Since

As InfoQ previously reported in March 2016, Dropbox announced that they had migrated away from Amazon Web Services (AWS). In this week's podcast Robert Bluman talks to Preslav Le. Preslav has been a software engineer at Dropbox for the past three years, contributing to various aspects of Dropbox’s infrastructure including traffic, performance and storage. He was part of the core oncall and storage oncall rotations, dealing with high emergency real world issues, from bad code pushes to complete datacenter outages. Why listen to this podcast: - Dropbox migrated away from Amazon S3 to their own data centres to allow them to optimise for their specific use case. - They are experimenting with Shingled Magnetic Recording (SMR) drives for primary storage to increase storage density. All writes go to an SSD cache and then get pushed asynchronously to the SMR disk. - Their average block size is 1.6MB with a maximum block size of 4MB. Knowing this allows the team to tune their storage system. - Three languages are used for the backend infrastructure. Python is used mainly for business logic, Go is the primary language used for heavy infrastructure services, and in some cases, for example where more direct control over memory is needed, Rust is also used. - Dropbox invest very heavily in verification and automation. A verifier scans every byte on disk and checks that it matches the checksum in the index. - Verification is also used to check that each box has the block keys it should have. Notes and links can be found on http://bit.ly/preslav-le Dropbox’s motivation for moving off the cloud 2:40 - Dropbox used Amazon S3 and other services where it made sense, but they stored all the metadata in their own data centres. 3:30 - Initially this was done because Amazon had poor support for persistent storage at the time. This has since improved but it didn’t make sense for dropbox to move the metadata back. 4:01 - By that time the dropbox team was ready to tackle the storage problem and built their own in-house replacement for S3, called Magic Pocket. Magic Pocket allowed Dropbox to move away from Amazon altogether. 4:30 - The move saved money, but also allowed DropBox to optimise for their specific use case and be faster. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/preslav-le You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 11, 2016 • 26min

Randy Shoup on Stitch Fix's Technology Stack, Data Science and Microservices

In this week's podcast QCon chair Wesley Reisz talks to Randy Shoup. Shoup is the vice president of engineering at Stitch Fix. Prior to Stitch Fix, he worked for Google as the director of engineering and cloud computing, CTO and co-founder of Shopilly, and chief engineer at Ebay. Why listen to this podcast: - Stitch Fix's business is a combination of art and science. Humans are much better with the machines, and the machines are much better with the humans. - Stitch Fix has 60 engineers, with 80 data scientists and algorithm developers. This ratio of data science to engineering is unique. - With Ruby-on-Rails on top of Postgres, the company maintains about 30 different applications on the same stack. - The practice of Test Driven Development makes Continuous Delivery work, and the practice of having the same people build the code as those who operate the code makes both of these things much more powerful. - Microservices gives feature velocity, the ability for individual teams to move quickly and independently of each other, and independent deployments. - Microservices solve a scaling problem. They solve an organisational scaling problem, and a technological scaling problem. These are not the problems that you have early on in the startup. - In the monolithic world, if you're not able to continue to vertically scale the application or the database or whatever your monolith is. And so for scaling reasons alone you might consider breaking it up into what we call microservices. Notes and links can be found on http://bit.ly/randy-shoup-podcast Data Science and Stitch Fix 1m:57s - Stitch Fix re-imagines retail, particularly for clothing. When you sign up, you fill out survey of the kinds of things that you like and you don't like, and we choose what we think you're going to enjoy based on the millions of customers that we have already. And we use a ton of data science in that process. 3m:00s - That goes into our algorithms and then our algorithms make personalised recommendations based on all the things we know about our other customers... there's a human element as well: we have 3,200 human stylists that are all around the United States and they choose the five items that go into the box [of clothing]. 3m:29s - What we like is that this is a combination of art and science. Modern companies combine what machines are really good at, such as chugging through the 60 to 70 questions times the millions of customers, and combining that with the human element of the stylists, figuring out what things go together, what things are trending, what things are appropriate... Humans are much better with the machines, and the machines are much better with the humans. [...] More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/randy-shoup-podcast You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app