Software Misadventures

Ronak Nathani, Guang Yang

A show about not just the technologies, but the people and stories behind them. In every episode, Ronak and Guang sit down with engineers, founders, and investors to chat about their paths, lessons they've learned and of course, the misadventures along the way.

Episodes

Mentioned books

Sep 12, 2021 • 1h 6min

Bruno Connelly - Building and leading the global SRE org at LinkedIn - #14

Bruno Connelly is a VP of Engineering at LinkedIn. He leads the Site Engineering org responsible for LinkedIn's production infrastructure. He joins the show to talk about his journey in tech - from teaching himself how to code at a young age, building, maintaining and reverse engineering software as a teenager, building ISPs in the early part of his career (there are some fun stories that involve sleeping in the data center) to leading the SRE org at LinkedIn over the last decade. He talks about the early days at LinkedIn that involved a lot of firefighting to keep the site up, how the team built technical stability and scaled the platform. We also dive into how he grew the SRE org globally and overcame challenges that came with the growth. Throughout the conversation, he shares various nuggets of wisdom - like how to stay calm under pressure and how to make people feel at ease - as he describes his leadership style, people who have influenced him and what he thinks is a positive way to collaborate with people. Website link: https://softwaremisadventures.com/bruno Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en

Aug 14, 2021 • 1h 24min

Lorin Hochstein - On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more - #13

With 5+ years of experience building resilient systems at the Netflix scale, Lorin joins the show to chat about his favorite incident story, the path that led him to doing chaos engineering (and later away from it), and advocating for a dedicated analyst to talk to people after an incident. Throughout the conversation, Lorin shares his philosophy and tips on how to learn from incidents, what engineers can gain from writing better, and why some metrics may not be as useful as you think.

Jul 9, 2021 • 1h 14min

Spoons (Daniel Spoonhower) - On building Lightstep, being customer focused, developing systems at Google scale and much more - #12

Spoons is the Co-founder and Chief Architect of Lightstep. He joins the show to talk about building systems at Google scale and various aspects that make Google a weird place than other companies. We talked about Spoons's journey of leaving Google and deciding to join Lightstep as a co-founder. We dig into the challenges during the early days of Lightstep and discuss the importance of speaking to customers to build the right product. We talk about what it's like to start a family and run a startup and how one can be intentional about building a company's culture. As always, we go through some of the misadventures and one of them involves a cable being cut under the English channel.

6 snips

Jun 11, 2021 • 1h 13min

Emmanuel Ameisen - On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more - #11

Emmanuel Ameisen, a machine learning engineer at Stripe and former lead at Insight Data Science, shares invaluable insights on building and deploying ML products at scale. He highlights common pitfalls in launching ML projects, emphasizing practicality over complexity. Emmanuel discusses the challenges of transitioning from research to engineering roles and the necessity of effective data management. He also touches on validating models in production, exploring testing methodologies, and shares his experience writing a book for engineers.

11 snips

May 7, 2021 • 1h 8min

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.

Apr 23, 2021 • 1h 13min

Evan Estola - On recommendation systems going bad, hiring ML engineers, giving constructive feedback, filter bubbles and much more - #9

Evan Estola (https://twitter.com/estola) is a Director of Engineering at Flatiron Health where he's leading software engineering teams focused on building Machine Learning products. Throughout this episode, Evan shares various stories when recommendation systems didn't work as expected, like this one time when members saw mathematically worst recommendations for meetups near them. He also shares why Schenectady, NY pops up on some lists of most popular cities and the story behind the Wall Street Journal article titled 'Orbitz steers Mac users to pricier hotels'. We also discuss skills Evan looks for when hiring ML engineers, how to give constructive feedback, filter bubbles and much more.

Apr 9, 2021 • 1h 2min

Uma Chingunde - On managing migrations, growing engineering teams and much more - #8

Uma is a VP of Engineering at Render. In this episode, she shared with us her insights on how to successfully manage infrastructure migrations. We discussed the importance of communicating the "why" behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects and much more. Uma also shared stories where parts of a migration didn't go as planned, how the team fixed the issue and the kind of engineers she thinks would make good tech leads for these projects. We had a great time speaking with Uma! Our major focus in this episode was large scale infrastructure migrations and Uma shared many insights on how to manage them successfully. We discussed the importance of communicating the "why" behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects and much more. Uma also shared stories where parts of a migration didn't go as planned, how the team fixed the issue and the kind of engineers she thinks would make good tech leads for these projects. There's a lot to learn from Uma's experience. Please enjoy this highly educational conversation with Uma Chingunde!

Mar 20, 2021 • 1h 7min

Charity Majors - On database outages, journey as a co-founder, thriving under pressure and growing as an engineer - #7

Charity Majors (https://twitter.com/mipsytipsy) is the co-founder and CTO of Honeycomb.io. Before this she worked at Facebook, Parse and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of Database Reliability Engineering book and also has an amazing blog at charity.wtf. We love the content in her blogs and have learned a lot from them. We had a lot of fun speaking with Charity in this lively conversation! We learned about her journey from being an engineer to co-founding Honeycomb, what it was like being on-call when she was only 17, and staying calm during production incidents. We talked about various production outages throughout the episode and our favorite involved driving to a datacenter to flip a DB switch. Charity also shares what it takes to build an awesome engineering culture, the engineer/manager pendulum, and qualities Charity looks for when hiring senior engineers.

Mar 7, 2021 • 1h 4min

Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6

Tammy Bryant Butow is a Principal SRE at Gremlin where she works on Chaos Engineering. In this episode, we discuss how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the datacenter, her experience running a disaster recovery test and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving and how extreme sports help Tammy keep calm under pressure. Lastly, Tammy has some great advice for teams looking to get started with chaos engineering.

Feb 19, 2021 • 1h 1min

Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5

Oliver Leaver-Smith, better known as Ols, is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app