Adventures in DevOps

Will Button, Warren Parad

Join us in listening to the experienced experts discuss cutting edge challenges in the world of DevOps. From applying the mindset at your company, to career growth and leadership challenges within engineering teams, and avoiding the common antipatterns. Every episode you'll meet a new industry veteran guest with their own unique story.

Episodes

Mentioned books

Oct 20, 2025 • 50min

Solving incidents with one-time ephemeral runbooks

Share Episode ⸺ Episode Sponsor: Attribute - https://dev0ps.fyi/attributeIn the wake of one of the worst AWS incidents in history, we're joined by Lawrence Jones, Founding Engineer at Incident.io. The conversation focuses on the challenges of managing incidents in highly regulated environments like FinTech, where the penalties for downtime are harsh and require a high level of rigor and discipline in the response process. Lawrence details the company's evolution, from running a monolithic Go binary on Heroku to moving to a more secure, robust setup in GCP, prioritizing the use of native security primitives like GCP Secret Manager and Kubernetes to meet the obligations of their growing customer base.We spotlight exactly how a system can crawl GitHub pull requests, Slack channels, telemetry data, and past incident post-mortems to dynamically generate an ephemeral runbook for the current incident.Also discussed are the technical challenges of using RAG (Retrieval-Augmented Generation), noting that they rely heavily on pre-processing data with tags and a service catalog rather than relying solely on less consistent vector embeddings to ensure fast, accurate search results during a crisis.Finally, Lawrence stresses that frontier models are no longer the limiting factor in building these complex systems; rather, success hinges on building structured, modular systems, and doing the hard work of defining objective metrics for improvement.Notable FactsCloud Secrets management at scaleEpisode: Solving Time Travel in RAG DatabasesEpisode: Does RAG Replace keyword search?Picks:Warren - Anker Adpatable Wall-Charger - PowerPort Atom IIILawrence - Rocktopus & The Checklist Manifesto

Oct 2, 2025 • 30min

The IT Dictionary: Post-Mortems, Cargo Cults, and Dropped Databases

Share Episode ⸺ Episode Sponsor: Attribute - https://dev0ps.fyi/attributeWe're joined by 20 year industry veteran and DevOps advocate, Adam Korga, celebrating the release of his book IT Dictionary. In this episode we quickly get down to the inspiration behind postmortems as we review some cornerstone cases both in software and in general technology.Adam shares how he started in the industry, long before DevOps was a coined term, focused on making systems safer and avoiding mistakes like accidentally dropping a production database. we review the infamous incidents of accidental database deletion, by LLMs and human's alike.And of course we touch on the quintessential postmortems in civil engineering, flight, and survivorship bias from World War II through analyzing bullet holes on returning planes.Notable FactsAdam's book: IT DictionaryKnight Capital: the 45 minute nightmareWork Chronicles Comic: Will my architecture work for 1 Million users?Picks:Warren - Cuitisan CANDL storage containersAdam - FUBAR

Sep 24, 2025 • 55min

Vector Databases Explained: From E-commerce Search to Molecule Research

Share Episode ⸺ Episode Sponsor: Attribute - https://dev0ps.fyi/attributeJenna Pederson, Staff Developer Relations at Pinecone, joins us to close the loop on Vector Databases. Demystifies how they power semantic search, their role in RAG, and also unexpected applications.Jenna takes us beyond the buzzword bingo, explaining how vector databases are the secret sauce behind semantic search. Sharing just how "red shirt" gets converted into a query that returns things semantically similar. It's all about turning your data into high-dimensional numerical meaning, which, as Jenna clarifies, is powered by some seriously clever math to find those "closest neighbors."The conversation inevitably veers into Retrieval-Augmented Generation (RAG). Jenna reveals how databases are the unsung heroes giving LLMs real brains (and up-to-date info) when they’re prone to hallucinating or just don’t know your company’s secrets. They complete the connection from proprietary and generalist foundational models to business relevant answers.Notable FactsEpisode: MCP: The Model Context Protocol and Agent InteractionsCrossing the ChasmPicks:Warren - HanCenDa USB C Magnetic adapterJenna - Keychron Alice Layout Mechanical keyboard (And get a 5% discount on us)

Sep 17, 2025 • 53min

The Unspoken Challenges of Deploying to Customer Clouds

Share EpisodeThis episode we are joined by Andrew Moreland, co-founder of Chalk. Andrew explains how their company’s core business model is to deploy their software directly into their customers’ cloud environments. This decision was driven by the need to handle highly sensitive data, like PII and financial records, that customers don't want to hand over to a third-party startup. The conversation delves into the surprising and complex challenges of this approach, which include managing granular IAM permissions and dealing with hidden global policies that can block their application. Andrew and Warren also discuss the real-world network congestion issues that affect cross-cloud traffic, a problem they've encountered multiple times. Andrew shares Chalk's mature philosophy on software releases, where they prioritize backwards compatibility to prevent customer churn, which is a key learning from a competitor.Finally, the episode explores the advanced technical solutions Chalk has built, such as their unique approach to "bitemporal modeling" to prevent training bias in machine learning datasets. As well as, the decision to move from Python to C++ and Rust for performance, using a symbolic interpreter to execute customer code written in Python without a Python runtime. The episode concludes with picks, including a surprisingly popular hobby and a unique take on high-quality chocolate.Notable FactsFact - The $1M hidden Kubernetes spendGiraffe and Medical Ruler training data biasSOLID principles don't produce better code?Veritasium - The Hole at the Bottom of MathEpisode: Auth Showdown on backwards compatible changesPicks:Warren - Switzerland Grocery Store ChocolateAndrew - Trek E-Bikes

Sep 7, 2025 • 46min

How to build in Observability at Petabyte Scale

Share EpisodeWe welcome guest Ang Li and dive into the immense challenge of observability at scale, where some customers are generating petabytes of data per day. Ang explains that instead of building a database from scratch—a decision he says went "against all the instincts" of a founding engineer—Observe chose to build its platform on top of Snowflake, leveraging its separation of compute and storage on EC2 and S3.The discussion delves into the technical stack and architectural decisions, including the use of Kafka to absorb large bursts of incoming customer data and smooth it out for Snowflake's batch-based engine. Ang notes this choice was also strategic for avoiding tight coupling with a single cloud provider like AWS Kinesis, which would hinder future multi-cloud deployments on GCP or Azure. The discussion also covers their unique pricing model, which avoids surprising customers with high bills by providing a lower cost for data ingestion and then using a usage-based model for queries. This is contrasted with Warren's experience with his company's user-based pricing, which can lead to negative customer experiences when limits are exceeded.The episode also explores Observe’s "love-hate relationship" with Snowflake, as Observe's usage accounts for over 2% of Snowflake's compute, which has helped them discover a lot of bugs but also caused sleepless nights for Snowflake's on-call engineers. Ang discusses hedging their bets for the future by leveraging open data formats like Iceberg, which can be stored directly in customer S3 buckets to enable true data ownership and portability. The episode concludes with a deep dive into the security challenges of providing multi-account access to customer data using IAM trust policies, and a look at the personal picks from the hosts.Notable LinksFact - Passkeys: Phishing on Google's own domain and It isn't even newEpisode: All About OTELEpisode: Self Healing SystemsPicks:Warren - The Shadow (1994 film)Ang - XREAL Pro AR Glasses

Aug 24, 2025 • 59min

The Open-Source Product Leader Challenge: Navigating Community, Code, and Collaboration Chaos

In a special solo flight, Warren welcomes Meagan Cojocar, General Manager at Pulumi and a self-proclaimed graduate of “PM school” at AWS. They dive into what it’s like to own an entire product line and why giving up that startup hustle for the big leagues sometimes means you miss the direct signal from your users. The conversation goes deep on the paradox of open-source where direct feedback is gold, but dealing with license-shifting competitors can make you wary. From the notorious HashiCorp kerfuffle to the rise of OpenTofu, they explore how Pulumi maintains its commitment to the community amidst a wave of customer distrust.Meagan highlights the invaluable feedback loop provided by the community, allowing for direct interaction between users and the engineering team. This contrasts with the "telephone game" that can happen in proprietary product development. The conversation also addresses the recent industry shift and then immediate back-peddling from open-source licenses, discussing the subsequent customer distrust and how Pulumi maintains its commitment to the open-source model.And finally, the duo tackles the elephant in the cloud: LLMs, and extends on the early MCP episode. They debate the great code quality vs. speed trade-off, the risk of a "botched" infrastructure deployment, and whether these models can solve anything more than a glorified statistical guessing game. It's a candid look at the future of DevOps, where the real chaos isn't the code, but the tools that write it. The conversation concludes with a philosophical debate on the fundamental capabilities of LLMs, questioning whether they can truly solve "hard problems" or are merely powerful statistical next-word predictors.Notable LinksVeritasium - the Math that predicts everythingFact - Don't outsource your customer support: Clorox sues CognizantCloudFlare uses an LLM to generate an OAuth2 LibraryPicks:Warren - Rands Leadership CommunityMeagan - The Manager's Path by Camille Fournier

Jul 31, 2025 • 55min

FinOps: Holding engineering teams accountable for spend

Yasmin Rajabi, Chief Strategy Officer at CloudBolt and an expert in FinOps and Kubernetes cost optimization, discusses the crucial junction of financial accountability and engineering teams. She highlights the staggering waste from unused systems and resource mismanagement. Yasmin emphasizes the effectiveness of tools like the Horizontal and Vertical Pod Autoscalers. The conversation also delves into the rising complexities of cloud costs, especially with AI workloads, revealing that engineering salaries are no longer the only significant expense.

Jul 17, 2025 • 53min

The Auth Showdown: Single tenant versus Multitenant Architectures

Get ready for a lively debate on this episode of Adventures in DevOps. We're joined by Brian Pontarelli, founder of FusionAuth and CleanSpeak. Warren and Brian face off by diving into the controversial topic of multitenant versus single-tenant architecture. Expert co-host Aimee Knight joins to moderate the discussion. Ever wondered how someone becomes an "auth expert"? Warren spills the beans on his journey, explaining it's less about a direct path and more about figuring out what it means for yourself. Brian chimes in with his own "random chance" story, revealing how they fell into it after their forum-based product didn't pan out.Aimee confesses her "alarm bells" start ringing whenever multitenant architecture is mentioned, jokingly demanding "details" and admitting her preference for more separation when it comes to reliability. Brian makes a compelling case for his company's chosen path, explaining how their high-performance, downloadable single-tenant profanity filter, CleanSpeak, handles billions of chat messages a month with extreme low latency. This architectural choice became a competitive advantage, attracting companies that couldn't use cloud-based multitenant competitors due to their need to run solutions in their own data centers.We critique cloud providers' tendency to push users towards their most profitable services, citing AWS Cognito as an example of a cost-effective solution for small-scale use that becomes cost-prohibitive with scaling and feature enablement. The challenges of integrating with Cognito, including its reliance on numerous other AWS services and the need for custom Lambda functions for configuration, are also a point of contention. The conversation extends to the frustrations of managing upgrades and breaking changes in both multitenant and single-tenant systems and the inherent difficulties of ensuring compatibility across different software versions and integrations. The episode concludes with a humorous take on the current state and perceived limitations of AI in software development, particularly concerning security.PicksWarren - Scarpa Hiking shoes - Planet Mojito SuadeAimee - Peloton TreadBrian - Searchcraft and Fight or Flight

Jun 24, 2025 • 1h 7min

Should We Be Using Kubernetes: Did the Best Product Win?

Omer Hamerman, an architect at Zesty specializing in Kubernetes and AI, joins the discussion to delve into whether Kubernetes won the infrastructure race based on merit or other factors. They analyze its surprising adoption rate and the idea that human preference for gradual improvements may explain its popularity despite perceived complexity. Omer also considers the merits of serverless solutions like AWS Fargate and the balance between control and efficiency, alongside the environmental challenges posed by Kubernetes deployment and AI.

Jun 21, 2025 • 1h 18min

Mastering SRE: Insights in Scale and at Capacity with Aimee Knight

In this episode, Aimee Knight, an expert in Site Reliability Engineering (SRE) whose experience hails from Paramount and NPM, joins the podcast to discuss her journey into SRE, the challenges she faced, and the strategies she employed to succeed. Aimee shares her transition from a non-traditional background in JavaScript development to SRE, highlighting the importance of understanding both the programming and infrastructure sides of engineering. She also delves into the complexities of SRE at different scales, the role of playbooks in incident management, and the balance between speed and quality in software development.Aimee discusses the impact of AI and machine learning on SRE, emphasizing the need for responsible use of these tools. She touches on the importance of understanding business needs and how it affects decision-making in SRE roles. The conversation also covers the trade-offs in system design, the challenges of scaling applications, and the importance of resilience in distributed systems. Aimee provides valuable insights into the pros and cons of a career in SRE, including the importance of self-care and the satisfaction of mentoring others.The episode concludes with us discussing some of the hard problems such as the on-call burden for large teams, and the technical expertise an org needs to maintain higher complexity systems. Is the average tenure in tech decreasing, we discuss it and do a deep dive on the consequences in the SRE world.PicksThe Adventures In DevOps: SurveyWarren's Technical BlogWarren: The Fifth Discipline by Peter SengeAimee: Sleep Token (Band) - Caramel, GraniteWill: The Bear Grylls Celebrity Hunt on NetflixJillian: Horizon Zero Dawn Video Game

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner