KubeFM cover image

KubeFM

Latest episodes

undefined
May 20, 2025 • 33min

Managing 100s of Kubernetes Clusters using Cluster API, with Zain Malik

Discover how to manage Kubernetes at scale with declarative infrastructure and automation principles.Zain Malik shares his experience managing multi-tenant Kubernetes clusters with up to 30,000 pods across clusters capped at 950 nodes. He explains how his team transitioned from Terraform to Cluster API for declarative cluster lifecycle management, contributing upstream to improve AKS support while implementing GitOps workflows.You will learn:How to address challenges in large-scale Kubernetes operations, including node pool management inconsistencies and lengthy provisioning timesWhy Cluster API provides a powerful foundation for multi-cloud cluster management, and how to extend it with custom operators for production-specific needsHow implementing GitOps principles eliminates manual intervention in critical operations like cluster upgradesStrategies for handling production incidents and bugs when adopting emerging technologies like Cluster APISponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/5PLksqVlkInterested in sponsoring an episode? Learn more.
undefined
May 13, 2025 • 46min

Super-Scaling Open Policy Agent with Batch Queries, with Nicholaos Mouzourakis

Dive into the technical challenges of scaling authorization in Kubernetes with this in-depth conversation about Open Policy Agent (OPA).Nicholaos Mouzourakis, Staff Product Security Engineer at Gusto, explains how his team re-architected Kubernetes native authorization using OPA to support scale, latency guarantees, and audit requirements across services. He shares detailed insights about their journey optimizing OPA performance through batch queries and solving unexpected interactions between Kubernetes resource limits and Go's runtime behavior.You will learn:Why traditional authorization approaches (code-driven and data-driven) fall short in microservice architectures, and how OPA provides a more flexible, decoupled solutionHow batch authorization can improve performance by up to 18x by reducing network round-tripsThe unexpected interaction between Kubernetes CPU limits and Go's thread management (GOMAXPROCS) that can severely impact OPA performancePractical deployment strategies for OPA in production environments, including considerations for sidecars, daemon sets, and WASM modulesSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/S-2vQ_j-4Interested in sponsoring an episode? Learn more.
undefined
May 6, 2025 • 34min

Kubernetes upgrades: beyond the one-click update, with Tanat Lokejaroenlarb

Discover how Adevinta manages Kubernetes upgrades at scale in this episode with Tanat Lokejaroenlarb. Tanat shares his team's journey from time-consuming blue-green deployments to efficient in-place upgrades for their multi-tenant Kubernetes platform SHIP, detailing the engineering decisions and operational challenges they overcame.You will learn:How to transition from blue-green to in-place Kubernetes upgrades while maintaining service reliabilityTechniques for tracking and addressing API deprecations using tools like Pluto and Kube-no-troubleStrategies for minimizing SLO impact during node rebuilds through serialized approaches and proper PDB configurationWhy a phased upgrade approach with "cluster waves" provides safer production deployments even with thorough testingSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/VVHFfXGl_Interested in sponsoring an episode? Learn more.
undefined
Apr 29, 2025 • 34min

From Fragile to Faultless: Kubernetes Self-Healing In Practice, with Grzegorz Głąb

Discover how to build resilient Kubernetes environments at scale with practical automation strategies from an engineer who's tackled complex production challenges.Grzegorz Głąb, Kubernetes Engineer at Cloud Kitchens, shares his team's journey developing a comprehensive self-healing framework. He explains how they addressed issues ranging from spot node preemptions to network packet drops caused by unbalanced IRQs, providing concrete examples of automation that prevents downtime and improves reliability.You will learn:How managed Kubernetes services like AKS provide benefits but require customization for specific use casesThe architecture of an effective self-healing framework using DaemonSets and deployments with Kubernetes-native componentsPractical solutions for common challenges like StatefulSet pods stuck on unreachable nodes and cleaning up orphaned podsTechniques for workload-level automation, including throttling CPU-hungry pods and automating diagnostic data collectionSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/yg_fkP0LNInterested in sponsoring an episode? Learn more.
undefined
Apr 22, 2025 • 1h 3min

Replacing StatefulSets with a custom Kubernetes operator in our Postgres cloud platform, with Andrew Charlton

Discover why standard Kubernetes StatefulSets might not be sufficient for your database workloads and how custom operators can provide better solutions for stateful applications.Andrew Charlton, Staff Software Engineer at Timescale, explains how they replaced Kubernetes StatefulSets with a custom operator called Popper for their PostgreSQL Cloud Platform. He details the technical limitations they encountered with StatefulSets and how their custom approach provides more intelligent management of database clusters.You will learn:Why StatefulSets fall short for managing high-availability PostgreSQL clusters, particularly around pod ordering and volume managementHow Timescale's instance matching approach solves complex reconciliation challenges when managing heterogeneous database workloadsThe benefits of implementing discrete, idempotent actions rather than workflows in Kubernetes operatorsReal-world examples of operations that became possible with their custom operator, including volume downsizing and availability zone consolidationSponsorThis episode is brought to you by mirrord — run local code like in your Kubernetes cluster without deploying first.More infoFind all the links and info for this episode here: https://ku.bz/fhZ_pNXM3Interested in sponsoring an episode? Learn more.
undefined
Mar 18, 2025 • 52min

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.You will learn:How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon setsWhy smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineeringHow running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthlyPractical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issuesSponsorThis episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.More infoFind all the links and info for this episode here: https://ku.bz/wP6bTlrFsInterested in sponsoring an episode? Learn more.
undefined
Mar 4, 2025 • 31min

I just want mTLS on Kubernetes, with John Howard

Dive into the world of Kubernetes security with this insightful conversation about securing cluster traffic through encryption.John Howard, Senior Software Engineer at Solo.io, explains the complexities of implementing Mutual TLS (mTLS) in Kubernetes. He discusses the evolution from DIY approaches to Service Mesh solutions, focusing on Istio's Ambient Mesh as a simplified path to workload encryption.You will learn:Why DIY mTLS implementation in Kubernetes is challenging at scale, requiring certificate management, application updates, and careful transition planningHow Service Mesh solutions offload security concerns from applications, allowing developers to focus on business logic while infrastructure handles encryptionThe advantages of Ambient Mesh's approach to simplifying mTLS implementation with its node proxy and waypoint proxy architectureSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/sk-ZF1PG9Interested in sponsoring an episode? Learn more.
undefined
Feb 25, 2025 • 32min

Learned it the hard way: don't use Cilium's default Pod CIDR, with Isala Piyarisi

This episode examines how a default configuration in Cilium CNI led to silent packet drops in production after 8 months of stable operations.Isala Piyarisi, Senior Software Engineer at WSO2, shares how his team discovered that Cilium's default Pod CIDR (10.0.0.0/8) was conflicting with their Azure Firewall subnet assignments, causing traffic disruptions in their staging environment.You will learn:How Cilium's default CIDR allocation can create routing conflicts with existing infrastructureA methodical process for debugging network issues using packet tracing, routing table analysis, and firewall logsThe procedure for safely changing Pod CIDR ranges in production clustersSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/kJjXQlmTwInterested in sponsoring an episode? Learn more.
undefined
Feb 18, 2025 • 33min

Simplifying Kubernetes deployments with a unified Helm chart, with Calin Florescu

Managing microservices in Kubernetes at scale often leads to inconsistent deployments and maintenance overhead. This episode explores a practical solution that standardizes service deployments while maintaining team autonomy.Calin Florescu discusses how a unified Helm chart approach can help platform teams support multiple development teams efficiently while maintaining consistent standards across services.You will learn:Why inconsistent Helm chart configurations across teams create maintenance challenges and slow down deploymentsHow to implement a unified Helm chart that balances standardization with flexibility through override functionsHow to maintain quality through automated documentation and testing with tools like Helm Docs and Helm unittestSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/mcPtH5395Interested in sponsoring an episode? Learn more.
undefined
Feb 4, 2025 • 22min

5,000 pods/second and 60% utilization with Gödel and Katalyst, with Yue Yin

Learn how ByteDance manages computing resources at scale with custom Kubernetes scheduling solutions that handle millions of pods across thousands of nodes.Yue Yin, Software Engineer at ByteDance, discusses their open-source Gödel scheduler and Katalyst resource management system. She explains how these tools address the challenges of managing online and offline workloads in large-scale Kubernetes deployments.You will learn:How Gödel's distributed architecture with dispatcher, scheduler, and binder components enables the scheduling of 5,000 pods per secondWhy NUMA-aware scheduling and two-layer architecture are crucial for handling complex workloads at scaleHow Katalyst provides node-level resource insights to enable efficient workload co-location and improve CPU utilizationSponsorThis episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/lMpNng_33Interested in sponsoring an episode? Learn more.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app