Data Engineering Podcast cover image

Data Engineering Podcast

Troubleshooting Kafka In Production

Dec 24, 2023
Elad Eldor, author of 'Kafka: Troubleshooting in Production', discusses the challenges of operating Kafka at scale and ways to mitigate potential issues. Topics include the importance of Kafka in the data pipeline, doubling retention in Kafka, managed vs. self-managed Kafka clusters, data lake complexity, monitoring for Kafka, troubleshooting unreplicated partitions, the cost of running Kafka in the cloud, and the need for a correlation tool.
01:14:44

Episode guests

Podcast summary created with Snipd AI

Quick takeaways

  • Understanding the components of Kafka (data, OS, Kafka itself) is crucial for troubleshooting and maintaining cluster health.
  • On-prem and cloud deployments of Kafka have different considerations, including hardware failure management, scalability, and operational complexity.

Deep dives

Understanding the Three Legs of Kafka: Data, OS, and Kafka

The book emphasizes the importance of understanding the three components of Kafka: the data, the operating system (OS), and Kafka itself. The data section highlights how data is spread among partitions and its impact on cluster health. The OS section explores the critical role of monitoring disk utilization and understanding metrics like I/Ostat to detect production issues. The Kafka metrics section delves into producer and consumer metrics for efficient data transfer. The book provides insights into common problems, such as configuring retention policies, handling storage usage, reducing costs, and optimizing cluster scalability.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner