Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Confluent

Hi, we’re Tim Berglund, Adi Polak, and Viktor Gamov and we’re excited to bring you the Confluent Developer podcast (formerly “Streaming Audio.”) Our hand-crafted weekly episodes feature in-depth interviews with our community of software developers (actual human beings - not AI) talking about some of the most interesting challenges they’ve faced in their careers. We aim to explore the conditions that gave rise to each person’s technical hurdles, as well as how their experiences transformed their understanding and approach to building systems. Whether you’re a seasoned open source data streaming engineer, or just someone who’s interested in learning more about Apache Kafka®, Apache Flink® and real-time data, we hope you’ll appreciate the stories, the discussion, and our effort to bring you a high-quality show worth your time.

Episodes

Mentioned books

7 snips

Mar 10, 2022 • 45min

Why Data Mesh? ft. Ben Stopford

With experience in data infrastructure and distributed data technologies, author of the book “Designing Event-Driven Systems” Ben Stopford (Lead Technologist, Office of the CTO, Confluent) explains the data mesh paradigm, differences between traditional data warehouses and microservices, as well as how you can get started with data mesh. Unlike standard data architecture, data mesh is about moving data away from a monolithic data warehouse into distributed data systems. Doing so will allow data to be available as a product—this is also one of the four principles of data mesh: Data ownership by domainData as a productData available everywhere for self-serviceData governed wherever it isThese four principles are technology agnostic, which means that they don’t restrict you to a programming language, Apache Kafka®, or other databases. Data mesh is all about building point-to-point architecture that lets you evolve and accommodate real-time data needs with governance tools.Fundamentally, data mesh is more than a technological shift. It’s a mindset shift that requires cultural adaptation of product thinking—treating data as a product instead of data as an asset or resource. Data mesh invests ownership of data by the people who create it with requirements that ensure quality and governance. Because data mesh consists of a map of interconnections, it’s important to have governance tools in place to identify data sources and provide data discovery capabilities. There are many ways to implement data mesh, event streaming being one of them. You can ingest data sets from across organizations and sources into your own data system. Then you can use stream processing to trigger an application response to the data set. By representing each data product as a data stream, you can tag it with sub-elements and secondary dimensions to enable data searchability. If you are using a managed service like Confluent Cloud for data mesh, you can visualize how data flows inside the mesh through a stream lineage graph. Ben also discusses the importance of keeping data architecture as simple as you can to avoid derivatives of data products.EPISODE LINKSData Mesh 101 courseData Mesh 101 with Live Walkthrough ExerciseIntroduction and Guide to Data MeshThe Definitive Guide to Building a Data Mesh with Event StreamsWhat is Data Mesh, and How Does it Work? ft. Zhamak DehghaniDesigning Event-Driven SystemsSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Mar 3, 2022 • 42min

Serverless Stream Processing with Apache Kafka ft. Bill Bejeck

What is serverless?Having worked as a software engineer for over 15 years and as a regular contributor to Kafka Streams, Bill Bejeck (Integration Architect, Confluent) is an Apache Kafka® committer and author of “Kafka Streams in Action.” In today’s episode, he explains what serverless and the architectural concepts behind it are. To clarify, serverless doesn’t mean you can run an application without a server—there are still servers in the architecture, but they are abstracted away from your application development. In other words, you can focus on building and running applications and services without any concerns over infrastructure management. Using a cloud provider such as Amazon Web Services (AWS) enables you to allocate machine resources on demand while handling provisioning, maintenance, and scaling of the server infrastructure. There are a few important terms to know when implementing serverless functions with event stream processors: Functions as a service (FaaS)Stateless stream processingStateful stream processingServerless commonly falls into the FaaS cloud computing service category—for example, AWS Lambda is the classic definition of a FaaS offering. You have a greater degree of control to run a discrete chunk of code in response to certain events, and it lets you write code to solve a specific issue or use case. Stateless processing is simpler in comparison to stateful processing, which is more complex as it involves keeping the state of an event stream and needs a key-value store. ksqlDB allows you to perform both stateless and stateful processing, but its strength lies in stateful processing to answer complex questions while AWS Lambda is better suited for stateless processing tasks. By integrating ksqlDB with AWS Lambda together, they deliver serverless event streaming and analytics at scale.EPISODE LINKSWhat is Serverless?Serverless Stream Processing with Apache Kafka, AWS Lambda, and ksqlDBStateful Serverless Architectures with ksqlDB and AWS Lambda Serverless GitHub repositoryKafka Streams in ActionWatch the video version of this podcastJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Feb 24, 2022 • 47min

The Evolution of Apache Kafka: From In-House Infrastructure to Managed Cloud Service ft. Jay Kreps

When it comes to Apache Kafka®, there’s no one better to tell the story than Jay Kreps (Co-Founder and CEO, Confluent), one of the original creators of Kafka. In this episode, he talks about the evolution of Kafka from in-house infrastructure to a managed cloud service and discusses what’s next for infrastructure engineers who used to self-manage the workload. Kafka started out at LinkedIn as a distributed stream processing framework and was core to their central data pipeline. At the time, the challenge was to address scalability for real-time data feeds. The social media platform’s initial data system was built on Apache™Hadoop®, but the team later realized that operationalizing and scaling the system required a considerable amount of work. When they started re-engineering the infrastructure, Jay observed a big gap in data streaming—on one end, data was being looked at constantly for analytics, while on the other end, data was being looked at once a day—missing real-time data interconnection. This ushered in efforts to build a distributed system that connects applications, data systems, and organizations for real-time data. That goal led to the birth of Kafka and eventually a company around it—Confluent.Over time, Confluent progressed from focussing solely on Kafka as a software product to a more holistic view—Kafka as a complete central nervous system for data, integrating connectors and stream processing with a fully-managed cloud service.Now as organizations make a similar shift from in-house infrastructure to fully-managed services, Jay outlines five guiding points to keep in mind: Cloud-native systems abstract away operational efforts for you without infrastructure concernsIt’s important to have a complete ecosystem for Kafka, including connectors, a SQL layer, and data governanceA distributed system should allow data to be accessible everywhere and across organizationsIdentifying a reliable storage infrastructure layer that is dependable, such as Amazon S3 is criticalCost-effective models mean sustainability and systems that are easy to build aroundEPISODE LINKSBuilding Real-Time Data Systems the Hard WayKris Jenkins TwitterThe Hitchhiker’s Guide to the GalaxyHedonic treadmillWatch the video version of this podcastJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven MicroserviceSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Feb 16, 2022 • 3min

Expanding Apache Kafka Multi-Tenancy for Cloud-Native Systems ft. Anna Povzner and Anastasia Vela

In an effort to make Apache Kafka® cloud native, Anna Povzener (Principal Engineer, Confluent) and Anastasia Vela (Software Engineer I, Confluent) have been working to expand multi-tenancy to cloud-native systems with automated capacity planning and scaling in Confluent Cloud. They explain how cloud-native data systems are different from legacy databases and share the technical requirements needed to create multi-tenancy for managed Kafka as a service. As a distributed system, Kafka is designed to support multi-tenant systems by: Isolating data with authentication, authorization, and encryptionIsolating user namespacesIsolating performance with quotasTraditionally, Kafka’s multi-tenant capabilities are used in on-premises data centers to make data available and accessible across the company—a single company would run a multi-tenant Kafka cluster with all its workloads to stream data across organizations. Some processes behind setting up multi-tenant Kafka clusters are manual with the requirement to over-provision resources and capacity in order to protect the cluster from unplanned demand increases. When Kafka is on cloud instances, you have the ability to scale cloud resources on the fly for any unplanned workloads to meet expectations instantaneously. To shift multi-tenancy to the cloud, Anna and Anastasia identify the following as essential for the architectural design: Abstractions: requires minimal operational complexity of a cloud servicePay-per-use model: requires the system to use only the minimum required resources until additional is necessary Uptime and performance SLA/SLO: requires support for unknown and unpredictable workloads with minimal operational workload while protecting the cluster from distributed denial-of-service (DDoS) attacksCost-efficiency: requires a lower cost of ownership You can also read more about the shift from on-premises to cloud-native, multi-tenant services in Anna and Anastasia’s publication on the Confluent blog. EPISODE LINKSFrom On-Prem to Cloud-Native: Multi-Tenancy in Confluent CloudCloud-Native Apache KafkaJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)Watch the video version of this pSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Jan 24, 2022 • 5min

Apache Kafka 3.1 - Overview of Latest Features, Updates, and KIPs

Apache Kafka® 3.1 is here with exciting new features and improvements! On behalf of the Kafka community, Danica Fine (Senior Developer Advocate, Confluent) shares release highlights that you won’t want to miss, including foreign-key joins in Kafka Streams and improvements that will provide consistency for Kafka latency metrics. KAFKA-13439 deprecates the eager protocol, which has been the default since Kafka 2.4—it’s advised to upgrade your applications to the cooperative protocol as the eager protocol will no longer be supported in future releases. Previously, foreign-key joins in Kafka Streams only worked if both primary and foreign-key tables were joined. This release adds support for foreign-key joins on tables with custom partitioners, which will be passed in as part of a new `TableJoined` object, comparable to the existing `Joined` and `StreamJoined` objects. With the goal of making Kafka more intuitive, KIP-773 enhances naming consistency for three new client metrics with millis and nanos. For example, `io-waittime-total` is reintroduced as `io-wait-time-ns-total`. The previously introduced metrics without ns will be deprecated but available for backward compatibility. KIP-768 continues the work started in KIP-255 to implement the necessary interfaces for a production-grade way to connect to an OpenID identity provider for authentication and token retrieval. This update provides an out-of-the-box implementation of an `AuthenticateCallbackHandler` that can be used to communicate with OAuth/OIDC. Additionally, this Kafka release introduces two new metrics for active brokers specifically, `ActiveBrokerCount` and `FenceBrokerCount`. These two metrics expose the number of active brokers in the cluster known by the controller and the number of fenced brokers known by the controller. Tune in to learn more about the Apache Kafka 3.1 release! EPISODE LINKSApache Kafka 3.1 release notes Read the blog to learn moreDownload Apache Kafka 3.1Watch the video version of this podcastSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Jan 20, 2022 • 31min

Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra

Maximizing cloud Apache Kafka® performance isn’t just about running data processes on cloud instances. There is a lot of engineering work required to set and maintain a high-performance standard for speed and availability. Alok Nikhil (Senior Software Engineer, Confluent) and Adithya Chandra (Staff Software Engineer II, Confluent) share about their efforts on how to optimize Kafka on Confluent Cloud and the three guiding principles that they follow whether you are self-managing Kafka or working on a cloud-native system: Know your users and plan for their workloadsInfrastructure matters for performance as well as cost efficiency Effective observability—you can’t improve what you don’t see A large part of setting and achieving performance standards is about understanding that workloads vary and come with unique requirements. There are different dimensions for performance, such as the number of partitions and the number of connections. Alok and Adithya suggest starting by identifying the workload patterns that are the most important to your business objectives for simulation, reproduction, and using the results to optimize the software. When identifying workloads, it’s essential to determine the infrastructure that you’ll need to support the given workload economically. Infrastructure optimization is as important as performance optimization. It's best practice to know the infrastructure that you have available to you and choose the appropriate hardware, operating system, and JVM to allocate the processes so that workloads run efficiently. With the necessary infrastructure patterns in place, it’s crucial to monitor metrics to ensure that your application is running as expected consistently with every release. Having the right observability metrics and logs allows you to identify and troubleshoot issues relatively quickly. Profiling and request sampling also help you dive deeper into performance issues, particularly, during incidents. Alok and Adithya’s team uses tooling such as the async-profiler for profiling CPU cycles, heap allocations, and lock contention.Alok and Adithya summarize their learnings and processes used for optimizing managed Kafka as a service, which can be applicable to your own cloud-native applications. You can also read more about their journey on the Confluent blog. EPISODE LINKSSpeed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent CloudCloud-Native Apache KafkaJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Jan 13, 2022 • 30min

From Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica Fine

Implementing an event-driven data pipeline can be challenging, but doing so within the context of a legacy architecture is even more complex. Having spent three years building a streaming data infrastructure and being on the first team at a financial organization to implement Apache Kafka® event-driven data pipelines, Danica Fine (Senior Developer Advocate, Confluent) shares about the development process and how ksqlDB and Kafka Connect became instrumental to the implementation.By moving away from batch processing to streaming data pipelines with Kafka, data can be distributed with increased data scalability and resiliency. Kafka decouples the source from the target systems, so you can react to data as it changes while ensuring accurate data in the target system. In order to transition from monolithic micro-batching applications to real-time microservices that can integrate with a legacy system that has been around for decades, Danica and her team started developing Kafka connectors to connect to various sources and target systems. Kafka connectors: Building two major connectors for the data pipeline, including a source connector to connect the legacy data source to stream data into Kafka, and another target connector to pipe data from Kafka back into the legacy architecture. Algorithm: Implementing Kafka Streams applications to migrate data from a monolithic architecture to a stream processing architecture. Data join: Leveraging Kafka Connect and the JDBC source connector to bring in all data streams to complete the pipeline.Streams join: Using ksqlDB to join streams—the legacy data system continues to produce streams while the Kafka data pipeline is another stream of data. As a final tip, Danica suggests breaking algorithms into process steps. She also describes how her experience relates to the data pipelines course on Confluent Developer and encourages anyone who is interested in learning more to check it out. EPISODE LINKSData Pipelines courseIntroduction to Streaming Data Pipelines with Apache Kafka and ksqlDBGuided Exercise on Building Streaming Data PipelinesMigrating from a Legacy System to Kafka StreamsWatch the video version of this podcastJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices withSEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo 🎧 Subscribe to Confluent Developer wherever you listen to podcasts. ▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes. 👍 If you enjoyed this, please leave us a rating. 🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app