How Denormalized is Building ‘DuckDB for Streaming’ with Apache DataFusion
Sep 13, 2024
auto_awesome
Amey Chaugule and Matt Green, co-founders of Denormalized, share their extensive engineering backgrounds from top tech firms. They discuss the creation of an embedded stream processing engine designed to simplify real-time data workloads. The duo tackles challenges in existing systems like Spark and Kafka, emphasizing developer experience and state management. They also compare DuckDB and SQLite in the context of streaming data, highlighting the future of user-friendly data tools and the importance of fault tolerance in modern applications.
Denormalized is developing an embedded stream processing engine that simplifies real-time data workloads by leveraging Apache DataFusion's single-node capabilities.
The challenges of achieving fault tolerance in streaming systems often lead practitioners to skip necessary checkpointing, raising critical concerns about continuous data processing.
Improving developer experience through accessible interfaces, like TypeScript, is vital for attracting a wider audience to modern streaming applications and systems.
Deep dives
Evolution of Streaming Systems
Streaming systems have undergone significant evolution, with various frameworks emerging to handle stream processing workloads. Experts like Ami and Matt from Denormalized have worked on multiple platforms, notably Uber, where they dealt with massive Kafka deployments. Their experiences revealed the complexities involved in real-time data processing and the challenges in achieving fault tolerance. They explored how traditional assumptions about streaming systems, particularly around fault tolerance, might not align with the actual practices in engineering today.
Challenges of Fault Tolerance
Achieving fault tolerance in streaming systems is notably difficult, often requiring complex consensus mechanisms and checkpointing processes. The conversation reveals that many practitioners opt to run systems without checkpointing due to the back pressure introduced during that process. This raises challenges for maintaining continuous data processing, especially in critical applications requiring real-time results. By simplifying the architecture around fault tolerance, teams could streamline operations and reduce the likelihood of failure, particularly when leveraging single-node systems.
The Need for Simplicity in Stream Processing
The current landscape of stream processing is characterized by intricate distributed systems that many users find cumbersome and challenging to navigate. Denormalized aims to simplify stream processing workloads by developing an embedded stream processing engine designed to operate efficiently on single-node architectures. A significant portion of the workloads does not necessarily require the complexities that come with distributed systems. This shift towards simplifying user experience while retaining performance could lead to wider adoption and better usability across diverse engineering teams.
Strengthening Developer Experience
Improving the developer experience in stream processing systems is critical for attracting a broader audience, especially as new roles like AI engineers emerge. Denormalized considers a range of potential interfaces for their technology, including the integration of TypeScript for better accessibility and usability. By enhancing the interfaces, the company can cater to application developers who may be mechanics behind the user experience, bridging the gap between traditional streaming systems and modern application development paradigms. This could enable engineers to more easily implement and manage streaming applications.
Data Fusion's Role in Innovation
Data Fusion is positioned as a versatile query engine that plays a crucial role in the development of new data systems. By enabling significant prototyping speed and offering extensibility, it allows companies like Denormalized to innovate quickly without the burden of infrastructure. This flexibility facilitates creative exploration of streaming systems, where developers can experiment without extensive overhead. As the community around Data Fusion continues to grow, it presents an opportunity to expand capabilities and foster collaboration among various projects within the emerging data landscape.
In this episode, Kostas and Nitay are joined by Amey Chaugule and Matt Green, co-founders of Denormalized. They delve into how Denormalized is building an embedded stream processing engine—think “DuckDB for streaming”—to simplify real-time data workloads. Drawing from their extensive backgrounds at companies like Uber, Lyft, Stripe, and Coinbase. Amey and Matt discuss the challenges of existing stream processing systems like Spark, Flink, and Kafka. They explain how their approach leverages Apache DataFusion, to create a single-node solution that reduces the complexities inherent in distributed systems.
The conversation explores topics such as developer experience, fault tolerance, state management, and the future of stream processing interfaces. Whether you’re a data engineer, application developer, or simply interested in the evolution of real-time data infrastructure, this episode offers valuable insights into making stream processing more accessible and efficient.
Chapters 00:00 Introduction and Background 12:03 Building an Embedded Stream Processing Engine 18:39 The Need for Stream Processing in the Current Landscape 22:45 Interfaces for Interacting with Stream Processing Systems 26:58 The Target Persona for Stream Processing Systems 31:23 Simplifying Stream Processing Workloads and State Management 34:50 State and Buffer Management 37:03 Distributed Computing vs. Single-Node Systems 42:28 Cost Savings with Single-Node Systems 47:04 The Power and Extensibility of Data Fusion 55:26 Integrating Data Store with Data Fusion 57:02 The Future of Streaming Systems 01:00:18 intro-outro-fade.mp3