175: The Parts, Pieces, and Future of Composable Data Systems, Featuring Wes McKinney, Pedro Pedreira, Chris Riccomini, and Ryan Blue
Jan 31, 2024
auto_awesome
Data systems experts Wes McKinney, Pedro Pedreira, Chris Riccomini, and Ryan Blue discuss the concept of composable data systems, the challenges and incentives for composable components, specialization and modularity in data workloads, and the efficiency and common layers in data management systems. They also explore the evolution of data system composability, exciting new projects in data systems, and the challenges of standardizing APIs.
Composable data stacks allow for efficient interoperability through common open standards, enabling developers to assemble different components without excessive custom code.
Adopting composability and open standards in data management systems can be initially complex and costly, but the long-term benefits outweigh the challenges.
The development of composable data systems requires trade-offs, such as increased complexity and engineering effort, but it offers efficiency, reusability, and improved modularity.
Deep dives
Composable data stacks defined by common open source standards
A composable data stack refers to a project or collection of projects that address a data processing need. The key characteristic is that the components within the stack are built using common open source standards, allowing for efficient interoperability. This means that developers can assemble different pieces of the stack without the need for excessive custom code or glue to connect them. The interoperability is based on well-defined open standards that are agreed upon and shared among the component systems.
Challenges in adopting composability and embracing open standards
The adoption of composability and open standards in data management systems is not without its challenges. One challenge is the initial investment required to build composable systems, which can be higher compared to developing a monolithic system. Additionally, there may be a bias among developers to create their own solutions rather than reusing existing components due to factors such as the belief that they can do it better or a preference for writing their own code. Furthermore, there are often competing commercial interests and a desire to control proprietary solutions, hindering the widespread implementation of open standards. However, it is becoming increasingly apparent that the long-term benefits of composability outweigh the initial complexities and costs.
Trade-offs in composable data systems and the need to focus on storage, data models, and APIs
Composability in data systems comes with its own set of trade-offs. One of the trade-offs is the increased complexity and engineering effort required to build composable systems compared to monolithic ones. The need to coordinate and integrate various components and adhere to open standards can slow down development initially, but it provides long-term benefits in terms of efficiency and reusability. Another trade-off is the challenge of defining and aligning data models, storage formats, and APIs across different systems. While there are existing standards and projects, ensuring compatibility and widespread adoption can be a complex task. However, as the industry recognizes the value of composability, efforts are being made to standardize data models, storage formats, and APIs at different layers of the data stack, from storage and processing to language interfaces.
Advancements in Open Standards and Composable Data Systems
The podcast episode discusses advancements in open standards and the development of composable data systems. One key area of focus is the concept of an Intermediate Representation (IR), which is borrowed from compilers. The IR acts as an intermediate data structure that represents computations and allows for their execution without ambiguity. The podcast highlights the importance of a standardized IR in query engines to enable exchange of query plans and improved modularity. Another significant point raised is the need for a coherent standard for data types across different components of the data stack, such as iceberg and arrow, to enhance compatibility and ease of implementation. Additionally, the conversation touches on virtualization technologies and their potential to change the way data systems are built, as well as the need for open standards when it comes to file formats and language APIs for interacting with data systems.
Challenges and Opportunities in Optimizers and Access Policies
The second part of the podcast focuses on challenges and opportunities in optimizers and access policies within data systems. The discussion acknowledges the complexity of building an optimizer that can be shared across different engines, but highlights the potential for rule-based optimizations and the possibility of standardizing APIs for defining physical capabilities and cost-based optimization. The topic of access policies is also explored, with a suggestion to move away from sharing policies and instead focus on sharing policy decisions. The podcast emphasizes the need for standardized ways to exchange policy decisions, such as defining user permissions at the table level and handling context-specific access rules. The conversation ends with the recognition that further exploration and investment in these areas can lead to more efficient and user-friendly data systems.
Challenges and incentives for composable components (10:37)
Specialization and modularity in data workloads (13:05)
Organic evolution of composable systems (17:50)
Efficiency and common layers in data management systems (22:09)
The IR and Data Computation (23:00)
Components of the Storage Layer (26:16)
Decoupling Language and Execution (29:42)
Apache Calcite and Modular Frontend (36:46)
Data Types and Coercion (39:27)
Describing Data Sets and Schema (42:00)
Open Standards and Frontiers (46:22)
Challenges of standardizing APIs (48:15)
Trade-offs in building composable systems (54:04)
Evolution of data system composability (56:32)
Exciting new projects in data systems (1:01:57)
Final thoughts and takeaways (1:17:25)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode