Adam Ferrari, SVP of Engineering at Starburst, discusses building a Data Lake Analytics platform and the interesting work happening at Starburst. They explore the history and purpose of Starburst, the growth and interest in data lakes, and the challenges of building and maintaining a data lake. They also discuss the scalability, performance, and architecture of Trino, the open-source project that forms the foundation of Starburst. Finally, they highlight the challenges of managing a data lake, including integrating with streaming services and keeping up with evolving lake formats.
Starburst is a data lake analytics platform that allows users to work with structured data at scale by leveraging the open source platform Trino.
Data lakes provide a scalable solution for managing and analyzing large and diverse datasets, bridging the gap between traditional data warehousing solutions and the increasing volume and variety of data.
Starburst, with Trino as its query engine, addresses the challenges of maintaining and organizing a data lake by providing a more complete and opinionated platform, simplifying data lake management and usability.
Deep dives
Starburst: A Data Lake Analytics Platform
Starburst is a powerful Data Lake Analytics platform built on the open-source technology Trino. Adam Ferrari, the SVP of Engineering at Starburst, discusses the platform's capabilities and its use in working with structured data at scale. Starburst leverages Trino's open source package to provide superpowers for data lake analytics, allowing users to federate and analyze data across various sources, including object storage and structured databases. The platform offers a unified SQL interface for querying and analyzing data, making it a flexible and efficient solution for big data needs.
The Evolution of Data Lakes
Adam explains that the emergence of data lakes was driven by the need to handle the increasing volume and variety of data. While data warehouses were suitable for processing structured data, the rise of machine-generated and internet-scale data posed a challenge to traditional data warehousing solutions. Data lakes, with their ability to store and process structured and semi-structured data, provided a scalable solution for managing and analyzing large and diverse datasets. The development of technologies like Trino has bridged the gap, making data lakes more accessible and consumable in a data warehouse-oriented way by leveraging SQL and providing a unified interface for various data sources.
Advantages and Challenges of Data Lakes
Data lakes offer benefits in terms of scalability, flexibility, and cost-effectiveness. They allow for the ingestion and exploration of raw data without the need for extensive upfront modeling and schema design. This agility enables organizations to quickly capture and analyze new data sources, fostering a data-driven culture. However, maintaining and organizing a data lake can present challenges. With its choose-your-own-adventure architecture, organizations must make decisions around data sourcing, formatting, and curation. Starburst aims to address these challenges by providing a more complete and opinionated platform, simplifying the management and usability of data lakes.
Trino: A Powerful Query Engine for Data Lakes
Trino, the query engine behind Starburst, offers a highly efficient and scalable solution for processing data in a data lake environment. Built on a distributed architecture, Trino excels at parallelizing and scheduling tasks across a cluster, enabling fast and efficient query execution. It abstracts the underlying storage systems, allowing users to seamlessly query and join data from multiple sources, such as object storage and databases, while also providing advanced optimization techniques. Trino's extensibility enables integration with various data connectors, ensuring compatibility with different storage technologies. Its fault-tolerant execution mode and caching capabilities contribute to superior performance and resiliency.
Future Directions and Simplifying Data Lake Adoption
The future of data lakes lies in making them more accessible and user-friendly for organizations of all sizes. The focus is on lowering the barrier to entry and providing out-of-the-box solutions that simplify data lake adoption. This includes comprehensive access management, integration with streaming services like Kafka, and advancements in technologies like Iceberg tables. By combining performance, scalability, and ease of use, data lakes can become the go-to platform for handling complex and diverse data, facilitating data-driven decision-making and fueling innovation.
Starburst is a data lake analytics platform. It’s designed to help users work with structured data at scale, and is built on the open source platform, Trino.
Adam Ferrari is the SVP of Engineering at Starburst. He joins the show to talk about Starburst, data engineering, and what it takes to build a data lake.
Full Disclosure: Starburst is a sponsor of Software Engineering Daily
Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer .