Alex Albu, tech lead for AI initiatives at Starburst, dives into the fascinating world of integrating AI workloads with lakehouse architecture. He shares his journey from software engineering to championing AI enhancements at Starburst. The discussion covers innovative solutions like AI agents for data exploration and metadata enrichment. Alex addresses the hurdles of marrying AI with traditional data systems and reveals future visions for improved data formats and AI-driven tools, promising a revolution in data management.
44:09
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
question_answer ANECDOTE
Alex Albu's Data Engineering Journey
Alex Albu shared how his journey in data engineering began by rebuilding ETL pipelines and replacing Hadoop with Spark, achieving significant performance gains.
His experience led him to discover Starburst and eventually work there, moving from software engineer to AI initiative tech lead.
insights INSIGHT
AI Enhances Data Exploration
AI can enhance data exploration by using conversational interfaces connected to curated data products.
Enriching metadata with AI allows deeper insights and improves data discoverability beyond basic schema details.
insights INSIGHT
Metadata Crucial for AI Success
Traditional warehouses struggle with unstructured data and often lack rich metadata critical for successful AI use.
Effective AI depends more on quality metadata than just raw data access, highlighting a key limitation in current architectures.
Get the Snipd Podcast app to discover more snips from this episode
Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.
Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads
Interview
Introduction
How did you get involved in the area of data management?
Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?
What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?
What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?
Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?
Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?
What are the foundational architectural modifications that you had to make to enable those capabilities?
For the vector storage and indexing, what modifications did you have to make to iceberg?
What was your reasoning for not using a format like Lance?
For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?
What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?
What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?
When is Starburst/lakehouse the wrong choice for a given AI use case?
What do you have planned for the future of AI on Starburst?
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.