The Data Engineering Show

The Firebolt Data Bros

The Data Engineering Show is a podcast for data engineering and BI practitioners to go beyond theory. Learn from the biggest influencers in tech about their practical day-to-day data challenges and solutions in a casual and fun setting.

SEASON 1 DATA BROS
Eldad and Boaz Farkash shared the same stuffed toys growing up as well as a big passion for data. After founding Sisense and building it to become a high-growth analytics unicorn, they moved on to their next venture, Firebolt, a leading high-performance cloud data warehouse.

SEASON 2 DATA BROS
In season 2 Eldad adopted a brilliant new little brother, and with their shared love for query processing, the connection was immediate. After excelling in his MS, Computer Science degree, Benjamin Wagner joined Firebolt to lead its query processing team and is a rising star in the data space.

For inquiries contact tamar@firebolt.io
Website: https://www.firebolt.io

Episodes

Mentioned books

Dec 16, 2025 • 26min

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

In this episode of the Data Engineering Show, host Benjamin Wagner sits down with Ritesh Varyani, Staff Software Engineer at Lyft, to explore how the company manages a sophisticated multi-engine data stack serving thousands of engineers, while simultaneously integrating AI across infrastructure and user-facing analytics.What You'll Learn:How to architect a polyglot data platform that serves fundamentally different workloads, Spark for ML training and massive parallel processing, Trino for dashboarding and medium-scale ETL, and ClickHouse for sub-second OLAP queries without creating operational chaosWhy unification matters more than expansion: Lyft's 2026 strategy prioritizes consolidating and simplifying the data stack rather than adding new tools, reducing maintenance burden and improving reliability for end usersThe dual-layer AI strategy that simultaneously enhances user analytics (semantic layer v2 with AI-native support) while automating platform operations (intelligent job failure diagnosis, adaptive resource allocation, and agentic workflow optimization)How to fund innovation from the bottom-up: Lyft's model encourages individual engineers to experiment with AI on their own time, prove business value through POCs, and secure leadership buy-in through demonstrated alignment with company strategyWhy vendor selection now includes AI explainability and debuggability as standard RFP requirements, even when AI isn't the primary driver of a purchasing decisionThe framework for deciding open-source investment vs. managed services: Prioritize business-critical goals first, then determine whether in-house ownership or vendor solutions accelerate that mission, AI becomes the accelerant, not the decision driverIf you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.About the Guest(s)Ritesh is a Staff Software Engineer at Lyft, bringing six years of experience architecting and scaling the company's data platform. With a background spanning Microsoft's data and cloud infrastructure, including work on Hadoop, Azure, and SaaS products. Ritesh leads Lyft's critical data systems including Trino, Spark, and ClickHouse. In this episode, Ritesh shares insights on building scalable, AI-native data platforms that serve diverse organizational needs, from batch processing and analytics to real-time marketplace operations. His strategic approach to unifying complex data stacks while integrating AI-driven reliability and user experience improvements provides actionable guidance for data engineers and platform leaders navigating infrastructure modernization at scale.Quotes"The goal of our platform is to give our users access to the data as fast as possible so that they can drive the meaning from the data that they are getting and take better data driven decisions." - Ritesh"We are a Hive format shop. We are going to be moving to other open table formats in the future, but at this point, we are a hive table format." - Ritesh"Our main goal at this point is primarily understanding how we see the data platform running five years from now, three years from now, and how we are able to future proof it." - Ritesh"In this world of AI, we should not be falling behind in any way, and bringing AI in the right places within our platform." - Ritesh"We want to make our semantic layer ready for the AI native side of things so that our teams are able to drive the best meaning possible from the data that they see." - Ritesh"Big data systems are distributed systems by nature, and where AI can help you is very clearly understand how the patterns are changing and what is a good action to take." - Ritesh"Rather than thinking of this as an AI versus an open source thing, it's about a question of what work is the most business critical and how do you go 100% behind it." - Ritesh"Not everybody is working on AI initiatives at this point, but where it makes sense according to our business strategy, if it aligns with it, then obviously we go and invest." - Ritesh"If you are the one who's going to take on the initiative, probably spend a few hours outside of what you're already working on, and that is how you will discover AI and the tooling for it." - Ritesh"We are trying to consolidate into a single direction of providing different kinds of models so that you are easily able to integrate and focus on the value you want to provide to your customers." - RiteshResources Connect on LinkedIn:Ritesh Varyani - https://www.linkedin.com/in/riteshvaryani/Benjamin Wagner - https://www.linkedin.com/in/wagjamin/Eldad Farkash - https://www.linkedin.com/in/eldadfarkash/Websites:Lyft - https://www.lyft.comTools & Platforms:Apache Spark – Batch processing engine for ML training jobs, large-scale data processing, and GDPR operationsTrino – Query engine for BI dashboarding, ETL workflows, and SQL-based data accessClickHouse – Columnar database for sub-second query latency and real-time analyticsAmazon S3 – Data lake storage for parquet tables and offline data processingAWS EKS (Elastic Kubernetes Service) – Kubernetes infrastructure for hosting Spark and TrinoClickHouse Cloud – Managed ClickHouse offering used by LyftHive Table Format – Current table format for organizing parquet files in S3Kubernetes Operators – Infrastructure for managing ClickHouse deploymentsThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

8 snips

Nov 19, 2025 • 20min

60 Billion Predictions Daily: Inside Credit Karma’s Agentic Data Layer with Maddie Daianu

Maddie Daianu, Head of Data and AI at Intuit Credit Karma, brings a wealth of experience from academia to the finance tech forefront. She dives into the monumental task of managing 80 billion daily predictions and the strategic shift to an 'Agentic Data Layer' for proactive financial management. Maddie shares insights on utilizing Google Cloud for real-time processing, the importance of the Unified Consumer Profile for personalized experiences, and how her team deploys 22,000 models each month, revolutionizing user interaction in finance.

Oct 7, 2025 • 20min

Block Bad Data Before the Write with Nike’s Ashok Singamaneni

In this episode of The Data Engineering Show, Benjamin and Eldad are joined by Ashok Singamaneni, a Principal Data Engineer at Nike. Ashok dives deep into his work on the open-source projects BrickFlow and Spark Expectations. He shares his journey from mechanical engineering to data engineering and the lessons learned over a decade of tackling production data quality issues that lead to costly recomputes.Ashok explains the philosophy behind Spark Expectations: treating the ingestion and transformation layers of a data pipeline (Bronze/Silver) as a software product rather than just a data engineering product. This means implementing rigorous checks like data quality, unit testing, and integration testing before the data is written to the final layer. He details the implementation using a Python decorator pattern within Spark jobs, allowing engineers to define rules that check for everything from basic column validation to complex referential integrity and aggregation consistency. The discussion also covers the trade-offs of using generative AI tools like Cursor for data engineering and the growing industry trend of prioritizing upfront data quality due to the rise of AI-powered analytics and direct leadership access to data.What You'll Learn:Why the ingestion and transformation layers (Bronze/Silver) of a data pipeline should be treated as a software product with rigorous testing.How Spark Expectations moves data quality checks to before data is written to the final tables to prevent mission-critical failures and recomputes.The three types of checks in Spark Expectations: row-level, aggregation-level, and query DQ (for referential integrity).How the tool handles failures with options to ignore, drop the record, or fail the entire job.Why big data quality is becoming a prime focus across the industry due to AI integrations and direct executive-level access to data.Ashok’s lessons on using Generative AI tools (like Cursor/Cloud Code) in data engineering projects and the necessity of restrictive permissions.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.About the Guest(s)Ashok Singamaneni is a Principal Data Engineer at Nike, with over twelve years of experience in the data space across the banking, healthcare, and retail domains. He is the creator of the popular open-source frameworks Spark Expectations and BrickFlow, which focus on improving data quality and pipeline reliability. Ashok advocates for treating data ingestion and transformation as a software product, ensuring checks and balances are in place early in the pipeline. He holds a background in mechanical engineering.Quotes"DLT expectations gave an idea to the industry that you can do data quality before actually writing the data into your final tables." - Ashok"I think over the time, in my experience, what I learned is this ingestion layer and the transformation layer, you should treat that as a software product, not like a data engineering product." - Ashok"If it's mission critical, then you fail the job, not process the data, and don't put that data into the final table so that you don't need to recompute that again." - Ashok"As the scale of the product increases, it becomes even more difficult for us to find exactly where the issue went wrong... it takes time for you to debug and see, like, lot of human effort also involved." - Ashok"Data observability and quality is becoming prime because of AI integrations that are happening." - Ashok"Ultimately, at the end of the day, you are responsible when you're checking in the code. It's not Claude or Karsar that will be blamed if something goes wrong." - Ashok"The leadership is directly looking at the data and if there is something wrong in the data, then there can be some serious repercussions happening on the business decisions." - Ashok"Rather than having bad data in the tables and then recomputing or reclarifying things, let's not put that data first in the first place." - Ashok"You can drop the record and put that in an error table and give that alert to the engineering team that there is some error in the error table you can look at." - Ashok"The road eq checks that happens are very fast. It should happen as a pretty standard checks that happens on the scale." - AshokResourcesProjects:Spark Expectations - Data quality frameworkBrickFlow - Open source project for data pipelinesTools & Technologies:Apache SparkDatabricks DLT (Delta Live Tables)Great Expectations - Post-processing data quality toolCursor / Cloud Code - Generative AI coding toolsSQLMeshFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioEldad@Firebolt.io Primary Speakers:Ashok Singamaneni Benjamin Wagner Eldad Farkash The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Sep 17, 2025 • 22min

Postgres vs. Elasticsearch: The Unexpected Winner in High-Stakes Search for Instacart with Ankit Mittal

Ankit Mittal, former Senior Engineer at Instacart and now at ParadeDB, shares his journey of enhancing search infrastructure by transitioning from Elasticsearch to PostgreSQL. He discusses the challenges of managing fast-moving grocery inventory and how consolidating search functions into one PostgreSQL cluster optimized performance. Ankit highlights the benefits of using PostgreSQL extensions for complex queries and the trade-offs between search systems, emphasizing improved efficiency and reduced latency in data retrieval.

Aug 28, 2025 • 21min

Is Self-Service BI a False Promise? Lei Tang of Fabi.ai Thinks So

Explore the future of AI-powered business intelligence with Lei Tang, CTO and Co-founder of Fabi.ai, as he discusses the evolution from traditional self-service BI to "Vibe-analytics." Learn how AI is transforming data accessibility, enabling anyone to perform sophisticated analytics without deep technical expertise. From building trust in AI-generated insights to creating intelligent semantic layers, discover how modern BI platforms are bridging the gap between data teams and business stakeholders. Tune in to understand why static dashboards are becoming obsolete and how AI agents will soon proactively surface business opportunities and insights.Key points:The limitations of traditional self-service BI and how AI is addressing themBuilding secure, context-aware AI systems for data analysisThe future of human-AI interaction in business intelligenceTechnical insights into modern BI platform architectureVision for proactive, AI-driven business insightsWhat You'll Learn:Why traditional self-service BI has failed to deliver on its promises and how AI can bridge the gapHow to build an AI-native BI platform that combines SQL, Python, and natural language processingThe framework for implementing "Vibe-analytics" - a new paradigm of AI-powered visual analyticsWhy context engineering and semantic understanding are crucial for accurate AI-driven analysisHow to balance security and accessibility when deploying AI-powered analytics toolsThe future of BI platforms as proactive insight generators rather than passive dashboardsWhy caching and stateful environments are essential for responsive AI-powered analyticsHow to leverage AI to translate business questions into accurate technical queries while maintaining data integrityAbout the Guest(s)Lei is the Co-founder and CTO of Fabi.ai, where he leads the development of AI-native business intelligence solutions. With a PhD in machine learning and over a decade of experience in the data domain, Lei has held significant roles, including positions at Yahoo, Walmart, Lyft (as Director of Data Science), and Clari (as Chief Data Scientist). His expertise spans machine learning, data engineering, and business analytics, with a particular focus on making data analysis more accessible and efficient. In this episode, Lei shares insights on the evolution of self-service BI and how AI is transforming business intelligence, drawing from his experience building Fabi.ai, a platform that combines SQL, Python, and AI to democratize data analysis. His work in developing "Vibe AI" (AI-powered BI) represents a significant advancement in making complex data analysis accessible to non-technical users while maintaining data accuracy and trust.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Quotes"For the past decade, it's really difficult to make sure the self-service BI can work. And then now with AI, the worst part is that it can run properly, but the numbers are wrong." - Lei"If you talk to anybody working in the BI space, like self-service BI, that has been termed for maybe for the past decade. But I have to say that is a false promise." - Lei"We're saying that we really want those data team to be able to, like, say, what type of data is exposed to, like, say, less technical folks." - Lei"In order to build AI native BI, I would say the focus should be how human interact with AI." - Lei"We believe that, essentially, this BI system or, like, AI BI system would be more like a agent, and then it'll actually looking for, like, business opportunities and insight and surface to you." - Lei"The one common theme I have been experiencing is that normally would work with other business stakeholders, could be marketing, could be operations, could be sales." - Lei"We strongly believe that BI should be stored as code." - Lei"Enterprise data tends to be very noisy, very complex." - Lei"The semantics of itself becomes part of the context for the AI engine." - Lei"Most organizations, the data, like the schema, the kind of business, like metrics and logic, has been constantly evolving." - LeiResourcesFabi.ai - AI-native BI platformFirebolt (firebolt.io) - Cloud data warehouse platformTools & Technologies:Firebolt Core - Free self-hosted query engineLooker - BI PlatformTableau - BI PlatformSisense - BI PlatformSnowflake - Data WarehouseBigQuery - Data WarehousePostgreSQL - DatabaseSQL Alchemy - Database toolkitPandas - Data analysis libraryFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.io Primary Speakers:Lei Tang Benjamin Wagner The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Jul 22, 2025 • 26min

Building Uber's AI Assistant: How Genie Revolutionizes On-Call Support with Paarth Chothani from Uber

Journey inside Uber's innovative AI assistant "Genie" with Paarth Chotani, Staff Engineer at Uber, as he shares how they're revolutionizing on-call support using LLMs and vector search. From processing massive amounts of internal documentation to building scalable RAG pipelines, discover how Uber tackles the challenges of implementing AI assistants at scale. Get insights into the evolution from traditional chatbots to agent-based solutions, and learn practical lessons about staying current in the rapidly evolving AI landscape. Whether you're building AI-powered tools or scaling data infrastructure, this episode offers valuable perspectives on balancing innovation with real-world implementation.• Building and scaling RAG pipelines at enterprise scale• Evolution from traditional chatbots to AI agents• Practical insights on data processing and vector search implementation• Leveraging open-source technologies in production environments• Navigating rapid technological changes in AI developmentWhat You'll Learn:How Uber transformed its on-call support system by building an AI assistant that searches across internal documentation, wikis, and codeWhy combining multiple data sources with vector databases creates more accurate and contextual responses for enterprise supportThe evolution from basic RAG implementation to agent-based architecture for handling complex support scenariosHow to scale AI processing pipelines using Apache Spark for large-scale data chunking and embedding generationWhy customization and internal data sources are crucial for enterprise AI assistant effectivenessThe future of AI assistants: moving from documentation lookup to automated problem resolution through multi-agent systemsHow to balance rapid AI innovation with setting realistic customer expectations in fast-moving tech environmentsPaarth is a Staff Engineer at Uber, where he works on Michelangelo, Uber's machine learning platform. With over four years at Uber, he specializes in feature store development, online serving at scale, and GenAI implementations. He has been instrumental in developing Genie, an AI-powered on-call assistant that revolutionizes how Uber's engineering teams handle support requests and documentation access. In this episode, Paarth shares valuable insights on building and scaling RAG-based systems, vector search implementations, and the evolution of AI assistants from traditional chatbots to sophisticated agent-based solutions. His experience spanning both AWS chatbot development and current GenAI innovations at Uber offers listeners a unique perspective on the rapid advancement of AI-powered enterprise solutions.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Quotes"Think of Genie as your on-call assistant. Different infra teams have their Slack channels, and because these technologies are widely used, you have to wait a lot." - Paarth"What we realized is for our engineers to really get help, data sources really should be internal only because we customize lot of these open source engines for making it work at Uber scale." - Paarth"Instead of building a mega scale pipeline that just ingest all data sources and then keeps a central data source solution, we instead are giving users the flexibility to ingest what data sources they want." - Paarth"We had to scale our you can say the whole infrared layer to chunk data faster to be able to create embedding set scale." - Paarth"It almost felt like they're doing what EMR was doing. You have your Hadoop and big data technology, and we needed these pipelines to basically process all this data quickly." - Paarth"We've even evolved from just giving you the right documentation to starting to evolve into a situation where we'll also start taking actions on your behalf." - Paarth"That intuition that comes from building this kind of bot, I feel like that intuition came again as we were starting to see this technology come, and we're like, hey, this looks like where you can pretty much fit all these pieces together." - Paarth"What we have seen with several use cases is agentic genie works well when designed well, when you've analyzed the problem of which type of subproblems the bot should resolve per channel, per use case." - Paarth"I think having a problem in mind always helps that way, the energy is little bit focused and directed." - Paarth"Whatever you're building is not enough because the expectation has already gone to the next level, so the pace is too fast right now." - PaarthResourcesCompanies & Platforms:Uber - ML Platform & EngineeringFirebolt - Cloud Data Warehouse (firebolt.io)Tools & Technologies:Michelangelo - Uber's ML Platform Genie - Uber's On-Call Assistant BotCursor - Developer IDEOpenSearch - Vector DatabaseLangGraph - Agent FrameworkNotable Projects Mentioned:MetaMate (Meta)Query Copilot (Uber)Scale at AI (Meta Meetup)Company Blogs:Uber Engineering Blog - Genie and Query Optimization articles Primary Speakers:Paarth Chotani - Staff Engineer, UberBenjamin - FireboltEldad - FireboltFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

26 snips

Jun 10, 2025 • 22min

From Zero to 100M Users: Inside Notion’s Data Stack and AI Strategy with Sumit Gupta

In this discussion, Sumit Gupta, Lead BI Engineer at Notion, shares insights from his journey through tech giants like Snowflake and Dropbox. He highlights how modern data stacks are evolving with tools like dbt and Iceberg, while emphasizing the shift from technical skills to crucial transferable skills in the AI era. Sumit explains how AI is revolutionizing workflows and automating content creation, stressing the importance of balancing automation with genuine human connections. He also provides tips on adapting to the rapid changes in data and AI technologies.

May 7, 2025 • 32min

How Rising Wave Is Redefining Real-Time Data with Postgres Power

In this episode of The Data Engineering Show, host Benjamin and co-host Eldad sit with Yingjun Wu, founder and CEO of Rising Wave, to explore the evolution of stream processing systems and the innovations his company is bringing to the space.What you’ll learn:Yingjun's journey from academic research in stream processing to founding Rising Wave, and the challenges of building trust in a new database system.How Rising Wave's architecture, using S3 as primary storage, delivers second-level scalability, while other systems can take hours to scale.The competitive landscape of stream processing, with Rising Wave's Postgres compatibility providing a significant advantage in ease of use.How one major company reduced its CPU requirements from 20,000 to just 600 by switching from a traditional stream processing system to Rising Wave.The rising importance of Apache Iceberg as a destination for stream processing output, helping companies avoid vendor lock-in.How streaming systems fit into modern data stacks, especially as companies seek to avoid being locked into proprietary systems.Yingjun Wu is the founder and CEO of Rising Wave, a stream processing system built in Rust and designed with a cloud-native architecture. With a PhD focused on stream processing and database systems, Yingjun previously worked at Redshift and IBM Research before founding Rising Wave. His company has developed a system that achieves significant performance and resource efficiency advantages over traditional stream processing solutions, while maintaining Postgres compatibility for ease of use.Episode Highlights:The Origins of Rising Wave (00:30)Yingjun shares his background in stream processing from his PhD days and explains how his experience at Redshift revealed the need for better stream processing solutions, especially since many data warehouse workloads involve data ingested from streaming sources like Kinesis or Kafka.Building a System from Scratch (04:10)Yingjun describes the challenging first 2-3 years of developing Rising Wave without customers, highlighting how trust is a major barrier for new database systems. After 2.5 years, they secured their first customers, including a startup and several larger companies, which helped establish Rising Wave's credibility.The Current Stream Processing Landscape (07:47)Benjamin asks about the current stream processing space, with Yingjun positioning Rising Wave as a leader, particularly for SQL-based workloads. He highlights several key advantages of Rising Wave, including its Rust-based implementation and S3-based storage architecture.S3 as Primary Storage (10:27)Yingjun explains their decision to use S3 as primary storage from day one, despite its slowness and expense. He discusses how they've optimized for these challenges and would still make the same architectural choice today due to benefits like simplified state management and superior elastic scaling.The Business Model (13:52)Rising Wave offers open-source, cloud, and on-premise versions of its product. Yingjun notes that many highly regulated industries require on-premise deployment, including customers in the banking and aerospace sectors.Typical Users and Competitive Advantages (15:01)When asked about their typical users, Yingjun explains they directly compete with Flink but have advantages in ease of use due to Postgres compatibility. Their users are either new to stream processing or are migrating from systems like Spark Streaming or Flink due to performance issues or development complexity.Apache Iceberg Integration (19:25)Yingjun discusses how Apache Iceberg is emerging as an important destination for Rising Wave output, as companies seek to avoid vendor lock-in with proprietary data warehouses. He explains how Rising Wave typically performs ETL functions before data is sent to Iceberg tables.The Future of Data Management (32:06)The conversation concludes with a discussion about Iceberg becoming a "single source of truth" for data, with multiple specialized query engines potentially accessing the same data. Yingjun and Eldad share perspectives on how this shift away from proprietary data lock-in is changing the data ecosystem.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Episode Resources:Rising Wave WebsiteYingjun Wu LinkedInFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

6 snips

Apr 8, 2025 • 24min

Revolutionizing Data Governance with DataStrato’s Unified Open Source Approach

Lisa Cao, Product Manager at DataStrato, dives into the world of data governance, sharing her expertise in AI/ML and open-source frameworks. The discussion highlights Apache Gravitino's unique capabilities, enabling unified governance across diverse data systems. They tackle the 'Push-Down Permission Management' model, essential for security, and the growing trend towards open ecosystems that prioritize flexibility. Lisa also emphasizes the importance of real-world tool adoption versus social media hype, keeping data engineers agile in a fast-paced landscape.

4 snips

Mar 19, 2025 • 31min

Database Technology in the Age of AI with DuckDB Labs co-creator Hannes Mühleisen

Hannes Mühleisen, CEO of DuckDB Labs and a professor in the Netherlands, discusses the innovative journey of DuckDB, an open-source analytical database that’s making waves with 10 million monthly downloads. He highlights how DuckDB differs from SQLite and its powerful analytical capabilities. Hannes also dives into the system's flexible ecosystem, allowing for custom functionalities. A fascinating discussion on AI’s impact on database management showcases the balance between traditional SQL usage and modern technological advancements.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app