Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Latest episodes

Jul 31, 2022 • 41min

Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda

Summary Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing David Bader about Arkouda, a horizontally scalable parallel compute library for exploratory data analysis in Python Interview Introduction How did you get involved in the area of data management? Can you describe what Arkouda is and the story behind it? What are the main goals of the project? How does it address those goals? Who is the primary audience for Arkouda? What are some of the main points of friction that engineers and scientists encounter while conducting exploratory data analysis (EDA)? What kinds of behaviors are they engaging in during these exploration cycles? When data scientists run up against the limitations of their tools and environments how does that impact the work of data engineers/data platform owners? There have been a number of libraries/frameworks/utilities/etc. built to improve the experience and outcomes for EDA. What was missing that made Arkouda necessary/useful? Can you describe how Arkouda is implemented? What are some of the novel algorithms that you have had to design to support Arkouda’s objectives? How have the design/goals/scope of the project changed since you started working on it? How has the evolution of hardware capabilities impacted the set of processing algorithms that are viable for addressing considerations of scale? What are the relative factors of scale along space/time axes that you are optimizing for? What are some opportunities that are still unrealized for algorithmic optimizations to expand horizons for large-scale data manipulation? For teams/individuals who are working with Arkouda can you describe the implementation process and what the end-user workflow looks like? What are the most interesting, innovative, or unexpected ways that you have seen Arkouda used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arkouda? When is Arkouda the wrong choice? What do you have planned for the future of Arkouda? Contact Info Website LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Arkouda NJIT == New Jersey Institute of Technology NumPy Pandas Podcast.__init__ Episode NetworkX Chapel Massive Graph Analytics Book Ray Podcast.__init__ Episode Dask Podcast Episode Bodo Podcast Episode Stinger Graph Analytics Bears-R-Us 0MQ Triangle Centrality Degree Centrality The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 31, 2022 • 1h 5min

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Summary Data lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Because of its centrality to your data systems it is valuable for debugging, governance, understanding context, and myriad other purposes. This means that it is important to have an accurate and complete lineage graph so that you don’t have to perform your own detective work when time is in short supply. In this episode Ernie Ostic shares the approach that he and his team at Manta are taking to build a complete view of data lineage across the various data systems in your organization and the useful applications of that information in the work of every data stakeholder. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Your host is Tobias Macey and today I’m interviewing Ernie Ostic about Manta, an automated data lineage service for managing visibility and quality of your data workflows Interview Introduction How did you get involved in the area of data management? Can you describe what Manta is and the story behind it? What are the core problems that Manta aims to solve? Data lineage and metadata systems are a hot topic right now. What is your summary of the state of the market? What are the capabilities that would lead a team or organization to choose Manta in place of the other options? What are some examples of "data lineage done wrong"? (what does that look like?) What are the risks associated with investing in an incomplete solution for data lineage? What are the core attributes that need to be tracked consistently to enable a comprehensive view of lineage? How do the practices for collecting lineage and metadata differ between structured, semi-structured, and unstructured data assets and their movement? Can you describe how Manta is implemented? How have the design and goals of the product changed or evolved? What is involved in integrating Manta with an organization’s data systems? What are the biggest sources of friction/errors in collecting and cleaning lineage information? One of the interesting capabilities that you advertise is versioning and time travel for lineage information. Why is that a necessary and useful feature? Once an organization’s lineage information is available in Manta, how does it factor into the daily workflow of different roles/stakeholders? There are a variety of use cases for metadata in a data platform beyond lineage. What are the benefits that you see from focusing on that as a core competency? Beyond validating quality, identifying errors, etc. it seems that automated discovery of lineage could produce insights into when the presence of data assets that shouldn’t exist. What are some examples of similar discoveries that you are aware of? What are the most interesting, innovative, or unexpected ways that you have seen Manta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Manta? When is Manta the wrong choice? What do you have planned for the future of Manta? Contact Info LinkedIn @dsrealtime01 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Manta Egeria OpenLineage Podcast Episode Apache Atlas Neo4J Easytrieve The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

47 snips

Jul 24, 2022 • 58min

Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Summary The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools’ dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster’s 1.0 release, and the new features coming with Dagster Cloud’s general availability. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Nick Schrock about software defined assets and improving the developer experience for data orchestration with Dagster Interview Introduction How did you get involved in the area of data management? What are the notable updates in Dagster since the last time we spoke? (November, 2021) One of the core concepts that you introduced and then stabilized in recent releases is the "software defined asset" (SDA). How have your users reacted to this capability? What are the notable outcomes in development and product practices that you have seen as a result? What are the changes to the interfaces and internals of Dagster that were necessary to support SDA? How did the API design shift from the initial implementation once the community started providing feedback? You’re releasing the stable 1.0 version of Dagster as part of something called "Dagster Day" on August 9th. What do you have planned for that event and what does the release mean for users who have been refraining from using the framework until now? Along with your 1.0 commitment to a stable interface in the framework you are also opening your cloud platform for general availability. What are the major lessons that you and your team learned in the beta period? What new capabilities are coming with the GA release? A core thesis in your work on Dagster is that developer tooling for data professionals has been lacking. What are your thoughts on the overall progress that has been made as an industry? What are the sharp edges that still need to be addressed? A core facet of product-focused software development over the past decade+ is CI/CD and the use of pre-production environments for testing changes, which is still a challenging aspect of data-focused engineering. How are you thinking about those capabilities for orchestration workflows in the Dagster context? What are the missing pieces in the broader ecosystem that make this a challenge even with support from tools and frameworks? How has the situation improved in the recent past and looking toward the near future? What role does the SDA approach have in pushing on these capabilities? What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on bringing Dagster to 1.0 and cloud to GA? When is Dagster/Dagster Cloud the wrong choice? What do you have planned for the future of Dagster and Elementl? Contact Info @schrockn on Twitter schrockn on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Dagster Day Dagster 1st Podcast Episode 2nd Podcast Episode Elementl GraphQL Unbundling Airflow Feast Spark SQL Dagster Cloud Branch Deployments Dagster custom I/O manager LakeFS Iceberg Project Nessie Prefect Prefect Orion Astronomer Temporal The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 24, 2022 • 1h 1min

Writing The Book That Offers A Single Reference For The Fundamentals Of Data Engineering

Summary Data engineering is a difficult job, requiring a large number of skills that often don’t overlap. Any effort to understand how to start a career in the role has required stitching together information from a multitude of resources that might not all agree with each other. In order to provide a single reference for anyone tasked with data engineering responsibilities Joe Reis and Matt Housley took it upon themselves to write the book "Fundamentals of Data Engineering". In this episode they share their experiences researching and distilling the lessons that will be useful to data engineers now and into the future, without being tied to any specific technologies that may fade from fashion. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect today. Your host is Tobias Macey and today I’m interviewing Joe Reis and Matt Housley about their new book on the Fundamentals of Data Engineering Interview Introduction How did you get involved in the area of data management? Can you explain what possessed you to write such an ambitious book? What are your goals with this book? What was your process for determining what subject areas to include in the book? How did you determine what level of granularity/detail to use for each subject area? Closely linked to what subjects are necessary to be effective as a data engineer is the concept of what that title encompasses. How have the definitions shifted over the past few decades? In your experiences working in industry and researching for the book, what is the prevailing view on what data engineers do? In the book you focus on what you term the "data lifecycle engineer". What are the skills and background that are needed to be successful in that role? Any discussion of technological concepts and how to build systems tends to drift toward specific tools. How did you balance the need to be agnostic to specific technologies while providing relevant and relatable examples? What are the aspects of the book that you anticipate needing to revisit over the next 2 – 5 years? Which elements do you think will remain evergreen? What are the most interesting, unexpected, or challenging lessons that you have learned while working on writing "Fundamentals of Data Engineering"? What are your predictions for the future of data engineering? Contact Info Joe LinkedIn Website Matt LinkedIn @doctorhousley on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Fundamentals of Data Engineering (affiliate link) Ternary Data Designing Data Intensive Applications James Webb Space Telescope Google Colossus Storage System DMBoK == Data Management Body of Knowledge DAMA Bill Inmon Apache Druid RTFM == Read The Fine Manual DuckDB Podcast Episode VisiCalc Ternary Data Newsletter Meroxa Podcast Episode Ruby on Rails Lambda Architecture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 17, 2022 • 1h 7min

Making The Total Cost Of Ownership For External Data Manageable With Crux

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Mark Etherington about Crux, a platform that helps organizations scale their most critical data delivery, operations, and transformation needs Interview Introduction How did you get involved in the area of data management? Can you describe what Crux is and the story behind it? What are the categories of information that organizations use external data sources for? What are the challenges and long-term costs related to integrating external data sources that are most often overlooked or underestimated? What are some of the primary risks involved in working with external data sources? How do you work with customers to help them understand the long-term costs associated with integrating various sources? How does that play into the broader conversation about assessing the value of a given data-set? Can you describe how you have architected the Crux platform? How have the design and goals of the platform changed or evolved since you started working on it? What are the design choices that have had the most significant impact on your ability to reduce operational complexity and maintenance overhead for the data you are working with? For teams who are relying on Crux to manage external data, what is involved in setting up the initial integration with your system? What are the steps to on-board new data sources? How do you manage data quality/data observability across your different data providers? What kinds of signals do you propagate to your customers to feed into their operational platforms? What are the most interesting, innovative, or unexpected ways that you have seen Crux used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Crux? When is Crux the wrong choice? What do you have planned for the future of Crux? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Crux Thomson Reuters Goldman Sachs JP Morgan Avro ESG == Environmental, Social, Government Data Selenium Google Cloud Platform Cadence Airflow The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By:Shipyard: ![Shipyard](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/v99MkWSB.png) Shipyard is an orchestration platform that helps data teams build out solid data operations from the get-go by connecting data tools and streamlining data workflows. Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build workflows while enabling engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams. With a high level of concurrency, scalability, and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them. Go to [dataengineeringpodcast.com/shipyard](https://www.dataengineeringpodcast.com/shipyard) to get started automating powerful workflows with their free developer plan today!Support Data Engineering Podcast

Jul 17, 2022 • 57min

Joe Reis Flips The Script And Interviews Tobias Macey About The Data Engineering Podcast

Summary Data engineering is a large and growing subject, with new technologies, specializations, and "best practices" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of "Fundamentals of Data Engineering", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today we’re flipping the script. Joe Reis of Ternary Data will be interviewing me about my time as the host of this show and my perspectives on the data ecosystem Interview Introduction How did you get involved in the area of data management? Now I’ll hand it off to Joe… Joe’s Notes You do a lot of podcasts. Why? Podcast.init started in 2015, and your first episode of Data Engineering was published January 14, 2017. Walk us through the start of these podcasts. why not a data science podcast? why DE? You’ve published 306 of shows of the Data Engineering Podcast, plus 370 for the init podcast, then you’ve got a new ML podcast. How have you kept the motivation over the years? What’s the process for the show (finding guests, topics, etc….recording, publishing)? It’s a lot of work. Walk us through this process. You’ve done a ton of shows and have a lot of context with what’s going on in the field of both data engineering and Python. What have been some of the major evolutions of topics you’ve covered? What’s been the most counterintuitive show or interesting thing you’ve learned while producing the show? How do you keep current with the data engineering landscape? You’ve got a very unique perspective of data engineering, having interviewed countless top people in the field. What are the the big trends you see in data engineering over the next 3 years? What do you do besides podcasting? Is this your only gig, or do you do other work? whats next? Contact Info LinkedIn Website Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Podcast.__init__ The Machine Learning Podcast Ternary Data Fundamentals of Data Engineering book (affiliate link) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 10, 2022 • 40min

Charting the Path of Riskified's Data Platform Journey

Summary Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service Interview Introduction How did you get involved in the area of data management? What does Riskified do? Can you describe the role of data at Riskified? What are some of the core types and sources of information that you are dealing with? Who/what are the primary consumers of the data that you are responsible for? What are the team structures that you have tested for your data professionals? What is the composition of your data roles? (e.g. ML engineers, data engineers, data scientists, data product managers, etc.) What are the organizational constraints that have the biggest impact on the design and usage of your data systems? Can you describe the current architecture of your data platform? What are some of the most notable evolutions/redesigns that you have gone through? What is your process for establishing and evaluating selection criteria for any new technologies that you adopt? How do you facilitate knowledge sharing between data professionals? What have you found to be the most challenging technological and organizational complexities that you have had to address on the path to your current state? What are the methods that you use for staying up to date with the data ecosystem? (opportunity to discuss Haya Data conference) In your role as organizers of the Haya Data conference, what are some of the insights that you have gained into the present state and future trajectory of the data community? What are the most interesting, innovative, or unexpected ways that you have seen the Riskified data platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data platform for Riskified? What do you have planned for the future of your data platform? Contact Info Inbar LinkedIn Lior LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Riskified ADABAS Aerospike Podcast Episode Neo4J Kafka Delta Lake Podcast Episode Databricks Snowflake Podcast Episode Tableau Looker Podcast Episode Redshift Event Sourcing Avro hayaData Conference Data Mesh Data Catalog Data Governance MLOps Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 10, 2022 • 1h 5min

Maintain Your Data Engineers' Sanity By Embracing Automation

Summary Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development Interview Introduction How did you get involved in the area of data management? What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense? What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in? The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent? What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate? As someone with a software engineering background and extensive experience working in data, what are the missing links to make those teams/objectives work together more seamlessly? How can tooling and automation help in that endeavor? A key factor in the adoption of automation for application delivery is automated tests. What are some of the strategies you find useful for identifying scope and targets for testing/monitoring of data products? As data usage and capabilities grow and evolve in an organization, what are the junction points that are in greatest need of well-defined data contracts? How can automation aid in enforcing and alerting on those contracts in a continuous fashion? What are the most interesting, innovative, or unexpected ways that you have seen automation of data operations used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on automation for data systems? When is automation the wrong choice? What does the future of data engineering look like? Contact Info Website @criccomini on Twitter criccomini on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links WePay Enterprise Service Bus The Missing README Hadoop Confluent Schema Registry Podcast Episode Avro CDC == Change Data Capture Debezium Podcast Episode Data Mesh What the heck is a data mesh? blog post SRE == Site Reliability Engineer Terraform Chef configuration management tool Puppet configuration management tool Ansible configuration management tool BigQuery Airflow Pulumi Podcast.__init__ Episode Monte Carlo Podcast Episode Bigeye Podcast Episode Anomalo Podcast Episode Great Expectations Podcast Episode Schemata Data Engineering Weekly newsletter The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 3, 2022 • 1h 11min

Be Confident In Your Data Integration By Quickly Validating Matching Records With data-diff

Summary The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy and Simon Eskildsen about their work to open source the data diff utility that they have been building at Datafold Interview Introduction How did you get involved in the area of data management? Can you describe what the data diff tool is and the story behind it? What was your motivation for going through the process of releasing your data diff functionality as an open source utility? What are some of the ways that data-diff composes with other data quality tools? (e.g. Great Expectations, Soda SQL, etc.) Can you describe how data-diff is implemented? Given the target of having a performant and scalable utility how did you approach the question of language selection? What are some of the ways that you have seen data-diff incorporated in the workflow of data teams? What were the steps that you needed to do to get the project cleaned up and separated from your internal implementation for release as open source? What are the most interesting, innovative, or unexpected ways that you have seen data-diff used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-diff? When is data-diff the wrong choice? What do you have planned for the future of data-diff? Contact Info Gleb LinkedIn @glebmm on Twitter Simon Website @Sirupsen on Twitter sirupsen on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Datafold Podcast Episode data-diff Autodesk Airbyte Podcast Episode Debezium Podcast Episode Napkin Math newsletter Airflow Dagster Podcast Episode Great Expectations Podcast Episode dbt Podcast Episode Trino Preql Podcast.__init__ Episode Erez Shinan Fivetran Podcast Episode md5 CRC32 Merkle Tree Locally Optimistic Presto The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Special Guest: Gleb Mezhanskiy.Support Data Engineering Podcast

Jul 3, 2022 • 59min

The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Summary The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping the current generation of data architectures Interview Introduction How did you get involved in the area of data management? In your opinion as an astrophysicist, how well does the metaphor of a starburst map to your current work at the company of the same name? Can you describe what you see as the dominant factors that influence a team’s approach to data architecture and design? Two of the most repeated (often mis-attributed) terms in the data ecosystem for the past couple of years are the "modern data stack" and the "data mesh". As someone who is working at a company that can be construed to provide solutions for either/both of those patterns, what are your thoughts on their lasting strength and long-term viability? What do you see as the strengths of the emerging lakehouse architecture in the context of the "modern data stack"? What are the factors that have prevented it from being a default choice compared to cloud data warehouses? (e.g. BigQuery, Redshift, Snowflake, Firebolt, etc.) What are the recent developments that are contributing to its current growth? What are the weak points/sharp edges that still need to be addressed? (both internal to the platforms and in the external ecosystem/integrations) What are some of the implementation challenges that teams often experience when trying to adopt a lakehouse strategy as the core building block of their data systems? What are some of the exercises that they should be performing to help determine their technical and organizational capacity to support that strategy over the long term? One of the core requirements for a data mesh implementation is to have a common system that allows for product teams to easily build their solutions on top of. How do lakehouse/data virtualization systems allow for that? What are some of the lessons that need to be shared with engineers to help them make effective use of these technologies when building their own data products? What are some of the supporting services that are helpful in these undertakings? What do you see as the forces that will have the most influence on the trajectory of data architectures over the next 2 – 5 years? What are the most interesting, innovative, or unexpected ways that you have seen lakehouse architectures used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Starburst product? When is a lakehouse the wrong choice? What do you have planned for the future of Starburst’s technology platform? Contact Info LinkedIn @ctartow on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Starburst Trino Teradata Cognos Data Lakehouse Data Virtualization Iceberg Podcast Episode Hudi Podcast Episode Delta Podcast Episode Snowflake Podcast Episode AWS Lake Formation Clickhouse Podcast Episode Druid Pinot Podcast Episode Starburst Galaxy Varada The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.