Addressing The Challenges Of Component Integration In Data Platform Architectures

whatshot 14 snips

Nov 27, 2023

In this podcast, the host discusses the challenges of integrating components in data platform architectures, including user experience, data sharing and delivery, and shadow IT. They explore event-driven pipelines, access control, data flow ownership, and metadata propagation. The importance of reliable integrations and extensible systems is emphasized, along with tools like Open Lineage and DBT. Python and open metadata platforms are highlighted for simplifying integration and managing permissions and roles across data tools.

29:43

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

question_answer

ANECDOTE

Tobias Macey's Data Platform

Tobias Macey is building a data platform using a cloud-first data lakehouse architecture.
He uses DBT, Airbite, Dagster, and Trino with S3 storage.

insights

INSIGHT

Custom Platform Complexity

Building a custom data platform is complex, especially with a small team.
Managed platforms or vendor solutions are often simpler starting points.

volunteer_activism

ADVICE

Data Presentation and Misuse

Consider how presented data might be misused.
Minimize friction for users to discourage data exfiltration.

Get the Snipd Podcast app to discover more snips from this episode

Challenges of Component Integration in Data Platform Architectures

03:20 • 8min

chevron_right

Challenges of Component Integration and the Role of Memphis Functions

11:48 • 1min

chevron_right

Challenges of Component Integration in Data Platform Architectures

13:17 • 7min

chevron_right

Addressing Challenges of Component Integration in Data Platform Architectures

19:50 • 2min

chevron_right

Challenges of Component Integration and Building a Metadata Platform

22:11 • 3min

chevron_right

Challenges of Component Integration in Data Platform Architectures

24:44 • 5min

chevron_right

Summary

Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis today to get started!
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truth

Interview

Introduction
How did you get involved in the area of data management?
data sharing
weight of history
- existing integrations with dbt
- switching cost for e.g. SQLMesh
- de facto standard of Airflow
Single source of truth
- permissions management across application layers
- Database engine
- Storage layer in a lakehouse
- Presentation/access layer (BI)
- Data flows
- dbt -> table level lineage
- orchestration engine -> pipeline flows
  - task based vs. asset based
- Metadata platform as the logical place for horizontal view

Contact Info

LinkedIn
Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Addressing The Challenges Of Component Integration In Data Platform Architectures

Tobias Macey's Data Platform

Custom Platform Complexity

Data Presentation and Misuse

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links