

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Apr 3, 2022 • 43min
Accelerate Development Of Enterprise Analytics With The Coalesce Visual Workflow Builder
Summary
The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Coalesce and the story behind it?
What are the core problems that you are focused on solving with Coalesce?
The platform appears to be fairly opinionated in the workflow. What are the design principles and philosophies that you have embedded into the user experience?
Can you describe how Coalesce is implemented?
What are the pitfalls in data architecture patterns that you commonly see organizations fall prey to?
How do the pre-built transformation templates in Coalesce help to guide users in a more maintainable direction?
The platform is currently tied to Snowflake as the underlying engine. How much effort will it be to expand your integrations and the scope of Coalesece?
What are the most interesting, innovative, or unexpected ways that you have seen Coalesce used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Coalesce?
When is Coalesce the wrong choice?
What do you have planned for the future of Coalesce?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Coalesce
Data Warehouse Toolkit
Wherescape
dbt
Podcast Episode
Type 2 Dimensions
Firebase
Kubernetes
Star Schema
Data Vault
Podcast Episode
Data Mesh
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 3, 2022 • 47min
Repeatable Patterns For Designing Data Platforms And When To Customize Them
Summary
Building a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
Hey Data Engineering Podcast listeners, want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%? Join our live webinar on April 20th. Joybird director of analytics, Brett Trani, will walk through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible. Visit www.rudderstack.com/joybird?utm_source=rss&utm_medium=rss to register today.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Brandon Beidel about his data platform journey at Red Ventures
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Red Ventures is and your role there?
Given the relative newness of data product management, where do you draw inspiration and direction for how to approach your work?
What are the primary categories of data product that your data consumers are building/relying on?
What are the types of data sources that you are working with to power those downstream use cases?
Can you describe the size and composition/organization of your data team(s)?
How do you approach the build vs. buy decision while designing and evolving your data platform?
What are the tools/platforms/architectural and usage patterns that you and your team have developed for your platform?
What are the primary goals and constraints that have contributed to your decisions?
How have the goals and design of the platform changed or evolved since you started working with the team?
You recently went through the process of establishing and reporting on SLAs for your data products. Can you describe the approach you took and the useful lessons that were learned?
What are the technical and organizational components of the data work at Red Ventures that have proven most difficult?
What excites you most about the future of data engineering?
What are the most interesting, innovative, or unexpected ways that you have seen teams building more reliable data systems?
What aspects of data tooling or processes are still missing for most data teams?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data products at Red Ventures?
What do you have planned for the future of your data platform?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Red Ventures
Monte Carlo
Opportunity Cost
dbt
Podcast Episode
Apache Ranger
Privacera
Podcast Episode
Segment
Fivetran
Podcast Episode
Databricks
Bigquery
Redshift
Hightouch
Podcast Episode
Airflow
Astronomer
Podcast Episode
Airbyte
Podcast Episode
Clickhouse
Podcast Episode
Presto
Podcast Episode
Trino
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 27, 2022 • 47min
Eliminate The Bottlenecks In Your Key/Value Storage With SpeeDB
Summary
At the foundational layer many databases and data processing engines rely on key/value storage for managing the layout of information on the disk. RocksDB is one of the most popular choices for this component and has been incorporated into popular systems such as ksqlDB. As these systems are scaled to larger volumes of data and higher throughputs the RocksDB engine can become a bottleneck for performance. In this episode Adi Gelvan shares the work that he and his team at SpeeDB have put into building a drop-in replacement for RocksDB that eliminates that bottleneck. He explains how they redesigned the core algorithms and storage management features to deliver ten times faster throughput, how the lower latencies work to reduce the burden on platform engineers, and how they are working toward an open source offering so that you can try it yourself with no friction.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
Your host is Tobias Macey and today I’m interviewing Adi Gelvan about his work on SpeeDB, the "next generation data engine"
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what SpeeDB is and the story behind it?
What is your target market and customer?
What are some of the shortcomings of RocksDB that these organizations are running into and how do they manifest?
What are the characteristics of RocksDB that have led so many database engines to embed it or build on top of it?
Which of the systems that rely on RocksDB do you most commonly see running into its limitations?
How does the work you have done at SpeeDB compare to the efforts of the Terark project?
Can you describe how you approached the work of identifying areas for improvement in RocksDB?
What are some of the optimizations that you introduced?
What are some tradeoffs that you deemed acceptable in the process of optimizing for speed and scale?
What is the integration process for adopting SpeeDB?
In the event that an organization has a system with data resident in RocksDB, what is the migration process?
What are the most interesting, innovative, or unexpected ways that you have seen SpeeDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on SpeeDB?
When is SpeeDB the wrong choice?
What do you have planned for the future of SpeeDB?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
SpeeDB
RocksDB
TerarkDB
EMC
Infinidat
LSM == Log-Structured Merge Tree
B+ Tree
LevelDB
LMDB
Bloom Filter
Badger
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 27, 2022 • 1h 3min
Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera
Summary
Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Balaji Ganesan about his work at Privacera and his view on the state of data governance, access control, and security in the cloud
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Privacera is and the story behind it?
What is your working definition of "data governance" and how does that influence your product focus and priorities?
What are some of the lessons that you learned from your work on Apache Ranger that helped with your efforts at Privacera?
How would you characterize your position in the market for data governance/data security tools?
What are the unique constraints and challenges that come into play when managing data in cloud platforms?
Can you explain how the Privacera platform is architected?
How have the design and goals of the system changed or evolved since you started working on it?
What is the workflow for an operator integrating Privacera into a data platform?
How do you provide feedback to users about the level of coverage for discovered data assets?
How does Privacera fit into the workflow of the different personas working with data?
What are some of the security and privacy controls that Privacera introduces?
How do you mitigate the potential for anyone to bypass Privacera’s controls by interacting directly with the underlying systems?
What are the most interesting, innovative, or unexpected ways that you have seen Privacera used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacera?
When is Privacera the wrong choice?
What do you have planned for the future of Privacera?
Contact Info
LinkedIn
@Balaji_Blog on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Privacera
Hadoop
Hortonworks
Apache Ranger
Oracle
Teradata
Presto/Trino
Starburst
Podcast Episode
Ahana
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:Acryl: 
The modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at [dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)Support Data Engineering Podcast

Mar 20, 2022 • 57min
Exploring Incident Management Strategies For Data Teams
Summary
Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing some of the ways that an "incident" can manifest in a data system?
At a high level, what are the steps and participants required to bring an incident to resolution?
The principle of incident management is familiar to application/site reliability teams. What is the current state of the art/adoption for these practices among data teams?
What are the signals that teams should be monitoring to identify and alert on potential incidents?
Alerting is a subjective and nuanced practice, regardless of the context. What are some useful practices that you have seen and enacted to reduce alert fatigue and provide useful context in the alerts that do get sent?
Another aspect of this problem is the proper routing of alerts to ensure that the right person sees and acts on it. How have you seen teams deal with the challenge of delivering alerts to the right people?
When there is an active incident, what are the steps that you commonly see data teams take to understand the cause and scope of the issue?
How can teams augment their systems to make incidents faster to resolve?
What are the most interesting, innovative, or unexpected ways that you have seen teams approch incident response?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on incident management strategies?
What are the aspects of incident management for data teams that are still missing?
Contact Info
Mei
@tao_mei on Twitter
Email
Francisco
@falberini on Twitter
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Monte Carlo
Learn more about RCA best practices
Segment
Podcast Episode
Segment Protocols
Redshift
Airflow
dbt
Podcast Episode
The Goal by Eliahu Golratt
Data Mesh
Podcast Episode
Follow-Up Podcast Episode
PagerDuty
OpsGenie
Grafana
Prometheus
Sentry
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 20, 2022 • 1h 13min
Accelerate Your Embedded Analytics With Apache Pinot
Summary
Data and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product today at dataengineeringpodcast.com/acryl
Your host is Tobias Macey and today I’m interviewing Kishore Gopalakrishna and Xiang Fu about Apache Pinot and its applications for powering user-facing analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Pinot is and the story behind it?
What are the primary use cases that Pinot is designed to support?
There are numerous OLAP engines available with varying tradeoffs and optimal use cases. What are the cases where Pinot is the preferred choice?
How does it compare to systems such as Clickhouse (for OLAP) or CubeJS/GoodData (for embedded analytics)?
How do the operational needs of a database engine change as you move from serving internal stakeholders to external end-users?
Can you describe how Pinot is architected?
What were the key design elements that were necessary to support low-latency queries with high concurrency?
Can you describe a typical end-to-end architecture where Pinot will be used for embedded analytics?
What are some of the tools/technologies/platforms/design patterns that Pinot might replace or obviate?
What are some of the useful lessons related to data modeling that users of Pinot should consider?
What are some edge cases that they might encounter due to details of how the storage layer is architected? (e.g. data tiering, tail latencies, etc.)
What are some heuristics that you have developed for understanding how to manage data lifecycles in a user-facing analytics application?
What are some of the ways that users might need to customize Pinot for their specific use cases and what options do they have for extending it?
What are the most interesting, innovative, or unexpected ways that you have seen Pinot used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pinot?
When is Pinot the wrong choice?
What do you have planned for the future of Pinot?
Contact Info
Kishore
LinkedIn
@KishoreBytes on Twitter
Xiang
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Apache Pinot
StarTree
Espresso
Apache Helix
Apache Gobblin
Apache S4
Kafka
Lucene
StarTree Index
Presto
Trino
Pulsar
Podcast Episode
Spark
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 14, 2022 • 1h 3min
Taking A Multidimensional Approach To Data Observability At Acceldata
Summary
Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale
Your host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure
Interview
Introduction
How did you get involved in the area of data?
Can you describe what Acceldata is and the story behind it?
What does it mean for a data observability platform to be "multidimensional"?
How do the architectural characteristics of the "modern data stack" influence the requirements and implementation of data observability strategies?
The data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports?
Who are your target users and how does that focus influence the way that you have approached feature and design priorities?
What are some of the ways that you are using the Acceldata platform to run Acceldata?
Can you describe how the Acceldata platform is implemented?
How have the design and goals of the system changed or evolved since you started working on it?
How are you managing the definition, collection, and correlation of events across stages of the data lifecycle?
What are some of the ways that performance data can feed back into the debugging and maintenance of an organization’s data ecosystem?
What are the challenges that data platform owners face when trying to interpret the metrics and events that are available in a system like Acceldata?
What are the most interesting, innovative, or unexpected ways that you have seen Acceldata used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acceldata?
When is Acceldata the wrong choice?
What do you have planned for the future of Acceldata?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
Acceldata
Semantic Web
Hortonworks
dbt
Podcast Episode
Firebolt
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 14, 2022 • 54min
Accelerating Adoption Of The Modern Data Stack At 5X Data
Summary
The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Your host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are doing at 5x and the story behind it?
How has your focus and operating model shifted since we spoke a year ago?
What are the biggest shifts in the market for data management that you have seen in that time?
What are the main challenges that your customers are facing when they start working with you?
What are the components that you are relying on to build repeatable data platforms for your customers?
What are the sharp edges that you have had to smooth out to scale your implementation of those systems?
What do you see as the white spaces that still exist in the offerings available for the "modern data stack"?
With the rapid introduction of so many new products in the data ecosystem, what are the categories that you see as being a long-term necessity?
What are the areas that you predict will merge and consolidate over the next 3 – 5 years?
What are the most interesting, innovative, or unexpected types of problems that you and your collaborators have had the opportunity to work on?
What are the most interesting, unexpected, or challenging lessons that you have learned while building the 5x organization?
When is 5x the wrong choice?
What do you have planned for the future of 5x?
Contact Info
LinkedIn
@tarush on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
5X Data
Podcast Interview
Snowflake
Podcast Interview
dbt
Podcast Interview
Fivetran
Podcast Interview
Looker
Podcast Interview
Matt Turck State of Data
Mixpanel
Amplitude
Heap
Podcast Episode
Bigquery
Narrator
Podcast Episode
Marquez
Podcast Episode
Atlan
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 5, 2022 • 1h 17min
Move Your Database To The Data And Speed Up Your Analytics With DuckDB
Summary
When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what DuckDB is and the story behind it?
Where did the name come from?
What are some of the use cases that DuckDB is designed to support?
The interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other?
How might they be used in concert to take advantage of their relative strengths?
What are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems?
Can you describe how DuckDB is implemented?
How has the design and goals of the project changed or evolved since you began working on it?
What are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory?
Can you describe a typical workflow of incorporating DuckDB into an analytical project?
What are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team?
What are some of the overlooked/misunderstood/under-utilized features of DuckDB that you would like to highlight?
What is the governance model and plan long-term sustainability of the project?
What are the most interesting, innovative, or unexpected ways that you have seen DuckDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckDB?
When is DuckDB the wrong choice?
What do you have planned for the future of DuckDB?
Contact Info
Hannes Mühleisen
@hfmuehleisen on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
DuckDB
CWI
SQLite
OLAP == Online Analytical Processing
Duck Typing
ZODB
Teradata
HTAP == Hybrid Transactional/Analytical Processing
Pandas
Podcast.__init__ Episode
Apache Arrow
Julia Language
Voltron Data
Parquet
Thrift
Protobuf
Vectorized Query Processor
LLVM
DuckDB Labs
DuckDB Foundation
MIT Open Courseware (OCW)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 5, 2022 • 50min
Developer Friendly Application Persistence That Is Fast And Scalable With HarperDB
Summary
Databases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Stephen Goldberg about HarperDB, a developer-friendly distributed database engine designed to scale across edge and cloud environments
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what HarperDB is and the story behind it?
There has been an explosion of database engines over the past 5 – 10 years, with each entrant offering specific capabilities. What are the use cases that HarperDB is focused on addressing?
What are the issues that you experienced with existing database engines that led to the creation of HarperDB?
In what ways does HarperDB address those issues?
What are some of the ways that the focus on developers has influenced the interfaces and features of HarperDB?
What is your view on the role of the database in the near to medium future?
Can you describe how HarperDB is implemented?
How have the design and goals changed from when you first started working on it?
One of the common difficulties in document oriented databases is being able to conduct performant joins. What are the considerations that users need to be aware of as they are designing their data models?
What are some examples of deployment topologies that HarperDB can support given the pub/sub replication model?
What are some of the data modeling/database design strategies that users of HarperDB should know in order to take full advantage of its capabilities?
With the dynamic schema capabilities allowing developers to add attributes and mutate the table structure at any point, what are the options for schema enforcment? (e.g. add an integer attribute and another record tries to write a string to that attribute location)
What are the most interesting, innovative, or unexpected ways that you have seen HarperDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on HarperDB?
When is HarperDB the wrong choice?
What do you have planned for the future of HarperDB?
Contact Info
LinkedIn
@sgoldberg on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
HarperDB
@harperdbio on Twitter
Mulesoft
Zapier
LMDB
SocketIO
SocketCluster
MongoDB
CouchDB
PostgreSQL
VoltDB
Heroku
SAP/Hana
NodeJS
DynamoDB
CockroachDB
Podcast Episode
Fastify
HTAP == Hybrid Transactional Analytical Processing
Splunk
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast