

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Jul 31, 2021 • 51min
Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar
Summary
Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone together. In this episode Shinji Kim shares her experience as a data professional struggling to collaborate with her colleagues and how that led her to founding a company to address that problem. She also discusses the combination of technical and social challenges that need to be solved for everyone to gain context and comprehension around their most valuable asset.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Shinji Kim about SelectStar, an intelligent data discovery platform that helps you understand your data
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what SelectStar is and the story behind it?
What are the core challenges that organizations are facing around data cataloging and discovery?
There has been a surge in tools and services for metadata collection, data catalogs, and data collaboration. How would you characterize the current state of the ecosystem?
What is SelectStar’s role in the space?
Who are your target customers and how does that shape your prioritization of features and the user experience design?
Can you describe how SelectStar is architected?
How have the goals and design of the platform shifted or evolved since you first began working on it?
I understand that you have built integrations with a number of BI and dashboarding tools such as Looker, Tableau, Superset, etc. What are the use cases that those integrations enable?
What are the challenges or complexities involved in building and maintaining those integrations?
What are the other categories of integration that you have had to implement to make SelectStar a viable solution?
Can you describe the workflow of a team that is using SelectStar to collaborate on data engineering and analytics?
What have been the most complex or difficult problems to solve for?
What are the most interesting, innovative, or unexpected ways that you have seen SelectStar used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on SelectStar?
When is SelectStar the wrong choice?
What do you have planned for the future of SelectStar?
Contact Info
LinkedIn
@shinjikim on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
SelectStar
University of Waterloo
Kafka
Storm
Concord Systems
Akamai
Snowflake
Podcast Episode
BigQuery
Looker
Podcast Episode
Tableau
dbt
Podcast Episode
OpenLineage
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 28, 2021 • 1h
Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax
Summary
Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what the Astra platform is and the story behind it?
How does streaming fit into your overall product vision and the needs of your customers?
What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?
What are the core use cases that you are aiming to support with Astra Streaming?
Can you describe the architecture and automation of your hosted platform for Pulsar?
What are the integration points that you have built to make it work well with Cassandra?
What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?
What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?
What is the process for someone to adopt and integrate with your Astra Streaming service?
How do you handle migrating existing projects, particularly if they are using Kafka currently?
One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?
What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML?
What are the ways that you are engaging with and supporting the Pulsar community?
What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra?
What are the most interesting, innovative, or unexpected ways that you have seen Astra used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?
When is Astra the wrong choice?
What do you have planned for the future of Astra?
Contact Info
Prabhat
LinkedIn
@prabhatja on Twitter
prabhatja on GitHub
Jonathan
LinkedIn
@spyced on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Pulsar
Podcast Episode
Streamnative Episode
Datastax Astra Streaming
Datastax Astra DB
Luna Streaming Distribution
Datastax
Cassandra
Kesque (formerly Kafkaesque)
Kafka
RabbitMQ
Prometheus
Grafana
Pulsar Heartbeat
Pulsar Summit
Pulsar Summit Presentation on Kafka Connectors
Replicated
Chaos Engineering
Fallout chaos engineering tools
Jepsen
Podcast Episode
Jack VanLightly
BookKeeper TLA+ Model
Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 23, 2021 • 1h 1min
Bringing The Metrics Layer To The Masses With Transform
Summary
Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Nick Handel about Transform, a platform providing a dedicated metrics layer for your data stack
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Transform is and the story behind it?
How do you define the concept of a "metric" in the context of the data platform?
What are the general strategies in the industry for creating, managing, and consuming metrics?
How has that been changing in the past couple of years?
What is driving that shift?
What are the main goals that you have for the Transform platform?
Who are the target users? How does that focus influence your approach to the design of the platform?
How is the Transform platform architected?
What are the core capabilities that are required for a metrics service?
What are the integration points for a metrics service?
Can you talk through the workflow of defining and consuming metrics with Transform?
What are the challenges that teams face in establishing consensus or a shared understanding around a given metric definition?
What are the lifecycle stages that need to be factored into the long-term maintenance of a metric definition?
What are some of the capabilities or projects that are made possible by having a metrics layer in the data platform?
What are the capabilities in downstream tools that are currently missing or underdeveloped to support the metrics store as a core layer of the platform?
What are the most interesting, innovative, or unexpected ways that you have seen Transform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Transform?
When is Transform the wrong choice?
What do you have planned for the future of Transform?
Contact Info
LinkedIn
@nick_handel on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Transform
Transform’s Metrics Framework
Transform’s Metrics Catalog
Transform’s Metrics API
Nick’s experiences using Airbnb’s Metrics Store
Get Transform
BlackRock
AirBnB
Airflow
Superset
Podcast Episode
AirBnB Knowledge Repo
AirBnB Minerva Metric Store
OLAP Cube
Semantic Layer
Master Data Management
Podcast Episode
Data Normalization
OpenLineage
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 20, 2021 • 1h 1min
Strategies For Proactive Data Quality Management
Summary
Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Datafold and the story behind it?
What are the biggest factors that you see contributing to data quality issues?
How are teams identifying and addressing those failures?
How does the data platform architecture impact the potential for introducing quality problems?
What are some of the potential risks or consequences of introducing errors in data processing?
How can organizations shift to being proactive in their data quality management?
How much of a role does tooling play in addressing the introduction and remediation of data quality problems?
Can you describe how Datafold is designed and architected to allow for proactive management of data quality?
What are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold?
What is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development?
What are the organizational patterns that you have found to be most conducive to proactive data quality management?
Who is responsible for identifying and addressing quality issues?
What are the most interesting, innovative, or unexpected ways that you have seen Datafold used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?
When is Datafold the wrong choice?
What do you have planned for the future of Datafold?
Contact Info
LinkedIn
@glebmm on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Datafold
Autodesk
Airflow
Podcast.__init__ Episode
Spark
Looker
Podcast Episode
Amundsen
Podcast Episode
dbt
Podcast Episode
Dagster
Podcast Episode
Podcast.__init__ Episode
Change Data Capture
Podcast Episodes
Delta Lake
Podcast Episode
Trino
Podcast Episode
Presto
Parquet
Podcast Episode
Data Quality Meetup
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Special Guest: Gleb Mezhanskiy.Support Data Engineering Podcast

Jul 16, 2021 • 1h 13min
Low Code And High Quality Data Engineering For The Whole Organization With Prophecy
Summary
There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Raj Bains about Prophecy, a low-code data engineering platform built on Spark and Airflow
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Prophecy and the story behind it?
There are a huge number of tools and recommended architectures for every variety of data need. Why is data engineering still such a complicated and challenging undertaking?
What features and capabilities does Prophecy provide to help address those issues?
What are the roles and use cases that you are focusing on serving with Prophecy?
What are the elements of the data platform that Prophecy can replace?
Can you describe how Prophecy is implemented?
What was your selection criteria for the foundational elements of the platform?
What would be involved in adopting other execution and orchestration engines?
Can you describe the workflow of building a pipeline with Prophecy?
What are the design and structural features that you have built to manage workflows as they scale in terms of technical and organizational complexity?
What are the options for data engineers/data professionals to build and share reusable components across the organization?
What are the most interesting, innovative, or unexpected ways that you have seen Prophecy used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prophecy?
When is Prophecy the wrong choice?
What do you have planned for the future of Prophecy?
Contact Info
LinkedIn
@_raj_bains on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Prophecy
CUDA
Apache Hive
Hortonworks
NoSQL
NewSQL
Paxos
Apache Impala
AbInitio
Teradata
Snowflake
Podcast Episode
Presto
Podcast Episode
LinkedIn
Spark
Databricks
Cron
Airflow
Astronomer
Alteryx
Streamsets
Azure Data Factory
Apache Flink
Podcast Episode
Prefect
Podcast Episode
Dagster
Podcast Episode
Podcast.__init__ Episode
Kubernetes Operator
Scala
Kafka
Abstract Syntax Tree
Language Server Protocol
Amazon Deequ
dbt
Tecton
Podcast Episode
Informatica
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 13, 2021 • 49min
Exploring The Design And Benefits Of The Modern Data Stack
Summary
We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving your definition of the modern data stack?
What are the key characteristics of a tool or platform that make it a candidate for the "modern" stack?
How does the modern data stack shift the responsibilities and capabilities of data professionals and consumers?
What are some difficulties that you face when working with customers to migrate to these new architectures?
What are some of the limitations of the components or paradigms of the modern stack?
What are some strategies that you have devised for addressing those limitations?
What are some edge cases that you have run up against with specific vendors that you have had to work around?
What are the "gotchas" that you don’t run up against until you’ve deployed a service and started using it at scale and over time?
How does data governance get applied across the various services and systems of the modern stack?
One of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users?
What is the role of data engineers in the context of the "modern" stack?
What are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack?
When is the modern data stack the wrong choice?
What new architectures or tools are you keeping an eye on for future client work?
Contact Info
Guillermo
LinkedIn
guillesd on GitHub
Bram
LinkedIn
bramochsendorf on GitHub
Juan
LinkedIn
jmperafan on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
GoDataDriven
Deloitte
RPA == Robotic Process Automation
Analytics Engineer
James Webb Space Telescope
Fivetran
Podcast Episode
dbt
Podcast Episode
Data Governance
Podcast Episodes
Azure Cloud Platform
Stitch Data
Airflow
Prefect
Argo Project
Looker
Azure Purview
Soda Data
Podcast Episode
Datafold
Materialize
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 9, 2021 • 1h 7min
Democratize Data Cleaning Across Your Organization With Trifacta
Summary
Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Trifacta is and the story behind it?
Across your site and material you focus on using the term "data wrangling". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT?
How does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform?
What is Trifacta’s role in the overall data platform/data lifecycle for an organization?
What are some examples of tools that Trifacta might replace?
What tools or systems does Trifacta integrate with?
Who are the target end-users of the Trifacta platform and how do those personas direct the design and functionality?
Can you describe how Trifacta is architected?
How have the goals and design of the system changed or evolved since you first began working on it?
Can you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it?
How can data engineers share and encourage proper patterns for working with data assets with end-users across the organization?
What are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools?
What are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve?
What are the most interesting, innovative, or unexpected ways that you have seen Trifacta used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata?
When is Trifacta the wrong choice?
What do you have planned for the future of Trifacta?
Contact Info
LinkedIn
@a_adam_wilson on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Trifacta
Informatica
UC Berkeley
Stanford University
Citadel
Podcast Episode
Stanford Data Wrangler
DBT
Podcast Episode
Pig
Databricks
Sqoop
Flume
SPSS
Tableau
SDLC == Software Delivery Life-Cycle
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 5, 2021 • 56min
Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager
Summary
At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Rich and Bart Wood about SaasGlue, a SaaS-based integration, orchestration and automation platform that lets you fill the gaps in your existing automation infrastructure
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what SaasGlue is and the story behind it?
I understand that you are building this company with your 3 brothers. What have been the pros and cons of working with your family on this project?
What are the main use cases that you are focused on enabling?
Who are your target users and how has that influenced the features and design of the platform?
Orchestration, automation, and workflow management are all areas that have a range of active products and projects. How do you characterize SaaSGlue’s position in the overall ecosystem?
What are some of the ways that you see it integrated into a data platform?
What are the core elements and concepts of the SaaSGlue platform?
How is the SaaSGlue platform architected?
How have the goals and design of the platform changed or evolved since you first began working on it?
What are some of the assumptions that you had at the beginning of the project which have been challenged or changed as you worked through building it?
Can you talk through the workflow of someone building a task graph with SaaSGlue?
How do you handle dependency management for custom code in the payloads for agent tasks?
How does SaasGlue manage metadata propagation throughout the execution graph?
How do you handle the myriad failure modes that you are likely to encounter? (e.g. agent failure, network partitions, individual task failures, etc.)
What are some of the tools/platforms/architectural paradigms that you looked to for inspiration while designing and building SaaSGlue?
What are the most interesting, innovative, or unexpected ways that you have seen SaasGlue used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on SaasGlue?
When is SaaSGlue the wrong choice?
What do you have planned for the future of SaaSGlue?
Contact Info
Rich
LinkedIn
Bart
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
SaaSGlue
Jenkins
Cron
Airflow
Ansible
Terraform
DSL == Domain Specific Language
Clojure
Gradle
Polymorphism
Dagster
Podcast Episode
Podcast.__init__ Episode
Martin Kleppman
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 3, 2021 • 1h 5min
Leveling Up Open Source Data Integration With Meltano Hub And The Singer SDK
Summary
Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Douwe Maan, Taylor Murphy, and AJ Steers about their work to level up the Singer ecosystem through projects like Meltano Hub and the Singer SDK
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what the Singer ecosystem is?
What are the current weak points/challenges in the ecosystem?
What is the current role of the Meltano project/community within the ecosystem?
What are the projects and activities related to Singer that you are focused on?
What are the main goals of the Meltano Hub?
What criteria are you using to determine which projects to include in the hub?
Why is the number of targets so small?
What additional functionality do you have planned for the hub?
What functionality does the SDK provide?
How does the presence of the SDK make it easier to write taps/targets?
What do you believe the long-term impacts of the SDK on the overall availability and quality of plugins will be?
Now that you have spun out your own business and raised funding, how does that influence the priorities and focus of your work?
How do you hope to productize what you have built at Meltano?
What are the most interesting, innovative, or unexpected ways that you have seen Meltano and Singer plugins used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with the Singer community and the Meltano project?
When is Singer/Meltano the wrong choice?
What do you have planned for the future of Meltano, Meltano Hub, and the Singer SDK?
Contact Info
Douwe
Website
Taylor
LinkedIn
@tayloramurphy on Twitter
Blog
AJ
LinkedIn
@aaronsteers on Twitter
aaronsteers on GitLab
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Singer
Meltano
Podcast Episode
Meltano Hub
Singer SDK
Concert Genetics
GitLab
Snowflake
dbt
Podcast Episode
Microsoft SQL Server
Airflow
Podcast Episode
Dagster
Podcast Episode
Podcast.__init__ Episode
Prefect
Podcast Episode
AWS Athena
Reverse ETL
REST (REpresentational State Transfer)
GraphQL
Meltano Interpretation of Singer Specification
Vision for the Future of Meltano blog post
Coalesce Conference
Running Your Data Team Like A Product Team
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 29, 2021 • 1h 6min
A Candid Exploration Of Timeseries Data Analysis With InfluxDB
Summary
While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Paul Dix about Influx Data and the different facets of the market for timeseries databases
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Influx Data and the story behind it?
Timeseries data is a fairly broad category with many variations in terms of storage volume, frequency, processing requirements, etc. This has led to an explosion of database engines and related tools to address these different needs. How do you think about your position and role in the ecosystem?
Who are your target customers and how does that focus inform your product and feature priorities?
What are the use cases that Influx is best suited for?
Can you give an overview of the different projects, tools, and services that comprise your platform?
How is InfluxDB architected?
How have the design and implementation of the DB engine changed or evolved since you first began working on it?
What are you optimizing for on the consistency vs. availability spectrum of CAP?
What is your approach to clustering/data distribution beyond a single node?
For the interface to your database engine you developed a custom query language. What was your process for deciding what syntax to use and how to structure the programmatic interface?
How do you handle the lifecycle of data in an Influx deployment? (e.g. aging out old data, periodic compaction/rollups, etc.)
With your strong focus on monitoring use cases, how do you handle the challenge of high cardinality in the data being stored?
What are some of the data modeling considerations that users should be aware of as they are designing a deployment of Influx?
What is the role of open source in your product strategy?
What are the most interesting, innovative, or unexpected ways that you have seen the Influx platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Influx?
When is Influx DB and/or the associated tools the wrong choice?
What do you have planned for the future of Influx Data?
Contact Info
LinkedIn
pauldix on GitHub
@pauldix on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Influx Data
Influx DB
Search and Information Retrieval
Datadog
Podcast Episode
New Relic
StackDriver
Scala
Cassandra
Redis
KDB
Latent Semantic Indexing
TICK Stack
ELK Stack
Prometheus
TSM storage engine
TSI Storage Engine
Golang
Rust Language
RAFT Protocol
Telegraf
Kafka
InfluxQL
Flux Language
DataFusion
Apache Arrow
Apache Parquet
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast