

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Oct 5, 2020 • 1h 1min
Self Service Real Time Data Integration Without The Headaches With Meroxa
Summary
Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing DeVaris Brown and Ali Hamidi about Meroxa, a new platform as a service for data integration
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Meroxa and what motivated you to turn it into a business?
What are the lessons that you learned from your time at Heroku which you are applying to your work on Meroxa?
Who are your target users and what are your guiding principles for designing the platform interface?
What are the common difficulties that engineers face in building and maintaining data infrastructure?
There are a variety of platforms that offer solutions for managing data integration, or powering end-to-end analytics, or building machine learning pipelines. What are the shortcomings of those existing options that might lead someone to choose Meroxa?
How is the Meroxa platform architected?
What are some of the initial assumptions that you had which have been challenged as you proceed with implementation?
What new capabilities does Meroxa bring to someone who uses it for integrating their application data?
What are the growth options for organizations that get started with Meroxa?
What are the core principles that you are focused on to allow for evolving your platform over the long run as the surrounding ecosystem continues to mature?
When is Meroxa the wrong choice?
What do you have planned for the future?
Contact Info
DeVaris Brown
Ali Hamidi
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Meroxa
Heroku
Heroku Kafka
Ascend
StreamSets
Nexus
Kafka Connect
Airflow
Podcast.__init__ Episode
Spark
Data Engineering Episode
Change Data Capture
Segment
Podcast Episode
Rudderstack
MParticle
Debezium
Podcast Episode
DBT
Podcast Episode
Materialize
Podcast Episode
Stitch Data
Fivetran
Podcast Episode
Elasticsearch
Podcast Episode
gRPC
GraphQL
REST == REpresentational State Transfer
Dagster/Elementl
Data Engineering Podcast Episode
Podcast.__init__ Episode
Prefect
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 29, 2020 • 60min
Speed Up And Simplify Your Streaming Data Workloads With Red Panda
Summary
Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
If you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo, visit dataengineeringpodcast.com/qubz.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Alexander Gallego about his work at Vectorized building Red Panda as a performance optimized, drop-in replacement for Kafka
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Red Panda is and what motivated you to create it?
What are the limitations of Kafka that make something like Red Panda necessary?
What are the current strengths of the Kafka ecosystem that make it a reasonable implementation target for Red Panda?
How is Red Panda architected?
How has the design or direction changed or evolved since you first began working on it?
What are the challenges that you face in automatically optimizing the runtime to take advantage of the hardware that it is deployed on?
How do cloud environments contribute to that complexity?
How are you handling the compatibility layer for the Kafka API?
What is your approach for managing versioning and ensuring that you maintain bug compatibility?
Beyond performance, what other areas of innovation or improvement in the capabilities and experience do you see while adhering to the Kafka protocol?
What are the opportunities for innovation in the streaming space that aren’t being explored yet?
What are some of the most interesting, innovative, or unexpected ways that you have seen Redpanda being used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building Red Panda and Vectorized?
When is Red Panda the wrong choice?
What do you have planned for the future of the product and business?
What is your Hack The Planet diversity scholarship?
Contact Info
@emaxerrno on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Vectorized
Free Download Trial
@vectorizedio Company Twitter Accn’t
Community Slack
Concord alternative to Flink
Apache Flink
Podcast Episode
FAANG == Facebook, Apple, Amazon, Netflix, and Google
Blackblaze
Raft
NATS
Pulsar
Podcast Episode
StreamNative Podcast Episode
Open Messaging Specification
ScyllaDB
CockroachDB
MemSQL
WASM == Web Assembly
Debezium
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 22, 2020 • 48min
Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor
Summary
Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Daniel Molnar about being a data janitor and how to cut through the hype to understand what to learn for the long run
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing your thoughts on the current state of the data management industry?
What is your strategy for being effective in the face of so much complexity and conflicting needs for data?
What are some of the common difficulties that you see data engineers contend with, whether technical or social/organizational?
What are the core fundamentals that you think are necessary for data engineers to be effective?
What are the gaps in knowledge or experience that you have seen data engineers contend with?
You recently started down the path of building a bootcamp for training data engineers. What was your motivation for embarking on that journey?
How would you characterize your particular approach?
What are some of the reasons that your applicants have for wanting to become versed in data engineering?
What is the baseline of capabilities that you expect of your target audience?
What level of proficiency do you aim for when someone has completed your training program?
Who do you think would not be a good fit for your academy?
As a hiring manager, what are the core capabilities that you look for in a data engineering candidate?
What are some of the methods that you use to assess competence?
What are the overall trends in the data management space that you are worried by?
Which ones are you happy about?
What are your plans and overall goals for the pipeline academy?
Contact Info
LinkedIn
@soobrosa on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Pipeline Data Engineering Academy
Data Janitor 101
The Data Janitor Returns
Berlin, Germany
Hungary
Urchin google analytics precursor
AWS Redshift
Nassim Nicholas Taleb
Black Swans (affiliate link)
KISS == Keep It Simple Stupid
Dan McKinley
Ralph Kimball Data Warehousing design
Falsehoods Programmers Believe
Apache Kafka
AWS Kinesis
ETL/ELT
CI/CD
Telemetry
Dêpeche Mode
Designing Data Intensive Applications (affiliate link)
Stop Hiring DevOps Engineers and Start Growing Them
T Shaped Engineer
Pipeline Data Engineering Academy Curriculum
MPP == Massively Parallel Processing
Apache Flink
Podcast Episode
Flask web framework
YAGNI == You Ain’t Gonna Need It
Pair Programming
Clojure
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 15, 2020 • 44min
Distributed In Memory Processing And Streaming With Hazelcast
Summary
In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Hazelcast is and its origins?
What are the benefits and tradeoffs of in-memory computation for data-intensive workloads?
What are some of the common use cases for the Hazelcast in memory grid?
How is Hazelcast implemented?
How has the architecture evolved since it was first created?
How is the Jet streaming framework architected?
What was the motivation for building it?
How do the capabilities of Jet compare to systems such as Flink or Spark Streaming?
How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems?
How is the governance of the open source grid and Jet projects handled?
What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings?
What is involved in building an application or workflow on top of Hazelcast?
What are the common patterns for engineers who are building on top of Hazelcast?
What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming?
What are the scaling factors for Hazelcast?
What are the edge cases that users should be aware of?
What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used?
When is Hazelcast Grid or Jet the wrong choice?
What is in store for the future of Hazelcast?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
HazelCast
Istanbul
Apache Spark
OrientDB
CAP Theorem
NVMe
Memristors
Intel Optane Persistent Memory
Hazelcast Jet
Kappa Architecture
IBM Cloud Paks
Digital Integration Hub (Gartner)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 7, 2020 • 54min
Simplify Your Data Architecture With The Presto Distributed SQL Engine
Summary
Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what Presto is and its origin story?
What was the motivation for releasing Presto as open source?
For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them?
What are the primary ways that Presto is being used?
I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth?
What are some of the deployment and scaling considerations that operators of Presto should be aware of?
What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations?
What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution?
When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success?
What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project?
When is Presto the wrong choice?
What is in store for the future of the Presto project and community?
Contact Info
LinkedIn
@mtraverso on Twitter
martint on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Presto
Starburst Data
Podcast Episode
Hadoop
Hive
Glue Metastore
BigQuery
Kinesis
Apache Pinot
Elasticsearch
ORC
Parquet
AWS Redshift
Avro
Podcast Episode
LZ4
Zstandard
KafkaSQL
Flink
Podcast Episode
PyTorch
Podcast.__init__ Episode
Tensorflow
Spark
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 1, 2020 • 1h 6min
Building A Better Data Warehouse For The Cloud At Firebolt
Summary
Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Eldad Farkash about Firebolt, a cloud data warehouse optimized for speed and elasticity on structured and semi-structured data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Firebolt is and your motivation for building it?
How does Firebolt compare to other data warehouse technologies what unique features does it provide?
The lines between a data warehouse and a data lake have been blurring in recent years. Where on that continuum does Firebolt lie?
What are the unique use cases that Firebolt allows for?
How do the performance characteristics of Firebolt change the ways that an engineer should think about data modeling?
What technologies might someone replace with Firebolt?
How is Firebolt architected and how has the design evolved since you first began working on it?
What are some of the most challenging aspects of building a data warehouse platform that is optimized for speed?
How do you handle support for nested and semi-structured data?
In what ways have you found it necessary/useful to extend SQL?
Due to the immutability of object storage, for data lakes the update or delete process involves reprocessing a potentially large amount of data. How do you approach that in Firebolt with your F3 format?
What have you found to be the most interesting, unexpected, or challenging lessons while building and scaling the Firebolt platform and business?
When is Firebolt the wrong choice?
What do you have planned for the future of Firebolt?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Firebolt
Sisense
SnowflakeDB
Podcast Episode
Redshift
Spark
Podcast Episode
Parquet
Podcast Episode
Hadoop
HDFS
S3
AWS Athena
BigQuery
Data Vault
Podcast Episode
Star Schema
Dimensional Modeling
Slowly Changing Dimensions
JDBC
TPC Benchmarks
DBT
Podcast Episode
Tableau
Looker
Podcast Episode
PrestoSQL
Podcast Episode
PostgreSQL
Podcast Episode
FoundationDB
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 25, 2020 • 51min
Metadata Management And Integration At LinkedIn With DataHub
Summary
In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about DataHub, LinkedIn’s metadata management and data catalog platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what DataHub is and some of its back story?
What were you using at LinkedIn for metadata management prior to the introduction of DataHub?
What was lacking in the previous solutions that motivated you to create a new platform?
There are a large number of other systems available for building data catalogs and tracking metadata, both open source and proprietary. What are the features of DataHub that would lead someone to use it in place of the other options?
Who is the target audience for DataHub?
How do the needs of those end users influence or constrain your approach to the design and interfaces provided by DataHub?
Can you describe how DataHub is architected?
How has it evolved since you first began working on it?
What was your motivation for releasing DataHub as an open source project?
What have been the benefits of that decision?
What are the challenges that you face in maintaining changes between the public repository and your internally deployed instance?
What is the workflow for populating metadata into DataHub?
What are the challenges that you see in managing the format of metadata and establishing consistent models for the information being stored?
How do you handle discovery of data assets for users of DataHub?
What are the integration and extension points of the platform?
What is involved in deploying and maintaining and instance of the DataHub platform?
What are some of the most interesting or unexpected ways that you have seen DataHub used inside or outside of LinkedIn?
What are some of the most interesting, unexpected, or challenging lessons that you learned while building and working with DataHub?
When is DataHub the wrong choice?
What do you have planned for the future of the project?
Contact Info
Mars
LinkedIn
mars-lan on GitHub
Pardhu
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
DataHub
Map/Reduce
Apache Flume
LinkedIn Blog Post introducing DataHub
WhereHows
Hive Metastore
Kafka
CDC == Change Data Capture
Podcast Episode
PDL LinkedIn language
GraphQL
Elasticsearch
Neo4J
Apache Pinot
Apache Gobblin
Apache Samza
Open Sourcing DataHub Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 17, 2020 • 1h 6min
Exploring The TileDB Universal Data Engine
Summary
Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what TileDB is and the problem that you are trying to solve with it?
What was your motivation for building it?
What are the main use cases or problem domains that you are trying to solve for?
What are the shortcomings of existing approaches to database design that prevent them from being useful for these applications?
What are the benefits of using matrices for data processing and domain modeling?
What are the challenges that you have faced in storing and processing sparse matrices efficiently?
How does the usage of matrices as the foundational primitive affect the way that users should think about data modeling?
What are the benefits of unbundling the storage engine from the processing layer
Can you describe how TileDB embedded is architected?
How has the design evolved since you first began working on it?
What is your approach to integrating with the broader ecosystem of data storage and processing utilities?
What does the workflow look like for someone using TileDB?
What is required to deploy TileDB in a production context?
How is the built in data versioning implemented?
What is the user experience for interacting with different versions of datasets?
How do you manage the lifecycle of versioned data to allow garbage collection?
How are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it?
What are the most interesting, unexpected, or innovative ways that you have seen TileDB used?
What have you found to be the most interesting, unexpected, or challenging aspects of building TileDB?
What features or capabilities are you consciously deciding not to implement?
When is TileDB the wrong choice?
What do you have planned for the future of TileDB?
Contact Info
LinkedIn
stavrospapadopoulos on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
TileDB
GitHub
Data Frames
TileDB Cloud
MIT
Intel
Sparse Linear Algebra
Sparse Matrices
HDF5
Dask
Spark
MariaDB
PrestoDB
GDAL
PDAL
Turing Complete
Clustered Index
Parquet File Format
Podcast Episode
Serializability
Delta Lake
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 10, 2020 • 59min
Closing The Loop On Event Data Collection With Iteratively
Summary
Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Patrick Thompson and Ondrej Hrebicek about Iteratively, a platform for enforcing consistent schemas for your event data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you are building at Iteratively and your motivation for creating it?
What are some of the ways that you have seen inconsistent message structures cause problems?
What are some of the common anti-patterns that you have seen for managing the structure of event messages?
What are the benefits that Iteratively provides for the different roles in an organization?
Can you describe the workflow for a team using Iteratively?
How is the Iteratively platform architected?
How has the design changed or evolved since you first began working on it?
What are the difficulties that you have faced in building integrations for the Iteratively workflow?
How is schema evolution handled throughout the lifecycle of an event?
What are the challenges that engineers face in building effective integration tests for their event schemas?
What has been your biggest challenge in messaging for your platform and educating potential users of its benefits?
What are some of the most interesting or unexpected ways that you have seen Iteratively used?
What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Iteratively?
When is Iteratively the wrong choice?
What do you have planned for the future of Iteratively?
Contact Info
Patrick
LinkedIn
@Patrickt010 on Twitter
Website
Ondrej
LinkedIn
@ondrej421 on Twitter
ondrej on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Iteratively
Syncplicity
Locally Optimistic
DBT
Podcast Episode
Snowplow Analytics
Podcast Episode
JSON Schema
Master Data Management
Podcast Episode
SDLC == Software Development Life Cycle
Amplitude
Mixpanel
Mode Analytics
CRUD == Create, Read, Update, Delete
Segment
Podcast Episode
Schemaver (JSON Schema Versioning Strategy)
Great Expectations
Podcast.init Interview
Data Engineering Podcast Interview
Confluence
Notion
Confluent Schema Registry
Podcast Episode
Snowplow Iglu Schema Registry
Pulsar Schema Registry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Aug 4, 2020 • 1h 1min
A Practical Introduction To Graph Data Applications
Summary
Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this episode Denise Gosnell and Matthias Broecheler discuss their recent book, the Practitioner’s Guide To Graph Data, including the fundamental principles that you need to know about graph structures, the current state of graph support in database engines, tooling, and query languages, as well as useful tips on potential pitfalls when putting them into production. This was an informative and enlightening conversation with two experts on graph data applications that will help you start on the right track in your own projects.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Denise Gosnell and Matthias Broecheler about the recently published practitioner’s guide to graph data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what your goals are for the Practitioner’s Guide To Graph Data?
What was your motivation for writing a book to address this topic?
What do you see as the driving force behind the growing popularity of graph technologies in recent years?
What are some of the common use cases/applications of graph data and graph traversal algorithms?
What are the core elements of graph thinking that data teams need to be aware of to be effective in identifying those cases in their existing systems?
What are the fundamental principles of graph technologies that data engineers should be familiar with?
What are the core modeling principles that they need to know for designing schemas in a graph database?
Beyond databases, what are some of the other components of the data stack that can or should handle graphs natively?
Do you typically use a graph database as the primary or complementary data store?
What are some of the common challenges that you see when bringing graph applications to production?
What have you found to be some of the common points of confusion or error prone aspects of implementing and maintaining graph oriented applications?
When it comes to the specific technologies of different graph databases, what are some of the edge cases/variances in the interfaces or modeling capabilities that they present?
How does the variation in query languages impact the overall adoption of these technologies?
What are your thoughts on the recent standardization of GSQL as an ANSI specification?
What are some of the scaling challenges that exist for graph data engines?
What are the ongoing developments/improvements/trends in graph technology that you are most excited about?
What are some of the shortcomings in existing technology/ecosystem for graph applications that you would like to see addressed?
What are some of the cases where a graph is the wrong abstraction for a data project?
What are some of the other resources that you recommend for anyone who wants to learn more about the various aspects of graph data?
Contact Info
Denise
LinkedIn
@DeniseKGosnell on Twitter
Matthias
LinkedIn
@MBroecheler on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The Practitioner’s Guide To Graph Data
Datastax
Titan graph database
Goethe
Graph Database
NoSQL
Relational Database
Elasticsearch
Podcast Episode
Associative Array Data Structure
RDF Triple
Datastax Multi-model Graph Database
Semantic Web
Gremlin Graph Query Language
Super Node
Neuromorphic Computing
Datastax Desktop
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast


