

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Nov 26, 2019 • 1h 1min
Building A Real Time Event Data Warehouse For Sentry
Summary
The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Ted Kaemming and James Cunningham about Snuba, the new open source search service at Sentry implemented on top of Clickhouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the internal and user-facing issues that you were facing at Sentry with the existing search capabilities?
What did the previous system look like?
What was your design criteria for building a new platform?
What was your initial list of possible system components and what was your evaluation process that resulted in your selection of Clickhouse?
Can you describe the system architecture of Snuba and some of the ways that it differs from your initial ideas of how it would work?
What have been some of the sharp edges of Clickhouse that you have had to engineer around?
How have you found the operational aspects of Clickhouse?
How did you manage the introduction of this new piece of infrastructure to a business that was already handling massive amounts of real-time data?
What are some of the downstream benefits of using Clickhouse for managing event data at Sentry?
For someone who is interested in using Snuba for their own purposes, how flexible is it for different domain contexts?
What are some of the other data challenges that you are currently facing at Sentry?
What is your next highest priority for evolving or rebuilding to address technical or business challenges?
Contact Info
James
@JTCunning on Twitter
JTCunning on GitHub
Ted
tkaemming on GitHub
Website
@tkaemming on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Sentry
Podcast.__init__ Episode
Snuba
Blog Post
Clickhouse
Podcast Episode
Disqus
Urban Airship
HBase
Google Bigtable
PostgreSQL
Redis
HyperLogLog
Riak
Celery
RabbitMQ
Apache Spark
Presto
Cassandra
Apache Kudu
Apache Pinot
Apache Druid
Flask
Apache Kafka
Cassandra Tombstone
Sentry Blog
XML
Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 18, 2019 • 56min
Escaping Analysis Paralysis For Your Data Platform With Data Virtualization
Summary
With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Matt Baird about AtScale, a platform that
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the AtScale platform and how it fits in the ecosystem of data tools?
What was your motivation for building the platform and what were some of the early challenges that you faced in achieving your current level of success?
How is the AtScale platform architected and what have been some of the main areas of evolution and change since you first began building it?
How has the surrounding data ecosystem changed since AtScale was founded?
How are current industry trends influencing your product focus?
Can you talk through the workflow for someone implementing AtScale?
What are some of the main use cases that benefit from data virtualization capabilities?
How does it influence the relevancy of data warehouses or data lakes?
What are some of the types of tools or patterns that AtScale replaces in a data platform?
What are some of the most interesting or unexpected ways that you have seen AtScale used?
What have been some of the most challenging aspects of building and growing the platform?
When is AtScale the wrong choice?
What do you have planned for the future of the platform and business?
Contact Info
LinkedIn
@zetty on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
AtScale
PeopleSoft
Oracle
Hadoop
PrestoDB
Impala
Apache Kylin
Apache Druid
Go Language
Scala
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 11, 2019 • 51min
Designing For Data Protection
Summary
The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what is encompassed by the idea of data protection?
What regulations control the enforcement of data protection requirements, and how can we determine whether we are subject to their rules?
What are some of the conflicts and constraints that act against our efforts to implement data protection?
How much of data protection is handled through technical implementation as compared to organizational policies and reporting requirements?
Can you give some examples of the types of information that are subject to data protection?
One of the challenges in data management generally is tracking the presence and usage of any given information. What are some strategies that you have found effective for auditing the usage of protected information?
A corollary to tracking and auditing of protected data in the GDPR is the need to allow for deletion of an individual’s information. How can we ensure effective deletion of these records when dealing with multiple storage systems?
What are some of the system components that are most helpful in implementing and maintaining technical and policy controls for data protection?
How do data protection regulations impact or restrict the technology choices that are viable for the data preparation layer?
Who in the organization is responsible for the proper compliance to GDPR and other data protection regimes?
Downstream from the storage and management platforms that we build as data engineers are data scientists and analysts who might request access to protected information. How do the regulations impact the types of analytics that they can use?
Contact Info
Karen
Email
Website
Mark
Email
Website
GDPR Now Podcast
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Data Protection
GDPR
This Is DPO
Intellectual Property
European Convention Of Human Rights
CCPA == California Consumer Privacy Act
PII == Personally Identifiable Information
Privacy By Design
US Privacy Shield
Principle of Least Privilege
International Association of Privacy Professionals
Privacy Technology Vendor Report
Data Provenance
Chief Data Officer
UK ICO (Information Commissioner’s Office)
AI Audit Framework
Data Council
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Nov 4, 2019 • 49min
Automating Your Production Dataflows On Spark
Summary
As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more.
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what the Ascend platform is?
What was your inspiration for creating it and what keeps you motivated?
What was your criteria for determining the best execution substrate for the Ascend platform?
Can you describe any limitations that are imposed by your selection of Spark as the processing engine?
If you were to rewrite Spark from scratch today to fit your particular requirements, what would you change about it?
Can you describe the technical implementation of Ascend?
How has the system design evolved since you first began working on it?
What are some of the assumptions that you had at the beginning of your work on Ascend that have been challenged or updated as a result of working with the technology and your customers?
How does the programming interface for Ascend differ from that of a vanilla Spark deployment?
What are the main benefits that a data engineer would get from using Ascend in place of running their own Spark deployment?
How do you enforce the lack of side effects in the transforms that comprise the dataflow?
Can you describe the pipeline orchestration system that you have built into Ascend and the benefits that it provides to data engineers?
What are some of the most challenging aspects of building and launching Ascend that you have dealt with?
What are some of the most interesting or unexpected lessons learned or edge cases that you have encountered?
What are some of the capabilities that you are most proud of and which have gained the greatest adoption?
What are some of the sharp edges that remain in the platform?
When is Ascend the wrong choice?
What do you have planned for the future of Ascend?
Contact Info
LinkedIn
@seanknapp on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Ascend
Kubernetes
BigQuery
Apache Spark
Apache Beam
Go Language
SHA Hashes
PySpark
Delta Lake
Podcast Episode
DAG == Directed Acyclic Graph
PrestoDB
MinIO
Podcast Episode
Parquet
Snappy Compression
Tensorflow
Kafka
Druid
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

76 snips
Oct 28, 2019 • 1h 8min
Build Maintainable And Testable Data Applications With Dagster
Summary
Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Dagster is and the origin story for the project?
In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for?
Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?
What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?
What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?
For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?
For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?
Once they have something working, what is involved in deploying it?
One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?
Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?
It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?
Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?
What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?
Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?
You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?
What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?
What is on your roadmap that you consider necessary before creating a 1.0 release?
Contact Info
@schrockn on Twitter
schrockn on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Dagster
Elementl
ETL
GraphQL
React
Matei Zaharia
DataOps Episode
Kafka
Fivetran
Podcast Episode
Spark
Supervised Learning
DevOps
Luigi
Airflow
Dask
Podcast Episode
Kubernetes
Ray
Maxime Beauchemin
Podcast Interview
Dagster Testing Guide
Great Expectations
Podcast.__init__ Interview
Papermill
Notebooks At Netflix Episode
DBT
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 22, 2019 • 43min
Data Orchestration For Hybrid Cloud Analytics
Summary
The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what you mean by the term "Data Orchestration"?
How does it compare to the concept of "Data Virtualization"?
What are some of the tools and platforms that fit under that umbrella?
What are some of the motivations for organizations to use the cloud for their data oriented workloads?
What are they giving up by using cloud resources in place of on-premises compute?
For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments?
What are some of the common patterns for cloud migration projects and what challenges do they present?
Do you have advice on useful metrics to track for determining project completion or success criteria?
How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals?
Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?
What are some of the common pain points that organizations encounter when working on hybrid implementations?
What are some of the missing pieces in the data orchestration landscape?
Are there any efforts that you are aware of that are aiming to fill those gaps?
Where is the data orchestration market heading, and what are some industry trends that are driving it?
What projects are you most interested in or excited by?
For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?
Contact Info
LinkedIn
@dborkar on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Alluxio
Podcast Episode
UC San Diego
Couchbase
Presto
Podcast Episode
Spark SQL
Data Orchestration
Data Virtualization
PyTorch
Podcast.init Episode
Rook storage orchestration
PySpark
MinIO
Podcast Episode
Kubernetes
Openstack
Hadoop
HDFS
Parquet Files
Podcast Episode
ORC Files
Hive Metastore
Iceberg Table Format
Podcast Episode
Data Orchestration Summit
Star Schema
Snowflake Schema
Data Warehouse
Data Lake
Teradata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 15, 2019 • 47min
Keeping Your Data Warehouse In Order With DataForm
Summary
Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.
Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analysts manage all data processes in your cloud data warehouse
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what DataForm is and the origin story for the platform and company?
What are the main benefits of using a tool like DataForm and who are the primary users?
Can you talk through the workflow for someone using DataForm and highlight the main features that it provides?
What are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data?
How does CI/CD and change management manifest in the context of data warehouse management?
How is the Dataform SDK itself implemented and how has it evolved since you first began working on it?
Can you differentiate the capabilities between the open source CLI and the hosted web platform, and when you might need to use one over the other?
What was your selection process for an embedded runtime and how did you decide on javascript?
Can you talk through some of the use cases that having an embedded runtime enables?
What are the limitations of SQL when working in a collaborative environment?
Which database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities?
What is involved in adding support for a new backend?
When is DataForm the wrong choice?
What do you have planned for the future of DataForm?
Contact Info
LinkedIn
@lewishemens on Twitter
lewish on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
DataForm
YCombinator
DBT == Data Build Tool
Podcast Episode
Fishtown Analytics
Typescript
Continuous Integration
Continuous Delivery
BigQuery
Snowflake DB
UDF == User Defined Function
RedShift
PostgreSQL
Podcast Episode
AWS Athena
Presto
Podcast Episode
Apache Beam
Apache Kafka
Segment
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 8, 2019 • 55min
Fast Analytics On Semi-Structured And Structured Data In The Cloud
Summary
The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Rockset is and your motivation for creating it?
What are some of the use cases that it enables which would otherwise be impractical or intractable?
How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace?
Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers?
Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset?
How is your storage backend implemented to allow for speed and flexibility in the query layer?
How does it manage distribution, balancing, and durability of the data?
What are your strategies for handling node and region failure in the cloud?
You have a whitepaper describing your architecture as being oriented around microservices on Kubernetes in order to be cloud agnostic. How do you handle the case where customers have data sources that span multiple cloud providers or regions and the latency that can result?
How is the query engine structured to allow for optimizing so many different query types (e.g. search, graph, timeseries, etc.)?
With Rockset handling a large portion of the underlying infrastructure work that a data engineer might be involved with, what are some ways that you have seen them use the time that they have gained and how has that benefitted the organizations that they work for?
What are some of the most interesting/unexpected/innovative ways that you have seen Rockset used?
When is Rockset the wrong choice for a given project?
What have you found to be the most challenging and the most exciting aspects of building the Rockset platform and company?
What do you have planned for the future of Rockset?
Contact Info
Venkat
LinkedIn
@iamveeve on Twitter
veeve on GitHub
Shruti
LinkedIn
@shrutibhat on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
Rockset
Blog
Oracle
VMWare
Facebook
Rube Goldberg Machine
SnowflakeDB
Protocol Buffers
Spark
Podcast Episode
Presto
Podcast Episode
Apache Kafka
RocksDB
InnoDB
Lucene
Log Structured Merge Tree (LSTM)
Kubernetes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Oct 1, 2019 • 35min
Ship Faster With An Opinionated Data Pipeline Framework
Summary
Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Kedro is and its origin story?
Who are the primary users of Kedro, and how does it fit into and impact the workflow of data engineers and data scientists?
Can you talk through a typical lifecycle for a project that is built using Kedro?
What are the overall features of Kedro and how do they compound to encourage best practices for data projects?
How does the culture and background of QuantumBlack influence the design and capabilities of Kedro?
What was the motivation for releasing it publicly as an open source framework?
What are some examples of ways that Kedro is being used within QuantumBlack and how has that experience informed the design and direction of the project?
Can you describe how Kedro itself is implemented and how it has evolved since you first started working on it?
There has been a recent trend away from end-to-end ETL frameworks and toward a decoupled model that focuses on a programming target with pluggable execution. What are the industry pressures that are driving that shift and what are your thoughts on how that will manifest in the long term?
How do the capabilities and focus of Kedro compare to similar projects such as Prefect and Dagster?
It has not yet reached a stable release. What are the aspects of Kedro that are still in flux and where are the changes most concentrated?
What is still missing for a stable 1.x release?
What are some of the most interesting/innovative/unexpected ways that you have seen Kedro used?
When is Kedro the wrong choice?
What do you have in store for the future of Kedro?
Contact Info
LinkedIn
@tomgoldenberg on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Kedro
GitHub
Quantum Black Labs
GitHub
Agolo
McKinsey
Airflow
Docker
Kubernetes
DataBricks
Formula 1
Kedro Viz
Dask
Podcast Interview
Py.Test
Azure Data Factory
Prefect
Podcast Interview
Dagster
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 23, 2019 • 1h 8min
Open Source Object Storage For All Of Your Data
Summary
Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.
Interview
Introduction
How did you get involved in the area of data management?
Can you explain what MinIO is and its origin story?
What are some of the main use cases that MinIO enables?
How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?
Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)
What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?
What are the constraints and opportunities that are provided by adhering to that API?
Can you describe how MinIO is implemented and the overall system design?
How has that design evolved since you first began working on it?
What assumptions did you have at the outset and how have they been challenged or updated?
What are the axes for scaling that MinIO provides and how does it handle clustering?
Where does it fall on the axes of availability and consistency in the CAP theorem?
One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume?
For someone who is interested in running MinIO, what is involved in deploying and maintaining an installation of it?
What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage?
How do you approach project governance and sustainability?
What are some of the most interesting/innovative/unexpected ways that you have seen MinIO used?
What do you have planned for the future of MinIO?
Contact Info
LinkedIn
@abperiasamy on Twitter
abperiasamy on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
MinIO
GlusterFS
Object Storage
RedHat
Bionics
AWS S3
Ceph
Swift Stack
POSIX
HDFS
Google BigQuery
AzureML
AWS SageMaker
AWS Athena
S3 Select
Azure Blob Store
BackBlaze
Round Robin DNS
Service Mesh
Istio
Envoy
SmartStack
Free Software
RocksDB
TanTan Blog Post
Presto
SparkML
MCAdmin Trace
DTrace
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast